A Comparison of Item Exposure Control Methods in Computerized Adaptive Testing

Size: px
Start display at page:

Download "A Comparison of Item Exposure Control Methods in Computerized Adaptive Testing"

Transcription

1 Journal of Educational Measurement Winter 1998, Vol. 35, No. 4, pp A Comparison of Item Exposure Control Methods in Computerized Adaptive Testing Javier Revuelta Vicente Ponsoda Universidad Aut6noma de Madrid Two new methods for item exposure control were proposed. In the Progressive method, as the test progresses, the influence of a random component on item selection is reduced and the importance of item information is increasingly more prominent. In the Restricted Maximum Information method, no item is allowed to be exposed in more than a predetermined proportion of tests. Both methods were compared with six other item-selection methods (Maximum Information, One Parameter, McBride and Martin, Randomesque, Sympson and Hetter, and Random Item Selection) with regard to test precision and item exposure variables. Results showed that the Restricted method was useful to reduce maximum exposure rates and that the Progressive method reduced the number of unused items. Both did well regarding precision. Thus, a combined Progressive-Restricted method may be useful to control item exposure without a serious decrease in test precision. One of the main goals of computerized adaptive testing (CAT) is to obtain precise ability estimates with a small number of items. To achieve this goal, items are selected specifically for each examinee from a large bank. Selection is based on characteristics of examinees (their provisional estimated ability) and items (their difficulty and discrimination parameters). Thus, a different subset of items may be administered to each person (Hambleton, Swaminathan, & Rogers, 1991). The Maximum Information method (MI) is widely used to select items during the testing session (Wainer, 1990, p. 111). It selects the unused item of the bank that provides the most information at the last estimated ability. Item information is greater as the item difficulty approaches the ability level of the examinee, as the discrimination parameter increases and as the pseudochance level approaches zero (Hambleton & Swaminathan, 1985, pp ). Each item bank contains "good" and "poor" candidates for MI selection at a particular ability level; items with high discrimination (a) values are extremely good candidates if their difficulties lie close to test taker's proficiency. In practice, some items are used in most test administrations, but others are rarely (if ever) selected (Wainer, 1990, pp ; Mills & Stocking, 1995). For instance, Hulin, Drasgow and Parsons (1983) in a simulation study administered 4600 tests and found that 141 of the 260 items were never administered with the MI method. The proportion of times an item is used (its Item The authors wish to thank two anonymous reviewers for their extensive and thoughtful comments and Carmen Ximdnez for her help in the preparation of this manuscript. This research was partially supported by three DGICYT grants (PS , PS and PS ). 311

2 Revueha and Ponsoda Exposure Rate) depends, then, on (a) its psychometric properties, (b) what other items are available in the pool, and (c) the distribution of ability of the examinees. Items with a high exposure rate may produce some undesirable effects. For example, if an examinee is evaluated several times with adaptive tests from the same item pool, he or she could be asked the same questions, and may have learned the correct response. The most frequently used items may soon become popular and lose their original psychometric properties, causing a decrease in the test's validity. From the test developer's perspective, it is also undesirable to create a large item pool and use an item selection method that leaves unused a large percentage of the items. It seems sensible to demand that all the items be administered at some time. This also guarantees more variety in the items the examinees receive. In short, test developers want to be sure that all items are used (for economic reasons), but that no item is overused (for security reasons). Item exposure control strategies have two main aims: (a) to prevent overexposure of some items, and (b) to increase the use rate of seldom or never-selected items. These strategies achieve a better control of item exposure by introducing several changes in the MI method, but not without cost: they also produce a loss of precision in ability estimation. In general, item exposure control strategies could be classified in two groups (Stocking, 1993): (a) Methods adding a random component to the MI item selection method. One such method selects the first item at random from the optimal five items, the second from the optimal four, the third from a group of three, the fourth from a group of two, and subsequently the optimal item is chosen (McBride & Martin, 1983). (b) Methods based on assigning a parameter to each item to control its maximum exposure. Sympson and Hetter (1985) developed a probabilistic method to achieve control over maximum exposure rates. The method acts in the same way as the MI method and selects the optimal item. However, this item will actually be administered not on 100% of the occasions when it is selected as optimum, as in the MI method, but only in 100k;% of the tests, where ki is the probability that item i is administered, given that it has been selected. Items with low exposure rates under the MI method will have k,. values close to one. Items with extremely high exposure rates will have smaller k i values. The set of k i values is determined through a series of simulations. Several refinements have recently been added to this method. Davey and Parshall (1995) extended the method to prevent not only individual item overuse, but also overexposure of item clusters. Stocking (1993) and Stocking and Lewis (1995) adapted Sympson and Hetter's method to structured item banks. Strategies derived from Sympson and Hetter's method present the important advantage of a direct control of unconditional and conditional exposure rates. For a bank of n items, one exposure-rate parameter per item has to be determined by simulation in the unconditional case. In the conditional-on-ability case (Stocking & Lewis, 1995), r exposure-rate parameters per item are needed, where r is the number of ability levels. Moreover, parameter values depend on CAT and item bank characteristics, such as number of items, stopping rule, and so on. Replacing or adding items to the bank, modifying test length, and so on, may cause changes in item functioning (Stocking & Lewis). Although this method is theoretically inter- 312

3 Item Exposure Control esting, it would be useful to find more manageable strategies that provide similar precision in ability estimation. The current report introduces two new methods for exposure control and compares them with several existing ones. The methods considered in this paper are described below. Methods Maximum Information Method (MI) The information provided by each unused item (Ii) at the last estimated ability is computed. The most informative item is administered. One Parameter Method (IP) This method is a modified MI method in which item discrimination and pseudochance parameters have no role: the unused item with the difficulty (b) parameter closest to the last estimated ability is selected and presented. The aim of this method is to increase the exposure of the items with a low a parameter and to reduce that of very discriminating items. One example of the application of this method may be foulad in Dodd (1990). She compared the 1P to the MI method in a computerized adaptive test measuring attitudes, based on the rating scale model. McBride and Martin Method (MM) As described above, a random component is added to the selection of the initial items. The first item is selected at random among the 5 most informative items, the second among the 4, and so on. The fifth and subsequent items were selected to be optimal. Other randomization schemes similar to the MM method have been proposed (Stocking & Swanson, 1993; Lewis, Subhiyah & Morrison, 1995; and Morrison, Subhiyah, & Nungester, In the Morrison et al. scheme, the first five items are selected at random from the optimal ten. For the sixth and following items, the most informative item is administered. Randomesque Method (RA ) Selection is always made at random among the 5 most informative items (Kingsbury & Zara, 1989; Morrison et al., 1995). Sympson and Hetter Method (SH) As mentioned above, in the unconditional case, this method assigns to each item a k; parameter ranging from 0 to 1. Once the most informative item has been selected, a random number from the uniform (0, 1) distribution is generated. If this random number is lower than k i the item is actually administered: if not, the item is set aside, the next-most-informative item is identified, a new random value is selected, and so on. Items are always selected from a set of items that have been neither administered nor set aside. The ki parameters are assigned by an iterative process of repeated simulations, until a maximum-exposure target is attained (0.40 in the simulations to be reported). An extensive description of this method can be found in Sympson and Hetter (1985), Stocking (1993), and Hetter and Sympson (1997). 313

4 Revuelta and Ponsoda Restricted Maximum Information Method (Rk) This was proposed as a practical alternative to the SH method. It avoids the complexities involved in the assignment of the k i parameters. Items are selected by the MI method, but none is allowed to be exposed in more than 100k% of the tests. When an item attains this limit it cannot be administered in the current test (Revuelta, 1995; Revuelta & Ponsoda, 1996). Suppose that a test has been administered t times. Let us call ai the number of times the item i has been administered in the previous t tests. The exposure rate for item i will be then a/t. The set of available items for the next test will be composed only of the items with exposure rates below k. Items are then selected from this reduced item pool by the MI method. The set of items that may be administered changes from test to test. A particular item will be available for some tests, then will not be available, but after being unavailable for one or more tests, its quotient a~/t will decrease and the item will be again available (when its exposure rate is again below k). The parameter k is the maximum allowed exposure rate. The only restriction on k is that, as some items will be unable to be administered in some tests, k's value must be greater than the reciprocal of the integer quotient between bank size and test length (maximum test length, in variable-length tests), to ensure that there will be enough available items for any test application. The SH method had a similar restriction (Stocking, 1993). Progressive Method (PR) This method was proposed by Revuelta (1995) and Revuelta and Ponsoda (1996). It also adds to the MI method a random component, whose contribution is important at the beginning of the test and increasingly less influential as the test progresses. These are the steps to administer a new item (let us call h the number of items already administered to this particular examinee) in a test with a maximum length of m items: For each unused item, the information at the ability estimated from the previous h items (li) was computed. Let us call H the highest information value obtained. A random value from the uniform (0, H) distribution was extracted (R~) for each item. The relative serial position of the item is defined as s = h/m. A weight is computed for each unused item as a linear combination of the random and information components, according to the formula w~ = (1 - s)r i + sli. The item with the highest weight is administered. Serial position values (s) linearly increase from 0 (for the first item to be administered) to a value close to 1 (for the last item). Since item information is multiplied by s and the random component by 1 - s, the importance of the two components changes in reverse directions as the testing session progresses. The information and random components are gaining and losing importance, respectively, throughout the testing session. The formula was motivated by the following considerations: when applying the MI method, the contribution to the test precision of the initial items is seldom great, since these items are very informative but for ability estimates that very often differ 314

5 Item Exposure Control markedly from the final estimates. It was therefore supposed that the PR method would reduce differences among items in item exposure rate without producing a serious loss in precision if the random component affected mainly the initial item selections. As the testing session progresses and provisional ability estimates approach final estimates, the information component is gaining the importance that the random component is losing. Two different studies are reported. Study One uses a real item bank and gives an initial impression of the methods in two conditions: fixed-length and variablelength tests. In Study Two, the methods MI, SH, PR, and Rk are explored in more detail. Simulated item banks were used in this case, making possible the comparison of conditions differing in test length and in the discrimination parameters of the items. Study One As mentioned above, in this study the described methods are compared using a real item pool (Ponsoda, Olea, & Revuelta, 1994) designed to evaluate examinees' English vocabulary. Due to the exploratory nature of the study, no strong predictions about results can be advanced. However, it was expected that the MI method would produce the best precision and the greatest differences among items in item exposure rates, since it aims at optimal precision and does not attempt any kind of item exposure control. Method Item Selection The six item selection methods described above (MM, RA, PR, R40, SH, and 1P) were compared with the MI method. A method in which items are strictly selected at random from the bank was also included as a control (CO) condition. The ki parameters of the SH method were estimated for each condition using 0.4 as the maximum desirable exposure rate. This same value was also used for the k rate of the Rk method, so that this method will be called R40 hereafter. The MI and CO methods were expected to produce the maximum and minimum ability estimation precision, respectively. With regard to item exposure control, these two methods were also expected to be the best (CO) and the poorest (MI) from the entire set. Conditions The eight methods were tested in two different conditions: (a) fixed-test length (35 items), and (b) variable-test length (the stopping rule was a standard error of ability lower than 0.22, or a maximum length of 50 items). Procedure The same 2000 simulees received a CAT under the two conditions and eight methods. Examinees' true ability parameters were normally distributed N(0, 1). The initial ability of all simulated subjects at the beginning of the test was taken as zero. 315

6 Revuelta and Ponsoda The program ADTEST (Ponsoda et al., 1994) was used for running the simulations. Abilities were estimated by the maximum-likelihood procedure. A real item bank of 221 items was employed. The item bank was calibrated using the ASCAL program (Assessment Systems Corporation, 1988). The descriptive statistics of the bank (mean, standard deviation, minimum and maximum) were as follows: a parameter (1.10, 0.36, 0.40, 2.06), b parameter (-0.11, 1.50, -3, 3), and c parameter (0.21, 0.06, 0.05 and 0.39). Further details can be found in Ponsoda, Wise, Olea, and Revuelta (1997). Three indices were computed to compare ability estimate precision among the methods: (a) overall bias between estimated and true ability, (b) the standard deviation of the difference between estimated and true ability (Se), and (c) the mean test length (only for variable-length tests). The simulation provided for each simulee the true and estimated ability. The variable D, error in ability estimation, was defined as the signed difference between the estimated and the true ability. Overall bias and Se were computed as the mean and the standard deviation of D, respectively, in the 2000 simulees. When the 2000 simulated subjects had completed the test, the number of times each item was administered could be computed. As mentioned above, the variable "Exposure Rate", ranging from 0 to 1, is the proportion of times an item had been administered across the 2000 tests. The following indices were used to compare the methods: (a) the percentage of items never administered in the 2000 simulations, (b) the coefficient of variation (CV) of the variable "Exposure Rate", and (c) the minimum and maximum values of this variable. The exposure rate distribution, grouped in ten intervals, was also computed for each item selection method. The CV is computed as the standard deviation of the variable exposure rate, multiplied by 100 and divided by its mean. As a measure of dispersion, it will be large when some items have high exposure rates while others are seldom used. The CV may be preferable to the variance when the exposure control methods and/or conditions to be compared differ markedly in the mean exposure rate they produce. This would be the case, for example, when comparing different test lengths. The simulation corresponding to each condition and method was repeated five times. The same 2000 simulated subjects were used at each combination of condition and method. The variables described above were computed for each repetition. The results shown below represent the means of the five repetitions. Results Precision In the fixed-length condition, the MI, MM, and RA methods yielded the highest precision in ability estimation. Poorest precision was produced by the CO and IP methods. The remaining three methods (SH, R40, and PR) yielded a precision slightly inferior to that of the best three methods (see Table 1). In the variablelength condition, the variable "test length" ("Items" column in Table 1) made clear the differences in precision among the methods. The MI, MM, and RA produced the shortest tests. The R40 and PR methods needed approximately three more items to reach the same degree of precision, and SH needed four. The longest tests were 316

7 Table 1 Item Exposure Control Ability Estimation Precision in the Fixed- and Variable-length Conditions, by Methods. Fixed Variable Bias Se Items Bias Se Items MI MM ii RA PR R SH ip CO Note. "Bias" means overall difference between estimated and true ability; "Se" means the standard deviation of the differences between estimated and true abilities; "Items" means the mean number of administered items. "Bias" and "Se" are multiplied by i000. The stopping rules were two: Fixed-length (35 items) and variable-length (50 items or standard error <= 0.22). administered in the IP and CO conditions. The IP method needed on average 12 more items than the MI method. In the CO method, the standard error stopping rule was never reached, and all the simulated subjects thus received 50 items. Bias values were very small. Maximum bias was and no clear differences in bias emerged among the methods. The information provided by the first item given, second item given, and so on, evaluated at the final ability estimate for each of the 2000 simulees in the fixedlength test, was computed. Figure 1 shows the mean across simulees of these information values. The results showed that the MI method administered the best items at the beginning of the test. The methods R40, SH, and RA provided nearly identical mean information values, slightly below those of MI method, throughout the test. The Progressive method provided poor items at the beginning, but they were more informative than those provided by the MI method from the eleventh item until the end of the test. The MM method produced a curve similar to MI. Finally the I P method produced a curve lying midway between the MI and CO methods. Exposure Control As seen above, in the variable-length condition, different test lengths were obtained. Item exposure depended on test length because, as length increased, more 317

8 Revuelta and Ponsoda ] c- O t13 E c- e- 03,4 MI PR i c R40 -F SH co Item Position FIGURE h Mean Information by Item Position." the results for the methods RA, 1P, and MM are not shown as they could not be easily distinguished from other curves items had to be included in the test, so that the percentage of unused items and the coefficient of variation should decrease. This circumstance would make more difficult a fair comparison among the methods with regard to item exposure rate in the variable-length condition. For this reason, only the results for the fixed-length condition are given. As Table 2 shows, the MI, MM, RA, and PR methods provided the highest coefficients of variation (over 100). The R40 and SH methods showed values close to 100, and the CV of the IP method was clearly lower. As expected, when items were strictly selected at random (CO method), item exposure rates were quite homogeneous, and the coefficient of variation was small. The methods MI, MM, RA, R40, and SH provided percentages of items never administered ranging from 15 to 25, thus an important part of the bank was never used. However, this was not the case for the other three methods in Table 2 (PR, 1P, and CO). They had a zero percentage of items never administered, indicating that all the items were used at least once. With the MI and 1P methods, the maximum values are 100. Since all the simulated subjects start with the same initial estimated ability, all the tests begin with the same first item. As expected, the maximum exposure rate (x 100) for the method R40 was 40. The CO method provided the highest minimum and the lowest maximum. For all the methods except two (lp and CO), around 50% of the items were grouped in the first interval, meaning that half the items of the bank had an extremely low exposure rate: they were administered in less than 10% of the tests. Item exposure rates provided by the MI method are expected to be related to item characteristics. Correlations of item exposure rates with item parameters were

9 Item Exposure Control Table = Item Exposure Rate Distributions By Methods and Statistics in the Fixed-Length Condition, Methods Item Exposure Rate(xl00) MI MM RA PR R40 SH IP CO 0-I i Ii.i CV , ,2 5.1 Min(xl00) Max(x100) i Items Never Administered (%) Note. The entries of the table are the percentage of items in the bank at each exposure rate interval. Item pool size is 221 items. "CV" stands for coefficient of variation. Exposure rate, minimum and maximum values are multiplied by 100. (discrimination, p < 0.001), 0.02 (difficulty, p > 0.05), (absolute value of difficulty, p < 0.01) and (pseudochance, p < 0.01). These results confirmed that the most popular items have high discrimination, medium difficulty, and low pseudochance levels. Taking the results from precision and exposure rate together, we can draw some conclusions: first, the greater the precision, the greater the differences in exposure rate among items. The best methods with regard to precision are the poorest with regard to exposure rate control; inversely, low precision methods are superior to high precision methods in exposure rate control. Second, if test administrators are concerned only with precision they should apply the MI or MM method; if their main concern is item exposure rate, the CO or IP method would be the best choice. As test administrators are expected to be concerned with precision, but at the same time interested in a more or less tight control of exposure rates, some other methods may be a better choice. The main advantage of the PR method is that it increased the minimum exposure rate and reduced the number of unused items without a serious decrease in test precision. However, it produced a maximum rate that was too high. R40 and SH kept the maximum rate under control and showed adequate precision, but the minimum rate remained equal to that of the MI method. These results suggested that a new method resulting from combining the Progressive and Restricted methods would perform well in precision and exposure (both in maximum and minimum rate). This method (called PRk) is a progressive 319

10 Revuelta and Ponsoda one, but no item is allowed to be exposed in more than 100k% of the tests. As in the Restricted method, the set of available items is determined before each test, but the Progressive method is then applied for item selection instead of the MI method. Thus, the methods MI, PR, SH, Rk, and PRk were chosen for more detailed scrutiny. Study Two Results of the study just described may have been produced by specific characteristics of the simulation, such as the psychometric properties of the items, stopping test rules, distribution of ability, and so on. This second study again compared the methods MI, PR, Rk, PRk, and SH, and attempted to find out whether the conclusions of the Study One held for different item banks. Precision and exposure rates depend on item parameters and test length. As seen above, item discrimination parameters correlated with item exposure rates when the MI method was used. Test length should also affect item exposure rates; longer tests may produce fewer exposure rate differences among items, since the less popular items may have to be administered when the bank is running out of items. No exposure rate differences among items would, of course, emerge in the extreme case of a test length equaling bank size. Method Conditions Three discrimination parameter distributions, three test lengths, and two item pool sizes were considered. Test lengths were 20, 40, or 60 items. Item pool sizes were 500 or 1000 items. Discrimination parameter distributions were as follows: (a) Lognormal (-0.25, 0.5), which produced a mean of 0.93 in the item bank; (b) Lognormal (0, 0.5), mean of 1.07; and (c) Lognormal (0.25, 0.5), mean of The methods MI, PR, Rk, PRk, SH, and CO were applied in each one of these 18 (3 3 2) conditions. For the SH, Rk, and PRk methods the desired maximum rate was again fixed at 0.4. Thus, these last two methods will again be called R40 and PR40. Finally, the PRk method was also considered with k = The PRI5 method was added to explore its efficiency when a tight control of item exposure is needed. The mean exposure rate in the previous study's CO method was 0.15, below most typical maximum exposure rates in CAT. The value applied by Potenza and Stocking (1997) was Procedure In each condition a simulated CAT was administered to the same 2000 simulated subjects, as in Study One. Initial estimated abilities were again fixed at 0. A simulated item bank was created for each one of the six combinations of bank size x discrimination distribution. The values of difficulty and guessing parameters were drawn from a normal (0, 1) distribution (difficulty parameter) and from a Beta (5, 17) distribution (pseudochance parameter, which resulted in a mean of 0.23 in 320

11 Item Exposure Control the bank) for the six item banks. See Baker (1992, pp ) for a description of these parameter distributions. The indices of precision and exposure rate applied in the previous study were also used here. Exposure rate distributions will not be reported. As in the first study, the simulation corresponding to each condition was repeated five times using the same 2000 true abilities. The results shown are the means of the five repetitions. Results Precision was related to the method applied. The MI method was the most precise, followed by the PR, R40, PR40 and SH methods. The PRI5 method was less precise and CO was the poorest in terms of precision. It should be noted that PR40 produced similar Se values to R40. The increase in a parameter and itempool size was also related to improved precision. Finally, as expected, longer tests gave lower bias and Se values. Bias values were quite low. The highest bias is (see Table 3). Methods did not show noticeable differences in bias. The results seem to suggest, however, that bias is slightly higher for th6 control and PR 15 methods. As can be seen in Table 3 (bank size 500), with the PR15 method, test precision is lower in the 60- than in the 40-item condition. This result was unexpected, and we do not know the reason for it. It suggested that the increase in the number of items had a positive and a negative effect. All things being equal, longer tests have more precision. In this case, however, in some specific conditions it seemed as if longer tests were made up of poorer items than shorter tests and that the precision increase generated by their extra items could not compensate for their lower item quality. This possibility requires further exploration. As shown in Tables 4 and 5, test length also affected the exposure rates. Longer tests gave more homogeneous rates (i.e., smaller coefficients of variation) and smaller percentages of items never administered. In almost all conditions the minimum exposure rate was 0, due to the large item pool size compared to test length. With regard to the methods, marked differences were observed. The MI method gave the highest coefficients of variation and the highest maximum rates. The PR method provided a low number of items unused, but the maximum rate increased with test length as there was no direct control over it. The methods SH and R40 had control over the maximum rate but the number of items never used was similar to that found with the MI method. PRk had control over maximum rate, and the number of unused items and CVs were as low as those provided by the PR method. By combining results on precision and exposure rates, some conclusions may be reached. The MI method gave the highest precision but the poorest exposure control. The remaining methods were less precise, while their results in exposure varied. The PR method produced the lowest number of unused items, but the maximum rate was not under control. The methods R40 and SH permitted this control but provided an unacceptably high percentage of unused items. The combined PRk method gave good results in both variables, and for some maximum 321

12 Revuelta and Ponsoda Table 3 Ability Estimation Precision by a Distribution, Size Test Length, Method, and Bank Test length ~ distribution Bank Size 500 Bias Se MI i PR R SH PR PRI CO 17 i0 5 I0 12 i0 MI PR R SH PR PRI CO Book Size i000 Bias Se MI i PR i R SH PR40 i PRI CO MI PR R SH PR PRI CO Note. The "Bias" and "Se" indices are multiplied by The means of the parameters in the three distributions are 0.93 (i), 1.07 (2) and 1.30 (3). exposure rates (0.40 in our case) it yielded good precision when compared to the other exposure control methods. Discussion The current work compared precision and item exposure rates of eight itemselection methods. Results showed that the MI method may compromise item bank security, although it would be a good choice for item selection if test peculiarities allowed us to disregard the exposure control problem. This would be the case, for example, if no examinee were to receive the test more than once, or if no examinee were expected to pass on information about the test to another examinee. None of the considered exposure control methods are fully satisfactory. Decisions about the particular method to apply should depend on the level of exposure control required. To prevent most examinees from answering the same items at the beginning of the test, the methods MM and RA are good candidates. 322

13 Item Exposure Control Table 4 Coefficient of Variation, Minimum Rate, Maximum Rate and Percentage of Items Never Used by a Distribution, Test Length and Method in the 500 Bank Size Condition. Bank Size 500 Test length a distribution MI PR R CV SH PR PRI CO MI PR i R MinSH PR PRI i 0.i CO MI i00.0 i00.0 i00.0 I00.0 i00.0 I00.0 PR R Max SH PR PRI CO MI PR R N% SH PR PRI CO Note. Minimum and maximum values are multiplied by i00. The means of the parameters in the three distributions are 0.93 (i), 1.07 (2) and 1.30 (3). If the test administrator is mainly concerned with reducing extremely high exposure rates, then the SH or Rk are good choices. The SH method is widely used in operational CATs and has also been extended to structured item banks (Stocking, 1993). One difficulty of the SH method is the assignment of the k parameters. In this research, the set of k parameters in both studies had to be obtained via simulation for each different condition before running the reported simulations. In the Rk method, when the first examinee receives the test all the items are available. Items administered to the first examinee will become unavailable for the second and following examinees, until they reach an exposure rate below k. In consequence, a possible shortcoming of the method is that the ability estimation precision may be better for some examinees than for others, depending on when they take the test. However, as the SH and Rk methods did not differ in precision, it seems safe to conclude that this property of the Rk method had no consequences in the current studies. As two reviewers pointed out, the use of the Rk meihod may face some difficulties if testing takes place at different CAT sites, since a simultaneous control 323

14 Revuelta and Ponsoda Table 5 Coefficient of Variation, Minimum Rate, Maximum Rate and Percentage of Items Never Used by ~ Distribution, Test Length and Method in the 1000 Bank Size Condition. Bank Size i000 Test length ~ distribution 2O 4O MI PR R CV SH PR PRI CO 15.8 ii MI PR R MinSH PR PRI CO MI i00.0 i00.0 i00.0 i00.0 i00.0 i00.0 PR R , Max SH PR PRI C0 3.i MI PR R N% SH PR PRI CO Note. Minimum and maximum values are multiplied by I00. The means of the parameters in the three distributions are 0.93 (i), 1.07 (2) and 1.30 (3). of item exposures at different sites would be required. In this case several possibilities may be considered. If the whole item bank is stored in each computer administering the test, the Rk method may be applied independently on each computer. If the test is stored in a single server, and transmitted to the terminals by a computer network, this server may control the exposure rate across all the terminals, without the need to count bow many times each particular item is exposed at each single terminal. The PR method produced greater control over exposure rates (except for maximum rate) and a small cost in precision, when compared to the methods MI, MM, and RA. One advantage of the method is the flexibility offered by the possibility of manipulating its formula. The efficiency of the PR method for controlling the item exposure rates may be regulated by the magnitude of the random component, which could be related to some variables affecting test precision (such as the variability in the item discrimination parameters or the maximum test length). The random and information components are linearly combined to produce the weights governing 324

15 Item Exposure Control item selection. Other formulas may, of course, have been proposed. More research is needed to explore the efficiency of alternative expressions for guiding item selection. The combined method PRk reduced the maximum exposure rates and also the number of unused items. Moreover, its precision was similar to that of the Rk method. The combined method, then, provided the best overall results. The precision of the IP method was clearly lower than that of the MI method, though with regard to item exposure rates it was superior to the methods discussed above. The maximum rate was 1, since, as with the MI method, every test administered the same first item. The influence of a parameters on item selection differed for the methods applied. They had no role in the 1P method, but they correlated (r = 0.8) with exposure rates in the MI method. As one referee pointed out, this unequal influence of the discrimination parameter may also affect the amount of bias the methods produced when (as is the case here) maximum likelihood procedures were used to estimate ability. Kim and Nicewander (1993) found that maximum likelihood estimates became more biased as the a parameter was increased from 0.5 to 1 and 2. Our data did not reveal any noticeable differences in bias among methods (the highest absolute bias found in both studies was negligible). However, a slight increase in bias was registered for the control and PR15 methods in the shortest tests, indicating that differences in bias among the methods may be discovered when other conditions are explored. The influence of item selection method on test validity also needs more research. Green, Bock, Humphreys, Linn, and Reckase (1984) suggested that item discrimination parameters are related to the dimensionality of the bank. Items with low discrimination parameters are less related to the trait measured by the entire bank. Methods using items with lower a (such as the IP method) may then compromise the unidimensionality of the CAT, and therefore its validity and predictive power. The current study did not take into account test overlap. Methods revealed as efficient for exposure rate control may not be so for exposure rate control conditional on examinee ability. Methods not including a random component, such as MI and 1P, should produce a high degree of item overlap, as they may administer the same items to same-ability examinees. The Rk and PRk methods may be easily extended to prevent overlap by computing the exposure rates in the preceding tests conditional on ability levels. Overlap should not be so problematic for those methods in which a random component was present. Further research should pay attention to this problem (Davey & Parshall, 1995; Stocking, 1993). Non-psychometric constraints in item selection, such as item content, item type, and so on, were not considered, but they should be taken into account before applying these methods to more realistic contexts in which administered items must also conform to a test plan. For the SH method this extension has been accomplished by Stocking (1993). Other characteristics of the evaluation, such as the relationship between examinees' ability and item difficulties, may also affect exposure rates. If the mean ability of examinees is well above (or below) the mean of item difficulties, controlling for exposure rates may be quite costly in terms of test precision, as the few informative items will be unavailable for most tests. The 325

16 Revuelta and Ponsoda relationship between the two distributions and the effects of this relationship on precision and exposure rates need further research. In conclusion, none of the methods proposed may yet be considered to represent a final solution to the exposure control problem. Methods for controlling item exposure rates apply suboptimal item selection strategies in order to diminish the exposure rate differences among items produced by the MI method. Simulations with real and simulated item banks showed that precision loss is not important for most of the methods tried. Moreover, this loss in precision may, of course, be compensated by a small increase in test length. More visible differences among methods were found regarding exposure control. The progressive restricted method seems to perform well on precision and exposure control and no parameters have to be determined by previous simulations. This method is therefore a good choice for keeping maximum exposure rates under control. References Assessment Systems Corporation (1988). User's manual for the MicroCAT testing system, version 3. St. Paul, MN: Author. Baker, F. B. (1992). Item Response Theory, Parameter Estimation Techniques. New York: Marcel Dekker. Davey, T., & Parshall, C. G. (1995, April). New algorithms for item selection and exposure control with computerized adaptive testing. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA. Dodd, B. G. (1990). The effect of item selection procedure and stepsize on computerized adaptive measurement using the rating scale model. Applied Psychological Measurement, 14, Green, B. F., Bock, R. D., Humphreys, L. G., Linn, R. L., & Reckase, M. D. (1984). Technical guidelines for assessing computerized adaptive tests. Journal of Educational Measurement, 21, Hambleton, R. K., & Swaminathan, H. (1985). Item Response Theory. Principles and Applications. Boston: Kluwer-Nijhoff Pub. Ham bleton, R., Swam inathan, H., & Rogers, H. J. (1991). Fundamentals of Item Response Theory. Newbury Park: Sage. Hetter, R. D., & Sympson, J. B. (1997). Item exposure control in CAT-ASVAB. In W. A. Sands, B.K. Waters, & J.R. McBride (Eds.), Computerized Adaptive Testing. Front Inquiry to Operation (pp ). Washington, DC: American Psychological Association. Hulin, C. L., Drasgow, E, & Parsons, C. K. (1983). Item Response Theory: Applications to Psychological Measurement. Homewood, IL.: Dow Jones-lrwin. Kim, J. K., & Nicewander, W.A. (1993). Ability estimation for conventional tests. Psychometrika, 58, Kingsbury, G.G., & Zara, A. R. (1989). Procedures for selecting items for computerized adaptive tests. Applied Measurement in Education, 2, Lewis, M. J., Subhiyah, R. G., & Morrison, C. A. (1995, April). A comparison of classification agreement between adaptive and full-length tests under the I-PL and 2-PL models. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA. McBride, J. R., & Martin, J. T. (1983). Reliability and validity of adaptive ability tests in a military setting. In D.J. Weiss (Ed.), New Horizons in Testing (pp ). New York: Academic Press. 326

17 Item Exposure Control Mills, C., N., & Stocking, M., L. (1995). Practical Issues in Large-Scale High-Stakes Computerized Adaptive Testing. (Technical Report RR-95-23). Princeton, N J: Educational Testing Service. Morrison, C., Subhiyah, R., & Nungester, R. (1995). Item exposure rates for unconstrained and content-balanced computerized adaptive tests. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA. Ponsoda, V., Olea, J., & Revuelta, J. (1994). ADTEST: A computer adaptive test based on the maximum information principle. Educational and Psychological Measurement, 54, Ponsoda, V., Wise, S. L., Olea, J., & Revuelta, J. (1997). An investigation of self-adapted testing in a Spanish high school population. Educational and Psychological Measurement, 57, Potenza, M. T., & Stocking, M. L. (1997). Hawed items in computerized adaptive testing. Journal of Educational Measurement, 34, Revuelta, J. (1995). El control de la exposici6n de los items en tests adaptativos informatizados [Item exposure control in computerized adaptive tests]. Unpublished master's dissertation, Universidad Aut6noma de Madrid, Spain. Revuelta, J., & Ponsoda, V. (1996). Metodos sencillos para el control de las tasas de exposicion en tests adaptativos informatizados [Simple methods for item exposure control in CATs]. Psicologica, 17, Stocking, M.L. (1993). Controlling item exposure rates in a realistic adaptive testing paradigm. (Technical Report RR 93-2). Princeton, N J: Educational Testing Service. Stocking, M.L., & Lewis, C. (1995). A New Method of Controlling Item Exposure in Computerized Adaptive Testing. (Technical Report RR-95-25). Princeton, N J: Educational Testing Service. Stocking, M. L., & Swanson, L. (1993). A method for severely constrained item selection in adaptive testing. Applied Psychological Measurement, 17, Sympson, J. B., & Hetter, R. D. (1985). Controlling item exposure rates in computerized adaptive testing. Proceedings of the 27th Annual Meeting of the Military Testing Association (pp ). San Diego, CA: Navy Personnel Research and Development Center. Wainer, H. (1990). Computerized adaptive testing: a primer. Hillsdale, N J: LEA. Authors JAVIER REVUELTA is Associate Professor, Facultad de Psicologia, Universidad Autonoma de Madrid, Canto Blanco 28049, Madrid, Spain; javier.revuelta@uam.es. Degree: PhD, University of Madrid. Specializations: computerized testing, IRT methods. VICENTE PONSODA is Professor, Facultad de Psicologia, Universidad Autonoma de Madrid, Canto Blanco 28049, Madrid, Spain; vicente.ponsoda@uam.es. Degree: PhD, University of Madrid. Specialization: computerized testing. 327

A Comparison of Item and Testlet Selection Procedures. in Computerized Adaptive Testing. Leslie Keng. Pearson. Tsung-Han Ho

A Comparison of Item and Testlet Selection Procedures. in Computerized Adaptive Testing. Leslie Keng. Pearson. Tsung-Han Ho ADAPTIVE TESTLETS 1 Running head: ADAPTIVE TESTLETS A Comparison of Item and Testlet Selection Procedures in Computerized Adaptive Testing Leslie Keng Pearson Tsung-Han Ho The University of Texas at Austin

More information

Computerized Adaptive Testing for Classifying Examinees Into Three Categories

Computerized Adaptive Testing for Classifying Examinees Into Three Categories Measurement and Research Department Reports 96-3 Computerized Adaptive Testing for Classifying Examinees Into Three Categories T.J.H.M. Eggen G.J.J.M. Straetmans Measurement and Research Department Reports

More information

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati.

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati. Likelihood Ratio Based Computerized Classification Testing Nathan A. Thompson Assessment Systems Corporation & University of Cincinnati Shungwon Ro Kenexa Abstract An efficient method for making decisions

More information

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

Mantel-Haenszel Procedures for Detecting Differential Item Functioning A Comparison of Logistic Regression and Mantel-Haenszel Procedures for Detecting Differential Item Functioning H. Jane Rogers, Teachers College, Columbia University Hariharan Swaminathan, University of

More information

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model Gary Skaggs Fairfax County, Virginia Public Schools José Stevenson

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 39 Evaluation of Comparability of Scores and Passing Decisions for Different Item Pools of Computerized Adaptive Examinations

More information

Computerized Mastery Testing

Computerized Mastery Testing Computerized Mastery Testing With Nonequivalent Testlets Kathleen Sheehan and Charles Lewis Educational Testing Service A procedure for determining the effect of testlet nonequivalence on the operating

More information

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2 MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and Lord Equating Methods 1,2 Lisa A. Keller, Ronald K. Hambleton, Pauline Parker, Jenna Copella University of Massachusetts

More information

Item Selection in Polytomous CAT

Item Selection in Polytomous CAT Item Selection in Polytomous CAT Bernard P. Veldkamp* Department of Educational Measurement and Data-Analysis, University of Twente, P.O.Box 217, 7500 AE Enschede, The etherlands 6XPPDU\,QSRO\WRPRXV&$7LWHPVFDQEHVHOHFWHGXVLQJ)LVKHU,QIRUPDWLRQ

More information

Termination Criteria in Computerized Adaptive Tests: Variable-Length CATs Are Not Biased. Ben Babcock and David J. Weiss University of Minnesota

Termination Criteria in Computerized Adaptive Tests: Variable-Length CATs Are Not Biased. Ben Babcock and David J. Weiss University of Minnesota Termination Criteria in Computerized Adaptive Tests: Variable-Length CATs Are Not Biased Ben Babcock and David J. Weiss University of Minnesota Presented at the Realities of CAT Paper Session, June 2,

More information

Computerized Adaptive Testing: A Comparison of Three Content Balancing Methods

Computerized Adaptive Testing: A Comparison of Three Content Balancing Methods The Journal of Technology, Learning, and Assessment Volume 2, Number 5 December 2003 Computerized Adaptive Testing: A Comparison of Three Content Balancing Methods Chi-Keung Leung, Hua-Hua Chang, and Kit-Tai

More information

Centre for Education Research and Policy

Centre for Education Research and Policy THE EFFECT OF SAMPLE SIZE ON ITEM PARAMETER ESTIMATION FOR THE PARTIAL CREDIT MODEL ABSTRACT Item Response Theory (IRT) models have been widely used to analyse test data and develop IRT-based tests. An

More information

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing Terry A. Ackerman University of Illinois This study investigated the effect of using multidimensional items in

More information

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida and Oleksandr S. Chernyshenko University of Canterbury Presented at the New CAT Models

More information

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Jee Seon Kim University of Wisconsin, Madison Paper presented at 2006 NCME Annual Meeting San Francisco, CA Correspondence

More information

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups GMAC Scaling Item Difficulty Estimates from Nonequivalent Groups Fanmin Guo, Lawrence Rudner, and Eileen Talento-Miller GMAC Research Reports RR-09-03 April 3, 2009 Abstract By placing item statistics

More information

Description of components in tailored testing

Description of components in tailored testing Behavior Research Methods & Instrumentation 1977. Vol. 9 (2).153-157 Description of components in tailored testing WAYNE M. PATIENCE University ofmissouri, Columbia, Missouri 65201 The major purpose of

More information

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories Kamla-Raj 010 Int J Edu Sci, (): 107-113 (010) Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories O.O. Adedoyin Department of Educational Foundations,

More information

IRT Parameter Estimates

IRT Parameter Estimates An Examination of the Characteristics of Unidimensional IRT Parameter Estimates Derived From Two-Dimensional Data Timothy N. Ansley and Robert A. Forsyth The University of Iowa The purpose of this investigation

More information

Using Bayesian Decision Theory to

Using Bayesian Decision Theory to Using Bayesian Decision Theory to Design a Computerized Mastery Test Charles Lewis and Kathleen Sheehan Educational Testing Service A theoretical framework for mastery testing based on item response theory

More information

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1.

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1. Running Head: A MODIFIED CATSIB PROCEDURE FOR DETECTING DIF ITEMS 1 A Modified CATSIB Procedure for Detecting Differential Item Function on Computer-Based Tests Johnson Ching-hong Li 1 Mark J. Gierl 1

More information

The Influence of Test Characteristics on the Detection of Aberrant Response Patterns

The Influence of Test Characteristics on the Detection of Aberrant Response Patterns The Influence of Test Characteristics on the Detection of Aberrant Response Patterns Steven P. Reise University of California, Riverside Allan M. Due University of Minnesota Statistical methods to assess

More information

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION Iweka Fidelis (Ph.D) Department of Educational Psychology, Guidance and Counselling, University of Port Harcourt,

More information

A Bayesian Nonparametric Model Fit statistic of Item Response Models

A Bayesian Nonparametric Model Fit statistic of Item Response Models A Bayesian Nonparametric Model Fit statistic of Item Response Models Purpose As more and more states move to use the computer adaptive test for their assessments, item response theory (IRT) has been widely

More information

Computerized Adaptive Testing

Computerized Adaptive Testing Computerized Adaptive Testing Daniel O. Segall Defense Manpower Data Center United States Department of Defense Encyclopedia of Social Measurement, in press OUTLINE 1. CAT Response Models 2. Test Score

More information

The Use of Item Statistics in the Calibration of an Item Bank

The Use of Item Statistics in the Calibration of an Item Bank ~ - -., The Use of Item Statistics in the Calibration of an Item Bank Dato N. M. de Gruijter University of Leyden An IRT analysis based on p (proportion correct) and r (item-test correlation) is proposed

More information

SLEEP DISTURBANCE ABOUT SLEEP DISTURBANCE INTRODUCTION TO ASSESSMENT OPTIONS. 6/27/2018 PROMIS Sleep Disturbance Page 1

SLEEP DISTURBANCE ABOUT SLEEP DISTURBANCE INTRODUCTION TO ASSESSMENT OPTIONS. 6/27/2018 PROMIS Sleep Disturbance Page 1 SLEEP DISTURBANCE A brief guide to the PROMIS Sleep Disturbance instruments: ADULT PROMIS Item Bank v1.0 Sleep Disturbance PROMIS Short Form v1.0 Sleep Disturbance 4a PROMIS Short Form v1.0 Sleep Disturbance

More information

Bayesian Tailored Testing and the Influence

Bayesian Tailored Testing and the Influence Bayesian Tailored Testing and the Influence of Item Bank Characteristics Carl J. Jensema Gallaudet College Owen s (1969) Bayesian tailored testing method is introduced along with a brief review of its

More information

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests David Shin Pearson Educational Measurement May 007 rr0701 Using assessment and research to promote learning Pearson Educational

More information

Effects of Ignoring Discrimination Parameter in CAT Item Selection on Student Scores. Shudong Wang NWEA. Liru Zhang Delaware Department of Education

Effects of Ignoring Discrimination Parameter in CAT Item Selection on Student Scores. Shudong Wang NWEA. Liru Zhang Delaware Department of Education Effects of Ignoring Discrimination Parameter in CAT Item Selection on Student Scores Shudong Wang NWEA Liru Zhang Delaware Department of Education Paper to be presented at the annual meeting of the National

More information

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, April 23-25, 2003 The Classification Accuracy of Measurement Decision Theory Lawrence Rudner University

More information

The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests

The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests Mary E. Lunz and Betty A. Bergstrom, American Society of Clinical Pathologists Benjamin D. Wright, University

More information

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison Empowered by Psychometrics The Fundamentals of Psychometrics Jim Wollack University of Wisconsin Madison Psycho-what? Psychometrics is the field of study concerned with the measurement of mental and psychological

More information

A Broad-Range Tailored Test of Verbal Ability

A Broad-Range Tailored Test of Verbal Ability A Broad-Range Tailored Test of Verbal Ability Frederic M. Lord Educational Testing Service Two parallel forms of a broad-range tailored test of verbal ability have been built. The test is appropriate from

More information

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS Michael J. Kolen The University of Iowa March 2011 Commissioned by the Center for K 12 Assessment & Performance Management at

More information

A Comparison of Several Goodness-of-Fit Statistics

A Comparison of Several Goodness-of-Fit Statistics A Comparison of Several Goodness-of-Fit Statistics Robert L. McKinley The University of Toledo Craig N. Mills Educational Testing Service A study was conducted to evaluate four goodnessof-fit procedures

More information

FATIGUE. A brief guide to the PROMIS Fatigue instruments:

FATIGUE. A brief guide to the PROMIS Fatigue instruments: FATIGUE A brief guide to the PROMIS Fatigue instruments: ADULT ADULT CANCER PEDIATRIC PARENT PROXY PROMIS Ca Bank v1.0 Fatigue PROMIS Pediatric Bank v2.0 Fatigue PROMIS Pediatric Bank v1.0 Fatigue* PROMIS

More information

LEDYARD R TUCKER AND CHARLES LEWIS

LEDYARD R TUCKER AND CHARLES LEWIS PSYCHOMETRIKA--VOL. ~ NO. 1 MARCH, 1973 A RELIABILITY COEFFICIENT FOR MAXIMUM LIKELIHOOD FACTOR ANALYSIS* LEDYARD R TUCKER AND CHARLES LEWIS UNIVERSITY OF ILLINOIS Maximum likelihood factor analysis provides

More information

Multidimensionality and Item Bias

Multidimensionality and Item Bias Multidimensionality and Item Bias in Item Response Theory T. C. Oshima, Georgia State University M. David Miller, University of Florida This paper demonstrates empirically how item bias indexes based on

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

Adaptive Measurement of Individual Change

Adaptive Measurement of Individual Change G. Kim-Kang Zeitschrift für & D.J. Psychologie Weiss: Adaptive / JournalofMeasurement Psychology 2008 Hogrefe 2008; of Individual & Vol. Huber 216(1):49 58 Publishers Change Adaptive Measurement of Individual

More information

Comparison of Computerized Adaptive Testing and Classical Methods for Measuring Individual Change

Comparison of Computerized Adaptive Testing and Classical Methods for Measuring Individual Change Comparison of Computerized Adaptive Testing and Classical Methods for Measuring Individual Change Gyenam Kim Kang Korea Nazarene University David J. Weiss University of Minnesota Presented at the Item

More information

ANXIETY A brief guide to the PROMIS Anxiety instruments:

ANXIETY A brief guide to the PROMIS Anxiety instruments: ANXIETY A brief guide to the PROMIS Anxiety instruments: ADULT PEDIATRIC PARENT PROXY PROMIS Pediatric Bank v1.0 Anxiety PROMIS Pediatric Short Form v1.0 - Anxiety 8a PROMIS Item Bank v1.0 Anxiety PROMIS

More information

Jason L. Meyers. Ahmet Turhan. Steven J. Fitzpatrick. Pearson. Paper presented at the annual meeting of the

Jason L. Meyers. Ahmet Turhan. Steven J. Fitzpatrick. Pearson. Paper presented at the annual meeting of the Performance of Ability Estimation Methods for Writing Assessments under Conditio ns of Multidime nsionality Jason L. Meyers Ahmet Turhan Steven J. Fitzpatrick Pearson Paper presented at the annual meeting

More information

Computerized Adaptive Testing With the Bifactor Model

Computerized Adaptive Testing With the Bifactor Model Computerized Adaptive Testing With the Bifactor Model David J. Weiss University of Minnesota and Robert D. Gibbons Center for Health Statistics University of Illinois at Chicago Presented at the New CAT

More information

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Thakur Karkee Measurement Incorporated Dong-In Kim CTB/McGraw-Hill Kevin Fatica CTB/McGraw-Hill

More information

PAIN INTERFERENCE. ADULT ADULT CANCER PEDIATRIC PARENT PROXY PROMIS-Ca Bank v1.1 Pain Interference PROMIS-Ca Bank v1.0 Pain Interference*

PAIN INTERFERENCE. ADULT ADULT CANCER PEDIATRIC PARENT PROXY PROMIS-Ca Bank v1.1 Pain Interference PROMIS-Ca Bank v1.0 Pain Interference* PROMIS Item Bank v1.1 Pain Interference PROMIS Item Bank v1.0 Pain Interference* PROMIS Short Form v1.0 Pain Interference 4a PROMIS Short Form v1.0 Pain Interference 6a PROMIS Short Form v1.0 Pain Interference

More information

Adaptive Estimation When

Adaptive Estimation When Adaptive Estimation When the Unidimensionality Assumption of IRT is Violated Valerie Greaud Folk Syracuse University Bert F. Green Johns Hopkins University This study examined some effects of using a unidimensional

More information

linking in educational measurement: Taking differential motivation into account 1

linking in educational measurement: Taking differential motivation into account 1 Selecting a data collection design for linking in educational measurement: Taking differential motivation into account 1 Abstract In educational measurement, multiple test forms are often constructed to

More information

INTRODUCTION TO ASSESSMENT OPTIONS

INTRODUCTION TO ASSESSMENT OPTIONS DEPRESSION A brief guide to the PROMIS Depression instruments: ADULT ADULT CANCER PEDIATRIC PARENT PROXY PROMIS-Ca Bank v1.0 Depression PROMIS Pediatric Item Bank v2.0 Depressive Symptoms PROMIS Pediatric

More information

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD Psy 427 Cal State Northridge Andrew Ainsworth, PhD Contents Item Analysis in General Classical Test Theory Item Response Theory Basics Item Response Functions Item Information Functions Invariance IRT

More information

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian Recovery of Marginal Maximum Likelihood Estimates in the Two-Parameter Logistic Response Model: An Evaluation of MULTILOG Clement A. Stone University of Pittsburgh Marginal maximum likelihood (MML) estimation

More information

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments Greg Pope, Analytics and Psychometrics Manager 2008 Users Conference San Antonio Introduction and purpose of this session

More information

Nonparametric DIF. Bruno D. Zumbo and Petronilla M. Witarsa University of British Columbia

Nonparametric DIF. Bruno D. Zumbo and Petronilla M. Witarsa University of British Columbia Nonparametric DIF Nonparametric IRT Methodology For Detecting DIF In Moderate-To-Small Scale Measurement: Operating Characteristics And A Comparison With The Mantel Haenszel Bruno D. Zumbo and Petronilla

More information

COMPUTING READER AGREEMENT FOR THE GRE

COMPUTING READER AGREEMENT FOR THE GRE RM-00-8 R E S E A R C H M E M O R A N D U M COMPUTING READER AGREEMENT FOR THE GRE WRITING ASSESSMENT Donald E. Powers Princeton, New Jersey 08541 October 2000 Computing Reader Agreement for the GRE Writing

More information

Adaptive EAP Estimation of Ability

Adaptive EAP Estimation of Ability Adaptive EAP Estimation of Ability in a Microcomputer Environment R. Darrell Bock University of Chicago Robert J. Mislevy National Opinion Research Center Expected a posteriori (EAP) estimation of ability,

More information

Item-Level Examiner Agreement. A. J. Massey and Nicholas Raikes*

Item-Level Examiner Agreement. A. J. Massey and Nicholas Raikes* Item-Level Examiner Agreement A. J. Massey and Nicholas Raikes* Cambridge Assessment, 1 Hills Road, Cambridge CB1 2EU, United Kingdom *Corresponding author Cambridge Assessment is the brand name of the

More information

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida Vol. 2 (1), pp. 22-39, Jan, 2015 http://www.ijate.net e-issn: 2148-7456 IJATE A Comparison of Logistic Regression Models for Dif Detection in Polytomous Items: The Effect of Small Sample Sizes and Non-Normality

More information

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses Item Response Theory Steven P. Reise University of California, U.S.A. Item response theory (IRT), or modern measurement theory, provides alternatives to classical test theory (CTT) methods for the construction,

More information

Constrained Multidimensional Adaptive Testing without intermixing items from different dimensions

Constrained Multidimensional Adaptive Testing without intermixing items from different dimensions Psychological Test and Assessment Modeling, Volume 56, 2014 (4), 348-367 Constrained Multidimensional Adaptive Testing without intermixing items from different dimensions Ulf Kroehne 1, Frank Goldhammer

More information

ANXIETY. A brief guide to the PROMIS Anxiety instruments:

ANXIETY. A brief guide to the PROMIS Anxiety instruments: ANXIETY A brief guide to the PROMIS Anxiety instruments: ADULT ADULT CANCER PEDIATRIC PARENT PROXY PROMIS Bank v1.0 Anxiety PROMIS Short Form v1.0 Anxiety 4a PROMIS Short Form v1.0 Anxiety 6a PROMIS Short

More information

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA Data Analysis: Describing Data CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA In the analysis process, the researcher tries to evaluate the data collected both from written documents and from other sources such

More information

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

Bruno D. Zumbo, Ph.D. University of Northern British Columbia Bruno Zumbo 1 The Effect of DIF and Impact on Classical Test Statistics: Undetected DIF and Impact, and the Reliability and Interpretability of Scores from a Language Proficiency Test Bruno D. Zumbo, Ph.D.

More information

An Introduction to Missing Data in the Context of Differential Item Functioning

An Introduction to Missing Data in the Context of Differential Item Functioning A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to Practical Assessment, Research & Evaluation. Permission is granted to distribute

More information

MEANING AND PURPOSE. ADULT PEDIATRIC PARENT PROXY PROMIS Item Bank v1.0 Meaning and Purpose PROMIS Short Form v1.0 Meaning and Purpose 4a

MEANING AND PURPOSE. ADULT PEDIATRIC PARENT PROXY PROMIS Item Bank v1.0 Meaning and Purpose PROMIS Short Form v1.0 Meaning and Purpose 4a MEANING AND PURPOSE A brief guide to the PROMIS Meaning and Purpose instruments: ADULT PEDIATRIC PARENT PROXY PROMIS Item Bank v1.0 Meaning and Purpose PROMIS Short Form v1.0 Meaning and Purpose 4a PROMIS

More information

Section 5. Field Test Analyses

Section 5. Field Test Analyses Section 5. Field Test Analyses Following the receipt of the final scored file from Measurement Incorporated (MI), the field test analyses were completed. The analysis of the field test data can be broken

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 26 for Mixed Format Tests Kyong Hee Chon Won-Chan Lee Timothy N. Ansley November 2007 The authors are grateful to

More information

Determining Differential Item Functioning in Mathematics Word Problems Using Item Response Theory

Determining Differential Item Functioning in Mathematics Word Problems Using Item Response Theory Determining Differential Item Functioning in Mathematics Word Problems Using Item Response Theory Teodora M. Salubayba St. Scholastica s College-Manila dory41@yahoo.com Abstract Mathematics word-problem

More information

Empirical Formula for Creating Error Bars for the Method of Paired Comparison

Empirical Formula for Creating Error Bars for the Method of Paired Comparison Empirical Formula for Creating Error Bars for the Method of Paired Comparison Ethan D. Montag Rochester Institute of Technology Munsell Color Science Laboratory Chester F. Carlson Center for Imaging Science

More information

An Alternative to the Trend Scoring Method for Adjusting Scoring Shifts. in Mixed-Format Tests. Xuan Tan. Sooyeon Kim. Insu Paek.

An Alternative to the Trend Scoring Method for Adjusting Scoring Shifts. in Mixed-Format Tests. Xuan Tan. Sooyeon Kim. Insu Paek. An Alternative to the Trend Scoring Method for Adjusting Scoring Shifts in Mixed-Format Tests Xuan Tan Sooyeon Kim Insu Paek Bihua Xiang ETS, Princeton, NJ Paper presented at the annual meeting of the

More information

Nearest-Integer Response from Normally-Distributed Opinion Model for Likert Scale

Nearest-Integer Response from Normally-Distributed Opinion Model for Likert Scale Nearest-Integer Response from Normally-Distributed Opinion Model for Likert Scale Jonny B. Pornel, Vicente T. Balinas and Giabelle A. Saldaña University of the Philippines Visayas This paper proposes that

More information

3 CONCEPTUAL FOUNDATIONS OF STATISTICS

3 CONCEPTUAL FOUNDATIONS OF STATISTICS 3 CONCEPTUAL FOUNDATIONS OF STATISTICS In this chapter, we examine the conceptual foundations of statistics. The goal is to give you an appreciation and conceptual understanding of some basic statistical

More information

PHYSICAL FUNCTION A brief guide to the PROMIS Physical Function instruments:

PHYSICAL FUNCTION A brief guide to the PROMIS Physical Function instruments: PROMIS Bank v1.0 - Physical Function* PROMIS Short Form v1.0 Physical Function 4a* PROMIS Short Form v1.0-physical Function 6a* PROMIS Short Form v1.0-physical Function 8a* PROMIS Short Form v1.0 Physical

More information

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1 Nested Factor Analytic Model Comparison as a Means to Detect Aberrant Response Patterns John M. Clark III Pearson Author Note John M. Clark III,

More information

The Matching Criterion Purification for Differential Item Functioning Analyses in a Large-Scale Assessment

The Matching Criterion Purification for Differential Item Functioning Analyses in a Large-Scale Assessment University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Educational Psychology Papers and Publications Educational Psychology, Department of 1-2016 The Matching Criterion Purification

More information

International Journal of Education and Research Vol. 5 No. 5 May 2017

International Journal of Education and Research Vol. 5 No. 5 May 2017 International Journal of Education and Research Vol. 5 No. 5 May 2017 EFFECT OF SAMPLE SIZE, ABILITY DISTRIBUTION AND TEST LENGTH ON DETECTION OF DIFFERENTIAL ITEM FUNCTIONING USING MANTEL-HAENSZEL STATISTIC

More information

Influences of IRT Item Attributes on Angoff Rater Judgments

Influences of IRT Item Attributes on Angoff Rater Judgments Influences of IRT Item Attributes on Angoff Rater Judgments Christian Jones, M.A. CPS Human Resource Services Greg Hurt!, Ph.D. CSUS, Sacramento Angoff Method Assemble a panel of subject matter experts

More information

Reliability & Validity Dr. Sudip Chaudhuri

Reliability & Validity Dr. Sudip Chaudhuri Reliability & Validity Dr. Sudip Chaudhuri M. Sc., M. Tech., Ph.D., M. Ed. Assistant Professor, G.C.B.T. College, Habra, India, Honorary Researcher, Saha Institute of Nuclear Physics, Life Member, Indian

More information

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods James Madison University JMU Scholarly Commons Department of Graduate Psychology - Faculty Scholarship Department of Graduate Psychology 3-008 Scoring Multiple Choice Items: A Comparison of IRT and Classical

More information

Development, Standardization and Application of

Development, Standardization and Application of American Journal of Educational Research, 2018, Vol. 6, No. 3, 238-257 Available online at http://pubs.sciepub.com/education/6/3/11 Science and Education Publishing DOI:10.12691/education-6-3-11 Development,

More information

PHYSICAL STRESS EXPERIENCES

PHYSICAL STRESS EXPERIENCES PHYSICAL STRESS EXPERIENCES A brief guide to the PROMIS Physical Stress Experiences instruments: PEDIATRIC PROMIS Pediatric Bank v1.0 - Physical Stress Experiences PROMIS Pediatric Short Form v1.0 - Physical

More information

CHAPTER 3 METHOD AND PROCEDURE

CHAPTER 3 METHOD AND PROCEDURE CHAPTER 3 METHOD AND PROCEDURE Previous chapter namely Review of the Literature was concerned with the review of the research studies conducted in the field of teacher education, with special reference

More information

Identifying Non- Effortful Student Behavior on Adaptive Tests: Steven L. Wise, Lingling Ma, and Robert A. Theaker. Northwest Evaluation Association

Identifying Non- Effortful Student Behavior on Adaptive Tests: Steven L. Wise, Lingling Ma, and Robert A. Theaker. Northwest Evaluation Association 1 Identifying Non- Effortful Student Behavior on Adaptive Tests: Implications for Test Fraud Detection Steven L. Wise, Lingling Ma, and Robert A. Theaker Northwest Evaluation Association Paper presented

More information

EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN

EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN By FRANCISCO ANDRES JIMENEZ A THESIS PRESENTED TO THE GRADUATE SCHOOL OF

More information

Research Prospectus. Your major writing assignment for the quarter is to prepare a twelve-page research prospectus.

Research Prospectus. Your major writing assignment for the quarter is to prepare a twelve-page research prospectus. Department of Political Science UNIVERSITY OF CALIFORNIA, SAN DIEGO Philip G. Roeder Research Prospectus Your major writing assignment for the quarter is to prepare a twelve-page research prospectus. A

More information

Biserial Weights: A New Approach

Biserial Weights: A New Approach Biserial Weights: A New Approach to Test Item Option Weighting John G. Claudy American Institutes for Research Option weighting is an alternative to increasing test length as a means of improving the reliability

More information

Using the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison

Using the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison Using the Testlet Model to Mitigate Test Speededness Effects James A. Wollack Youngsuk Suh Daniel M. Bolt University of Wisconsin Madison April 12, 2007 Paper presented at the annual meeting of the National

More information

Fundamental Clinical Trial Design

Fundamental Clinical Trial Design Design, Monitoring, and Analysis of Clinical Trials Session 1 Overview and Introduction Overview Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics, University of Washington February 17-19, 2003

More information

Likert Scaling: A how to do it guide As quoted from

Likert Scaling: A how to do it guide As quoted from Likert Scaling: A how to do it guide As quoted from www.drweedman.com/likert.doc Likert scaling is a process which relies heavily on computer processing of results and as a consequence is my favorite method

More information

A Comparison of Four Test Equating Methods

A Comparison of Four Test Equating Methods A Comparison of Four Test Equating Methods Report Prepared for the Education Quality and Accountability Office (EQAO) by Xiao Pang, Ph.D. Psychometrician, EQAO Ebby Madera, Ph.D. Psychometrician, EQAO

More information

Noncompensatory. A Comparison Study of the Unidimensional IRT Estimation of Compensatory and. Multidimensional Item Response Data

Noncompensatory. A Comparison Study of the Unidimensional IRT Estimation of Compensatory and. Multidimensional Item Response Data A C T Research Report Series 87-12 A Comparison Study of the Unidimensional IRT Estimation of Compensatory and Noncompensatory Multidimensional Item Response Data Terry Ackerman September 1987 For additional

More information

Item Response Theory: Methods for the Analysis of Discrete Survey Response Data

Item Response Theory: Methods for the Analysis of Discrete Survey Response Data Item Response Theory: Methods for the Analysis of Discrete Survey Response Data ICPSR Summer Workshop at the University of Michigan June 29, 2015 July 3, 2015 Presented by: Dr. Jonathan Templin Department

More information

Designing small-scale tests: A simulation study of parameter recovery with the 1-PL

Designing small-scale tests: A simulation study of parameter recovery with the 1-PL Psychological Test and Assessment Modeling, Volume 55, 2013 (4), 335-360 Designing small-scale tests: A simulation study of parameter recovery with the 1-PL Dubravka Svetina 1, Aron V. Crawford 2, Roy

More information

Examining Factors Affecting Language Performance: A Comparison of Three Measurement Approaches

Examining Factors Affecting Language Performance: A Comparison of Three Measurement Approaches Pertanika J. Soc. Sci. & Hum. 21 (3): 1149-1162 (2013) SOCIAL SCIENCES & HUMANITIES Journal homepage: http://www.pertanika.upm.edu.my/ Examining Factors Affecting Language Performance: A Comparison of

More information

SESUG '98 Proceedings

SESUG '98 Proceedings Generating Item Responses Based on Multidimensional Item Response Theory Jeffrey D. Kromrey, Cynthia G. Parshall, Walter M. Chason, and Qing Yi University of South Florida ABSTRACT The purpose of this

More information

Adjusting for mode of administration effect in surveys using mailed questionnaire and telephone interview data

Adjusting for mode of administration effect in surveys using mailed questionnaire and telephone interview data Adjusting for mode of administration effect in surveys using mailed questionnaire and telephone interview data Karl Bang Christensen National Institute of Occupational Health, Denmark Helene Feveille National

More information

Journal of Political Economy, Vol. 93, No. 2 (Apr., 1985)

Journal of Political Economy, Vol. 93, No. 2 (Apr., 1985) Confirmations and Contradictions Journal of Political Economy, Vol. 93, No. 2 (Apr., 1985) Estimates of the Deterrent Effect of Capital Punishment: The Importance of the Researcher's Prior Beliefs Walter

More information

Smoking Social Motivations

Smoking Social Motivations Smoking Social Motivations A brief guide to the PROMIS Smoking Social Motivations instruments: ADULT PROMIS Item Bank v1.0 Smoking Social Motivations for All Smokers PROMIS Item Bank v1.0 Smoking Social

More information

Saville Consulting Wave Professional Styles Handbook

Saville Consulting Wave Professional Styles Handbook Saville Consulting Wave Professional Styles Handbook PART 4: TECHNICAL Chapter 19: Reliability This manual has been generated electronically. Saville Consulting do not guarantee that it has not been changed

More information

Applying the Minimax Principle to Sequential Mastery Testing

Applying the Minimax Principle to Sequential Mastery Testing Developments in Social Science Methodology Anuška Ferligoj and Andrej Mrvar (Editors) Metodološki zvezki, 18, Ljubljana: FDV, 2002 Applying the Minimax Principle to Sequential Mastery Testing Hans J. Vos

More information