BAYESIAN ITEM SELECTION CRITERIA FOR ADAPTIVE TESTING. WIM J. VaN DER LINDEN UNIVERSITY OF TWENTE

Size: px

Start display at page:

Download "BAYESIAN ITEM SELECTION CRITERIA FOR ADAPTIVE TESTING. WIM J. VaN DER LINDEN UNIVERSITY OF TWENTE"

Lee Ray
5 years ago
Views:

PSYCHOMETRIKA--VOL. 63, NO. 2, 21-216 JUNE 1998 BAYESIAN ITEM SELECTION CRITERIA FOR ADAPTIVE TESTING WIM J.

1 PSYCHOMETRIKA--VOL. 63, NO. 2, JUNE 1998 BAYESIAN ITEM SELECTION CRITERIA FOR ADAPTIVE TESTING WIM J. VaN DER LINDEN UNIVERSITY OF TWENTE Owen (1975) proposed an approximate empirical Bayes procedure for item selection in computerized adaptive testing (CAT). The procedure replaces the true posterior by a normal approximation with closed-form expressions for its first two moments. This approximation was necessary to minimize the computational complexity involved in a fully Bayesian approach but is no longer necessary given the computational power currently available for adaptive testing. This paper suggests several item selection criteria for adaptive testing which are all based on the use of the true posterior. Some of the statistical properties of the ability estimator produced by these criteria are discussed and empirically characterized. Key words: adaptive testing, item response theory, Bayesian statistics, item selection criteria. Introduction Computerized adaptive testing (CAT) is based on the principle of selecting items to match the current estimate of the ability of the examinee. An important choice is how to translate this principle into a formal criterion of item selection implementable as a computer algorithm. Since the early days of adaptive testing, two item selection criteria have been popular: the maximum-information criterion and an approximate Bayesian criterion proposed by Owen (1975). It is the purpose of this paper to introduce several new criteria for item selection in CAT which are all Bayesian in the sense that they are based on the true posterior distribution of the ability of the examinee. The criteria can be used as an alternative to Owen's criterion which is based on a normal approximation to the posterior. The approximation was introduced at a time when the numerical complexity involved in a fully Bayesian approach was a practical problem. However, for the computers currently in use in CAT programs, this complexity is no longer a problem. The paper is organized as follows: First, the maximum-information and Owen's criterion are reviewed. Subsequently, several Bayesian criteria for item selection are introduced. Next, some statistical properties of the final ability estimators for an adaptive test based on these criteria are discussed. The last section of the paper presents results from a simulation study run to characterize the properties of these estimators empirically. Model The two-parameter logistic model will be used as the response model under which the items in the pool have satisfactory fit. However, the results obtained in this paper easily generalize to any (unidimensional) item response theory (IRT) model. To introduce the model, a random response variable U/ is defined to denote a correct (Ui = 1) or an incorrect (Ui = ) response to item i. The model is given by the following equation for the probability of success on item i for an examinee with (fixed) ability ~ (-o% oo): Portions of this paper were presented at the 6th annual meeting of the Psychometric Society, Minneapolis, Minnesota, June, The author is indebted to Wim M. M. Tielen for his computational support. Requests for reprints should be sent to W. J. van der Linden, Department of Educational Measurement and Data Analysis, University of Twente, P.O. Box 217, 75 AE Enschede, THE NETHERLANDS. ; vanderlinden@edte.utwente.nl /98/ / 1998 The Psychometric Society 21

2 22 PSYCHOMETRIKA exp [ai(o - b~)] p,(o) -~ Prob {U~ = 11} exp [a,(o - b3]" (1) Location parameter b i E (-% ~) and scale parameter a i E [, m) in this model are commonly interpreted as the difficulty and discriminating power of item i, respectively. Maximum-Information Criterion To present the maximum information criterion, the following notation is needed. The items in the pool are denoted by i = 1... I. For convenience, an adaptive test of fixed length n will be assumed, but the presentation can easily be extended to CAT with a stopping rule based on a prespecified level of (posterior) precision of the ability estimator. For another argument for a fixed test length, see the discussion on constrained adaptive testing below. The rank of the items in the test is denoted by index k = 1,..., n. It follows that ik is the index of the item in the pool administered as the kth item in the test. Suppose k - 1 items have already been selected. The indices of these items form the set Sk-1 -- {il... ik- t }. The remaining set of items in the pool is denoted as Rk ---- { 1..., l}\sk- 1. For responses Ui, = ui,... Uik_, = ui~_, obtained on the first k --I items, the likelihood function is S~ {exp [ai~(o - bi~)]} ~' L(O[Uil... 'Uik-l)-~ i -e~pp [~]--- bv~)] ]=1 (2) A maximum-likelihood (ML) estimator of based on these responses is a maximizer of (2) over, that is, OML ~ maxo{l( Olui~, uik_~); E (-% oo)}. (3) nil,..., uik. 1. Fisher's information about the unknown value of in the response variables associated with the k - 1 items is defined as: ) k-1 (p;(o)fl Iu,... v,~_, () - -% ~-~ In L(OIUi~,..., U~_~) = j~ p~j(o)(1 -p~:(o))' (4) where p;(o) and the last step in (4) is a well-known result for the model in (1) (for a derivation see, for example, Hambteton & Swaminathan, 1985, sec. 6.3). Note that (4) is additive because of conditional independence of the response variables given. The maximum-information criterion common in adaptive testing selects the kth item such that maximum information is obtained at = u,..., u,~_,, that is, ik = maxj {Iu,,... u,~_~, uj(ou,,... u,k~); J E Rk}. Because the information measure is additive, the criterion is equivalent to ik = maxj {Ivj(Ou,...,k.-,); J E R,}. (6) (5)

3 WlM J. VAN DER LINDEN 23 Owen's Criterion As an alternative to the maximum-information criterion, Owen (1975) proposed an approximate empirical Bayes procedure for adaptive testing based on the following threeparameter normal-ogive model: pi(o) - ci + (1 - ci) b[ai(o - bi)], (7) where cb(.) is the normal distribution function and c i a lower asymptote to model the probability of guessing item i correctly. For a vector of responses to the first k - 1 items, the likelihood function is given by (2) with the logistic factor replaced by the normal ogive. Assuming a prior g(o), the following expression for the posterior distribution of after k - 1 items is obtained: L(Olui,,..., u~k_l)g(o) g(olu~,... u~k_,) = (8) f L(Olui,... uik_,)g(o) do Owen's procedure uses (8) as an update for the posterior with the choice of a normal density for the prior g(o). Item k is chosen to satisfy Ib,k - %(Olu~l,..., ui,_l)l < 8 (9) for a small value of 8, where %(Olui,..., uik_) is the expectation of over (8) now generally known as the Expected A Posteriori (EAP) estimator of. The procedure is stopped as soon as the variance of (8) is smaller than a prespecified threshold value. It should be noted that the likelihood in (8) does not have the normal family as class of conjugate distributions. Therefore, if a normal prior is chosen, the posterior is not normal, and repeated updating of the posterior using (8) soon leads to a posterior that could not be calculated in real-time CAT by the computers available in the 7s. Owen therefore proposed to replace the true posterior by a normal approximation with the same mean and variance which is then used as the new prior in the update in (8). He also presented closed-form expressions for this mean and variance (see Appendix 1). The approximation was motivated by showing that for mild bounds on 8 the estimator Ou,... u,~_~ =- %(Olui~..., ui~_) goes to the true value of in mean square for k --+ (Owen, 1975, Theorem 2). Owen also referred to the criterion of minimization of the preposterior risk under a quadratic loss function as an alternative to (9). This criterion selects the item that minimizes the expected posterior variance. Computationally it is more involved than the criterion in (9) with a normal approximation to the posterior, and for this reason the latter became widely popular as Owen's procedure of adaptive testing. An extensive simulation study of the statistical properties of Owen's procedure is reported in Weiss and McBride (1984). A generalization of the procedure to the case of multidimensional adaptive testing is discussed in Bloxom and Vale (1987). The criterion of minimum expected posterior variance will be returned to later in this paper. Constrained Adaptive Testing Before presenting several new criteria for item selection in CAT, an important impetus for searching improved criteria is discussed. Most operational adaptive testing programs find it necessary to base item selection not only on statistical criteria but also to impose constraints on the item selection procedure to control such attributes as item content, frequency of previous item exposure, or expected response time for the examin-

24 PSYCHOMETRIKA ees. In addition, constraints may be necessary to deal with an item set structure of the pool or to prevent items with clues to each other from being in the same test.

4 24 PSYCHOMETRIKA ees. In addition, constraints may be necessary to deal with an item set structure of the pool or to prevent items with clues to each other from being in the same test. In fact, the implicit ideal in these programs is an adaptive tests meeting the same specifications--and hence demonstrate the same validity--as their fixed-format precursors but with much shorter length. The number of constraints on the item selection procedure needed to realize this ideal can easily run into the hundreds. In addition, fixed test length may be necessary to maintain the same set of specifications across examinees. Constraining item selection in CAT generally has an unfavorable effect on the (posterior) precision of the ability estimator, in particular if CAT is used in the framework of a test battery with a large series of short tests (Brown & Weiss, 1977; Gialluca & Weiss, 1979). For such cases, knowledge of item selection criteria extracting maximum information from the items in the pool becomes an important goal. In addition, several strategies are possible to minimize the unfavorable effect of the constraints on the ability estimator (van der Linden & Reese, 1998). For example, it can be decided to administer a relatively less severely constrained section of the test first so that the majority of the constraints is imposed when the ability estimator is already in the neighborhood of its true value. Alternatively, a short unconstrained adaptive pretest can be administered, whereupon the main test follows reformulating the constraints to allow for the attributes of the items already selected. In either case, the item selection criterion should demonstrate its efficiency primarily for short tests. Bayesian Criteria A Bayesian approach to adaptive testing is loosely defined as any approach which uses a prior or posterior distribution to define rules for: (a) selecting the first item; (b) estimating ; (c) selecting the next item; or (d) stopping the test. According to this convention, the use of an informative prior to select the first item in a CAT procedure is thus an example of a Bayesian adaptive testing procedure (van der Linden, in press). Owen's procedure is adaptive in that the item selection criterion in (9) is based on the mean of the (approximate) posterior and the posterior variance is used to stop the test. However, his procedure does not base item selection on the full posterior which, in a Bayesian framework, is the best reflection of the uncertainty in the current ability estimate. This section introduces several alternative criteria for item selection based on the full posterior. The first two criteria generalize the idea of maximum information in a Bayesian fashion. The next criterion is the one of minimum expected posterior variance also discussed in Owen (1975). The fourth criterion combines the ideas of posterior-weighted information and preposterior prediction underlying the first and second criteria, respectively. Finally, some other Bayesian procedures of item selection are alluded to. Ma~cimum Posterior-Weighted Information The first criterion reformulates the maximum information criterion in a Bayesian fashion by first choosing the appropriate information measure and then taking its expectation across the posterior distribution. The main motivation for this criterion is the observation that as long as the posterior distribution of the estimator has not yet converged to a single point, it may be suboptimal to select an item with maximum information at a point estimate ignoring values of the ability parameter with substantial likelihood in its neighborhood. This point will be further discussed below, but first the criterion is given in its mathematical form. If the kth item is selected, responses to the first k - 1 items are already known. Hence, these data can no longer be presented by random variables but only by the (fixed) values of their realizations. As a consequence, Fisher's information, defined as an expected

WIM J. VAN DER LINDEN 25 value across random data, is no longer a valid measure. A typical Bayesian choice is to use the observed information measure Ju,,...,k_,() - -do-- ~ In L(Olu~,,.

5 WIM J. VAN DER LINDEN 25 value across random data, is no longer a valid measure. A typical Bayesian choice is to use the observed information measure Ju,,...,k_,() - -do-- ~ In L(Olu~,,..., u~_,) (1) which reflects the curvature of the observed likelihood function at the value of relative to the metric chosen for. Though the distinction between the two information measures is important, for known item parameters the model in (1) defines an exponential family which has the well-known property of the second derivative of the log-likelihood function being independent of the data (e.g., Andersen, 198, sec. 3.3). As a consequence of this property, Fisher's information and the observed information measure are equal. A direct derivation of the equality for logistic IRT models is given in Veerkamp (1996). Thus, it holds that Ju,,...,,_,() = Iv,,... u,k_,(o). (11) However, to obtain generality, a notational distinction between the two information measures will be maintained. The proposal is to select the next item to maximize the expected value of the observed information J(O) over the posterior distribution of O. The index of the kth item according to this criterion is: ik=maxj{fjv~(o)g(o,ui,... uik_i)do;j~rk), (12) where ff(oluil..., uik_,) is the posterior update obtained from (8). The same idea of not ignoring other likely values of the ability parameter than the one with maximum likelihood when the items are selected is present in the likelihood-weighted information criterion introduced in Luecht (1995) and Veerkamp and Berger (1997) in whichthe posterior in (12) is replaced by the likelihood function associated with the data. The latter authors also offer an interval information criterion defined as maxj {f ~ Ij(O) do; j ~ Rk} where L and OR are the end points of the interval to be chosen by the psychometrician. However, in the 2-PL model the integral is known to approach the value of the discrimination parameter aj for L ~ --~ and _R ~ ~, and item selection becomes independent of the value of the item difficulty parameter bj--a phenomenon contrary to the well-known importance of the difficulty parameter in CAT item selection. On the other hand, for L, OR -~' O the criterion approaches the maximum-information criterion in (5). Generally, the likelihood-weighted information criterion is an improvement over the interval information criterion, but for the first few items the likelihood function is still relatively flat and item selection may capitalize again on large values of aj independent of the value of bj. The only way out seems to use the posterior distribution of the ability parameter for weighing the information, as is done in the criterion in (12). This criterion allows for incorporating data on background variables with a known statistical relation to in the posterior distribution which is then not flat. The statistical methodology needed to estimate an empirical prior from background variables and an example from adaptive testing are given in van der Linden (in press). Maximum Expected Information The following criterion predicts the probability distribution of the responses of the examinee on each item j E Rk and selects the item with maximum expected information over this probability distribution. More in particular, to maintain the Bayesian framework,

6 26 PSYCHOMETRIKA for each item j E Rk the distribution of Uj is taken to be the posterior predictive distribution after the responses to the earlier items. The distribution has probability function py(uy = ujlua,..., ug,_~) = f py(uy = ujlo)g(olui,... ui,,) do. (13) However, the ability parameter is still estimated as a point estimate, u u, derived tl~.., ~ tk. I. from the posterior distribution. The estimate can be calculated, for example, using the method of maximum a posteriori (MAP) or expected a posterior (EAP) estimation (for an alternative, see the criterion in (16) below). If itemj would be chosen and response Uj = obtained,. the new point estimate of would be u,, u,,~- and observed in- ~... ~k- ~.~ formation would be equal to Ju,,, uik u=(). Analogously, if Uj = 1, the new ^ ~,. estimate would become Oui... u,~_~,~=~ an~ observed reformation would be equal to Jgil... I'lik-l,Sj=I( O )" The maximum expected information criterion selects as the kth item: ik = maxj {pj(uj = Olui,... Uik_t)Jui,... ik_l,uj=o(ouit... it_,,uj=) + pj(uj = lluil... ui~_,)j,,~... l~ik_l,vj= 1 (Ou,~... k_~.u~=l);jerk}" (14) Note that the criterion in (14) not only evaluates the observed information measure associated with Uj = and Uj = 1 but that re-evaluation of the measure for ui~,..., uik_, is also implied. Further, since its two terms are evaluated at different values of, the expression in (14), though an expected value, is not an instance of Fisher's information defined in (4). Minimum Expected Posterior Variance If the information measures in (14) are replaced by the posterior variances of for Uj = and Uj = 1, the following criterion is obtained: ik = mini {p~(u i = Olui... uik_~) Var (Olui~..., u~k_ ~, Uj = O) +pj(uj = llu,,,...,u,k_~)var(olui~,..., u~k_l, U i = 1);j ER@ (15) Though the use of information measures for item selection is a well-established practice in IRT, the reciprocal of the information measure is only a large-sample approximation to the true variance of the posterior Therefore, the criterion in (15) is a small-sample alternative to the one in (14). As already noted, the same criterion was proposed as an alternative to (9) in Owen (1975). Reviews of the criterion can be found, for example, in Thissen and Mislevy (199) and Weiss (1982). Maximum Expected Posterior Weighted-Information In the criterion in (14), observed information is predicted for wrong and correct responses and the expectation is taken over these predictions. However, rather than evaluating observed information at predicted point estimates of the ability parameter, its expectation over the full predicted posteriors could be used. Let g(o[uil..., uik_? Uj = O)

$The following proposal is to select as the kth item: ik =maxj {pj(uj =1ui~... ui~_~) f Juil... k_l,uj=o(o)g(oluil...,uik-l, uj=) do ~-pj(uj = lluil... Uik_l) " f Ju,~... ~k_~,~=l(o)g(olui~,..., Uik-~.$

7 WIM J. VAN DER LINDEN 27 and g(olui,... Uik_~, Uj = 1) be the posterior of after a wrong and correct response to item j, respectively. The following proposal is to select as the kth item: ik =maxj {pj(uj =1ui~... ui~_~) f Juil... k_l,uj=o(o)g(oluil...,uik-l, uj=) do ~-pj(uj = lluil... Uik_l) " f Ju,~... ~k_~,~=l(o)g(olui~,..., Uik-~.~=l) do; j E Rk). (16) Note that this criterion combines the ideas underlying the criteria in (12) and (14): Like (12), observed information is weighted by a posterior density, but at the same time the criterion shares the idea of preposterior prediction with (14). Additional Criteria The above criteria do not constitute an exhaustive set of posterior-based criteria for item selection. For example, it is an easy step to generalize the criteria in (14) through (16) to predictions two or more items ahead. However, for larger item pools the combinatorial complexity of such criteria would quickly exceed the possibility of application to real-time CAT but for small pools the idea seems attractive Chang and Ying (1996) propose to replace Fisher's information by the Kullback-Leibler measure. The same substitution could easily be made for the criteria in (12), (14), and (16). Finally, maximum posterior variance between groups who score the item correct and incorrect was proposed as an item selection criterion by Wainer, Lewis, Kaplan, and Braswell (1992) (for an empirical comparison with the maximum-information criterion, see Schnipke and Green, 1995). Large-Sample Equivalence As is well known in Bayesian statistics, for k --~ ~ the posteriors g(o[ui,..., Uik = O) and g(oluil,..., Uik = 1) converge to a common (degenerate) distribution at the true ability value. Consequently, the posterior-based estimates u,,~ko and u... ui~=, tend. il - -. =,,. to the same value. Therefore, asymptotically the criteria in (12), (14), and (~6) lead to the selection of the same items as the maximum-information criterion in (6). A comparable statement can be made for the criterion in (15) since both posterior variances converge to the reciprocal of Fisher's information. Thus, the choice of a Bayesian item selection criterion in adaptive testing is not expected to lead to improved ability estimation in adaptive testing for long tests Statistical Properties of Criteria For conventional linear tests, "inward bias" of the EAP and MAP estimator of is a typical Bayesian result. More precisely, if the prior is centered at =, estimators of positive values of are negatively biased whereas estimators of negative values show a positive bias. However, this bias is usually offset by a favorable mean-squared error, in particular if the prior is informative. Of course, the precise magnitudes of these effects depend on the choice of the prior and the length of the test. For the MLE of, the opposite holds. These estimators are "outwardly biased" and typically have a larger mean-squared error than Bayesian estimators. For the logistic model

8 - ) 28 PSYCHOMETRIKA in (1) Lord (1983) and Samejima (1983) derive the following approximation to the bias function of the MLE: [3()-- %( - O) ~ I-~ aili(o)[pi(o) -.5], (17) j=l which is accurate to the order n -1. As I() -2 is always positive, the term aili(o)[pi(o ) -.5] in (17) allows for an evaluation of the sign of the contribution of an individual item to the bias of the MLE. For an item with = b i, it holds that pi(o) =.5, and the corresponding term in (17) vanishes. However, if > bi, a positive contribution to the bias is obtained, whereas the opposite occurs for < b i. As a consequence, for a test with all items located at the same value, the MLE of is always "biased away from where the items are". For a conventional linear test, the bias function of the chosen ability estimator is fixed by design. In an adaptive test, however, the choice of the next item matches the current ability estimate, and, as a consequence, the bias added to the final estimator by the item is dependent on the bias already present in the current estimator. For an adaptive test with ML estimation of ability and item selection according to the criterion of maximum information in (6), the dependency creates a perfect negative feedback mechanism, as is shown by the following argument: For the response model in (1), the maximum-information criterion in (6) implies Therefore, it follows from (17) that bit, = ~ML (18) uil,..., uik-i " ^ML - ~(Ou,... v,~_, > O~%(bi, - O) > ~%(pik(o) -.5) < AML -- O) < ~(~vil..,uik-, -- ), (19) %(u,... v,, whereas the opposite occurs for %(~u L o - O) < O. (In the second and third step, the il~ ~. ik-1 expectations are taken over random selection of the kth item.) It can be concluded that each time the current estimator of has a negative or positive error, the next item is selected automatically to counteract the error. As is well known, for a flat prior a Bayesian procedure with the maximum a posteriori estimator is identical to ML estimation. If the posterior is symmetric, the same holds for the EAP estimator. For an informative prior, the behavior can nicely be illustrated using Owen's equations (Appendix, Equations A.3-A.6). Suppose the prior is located at =, the difficulty of the first item has the same value, but the examinee has ability >. The probability of a correct response is larger than the probability of an incorrect response, and Equation A.4 shows that the EAP estimator tends to increase by a positive amount which is smaller, the more informative the prior (i.e., the smaller variance in this equation). For ability <, Equation 4.5 shows that the same holds in the opposite direction. Thus, for a noninformative prior the negative feedback mechanism above can be expected to hold (provided the posterior is symmetric), but as the prior becomes more informative the process changes and the value of the estimator can be expected to move gradually from the location of the prior to the true value of the ability parameter. If the prior strongly dominates the length of the test, large bias in the final estimator can be expected, unless the prior is located exactly at the ability of the examinee. However, larger bias does not imply a larger MSE, since it may be offset by a smaller variance of the estimator due to the information in the prior.

9 WIM J. VAN DER LINDEN 29 Whether or not a Bayesian adaptive procedure actually outperforms one with maximum-information item selection and ML estimation of ability depends on the choice of such quantities as the prior, the initial item, and the length of the test. Further, for an actual item pool, both procedures may be hampered differently by the fact that the item parameters are not spread densely enough in certain ability intervals. Therefore, to get a more quantitative evaluation of the statistical properties of the Bayesian item selection criteria relative to the maximum-information criterion, a simulation study was run. The results for the item selection criteria in this paper complement earlier results reported for Bayesian and maximum-likelihood ability estimation for conventional linear tests (Kim & Nicewander, 1993; Warm, 1989) and adaptive tests (Warm, 1989). Empirical Results The item pool consisted of 3 items with response functions following the two- parameter logistic model in (1) and values for their item parameters drawn from a i U(.5, 1.5) and b i ~ U(-4., 4.). The following procedures were compared: 1. maximum information, with - U(-4., 4.) as prior; 2. maximum posterior-weighted information, with N(/3~, o~1~) as prior; 3. maximum expected information, with N(/3x, cr2~c) as~rior; 4. minimum expected posterior variance, with N(/3x, o~l~) as prior; 5. maximum expected posterior-weighted information, with N(/3Ox, o2[x) as prior, where x is assumed to be a background variable with known regression on. Estimates of the regression parameters can be used as to define empirical priors for the abilities of the examinees (van der Linden, in press). The standard setup consisted of/3x ---.5, tr~ =.87 for the last four procedures, MAP estimation of for all point estimates during the test and ML estimation for the estimates of. The choice of ML estimation for the final estimator of for all procedures was motivated by the fact that background information can be used to improve the choice of the initial item but is generally not accepted as a source of data for ability estimation. Note that the combination of MAP estimation and a uniform prior for in the first procedure yields an MLE for. The bias and mean-squared error (MSE) functions of the final estimator of were studied as a function of test length (n = 5, 1, 2, 3). Estimates of the functions were calculated for = -4.(.5)4. on the basis of 25 replications of the procedures for each value. The CPU times for the four Bayesian criteria to select an item on a PC with Pentium 133 processor were all in the range of seconds. The results are given in Figure 1. For all test lengths the maximum information criterion had the worst MSE function, followed by the maximum posterior-weighted information criterion. The best MSE functions were obtained for the maximum expected information, maximum expected posterior variance, and maximum expected posteriorweighted information criteria. The differences among the MSE functions for these three criteria were always negligible. Even after 3 items, the MSE functions for first two and last three criteria still differed by a factor equal to 2. The critical feature seems to be the use of the posterior predictive distribution of the item responses given in (13). Using the posterior only for weighing observed information, as in the second criterion in (12), is not enough. A comparison between the criteria in (14) and (16) shows that once the posterior is used in the predictive distribution of the item responses, using it for additional weighing of observed information does not yield any further increase in efficiency. Also, in spite of the fact that the observed variance in (14) can be motivated only by its large-sample equivalence to the posterior information in (15), after 1 items no differences in efficiency

10 21 PSYCHOMETRIKA U') rn n=5 1.25~ ~ \\.75. ~. \\ ',~ -.75, \ \ -i., \\ LU Or) ' 1.4, n=5 1,2, 1,,.8",6".4* % \ ~ / /, ', \ / / ~', \ / /,2". -4. ' -3.' ' -2: " -1?.8 'l.bo, 2...., ',,, FIGURE 1 (a): n = 5 Estimated bias and MSE functions for five item selection criteria (Maximum Information: solid; Maximum Posterior-Weighted Information: dashed; Maximum Expected Information: dotted; Maximum Expected Posterior Variance: bold solid; Maximum Expected Posterior-Weighted Information: dashed-dotted). between the two criteria were found. Finally, observe that for all criteria the MSE functions were U-shaped after 5 items but became fiat after 1 items.

11 _ J,, -. WIM J. VAN DER LINDEN n=lo = m rn ,,, ; -2" -1~.6 ' 11~'2.b' 3.bo 4. i..u O ,6' 1.4, t.2 t.. n=lo.8' '.,,,,,,,,, i, ',,,, F[6URE 1 (b): n = 1 Estimated bias and MSE functions for five item selection criteria (Maximum Information: solid; Maximum Posterior-Weighted Information: dashed; Maximum Expected Information: dotted; Maximum Expected Posterior Variance: bold solid; Maximum Expected Posterior-Weighted Information: dashed-dotted). The bias functions of all five criteria showed excellent results after 2 items. These functions were all fiat at a bias value of.. The same result was obtained for the last three criteria after 1 items. Again, these three criteria were the ones based on the posterior

12 212 PSYCHOMETRIKA (/} n=2. m 1. rn.75.5,.25,.. ~ ' -.5" t oo'-3.bo'-2.bo'-1.'oo".6o ' 1.bo ' 2.bo'3.bo' 4. u..i o " n=2.8".6".4". 2 - ~ ' ~ i, -11''.6 1.bo '2.bo 3.bo 4. FIGURE l(e): n = 2 Estimated bias and MSE functions for five item selection criteria (Maximum Information: solid; Maximum Posterior-Weighted Information: dashed; Maximum Expected Information: dotted; Maximum Expected Posterior Variance: bold solid; Maximum Expected Posterior-Weighted Information: dashed-dotted). predictive distribution of the item responses. Finally, observe that all criteria showed inward bias at 5 and 1 items but that the bias for the first two criteria (maximum information and maximum posterior weighted-information criterion) was strongest. Also,

13 WIM J. VAN DER LINDEN n=3. m 1: "1.8 "2.8 "3.bo 4. III 1.8- n= ,.,,,.....,, FIGURE l(d): n = 3 Estimated bias and MSE functions for five item selection criteria (Maximum Information: solid; Maximum Posterior-Weighted Information: dashed; Maximum Expected Information: dotted; Maximum Expected Posterior Variance: bold solid; Maximum Expected Posterior-Weighted Information: dashed-dotted). the direction of the bias for the maximum information criterion is opposite to the one for a conventional linear test. This reversal is assumed to be the result of the negative feedback mechanism discussed earlier.

14 214 PSYCHOMETRIKA Discussion Application of item selection criteria for adaptive testing based on the true posterior distribution of the ability parameter is no longer hampered by costs of computing. The results in this paper show that for short tests these criteria are clearly superior to the maximum-information criterion with ML estimation of the ability parameter. In addition, the criteria based on the posterior predictive distribution of the item response showed a substantial further improvement both in terms of their mean-squared error and bias. The last criteria had lower uniform efficiency over the whole range of values studied after 1 items and remained superior for longer tests. These findings suggest that these criteria should be the candidates for application in adaptive tests of any length. The model used in this paper was the 2-PL model in (1). However, several operational CAT programs use the 3-PL model with a guessing parameter rather than the 2-PL model. As is well known, the effect of the presence of a guessing parameter is a reduction of information at the lower tail of the information function and therefore an increase in (posterior) variance (Hambleton & Swaminathan, 1985, chap. 6). Typically, four-choice items are used and the guessing parameter has values in the neighborhood of.25. In such cases, this parameter can be expected to have an effect on the (posterior) precision of the ability estimator for the less able examinees. However, since the variation in the values of the guessing parameter is mostly small, its presences is not expected to produce large differential effects on the item selection criteria in this paper. Empirical research is needed to verify this expectation. Appendix Owen's (1975) equations for updating the mean and variance of a (normalized) posterior are not found in the CAT literature. Using the notation in this paper, let ~i, and ~ik be defined as bik -- %( OlUh... Uik-~) ~ik -- (a~2 + Var(O[uil,...,Uik_l)) 1/z (A1) and ~, -- c~k + (1 - c~) (~,k). (A2) Note that ~i~ can be interpreted as a standardized difference between the difficulty of item i and the value of the EAP estimator based on the previous k - 1 items. If the posterior g(o[uil,..., uik ) is normal, then, for the 3-parameter normal-ogive model in (7), its expectation, %(Oluil,..., uik_), and variance, Var (Olui,,..., uik_l), have updates for U/~ = and U/k = 1 that can be written as %(Olui,... ui~_,, Ui~ = O) = %(Olui~... ui~_~) - Var (Olui~..., ui~_~). [a[~ z + Var ( Olui ~,..., uik_,) ]-l/z d~( l~ik)~(,~i~ ); (A3) %(O]ui~..., ui~_,, Ui~ = 1) = %(O[ui~,..., ui~_,) + (1 - %) Var (Olui,,..., ui~_,)" [a~ z + Var (Olui,... ui~_,)]-~/zrh(tji~)[% + (1 - %)~(-~i~)]; (A4)

15 Cik WIM J. VAN DER LINDEN 215 Var (Olui 1,..., ui~_~, Ui~ = O) t" = Var (Olui~,..., uik_~){ 1 - (1 + ai~ - 2 Var (Olui~..., (15) - ] Var (Olui~... ui~_,, Ui~ = 1) (.., -.., )-1) Var (Olui,,. ui,_~) 1 (1 Cix)(1 + aik 2 Var (Oluil,. Uik_x L (1 - ) f~(~ik) ]'} (A6) where ~b(.) and ~(.) are the normal density and distribution function, respectively. If the item pool is dense enough, ~ik can be set to zero and ~ri~ takes the value.5(1 + cik ), whereupon the equations simplify drastically. Note that the first two equations show that the next EAP estimate is equal to the previous estimate plus a correction which is negative for a wrong (Ui~ = ) and positive for a correct response (Ui~ = 1). The size of the correction depends on the variance of the estimator. The last two equations show that at each update the posterior variance decreases, but that the decrease for a correct response is smaller due to the presence of the factor 1 - ci~ which accounts for the probability of guessing the correct response under the 3-parameter normal-ogive model. References Andersen, E. B. (198). Discrete statistical models with social science applications. Amsterdam: North-Holland. Bloxom, B., & Vale, C. D. (1987, June). Multidimensional adaptive testing: An approximate procedure for updating. Paper presented at the annual meeting of the Psychometric Society, Montreal, Canada. Brown, J. M., & Weiss, D. J. (1977). An adaptive testing strategy for achievement in test batteries (Research Report 77-6). Minneapolis, MN: Psychometrics Program, Department of Psychology, University of Minnesota. Chang, H.-H., & Ying, Z. (1996). A global information approach to computerized adaptive testing. Applied Psychological Measurement, 2, Gialluca, K. A., & Weiss, D. J. (1979). Efficiency of an adaptive inter-subtest branching strategy in the measurement of classroom achievement (Research Report 79-6). Minneapolis, MN: Psychometrics Program, Department of Psychology, University of Minnesota. Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Boston: Ktuwer- Nijhof. Kim, J. K., & Nicewander, W. A. (1993). Ability estimation for conventional tests. Psyehometrika, 58, Lord, F. M. (1983). Unbiased estimators of ability parameters, of their variance, and of their parallel-forms reliability. Psychometrika, 48, Luecht, R. M. (1995). Some alternative CATitem selection heuristics (Internal report). Philadelphia, PA: National Board of Medical Examiners. Owen, R. J. (1975). A Bayesian sequential procedure for quantal response in the context of adaptive testing. Journal of the American Statistical Association, 7, Samejima, F. (1993). The bias function of the maximum likelihood estimate of ability for the dichotomous response level. Psychometrika, 58,

216 PSYCHOMETRIKA Schnipke, D. L., & Green, B. F. (1995). A comparison of item selection routines in linear and adaptive tests. Journal of Educational Measurement, 32, 227-242. Thissen, E.

16 216 PSYCHOMETRIKA Schnipke, D. L., & Green, B. F. (1995). A comparison of item selection routines in linear and adaptive tests. Journal of Educational Measurement, 32, Thissen, E., & Mislevy, R. J. (199). In H. Wainer (Ed.), Computerized adaptive testing: A primer. Hillsdale, NJ: Erlbaum. van der Linden, W. J. (in press). Empirical initialization of the ability estimator in adaptive testing algorithms. Applied Psychological Measurement. van der Linden, W. J., & Reese, L. M. (1998). A model for optimal constrained adaptive testing. Applied Psychological Measurement, 22. Veerkamp, W. J. J. (1996). Statistical inference for adaptive testing (Internal report). Enschede, The Netherlands: University of Twente, Department of Educational Measurement and Data Analysis. Veerkamp, W. J. J., & Berger, M. P. F. (1997). Some new item selection criteria for adaptive testing. Journal of Educational and Behavioral Statistics, 22, Warm, T. A. (1989). Weighted likelihood estimation of ability in item response theory with tests of finite length. Psychometrika, 54, Weiss, D. J. (1982). Improving measurement quality and efficiency with adaptive testing. Applied Psychological Measurement, 4, Weiss, D. J., & McBride, J. R. (1984). Bias and information of Bayesian adaptive testing. Applied Psychological Measurement, 8, Wainer, H., Lewis, C., Kaplan, B., & Braswell, J. (1991). Building algebra testlets: A comparison of hierarchical and linear structures. Journal of Educational Measurement, 28, Manuscript received 9/9/96 Final version received 8/1/97

Item Selection in Polytomous CAT

Item Selection in Polytomous CAT Bernard P. Veldkamp* Department of Educational Measurement and Data-Analysis, University of Twente, P.O.Box 217, 7500 AE Enschede, The etherlands 6XPPDU\,QSRO\WRPRXV&$7LWHPVFDQEHVHOHFWHGXVLQJ)LVKHU,QIRUPDWLRQ