BAYESIAN ITEM SELECTION CRITERIA FOR ADAPTIVE TESTING. WIM J. VaN DER LINDEN UNIVERSITY OF TWENTE

Size: px
Start display at page:

Download "BAYESIAN ITEM SELECTION CRITERIA FOR ADAPTIVE TESTING. WIM J. VaN DER LINDEN UNIVERSITY OF TWENTE"

Transcription

1 PSYCHOMETRIKA--VOL. 63, NO. 2, JUNE 1998 BAYESIAN ITEM SELECTION CRITERIA FOR ADAPTIVE TESTING WIM J. VaN DER LINDEN UNIVERSITY OF TWENTE Owen (1975) proposed an approximate empirical Bayes procedure for item selection in computerized adaptive testing (CAT). The procedure replaces the true posterior by a normal approximation with closed-form expressions for its first two moments. This approximation was necessary to minimize the computational complexity involved in a fully Bayesian approach but is no longer necessary given the computational power currently available for adaptive testing. This paper suggests several item selection criteria for adaptive testing which are all based on the use of the true posterior. Some of the statistical properties of the ability estimator produced by these criteria are discussed and empirically characterized. Key words: adaptive testing, item response theory, Bayesian statistics, item selection criteria. Introduction Computerized adaptive testing (CAT) is based on the principle of selecting items to match the current estimate of the ability of the examinee. An important choice is how to translate this principle into a formal criterion of item selection implementable as a computer algorithm. Since the early days of adaptive testing, two item selection criteria have been popular: the maximum-information criterion and an approximate Bayesian criterion proposed by Owen (1975). It is the purpose of this paper to introduce several new criteria for item selection in CAT which are all Bayesian in the sense that they are based on the true posterior distribution of the ability of the examinee. The criteria can be used as an alternative to Owen's criterion which is based on a normal approximation to the posterior. The approximation was introduced at a time when the numerical complexity involved in a fully Bayesian approach was a practical problem. However, for the computers currently in use in CAT programs, this complexity is no longer a problem. The paper is organized as follows: First, the maximum-information and Owen's criterion are reviewed. Subsequently, several Bayesian criteria for item selection are introduced. Next, some statistical properties of the final ability estimators for an adaptive test based on these criteria are discussed. The last section of the paper presents results from a simulation study run to characterize the properties of these estimators empirically. Model The two-parameter logistic model will be used as the response model under which the items in the pool have satisfactory fit. However, the results obtained in this paper easily generalize to any (unidimensional) item response theory (IRT) model. To introduce the model, a random response variable U/ is defined to denote a correct (Ui = 1) or an incorrect (Ui = ) response to item i. The model is given by the following equation for the probability of success on item i for an examinee with (fixed) ability ~ (-o% oo): Portions of this paper were presented at the 6th annual meeting of the Psychometric Society, Minneapolis, Minnesota, June, The author is indebted to Wim M. M. Tielen for his computational support. Requests for reprints should be sent to W. J. van der Linden, Department of Educational Measurement and Data Analysis, University of Twente, P.O. Box 217, 75 AE Enschede, THE NETHERLANDS. ; vanderlinden@edte.utwente.nl /98/ / 1998 The Psychometric Society 21

2 22 PSYCHOMETRIKA exp [ai(o - b~)] p,(o) -~ Prob {U~ = 11} exp [a,(o - b3]" (1) Location parameter b i E (-% ~) and scale parameter a i E [, m) in this model are commonly interpreted as the difficulty and discriminating power of item i, respectively. Maximum-Information Criterion To present the maximum information criterion, the following notation is needed. The items in the pool are denoted by i = 1... I. For convenience, an adaptive test of fixed length n will be assumed, but the presentation can easily be extended to CAT with a stopping rule based on a prespecified level of (posterior) precision of the ability estimator. For another argument for a fixed test length, see the discussion on constrained adaptive testing below. The rank of the items in the test is denoted by index k = 1,..., n. It follows that ik is the index of the item in the pool administered as the kth item in the test. Suppose k - 1 items have already been selected. The indices of these items form the set Sk-1 -- {il... ik- t }. The remaining set of items in the pool is denoted as Rk ---- { 1..., l}\sk- 1. For responses Ui, = ui,... Uik_, = ui~_, obtained on the first k --I items, the likelihood function is S~ {exp [ai~(o - bi~)]} ~' L(O[Uil... 'Uik-l)-~ i -e~pp [~]--- bv~)] ]=1 (2) A maximum-likelihood (ML) estimator of based on these responses is a maximizer of (2) over, that is, OML ~ maxo{l( Olui~, uik_~); E (-% oo)}. (3) nil,..., uik. 1. Fisher's information about the unknown value of in the response variables associated with the k - 1 items is defined as: ) k-1 (p;(o)fl Iu,... v,~_, () - -% ~-~ In L(OIUi~,..., U~_~) = j~ p~j(o)(1 -p~:(o))' (4) where p;(o) and the last step in (4) is a well-known result for the model in (1) (for a derivation see, for example, Hambteton & Swaminathan, 1985, sec. 6.3). Note that (4) is additive because of conditional independence of the response variables given. The maximum-information criterion common in adaptive testing selects the kth item such that maximum information is obtained at = u,..., u,~_,, that is, ik = maxj {Iu,,... u,~_~, uj(ou,,... u,k~); J E Rk}. Because the information measure is additive, the criterion is equivalent to ik = maxj {Ivj(Ou,...,k.-,); J E R,}. (6) (5)

3 WlM J. VAN DER LINDEN 23 Owen's Criterion As an alternative to the maximum-information criterion, Owen (1975) proposed an approximate empirical Bayes procedure for adaptive testing based on the following threeparameter normal-ogive model: pi(o) - ci + (1 - ci) b[ai(o - bi)], (7) where cb(.) is the normal distribution function and c i a lower asymptote to model the probability of guessing item i correctly. For a vector of responses to the first k - 1 items, the likelihood function is given by (2) with the logistic factor replaced by the normal ogive. Assuming a prior g(o), the following expression for the posterior distribution of after k - 1 items is obtained: L(Olui,,..., u~k_l)g(o) g(olu~,... u~k_,) = (8) f L(Olui,... uik_,)g(o) do Owen's procedure uses (8) as an update for the posterior with the choice of a normal density for the prior g(o). Item k is chosen to satisfy Ib,k - %(Olu~l,..., ui,_l)l < 8 (9) for a small value of 8, where %(Olui,..., uik_) is the expectation of over (8) now generally known as the Expected A Posteriori (EAP) estimator of. The procedure is stopped as soon as the variance of (8) is smaller than a prespecified threshold value. It should be noted that the likelihood in (8) does not have the normal family as class of conjugate distributions. Therefore, if a normal prior is chosen, the posterior is not normal, and repeated updating of the posterior using (8) soon leads to a posterior that could not be calculated in real-time CAT by the computers available in the 7s. Owen therefore proposed to replace the true posterior by a normal approximation with the same mean and variance which is then used as the new prior in the update in (8). He also presented closed-form expressions for this mean and variance (see Appendix 1). The approximation was motivated by showing that for mild bounds on 8 the estimator Ou,... u,~_~ =- %(Olui~..., ui~_) goes to the true value of in mean square for k --+ (Owen, 1975, Theorem 2). Owen also referred to the criterion of minimization of the preposterior risk under a quadratic loss function as an alternative to (9). This criterion selects the item that minimizes the expected posterior variance. Computationally it is more involved than the criterion in (9) with a normal approximation to the posterior, and for this reason the latter became widely popular as Owen's procedure of adaptive testing. An extensive simulation study of the statistical properties of Owen's procedure is reported in Weiss and McBride (1984). A generalization of the procedure to the case of multidimensional adaptive testing is discussed in Bloxom and Vale (1987). The criterion of minimum expected posterior variance will be returned to later in this paper. Constrained Adaptive Testing Before presenting several new criteria for item selection in CAT, an important impetus for searching improved criteria is discussed. Most operational adaptive testing programs find it necessary to base item selection not only on statistical criteria but also to impose constraints on the item selection procedure to control such attributes as item content, frequency of previous item exposure, or expected response time for the examin-

4 24 PSYCHOMETRIKA ees. In addition, constraints may be necessary to deal with an item set structure of the pool or to prevent items with clues to each other from being in the same test. In fact, the implicit ideal in these programs is an adaptive tests meeting the same specifications--and hence demonstrate the same validity--as their fixed-format precursors but with much shorter length. The number of constraints on the item selection procedure needed to realize this ideal can easily run into the hundreds. In addition, fixed test length may be necessary to maintain the same set of specifications across examinees. Constraining item selection in CAT generally has an unfavorable effect on the (posterior) precision of the ability estimator, in particular if CAT is used in the framework of a test battery with a large series of short tests (Brown & Weiss, 1977; Gialluca & Weiss, 1979). For such cases, knowledge of item selection criteria extracting maximum information from the items in the pool becomes an important goal. In addition, several strategies are possible to minimize the unfavorable effect of the constraints on the ability estimator (van der Linden & Reese, 1998). For example, it can be decided to administer a relatively less severely constrained section of the test first so that the majority of the constraints is imposed when the ability estimator is already in the neighborhood of its true value. Alternatively, a short unconstrained adaptive pretest can be administered, whereupon the main test follows reformulating the constraints to allow for the attributes of the items already selected. In either case, the item selection criterion should demonstrate its efficiency primarily for short tests. Bayesian Criteria A Bayesian approach to adaptive testing is loosely defined as any approach which uses a prior or posterior distribution to define rules for: (a) selecting the first item; (b) estimating ; (c) selecting the next item; or (d) stopping the test. According to this convention, the use of an informative prior to select the first item in a CAT procedure is thus an example of a Bayesian adaptive testing procedure (van der Linden, in press). Owen's procedure is adaptive in that the item selection criterion in (9) is based on the mean of the (approximate) posterior and the posterior variance is used to stop the test. However, his procedure does not base item selection on the full posterior which, in a Bayesian framework, is the best reflection of the uncertainty in the current ability estimate. This section introduces several alternative criteria for item selection based on the full posterior. The first two criteria generalize the idea of maximum information in a Bayesian fashion. The next criterion is the one of minimum expected posterior variance also discussed in Owen (1975). The fourth criterion combines the ideas of posterior-weighted information and preposterior prediction underlying the first and second criteria, respectively. Finally, some other Bayesian procedures of item selection are alluded to. Ma~cimum Posterior-Weighted Information The first criterion reformulates the maximum information criterion in a Bayesian fashion by first choosing the appropriate information measure and then taking its expectation across the posterior distribution. The main motivation for this criterion is the observation that as long as the posterior distribution of the estimator has not yet converged to a single point, it may be suboptimal to select an item with maximum information at a point estimate ignoring values of the ability parameter with substantial likelihood in its neighborhood. This point will be further discussed below, but first the criterion is given in its mathematical form. If the kth item is selected, responses to the first k - 1 items are already known. Hence, these data can no longer be presented by random variables but only by the (fixed) values of their realizations. As a consequence, Fisher's information, defined as an expected

5 WIM J. VAN DER LINDEN 25 value across random data, is no longer a valid measure. A typical Bayesian choice is to use the observed information measure Ju,,...,k_,() - -do-- ~ In L(Olu~,,..., u~_,) (1) which reflects the curvature of the observed likelihood function at the value of relative to the metric chosen for. Though the distinction between the two information measures is important, for known item parameters the model in (1) defines an exponential family which has the well-known property of the second derivative of the log-likelihood function being independent of the data (e.g., Andersen, 198, sec. 3.3). As a consequence of this property, Fisher's information and the observed information measure are equal. A direct derivation of the equality for logistic IRT models is given in Veerkamp (1996). Thus, it holds that Ju,,...,,_,() = Iv,,... u,k_,(o). (11) However, to obtain generality, a notational distinction between the two information measures will be maintained. The proposal is to select the next item to maximize the expected value of the observed information J(O) over the posterior distribution of O. The index of the kth item according to this criterion is: ik=maxj{fjv~(o)g(o,ui,... uik_i)do;j~rk), (12) where ff(oluil..., uik_,) is the posterior update obtained from (8). The same idea of not ignoring other likely values of the ability parameter than the one with maximum likelihood when the items are selected is present in the likelihood-weighted information criterion introduced in Luecht (1995) and Veerkamp and Berger (1997) in whichthe posterior in (12) is replaced by the likelihood function associated with the data. The latter authors also offer an interval information criterion defined as maxj {f ~ Ij(O) do; j ~ Rk} where L and OR are the end points of the interval to be chosen by the psychometrician. However, in the 2-PL model the integral is known to approach the value of the discrimination parameter aj for L ~ --~ and _R ~ ~, and item selection becomes independent of the value of the item difficulty parameter bj--a phenomenon contrary to the well-known importance of the difficulty parameter in CAT item selection. On the other hand, for L, OR -~' O the criterion approaches the maximum-information criterion in (5). Generally, the likelihood-weighted information criterion is an improvement over the interval information criterion, but for the first few items the likelihood function is still relatively flat and item selection may capitalize again on large values of aj independent of the value of bj. The only way out seems to use the posterior distribution of the ability parameter for weighing the information, as is done in the criterion in (12). This criterion allows for incorporating data on background variables with a known statistical relation to in the posterior distribution which is then not flat. The statistical methodology needed to estimate an empirical prior from background variables and an example from adaptive testing are given in van der Linden (in press). Maximum Expected Information The following criterion predicts the probability distribution of the responses of the examinee on each item j E Rk and selects the item with maximum expected information over this probability distribution. More in particular, to maintain the Bayesian framework,

6 26 PSYCHOMETRIKA for each item j E Rk the distribution of Uj is taken to be the posterior predictive distribution after the responses to the earlier items. The distribution has probability function py(uy = ujlua,..., ug,_~) = f py(uy = ujlo)g(olui,... ui,,) do. (13) However, the ability parameter is still estimated as a point estimate, u u, derived tl~.., ~ tk. I. from the posterior distribution. The estimate can be calculated, for example, using the method of maximum a posteriori (MAP) or expected a posterior (EAP) estimation (for an alternative, see the criterion in (16) below). If itemj would be chosen and response Uj = obtained,. the new point estimate of would be u,, u,,~- and observed in- ~... ~k- ~.~ formation would be equal to Ju,,, uik u=(). Analogously, if Uj = 1, the new ^ ~,. estimate would become Oui... u,~_~,~=~ an~ observed reformation would be equal to Jgil... I'lik-l,Sj=I( O )" The maximum expected information criterion selects as the kth item: ik = maxj {pj(uj = Olui,... Uik_t)Jui,... ik_l,uj=o(ouit... it_,,uj=) + pj(uj = lluil... ui~_,)j,,~... l~ik_l,vj= 1 (Ou,~... k_~.u~=l);jerk}" (14) Note that the criterion in (14) not only evaluates the observed information measure associated with Uj = and Uj = 1 but that re-evaluation of the measure for ui~,..., uik_, is also implied. Further, since its two terms are evaluated at different values of, the expression in (14), though an expected value, is not an instance of Fisher's information defined in (4). Minimum Expected Posterior Variance If the information measures in (14) are replaced by the posterior variances of for Uj = and Uj = 1, the following criterion is obtained: ik = mini {p~(u i = Olui... uik_~) Var (Olui~..., u~k_ ~, Uj = O) +pj(uj = llu,,,...,u,k_~)var(olui~,..., u~k_l, U i = 1);j ER@ (15) Though the use of information measures for item selection is a well-established practice in IRT, the reciprocal of the information measure is only a large-sample approximation to the true variance of the posterior Therefore, the criterion in (15) is a small-sample alternative to the one in (14). As already noted, the same criterion was proposed as an alternative to (9) in Owen (1975). Reviews of the criterion can be found, for example, in Thissen and Mislevy (199) and Weiss (1982). Maximum Expected Posterior Weighted-Information In the criterion in (14), observed information is predicted for wrong and correct responses and the expectation is taken over these predictions. However, rather than evaluating observed information at predicted point estimates of the ability parameter, its expectation over the full predicted posteriors could be used. Let g(o[uil..., uik_? Uj = O)

7 WIM J. VAN DER LINDEN 27 and g(olui,... Uik_~, Uj = 1) be the posterior of after a wrong and correct response to item j, respectively. The following proposal is to select as the kth item: ik =maxj {pj(uj =1ui~... ui~_~) f Juil... k_l,uj=o(o)g(oluil...,uik-l, uj=) do ~-pj(uj = lluil... Uik_l) " f Ju,~... ~k_~,~=l(o)g(olui~,..., Uik-~.~=l) do; j E Rk). (16) Note that this criterion combines the ideas underlying the criteria in (12) and (14): Like (12), observed information is weighted by a posterior density, but at the same time the criterion shares the idea of preposterior prediction with (14). Additional Criteria The above criteria do not constitute an exhaustive set of posterior-based criteria for item selection. For example, it is an easy step to generalize the criteria in (14) through (16) to predictions two or more items ahead. However, for larger item pools the combinatorial complexity of such criteria would quickly exceed the possibility of application to real-time CAT but for small pools the idea seems attractive Chang and Ying (1996) propose to replace Fisher's information by the Kullback-Leibler measure. The same substitution could easily be made for the criteria in (12), (14), and (16). Finally, maximum posterior variance between groups who score the item correct and incorrect was proposed as an item selection criterion by Wainer, Lewis, Kaplan, and Braswell (1992) (for an empirical comparison with the maximum-information criterion, see Schnipke and Green, 1995). Large-Sample Equivalence As is well known in Bayesian statistics, for k --~ ~ the posteriors g(o[ui,..., Uik = O) and g(oluil,..., Uik = 1) converge to a common (degenerate) distribution at the true ability value. Consequently, the posterior-based estimates u,,~ko and u... ui~=, tend. il - -. =,,. to the same value. Therefore, asymptotically the criteria in (12), (14), and (~6) lead to the selection of the same items as the maximum-information criterion in (6). A comparable statement can be made for the criterion in (15) since both posterior variances converge to the reciprocal of Fisher's information. Thus, the choice of a Bayesian item selection criterion in adaptive testing is not expected to lead to improved ability estimation in adaptive testing for long tests Statistical Properties of Criteria For conventional linear tests, "inward bias" of the EAP and MAP estimator of is a typical Bayesian result. More precisely, if the prior is centered at =, estimators of positive values of are negatively biased whereas estimators of negative values show a positive bias. However, this bias is usually offset by a favorable mean-squared error, in particular if the prior is informative. Of course, the precise magnitudes of these effects depend on the choice of the prior and the length of the test. For the MLE of, the opposite holds. These estimators are "outwardly biased" and typically have a larger mean-squared error than Bayesian estimators. For the logistic model

8 - ) 28 PSYCHOMETRIKA in (1) Lord (1983) and Samejima (1983) derive the following approximation to the bias function of the MLE: [3()-- %( - O) ~ I-~ aili(o)[pi(o) -.5], (17) j=l which is accurate to the order n -1. As I() -2 is always positive, the term aili(o)[pi(o ) -.5] in (17) allows for an evaluation of the sign of the contribution of an individual item to the bias of the MLE. For an item with = b i, it holds that pi(o) =.5, and the corresponding term in (17) vanishes. However, if > bi, a positive contribution to the bias is obtained, whereas the opposite occurs for < b i. As a consequence, for a test with all items located at the same value, the MLE of is always "biased away from where the items are". For a conventional linear test, the bias function of the chosen ability estimator is fixed by design. In an adaptive test, however, the choice of the next item matches the current ability estimate, and, as a consequence, the bias added to the final estimator by the item is dependent on the bias already present in the current estimator. For an adaptive test with ML estimation of ability and item selection according to the criterion of maximum information in (6), the dependency creates a perfect negative feedback mechanism, as is shown by the following argument: For the response model in (1), the maximum-information criterion in (6) implies Therefore, it follows from (17) that bit, = ~ML (18) uil,..., uik-i " ^ML - ~(Ou,... v,~_, > O~%(bi, - O) > ~%(pik(o) -.5) < AML -- O) < ~(~vil..,uik-, -- ), (19) %(u,... v,, whereas the opposite occurs for %(~u L o - O) < O. (In the second and third step, the il~ ~. ik-1 expectations are taken over random selection of the kth item.) It can be concluded that each time the current estimator of has a negative or positive error, the next item is selected automatically to counteract the error. As is well known, for a flat prior a Bayesian procedure with the maximum a posteriori estimator is identical to ML estimation. If the posterior is symmetric, the same holds for the EAP estimator. For an informative prior, the behavior can nicely be illustrated using Owen's equations (Appendix, Equations A.3-A.6). Suppose the prior is located at =, the difficulty of the first item has the same value, but the examinee has ability >. The probability of a correct response is larger than the probability of an incorrect response, and Equation A.4 shows that the EAP estimator tends to increase by a positive amount which is smaller, the more informative the prior (i.e., the smaller variance in this equation). For ability <, Equation 4.5 shows that the same holds in the opposite direction. Thus, for a noninformative prior the negative feedback mechanism above can be expected to hold (provided the posterior is symmetric), but as the prior becomes more informative the process changes and the value of the estimator can be expected to move gradually from the location of the prior to the true value of the ability parameter. If the prior strongly dominates the length of the test, large bias in the final estimator can be expected, unless the prior is located exactly at the ability of the examinee. However, larger bias does not imply a larger MSE, since it may be offset by a smaller variance of the estimator due to the information in the prior.

9 WIM J. VAN DER LINDEN 29 Whether or not a Bayesian adaptive procedure actually outperforms one with maximum-information item selection and ML estimation of ability depends on the choice of such quantities as the prior, the initial item, and the length of the test. Further, for an actual item pool, both procedures may be hampered differently by the fact that the item parameters are not spread densely enough in certain ability intervals. Therefore, to get a more quantitative evaluation of the statistical properties of the Bayesian item selection criteria relative to the maximum-information criterion, a simulation study was run. The results for the item selection criteria in this paper complement earlier results reported for Bayesian and maximum-likelihood ability estimation for conventional linear tests (Kim & Nicewander, 1993; Warm, 1989) and adaptive tests (Warm, 1989). Empirical Results The item pool consisted of 3 items with response functions following the two- parameter logistic model in (1) and values for their item parameters drawn from a i U(.5, 1.5) and b i ~ U(-4., 4.). The following procedures were compared: 1. maximum information, with - U(-4., 4.) as prior; 2. maximum posterior-weighted information, with N(/3~, o~1~) as prior; 3. maximum expected information, with N(/3x, cr2~c) as~rior; 4. minimum expected posterior variance, with N(/3x, o~l~) as prior; 5. maximum expected posterior-weighted information, with N(/3Ox, o2[x) as prior, where x is assumed to be a background variable with known regression on. Estimates of the regression parameters can be used as to define empirical priors for the abilities of the examinees (van der Linden, in press). The standard setup consisted of/3x ---.5, tr~ =.87 for the last four procedures, MAP estimation of for all point estimates during the test and ML estimation for the estimates of. The choice of ML estimation for the final estimator of for all procedures was motivated by the fact that background information can be used to improve the choice of the initial item but is generally not accepted as a source of data for ability estimation. Note that the combination of MAP estimation and a uniform prior for in the first procedure yields an MLE for. The bias and mean-squared error (MSE) functions of the final estimator of were studied as a function of test length (n = 5, 1, 2, 3). Estimates of the functions were calculated for = -4.(.5)4. on the basis of 25 replications of the procedures for each value. The CPU times for the four Bayesian criteria to select an item on a PC with Pentium 133 processor were all in the range of seconds. The results are given in Figure 1. For all test lengths the maximum information criterion had the worst MSE function, followed by the maximum posterior-weighted information criterion. The best MSE functions were obtained for the maximum expected information, maximum expected posterior variance, and maximum expected posteriorweighted information criteria. The differences among the MSE functions for these three criteria were always negligible. Even after 3 items, the MSE functions for first two and last three criteria still differed by a factor equal to 2. The critical feature seems to be the use of the posterior predictive distribution of the item responses given in (13). Using the posterior only for weighing observed information, as in the second criterion in (12), is not enough. A comparison between the criteria in (14) and (16) shows that once the posterior is used in the predictive distribution of the item responses, using it for additional weighing of observed information does not yield any further increase in efficiency. Also, in spite of the fact that the observed variance in (14) can be motivated only by its large-sample equivalence to the posterior information in (15), after 1 items no differences in efficiency

10 21 PSYCHOMETRIKA U') rn n=5 1.25~ ~ \\.75. ~. \\ ',~ -.75, \ \ -i., \\ LU Or) ' 1.4, n=5 1,2, 1,,.8",6".4* % \ ~ / /, ', \ / / ~', \ / /,2". -4. ' -3.' ' -2: " -1?.8 'l.bo, 2...., ',,, FIGURE 1 (a): n = 5 Estimated bias and MSE functions for five item selection criteria (Maximum Information: solid; Maximum Posterior-Weighted Information: dashed; Maximum Expected Information: dotted; Maximum Expected Posterior Variance: bold solid; Maximum Expected Posterior-Weighted Information: dashed-dotted). between the two criteria were found. Finally, observe that for all criteria the MSE functions were U-shaped after 5 items but became fiat after 1 items.

11 _ J,, -. WIM J. VAN DER LINDEN n=lo = m rn ,,, ; -2" -1~.6 ' 11~'2.b' 3.bo 4. i..u O ,6' 1.4, t.2 t.. n=lo.8' '.,,,,,,,,, i, ',,,, F[6URE 1 (b): n = 1 Estimated bias and MSE functions for five item selection criteria (Maximum Information: solid; Maximum Posterior-Weighted Information: dashed; Maximum Expected Information: dotted; Maximum Expected Posterior Variance: bold solid; Maximum Expected Posterior-Weighted Information: dashed-dotted). The bias functions of all five criteria showed excellent results after 2 items. These functions were all fiat at a bias value of.. The same result was obtained for the last three criteria after 1 items. Again, these three criteria were the ones based on the posterior

12 212 PSYCHOMETRIKA (/} n=2. m 1. rn.75.5,.25,.. ~ ' -.5" t oo'-3.bo'-2.bo'-1.'oo".6o ' 1.bo ' 2.bo'3.bo' 4. u..i o " n=2.8".6".4". 2 - ~ ' ~ i, -11''.6 1.bo '2.bo 3.bo 4. FIGURE l(e): n = 2 Estimated bias and MSE functions for five item selection criteria (Maximum Information: solid; Maximum Posterior-Weighted Information: dashed; Maximum Expected Information: dotted; Maximum Expected Posterior Variance: bold solid; Maximum Expected Posterior-Weighted Information: dashed-dotted). predictive distribution of the item responses. Finally, observe that all criteria showed inward bias at 5 and 1 items but that the bias for the first two criteria (maximum information and maximum posterior weighted-information criterion) was strongest. Also,

13 WIM J. VAN DER LINDEN n=3. m 1: "1.8 "2.8 "3.bo 4. III 1.8- n= ,.,,,.....,, FIGURE l(d): n = 3 Estimated bias and MSE functions for five item selection criteria (Maximum Information: solid; Maximum Posterior-Weighted Information: dashed; Maximum Expected Information: dotted; Maximum Expected Posterior Variance: bold solid; Maximum Expected Posterior-Weighted Information: dashed-dotted). the direction of the bias for the maximum information criterion is opposite to the one for a conventional linear test. This reversal is assumed to be the result of the negative feedback mechanism discussed earlier.

14 214 PSYCHOMETRIKA Discussion Application of item selection criteria for adaptive testing based on the true posterior distribution of the ability parameter is no longer hampered by costs of computing. The results in this paper show that for short tests these criteria are clearly superior to the maximum-information criterion with ML estimation of the ability parameter. In addition, the criteria based on the posterior predictive distribution of the item response showed a substantial further improvement both in terms of their mean-squared error and bias. The last criteria had lower uniform efficiency over the whole range of values studied after 1 items and remained superior for longer tests. These findings suggest that these criteria should be the candidates for application in adaptive tests of any length. The model used in this paper was the 2-PL model in (1). However, several operational CAT programs use the 3-PL model with a guessing parameter rather than the 2-PL model. As is well known, the effect of the presence of a guessing parameter is a reduction of information at the lower tail of the information function and therefore an increase in (posterior) variance (Hambleton & Swaminathan, 1985, chap. 6). Typically, four-choice items are used and the guessing parameter has values in the neighborhood of.25. In such cases, this parameter can be expected to have an effect on the (posterior) precision of the ability estimator for the less able examinees. However, since the variation in the values of the guessing parameter is mostly small, its presences is not expected to produce large differential effects on the item selection criteria in this paper. Empirical research is needed to verify this expectation. Appendix Owen's (1975) equations for updating the mean and variance of a (normalized) posterior are not found in the CAT literature. Using the notation in this paper, let ~i, and ~ik be defined as bik -- %( OlUh... Uik-~) ~ik -- (a~2 + Var(O[uil,...,Uik_l)) 1/z (A1) and ~, -- c~k + (1 - c~) (~,k). (A2) Note that ~i~ can be interpreted as a standardized difference between the difficulty of item i and the value of the EAP estimator based on the previous k - 1 items. If the posterior g(o[uil,..., uik ) is normal, then, for the 3-parameter normal-ogive model in (7), its expectation, %(Oluil,..., uik_), and variance, Var (Olui,,..., uik_l), have updates for U/~ = and U/k = 1 that can be written as %(Olui,... ui~_,, Ui~ = O) = %(Olui~... ui~_~) - Var (Olui~..., ui~_~). [a[~ z + Var ( Olui ~,..., uik_,) ]-l/z d~( l~ik)~(,~i~ ); (A3) %(O]ui~..., ui~_,, Ui~ = 1) = %(O[ui~,..., ui~_,) + (1 - %) Var (Olui,,..., ui~_,)" [a~ z + Var (Olui,... ui~_,)]-~/zrh(tji~)[% + (1 - %)~(-~i~)]; (A4)

15 Cik WIM J. VAN DER LINDEN 215 Var (Olui 1,..., ui~_~, Ui~ = O) t" = Var (Olui~,..., uik_~){ 1 - (1 + ai~ - 2 Var (Olui~..., (15) - ] Var (Olui~... ui~_,, Ui~ = 1) (.., -.., )-1) Var (Olui,,. ui,_~) 1 (1 Cix)(1 + aik 2 Var (Oluil,. Uik_x L (1 - ) f~(~ik) ]'} (A6) where ~b(.) and ~(.) are the normal density and distribution function, respectively. If the item pool is dense enough, ~ik can be set to zero and ~ri~ takes the value.5(1 + cik ), whereupon the equations simplify drastically. Note that the first two equations show that the next EAP estimate is equal to the previous estimate plus a correction which is negative for a wrong (Ui~ = ) and positive for a correct response (Ui~ = 1). The size of the correction depends on the variance of the estimator. The last two equations show that at each update the posterior variance decreases, but that the decrease for a correct response is smaller due to the presence of the factor 1 - ci~ which accounts for the probability of guessing the correct response under the 3-parameter normal-ogive model. References Andersen, E. B. (198). Discrete statistical models with social science applications. Amsterdam: North-Holland. Bloxom, B., & Vale, C. D. (1987, June). Multidimensional adaptive testing: An approximate procedure for updating. Paper presented at the annual meeting of the Psychometric Society, Montreal, Canada. Brown, J. M., & Weiss, D. J. (1977). An adaptive testing strategy for achievement in test batteries (Research Report 77-6). Minneapolis, MN: Psychometrics Program, Department of Psychology, University of Minnesota. Chang, H.-H., & Ying, Z. (1996). A global information approach to computerized adaptive testing. Applied Psychological Measurement, 2, Gialluca, K. A., & Weiss, D. J. (1979). Efficiency of an adaptive inter-subtest branching strategy in the measurement of classroom achievement (Research Report 79-6). Minneapolis, MN: Psychometrics Program, Department of Psychology, University of Minnesota. Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Boston: Ktuwer- Nijhof. Kim, J. K., & Nicewander, W. A. (1993). Ability estimation for conventional tests. Psyehometrika, 58, Lord, F. M. (1983). Unbiased estimators of ability parameters, of their variance, and of their parallel-forms reliability. Psychometrika, 48, Luecht, R. M. (1995). Some alternative CATitem selection heuristics (Internal report). Philadelphia, PA: National Board of Medical Examiners. Owen, R. J. (1975). A Bayesian sequential procedure for quantal response in the context of adaptive testing. Journal of the American Statistical Association, 7, Samejima, F. (1993). The bias function of the maximum likelihood estimate of ability for the dichotomous response level. Psychometrika, 58,

16 216 PSYCHOMETRIKA Schnipke, D. L., & Green, B. F. (1995). A comparison of item selection routines in linear and adaptive tests. Journal of Educational Measurement, 32, Thissen, E., & Mislevy, R. J. (199). In H. Wainer (Ed.), Computerized adaptive testing: A primer. Hillsdale, NJ: Erlbaum. van der Linden, W. J. (in press). Empirical initialization of the ability estimator in adaptive testing algorithms. Applied Psychological Measurement. van der Linden, W. J., & Reese, L. M. (1998). A model for optimal constrained adaptive testing. Applied Psychological Measurement, 22. Veerkamp, W. J. J. (1996). Statistical inference for adaptive testing (Internal report). Enschede, The Netherlands: University of Twente, Department of Educational Measurement and Data Analysis. Veerkamp, W. J. J., & Berger, M. P. F. (1997). Some new item selection criteria for adaptive testing. Journal of Educational and Behavioral Statistics, 22, Warm, T. A. (1989). Weighted likelihood estimation of ability in item response theory with tests of finite length. Psychometrika, 54, Weiss, D. J. (1982). Improving measurement quality and efficiency with adaptive testing. Applied Psychological Measurement, 4, Weiss, D. J., & McBride, J. R. (1984). Bias and information of Bayesian adaptive testing. Applied Psychological Measurement, 8, Wainer, H., Lewis, C., Kaplan, B., & Braswell, J. (1991). Building algebra testlets: A comparison of hierarchical and linear structures. Journal of Educational Measurement, 28, Manuscript received 9/9/96 Final version received 8/1/97

Item Selection in Polytomous CAT

Item Selection in Polytomous CAT Item Selection in Polytomous CAT Bernard P. Veldkamp* Department of Educational Measurement and Data-Analysis, University of Twente, P.O.Box 217, 7500 AE Enschede, The etherlands 6XPPDU\,QSRO\WRPRXV&$7LWHPVFDQEHVHOHFWHGXVLQJ)LVKHU,QIRUPDWLRQ

More information

Adaptive EAP Estimation of Ability

Adaptive EAP Estimation of Ability Adaptive EAP Estimation of Ability in a Microcomputer Environment R. Darrell Bock University of Chicago Robert J. Mislevy National Opinion Research Center Expected a posteriori (EAP) estimation of ability,

More information

Applying the Minimax Principle to Sequential Mastery Testing

Applying the Minimax Principle to Sequential Mastery Testing Developments in Social Science Methodology Anuška Ferligoj and Andrej Mrvar (Editors) Metodološki zvezki, 18, Ljubljana: FDV, 2002 Applying the Minimax Principle to Sequential Mastery Testing Hans J. Vos

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 39 Evaluation of Comparability of Scores and Passing Decisions for Different Item Pools of Computerized Adaptive Examinations

More information

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model Gary Skaggs Fairfax County, Virginia Public Schools José Stevenson

More information

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, April 23-25, 2003 The Classification Accuracy of Measurement Decision Theory Lawrence Rudner University

More information

Item Response Theory. Author's personal copy. Glossary

Item Response Theory. Author's personal copy. Glossary Item Response Theory W J van der Linden, CTB/McGraw-Hill, Monterey, CA, USA ã 2010 Elsevier Ltd. All rights reserved. Glossary Ability parameter Parameter in a response model that represents the person

More information

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

Mantel-Haenszel Procedures for Detecting Differential Item Functioning A Comparison of Logistic Regression and Mantel-Haenszel Procedures for Detecting Differential Item Functioning H. Jane Rogers, Teachers College, Columbia University Hariharan Swaminathan, University of

More information

Termination Criteria in Computerized Adaptive Tests: Variable-Length CATs Are Not Biased. Ben Babcock and David J. Weiss University of Minnesota

Termination Criteria in Computerized Adaptive Tests: Variable-Length CATs Are Not Biased. Ben Babcock and David J. Weiss University of Minnesota Termination Criteria in Computerized Adaptive Tests: Variable-Length CATs Are Not Biased Ben Babcock and David J. Weiss University of Minnesota Presented at the Realities of CAT Paper Session, June 2,

More information

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD Psy 427 Cal State Northridge Andrew Ainsworth, PhD Contents Item Analysis in General Classical Test Theory Item Response Theory Basics Item Response Functions Item Information Functions Invariance IRT

More information

Computerized Mastery Testing

Computerized Mastery Testing Computerized Mastery Testing With Nonequivalent Testlets Kathleen Sheehan and Charles Lewis Educational Testing Service A procedure for determining the effect of testlet nonequivalence on the operating

More information

Using Bayesian Decision Theory to

Using Bayesian Decision Theory to Using Bayesian Decision Theory to Design a Computerized Mastery Test Charles Lewis and Kathleen Sheehan Educational Testing Service A theoretical framework for mastery testing based on item response theory

More information

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati.

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati. Likelihood Ratio Based Computerized Classification Testing Nathan A. Thompson Assessment Systems Corporation & University of Cincinnati Shungwon Ro Kenexa Abstract An efficient method for making decisions

More information

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses Item Response Theory Steven P. Reise University of California, U.S.A. Item response theory (IRT), or modern measurement theory, provides alternatives to classical test theory (CTT) methods for the construction,

More information

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Jee Seon Kim University of Wisconsin, Madison Paper presented at 2006 NCME Annual Meeting San Francisco, CA Correspondence

More information

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests David Shin Pearson Educational Measurement May 007 rr0701 Using assessment and research to promote learning Pearson Educational

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

Constrained Multidimensional Adaptive Testing without intermixing items from different dimensions

Constrained Multidimensional Adaptive Testing without intermixing items from different dimensions Psychological Test and Assessment Modeling, Volume 56, 2014 (4), 348-367 Constrained Multidimensional Adaptive Testing without intermixing items from different dimensions Ulf Kroehne 1, Frank Goldhammer

More information

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2 MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and Lord Equating Methods 1,2 Lisa A. Keller, Ronald K. Hambleton, Pauline Parker, Jenna Copella University of Massachusetts

More information

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian Recovery of Marginal Maximum Likelihood Estimates in the Two-Parameter Logistic Response Model: An Evaluation of MULTILOG Clement A. Stone University of Pittsburgh Marginal maximum likelihood (MML) estimation

More information

Effects of Ignoring Discrimination Parameter in CAT Item Selection on Student Scores. Shudong Wang NWEA. Liru Zhang Delaware Department of Education

Effects of Ignoring Discrimination Parameter in CAT Item Selection on Student Scores. Shudong Wang NWEA. Liru Zhang Delaware Department of Education Effects of Ignoring Discrimination Parameter in CAT Item Selection on Student Scores Shudong Wang NWEA Liru Zhang Delaware Department of Education Paper to be presented at the annual meeting of the National

More information

A Multilevel Testlet Model for Dual Local Dependence

A Multilevel Testlet Model for Dual Local Dependence Journal of Educational Measurement Spring 2012, Vol. 49, No. 1, pp. 82 100 A Multilevel Testlet Model for Dual Local Dependence Hong Jiao University of Maryland Akihito Kamata University of Oregon Shudong

More information

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories Kamla-Raj 010 Int J Edu Sci, (): 107-113 (010) Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories O.O. Adedoyin Department of Educational Foundations,

More information

Estimating the Validity of a

Estimating the Validity of a Estimating the Validity of a Multiple-Choice Test Item Having k Correct Alternatives Rand R. Wilcox University of Southern California and University of Califarnia, Los Angeles In various situations, a

More information

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1.

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1. Running Head: A MODIFIED CATSIB PROCEDURE FOR DETECTING DIF ITEMS 1 A Modified CATSIB Procedure for Detecting Differential Item Function on Computer-Based Tests Johnson Ching-hong Li 1 Mark J. Gierl 1

More information

Selection of Linking Items

Selection of Linking Items Selection of Linking Items Subset of items that maximally reflect the scale information function Denote the scale information as Linear programming solver (in R, lp_solve 5.5) min(y) Subject to θ, θs,

More information

A Bayesian Nonparametric Model Fit statistic of Item Response Models

A Bayesian Nonparametric Model Fit statistic of Item Response Models A Bayesian Nonparametric Model Fit statistic of Item Response Models Purpose As more and more states move to use the computer adaptive test for their assessments, item response theory (IRT) has been widely

More information

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS)

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS) Chapter : Advanced Remedial Measures Weighted Least Squares (WLS) When the error variance appears nonconstant, a transformation (of Y and/or X) is a quick remedy. But it may not solve the problem, or it

More information

A Broad-Range Tailored Test of Verbal Ability

A Broad-Range Tailored Test of Verbal Ability A Broad-Range Tailored Test of Verbal Ability Frederic M. Lord Educational Testing Service Two parallel forms of a broad-range tailored test of verbal ability have been built. The test is appropriate from

More information

Computerized Adaptive Testing for Classifying Examinees Into Three Categories

Computerized Adaptive Testing for Classifying Examinees Into Three Categories Measurement and Research Department Reports 96-3 Computerized Adaptive Testing for Classifying Examinees Into Three Categories T.J.H.M. Eggen G.J.J.M. Straetmans Measurement and Research Department Reports

More information

Description of components in tailored testing

Description of components in tailored testing Behavior Research Methods & Instrumentation 1977. Vol. 9 (2).153-157 Description of components in tailored testing WAYNE M. PATIENCE University ofmissouri, Columbia, Missouri 65201 The major purpose of

More information

Gender-Based Differential Item Performance in English Usage Items

Gender-Based Differential Item Performance in English Usage Items A C T Research Report Series 89-6 Gender-Based Differential Item Performance in English Usage Items Catherine J. Welch Allen E. Doolittle August 1989 For additional copies write: ACT Research Report Series

More information

Computerized Adaptive Testing

Computerized Adaptive Testing Computerized Adaptive Testing Daniel O. Segall Defense Manpower Data Center United States Department of Defense Encyclopedia of Social Measurement, in press OUTLINE 1. CAT Response Models 2. Test Score

More information

The Use of Item Statistics in the Calibration of an Item Bank

The Use of Item Statistics in the Calibration of an Item Bank ~ - -., The Use of Item Statistics in the Calibration of an Item Bank Dato N. M. de Gruijter University of Leyden An IRT analysis based on p (proportion correct) and r (item-test correlation) is proposed

More information

A Comparison of Item and Testlet Selection Procedures. in Computerized Adaptive Testing. Leslie Keng. Pearson. Tsung-Han Ho

A Comparison of Item and Testlet Selection Procedures. in Computerized Adaptive Testing. Leslie Keng. Pearson. Tsung-Han Ho ADAPTIVE TESTLETS 1 Running head: ADAPTIVE TESTLETS A Comparison of Item and Testlet Selection Procedures in Computerized Adaptive Testing Leslie Keng Pearson Tsung-Han Ho The University of Texas at Austin

More information

Computerized Adaptive Testing: A Comparison of Three Content Balancing Methods

Computerized Adaptive Testing: A Comparison of Three Content Balancing Methods The Journal of Technology, Learning, and Assessment Volume 2, Number 5 December 2003 Computerized Adaptive Testing: A Comparison of Three Content Balancing Methods Chi-Keung Leung, Hua-Hua Chang, and Kit-Tai

More information

Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions

Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions J. Harvey a,b, & A.J. van der Merwe b a Centre for Statistical Consultation Department of Statistics

More information

SUPPLEMENTAL MATERIAL

SUPPLEMENTAL MATERIAL 1 SUPPLEMENTAL MATERIAL Response time and signal detection time distributions SM Fig. 1. Correct response time (thick solid green curve) and error response time densities (dashed red curve), averaged across

More information

Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement Invariance Tests Of Multi-Group Confirmatory Factor Analyses

Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement Invariance Tests Of Multi-Group Confirmatory Factor Analyses Journal of Modern Applied Statistical Methods Copyright 2005 JMASM, Inc. May, 2005, Vol. 4, No.1, 275-282 1538 9472/05/$95.00 Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement

More information

Strategies for computerized adaptive testing: golden section search, dichotomous search, and Z- score strategies

Strategies for computerized adaptive testing: golden section search, dichotomous search, and Z- score strategies Retrospective Theses and Dissertations 1993 Strategies for computerized adaptive testing: golden section search, dichotomous search, and Z- score strategies Beiling Xiao Iowa State University Follow this

More information

Item-Rest Regressions, Item Response Functions, and the Relation Between Test Forms

Item-Rest Regressions, Item Response Functions, and the Relation Between Test Forms Item-Rest Regressions, Item Response Functions, and the Relation Between Test Forms Dato N. M. de Gruijter University of Leiden John H. A. L. de Jong Dutch Institute for Educational Measurement (CITO)

More information

THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH

THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH By JANN MARIE WISE MACINNES A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF

More information

A Comparison of Several Goodness-of-Fit Statistics

A Comparison of Several Goodness-of-Fit Statistics A Comparison of Several Goodness-of-Fit Statistics Robert L. McKinley The University of Toledo Craig N. Mills Educational Testing Service A study was conducted to evaluate four goodnessof-fit procedures

More information

The Influence of Test Characteristics on the Detection of Aberrant Response Patterns

The Influence of Test Characteristics on the Detection of Aberrant Response Patterns The Influence of Test Characteristics on the Detection of Aberrant Response Patterns Steven P. Reise University of California, Riverside Allan M. Due University of Minnesota Statistical methods to assess

More information

The Effect of Guessing on Item Reliability

The Effect of Guessing on Item Reliability The Effect of Guessing on Item Reliability under Answer-Until-Correct Scoring Michael Kane National League for Nursing, Inc. James Moloney State University of New York at Brockport The answer-until-correct

More information

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan Connexion of Item Response Theory to Decision Making in Chess Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan Acknowledgement A few Slides have been taken from the following presentation

More information

A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM LIKELIHOOD METHODS IN ESTIMATING THE ITEM PARAMETERS FOR THE 2PL IRT MODEL

A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM LIKELIHOOD METHODS IN ESTIMATING THE ITEM PARAMETERS FOR THE 2PL IRT MODEL International Journal of Innovative Management, Information & Production ISME Internationalc2010 ISSN 2185-5439 Volume 1, Number 1, December 2010 PP. 81-89 A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM

More information

Bayesian Tailored Testing and the Influence

Bayesian Tailored Testing and the Influence Bayesian Tailored Testing and the Influence of Item Bank Characteristics Carl J. Jensema Gallaudet College Owen s (1969) Bayesian tailored testing method is introduced along with a brief review of its

More information

Item Response Theory: Methods for the Analysis of Discrete Survey Response Data

Item Response Theory: Methods for the Analysis of Discrete Survey Response Data Item Response Theory: Methods for the Analysis of Discrete Survey Response Data ICPSR Summer Workshop at the University of Michigan June 29, 2015 July 3, 2015 Presented by: Dr. Jonathan Templin Department

More information

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION Iweka Fidelis (Ph.D) Department of Educational Psychology, Guidance and Counselling, University of Port Harcourt,

More information

Sampling Weights, Model Misspecification and Informative Sampling: A Simulation Study

Sampling Weights, Model Misspecification and Informative Sampling: A Simulation Study Sampling Weights, Model Misspecification and Informative Sampling: A Simulation Study Marianne (Marnie) Bertolet Department of Statistics Carnegie Mellon University Abstract Linear mixed-effects (LME)

More information

Parameter Estimation with Mixture Item Response Theory Models: A Monte Carlo Comparison of Maximum Likelihood and Bayesian Methods

Parameter Estimation with Mixture Item Response Theory Models: A Monte Carlo Comparison of Maximum Likelihood and Bayesian Methods Journal of Modern Applied Statistical Methods Volume 11 Issue 1 Article 14 5-1-2012 Parameter Estimation with Mixture Item Response Theory Models: A Monte Carlo Comparison of Maximum Likelihood and Bayesian

More information

Statistics for Social and Behavioral Sciences

Statistics for Social and Behavioral Sciences Statistics for Social and Behavioral Sciences Advisors: S.E. Fienberg W.J. van der Linden For other titles published in this series, go to http://www.springer.com/series/3463 Jean-Paul Fox Bayesian Item

More information

IRT Parameter Estimates

IRT Parameter Estimates An Examination of the Characteristics of Unidimensional IRT Parameter Estimates Derived From Two-Dimensional Data Timothy N. Ansley and Robert A. Forsyth The University of Iowa The purpose of this investigation

More information

A Brief Introduction to Bayesian Statistics

A Brief Introduction to Bayesian Statistics A Brief Introduction to Statistics David Kaplan Department of Educational Psychology Methods for Social Policy Research and, Washington, DC 2017 1 / 37 The Reverend Thomas Bayes, 1701 1761 2 / 37 Pierre-Simon

More information

SLAUGHTER PIG MARKETING MANAGEMENT: UTILIZATION OF HIGHLY BIASED HERD SPECIFIC DATA. Henrik Kure

SLAUGHTER PIG MARKETING MANAGEMENT: UTILIZATION OF HIGHLY BIASED HERD SPECIFIC DATA. Henrik Kure SLAUGHTER PIG MARKETING MANAGEMENT: UTILIZATION OF HIGHLY BIASED HERD SPECIFIC DATA Henrik Kure Dina, The Royal Veterinary and Agricuural University Bülowsvej 48 DK 1870 Frederiksberg C. kure@dina.kvl.dk

More information

Remarks on Bayesian Control Charts

Remarks on Bayesian Control Charts Remarks on Bayesian Control Charts Amir Ahmadi-Javid * and Mohsen Ebadi Department of Industrial Engineering, Amirkabir University of Technology, Tehran, Iran * Corresponding author; email address: ahmadi_javid@aut.ac.ir

More information

Modeling Randomness in Judging Rating Scales with a Random-Effects Rating Scale Model

Modeling Randomness in Judging Rating Scales with a Random-Effects Rating Scale Model Journal of Educational Measurement Winter 2006, Vol. 43, No. 4, pp. 335 353 Modeling Randomness in Judging Rating Scales with a Random-Effects Rating Scale Model Wen-Chung Wang National Chung Cheng University,

More information

Introduction to Bayesian Analysis 1

Introduction to Bayesian Analysis 1 Biostats VHM 801/802 Courses Fall 2005, Atlantic Veterinary College, PEI Henrik Stryhn Introduction to Bayesian Analysis 1 Little known outside the statistical science, there exist two different approaches

More information

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing Terry A. Ackerman University of Illinois This study investigated the effect of using multidimensional items in

More information

A Race Model of Perceptual Forced Choice Reaction Time

A Race Model of Perceptual Forced Choice Reaction Time A Race Model of Perceptual Forced Choice Reaction Time David E. Huber (dhuber@psyc.umd.edu) Department of Psychology, 1147 Biology/Psychology Building College Park, MD 2742 USA Denis Cousineau (Denis.Cousineau@UMontreal.CA)

More information

ST440/550: Applied Bayesian Statistics. (10) Frequentist Properties of Bayesian Methods

ST440/550: Applied Bayesian Statistics. (10) Frequentist Properties of Bayesian Methods (10) Frequentist Properties of Bayesian Methods Calibrated Bayes So far we have discussed Bayesian methods as being separate from the frequentist approach However, in many cases methods with frequentist

More information

Classical Psychophysical Methods (cont.)

Classical Psychophysical Methods (cont.) Classical Psychophysical Methods (cont.) 1 Outline Method of Adjustment Method of Limits Method of Constant Stimuli Probit Analysis 2 Method of Constant Stimuli A set of equally spaced levels of the stimulus

More information

Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure. Rob Cavanagh Len Sparrow Curtin University

Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure. Rob Cavanagh Len Sparrow Curtin University Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure Rob Cavanagh Len Sparrow Curtin University R.Cavanagh@curtin.edu.au Abstract The study sought to measure mathematics anxiety

More information

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

Lec 02: Estimation & Hypothesis Testing in Animal Ecology Lec 02: Estimation & Hypothesis Testing in Animal Ecology Parameter Estimation from Samples Samples We typically observe systems incompletely, i.e., we sample according to a designed protocol. We then

More information

Impact of Violation of the Missing-at-Random Assumption on Full-Information Maximum Likelihood Method in Multidimensional Adaptive Testing

Impact of Violation of the Missing-at-Random Assumption on Full-Information Maximum Likelihood Method in Multidimensional Adaptive Testing A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to Practical Assessment, Research & Evaluation. Permission is granted to distribute

More information

During the past century, mathematics

During the past century, mathematics An Evaluation of Mathematics Competitions Using Item Response Theory Jim Gleason During the past century, mathematics competitions have become part of the landscape in mathematics education. The first

More information

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida and Oleksandr S. Chernyshenko University of Canterbury Presented at the New CAT Models

More information

Centre for Education Research and Policy

Centre for Education Research and Policy THE EFFECT OF SAMPLE SIZE ON ITEM PARAMETER ESTIMATION FOR THE PARTIAL CREDIT MODEL ABSTRACT Item Response Theory (IRT) models have been widely used to analyse test data and develop IRT-based tests. An

More information

Factors Affecting the Item Parameter Estimation and Classification Accuracy of the DINA Model

Factors Affecting the Item Parameter Estimation and Classification Accuracy of the DINA Model Journal of Educational Measurement Summer 2010, Vol. 47, No. 2, pp. 227 249 Factors Affecting the Item Parameter Estimation and Classification Accuracy of the DINA Model Jimmy de la Torre and Yuan Hong

More information

Rasch Versus Birnbaum: New Arguments in an Old Debate

Rasch Versus Birnbaum: New Arguments in an Old Debate White Paper Rasch Versus Birnbaum: by John Richard Bergan, Ph.D. ATI TM 6700 E. Speedway Boulevard Tucson, Arizona 85710 Phone: 520.323.9033 Fax: 520.323.9139 Copyright 2013. All rights reserved. Galileo

More information

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

Bruno D. Zumbo, Ph.D. University of Northern British Columbia Bruno Zumbo 1 The Effect of DIF and Impact on Classical Test Statistics: Undetected DIF and Impact, and the Reliability and Interpretability of Scores from a Language Proficiency Test Bruno D. Zumbo, Ph.D.

More information

Sawtooth Software. The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? RESEARCH PAPER SERIES

Sawtooth Software. The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? RESEARCH PAPER SERIES Sawtooth Software RESEARCH PAPER SERIES The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? Dick Wittink, Yale University Joel Huber, Duke University Peter Zandan,

More information

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods James Madison University JMU Scholarly Commons Department of Graduate Psychology - Faculty Scholarship Department of Graduate Psychology 3-008 Scoring Multiple Choice Items: A Comparison of IRT and Classical

More information

Bayesian and Frequentist Approaches

Bayesian and Frequentist Approaches Bayesian and Frequentist Approaches G. Jogesh Babu Penn State University http://sites.stat.psu.edu/ babu http://astrostatistics.psu.edu All models are wrong But some are useful George E. P. Box (son-in-law

More information

An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow?

An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow? Journal of Educational and Behavioral Statistics Fall 2006, Vol. 31, No. 3, pp. 241 259 An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow? Michael C. Edwards The Ohio

More information

André Cyr and Alexander Davies

André Cyr and Alexander Davies Item Response Theory and Latent variable modeling for surveys with complex sampling design The case of the National Longitudinal Survey of Children and Youth in Canada Background André Cyr and Alexander

More information

Computerized Adaptive Testing With the Bifactor Model

Computerized Adaptive Testing With the Bifactor Model Computerized Adaptive Testing With the Bifactor Model David J. Weiss University of Minnesota and Robert D. Gibbons Center for Health Statistics University of Illinois at Chicago Presented at the New CAT

More information

Binomial Test Models and Item Difficulty

Binomial Test Models and Item Difficulty Binomial Test Models and Item Difficulty Wim J. van der Linden Twente University of Technology In choosing a binomial test model, it is important to know exactly what conditions are imposed on item difficulty.

More information

The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests

The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests Mary E. Lunz and Betty A. Bergstrom, American Society of Clinical Pathologists Benjamin D. Wright, University

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 10: Introduction to inference (v2) Ramesh Johari ramesh.johari@stanford.edu 1 / 17 What is inference? 2 / 17 Where did our data come from? Recall our sample is: Y, the vector

More information

Full-Information Item Factor Analysis: Applications of EAP Scores

Full-Information Item Factor Analysis: Applications of EAP Scores Full-Information Item Factor Analysis: Applications of EAP Scores Eiji Muraki National Opinion Research Center George Engelhard, Jr. Emory University The full-information item factor analysis model proposed

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 26 for Mixed Format Tests Kyong Hee Chon Won-Chan Lee Timothy N. Ansley November 2007 The authors are grateful to

More information

S Imputation of Categorical Missing Data: A comparison of Multivariate Normal and. Multinomial Methods. Holmes Finch.

S Imputation of Categorical Missing Data: A comparison of Multivariate Normal and. Multinomial Methods. Holmes Finch. S05-2008 Imputation of Categorical Missing Data: A comparison of Multivariate Normal and Abstract Multinomial Methods Holmes Finch Matt Margraf Ball State University Procedures for the imputation of missing

More information

Adjusting for mode of administration effect in surveys using mailed questionnaire and telephone interview data

Adjusting for mode of administration effect in surveys using mailed questionnaire and telephone interview data Adjusting for mode of administration effect in surveys using mailed questionnaire and telephone interview data Karl Bang Christensen National Institute of Occupational Health, Denmark Helene Feveille National

More information

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm Journal of Social and Development Sciences Vol. 4, No. 4, pp. 93-97, Apr 203 (ISSN 222-52) Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm Henry De-Graft Acquah University

More information

Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data. Zhen Li & Bruno D.

Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data. Zhen Li & Bruno D. Psicológica (2009), 30, 343-370. SECCIÓN METODOLÓGICA Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data Zhen Li & Bruno D. Zumbo 1 University

More information

Exploiting Auxiliary Infornlation

Exploiting Auxiliary Infornlation Exploiting Auxiliary Infornlation About Examinees in the Estimation of Item Parameters Robert J. Mislovy Educationai Testing Service estimates can be The precision of item parameter increased by taking

More information

properties of these alternative techniques have not

properties of these alternative techniques have not Methodology Review: Item Parameter Estimation Under the One-, Two-, and Three-Parameter Logistic Models Frank B. Baker University of Wisconsin This paper surveys the techniques used in item response theory

More information

María Verónica Santelices 1 and Mark Wilson 2

María Verónica Santelices 1 and Mark Wilson 2 On the Relationship Between Differential Item Functioning and Item Difficulty: An Issue of Methods? Item Response Theory Approach to Differential Item Functioning Educational and Psychological Measurement

More information

A Race Model of Perceptual Forced Choice Reaction Time

A Race Model of Perceptual Forced Choice Reaction Time A Race Model of Perceptual Forced Choice Reaction Time David E. Huber (dhuber@psych.colorado.edu) Department of Psychology, 1147 Biology/Psychology Building College Park, MD 2742 USA Denis Cousineau (Denis.Cousineau@UMontreal.CA)

More information

Section 5. Field Test Analyses

Section 5. Field Test Analyses Section 5. Field Test Analyses Following the receipt of the final scored file from Measurement Incorporated (MI), the field test analyses were completed. The analysis of the field test data can be broken

More information

CYRINUS B. ESSEN, IDAKA E. IDAKA AND MICHAEL A. METIBEMU. (Received 31, January 2017; Revision Accepted 13, April 2017)

CYRINUS B. ESSEN, IDAKA E. IDAKA AND MICHAEL A. METIBEMU. (Received 31, January 2017; Revision Accepted 13, April 2017) DOI: http://dx.doi.org/10.4314/gjedr.v16i2.2 GLOBAL JOURNAL OF EDUCATIONAL RESEARCH VOL 16, 2017: 87-94 COPYRIGHT BACHUDO SCIENCE CO. LTD PRINTED IN NIGERIA. ISSN 1596-6224 www.globaljournalseries.com;

More information

An Empirical Examination of the Impact of Item Parameters on IRT Information Functions in Mixed Format Tests

An Empirical Examination of the Impact of Item Parameters on IRT Information Functions in Mixed Format Tests University of Massachusetts - Amherst ScholarWorks@UMass Amherst Dissertations 2-2012 An Empirical Examination of the Impact of Item Parameters on IRT Information Functions in Mixed Format Tests Wai Yan

More information

On the purpose of testing:

On the purpose of testing: Why Evaluation & Assessment is Important Feedback to students Feedback to teachers Information to parents Information for selection and certification Information for accountability Incentives to increase

More information

Nearest-Integer Response from Normally-Distributed Opinion Model for Likert Scale

Nearest-Integer Response from Normally-Distributed Opinion Model for Likert Scale Nearest-Integer Response from Normally-Distributed Opinion Model for Likert Scale Jonny B. Pornel, Vicente T. Balinas and Giabelle A. Saldaña University of the Philippines Visayas This paper proposes that

More information

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination Timothy N. Rubin (trubin@uci.edu) Michael D. Lee (mdlee@uci.edu) Charles F. Chubb (cchubb@uci.edu) Department of Cognitive

More information

Combining Risks from Several Tumors Using Markov Chain Monte Carlo

Combining Risks from Several Tumors Using Markov Chain Monte Carlo University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln U.S. Environmental Protection Agency Papers U.S. Environmental Protection Agency 2009 Combining Risks from Several Tumors

More information