A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests

Size: px
Start display at page:

Download "A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests"

Transcription

1 A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests David Shin Pearson Educational Measurement May 007 rr0701 Using assessment and research to promote learning

2 Pearson Educational Measurement (PEM) is the most comprehensive provider of educational assessment products, services and solutions. As a pioneer in educational measurement, PEM has been a trusted partner in district, state and national assessments for more than 50 years. PEM helps educators and parents use assessment and research to promote learning and academic achievement. PEM Research Reports provide dissemination of PEM research and assessment-related articles prior to publication. PEM reports in.pdf format may be downloaded at:

3 Because the world is complex and resources are often limited, test scores often serve to both rank individuals and provide diagnostic feedback (Wainer, Vevea, Camacho, Reeve, Rosa, Nelson, Swygert, and Thissen, 000). These two purposes create challenges from the view of content coverage. Standardized achievement tests serve the purpose of ranking very well. To serve the purpose of diagnosis, standardized achievement tests must have clusters of items that yield interpretable scores. These clusters could be learning objectives, subtests, or learning standards. Scores on these clusters of items will be referred to as objective scores in this paper. If there are a large number of items measuring an objective on a test, the estimation of an objective score might be precise and reliable. However, in many cases the number of items is fewer than optimal for the level of reliability desired. This condition is problematic, but it often exists in practice (Pommerich, Nicewander, and Hanson, 1999). The purpose of this paper is to review and evaluate a number of methods that attempt to provide a more precise and reliable estimation of objective scores. The first section of this paper reviews a number of different methods for estimating objective scores. The second section of this paper presents a study evaluating a subset of the more practical of these methods. Review of Current Objective Score Estimation Methods A number of studies have been devoted to proposing methods for the more precise and reliable estimation of objective scores (Yen 1987; Yen, Sykes, Ito, and Julian, 1997; Bock, Thissen, and Zimowski, 1997; Pommerich et al., 1999; Wainer et al., 000; Gessaroli, 004; Kahraman and Kamata, 004; Tate, 004; Shin, Ansley, Tsai, and Mao,

4 005). This section provides a review of these methods. All of these methods, implicitly or explicitly, estimate the objective scores using collateral test information. This review notes how different methods take advantage of collateral information in different ways. The IRT domain score was used as the estimation of the objective score by Bock, Thissen, and Zimowski (1997). The collateral test information is used in a sense that when the item parameters are calibrated, the data from items in the other objectives will actually contribute to the estimation of the item parameters in the objective of interest. For example, if items 1 to 6 are measuring the first objective and items 7 to 1 are measuring the second objective in the same test, in calibrating the test, the data from items 7 to 1 actually contribute to the estimation of the item parameters of items 1 to 6. Bock compared the proportion-correct score with IRT domain score computed by theta from either maximum likelihood estimation (MLE) or Bayesian estimation and found that the objective score estimated from the IRT estimation is more accurate than the proportion-correct score. Several studies extended the work of Bock, Thissen, and Zimowski (1997). Pommerich et al. (1999) similarly used IRT domain score to estimate the objective score. The only difference is that they did not estimate the score at an individual level but rather at a group level. Tate (004) extended Bock s study to the multidimensional case with different dimensionality and degree of correlation between subsets of test. The main purpose of Tate s study was to find a better method between MLE and the expected a posteriori (EAP) estimation to estimate the objective score. He found that the choice of estimation approach depends on the intended uses of the objective scores (Tate, 000, p.107). 3

5 Adopting a somewhat different approach, Kahraman and Kamata (004) also tried to estimate the objective score using IRT models. Kahraman and Kamata used out-ofscale items which is explicitly a kind of collateral test information to help the estimate of the objective score. The out-of-scale items are the items that measure different objectives but are in the same test as the objective of interest. For example, if the score of Objective One is estimated, then the items in Objective Two, Three, and so on are out-of-scale items. They controlled the number of out-of-scale items, correlation between objectives, and the discrimination of items and found that the correlation between objectives needs to be at least 0.5 for the moderate-discrimination out-of-scale items and 0.9 for the highdiscrimination items in order to take advantage of the out-of-scale items. Wainer et al. (000) used an empirical Bayesian (EB) method to compute the objective score. The basic concept of this method is similar to Kelley s (197) regressed score. Indeed, it is a multivariate version of Kelley s method. In Kelley s method, only one test score was used to estimate the regressed score, but in Wainer s method, several other objective scores were used to estimate the objective score of interest. Yen (1987) and Yen et al.(1997) combined the IRT and EB method to compute the objective score using a method labeled Objective Performance Index (OPI). They used the IRT domain score as the prior and assumed that the prior had a beta distribution. With another assumption that the likelihood of the objective score follows a binomial distribution, the posterior distribution is then also a beta distribution. The mean, standard deviation (SD) of the posterior distribution are the estimated objective score and its standard error. 4

6 Gessaroli (004) used multidimensional IRT (MIRT) to compute the objective score. Gessaroli (004) then compared it to the Wainer s EB method (Wainer et al., 000). He found that the EB method had almost the same results as the MIRT methods. Shin et al. (005) applied the Markov chain Monte Carlo (MCMC) technique to estimate the objective scores of IRT, OPI, and Wainer et al. s methods. Shin et al. (005) then compared these MCMC versions to their original non-mcmc counterparts. They found that the MCMC alternatives performed either the same or slightly better than the non-mcmc methods. Some of the methods reviewed in this paper may not be practical for use in a large scale testing program. For example: the method of Pommerich et al. (1999) only estimated the objective score for groups. It may not meet the need for a large scale test that reports individual objective scores. The method of Gessaroli (004) involved MIRT that is rarely used in a large scale test. The method of Kahraman and Kamata (004) required certain conditions in order to take the advantage of this method and these conditions may not be met in many large scale tests (p.417). The study of Tate (004) was very similar to Bock s study (Bock, Thissen, and Zimowski, 1997). The MCMC IRT and MCMC OPI method in Shin et al. (005) are too time-consuming and hence may not be practical. However, other methods reviewed in this paper may be suitable for use in a large scale testing program. The method of Yen et al. (1997) is actually currently used for some state tests. The Bock, Thissen, and Zimowski (1997) method is convenient to be implemented in the tests using IRT. The Wainer et al. (000) method and the MCMC Wainer methods of Shin et al. (005) performed better than the other methods in the 5

7 study of Shin et al. (005). Therefore, these methods were included in the study reported in the next section. Evaluation of Selected Objective Score Estimation Methods The study reported in this section compares five methods that use collateral test information to estimate the objective score for a mixed-format test. These methods include an adjusted version of Bock et al. s item response theory (IRT) approach (Bock et al., 1997), Yen s objective performance index (OPI) approach (Yen, 1997), Wainer et al. s regressed score approach (Wainer et al., 000), the Shin s MCMC regressed score approach (Shin et al., 005), and the proportion-correct score approach. They are referred as the Bock method, the Yen method, the Wainer method, the Shin method, and proportion-correct method hereafter. In addition to comparing these five methods using a common data set, the present study extends earlier work by including mixed-format tests. Only Yen et al. (1997) consider the case of a mixed-format test. As more large scale tests require the reporting of objective scores and the mixed-format tests become more common, it is necessary to conduct a study that compares different objective score estimation methods for mixed-format tests. From previous studies (Yen, 1987; Yen et al., 1997; Wainer et al., 000; Pommerich et al., 1999; Bock et al., 1997, Shin et al. 005), the number of items in an objective, and the correlation between objectives were two main factors that affected the estimation results. In the current study, the proportion of the polytomous items in a test and the student sample size were also studied. The performance of the methods under 6

8 different combinations of these four factors was compared. Six main questions were investigated in this study: (1) What is the order of the objective score reliabilities estimated from different methods and how are they influenced by the four factors studied? () What is the nominal 95% confident / credibility interval (95CI) of each method and how are they influenced by the four factors studied? (3) What is the order of the widths of the 95CI for each method and how are they influenced by the four factors studied? (4) What is the magnitude and order of the absolute bias (Bias) of different methods and how are they influenced by the four factors studied? (5) What is the magnitude and order of the standard deviation of estimation (SD) of different methods and how are they influenced by the four factors studied? (6) What is the magnitude and order of the root-mean-square error (RMSE) of different methods and how are they influenced by the four factors studied? The first question addresses the reliability of the objective score estimated by each method. As mentioned previously, one of the reasons to use methods other than the proportion-correct method is that the proportion-correct objective score is not reliable given the limited number of items in each objective. Therefore, the objective score estimated from the selected method must be more reliable than the proportion-correct score. However, because some of the estimating methods such as the IRT method and OPI method do not provide a way to estimate reliability, empirically it becomes impossible to compare the reliability of each method. In the present study, since the true score and the estimated score of each method are both available from the simulation 7

9 process, it becomes possible to compute the correlation between the true score and the estimated score and the reliability of each method can be obtained through the following equation: XX T TT TXTX TX ' = = = = ( ) = ( ) TX (1) X X T X T X T Where XX,, X, T, TX, and XT are the reliability, standard error of the estimated score, standard error of the true score, covariance of the true and estimated score, and the correlation between true score and estimated score, respectively. The second question concerns the accuracy of the nominal 95% confidence / credibility intervals (95CI) of each method. For the objective score estimated from proportion-corrected method, the 95CI is the confidence interval. For the other methods, the 95CI represents the credibility interval. Very often when an objective score is reported, the 95CI is also presented. However, the nominal 95CI may not really cover the true score 95% of the time. Through the simulation process in current study, the lower and upper bound of each estimated objective score was computed and the actual percentage that the 95CI covers the true score, percent coverage (PC), for each method was then computed to answer this question. The third question is about the widths of the 95CI for each method. It is possible for a method to have a 95CI that covers the true score 95% of the time but has a wide range of its 95CI. For example, a 95CI of the proportion-correct method may range from 4% to 95%. Obviously this range will cover the true proportion-correct score well because the whole range of the proportion-correct is just from 0% to 100%. However, such a 95CI is actually meaningless because its range is just too wide. Therefore, in this 8

10 study the widths of the 95CI for each method were compared. It is necessary to consider both the 95CI and its width at the same time in order to find a better estimating method. as follows: The fourth to sixth questions are about the Bias, SD, and RMSE. They are defined Bias = E( W ), () SD E W E W = [ ( )], (3) RMSE E W Bias SD = ( ) = +, (4) where W is the point estimator, and is the true parameter. To summarize, six criteria, (1) reliability, () percent-coverage of true score for a nominal 95% confidence/credibility interval (95PC), (3) the width of the 95% confidence/credibility interval (95CI), (4) Bias, (5) SD, and (6) RMSE were used as the dependent variables in the comparisons of the different methods. A more desirable estimating method should have high score reliability, narrow 95CI, accurate percentcoverage of a nominal 95CI, small SD, Bias that is close to zero, and consequently, small RMSE. Procedures Simulated Responses For the purpose of this study, data were simulated to represent different conditions found in test data. Four factors are considered in generating the simulated data. Detailed information about these four factors is provided in Table 1. In all, 3* 3 * 3 * 3 = 81 conditions were considered in the simulation study. 9

11 Table 1. Simulation Factors and Number of Levels Factors No. of levels Description Number of examinees 3 50, 500, 1000 Test length 3 6, 1, or 18 items for each objective Correlation between objectives 3 Approximately 1.0, 0.8, and 0.5 Ratio of CR/MC items 3 0, 0% or 50% (in number of items) The present study used empirical item parameters from a state testing program. Within this item pool there are 30 MC items and 18 CR items (each with three score categories, 0, 1, and ). The ability () values for the examinees on different number of objectives were simulated from a standardized multivariate normal distribution with correlation coefficients equal to 1.0, 0.8, or 0.5. For each of the 81 conditions, 100 simulation trials (i.e., response vectors) were used. At each of the conditions, with item parameters assumed to be known, the simulation process involved the following steps: 1. Generate s for each of the examinees from a standardized multivariate normal distribution. The generated values were restricted to be between -3 and 3. The correlation coefficients between s were.5,.8 or Compute P ij () by using the IRT 3pl equation for item i in objective j and P ijk () by using the generalized partial credit model equation for category k of item i in objective j. The item parameters were randomly selected with replacement from the item pool. P ij () and P ijk () are defined in the Bock method section of this paper. 3. Use P ij () and P ijk () in step to compute the true score for each objective. For example, if items 1 through 6 were in objective 1, the true score of 10

12 objective 1 was the total of the P ij () and weighted P ijk () of item 1 through 6. This true objective score was used as the baseline to compare different methods of estimating the objective score. 4. Generate response, y ij using P ij () and P ijk () from step 3. y ij is either a sample from the binomial distribution with probability P ij () or a sample from the multinomial distribution with probabilities equal to P ijk (), where k is the index for category level and ranges from 0 to. 5. Use the data from step 4 to estimate the objective scores using different estimating methods. The details about each method are described later. 6. Repeat step 4 and step 5 for 100 times and compute objective score reliability, 95CI, percentage coverage of nominal 95CI, Bias, SD, and RMSE for further analyses. It should be noticed that in the simulation process the data were simulated according to the correlation between s rather than the correlation between the objective scores. However, because the objective score are a monotonically increasing function of, the correlation between the objective scores is approximately equal to the correlation between s. Objective Scores The simulated data were used as input for different estimating methods to compute the objective scores. Then these estimated objective scores were used for comparison of the estimating methods. The following are the brief descriptions of these estimating methods. 11

13 Bock method The data from the simulation procedure were used to estimate the examinees values for the whole test. The estimated parameters, $, were then entered into equations: $ (1) ( $ ) (1 ) exp[1.7 aij( bij )] Pij = cij + cij 1 + exp[1.7 a ( $ b )] ij ij, or () P ( $ ) = ijk m c k $ exp aij( bij ) = 1 i exp a ( $ ij bij ) 0 0 c= = to estimate P ij ( ˆ ) and P ijk ( ˆ ), where i, j,, k, and m i represent item, objective, score level index, current computed score level, and total score levels for item i, respectively. The objective scores for objective j, IRT T, were then computed by the equation, 1 IRT T = ( $ ), (5) j I j n j i= 1 where i represents item, I j is the number of items in objective j, and n j is the maximum possible points in objective j. Note that n j equals ( mi 1). i= 1 For MC items, I j ij ( $ ) = P ( $ ) ; (6) ij ij for CR items, m i ( $ ) ( 1) ( $ ij = k Pijk ). (7) k= 1 Bayes estimates of scale scores, $, were used to compute the IRT objective scores with a normal (0, 1) as the prior distribution for the abilities. Usually when the objective scores are estimated, the item parameters (i.e., the item pool) already exist. Therefore, in this study, the item parameters were assumed known. The variance of the IRT T can be expressed in the equation below: 1

14 n MC ncr mi mi $ $ ( ) 1 ( ) ( 1) ( $ ) ( 1) ( $ Pij Pij + k Pijk k Pijk ) i= 1 i= 1 k= 1 k= 1 VAR IRT T =, j n j (8) where n mc, and n cr are the number of MC and CR items, respectively, and n j represents, as defined previously, the maximum possible points in the objective j. Yen method The following steps were used to estimate Yen s T (Yen et al., 1997): 1. Estimate IRT item parameters for all selected items. Estimate for the whole test (including all objectives). 3. For each objective, calculate IRT T j (see Equation 1 on page 5), T j, where j represents objective j. 4. Obtain x j J nj( T j ) n j Q =. (9) T (1 j T j) j= 1 If Q> ( J,.10), then Yen T j, x j Tj =, n j p = x (10) j j, and qj = nj xj (11), where x j is the observed score obtained in objective j, n j is the max points can be obtained in objective j, and J is the number of objectives. If Q ( J,.10), then * j j j j p = T n + x (1) 13

15 , and * j j j j j q = (1 T ) n + n x. (13) The Yen T j, T j, is defined to be * p T jn + x T j = = * p + q n + n j j j j j j j, where n * j μ( T )[1 ( j μ T j )] = 1 ( T j ) I j I j 1 $ $ 1 $ I (, ) ij( ) 1 ij( ) 1. n I i 1 i 1 j j = nj = 1 '( $ ij ) n j i= 1 (14) For MC items: 1.7a 1 ( $ ) ( $ ij Pij Pij ) c ij '( $ ij ) = (1 c ) ij (15) and n $ MC ij '( ) I(, $ ) = $ $ i= 1 Pij( ) 1 Pij( ) n ( $ MC a ) ( $ ij Pij Pij ) c ij =, $ i= 1 P ( ) 1 c ij ij (16) where n MC is the total number of MC items in the whole test. For CR items: m i $ $ $ ij '( ) = aij ij( ) ( k 1) ij( ) k= 1 mi = a $ $ ij{ ( k 1) Pijk ( ) ij( ) } k= 1 (17) 14

16 and I(, $ ) = = n CR n mi i= 1 CR k= 1 '( $ ij ) ( k 1) P ( $ ) ( $ ijk ij ) mi ( 1) ( $ ) ( $ aij k Pijk ij ) k= 1 mi i= 1 k= 1 ( k 1) P ( $ ) ( $ ijk ij ) n CR mi ( 1) ( $ ) ( $ = aij k Pijk ij ), 1 k= 1 i= (18) where n MC is the total number of CR items in the whole test. The definition of ( $ ) is in Equations (6) and (7). The variance of Yen T can be expressed in the equation below: ij pq Var =, ( + ) ( + + 1) j j Yen Tj pj qj pj qj (19) where p j and q j are defined in Equations (10) to (13). Wainer method In vector notation, for the multivariate situation involving several objective scores collected in the vector x, REG T = x. + â(x. - x). (0) x. is the mean vector which contains the means of each objective involved. is a matrix that is the multivariate analog for the estimated reliability for each objective. All that is needed to calculate REG T are estimates of. The equation for calculating is: = S true ( S obs ) -1, (1) 15

17 where S obs is the observed variance-covariance matrix of the objective scores and S true is the variance-covariance matrix of the true objective scores. Because errors are uncorrelated with true scores, it is easy to see that =. Where jv jv ' xjvxjv ' jv and jv ' xjvxjv ' are the off-diagonal elements of true and obs, the population variance-covariance matrices of the true objective scores and observed objective scores. It is in the diagonal elements of S obs and S true that the difference arises. However, if the diagonal elements of S obs are multiplied by the reliability, / x, of the subscale in question, the results are the diagonal elements of S true. It is customary to estimate reliability with Cronbach s coefficient (Wainer et al., 000). The score variance of the estimates for the vth objective is the vth diagonal element of the matrix, Shin method C= S ( S ) S true obs 1 true. () Instead of using the empirical Bayes approach in the regressed score method, the MCMC regressed score method uses a fully Bayesian model. Here is an example with two objectives for the illustration of the full Bayesian estimation. If there are two objectives called M and V, the observed scores and true scores of these two objectives are x m, x v, m and v, respectively. The fully Bayesian model results in the posterior distributions of ( m x m, x v ), MREG T m, and ( v x m, x v ), MREG T v. From classical test theory (CTT): xpm = pm + pm, (3) and xpv =. pv + pv (4) 16

18 where p represents examinee. These can be re-parameterized via CTT equations as: xpm = pm + pm, pm N(0, m), (5) where = μ + (6) pm m pm, pm N(0, m); and xpv = pv + pv, pv N(0, v ), (7) where = μ +. (8) pv v pv, pv N(0, v ) pm and pv are the error terms. μ pm and μ pv are the common true scores for objective m and objective v. pm and pv are the unique true scores for examinee p on objective m and objective v. In the fully Bayesian model, instead of using the maximum likelihood estimate (MLE) like the empirical Bayesian method did to replace the unknown parameters, it is necessary to give a prior distribution on the unknown parameters such as m, v, m, v, and mv. If it is assumed that xpm μm m + m mv m mv x pv μ v mv v + v mv v N,, (9) pm μ m m mv m mv pv μv mv v mv v then x 1 pm μm 1 pm xpm, x pv N μm , m xpv μ, (30) v 17

19 and x 1 pm μm 1 pv xpm, x pv N μv , v x pv μ, (31) v where m + m mv 11=, mv v + (3) v = = mv v m mv 1, (33) 1 m mv, (34) = and 31 mv v. (35) = The priors of μ, m μ v, m, v, m, m, and mv need to be specified. This model was coded with WinBUGS to estimate the full Bayesian regressed objective scores. The variance of the estimation is the variance of the posterior distribution and the 95% credibility interval equals to the interval between the.05 and.975 quartiles of the posterior distribution. Results The results of the comparison of each method are presented in Figures 1 to 4. Figures 1 to 4 are the results of the comparison of the estimated reliability under different factors (the number of examinees, the number of items in each objective, the correlation between objectives, and the ratio of CR/MC items). For brevity, if not specially mentioned, the reliability in the following text refers to the estimated reliability. Figures 18

20 5 to 8 are the results of comparing the width of 95CI for each method. Figure 9 to 1 are the results of the comparison of the percentage coverage of 95CI for each method. Figures 13 to 16 are the results of the comparison of the absolute bias under different factors (the number of examinees, the number of items in each objective, the correlation between objectives, and the ratio of CR/MC items). Figures 17 to 0 are the results of comparing the SD for each method. Figure 1 to 4 are the results of the comparison of the percentage RMSE for each method. Comparison of Reliability Reliability.76 Methods.74 Bock Yen.7 Wainer Shin Proportion-correct Number of examinees Figure 1. Reliability for different number of examinees Figure 1 is the comparison of reliability for different number of examinees. From Figure 1 it can be seen that: 19

21 (1) The number of examinees did not impact the reliability of the objective score because the reliability for different numbers of examinees remains approximately the same except for the Bock method. However, the largest reliability in Bock method is.74 which is only about.03 higher than the lowers reliability,.71, for this method. () The order of the magnitude of the objective score reliability for each method is: Shin = Wainer > Yen > Bock > proportion-correct. The objective score estimated from the Shin and Wainer method have the highest reliability..9.8 Reliability.7 Methods Bock.6 Yen Wainer Shin Proportion-correct Number of Items in each objective Figure. Reliability for different number of items in each objective Figure is the comparison of reliability for different number of items in each objective. From Figure it can be seen that: 0

22 (1) The number of items in each objective had impact on the reliability of the objective score because the reliability for different number of items per objective increased as more items were in each objective. The slope of the increasing was greater from 6 to 1 items and then became flatter from 1 to 18 items. () The order of the magnitude of the reliability of objective score for each method is: Shin = Wainer > Yen > Bock > proportion-correct. The objective scores estimated from the Shin and Wainer methods have the highest reliability. (3) For the Bock s method, the reliability became the same for the 1 and 18 item cases..9.8 Reliability.7 Methods Bock.6 Yen Wainer Shin Proportion-correct Correlation between objectives Figure 3. Reliability for different correlation between objectives 1

23 Figure 3 is the comparison of reliability for different correlation between objectives. From Figure 3 it can be seen that: (1) The correlation between objectives had impact on the reliability of the objective score because the reliability increased as the correlation became higher. However, the effect is not shown in the proportion-correct method. () The order of the magnitude of the reliability of objective score for each method is approximately: Shin = Wainer > Yen > proportion-correct. The objective scores estimated from the Shin and Wainer methods have the highest reliability. (3) The Bock s method had a unique performance in this case. When the correlation between objectives equaled to.5, it had the lowest reliability and when the correlation equaled to one, it had the highest reliability. This finding is related to the assumption of the Bock method. Because Bock s objective score is actually the IRT domain score, it needs to satisfy the assumption of IRT which requires each objective to test the same thing (i.e., the unidimensionality assumption). Therefore, in the case when the correlation between objectives is 0.5, the assumption is violated resulting in it the worst method to estimate the objective score. However, when the correlation was 1.0, the assumption was met and it was the best method for the estimating.

24 Reliability.74 Methods Bock Yen Wainer Shin Proportion-correct Ratio of CR/MC items (in number of items) Figure 4. Reliability for different ratio of CR/MC items Figure 4 is the comparison of reliability for different ratio of CR/MC items in each objective. From Figure 4 it can be seen that: (1) The ratio of CR/MC items in each objective had impact on the reliability of the objective score because the reliability for different ratio of CR/MC items increased as the ratio of CR/MC items in each objective increased. () The order of the magnitude of the reliability of objective score for each method is: Shin = Wainer > Yen > Bock > proportion-correct. The objective scores estimated from the Shin Wainer methods have the highest reliability. (4) Because the CR items used in this study have the same number of categories (3 categories), more CR items means more possible points can be obtained in each 3

25 objective. That is, if there are more possible points in each objective, the reliability of estimated objective scores will be higher. To sum up: as the number of items per objective, the correlation between objectives, and the ratio of CR/MC items increased, the reliability of the estimated objective score also increased. The number of examinees did not affect the reliability of the objective score. Within these 5 methods studied, generally the estimated objective scores from the Shin and Wainer methods have the highest reliability except for the case when the correlation between objectives is 1.0. In that case, the Bock method yielded the highest reliability. Comparison of the Width of 95CI.6.5 Width of 95CI.4 Methods Bock.3 Yen Wainer Shin Proportion-correct Number of examinees Figure 5. Width of 95CI for different number of examinees 4

26 Figure 5 is the comparison of the width of 95CI for different number of examinees. From Figure 5 it can be seen that: (1) The number of examinees did not have impact on the width of 95CI. () The order of the magnitude of the width of 95CI for each method is: Proportioncorrect > Bock > Shin = Wainer > Yen Width of 95CI.5.4 Methods Bock.3. Yen Wainer Shin Proportion-correct Number of items in each objective Figure 6. Width of 95CI for different number of items in each objective Figure 6 is the comparison of the width of 95CI for different number of items per objective. From Figure 6 it can be seen that: (1) The number of examinees had impact on the width of 95CI. As the number of items per objective increased, the width of 95CI decreased. 5

27 () The order of the magnitude of the width of 95CI for each method is: Proportioncorrect > Bock > Shin = Wainer > Yen..6.5 Width of 95CI.4 Methods Bock.3 Yen Wainer Shin Proportion-correct Correlation between objectives Figure 7. Width of 95CI for different correlation between objectives Figure 7 is the comparison of the width of 95CI for different correlation between objectives. From Figure 7 it can be seen that: (1) The correlation between objectives did not have impact on the width of 95CI for the Bock, proportion-correct, and Yen methods; but has slight effect on the Wainer and Shin methods. For these two methods, as the correlation between objectives increased, the width of 95CI slightly decreased. () The order of the magnitude of the width of 95CI for each method is: Proportioncorrect > Bock > Shin = Wainer > Yen. 6

28 .6.5 Width of 95CI.4 Methods Bock.3 Yen Wainer Shin Proportion-correct Ratio of CR/MC items (in number of items) Figure 8. Width of 95CI for different ratio of CR/MC items Figure 8 is the comparison of the width of 95CI for different ratio of CR/MC items. From Figure 8 it can be seen that: (1) The ratio of CR/MC items did not have impact on the width of 95CI except for the Bock method. For Bock methods, as the number of items per objective increased, the width of 95CI slightly decreased. () The order of the magnitude of the width of 95CI for each method is: Proportioncorrect > Bock > Shin = Wainer > Yen. To sum up: within the four factors studied, only the number of items per objective had impact on the width of 95CI. More items per objective tended to lead to a narrower width 7

29 of 95CI. The order of the magnitude of the width of 95CI is consistent across different factors. The Yen method has the narrowest 95CI. Comparison of the Percent Coverage of 95CI 10 Percentage coverage of 95CI Methods Bock Yen Wainer Shin Proportion-correct Number of examinees Figure 9. Percent coverage of 95CI for different number of examinees Figure 9 is the comparison of the percent coverage of 95CI for different number of examinees. From Figure 9 it can be seen that: (1) Generally, the number of examinees did not have impact on the percent coverage of 95CI. () Almost all the methods have a conservative 95CI because their percentage coverage of 95CI are all larger than 95%. That means, the nominal 95CI covers the true score more than 95% of the time. 8

30 10 Percentage coverage of 95CI Methods Bock Yen Wainer Shin Proportion-correct Number of Items in each objective Figure 10. Percent coverage of 95CI for different number of items in each objective Figure 10 is the comparison of the percent coverage of 95CI for different number of items in each objective. From Figure 10 it can be seen that: (1) For the Bock and Yen methods, the number of items in each objective had impact on the percent coverage of 95CI. Generally, as the number of items per objective increased the percent coverage of 95CI decreased. () Almost all the methods have a conservative 95CI because their percentage coverage of 95CI are all larger than 95%. That means, the nominal 95CI covers the true score more than 95% of the time. 9

31 10 Percentage coverage of 95CI Method Bock Yen Wainer Shin Proportion-correct Correlation between objectives Figure 11. Percent coverage of 95CI for different correlation between objectives Figure 11 is the comparison of the percent coverage of 95CI for different correlation between objectives. From Figure 11 it can be seen that: (1) For the Bock and Yen methods, the correlation between objectives has an obvious impact on the percent coverage of 95CI. () Almost all the methods have a conservative 95CI because their percentage coverage of 95CI are all larger than 95%. That means, the nominal 95CI covers the true score more than 95% of the time. 30

32 10 Percentage coverage of 95CI Methods Bock Yen Wainer Shin Proportion-correct Ratio of CR/MC items (in number of items) Figure 1. Percent coverage of 95CI for different ratio of CR/MC items Figure 1 is the comparison of the percent coverage of 95CI for different ratio of CR/MC items. From Figure 1 it can be seen that: (1) For the Yen methods, the CR/MC ratio has some impact on the percent coverage of 95CI but the pattern is not consistent across situations. () Almost all the methods have a conservative 95CI because their percentage coverage of 95CI are all larger than 95%. That means, the nominal 95CI covers the true score more than 95% of the time. To sum up: the 95CI for most of the methods except for Yen method are conservative. That is, the nominal 95CI covered the true score more than 95% of the time in the 31

33 simulation study. Yen method has the 95CI that is more close to its nominal value, 95%. This may due to the non-symmetrical property of the 95CI for Yen s method. Comparison of Bias Bias.03 Method.0.01 Bock Yen Wainer Shin Proportion-correct Number of examinees Figure 13. Bias for different number of examinees Figure 13 is the comparison of bias for different number of examinees. From Figure 13 it can be seen that: (1) The number of examinees did not impact the Bias of the objective score because the Bias for different number of examinees remains approximately the same for each method. 3

34 () The order of the Bias for each method was: Bock > Wainer > Shin > Yen > proportion-correct. The magnitudes of Bias ranged from approximately 0.01 to That is, if the perfect score is 100, the Bias is around 1 to 5 points Bias Method Bock.0.01 Yen Wainer Shin Proportion-correct Number of items in each objective Figure 14. Bias for different number of items in each objective Figure 14 is the comparison of bias for different number of items in each objective. From Figure 14 it can be seen that: (1) The number of items in each objective had impact on the Bias of the objective scores for Wainer, Shin and Yen methods because the Bias for different number of items per objective increased as more items were in each objective. The slope of the increasing was greater from 6 to 1 items and then became flatter from 1 to 18 items. The number of items did not affect the Bias for proportion-correct method. For the Bock method, the Bias decreased as the number of items 33

35 increased from 6 to 1, but increased slightly as the number of items increased from 1 to 18. () The order of the Bias for each method was approximately: Bock > Wainer > Shin > Yen > proportion-correct Bias.06 Method.04 Bock Yen.0 Wainer Shin Proportion-correct Correlation between objectives Figure 15. Bias for different correlation between objectives Figure 15 is the comparison of bias for different correlation between objectives. From Figure 15 it can be seen that: (1) The correlation between objectives had impact on the Bias for each method because the Bias increased as the correlation became higher. However, the effect was not shown in the proportion-correct method. 34

36 () The order of the Bias for each method was approximately: Bock > Wainer > Shin > Yen > proportion-correct. (3) The Bock s method had a unique performance in this case. When the correlation between objectives equaled to.5, it had the highest Bias but when the correlation equaled to one, it had the relatively low Bias. This finding is related to the assumption of the Bock method. Because Bock s objective score is actually the IRT domain score, it needs to satisfy the assumption of IRT which requires each objective to test the same thing (i.e., the unidimensionality assumption). Therefore, in the case when the correlation between objectives is 0.5, the assumption is violated resulting in it the worst method to estimate the objective score. However, when the correlation was 1.0, the assumption was met and it made the Bias lower Bias.03 Method.0.01 Bock Yen Wainer Shin Proportion-correct Ratio of CR/MC items (in number of items) 35

37 Figure 16. Bias for different ratio of CR/MC items Figure 16 is the comparison of bias for different ratio of CR/MC items in each objective. From Figure 4 it can be seen that: (1) The ratio of CR/MC items in each objective did not impact the Bias of the objective score () The order of the Bias for each method was approximately: Bock > Wainer > Shin > Yen > proportion-correct. To sum up: the Bias was affected by two factors: the number of items per objective and the correlation between objectives. As the number of items per objective or the correlation between objectives increased, the Bias of the estimated objective score also increased. The number of examinees and the ratio of CR/MC items did not affect the Bias of the objective score. Within these 5 methods studied, the order of the magnitude of Bias is consistent across four factors. Generally, the order was: Bock > Wainer > Shin > Yen > proportion-correct. The magnitude of the Bias approximately ranged from 0.01 to Comparison of the SD 36

38 SD.09 Method Bock Yen Wainer Shin Proportion-correct Number of examinees Figure 17. SD for different number of examinees Figure 17 is the comparison of the SD for different number of examinees. From Figure 17 it can be seen that: (1) The number of examinees did not impact the SD. () The order of the SD for each method is: Proportion-correct > Yen > Shin > Wainer > Bock. 37

39 SD.10 Method Bock Yen Wainer Shin Proportion-correct Number of item in each objective Figure 18. SD for different number of items in each objective Figure 18 is the comparison of the SD for different number of items per objective. From Figure 18 it can be seen that: (1) The number of examinees had impact on the SD. As the number of items per objective increased, the SD decreased. () Generally, the order of the SD for each method is: Proportion-correct > Yen > Shin > Wainer > Bock. 38

40 SD.09 Method Bock Yen Wainer Shin Proportion-correct Correlation between objectives Figure 19. SD for different correlation between objectives Figure 19 is the comparison of the SD for different correlation between objectives. From Figure 19 it can be seen that: (1) The correlation between objectives had impact on the SD except for proportioncorrect method. For the other methods, as the correlation between objectives increased, the SD decreased. () The order of the SD for each method is: Proportion-correct > Yen > Shin > Wainer > Bock. 39

41 SD.10 Method Bock Yen Wainer shin Proportion-correct Ration of CR/MC items (in number of items) Figure 0. SD for different ratio of CR/MC items Figure 0 is the comparison of the SD for different ratio of CR/MC items. From Figure 0 it can be seen that: (1) The ratio of CR/MC items did not impact the SD. () The order of the SD for each method is: Proportion-correct > Yen > Shin > Wainer > Bock.. To sum up: within the four factors studied, only the number of items per objective and the correlation between objectives had impact on the SD. More items per objective or higher correlation between objectives tended to lead to a narrower SD. The order of the SD is consistent across different factors. The order is: Proportion-correct > Yen > Shin > Wainer > Bock. The magnitude of the SD ranged approximately from 0.07 to Comparison of the RMSE 40

42 RMSE Methods.10 Bock Yen.09 Wainer Shin Proportion-correct Number of examinees Figure 1. RMSE for different number of examinees Figure 1 is the comparison of the RMSE for different number of examinees. From Figure 1 it can be seen that: (1) Generally, the number of examinees did not impact the RMSE. () The order of the RMSE for each method is: Proportion-correct > Yen > Bock > Shin Wainer. 41

43 RMSE.1 Method Bock Yen Wainer Shin Proportion-correct Number of items in each objective Figure. RMSE for different number of items in each objective Figure is the comparison of the RMSE for different number of items in each objective. From Figure it can be seen that: (1) The number of items in each objective had impact on the RMSE for each method. Generally, as the number of items per objective increased the RMSE decreased. () The order of the RMSE for each method is: Proportion-correct > Yen > Bock > Shin Wainer. 4

44 RMSE.10 Method Bock Yen Wainer Shin Proportion-correct Correlation between objectives Figure 3. RMSE for different correlation between objectives Figure 3 is the comparison of the RMSE for different correlation between objectives. From Figure 3 it can be seen that: (1) Except for the proportion-correct method, the correlation between objectives had impact on the RMSE. As the correlation between objectives increased the RMSE decreased. () The order of the RMSE for each method was generally: Proportion-correct > Yen > Bock > Shin Wainer. (3) It should be noticed that the Bock method was greatly affected by the correlation between objectives. When the correlation equals to 0.5, the assumption of Bock method is violated very much and its RMSE became high. However, when the 43

45 correlation equals to 1.0, the Bock assumption is perfectly met and its RMSE became the lowest RMSE.11 Method Bock Yen Wainer Shin Proportion-correct Ratio of CR/MC items (in number of items) Figure 4. RMSE for different ratio of CR/MC items Figure 4 is the comparison of the RMSE for different ratio of CR/MC items. From Figure 4 it can be seen that: (3) The CR/MC ratio had slightly impact on the RMSE for each method. As the ratio increased, the RMSE slight decreased. (4) The order of the RMSE for each method was generally: Proportion-correct > Yen > Bock > Shin Wainer. To sum up: within the four factors studied, only the number of items per objective and the correlation between objectives had impact on the RMSE. More items per objective or 44

46 higher correlation between objectives tended to lead to a smaller RMSE. The order of the SD is consistent across different factors. The order of the RMSE for each method was generally: Proportion-correct > Yen > Bock > Shin Wainer. The magnitude of the RMSE ranged approximately from 0.08 to Discussion Based on the results from this simulation study, it appears that using estimation methods other than the proportion-correct method improved the estimation of objective scores in terms of the reliability of the objective score. Factors that affect the reliability of objective scores included the number of items per objective, the correlation between objectives, and the ratio of CR/MC items. The number of examinees did not affect the reliability of the objective score. As the value of these factors increased, the reliability of the estimated objective score also increase. Only the number of items per objective affected the 95CI and the percent coverage of 95CI. Generally, the objective scores estimated from the Wainer and Shin methods had the highest reliability. They ranged approximately from 0.68 to The objective scores estimated from the Yen and proportion-correct methods had relatively lower reliability. The range of reliabilities was from 0.59 to 0.75 for the proportion-correct method and from 0.65 to 0.80 for the Yen method. The studied factors seem to have had a larger impact on the Bock method, especially the correlation between objectives. When the correlation was 1.0, the Bock method had the highest reliability (around 0.85). Table shows the effective gain in number of items for each method versus the proportion-correct method using the Spearman-Brown prophesy formula. It can be seen from Table that the use of the Wainer and Shin methods can lead to an effective gain of 45

47 up to 1.63 times the number of items in a subscore. The Bock method may yield an effective gain up to 1.89 times of the number of items in an objective. For example, if there are 6 items in an objective, using the Wainer and Shin methods may obtain the reliability as if there are 9.78 (6 * 1.63 = 9.78) items in that objective compared to the proportion-correct method. In other words, to achieve the score reliability of the Wainer/Shin methods with the proportion-correct method, the number of items per objective must be increased by a factor of Table. The Effective Gained in Number of Items for each method Methods Wainer / Shin Yen Bock Proportioncorrect Original reliability min max Number of items Gained min max In the percentage scale, the widths of the 95CI are approximately 7%, 35%, 50%, and 55% for the Yen method, Wainer and Shin methods, Bock method, and proportion-correct method, respectively. That means if an objective score is given based on the Yen method, 95% of the time the true objective score will fall in the range with the width equal to 7%. Similarly, 95% of the time a score based on the Wainer and Shin methods will fall in a range with width equal to 35%, and so on. Basically, the narrower the width is, the more precise the estimation will be. However, this needs to be evaluated with the percentage coverage of the nominal width of CI (confidence or credibility interval). Generally, a good estimator should have a narrow 95CI and an accurate 46

48 percentage coverage of the nominal CI. By combining the results of the 95CI width and the percentage coverage of 95CI, it can be seen that Yen s method provided a narrow 95CI and an relatively accurate percentage coverage of 95CI. The other methods are all too conservative because their 95CI covered the true value more than 95% of the time. Therefore, an adjusted 95CI value should be developed and studied in order to obtain a more precise 95CI for these methods. In general, using estimation methods other than the proportion-correct method improved the estimation of objective scores in terms of SD and RMSE. The proportioncorrect method had the smallest Bias but the largest SD. Because the magnitude of Bias is much smaller than that of SD, the RMSE of the proportion-correct method became the largest. Therefore, the use of the other methods is expected to estimate the objective scores better than the proportion-correct method. Although the Bock method had the smallest SD, it also had the largest Bias. Compared to the Bock method, the Wainer and Shin methods had SD s similar to the Bock method, but had smaller Bias than the Bock method. Therefore, the Wainer and Shin methods had slightly smaller RMSE than the Bock method. Factors that affected the RMSE, SD, and Bias of objective scores included the number of items per objective and the correlation between objectives. The number of examinees and the ratio of CR/MC items did not affect the RMSE, SD, and Bias of objective scores. Table 3 shows the maximum and minimum RMSE values for each method. Using the methods other than the proportion-correct method reduces the RMSE by as much as 0.055, which is 33 percent of the proportion-correct RMSE.. 47

An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow?

An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow? Journal of Educational and Behavioral Statistics Fall 2006, Vol. 31, No. 3, pp. 241 259 An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow? Michael C. Edwards The Ohio

More information

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian Recovery of Marginal Maximum Likelihood Estimates in the Two-Parameter Logistic Response Model: An Evaluation of MULTILOG Clement A. Stone University of Pittsburgh Marginal maximum likelihood (MML) estimation

More information

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

Lec 02: Estimation & Hypothesis Testing in Animal Ecology Lec 02: Estimation & Hypothesis Testing in Animal Ecology Parameter Estimation from Samples Samples We typically observe systems incompletely, i.e., we sample according to a designed protocol. We then

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

A Bayesian Nonparametric Model Fit statistic of Item Response Models

A Bayesian Nonparametric Model Fit statistic of Item Response Models A Bayesian Nonparametric Model Fit statistic of Item Response Models Purpose As more and more states move to use the computer adaptive test for their assessments, item response theory (IRT) has been widely

More information

Using collateral information in the estimation of sub-scores --- a fully Bayesian approach

Using collateral information in the estimation of sub-scores --- a fully Bayesian approach University of Iowa Iowa Research Online Theses and Dissertations Summer 009 Using collateral information in the estimation of sub-scores --- a fully Bayesian approach Shuqin Tao University of Iowa Copyright

More information

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Jee Seon Kim University of Wisconsin, Madison Paper presented at 2006 NCME Annual Meeting San Francisco, CA Correspondence

More information

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS Michael J. Kolen The University of Iowa March 2011 Commissioned by the Center for K 12 Assessment & Performance Management at

More information

Jason L. Meyers. Ahmet Turhan. Steven J. Fitzpatrick. Pearson. Paper presented at the annual meeting of the

Jason L. Meyers. Ahmet Turhan. Steven J. Fitzpatrick. Pearson. Paper presented at the annual meeting of the Performance of Ability Estimation Methods for Writing Assessments under Conditio ns of Multidime nsionality Jason L. Meyers Ahmet Turhan Steven J. Fitzpatrick Pearson Paper presented at the annual meeting

More information

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison Group-Level Diagnosis 1 N.B. Please do not cite or distribute. Multilevel IRT for group-level diagnosis Chanho Park Daniel M. Bolt University of Wisconsin-Madison Paper presented at the annual meeting

More information

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses Item Response Theory Steven P. Reise University of California, U.S.A. Item response theory (IRT), or modern measurement theory, provides alternatives to classical test theory (CTT) methods for the construction,

More information

IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS

IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS A Dissertation Presented to The Academic Faculty by HeaWon Jun In Partial Fulfillment of the Requirements

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 26 for Mixed Format Tests Kyong Hee Chon Won-Chan Lee Timothy N. Ansley November 2007 The authors are grateful to

More information

Using the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison

Using the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison Using the Testlet Model to Mitigate Test Speededness Effects James A. Wollack Youngsuk Suh Daniel M. Bolt University of Wisconsin Madison April 12, 2007 Paper presented at the annual meeting of the National

More information

Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions

Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions J. Harvey a,b, & A.J. van der Merwe b a Centre for Statistical Consultation Department of Statistics

More information

A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM LIKELIHOOD METHODS IN ESTIMATING THE ITEM PARAMETERS FOR THE 2PL IRT MODEL

A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM LIKELIHOOD METHODS IN ESTIMATING THE ITEM PARAMETERS FOR THE 2PL IRT MODEL International Journal of Innovative Management, Information & Production ISME Internationalc2010 ISSN 2185-5439 Volume 1, Number 1, December 2010 PP. 81-89 A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM

More information

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2 MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and Lord Equating Methods 1,2 Lisa A. Keller, Ronald K. Hambleton, Pauline Parker, Jenna Copella University of Massachusetts

More information

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories Kamla-Raj 010 Int J Edu Sci, (): 107-113 (010) Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories O.O. Adedoyin Department of Educational Foundations,

More information

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods James Madison University JMU Scholarly Commons Department of Graduate Psychology - Faculty Scholarship Department of Graduate Psychology 3-008 Scoring Multiple Choice Items: A Comparison of IRT and Classical

More information

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD Psy 427 Cal State Northridge Andrew Ainsworth, PhD Contents Item Analysis in General Classical Test Theory Item Response Theory Basics Item Response Functions Item Information Functions Invariance IRT

More information

Mediation Analysis With Principal Stratification

Mediation Analysis With Principal Stratification University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 3-30-009 Mediation Analysis With Principal Stratification Robert Gallop Dylan S. Small University of Pennsylvania

More information

The Effect of Guessing on Item Reliability

The Effect of Guessing on Item Reliability The Effect of Guessing on Item Reliability under Answer-Until-Correct Scoring Michael Kane National League for Nursing, Inc. James Moloney State University of New York at Brockport The answer-until-correct

More information

Ordinal Data Modeling

Ordinal Data Modeling Valen E. Johnson James H. Albert Ordinal Data Modeling With 73 illustrations I ". Springer Contents Preface v 1 Review of Classical and Bayesian Inference 1 1.1 Learning about a binomial proportion 1 1.1.1

More information

Selection of Linking Items

Selection of Linking Items Selection of Linking Items Subset of items that maximally reflect the scale information function Denote the scale information as Linear programming solver (in R, lp_solve 5.5) min(y) Subject to θ, θs,

More information

Does factor indeterminacy matter in multi-dimensional item response theory?

Does factor indeterminacy matter in multi-dimensional item response theory? ABSTRACT Paper 957-2017 Does factor indeterminacy matter in multi-dimensional item response theory? Chong Ho Yu, Ph.D., Azusa Pacific University This paper aims to illustrate proper applications of multi-dimensional

More information

Bayesian Tailored Testing and the Influence

Bayesian Tailored Testing and the Influence Bayesian Tailored Testing and the Influence of Item Bank Characteristics Carl J. Jensema Gallaudet College Owen s (1969) Bayesian tailored testing method is introduced along with a brief review of its

More information

A Multilevel Testlet Model for Dual Local Dependence

A Multilevel Testlet Model for Dual Local Dependence Journal of Educational Measurement Spring 2012, Vol. 49, No. 1, pp. 82 100 A Multilevel Testlet Model for Dual Local Dependence Hong Jiao University of Maryland Akihito Kamata University of Oregon Shudong

More information

Bayesian and Frequentist Approaches

Bayesian and Frequentist Approaches Bayesian and Frequentist Approaches G. Jogesh Babu Penn State University http://sites.stat.psu.edu/ babu http://astrostatistics.psu.edu All models are wrong But some are useful George E. P. Box (son-in-law

More information

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

Mantel-Haenszel Procedures for Detecting Differential Item Functioning A Comparison of Logistic Regression and Mantel-Haenszel Procedures for Detecting Differential Item Functioning H. Jane Rogers, Teachers College, Columbia University Hariharan Swaminathan, University of

More information

Computerized Mastery Testing

Computerized Mastery Testing Computerized Mastery Testing With Nonequivalent Testlets Kathleen Sheehan and Charles Lewis Educational Testing Service A procedure for determining the effect of testlet nonequivalence on the operating

More information

A Comparison of Several Goodness-of-Fit Statistics

A Comparison of Several Goodness-of-Fit Statistics A Comparison of Several Goodness-of-Fit Statistics Robert L. McKinley The University of Toledo Craig N. Mills Educational Testing Service A study was conducted to evaluate four goodnessof-fit procedures

More information

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, April 23-25, 2003 The Classification Accuracy of Measurement Decision Theory Lawrence Rudner University

More information

LUO, XIAO, Ph.D. The Optimal Design of the Dual-purpose Test. (2013) Directed by Dr. Richard M. Luecht. 155 pp.

LUO, XIAO, Ph.D. The Optimal Design of the Dual-purpose Test. (2013) Directed by Dr. Richard M. Luecht. 155 pp. LUO, XIAO, Ph.D. The Optimal Design of the Dual-purpose Test. (2013) Directed by Dr. Richard M. Luecht. 155 pp. Traditional test development focused on one purpose of the test, either ranking test-takers

More information

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati.

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati. Likelihood Ratio Based Computerized Classification Testing Nathan A. Thompson Assessment Systems Corporation & University of Cincinnati Shungwon Ro Kenexa Abstract An efficient method for making decisions

More information

Termination Criteria in Computerized Adaptive Tests: Variable-Length CATs Are Not Biased. Ben Babcock and David J. Weiss University of Minnesota

Termination Criteria in Computerized Adaptive Tests: Variable-Length CATs Are Not Biased. Ben Babcock and David J. Weiss University of Minnesota Termination Criteria in Computerized Adaptive Tests: Variable-Length CATs Are Not Biased Ben Babcock and David J. Weiss University of Minnesota Presented at the Realities of CAT Paper Session, June 2,

More information

Discriminant Analysis with Categorical Data

Discriminant Analysis with Categorical Data - AW)a Discriminant Analysis with Categorical Data John E. Overall and J. Arthur Woodward The University of Texas Medical Branch, Galveston A method for studying relationships among groups in terms of

More information

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES Correlational Research Correlational Designs Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are

More information

Item Parameter Recovery for the Two-Parameter Testlet Model with Different. Estimation Methods. Abstract

Item Parameter Recovery for the Two-Parameter Testlet Model with Different. Estimation Methods. Abstract Item Parameter Recovery for the Two-Parameter Testlet Model with Different Estimation Methods Yong Luo National Center for Assessment in Saudi Arabia Abstract The testlet model is a popular statistical

More information

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan Connexion of Item Response Theory to Decision Making in Chess Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan Acknowledgement A few Slides have been taken from the following presentation

More information

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model Gary Skaggs Fairfax County, Virginia Public Schools José Stevenson

More information

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Thakur Karkee Measurement Incorporated Dong-In Kim CTB/McGraw-Hill Kevin Fatica CTB/McGraw-Hill

More information

POLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS

POLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS POLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS By OU ZHANG A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

More information

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1 Nested Factor Analytic Model Comparison as a Means to Detect Aberrant Response Patterns John M. Clark III Pearson Author Note John M. Clark III,

More information

Comparison of Computerized Adaptive Testing and Classical Methods for Measuring Individual Change

Comparison of Computerized Adaptive Testing and Classical Methods for Measuring Individual Change Comparison of Computerized Adaptive Testing and Classical Methods for Measuring Individual Change Gyenam Kim Kang Korea Nazarene University David J. Weiss University of Minnesota Presented at the Item

More information

Linking Mixed-Format Tests Using Multiple Choice Anchors. Michael E. Walker. Sooyeon Kim. ETS, Princeton, NJ

Linking Mixed-Format Tests Using Multiple Choice Anchors. Michael E. Walker. Sooyeon Kim. ETS, Princeton, NJ Linking Mixed-Format Tests Using Multiple Choice Anchors Michael E. Walker Sooyeon Kim ETS, Princeton, NJ Paper presented at the annual meeting of the American Educational Research Association (AERA) and

More information

A Brief Introduction to Bayesian Statistics

A Brief Introduction to Bayesian Statistics A Brief Introduction to Statistics David Kaplan Department of Educational Psychology Methods for Social Policy Research and, Washington, DC 2017 1 / 37 The Reverend Thomas Bayes, 1701 1761 2 / 37 Pierre-Simon

More information

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm Journal of Social and Development Sciences Vol. 4, No. 4, pp. 93-97, Apr 203 (ISSN 222-52) Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm Henry De-Graft Acquah University

More information

Copyright. Kelly Diane Brune

Copyright. Kelly Diane Brune Copyright by Kelly Diane Brune 2011 The Dissertation Committee for Kelly Diane Brune Certifies that this is the approved version of the following dissertation: An Evaluation of Item Difficulty and Person

More information

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups GMAC Scaling Item Difficulty Estimates from Nonequivalent Groups Fanmin Guo, Lawrence Rudner, and Eileen Talento-Miller GMAC Research Reports RR-09-03 April 3, 2009 Abstract By placing item statistics

More information

Adaptive EAP Estimation of Ability

Adaptive EAP Estimation of Ability Adaptive EAP Estimation of Ability in a Microcomputer Environment R. Darrell Bock University of Chicago Robert J. Mislevy National Opinion Research Center Expected a posteriori (EAP) estimation of ability,

More information

An Alternative to the Trend Scoring Method for Adjusting Scoring Shifts. in Mixed-Format Tests. Xuan Tan. Sooyeon Kim. Insu Paek.

An Alternative to the Trend Scoring Method for Adjusting Scoring Shifts. in Mixed-Format Tests. Xuan Tan. Sooyeon Kim. Insu Paek. An Alternative to the Trend Scoring Method for Adjusting Scoring Shifts in Mixed-Format Tests Xuan Tan Sooyeon Kim Insu Paek Bihua Xiang ETS, Princeton, NJ Paper presented at the annual meeting of the

More information

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination Timothy N. Rubin (trubin@uci.edu) Michael D. Lee (mdlee@uci.edu) Charles F. Chubb (cchubb@uci.edu) Department of Cognitive

More information

Basic concepts and principles of classical test theory

Basic concepts and principles of classical test theory Basic concepts and principles of classical test theory Jan-Eric Gustafsson What is measurement? Assignment of numbers to aspects of individuals according to some rule. The aspect which is measured must

More information

Noncompensatory. A Comparison Study of the Unidimensional IRT Estimation of Compensatory and. Multidimensional Item Response Data

Noncompensatory. A Comparison Study of the Unidimensional IRT Estimation of Compensatory and. Multidimensional Item Response Data A C T Research Report Series 87-12 A Comparison Study of the Unidimensional IRT Estimation of Compensatory and Noncompensatory Multidimensional Item Response Data Terry Ackerman September 1987 For additional

More information

Analyzing data from educational surveys: a comparison of HLM and Multilevel IRT. Amin Mousavi

Analyzing data from educational surveys: a comparison of HLM and Multilevel IRT. Amin Mousavi Analyzing data from educational surveys: a comparison of HLM and Multilevel IRT Amin Mousavi Centre for Research in Applied Measurement and Evaluation University of Alberta Paper Presented at the 2013

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 39 Evaluation of Comparability of Scores and Passing Decisions for Different Item Pools of Computerized Adaptive Examinations

More information

Small-area estimation of mental illness prevalence for schools

Small-area estimation of mental illness prevalence for schools Small-area estimation of mental illness prevalence for schools Fan Li 1 Alan Zaslavsky 2 1 Department of Statistical Science Duke University 2 Department of Health Care Policy Harvard Medical School March

More information

A SAS Macro to Investigate Statistical Power in Meta-analysis Jin Liu, Fan Pan University of South Carolina Columbia

A SAS Macro to Investigate Statistical Power in Meta-analysis Jin Liu, Fan Pan University of South Carolina Columbia Paper 109 A SAS Macro to Investigate Statistical Power in Meta-analysis Jin Liu, Fan Pan University of South Carolina Columbia ABSTRACT Meta-analysis is a quantitative review method, which synthesizes

More information

The Use of Item Statistics in the Calibration of an Item Bank

The Use of Item Statistics in the Calibration of an Item Bank ~ - -., The Use of Item Statistics in the Calibration of an Item Bank Dato N. M. de Gruijter University of Leyden An IRT analysis based on p (proportion correct) and r (item-test correlation) is proposed

More information

S Imputation of Categorical Missing Data: A comparison of Multivariate Normal and. Multinomial Methods. Holmes Finch.

S Imputation of Categorical Missing Data: A comparison of Multivariate Normal and. Multinomial Methods. Holmes Finch. S05-2008 Imputation of Categorical Missing Data: A comparison of Multivariate Normal and Abstract Multinomial Methods Holmes Finch Matt Margraf Ball State University Procedures for the imputation of missing

More information

Differential Item Functioning

Differential Item Functioning Differential Item Functioning Lecture #11 ICPSR Item Response Theory Workshop Lecture #11: 1of 62 Lecture Overview Detection of Differential Item Functioning (DIF) Distinguish Bias from DIF Test vs. Item

More information

Score Tests of Normality in Bivariate Probit Models

Score Tests of Normality in Bivariate Probit Models Score Tests of Normality in Bivariate Probit Models Anthony Murphy Nuffield College, Oxford OX1 1NF, UK Abstract: A relatively simple and convenient score test of normality in the bivariate probit model

More information

Standard Errors of Correlations Adjusted for Incidental Selection

Standard Errors of Correlations Adjusted for Incidental Selection Standard Errors of Correlations Adjusted for Incidental Selection Nancy L. Allen Educational Testing Service Stephen B. Dunbar University of Iowa The standard error of correlations that have been adjusted

More information

Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX

Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX Paper 1766-2014 Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX ABSTRACT Chunhua Cao, Yan Wang, Yi-Hsin Chen, Isaac Y. Li University

More information

Method Comparison for Interrater Reliability of an Image Processing Technique in Epilepsy Subjects

Method Comparison for Interrater Reliability of an Image Processing Technique in Epilepsy Subjects 22nd International Congress on Modelling and Simulation, Hobart, Tasmania, Australia, 3 to 8 December 2017 mssanz.org.au/modsim2017 Method Comparison for Interrater Reliability of an Image Processing Technique

More information

An Assessment of The Nonparametric Approach for Evaluating The Fit of Item Response Models

An Assessment of The Nonparametric Approach for Evaluating The Fit of Item Response Models University of Massachusetts Amherst ScholarWorks@UMass Amherst Open Access Dissertations 2-2010 An Assessment of The Nonparametric Approach for Evaluating The Fit of Item Response Models Tie Liang University

More information

A BAYESIAN SOLUTION FOR THE LAW OF CATEGORICAL JUDGMENT WITH CATEGORY BOUNDARY VARIABILITY AND EXAMINATION OF ROBUSTNESS TO MODEL VIOLATIONS

A BAYESIAN SOLUTION FOR THE LAW OF CATEGORICAL JUDGMENT WITH CATEGORY BOUNDARY VARIABILITY AND EXAMINATION OF ROBUSTNESS TO MODEL VIOLATIONS A BAYESIAN SOLUTION FOR THE LAW OF CATEGORICAL JUDGMENT WITH CATEGORY BOUNDARY VARIABILITY AND EXAMINATION OF ROBUSTNESS TO MODEL VIOLATIONS A Thesis Presented to The Academic Faculty by David R. King

More information

International Journal of Education and Research Vol. 5 No. 5 May 2017

International Journal of Education and Research Vol. 5 No. 5 May 2017 International Journal of Education and Research Vol. 5 No. 5 May 2017 EFFECT OF SAMPLE SIZE, ABILITY DISTRIBUTION AND TEST LENGTH ON DETECTION OF DIFFERENTIAL ITEM FUNCTIONING USING MANTEL-HAENSZEL STATISTIC

More information

Kelvin Chan Feb 10, 2015

Kelvin Chan Feb 10, 2015 Underestimation of Variance of Predicted Mean Health Utilities Derived from Multi- Attribute Utility Instruments: The Use of Multiple Imputation as a Potential Solution. Kelvin Chan Feb 10, 2015 Outline

More information

11/24/2017. Do not imply a cause-and-effect relationship

11/24/2017. Do not imply a cause-and-effect relationship Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are highly extraverted people less afraid of rejection

More information

Nearest-Integer Response from Normally-Distributed Opinion Model for Likert Scale

Nearest-Integer Response from Normally-Distributed Opinion Model for Likert Scale Nearest-Integer Response from Normally-Distributed Opinion Model for Likert Scale Jonny B. Pornel, Vicente T. Balinas and Giabelle A. Saldaña University of the Philippines Visayas This paper proposes that

More information

Bayesian Estimation of a Meta-analysis model using Gibbs sampler

Bayesian Estimation of a Meta-analysis model using Gibbs sampler University of Wollongong Research Online Applied Statistics Education and Research Collaboration (ASEARC) - Conference Papers Faculty of Engineering and Information Sciences 2012 Bayesian Estimation of

More information

Effects of Ignoring Discrimination Parameter in CAT Item Selection on Student Scores. Shudong Wang NWEA. Liru Zhang Delaware Department of Education

Effects of Ignoring Discrimination Parameter in CAT Item Selection on Student Scores. Shudong Wang NWEA. Liru Zhang Delaware Department of Education Effects of Ignoring Discrimination Parameter in CAT Item Selection on Student Scores Shudong Wang NWEA Liru Zhang Delaware Department of Education Paper to be presented at the annual meeting of the National

More information

An Empirical Assessment of Bivariate Methods for Meta-analysis of Test Accuracy

An Empirical Assessment of Bivariate Methods for Meta-analysis of Test Accuracy Number XX An Empirical Assessment of Bivariate Methods for Meta-analysis of Test Accuracy Prepared for: Agency for Healthcare Research and Quality U.S. Department of Health and Human Services 54 Gaither

More information

Modeling DIF with the Rasch Model: The Unfortunate Combination of Mean Ability Differences and Guessing

Modeling DIF with the Rasch Model: The Unfortunate Combination of Mean Ability Differences and Guessing James Madison University JMU Scholarly Commons Department of Graduate Psychology - Faculty Scholarship Department of Graduate Psychology 4-2014 Modeling DIF with the Rasch Model: The Unfortunate Combination

More information

A Comparison of DIMTEST and Generalized Dimensionality Discrepancy. Approaches to Assessing Dimensionality in Item Response Theory. Ray E.

A Comparison of DIMTEST and Generalized Dimensionality Discrepancy. Approaches to Assessing Dimensionality in Item Response Theory. Ray E. A Comparison of DIMTEST and Generalized Dimensionality Discrepancy Approaches to Assessing Dimensionality in Item Response Theory by Ray E. Reichenberg A Thesis Presented in Partial Fulfillment of the

More information

EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN

EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN By FRANCISCO ANDRES JIMENEZ A THESIS PRESENTED TO THE GRADUATE SCHOOL OF

More information

STATISTICS AND RESEARCH DESIGN

STATISTICS AND RESEARCH DESIGN Statistics 1 STATISTICS AND RESEARCH DESIGN These are subjects that are frequently confused. Both subjects often evoke student anxiety and avoidance. To further complicate matters, both areas appear have

More information

Methods Research Report. An Empirical Assessment of Bivariate Methods for Meta-Analysis of Test Accuracy

Methods Research Report. An Empirical Assessment of Bivariate Methods for Meta-Analysis of Test Accuracy Methods Research Report An Empirical Assessment of Bivariate Methods for Meta-Analysis of Test Accuracy Methods Research Report An Empirical Assessment of Bivariate Methods for Meta-Analysis of Test Accuracy

More information

Development, Standardization and Application of

Development, Standardization and Application of American Journal of Educational Research, 2018, Vol. 6, No. 3, 238-257 Available online at http://pubs.sciepub.com/education/6/3/11 Science and Education Publishing DOI:10.12691/education-6-3-11 Development,

More information

for Scaling Ability and Diagnosing Misconceptions Laine P. Bradshaw James Madison University Jonathan Templin University of Georgia Author Note

for Scaling Ability and Diagnosing Misconceptions Laine P. Bradshaw James Madison University Jonathan Templin University of Georgia Author Note Combing Item Response Theory and Diagnostic Classification Models: A Psychometric Model for Scaling Ability and Diagnosing Misconceptions Laine P. Bradshaw James Madison University Jonathan Templin University

More information

Section 5. Field Test Analyses

Section 5. Field Test Analyses Section 5. Field Test Analyses Following the receipt of the final scored file from Measurement Incorporated (MI), the field test analyses were completed. The analysis of the field test data can be broken

More information

Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement Invariance Tests Of Multi-Group Confirmatory Factor Analyses

Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement Invariance Tests Of Multi-Group Confirmatory Factor Analyses Journal of Modern Applied Statistical Methods Copyright 2005 JMASM, Inc. May, 2005, Vol. 4, No.1, 275-282 1538 9472/05/$95.00 Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement

More information

Multidimensional Modeling of Learning Progression-based Vertical Scales 1

Multidimensional Modeling of Learning Progression-based Vertical Scales 1 Multidimensional Modeling of Learning Progression-based Vertical Scales 1 Nina Deng deng.nina@measuredprogress.org Louis Roussos roussos.louis@measuredprogress.org Lee LaFond leelafond74@gmail.com 1 This

More information

ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE

ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE California State University, San Bernardino CSUSB ScholarWorks Electronic Theses, Projects, and Dissertations Office of Graduate Studies 6-2016 ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION

More information

On Test Scores (Part 2) How to Properly Use Test Scores in Secondary Analyses. Structural Equation Modeling Lecture #12 April 29, 2015

On Test Scores (Part 2) How to Properly Use Test Scores in Secondary Analyses. Structural Equation Modeling Lecture #12 April 29, 2015 On Test Scores (Part 2) How to Properly Use Test Scores in Secondary Analyses Structural Equation Modeling Lecture #12 April 29, 2015 PRE 906, SEM: On Test Scores #2--The Proper Use of Scores Today s Class:

More information

3 CONCEPTUAL FOUNDATIONS OF STATISTICS

3 CONCEPTUAL FOUNDATIONS OF STATISTICS 3 CONCEPTUAL FOUNDATIONS OF STATISTICS In this chapter, we examine the conceptual foundations of statistics. The goal is to give you an appreciation and conceptual understanding of some basic statistical

More information

Empirical Formula for Creating Error Bars for the Method of Paired Comparison

Empirical Formula for Creating Error Bars for the Method of Paired Comparison Empirical Formula for Creating Error Bars for the Method of Paired Comparison Ethan D. Montag Rochester Institute of Technology Munsell Color Science Laboratory Chester F. Carlson Center for Imaging Science

More information

Sawtooth Software. The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? RESEARCH PAPER SERIES

Sawtooth Software. The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? RESEARCH PAPER SERIES Sawtooth Software RESEARCH PAPER SERIES The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? Dick Wittink, Yale University Joel Huber, Duke University Peter Zandan,

More information

THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL

THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL JOURNAL OF EDUCATIONAL MEASUREMENT VOL. II, NO, 2 FALL 1974 THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL SUSAN E. WHITELY' AND RENE V. DAWIS 2 University of Minnesota Although it has been claimed that

More information

TECHNICAL REPORT. The Added Value of Multidimensional IRT Models. Robert D. Gibbons, Jason C. Immekus, and R. Darrell Bock

TECHNICAL REPORT. The Added Value of Multidimensional IRT Models. Robert D. Gibbons, Jason C. Immekus, and R. Darrell Bock 1 TECHNICAL REPORT The Added Value of Multidimensional IRT Models Robert D. Gibbons, Jason C. Immekus, and R. Darrell Bock Center for Health Statistics, University of Illinois at Chicago Corresponding

More information

Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis

Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis DSC 4/5 Multivariate Statistical Methods Applications DSC 4/5 Multivariate Statistical Methods Discriminant Analysis Identify the group to which an object or case (e.g. person, firm, product) belongs:

More information

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp The Stata Journal (22) 2, Number 3, pp. 28 289 Comparative assessment of three common algorithms for estimating the variance of the area under the nonparametric receiver operating characteristic curve

More information

Applying the Minimax Principle to Sequential Mastery Testing

Applying the Minimax Principle to Sequential Mastery Testing Developments in Social Science Methodology Anuška Ferligoj and Andrej Mrvar (Editors) Metodološki zvezki, 18, Ljubljana: FDV, 2002 Applying the Minimax Principle to Sequential Mastery Testing Hans J. Vos

More information

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida Vol. 2 (1), pp. 22-39, Jan, 2015 http://www.ijate.net e-issn: 2148-7456 IJATE A Comparison of Logistic Regression Models for Dif Detection in Polytomous Items: The Effect of Small Sample Sizes and Non-Normality

More information

Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study

Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study Research Report Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study Xueli Xu Matthias von Davier April 2010 ETS RR-10-10 Listening. Learning. Leading. Linking Errors in Trend Estimation

More information

The Influence of Test Characteristics on the Detection of Aberrant Response Patterns

The Influence of Test Characteristics on the Detection of Aberrant Response Patterns The Influence of Test Characteristics on the Detection of Aberrant Response Patterns Steven P. Reise University of California, Riverside Allan M. Due University of Minnesota Statistical methods to assess

More information

Decision consistency and accuracy indices for the bifactor and testlet response theory models

Decision consistency and accuracy indices for the bifactor and testlet response theory models University of Iowa Iowa Research Online Theses and Dissertations Summer 2014 Decision consistency and accuracy indices for the bifactor and testlet response theory models Lee James LaFond University of

More information

Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales

Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales University of Iowa Iowa Research Online Theses and Dissertations Summer 2013 Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales Anna Marie

More information