A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests

Similar documents
An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow?

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

Technical Specifications

A Bayesian Nonparametric Model Fit statistic of Item Response Models

Using collateral information in the estimation of sub-scores --- a fully Bayesian approach

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

Jason L. Meyers. Ahmet Turhan. Steven J. Fitzpatrick. Pearson. Paper presented at the annual meeting of the

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests

Using the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison

Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions

A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM LIKELIHOOD METHODS IN ESTIMATING THE ITEM PARAMETERS FOR THE 2PL IRT MODEL

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Mediation Analysis With Principal Stratification

The Effect of Guessing on Item Reliability

Ordinal Data Modeling

Selection of Linking Items

Does factor indeterminacy matter in multi-dimensional item response theory?

Bayesian Tailored Testing and the Influence

A Multilevel Testlet Model for Dual Local Dependence

Bayesian and Frequentist Approaches

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

Computerized Mastery Testing

A Comparison of Several Goodness-of-Fit Statistics

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland

LUO, XIAO, Ph.D. The Optimal Design of the Dual-purpose Test. (2013) Directed by Dr. Richard M. Luecht. 155 pp.

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati.

Termination Criteria in Computerized Adaptive Tests: Variable-Length CATs Are Not Biased. Ben Babcock and David J. Weiss University of Minnesota

Discriminant Analysis with Categorical Data

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

Item Parameter Recovery for the Two-Parameter Testlet Model with Different. Estimation Methods. Abstract

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

POLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Comparison of Computerized Adaptive Testing and Classical Methods for Measuring Individual Change

Linking Mixed-Format Tests Using Multiple Choice Anchors. Michael E. Walker. Sooyeon Kim. ETS, Princeton, NJ

A Brief Introduction to Bayesian Statistics

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm

Copyright. Kelly Diane Brune

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

Adaptive EAP Estimation of Ability

An Alternative to the Trend Scoring Method for Adjusting Scoring Shifts. in Mixed-Format Tests. Xuan Tan. Sooyeon Kim. Insu Paek.

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination

Basic concepts and principles of classical test theory

Noncompensatory. A Comparison Study of the Unidimensional IRT Estimation of Compensatory and. Multidimensional Item Response Data

Analyzing data from educational surveys: a comparison of HLM and Multilevel IRT. Amin Mousavi

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Small-area estimation of mental illness prevalence for schools

A SAS Macro to Investigate Statistical Power in Meta-analysis Jin Liu, Fan Pan University of South Carolina Columbia

The Use of Item Statistics in the Calibration of an Item Bank

S Imputation of Categorical Missing Data: A comparison of Multivariate Normal and. Multinomial Methods. Holmes Finch.

Differential Item Functioning

Score Tests of Normality in Bivariate Probit Models

Standard Errors of Correlations Adjusted for Incidental Selection

Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX

Method Comparison for Interrater Reliability of an Image Processing Technique in Epilepsy Subjects

An Assessment of The Nonparametric Approach for Evaluating The Fit of Item Response Models

A BAYESIAN SOLUTION FOR THE LAW OF CATEGORICAL JUDGMENT WITH CATEGORY BOUNDARY VARIABILITY AND EXAMINATION OF ROBUSTNESS TO MODEL VIOLATIONS

International Journal of Education and Research Vol. 5 No. 5 May 2017

Kelvin Chan Feb 10, 2015

11/24/2017. Do not imply a cause-and-effect relationship

Nearest-Integer Response from Normally-Distributed Opinion Model for Likert Scale

Bayesian Estimation of a Meta-analysis model using Gibbs sampler

Effects of Ignoring Discrimination Parameter in CAT Item Selection on Student Scores. Shudong Wang NWEA. Liru Zhang Delaware Department of Education

An Empirical Assessment of Bivariate Methods for Meta-analysis of Test Accuracy

Modeling DIF with the Rasch Model: The Unfortunate Combination of Mean Ability Differences and Guessing

A Comparison of DIMTEST and Generalized Dimensionality Discrepancy. Approaches to Assessing Dimensionality in Item Response Theory. Ray E.

EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN

STATISTICS AND RESEARCH DESIGN

Methods Research Report. An Empirical Assessment of Bivariate Methods for Meta-Analysis of Test Accuracy

Development, Standardization and Application of

for Scaling Ability and Diagnosing Misconceptions Laine P. Bradshaw James Madison University Jonathan Templin University of Georgia Author Note

Section 5. Field Test Analyses

Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement Invariance Tests Of Multi-Group Confirmatory Factor Analyses

Multidimensional Modeling of Learning Progression-based Vertical Scales 1

ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE

On Test Scores (Part 2) How to Properly Use Test Scores in Secondary Analyses. Structural Equation Modeling Lecture #12 April 29, 2015

3 CONCEPTUAL FOUNDATIONS OF STATISTICS

Empirical Formula for Creating Error Bars for the Method of Paired Comparison

Sawtooth Software. The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? RESEARCH PAPER SERIES

THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL

TECHNICAL REPORT. The Added Value of Multidimensional IRT Models. Robert D. Gibbons, Jason C. Immekus, and R. Darrell Bock

Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp

Applying the Minimax Principle to Sequential Mastery Testing

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida

Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study

The Influence of Test Characteristics on the Detection of Aberrant Response Patterns

Decision consistency and accuracy indices for the bifactor and testlet response theory models

Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales

Transcription:

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests David Shin Pearson Educational Measurement May 007 rr0701 Using assessment and research to promote learning

Pearson Educational Measurement (PEM) is the most comprehensive provider of educational assessment products, services and solutions. As a pioneer in educational measurement, PEM has been a trusted partner in district, state and national assessments for more than 50 years. PEM helps educators and parents use assessment and research to promote learning and academic achievement. PEM Research Reports provide dissemination of PEM research and assessment-related articles prior to publication. PEM reports in.pdf format may be downloaded at: http://www.pearsonedmeasurement.com/research/research.htm

Because the world is complex and resources are often limited, test scores often serve to both rank individuals and provide diagnostic feedback (Wainer, Vevea, Camacho, Reeve, Rosa, Nelson, Swygert, and Thissen, 000). These two purposes create challenges from the view of content coverage. Standardized achievement tests serve the purpose of ranking very well. To serve the purpose of diagnosis, standardized achievement tests must have clusters of items that yield interpretable scores. These clusters could be learning objectives, subtests, or learning standards. Scores on these clusters of items will be referred to as objective scores in this paper. If there are a large number of items measuring an objective on a test, the estimation of an objective score might be precise and reliable. However, in many cases the number of items is fewer than optimal for the level of reliability desired. This condition is problematic, but it often exists in practice (Pommerich, Nicewander, and Hanson, 1999). The purpose of this paper is to review and evaluate a number of methods that attempt to provide a more precise and reliable estimation of objective scores. The first section of this paper reviews a number of different methods for estimating objective scores. The second section of this paper presents a study evaluating a subset of the more practical of these methods. Review of Current Objective Score Estimation Methods A number of studies have been devoted to proposing methods for the more precise and reliable estimation of objective scores (Yen 1987; Yen, Sykes, Ito, and Julian, 1997; Bock, Thissen, and Zimowski, 1997; Pommerich et al., 1999; Wainer et al., 000; Gessaroli, 004; Kahraman and Kamata, 004; Tate, 004; Shin, Ansley, Tsai, and Mao,

005). This section provides a review of these methods. All of these methods, implicitly or explicitly, estimate the objective scores using collateral test information. This review notes how different methods take advantage of collateral information in different ways. The IRT domain score was used as the estimation of the objective score by Bock, Thissen, and Zimowski (1997). The collateral test information is used in a sense that when the item parameters are calibrated, the data from items in the other objectives will actually contribute to the estimation of the item parameters in the objective of interest. For example, if items 1 to 6 are measuring the first objective and items 7 to 1 are measuring the second objective in the same test, in calibrating the test, the data from items 7 to 1 actually contribute to the estimation of the item parameters of items 1 to 6. Bock compared the proportion-correct score with IRT domain score computed by theta from either maximum likelihood estimation (MLE) or Bayesian estimation and found that the objective score estimated from the IRT estimation is more accurate than the proportion-correct score. Several studies extended the work of Bock, Thissen, and Zimowski (1997). Pommerich et al. (1999) similarly used IRT domain score to estimate the objective score. The only difference is that they did not estimate the score at an individual level but rather at a group level. Tate (004) extended Bock s study to the multidimensional case with different dimensionality and degree of correlation between subsets of test. The main purpose of Tate s study was to find a better method between MLE and the expected a posteriori (EAP) estimation to estimate the objective score. He found that the choice of estimation approach depends on the intended uses of the objective scores (Tate, 000, p.107). 3

Adopting a somewhat different approach, Kahraman and Kamata (004) also tried to estimate the objective score using IRT models. Kahraman and Kamata used out-ofscale items which is explicitly a kind of collateral test information to help the estimate of the objective score. The out-of-scale items are the items that measure different objectives but are in the same test as the objective of interest. For example, if the score of Objective One is estimated, then the items in Objective Two, Three, and so on are out-of-scale items. They controlled the number of out-of-scale items, correlation between objectives, and the discrimination of items and found that the correlation between objectives needs to be at least 0.5 for the moderate-discrimination out-of-scale items and 0.9 for the highdiscrimination items in order to take advantage of the out-of-scale items. Wainer et al. (000) used an empirical Bayesian (EB) method to compute the objective score. The basic concept of this method is similar to Kelley s (197) regressed score. Indeed, it is a multivariate version of Kelley s method. In Kelley s method, only one test score was used to estimate the regressed score, but in Wainer s method, several other objective scores were used to estimate the objective score of interest. Yen (1987) and Yen et al.(1997) combined the IRT and EB method to compute the objective score using a method labeled Objective Performance Index (OPI). They used the IRT domain score as the prior and assumed that the prior had a beta distribution. With another assumption that the likelihood of the objective score follows a binomial distribution, the posterior distribution is then also a beta distribution. The mean, standard deviation (SD) of the posterior distribution are the estimated objective score and its standard error. 4

Gessaroli (004) used multidimensional IRT (MIRT) to compute the objective score. Gessaroli (004) then compared it to the Wainer s EB method (Wainer et al., 000). He found that the EB method had almost the same results as the MIRT methods. Shin et al. (005) applied the Markov chain Monte Carlo (MCMC) technique to estimate the objective scores of IRT, OPI, and Wainer et al. s methods. Shin et al. (005) then compared these MCMC versions to their original non-mcmc counterparts. They found that the MCMC alternatives performed either the same or slightly better than the non-mcmc methods. Some of the methods reviewed in this paper may not be practical for use in a large scale testing program. For example: the method of Pommerich et al. (1999) only estimated the objective score for groups. It may not meet the need for a large scale test that reports individual objective scores. The method of Gessaroli (004) involved MIRT that is rarely used in a large scale test. The method of Kahraman and Kamata (004) required certain conditions in order to take the advantage of this method and these conditions may not be met in many large scale tests (p.417). The study of Tate (004) was very similar to Bock s study (Bock, Thissen, and Zimowski, 1997). The MCMC IRT and MCMC OPI method in Shin et al. (005) are too time-consuming and hence may not be practical. However, other methods reviewed in this paper may be suitable for use in a large scale testing program. The method of Yen et al. (1997) is actually currently used for some state tests. The Bock, Thissen, and Zimowski (1997) method is convenient to be implemented in the tests using IRT. The Wainer et al. (000) method and the MCMC Wainer methods of Shin et al. (005) performed better than the other methods in the 5

study of Shin et al. (005). Therefore, these methods were included in the study reported in the next section. Evaluation of Selected Objective Score Estimation Methods The study reported in this section compares five methods that use collateral test information to estimate the objective score for a mixed-format test. These methods include an adjusted version of Bock et al. s item response theory (IRT) approach (Bock et al., 1997), Yen s objective performance index (OPI) approach (Yen, 1997), Wainer et al. s regressed score approach (Wainer et al., 000), the Shin s MCMC regressed score approach (Shin et al., 005), and the proportion-correct score approach. They are referred as the Bock method, the Yen method, the Wainer method, the Shin method, and proportion-correct method hereafter. In addition to comparing these five methods using a common data set, the present study extends earlier work by including mixed-format tests. Only Yen et al. (1997) consider the case of a mixed-format test. As more large scale tests require the reporting of objective scores and the mixed-format tests become more common, it is necessary to conduct a study that compares different objective score estimation methods for mixed-format tests. From previous studies (Yen, 1987; Yen et al., 1997; Wainer et al., 000; Pommerich et al., 1999; Bock et al., 1997, Shin et al. 005), the number of items in an objective, and the correlation between objectives were two main factors that affected the estimation results. In the current study, the proportion of the polytomous items in a test and the student sample size were also studied. The performance of the methods under 6

different combinations of these four factors was compared. Six main questions were investigated in this study: (1) What is the order of the objective score reliabilities estimated from different methods and how are they influenced by the four factors studied? () What is the nominal 95% confident / credibility interval (95CI) of each method and how are they influenced by the four factors studied? (3) What is the order of the widths of the 95CI for each method and how are they influenced by the four factors studied? (4) What is the magnitude and order of the absolute bias (Bias) of different methods and how are they influenced by the four factors studied? (5) What is the magnitude and order of the standard deviation of estimation (SD) of different methods and how are they influenced by the four factors studied? (6) What is the magnitude and order of the root-mean-square error (RMSE) of different methods and how are they influenced by the four factors studied? The first question addresses the reliability of the objective score estimated by each method. As mentioned previously, one of the reasons to use methods other than the proportion-correct method is that the proportion-correct objective score is not reliable given the limited number of items in each objective. Therefore, the objective score estimated from the selected method must be more reliable than the proportion-correct score. However, because some of the estimating methods such as the IRT method and OPI method do not provide a way to estimate reliability, empirically it becomes impossible to compare the reliability of each method. In the present study, since the true score and the estimated score of each method are both available from the simulation 7

process, it becomes possible to compute the correlation between the true score and the estimated score and the reliability of each method can be obtained through the following equation: XX T TT TXTX TX ' = = = = ( ) = ( ) TX (1) X X T X T X T Where XX,, X, T, TX, and XT are the reliability, standard error of the estimated score, standard error of the true score, covariance of the true and estimated score, and the correlation between true score and estimated score, respectively. The second question concerns the accuracy of the nominal 95% confidence / credibility intervals (95CI) of each method. For the objective score estimated from proportion-corrected method, the 95CI is the confidence interval. For the other methods, the 95CI represents the credibility interval. Very often when an objective score is reported, the 95CI is also presented. However, the nominal 95CI may not really cover the true score 95% of the time. Through the simulation process in current study, the lower and upper bound of each estimated objective score was computed and the actual percentage that the 95CI covers the true score, percent coverage (PC), for each method was then computed to answer this question. The third question is about the widths of the 95CI for each method. It is possible for a method to have a 95CI that covers the true score 95% of the time but has a wide range of its 95CI. For example, a 95CI of the proportion-correct method may range from 4% to 95%. Obviously this range will cover the true proportion-correct score well because the whole range of the proportion-correct is just from 0% to 100%. However, such a 95CI is actually meaningless because its range is just too wide. Therefore, in this 8

study the widths of the 95CI for each method were compared. It is necessary to consider both the 95CI and its width at the same time in order to find a better estimating method. as follows: The fourth to sixth questions are about the Bias, SD, and RMSE. They are defined Bias = E( W ), () SD E W E W = [ ( )], (3) RMSE E W Bias SD = ( ) = +, (4) where W is the point estimator, and is the true parameter. To summarize, six criteria, (1) reliability, () percent-coverage of true score for a nominal 95% confidence/credibility interval (95PC), (3) the width of the 95% confidence/credibility interval (95CI), (4) Bias, (5) SD, and (6) RMSE were used as the dependent variables in the comparisons of the different methods. A more desirable estimating method should have high score reliability, narrow 95CI, accurate percentcoverage of a nominal 95CI, small SD, Bias that is close to zero, and consequently, small RMSE. Procedures Simulated Responses For the purpose of this study, data were simulated to represent different conditions found in test data. Four factors are considered in generating the simulated data. Detailed information about these four factors is provided in Table 1. In all, 3* 3 * 3 * 3 = 81 conditions were considered in the simulation study. 9

Table 1. Simulation Factors and Number of Levels Factors No. of levels Description Number of examinees 3 50, 500, 1000 Test length 3 6, 1, or 18 items for each objective Correlation between objectives 3 Approximately 1.0, 0.8, and 0.5 Ratio of CR/MC items 3 0, 0% or 50% (in number of items) The present study used empirical item parameters from a state testing program. Within this item pool there are 30 MC items and 18 CR items (each with three score categories, 0, 1, and ). The ability () values for the examinees on different number of objectives were simulated from a standardized multivariate normal distribution with correlation coefficients equal to 1.0, 0.8, or 0.5. For each of the 81 conditions, 100 simulation trials (i.e., response vectors) were used. At each of the conditions, with item parameters assumed to be known, the simulation process involved the following steps: 1. Generate s for each of the examinees from a standardized multivariate normal distribution. The generated values were restricted to be between -3 and 3. The correlation coefficients between s were.5,.8 or 1.0.. Compute P ij () by using the IRT 3pl equation for item i in objective j and P ijk () by using the generalized partial credit model equation for category k of item i in objective j. The item parameters were randomly selected with replacement from the item pool. P ij () and P ijk () are defined in the Bock method section of this paper. 3. Use P ij () and P ijk () in step to compute the true score for each objective. For example, if items 1 through 6 were in objective 1, the true score of 10

objective 1 was the total of the P ij () and weighted P ijk () of item 1 through 6. This true objective score was used as the baseline to compare different methods of estimating the objective score. 4. Generate response, y ij using P ij () and P ijk () from step 3. y ij is either a sample from the binomial distribution with probability P ij () or a sample from the multinomial distribution with probabilities equal to P ijk (), where k is the index for category level and ranges from 0 to. 5. Use the data from step 4 to estimate the objective scores using different estimating methods. The details about each method are described later. 6. Repeat step 4 and step 5 for 100 times and compute objective score reliability, 95CI, percentage coverage of nominal 95CI, Bias, SD, and RMSE for further analyses. It should be noticed that in the simulation process the data were simulated according to the correlation between s rather than the correlation between the objective scores. However, because the objective score are a monotonically increasing function of, the correlation between the objective scores is approximately equal to the correlation between s. Objective Scores The simulated data were used as input for different estimating methods to compute the objective scores. Then these estimated objective scores were used for comparison of the estimating methods. The following are the brief descriptions of these estimating methods. 11

Bock method The data from the simulation procedure were used to estimate the examinees values for the whole test. The estimated parameters, $, were then entered into equations: $ (1) ( $ ) (1 ) exp[1.7 aij( bij )] Pij = cij + cij 1 + exp[1.7 a ( $ b )] ij ij, or () P ( $ ) = ijk m c k $ exp aij( bij ) = 1 i exp a ( $ ij bij ) 0 0 c= = to estimate P ij ( ˆ ) and P ijk ( ˆ ), where i, j,, k, and m i represent item, objective, score level index, current computed score level, and total score levels for item i, respectively. The objective scores for objective j, IRT T, were then computed by the equation, 1 IRT T = ( $ ), (5) j I j n j i= 1 where i represents item, I j is the number of items in objective j, and n j is the maximum possible points in objective j. Note that n j equals ( mi 1). i= 1 For MC items, I j ij ( $ ) = P ( $ ) ; (6) ij ij for CR items, m i ( $ ) ( 1) ( $ ij = k Pijk ). (7) k= 1 Bayes estimates of scale scores, $, were used to compute the IRT objective scores with a normal (0, 1) as the prior distribution for the abilities. Usually when the objective scores are estimated, the item parameters (i.e., the item pool) already exist. Therefore, in this study, the item parameters were assumed known. The variance of the IRT T can be expressed in the equation below: 1

n MC ncr mi mi $ $ ( ) 1 ( ) ( 1) ( $ ) ( 1) ( $ Pij Pij + k Pijk k Pijk ) i= 1 i= 1 k= 1 k= 1 VAR IRT T =, j n j (8) where n mc, and n cr are the number of MC and CR items, respectively, and n j represents, as defined previously, the maximum possible points in the objective j. Yen method The following steps were used to estimate Yen s T (Yen et al., 1997): 1. Estimate IRT item parameters for all selected items. Estimate for the whole test (including all objectives). 3. For each objective, calculate IRT T j (see Equation 1 on page 5), T j, where j represents objective j. 4. Obtain x j J nj( T j ) n j Q =. (9) T (1 j T j) j= 1 If Q> ( J,.10), then Yen T j, x j Tj =, n j p = x (10) j j, and qj = nj xj (11), where x j is the observed score obtained in objective j, n j is the max points can be obtained in objective j, and J is the number of objectives. If Q ( J,.10), then * j j j j p = T n + x (1) 13

, and * j j j j j q = (1 T ) n + n x. (13) The Yen T j, T j, is defined to be * p T jn + x T j = = * p + q n + n j j j j j j j, where n * j μ( T )[1 ( j μ T j )] = 1 ( T j ) I j I j 1 $ $ 1 $ I (, ) ij( ) 1 ij( ) 1. n I i 1 i 1 j j = nj = 1 '( $ ij ) n j i= 1 (14) For MC items: 1.7a 1 ( $ ) ( $ ij Pij Pij ) c ij '( $ ij ) = (1 c ) ij (15) and n $ MC ij '( ) I(, $ ) = $ $ i= 1 Pij( ) 1 Pij( ) n 1.7 1 ( $ MC a ) ( $ ij Pij Pij ) c ij =, $ i= 1 P ( ) 1 c ij ij (16) where n MC is the total number of MC items in the whole test. For CR items: m i $ $ $ ij '( ) = aij ij( ) ( k 1) ij( ) k= 1 mi = a $ $ ij{ ( k 1) Pijk ( ) ij( ) } k= 1 (17) 14

and I(, $ ) = = n CR n mi i= 1 CR k= 1 '( $ ij ) ( k 1) P ( $ ) ( $ ijk ij ) mi ( 1) ( $ ) ( $ aij k Pijk ij ) k= 1 mi i= 1 k= 1 ( k 1) P ( $ ) ( $ ijk ij ) n CR mi ( 1) ( $ ) ( $ = aij k Pijk ij ), 1 k= 1 i= (18) where n MC is the total number of CR items in the whole test. The definition of ( $ ) is in Equations (6) and (7). The variance of Yen T can be expressed in the equation below: ij pq Var =, ( + ) ( + + 1) j j Yen Tj pj qj pj qj (19) where p j and q j are defined in Equations (10) to (13). Wainer method In vector notation, for the multivariate situation involving several objective scores collected in the vector x, REG T = x. + â(x. - x). (0) x. is the mean vector which contains the means of each objective involved. is a matrix that is the multivariate analog for the estimated reliability for each objective. All that is needed to calculate REG T are estimates of. The equation for calculating is: = S true ( S obs ) -1, (1) 15

where S obs is the observed variance-covariance matrix of the objective scores and S true is the variance-covariance matrix of the true objective scores. Because errors are uncorrelated with true scores, it is easy to see that =. Where jv jv ' xjvxjv ' jv and jv ' xjvxjv ' are the off-diagonal elements of true and obs, the population variance-covariance matrices of the true objective scores and observed objective scores. It is in the diagonal elements of S obs and S true that the difference arises. However, if the diagonal elements of S obs are multiplied by the reliability, / x, of the subscale in question, the results are the diagonal elements of S true. It is customary to estimate reliability with Cronbach s coefficient (Wainer et al., 000). The score variance of the estimates for the vth objective is the vth diagonal element of the matrix, Shin method C= S ( S ) S true obs 1 true. () Instead of using the empirical Bayes approach in the regressed score method, the MCMC regressed score method uses a fully Bayesian model. Here is an example with two objectives for the illustration of the full Bayesian estimation. If there are two objectives called M and V, the observed scores and true scores of these two objectives are x m, x v, m and v, respectively. The fully Bayesian model results in the posterior distributions of ( m x m, x v ), MREG T m, and ( v x m, x v ), MREG T v. From classical test theory (CTT): xpm = pm + pm, (3) and xpv =. pv + pv (4) 16

where p represents examinee. These can be re-parameterized via CTT equations as: xpm = pm + pm, pm N(0, m), (5) where = μ + (6) pm m pm, pm N(0, m); and xpv = pv + pv, pv N(0, v ), (7) where = μ +. (8) pv v pv, pv N(0, v ) pm and pv are the error terms. μ pm and μ pv are the common true scores for objective m and objective v. pm and pv are the unique true scores for examinee p on objective m and objective v. In the fully Bayesian model, instead of using the maximum likelihood estimate (MLE) like the empirical Bayesian method did to replace the unknown parameters, it is necessary to give a prior distribution on the unknown parameters such as m, v, m, v, and mv. If it is assumed that xpm μm m + m mv m mv x pv μ v mv v + v mv v N,, (9) pm μ m m mv m mv pv μv mv v mv v then x 1 pm μm 1 pm xpm, x pv N μm + 1 11, m 1 11 1 xpv μ, (30) v 17

and x 1 pm μm 1 pv xpm, x pv N μv + 31 11, v 31 11 1 x pv μ, (31) v where m + m mv 11=, mv v + (3) v = = mv v m mv 1, (33) 1 m mv, (34) = and 31 mv v. (35) = The priors of μ, m μ v, m, v, m, m, and mv need to be specified. This model was coded with WinBUGS to estimate the full Bayesian regressed objective scores. The variance of the estimation is the variance of the posterior distribution and the 95% credibility interval equals to the interval between the.05 and.975 quartiles of the posterior distribution. Results The results of the comparison of each method are presented in Figures 1 to 4. Figures 1 to 4 are the results of the comparison of the estimated reliability under different factors (the number of examinees, the number of items in each objective, the correlation between objectives, and the ratio of CR/MC items). For brevity, if not specially mentioned, the reliability in the following text refers to the estimated reliability. Figures 18

5 to 8 are the results of comparing the width of 95CI for each method. Figure 9 to 1 are the results of the comparison of the percentage coverage of 95CI for each method. Figures 13 to 16 are the results of the comparison of the absolute bias under different factors (the number of examinees, the number of items in each objective, the correlation between objectives, and the ratio of CR/MC items). Figures 17 to 0 are the results of comparing the SD for each method. Figure 1 to 4 are the results of the comparison of the percentage RMSE for each method. Comparison of Reliability.80.78 Reliability.76 Methods.74 Bock Yen.7 Wainer Shin.70 50 500 1000 Proportion-correct Number of examinees Figure 1. Reliability for different number of examinees Figure 1 is the comparison of reliability for different number of examinees. From Figure 1 it can be seen that: 19

(1) The number of examinees did not impact the reliability of the objective score because the reliability for different numbers of examinees remains approximately the same except for the Bock method. However, the largest reliability in Bock method is.74 which is only about.03 higher than the lowers reliability,.71, for this method. () The order of the magnitude of the objective score reliability for each method is: Shin = Wainer > Yen > Bock > proportion-correct. The objective score estimated from the Shin and Wainer method have the highest reliability..9.8 Reliability.7 Methods Bock.6 Yen Wainer Shin.5 6 1 18 Proportion-correct Number of Items in each objective Figure. Reliability for different number of items in each objective Figure is the comparison of reliability for different number of items in each objective. From Figure it can be seen that: 0

(1) The number of items in each objective had impact on the reliability of the objective score because the reliability for different number of items per objective increased as more items were in each objective. The slope of the increasing was greater from 6 to 1 items and then became flatter from 1 to 18 items. () The order of the magnitude of the reliability of objective score for each method is: Shin = Wainer > Yen > Bock > proportion-correct. The objective scores estimated from the Shin and Wainer methods have the highest reliability. (3) For the Bock s method, the reliability became the same for the 1 and 18 item cases..9.8 Reliability.7 Methods Bock.6 Yen Wainer Shin.5 0.5 0.8 1.0 Proportion-correct Correlation between objectives Figure 3. Reliability for different correlation between objectives 1

Figure 3 is the comparison of reliability for different correlation between objectives. From Figure 3 it can be seen that: (1) The correlation between objectives had impact on the reliability of the objective score because the reliability increased as the correlation became higher. However, the effect is not shown in the proportion-correct method. () The order of the magnitude of the reliability of objective score for each method is approximately: Shin = Wainer > Yen > proportion-correct. The objective scores estimated from the Shin and Wainer methods have the highest reliability. (3) The Bock s method had a unique performance in this case. When the correlation between objectives equaled to.5, it had the lowest reliability and when the correlation equaled to one, it had the highest reliability. This finding is related to the assumption of the Bock method. Because Bock s objective score is actually the IRT domain score, it needs to satisfy the assumption of IRT which requires each objective to test the same thing (i.e., the unidimensionality assumption). Therefore, in the case when the correlation between objectives is 0.5, the assumption is violated resulting in it the worst method to estimate the objective score. However, when the correlation was 1.0, the assumption was met and it was the best method for the estimating.

.8.80.78.76 Reliability.74 Methods.7.70.68 Bock Yen Wainer Shin.66 0 0. 0.5 Proportion-correct Ratio of CR/MC items (in number of items) Figure 4. Reliability for different ratio of CR/MC items Figure 4 is the comparison of reliability for different ratio of CR/MC items in each objective. From Figure 4 it can be seen that: (1) The ratio of CR/MC items in each objective had impact on the reliability of the objective score because the reliability for different ratio of CR/MC items increased as the ratio of CR/MC items in each objective increased. () The order of the magnitude of the reliability of objective score for each method is: Shin = Wainer > Yen > Bock > proportion-correct. The objective scores estimated from the Shin Wainer methods have the highest reliability. (4) Because the CR items used in this study have the same number of categories (3 categories), more CR items means more possible points can be obtained in each 3

objective. That is, if there are more possible points in each objective, the reliability of estimated objective scores will be higher. To sum up: as the number of items per objective, the correlation between objectives, and the ratio of CR/MC items increased, the reliability of the estimated objective score also increased. The number of examinees did not affect the reliability of the objective score. Within these 5 methods studied, generally the estimated objective scores from the Shin and Wainer methods have the highest reliability except for the case when the correlation between objectives is 1.0. In that case, the Bock method yielded the highest reliability. Comparison of the Width of 95CI.6.5 Width of 95CI.4 Methods Bock.3 Yen Wainer Shin. 50 500 1000 Proportion-correct Number of examinees Figure 5. Width of 95CI for different number of examinees 4

Figure 5 is the comparison of the width of 95CI for different number of examinees. From Figure 5 it can be seen that: (1) The number of examinees did not have impact on the width of 95CI. () The order of the magnitude of the width of 95CI for each method is: Proportioncorrect > Bock > Shin = Wainer > Yen..8.7.6 Width of 95CI.5.4 Methods Bock.3. Yen Wainer Shin.1 6 1 18 Proportion-correct Number of items in each objective Figure 6. Width of 95CI for different number of items in each objective Figure 6 is the comparison of the width of 95CI for different number of items per objective. From Figure 6 it can be seen that: (1) The number of examinees had impact on the width of 95CI. As the number of items per objective increased, the width of 95CI decreased. 5

() The order of the magnitude of the width of 95CI for each method is: Proportioncorrect > Bock > Shin = Wainer > Yen..6.5 Width of 95CI.4 Methods Bock.3 Yen Wainer Shin. 0.5 0.8 1.0 Proportion-correct Correlation between objectives Figure 7. Width of 95CI for different correlation between objectives Figure 7 is the comparison of the width of 95CI for different correlation between objectives. From Figure 7 it can be seen that: (1) The correlation between objectives did not have impact on the width of 95CI for the Bock, proportion-correct, and Yen methods; but has slight effect on the Wainer and Shin methods. For these two methods, as the correlation between objectives increased, the width of 95CI slightly decreased. () The order of the magnitude of the width of 95CI for each method is: Proportioncorrect > Bock > Shin = Wainer > Yen. 6

.6.5 Width of 95CI.4 Methods Bock.3 Yen Wainer Shin. 0 0. 0.5 Proportion-correct Ratio of CR/MC items (in number of items) Figure 8. Width of 95CI for different ratio of CR/MC items Figure 8 is the comparison of the width of 95CI for different ratio of CR/MC items. From Figure 8 it can be seen that: (1) The ratio of CR/MC items did not have impact on the width of 95CI except for the Bock method. For Bock methods, as the number of items per objective increased, the width of 95CI slightly decreased. () The order of the magnitude of the width of 95CI for each method is: Proportioncorrect > Bock > Shin = Wainer > Yen. To sum up: within the four factors studied, only the number of items per objective had impact on the width of 95CI. More items per objective tended to lead to a narrower width 7

of 95CI. The order of the magnitude of the width of 95CI is consistent across different factors. The Yen method has the narrowest 95CI. Comparison of the Percent Coverage of 95CI 10 Percentage coverage of 95CI 100 98 96 94 9 Methods Bock Yen Wainer Shin 90 50 500 1000 Proportion-correct Number of examinees Figure 9. Percent coverage of 95CI for different number of examinees Figure 9 is the comparison of the percent coverage of 95CI for different number of examinees. From Figure 9 it can be seen that: (1) Generally, the number of examinees did not have impact on the percent coverage of 95CI. () Almost all the methods have a conservative 95CI because their percentage coverage of 95CI are all larger than 95%. That means, the nominal 95CI covers the true score more than 95% of the time. 8

10 Percentage coverage of 95CI 100 98 96 94 Methods Bock Yen Wainer Shin 9 6 1 18 Proportion-correct Number of Items in each objective Figure 10. Percent coverage of 95CI for different number of items in each objective Figure 10 is the comparison of the percent coverage of 95CI for different number of items in each objective. From Figure 10 it can be seen that: (1) For the Bock and Yen methods, the number of items in each objective had impact on the percent coverage of 95CI. Generally, as the number of items per objective increased the percent coverage of 95CI decreased. () Almost all the methods have a conservative 95CI because their percentage coverage of 95CI are all larger than 95%. That means, the nominal 95CI covers the true score more than 95% of the time. 9

10 Percentage coverage of 95CI 100 98 96 94 Method Bock Yen Wainer Shin 9 0.5 0.8 1.0 Proportion-correct Correlation between objectives Figure 11. Percent coverage of 95CI for different correlation between objectives Figure 11 is the comparison of the percent coverage of 95CI for different correlation between objectives. From Figure 11 it can be seen that: (1) For the Bock and Yen methods, the correlation between objectives has an obvious impact on the percent coverage of 95CI. () Almost all the methods have a conservative 95CI because their percentage coverage of 95CI are all larger than 95%. That means, the nominal 95CI covers the true score more than 95% of the time. 30

10 Percentage coverage of 95CI 100 98 96 94 9 Methods Bock Yen Wainer Shin 90 0 0. 0.5 Proportion-correct Ratio of CR/MC items (in number of items) Figure 1. Percent coverage of 95CI for different ratio of CR/MC items Figure 1 is the comparison of the percent coverage of 95CI for different ratio of CR/MC items. From Figure 1 it can be seen that: (1) For the Yen methods, the CR/MC ratio has some impact on the percent coverage of 95CI but the pattern is not consistent across situations. () Almost all the methods have a conservative 95CI because their percentage coverage of 95CI are all larger than 95%. That means, the nominal 95CI covers the true score more than 95% of the time. To sum up: the 95CI for most of the methods except for Yen method are conservative. That is, the nominal 95CI covered the true score more than 95% of the time in the 31

simulation study. Yen method has the 95CI that is more close to its nominal value, 95%. This may due to the non-symmetrical property of the 95CI for Yen s method. Comparison of Bias.06.05.04 Bias.03 Method.0.01 Bock Yen Wainer Shin 0.00 50 500 1000 Proportion-correct Number of examinees Figure 13. Bias for different number of examinees Figure 13 is the comparison of bias for different number of examinees. From Figure 13 it can be seen that: (1) The number of examinees did not impact the Bias of the objective score because the Bias for different number of examinees remains approximately the same for each method. 3

() The order of the Bias for each method was: Bock > Wainer > Shin > Yen > proportion-correct. The magnitudes of Bias ranged from approximately 0.01 to 0.05. That is, if the perfect score is 100, the Bias is around 1 to 5 points..07.06.05 Bias.04.03 Method Bock.0.01 Yen Wainer Shin 0.00 6 1 18 Proportion-correct Number of items in each objective Figure 14. Bias for different number of items in each objective Figure 14 is the comparison of bias for different number of items in each objective. From Figure 14 it can be seen that: (1) The number of items in each objective had impact on the Bias of the objective scores for Wainer, Shin and Yen methods because the Bias for different number of items per objective increased as more items were in each objective. The slope of the increasing was greater from 6 to 1 items and then became flatter from 1 to 18 items. The number of items did not affect the Bias for proportion-correct method. For the Bock method, the Bias decreased as the number of items 33

increased from 6 to 1, but increased slightly as the number of items increased from 1 to 18. () The order of the Bias for each method was approximately: Bock > Wainer > Shin > Yen > proportion-correct..10.08 Bias.06 Method.04 Bock Yen.0 Wainer Shin 0.00 0.5 0.8 1.0 Proportion-correct Correlation between objectives Figure 15. Bias for different correlation between objectives Figure 15 is the comparison of bias for different correlation between objectives. From Figure 15 it can be seen that: (1) The correlation between objectives had impact on the Bias for each method because the Bias increased as the correlation became higher. However, the effect was not shown in the proportion-correct method. 34

() The order of the Bias for each method was approximately: Bock > Wainer > Shin > Yen > proportion-correct. (3) The Bock s method had a unique performance in this case. When the correlation between objectives equaled to.5, it had the highest Bias but when the correlation equaled to one, it had the relatively low Bias. This finding is related to the assumption of the Bock method. Because Bock s objective score is actually the IRT domain score, it needs to satisfy the assumption of IRT which requires each objective to test the same thing (i.e., the unidimensionality assumption). Therefore, in the case when the correlation between objectives is 0.5, the assumption is violated resulting in it the worst method to estimate the objective score. However, when the correlation was 1.0, the assumption was met and it made the Bias lower..06.05.04 Bias.03 Method.0.01 Bock Yen Wainer Shin 0.00 0 0. 0.5 Proportion-correct Ratio of CR/MC items (in number of items) 35

Figure 16. Bias for different ratio of CR/MC items Figure 16 is the comparison of bias for different ratio of CR/MC items in each objective. From Figure 4 it can be seen that: (1) The ratio of CR/MC items in each objective did not impact the Bias of the objective score () The order of the Bias for each method was approximately: Bock > Wainer > Shin > Yen > proportion-correct. To sum up: the Bias was affected by two factors: the number of items per objective and the correlation between objectives. As the number of items per objective or the correlation between objectives increased, the Bias of the estimated objective score also increased. The number of examinees and the ratio of CR/MC items did not affect the Bias of the objective score. Within these 5 methods studied, the order of the magnitude of Bias is consistent across four factors. Generally, the order was: Bock > Wainer > Shin > Yen > proportion-correct. The magnitude of the Bias approximately ranged from 0.01 to 0.08. Comparison of the SD 36

.13.1.11.10 SD.09 Method Bock.08.07 Yen Wainer Shin.06 50 500 1000 Proportion-correct Number of examinees Figure 17. SD for different number of examinees Figure 17 is the comparison of the SD for different number of examinees. From Figure 17 it can be seen that: (1) The number of examinees did not impact the SD. () The order of the SD for each method is: Proportion-correct > Yen > Shin > Wainer > Bock. 37

.18.16.14.1 SD.10 Method Bock.08.06 Yen Wainer Shin.04 6 1 18 Proportion-correct Number of item in each objective Figure 18. SD for different number of items in each objective Figure 18 is the comparison of the SD for different number of items per objective. From Figure 18 it can be seen that: (1) The number of examinees had impact on the SD. As the number of items per objective increased, the SD decreased. () Generally, the order of the SD for each method is: Proportion-correct > Yen > Shin > Wainer > Bock. 38

.13.1.11.10 SD.09 Method Bock.08.07 Yen Wainer Shin.06 0.5 0.8 1.0 Proportion-correct Correlation between objectives Figure 19. SD for different correlation between objectives Figure 19 is the comparison of the SD for different correlation between objectives. From Figure 19 it can be seen that: (1) The correlation between objectives had impact on the SD except for proportioncorrect method. For the other methods, as the correlation between objectives increased, the SD decreased. () The order of the SD for each method is: Proportion-correct > Yen > Shin > Wainer > Bock. 39

.14.13.1.11 SD.10 Method.09.08.07 Bock Yen Wainer shin.06 0 0. 0.5 Proportion-correct Ration of CR/MC items (in number of items) Figure 0. SD for different ratio of CR/MC items Figure 0 is the comparison of the SD for different ratio of CR/MC items. From Figure 0 it can be seen that: (1) The ratio of CR/MC items did not impact the SD. () The order of the SD for each method is: Proportion-correct > Yen > Shin > Wainer > Bock.. To sum up: within the four factors studied, only the number of items per objective and the correlation between objectives had impact on the SD. More items per objective or higher correlation between objectives tended to lead to a narrower SD. The order of the SD is consistent across different factors. The order is: Proportion-correct > Yen > Shin > Wainer > Bock. The magnitude of the SD ranged approximately from 0.07 to 0.16. Comparison of the RMSE 40

.13.1.11 RMSE Methods.10 Bock Yen.09 Wainer Shin.08 50 500 1000 Proportion-correct Number of examinees Figure 1. RMSE for different number of examinees Figure 1 is the comparison of the RMSE for different number of examinees. From Figure 1 it can be seen that: (1) Generally, the number of examinees did not impact the RMSE. () The order of the RMSE for each method is: Proportion-correct > Yen > Bock > Shin Wainer. 41

.18.16.14 RMSE.1 Method.10.08 Bock Yen Wainer Shin.06 6 1 18 Proportion-correct Number of items in each objective Figure. RMSE for different number of items in each objective Figure is the comparison of the RMSE for different number of items in each objective. From Figure it can be seen that: (1) The number of items in each objective had impact on the RMSE for each method. Generally, as the number of items per objective increased the RMSE decreased. () The order of the RMSE for each method is: Proportion-correct > Yen > Bock > Shin Wainer. 4

.13.1.11 RMSE.10 Method.09.08 Bock Yen Wainer Shin.07 0.5 0.8 1.0 Proportion-correct Correlation between objectives Figure 3. RMSE for different correlation between objectives Figure 3 is the comparison of the RMSE for different correlation between objectives. From Figure 3 it can be seen that: (1) Except for the proportion-correct method, the correlation between objectives had impact on the RMSE. As the correlation between objectives increased the RMSE decreased. () The order of the RMSE for each method was generally: Proportion-correct > Yen > Bock > Shin Wainer. (3) It should be noticed that the Bock method was greatly affected by the correlation between objectives. When the correlation equals to 0.5, the assumption of Bock method is violated very much and its RMSE became high. However, when the 43

correlation equals to 1.0, the Bock assumption is perfectly met and its RMSE became the lowest..14.13.1 RMSE.11 Method.10.09 Bock Yen Wainer Shin.08 0 0. 0.5 Proportion-correct Ratio of CR/MC items (in number of items) Figure 4. RMSE for different ratio of CR/MC items Figure 4 is the comparison of the RMSE for different ratio of CR/MC items. From Figure 4 it can be seen that: (3) The CR/MC ratio had slightly impact on the RMSE for each method. As the ratio increased, the RMSE slight decreased. (4) The order of the RMSE for each method was generally: Proportion-correct > Yen > Bock > Shin Wainer. To sum up: within the four factors studied, only the number of items per objective and the correlation between objectives had impact on the RMSE. More items per objective or 44

higher correlation between objectives tended to lead to a smaller RMSE. The order of the SD is consistent across different factors. The order of the RMSE for each method was generally: Proportion-correct > Yen > Bock > Shin Wainer. The magnitude of the RMSE ranged approximately from 0.08 to 0.16. Discussion Based on the results from this simulation study, it appears that using estimation methods other than the proportion-correct method improved the estimation of objective scores in terms of the reliability of the objective score. Factors that affect the reliability of objective scores included the number of items per objective, the correlation between objectives, and the ratio of CR/MC items. The number of examinees did not affect the reliability of the objective score. As the value of these factors increased, the reliability of the estimated objective score also increase. Only the number of items per objective affected the 95CI and the percent coverage of 95CI. Generally, the objective scores estimated from the Wainer and Shin methods had the highest reliability. They ranged approximately from 0.68 to 0.83. The objective scores estimated from the Yen and proportion-correct methods had relatively lower reliability. The range of reliabilities was from 0.59 to 0.75 for the proportion-correct method and from 0.65 to 0.80 for the Yen method. The studied factors seem to have had a larger impact on the Bock method, especially the correlation between objectives. When the correlation was 1.0, the Bock method had the highest reliability (around 0.85). Table shows the effective gain in number of items for each method versus the proportion-correct method using the Spearman-Brown prophesy formula. It can be seen from Table that the use of the Wainer and Shin methods can lead to an effective gain of 45

up to 1.63 times the number of items in a subscore. The Bock method may yield an effective gain up to 1.89 times of the number of items in an objective. For example, if there are 6 items in an objective, using the Wainer and Shin methods may obtain the reliability as if there are 9.78 (6 * 1.63 = 9.78) items in that objective compared to the proportion-correct method. In other words, to achieve the score reliability of the Wainer/Shin methods with the proportion-correct method, the number of items per objective must be increased by a factor of 1.63. Table. The Effective Gained in Number of Items for each method Methods Wainer / Shin Yen Bock Proportioncorrect Original reliability min 0.68 0.65 0.58 0.59 max 0.83 0.80 0.85 0.75 Number of items Gained min 1.48 1.9 0.96 1.00 max 1.63 1.33 1.89 1.00 In the percentage scale, the widths of the 95CI are approximately 7%, 35%, 50%, and 55% for the Yen method, Wainer and Shin methods, Bock method, and proportion-correct method, respectively. That means if an objective score is given based on the Yen method, 95% of the time the true objective score will fall in the range with the width equal to 7%. Similarly, 95% of the time a score based on the Wainer and Shin methods will fall in a range with width equal to 35%, and so on. Basically, the narrower the width is, the more precise the estimation will be. However, this needs to be evaluated with the percentage coverage of the nominal width of CI (confidence or credibility interval). Generally, a good estimator should have a narrow 95CI and an accurate 46

percentage coverage of the nominal CI. By combining the results of the 95CI width and the percentage coverage of 95CI, it can be seen that Yen s method provided a narrow 95CI and an relatively accurate percentage coverage of 95CI. The other methods are all too conservative because their 95CI covered the true value more than 95% of the time. Therefore, an adjusted 95CI value should be developed and studied in order to obtain a more precise 95CI for these methods. In general, using estimation methods other than the proportion-correct method improved the estimation of objective scores in terms of SD and RMSE. The proportioncorrect method had the smallest Bias but the largest SD. Because the magnitude of Bias is much smaller than that of SD, the RMSE of the proportion-correct method became the largest. Therefore, the use of the other methods is expected to estimate the objective scores better than the proportion-correct method. Although the Bock method had the smallest SD, it also had the largest Bias. Compared to the Bock method, the Wainer and Shin methods had SD s similar to the Bock method, but had smaller Bias than the Bock method. Therefore, the Wainer and Shin methods had slightly smaller RMSE than the Bock method. Factors that affected the RMSE, SD, and Bias of objective scores included the number of items per objective and the correlation between objectives. The number of examinees and the ratio of CR/MC items did not affect the RMSE, SD, and Bias of objective scores. Table 3 shows the maximum and minimum RMSE values for each method. Using the methods other than the proportion-correct method reduces the RMSE by as much as 0.055, which is 33 percent of the proportion-correct RMSE.. 47