Decision consistency and accuracy indices for the bifactor and testlet response theory models

Size: px
Start display at page:

Download "Decision consistency and accuracy indices for the bifactor and testlet response theory models"

Transcription

1 University of Iowa Iowa Research Online Theses and Dissertations Summer 2014 Decision consistency and accuracy indices for the bifactor and testlet response theory models Lee James LaFond University of Iowa Copyright 2014 Lee LaFond This dissertation is available at Iowa Research Online: Recommended Citation LaFond, Lee James. "Decision consistency and accuracy indices for the bifactor and testlet response theory models." PhD (Doctor of Philosophy) thesis, University of Iowa, Follow this and additional works at: Part of the Educational Psychology Commons

2 DECISION CONSISTENCY AND ACCURACY INDICES FOR THE BIFACTOR AND TESTLET RESPONSE THEORY MODELS by Lee James LaFond A thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Psychological and Quantitative Foundations in the Graduate College of The University of Iowa August 2014 Thesis Supervisor: Associate Professor Won-Chan Lee

3 Copyright by LEE JAMES LAFOND 2014 All Rights Reserved

4 Graduate College The University of Iowa Iowa City, Iowa CERTIFICATE OF APPROVAL PH.D. THESIS This is to certify that the Ph. D. thesis of Lee James LaFond has been approved by the Examining Committee for the thesis requirement for the Doctor of Philosophy degree in Psychological and Quantitative Foundations at the August 2014 graduation. Thesis Committee: Won-Chan Lee, Thesis Supervisor Robert Brennan Kate Cowles Deborah Harris Michael Kolen Donald Yarbrough

5 ACKNOWLEDGMENTS I would like to thank several people for all of their valuable help in both writing this thesis and helping guide my experience in graduate school. First, I would like to thank Dr. Won-Chan Lee who went above the call of duty in serving as my dissertation chair. In addition to the thoughtful and thorough feedback he provided for this study, I am also extremely grateful for all of the programming work that went into developing mirt-class. Next, I would like to thank Dr. Donald Yarbrough, who in addition to serving on my committee, also served as my academic advisor and supervisor at the Center for Evaluation and Assessment. I cannot think of a better person to have served as a mentor and guide to the field of program evaluation. I also would like to thank all of the other members on my committee: Dr. Robert Brennan, Dr. Michael Kolen, Dr. Deborah Harris, and Dr. Kate Cowles. I am deeply honored to have such key leaders in the field guide me through this process. Finally, I would like to thank Dr. Anna Topczewski for her friendship as a fellow graduate student in the program, and for all of the assistance she provided for the simulation study. ii

6 ABSTRACT The primary goal of this study was to develop a new procedure for estimating decision consistency and accuracy indices using the bifactor and testlet response theory (TRT) models. This study is the first to investigate decision consistency and accuracy from a multidimensional perspective, and the results have shown that the bifactor model at least behaved in way that met the author s expectations and represents a potential useful procedure. The TRT model, on the other hand, did not meet the author s expectations and generally showed poor model performance. The multidimensional decision consistency and accuracy indices proposed in this study appear to provide good performance, at least for the bifactor model, in the case of a testlet effect of large magnitude. For practitioners examining a test containing testlets for decision consistency and accuracy, a recommended first step is to check for dimensionality. If the testlets show a significant degree of multidimensionality, then the usage of the multidimensional indices proposed can be recommended as the simulation study showed an improved level of performance over unidimensional IRT models. However, if there is a not a significant degree of multidimensionality then the unidimensional IRT models and indices would perform as well, or even better, than the multidimensional models. Another goal of this study was to compare methods for numerical integration used in the calculation of decision consistency and accuracy indices. This study investigated a new method (M method) that sampled ability estimates through a Monte Carlo approach. In summary, the M method seems to be just as accurate as the other commonly used methods for numerical integration. However, it has some practical advantages over the D and P methods. As previously mentioned, it is not as nearly as computationally intensive as the D method. Also, the P method requires large sample sizes. In addition, the P method has conceptual disadvantage in that the conditioning variable, in theory, should be the true theta, not an estimated theta. The M method avoids both of these issues and iii

7 seems to provide equally accurate estimates of decision consistency and accuracy indices, which makes it a strong option particularly in multidimensional cases. iv

8 TABLE OF CONTENTS LIST OF TABLES.....vii LIST OF SYMBOLS...ix CHAPTER 1: INTRODUCTION 1 Decision Consistency... 2 Decision Accuracy... 3 Measurement Models... 3 Unidimensional 3PL Item Response Theory Model Bifactor IRT Model..5 Testlet Response Theory Model..6 Decision Consistency Indices..7 Decision Accuracy Indices..8 Purpose of the Study...9 CHAPTER 2: LITERATURE REVIEW...12 Introduction 12 The Development of Decision Consistency and Accuracy Indices...14 Carver Method...14 Swaminathan-Hambleton-Algina Method.14 Strong True Score Models.15 Subkoviak and Compound Multinomial Models...17 Livingston-Lewis Procedure Item Response Theory Methods 19 Item Response Theory...22 Bifactor Model Testlet Response Theory 28 Model Fit 29 Summary 32 CHAPTER 3: METHODOLOGY.33 Data 33 Establishing Cut-Scores.36 Decision Consistency and Accuracy Under the Bifactor and TRT Models..36 Model Specification Estimation.. 39 Testlet Response Model. 40 Analysis.41 Summary 45 CHAPTER 4: RESULTS...53 Simulation Study Results...53 Population Values of Phi and Gamma...54 Estimated Values of Phi and Gamma 55 Bias 56 Standard Errors..59 RMSE. 60 v

9 Summary D Method versus M Method Comparison.61 Unidimensional Models.63 Multidimensional Models..65 Summary Real Data Results...67 Unidimensionality, Model Fit, and Local Independence...69 Marginal Classification Indices.73 Summary CHAPTER 5: DISCUSSION Purpose Purposes 2 and Purpose Limitations and Future Research.115 Summary and Conclusion 119 REFERENCES APPENDIX A: SAMPLE FLEXMIRT BIFACTOR SYNTAX. 130 APPENDIX B: SAMPLE FLEXMIRT TRT SYNTAX vi

10 LIST OF TABLES Table 1.1. Mastery and Non-Mastery Outcomes on Two Administrations of the Same Test Descriptive Statistics for Summed-Scores of Tests and Testlets for Form I Descriptive Statistics for Summed-Scores of Tests and Testlets for Form II Descriptive Statistics for Summed-Scores of Tests and Testlets for Form III Summary of Models Employed in the Study Summary of Model Priors for Estimation Summary of Model Fit Statistics Employed in the Study Simulation Phi and Gamma Results Based On Population Items Statistics Simulation Study Decision Consistency Index (Phi) Means Under Various Models Simulation Study Decision Accuracy Index (Gamma) Means Under Various Models Simulation Study Decision Consistency Index Bias Under Various Models Simulation Study Decision Accuracy Index Bias Under Various Models Simulation Study Decision Consistency Index Absolute Value Bias Under Various Models Simulation Study Decision Accuracy Index Absolute Value Bias Under Various Models Simulation Study Decision Consistency Index SEs Under Various Models Simulation Study Decision Accuracy Index SEs Under Various Models Simulation Study Decision Consistency Index RMSEs Under Various Models Simulation Study Decision Consistency Index RMSEs Under Various Models D method vs M method Decision Consistency Index Comparison for the UIRT Model D method vs M method Decision Accuracy Index Comparison for the UIRT Model D method vs M method Decision Consistency Index Comparison for the GRM Model D method vs M method Decision Accuracy Index Comparison for the GRM Model D method vs M method Decision Consistency Index Comparison for the Bifactor Model D method vs M method Decision Accuracy Index Comparison for the Bifactor Model D method vs M method Decision Consistency Index Comparison for the TRT Model D method vs M method Decision Accuracy Index Comparison for the TRT Model Summary Statistics for Test A Forms and Testlets Summary Statistics for Test B Forms and Testlets DIMTEST p-value results for all forms of Test A and Test B Fit and LD Statistics for Test A.. 97 vii

11 4.24. Fit and LD Statistics for Test B Actual and Estimated Proportions for Test A with 50% Cut-score Actual and Estimated Proportions for Test A with 80% Cut-score Actual and Estimated Proportions for Test A with Both 50% and 80% Cut-scores Actual and Estimated Proportions for Test B with 50% Cut-score Actual and Estimated Proportions for Test B with 80% Cut-score Actual and Estimated Proportions for Test B with Both 50% and 80% Cut-scores Decision Consistency Scores (Phi) for Test A Decision Accuracy Scores (Gamma) for Test A Decision Consistency Scores (Phi) for Test B Decision Accuracy Scores (Gamma) for Test B Comparison of Estimated A Parameters for IRT Models in Test B Form I 121 viii

12 LIST OF SYMBOLS a i b i c i d i g(i) IRT discrimination/slope parameter IRT difficulty/location parameter IRT pseudo-chance parameter IRT multidimensional intercept parameter Testlet with nested item i g(θ) Distribution of ability g(τ) h i j K n N p Q S True score density function Index denoting a specific classification category Index denoting a specific item Number of score points on a polytomous item Total number of classification categories Number of items in item set Number of examinees Probability of earning a particular score Number of quadrature points Number of specific ability factors S X 2 Item fit statistic u U i x An item score A random variable representing responses for item i Test summed-score x k Cut-score in raw score metric (k = 1, 2,, K) y AIC BIC CTT IRT Vector of item responses Akaike Information Criterion Bayesian Information Criterion Classical Test Theory Item Response Theory ix

13 MIRT Multidimensional Item Response Theory TRT Testlet Response Theory UIRT Unidimensional Item Response Theory 1PL 2PL 3PL β φ φ c φ θ γ γ θ + γ θ γ θ γ + γ η η θ θ κ τ π Γg(i) One-parameter logistic IRT model Two-parameter logistic IRT model Three-parameter logistic IRT model Vector of item parameters Marginal decision consistency or agreement index Chance Agreement Conditional classification consistency index Marginal accuracy index Conditional classification accuracy index Conditional false positive error rate Conditional false negative error rate Marginal false positive error rate Marginal false negative error rate True classification Accurate decision category Ability Vector of abilities Coefficient Kappa True score True proportion correct score Testlet effect for testlet with nested item i 2 σ g(i) Variance of testlet effect Χ 2 Local dependence index x

14 1 CHAPTER 1 INTRODUCTION A common application of test scores is to determine levels of examinee performance relative to specified cut-scores. Correspondingly, it is useful to know to what degree the classification is both consistent and accurate. Decision consistency describes the degree to which test takers are re-classified into the same category over parallel replications. Decision accuracy describes the degree to which actual classifications using observed cut-scores agree with true classifications based on known true cut-scores (Lee, 2010). Decision consistency and accuracy related to test scores is of great practical value to psychometricians, practitioners, and test takers. The consequences of NCLB and the role of high stakes testing makes proper determination of proficiency critical. For example, a high school student who is erroneously deemed non-proficient may be denied graduation in some states. As a result, there is a general expectation and need for test classifications to be both stable and accurate. In order to determine proficiency levels, tests use items of various formats that have estimable parameters based on a chosen measurement model. One common type of item employed is the dichotomously scored multiple-choice item. This type of item is appealing as they are generally regarded as reliable, efficient, and can be objectively scored. One common usage of the multiple-choice item is in conjunction with passages. With this type of format, several items can draw from a common stimulus thereby forming a testlet. Estimating parameters of these types of items with item response theory must be done with care due to the assumption of local independence. With assumed local independence, items are conditionally independent of each other given scores on other items. More specifically, the likelihood of a particular test taker getting a certain item correct does not depend on how they performed on other items (Lord, 1980). However,

15 2 the existence of a common stimulus or passage can violate this assumption. For example, a student who is struggling to comprehend a particular reading passage is likely to find all of the related items more difficult in a correlated fashion. Decision Consistency As noted previously, decision consistency is the degree to which a test consistently classifies members of the same group into the same category over replication. For a straightforward example see Table 1.1. An examination of the results of the two administrations of this particular test reveals that out of the 100 test takers: 50 demonstrated mastery on both administrations, 15 demonstrated non-mastery on both administrations, 10 demonstrated mastery on the first but not the second administration, and 25 demonstrated mastery on the second but not the first administration. The results of the 50 test takers who demonstrated mastery both times and the 15 who demonstrated non-mastery both times show decision consistency. In other words, those 65 test takers had the same classification over both replications of the test. The results of the remaining 35 test takers had inconsistent results over replication which demonstrates mastery on one administration but non-mastery in another. Thus, in this particular example only 65 out of a 100, or 65%, display decision consistency between the two replications. This degree of consistency is likely undesirable for those involved with the usage of this particular test. For example, the results of the 35% percent who had inconsistent results would be inconclusive and not of any particular use. In addition, such a lack of consistency might suggest a lack of reliability of the test scores and the possibility that many of the classification results may only be due to chance. While 100% consistency may not be attainable, it is in the best interest of test makers to ensure that the classification of test results is as stable as possible.

16 3 Decision Accuracy While decision consistency is desirable, it should be noted that a high level of consistency does not automatically imply that the results necessarily reflect the test takers true ability or nature (Huynh, 1990). The accuracy or validity of these decisions is commonly assessed through content or validity studies, but indices for decision accuracy have been developed as well. Again, decision accuracy describes the degree to which actual classifications using observed cut-scores agree with true classifications based on known true cut-scores. More specifically, decision accuracy describes how closely a test taker s true score classification aligns with their observed score classification (Haertel, 2006). In decision accuracy, the concepts of false-positives and false-negatives are useful to consider. For example, a false-positive would be a result of mastery when in reality the true classification of the student should be non-mastery. Similarly, a false-negative would be a result of non-mastery when the true classification is mastery. Both of these results represent errors in classification and a lack of decision accuracy. More generally, a falsepositive is when a test taker is classified at a higher classification than the true score indicates and a false-negative is when a test taker is classified at a lower classification than the true score indicates. Measurement Models A primary consideration in determining decision consistency and accuracy is that the true ability and true scores of test takers are not known and must be estimated. In addition, item parameters such as difficulty, discrimination, and the pseudo-guessing parameter need to be estimated as well. Measurement models have been developed according to a set of assumptions for the purposes of calculating these estimates. In this study, the item response theory (IRT) framework is discussed with three specific measurement models: the unidimensional three-parameter logistic (UIRT) IRT model, the testlet response theory model, and the bifactor IRT model.

17 4 Unidimensional 3PL Item Response Theory Model Unidimensional item response theory is a key measurement model in psychometrics and plays a foundational role in this study. Lord (1980) frames item response theory as a theory of statistical estimation that uses latent characterizations of individuals as predictors of observed responses. In comparison to classical test theory, Hambleton (1993) describes several benefits and characteristics of item response theory in comparison to classical test theory. For example, one defining characteristic of IRT is that the item-ability relationship is clearly defined, where in classical test theory it is not specified. For the UIRT model, the probability that a person of a given ability gives a correct response can be represented by: P(θ) = c i + (1 c i ) exp {1.7[a i (θ b i )]} 1+exp {1.7[a i (θ b i )]}, (1.1) where θ is the latent trait (ability, skill, etc.), P(θ) is the probability of a correct response given a particular ability level, parameter a i is the discriminating power of the item, parameter b i is the difficulty of the item, and parameter c i is the pseudo-guessing parameter. However, while UIRT is quite flexible and powerful, it is based on the key assumptions of unidimensionality and local independence. The unidimensionality assumption states that the observations on the items are solely a function of a single continuous latent person variable (de Ayala, 2009). For example, on a mathematics test to meet this assumption there is assumed to be a single latent mathematics proficiency variable that underlies the test taker s performance. This assumption can be violated from a number of perspectives. First, items can contain multiple content strands which can lead to multidimensionality. For example, the performance of a test taker on a math problem which contains a significant amount of text in the stimulus may also partially depend on their level of reading ability. Multidimensionality can also exist from a format

18 5 perspective. Students may perform differentially well on, for example, multiple-choice versus constructed-response items. As a result, it is common practice for practitioners to conduct dimensionality assessments to see if the assumption is violated to the extent that the model no longer fits the data. If multidimensionality exists, the degree of the violation should be examined to see if the model can still provide a reasonable representation. The second assumption, local independence, states that the probability of correctly responding to any particular item is independent of the performance on any other items on the test. Lord (1980) defines this relationship as, P(U i = 1 θ) = P(U i = 1 θ, u j, u k, ) (i j, k, ), (1.2) where θ denotes ability and u is the response for items i, j, k, etc. This assumption can be violated in a number of ways. For example, one item response may clue the response of another thereby creating a dependency. Also, it is a common practice in assessment to tie a number of items to a particular stimulus or passage. As a result, dependency may exist between the items due to the common stimulus resulting in a violation of local independence. This particular effect is of primary concern to this study and leads to the choice of the other two models to be examined. Bifactor IRT Model Full-information item bifactor analysis (Gibbons, et al., 2007; Gibbons & Hedeker, 1992) allows for multidimensionality between item types. A sample item bifactor measurement structure could have the following factor pattern: a 10 a 11 0 a 20 a 21 0 a 30 0 a 32. a 40 0 a 42 ( a 50 0 a 52) The first subscript denotes the item and the second corresponds to a dimension. For example, the above pattern could represent the factor structure for 5 reading items with 2 nested within one passage and 3 in another. The first column represents a general reading

19 6 dimension while the second and third are dimensions specific to each of the passages. Thus, this model controls for the passage effect that potentially violates the unidimensionality assumptions in the UIRT model. This study will focus on the bifactor IRT model (Cai, Yang, & Hansen, 2011) which is an extension of the standard UIRT model. For dichotomous items the bifactor model with general factor θ0 and one specific factor θs is: P(θ 0, θ s ) = c i + (1 c i ) exp [1.7(a 0iθ 0 +a si θ s +d i )] 1+exp [1.7(a 0i θ 0 +a si θ s +d i )]. (1.3) Above, c i is the pseudo-guessing parameter, d i is the item intercept, a 0 is the item slope on the general factor, and a s is the item slope on the specific factor s. Note that the item slopes are similar in interpretation to discrimination in the UIRT model, but are specific to each factor. Testlet Response Theory IRT Model Testlets are a popular form of structuring items which involve attaching multiple items to a single stimulus or passage. One common usage is with reading passages from which several items draw inferences. Wainer, Bradlow, and Wang (2007) state that, for a typical reading passage followed by four to six associated items, the local independence assumption does not hold. Reducing the length of the passage was found to reduce this effect, but it was discovered that the construct measured was not the same. Attaching only one item to the passage also eliminates the problem, but at the cost of being very inefficient in terms of time needed to read the passage relative to the amount of information gained. Thus, there is a need for a model where the unit of test construction is smaller than the whole test but larger than a single item. Wainer and Kiely (1987), argued for such a model and proposed the testlet as a unit of construction. Of particular interest to this study is testlet response theory (TRT), which was developed by Wainer et al. (2007). Essentially it expands the 3PL UIRT model to:

20 7 P(θ) = c i + (1 c i ) exp {1.7[a i(θ b i Γ g(i) )]} 1+exp {1.7[a i (θ b i Γ g(i) )]}, (1.4) where Γg(i) is the testlet effect of a particular respondent for item i nested within testlet g(i). Note that if Γg(i)=0, there is no testlet effect and the model simplifies to the UIRT model. However, as described in DeMars (2006), the testlet response model is simply a constrained bifactor model. In the bifactor model, testlet slopes are independent of general slopes. Under the test response model, the testlet slopes are proportional to the general slopes. Thus, while the estimation of ability and item parameters is impacted, the same general model could be used to estimate decision consistency and accuracy indices. Decision Consistency Indices Classification accuracy and consistency indices can be estimated using a wide variety of models. In the summed-score metric, indices exist both for item response theory (Huynh, 1990) and classical test theory (Livingston & Lewis, 1995). More recently, theta metric indices have been developed by Rudner (2001) and Guo (2006). Specifically, this study will focus on IRT methods described in Schulz et al. (1999), Wang et al. (2000), Lee et al. (2002), and more recently generalized in Lee (2010). Note, however, that these methods were developed with the UIRT model in mind, and the primary purpose of this study is to develop a procedure for the TRT and bifactor models. Assume that a test score is found by summing all of the item scores on a particular form. Also, let x 1, x 2,, x K 1 denote a set of cut-scores that are used to classify examinees into K mutually exclusive categories. So, a score less than x 1 would be placed in the first category, a score equal to or greater than x 1 but less than x 2 would be placed in the second category, and so on. The conditional category probability can be computed by summing conditional summed-score probabilities for all scores that belong to category h: x h 1 p θ (h) = x=x (h 1) Pr (X = x θ), (1.5)

21 8 where h = 1, 2,, K. Once the above probability is calculated, it is useful to know the probability of an individual with a given ability level being placed twice in the same category on two separate parallel administrations of a test. This is the conditional classification consistency index (φ θ ), which can be calculated as: φ θ = K h=1 [p θ (h)] 2. (1.6) Finally, given the above, classification consistency across all levels of ability can be calculated. Given the distribution of ability, g(θ), the marginal classification consistency index ϕ is computed as: ϕ = φ θ g(θ) dθ. (1.7) Decision Accuracy Indices Decision accuracy describes the degree to which actual classifications using observed cut-scores agree with true classifications based on known true cut-scores. However, true scores, true ability levels, and true classifications are not known, and statistical techniques and assumptions need to be employed in order for an estimate to be made. First, suppose that the expected summed-score of the test taker is their true score τ. Next, suppose a set of true cut-scores on the summed-score metric, τ 1, τ 2,, τ K 1, determine the true categorization of each test taker with θ or τ. Also, assume that the conditional probabilities, p θ (h), from Equation 1.5 are known. Finally, the true categorical status, η(= 1, 2,, K), can be determined by comparing the expected sum score for θ with the true cut-scores. Following from the above, the conditional classification accuracy index is: γ θ = p θ (η), for θ η. (1.8) Next, by summing over all ability levels, the marginal classification accuracy index is: γ = γ θ g(θ)dθ. (1.9)

22 9 Note that in order for the indices in Equations 1.7 and 1.9 to be estimated, it is necessary to approximate the integral associated with the θ distribution. There are two different approaches typically employed: the D method and the P method (Lee, 2010). The D method is a distributional approach using estimated quadrature points and weights, and it replaces the integral with summations. The P method uses individual θ estimates to calculate the individual conditional classification indices which are then averaged over all examinees. Lee (2010) found that the D and P methods produced similar results, but suggested that the D method be used when the focus of investigation is at the group-level, and the P method when the focus is on the individual. Purpose of the Study The goal of this study is to investigate decision consistency and accuracy indices based on the UIRT, TRT, and bifactor models. The UIRT model has been thoroughly investigated in past research and will be the most straightforward of the models. However, the bifactor model is multidimensional and will be the primary focus of this study. The presence of multiple thetas does not fit with the current methods for estimating indices, and as a result a new procedure needs to be developed. This new procedure is the first to address multidimensionality in estimating decision consistency and accuracy indices, and as such represents a meaningful contribution to the literature. The TRT model, which is a constrained version of the bifactor model, will use the same general procedures as the bifactor model. Specifically, the purposes of this study are: 1. Develop a new procedure for estimating decision consistency and accuracy indices using the bifactor and TRT models. 2. Compare decision consistency and accuracy indices between the UIRT, TRT, and bifactor models using simulated and real testlet data from various sources. In addition, use the 4P BB and GRM models as a baseline comparison.

23 10 3. Investigate how the placement of cut-scores and degree of multidimensionality affects the estimates of decision consistency and accuracy indices between the UIRT, TRT, and bifactor models. 4. Compare the different numerical integration methods used to calculate the indices. Ideally, it is the hope of this study to offer practitioners another credible option for measuring the degree of decision consistency and accuracy in assessments. At present, no other decision consistency or accuracy index explicitly accounts for multidimensionality. In particular, the indices presented in this study directly address the multidimensionality caused by the presence of testlets. In tests that employ testlets, these new indices have the potential to provide a more accurate picture of decision consistency and accuracy, making them a potentially useful new procedure for practitioners.

24 Table 1.1. Mastery and Non-Mastery Outcomes on Two Administrations of the Same Test Administration 1 Administration 2 Mastery Non-Mastery Total Mastery Non-Mastery Total

25 12 CHAPTER 2 LITERATURE REVIEW This chapter will review the literature of decision consistency and accuracy indices in order to provide a history and broader context which will serve as a foundation for this study. First, this chapter will provide a background and motivation of criterion referenced testing. Next, a history of the development and evolution of consistency and accuracy indices will be provided. In addition, the background will be explored for the three models employed in this study: the unidimensional item response theory model, the testlet response theory model, and the bifactor model. Finally, there will be a discussion of how indices will be estimated for the different models and a brief overview of assessing model fit. Introduction In order for a test score to be meaningfully interpreted a test needs to be referenced in some way. In other words, the score needs to be compared to something external to the test as point of reference for comparisons. For example, one possible way to do this is through norm-referenced testing. In norm-referenced testing, derived scores (e.g. percentile rank, grade-equivalent scores, etc.) are constructed in a way that conveys information about the relative standing of the test taker to others in the defined group. This defined group is referred to as the norm group, and the derived scores are known as norm-referenced scores (Nitko, 1980). However, for some uses norm-referenced scores can be insufficient. For example, you may want to know if a particular student has mastered the prerequisite skills necessary to be successful in a math course. Here a norm-referenced interpretation is not particularly useful. In this case, you know the position of the student within the norm group, but do not know if the student has sufficient mastery. What is needed here is some sort of criterion by which mastery is gauged, and this is the motivation for criterionreferenced testing.

26 13 A common application of criterion-referenced test scores is to determine levels of test taker performance relative to specified cut-scores. With regards to this, it is useful to know to what degree the classification is both consistent and accurate. Decision consistency describes the degree to which test takers are re-classified into the same category over parallel replication. Decision accuracy describes the degree to which actual classifications using observed cut-scores agree with true classifications based on known true cut-scores (Lee, 2010). Decision consistency and accuracy related to test scores are of great practical value to psychometricians, practitioners, and test takers. The consequences of NCLB and the role of high stakes testing makes proper determination of proficiency critical. For example, a high school student who is erroneously deemed non-proficient may be denied graduation in some states. Similarly, a nurse who is mistakenly deemed proficient when there are deficiencies may not be capable of providing adequate care. In particular, the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1999) state, when a test or combination of measures is used to make categorical decisions, estimates should be provided of the percentage of examinees who would be classified in the same way on two applications of the procedure, using the same form or alternate forms of the instrument (p. 35). As a result, there is a general expectation and need for test classifications to be examined as part of the validation process. In order to determine proficiency levels, tests use items of various formats that have estimable parameters based on a chosen measurement model. One common type of item employed is the dichotomously-scored multiple-choice item. This type of item is appealing as they are generally regarded as reliable, efficient, and can be objectively scored. One common usage of the multiple-choice item is associated with passages. With this type of format, several items can draw from a common stimulus thereby forming a testlet. Estimating parameters of these types of items with item response theory must be done with care due to the assumption of local independence. With assumed local

27 14 independence items are conditionally independent of each other given scores on other items. In other words, the likelihood of a particular test taker getting a certain item correct does not depend on how they performed on other items (Lord, 1980). However, the existence of a common stimulus or passage can violate this assumption. The Development of Decision Consistency and Accuracy Indices Carver Method The earliest formal method for determining classification consistency appears in Carver (1970). Essentially, this method compared the percentage of mastery on two parallel administrations of a test. If the two percentages were equal then the test is considered reliable. The clear weakness to this method is that even if the two percentages are identical, the test still could be unreliable with regards to the performance of individual test takers. For example, half of the test takers could be considered masters on the first administration of the test and the other half could be masters on the second. The results are reversed but with the same percentage of mastery which is an unstable result. Swaminathan-Hambleton-Algina Method Following Carver s method is the Swaminathan-Hambleton-Algina Method developed in Hambleton and Novick (1973) and Swaminathan et al. (1974). This method suggested that the proportion of individuals consistently classified as masters be the measure for reliability of decision consistency. Referring back to Table 1.1, fifty out of one hundred students demonstrated mastery on both administrations and fifteen demonstrated non-mastery on both administrations. Therefore, the percentage of consistent classification can be calculated as: m φ = k=1 φ kk, (2.1) where φ kk is the proportion of individuals consistently classified in the kth category on both administrations. Thus, for Table 1.1 this is calculated as: φ = =

28 15 The upper limit of φ is 1.00, which is perfect consistency, and the lower limit is generally the proportion of consistent decisions that could be expected by chance. This is defined as: m φ c = k=1 p k. p.k, (2.2) where p k. and p.k are the proportions assigned to category k on both forms. For example, for Table 1.1 this is calculated as: φ c = ( ) ( ) + ( ) ( ) =.51. Note that while the proportion of consistent decisions in the example is 0.65, the proportion expected by chance is very high at This suggests that a sizeable proportion of the consistent decisions is due to chance at this particular cut-score, and not due to the reliability of the test itself. Swaminathan et al. (1974) suggested using Cohen s (1960) kappa coefficient to remove the chance proportion to determine the proportion of consistent decisions that can be expected beyond chance. The kappa coefficient is calculated by: κ = φ φ c = =.29. (2.3) 1 φ c 1.51 Here, the upper limit is again 1.00, suggesting perfect consistency, while the lower limit is theoretically lower than zero. In this particular example, the low kappa value suggests that once the proportion for chance agreement is removed the remaining degree of classification consistency is quite low. Strong True Score Models Note that the Swaminathan-Hambleton-Algina Method still requires the administration of two parallel forms of the test. However, often it is not practical or possible to give two parallel administrations of the same test in an achievement test context. Thus, it is useful to have an index that can be calculated from a single administration of a test. The beta-binomial model developed by Huynh (1976) is an example of one such index.

29 16 Hanson and Brennan (1990) compared three different strong true score betabinomial models (two-parameter, four-parameter, and four-parameter compound binominal) with regards to estimating indices of classification indices. Strong true score models consider the probability that the summed-score random variable X of a test score (with n dichotomously scored items) equals i (i = 0,, n), as: 1 0 Pr(X = i) = Pr(X = i π) g(π)dπ, (2.4) where π is the proportion correct true score, g(π) is the true score density function, and Pr(X = i π) is the conditional error distribution. Here, g(π)is assumed to belong to a certain parametric class, and Pr(X = i π) is assumed to be either binomial or an approximation of a compound binomial distribution. Each of the three models Hanson and Brennan examined had their own set of assumptions. For the two-parameter beta-binomial model, the true score distribution is beta and the conditional error distribution is binomial. For the four-parameter betabinomial model, the true score distribution is four-parameter beta (Lord, 1965) and the conditional error distribution is binomial. For the four-parameter beta compound binomial model, the true score distribution is four-parameter beta and the conditional error distribution is a two-term approximation to the compound binomial distribution. Here, classification consistency is defined as the consistency with which examinees are categorized on the basis of two independent administrations. Independence is defined so that the summed-scores on the two administrations (X 1 and X 2 ) are conditionally independent and identically distributed. Assuming two categories of classification, the bivariate distribution of X 1 and X 2 is: 1 0 Pr(X 1 = i, X 2 = j) = Pr(X 1 = i π) Pr(X 2 = i π) g(π)dπ. (2.5) Again assuming two categories of classification, from Equation 2.5 the classification index φ is defined as: x 0 1 x 0 1 n j=x 0 φ = i=0 j=0 Pr ( X 1 = i, X 2 = j) + i=x 0 Pr ( X 1 = i, X 2 = j). (2.6) n

30 17 The classification index φ is the probability that two summed-scores from parallel independent administrations are either both less than the cut-score x 0 (non-mastery) or both greater than or equal to x 0 (mastery). As given in Equation 2.3, the coefficient κ = φ φ c 1 φ c. Here, the probability of chance agreement φ c is given by: x 0 1 x 0 1 j=0 + φ c = [ i=0 Pr (X 1 = i)] [ Pr (X 2 = j)] n n [ i=x 0 Pr (X 1 = i)] [ j=x 0 Pr (X 2 = j)]. (2.7) Note that since X 1 and X 2 are independent and identically distributed, x 0 1 x 0 1 i=0 Pr (X 1 = i) = j=0 Pr(X 2 = j) = p 0, (2.8) where p 0 is the marginal probability that a test taker scores below the cut-score x 0. Also, n n i=x 0 Pr (X 1 = i) = j=x 0 Pr(X 2 = j) = p 1, (2.9) where p 1 is the marginal probability that a test taker scores equal to or greater than the cut-score x 0. As a result, φ c = p p 1 2, and p c does not depend on the actual pair of test scores for a test taker. Thus, given the assumptions made, only a single administration is necessary to calculate the indices for classification consistency. In their comparison of the three beta-binomial models, Hanson and Brennan found that the two-parameter model often demonstrates lack of fit. They recommended that before the two-parameter model is used, the adequacy of the model in fitting the raw observed scores be evaluated. In the cases where the two-parameter model does not fit, the four-parameter models may provide better fit. According to their results, the fourparameter beta-binomial and the four-parameter beta compound binomial provided very similar results. If neither the two-parameter nor four-parameter models fit, they suggest using a more complex model similar to those discussed by Wilcox (1981). Subkoviak and Compound Multinomial Models Lee (2005) notes that the strong true score models use a distributional approach, where assumptions are made concerning the distributional form of the true scores.

31 18 Subkoviak (1976) employed an individual approach where no such assumptions were made. The Subkoviak procedure estimates classification consistency one examinee at a time, and then averages over examinees to create an overall consistency index for the entire sample group. Lee (2005) extended Subkoviak s work using the compound multinomial procedure (see also Lee, Brennan, & Wan, 2009). Lee proposed a multinomial error model for a test with undifferentiated polytomous items, and a compound multinomial model for a test containing a mixture of items. The multinomial procedure reduces to Subkoviak s procedure when items are dichotomously scored. Assume there is a test that contains n polytomous items, each with j score points, c 1 < c 2 < < c j. Also, assume that X 1, X 2,, X j are the random variables representing the number of items scored with each of the possible score points. Under this procedure, each examinee response pattern would follow a multinomial distribution as follows: n! x Pr(X 1 = x 1, X 2 = x 2,, X j = x j π) = 1 x π 2 x 2 π j j, (2.10) x 1!x 2! x k 1! π 1 where π = {π 1, π 2,, π j } can be estimated by the observed proportions of items scored with corresponding points. From here the probability density function of Y can be determined by summing over all sets of X 1, X 2,, X j for a total score of y: Pr(Y = y π) = Pr(X 1 = x 1, X 2 = x 2,, X h = x j π) y, (2.11) where y = c 1 x 1 + c 2 x c j x j as the sum of the item scores. Once this density function is determined, the φ and κ indices can be calculated. Livingston-Lewis Procedure Livingston and Lewis (1995) described a procedure for estimating the accuracy and consistency of classifications based on test scores based on effective test length. Effective test length refers to the number of discrete, dichotomously scored, locally independent test items needed to produce the total scores having the same precision as the scores actually being used. The original test score is transformed onto a new scale with a

32 19 maximum equal to the effective test length. The true score distribution of the new scale is then estimated by fitting a 2- or 4-parameter beta model. Also, the conditional distribution of scores on the new scale, given the true score, is estimated from a binomial distribution based on the estimated effective test length. Once the parameters of both distributions are known, classification consistency is estimated in the same way as demonstrated in Hanson and Brennan (1990). Item Response Theory Methods All of the previously discussed models are based on either summed-scores or scale-scores and fall under the umbrella of classical test theory. Huynh (1990) first explored procedures for consistency indices based on latent trait models. Huynh demonstrates how the Rasch model could be used to project the bivariate frequency of test scores based on equivalent test scores. This distribution is then used to estimate consistency indices such as φ and κ. Estimating these indices using IRT requires the determination of the marginal distribution of summed-scores through integration associated with the distribution of the latent trait θ. There are two different approaches possible: the D method and the P method (Lee, 2010). The D method is a distributional approach using estimated quadrature points and weights, and it replaces the integral with summations. The P method uses individual θ estimates to calculate the individual conditional classification indices which are then averaged over all examinees. Lee (2010) found that the D and P methods produced similar results, but suggested that the D method be used when the focus of investigation is at the group-level and the P method when the focus is on the individual. Huynh s work has been further developed by Schulz et al. (1999), Wang et al. (2000), and Lee et al. (2002). More recently, theta metric indices have been developed by Rudner (2001) and Guo (2006). Lee (2010) generalized their work and the following procedures reflect that form.

33 20 Given θ and g(θ), the latent trait being measured and its distribution respectively, the marginal probability of the total summed-score X is given by: Pr(X = x) = Pr(X = x θ) g(θ)dθ. (2.12) Note that Pr(X = x θ) is the conditional summed-score distribution. Also, due to the IRT assumption of conditional independence, the probability of a summed-score is the product of probabilities for item responses given θ. Typically, a recursive formula, such as the Lord-Wingersky algorithm (1984), is employed to calculate the conditional summed-score distribution for dichotomous items. In addition, when all items are dichotomous, a compound binomial model is used for modeling conditional numbercorrect score distributions. Assume that a test score is found by summing all of the item scores on a particular form. Also, let x 1, x 2,, x K 1 denote a set of cut-scores that are used to classify examinees into K mutually exclusive categories. A score less than x 1 would be placed in the first category, a score equal to or greater than x 1 but less than x 2 would be placed in the second category, and so on. The conditional category probability can be computed by summing conditional summed-score probabilities for all scores that belong to category h: x h 1 p θ (h) = x=x (h 1) Pr (X = x θ), (2.13) where h = 1, 2,, K. Once the above probability is calculated it is useful to know the probability of an individual with a given ability level being placed twice in the same category on two separate parallel administrations of a test. This is the conditional classification consistency index (φ θ ) which can be calculated as: φ θ = K h=1 [p θ (h)] 2. (2.14) Finally, given the above, classification consistency across all levels of ability can be calculated. Given the distribution of ability, g(θ), the marginal classification consistency index ϕ is computed as: ϕ = φ θ g(θ) dθ. (2.15)

34 21 Note that in this context, the coefficient κ can be expressed as κ = φ φ c 1 φ c. In addition, the chance probability, φ c, can be expressed as φ c = K h=1 [p(h)] 2, where p(h)is the marginal category probability obtained by integrating p θ (h) over θ. Decision accuracy describes the degree to which actual classifications using observed cut-scores agree with true classifications based on known true cut-scores. However, true scores, true ability levels, and true classifications are not known, and statistical techniques and assumptions need to be employed in order for an informed guess to be made. First, suppose that the expected summed-score of the test taker is their true score τ. Next, suppose a set of true cut-scores on the summed-score metric, τ 1, τ 2,, τ K 1, determine the true categorization of each test taker with θ or τ. Also, assume that the conditional probabilities, p θ (h), from Equation 2.13 are known. Finally, the true categorical status, η(= 1, 2,, K), can be determined by comparing the expected sum score for θ with the true cut-scores. Following from the above, the conditional classification accuracy index is: γ θ = p θ (η), for θ η. (2.16) Next, by summing over all ability levels the marginal classification accuracy index is: γ = γ θ g(θ)dθ. (2.17) Classification accuracy can also be framed in terms of false positive and false negative error rates. A conditional false positive error rate is the probability that a test taker of a given ability is classified into a category higher than the test taker s true category. This can be expressed by: K γ + θ = η=η +1 p θ (η), for θ η, (2.18) where η is the accurate decision category. In contrast, the false negative error rate, or the probability that a test taker of a given ability is classified into a category lower than the test taker s true category, is given by: η 1 γ θ = η=1 p θ (η), for θ η. (2.19)

35 22 Correspondingly, the marginal false positive and false negative error rates are given by: and γ + = γ = γ θ + g(θ)dθ (2.20) γ θ g(θ)dθ. (2.21) Item Response Theory Item response theory (IRT) is a key measurement model in psychometrics. Lord (1980) frames item response theory as a theory of statistical estimation that uses latent characterizations of individuals as predictors of observed responses. While IRT is quite flexible and powerful, it is based on the key assumptions of unidimensionality and local independence. The unidimensionality assumption states that the observations on the items are solely a function of a single continuous latent person variable (de Ayala, 2009). For example, on a mathematics test there is assumed to be a single latent mathematics proficiency variable that underlies the test taker s performance. This assumption can be violated from a number of perspectives. First, items can contain multiple content strands which can lead to multidimensionality. For example, the performance of a test taker on a math problem which contains a significant amount of text in the stimulus may also partially depend on their level of reading ability. Multidimensionality can also exist from a format perspective. Students may perform differentially well on, for example, multiplechoice versus constructed-response items. It should be noted, however, that no test is likely to be perfectly unidimensional. The second assumption, local independence, states that the probability of correctly responding to any particular item is independent of the performance on any other items on the test. As noted in Chapter 1, Lord (1980) defines this relationship as, P(u i = 1 θ) = P(u i = 1 θ, u j, u k, ) (i j, k, ). (2.22) Here, θ denotes ability and u is the response for items i, j, k, etc. This assumption can be violated in a number of ways including the usage of passages.

Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales

Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales University of Iowa Iowa Research Online Theses and Dissertations Summer 2013 Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales Anna Marie

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 39 Evaluation of Comparability of Scores and Passing Decisions for Different Item Pools of Computerized Adaptive Examinations

More information

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, April 23-25, 2003 The Classification Accuracy of Measurement Decision Theory Lawrence Rudner University

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD Psy 427 Cal State Northridge Andrew Ainsworth, PhD Contents Item Analysis in General Classical Test Theory Item Response Theory Basics Item Response Functions Item Information Functions Invariance IRT

More information

Using the Score-based Testlet Method to Handle Local Item Dependence

Using the Score-based Testlet Method to Handle Local Item Dependence Using the Score-based Testlet Method to Handle Local Item Dependence Author: Wei Tao Persistent link: http://hdl.handle.net/2345/1363 This work is posted on escholarship@bc, Boston College University Libraries.

More information

IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS

IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS A Dissertation Presented to The Academic Faculty by HeaWon Jun In Partial Fulfillment of the Requirements

More information

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses Item Response Theory Steven P. Reise University of California, U.S.A. Item response theory (IRT), or modern measurement theory, provides alternatives to classical test theory (CTT) methods for the construction,

More information

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Jee Seon Kim University of Wisconsin, Madison Paper presented at 2006 NCME Annual Meeting San Francisco, CA Correspondence

More information

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS Michael J. Kolen The University of Iowa March 2011 Commissioned by the Center for K 12 Assessment & Performance Management at

More information

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida and Oleksandr S. Chernyshenko University of Canterbury Presented at the New CAT Models

More information

Linking across forms in vertical scaling under the common-item nonequvalent groups design

Linking across forms in vertical scaling under the common-item nonequvalent groups design University of Iowa Iowa Research Online Theses and Dissertations Spring 2013 Linking across forms in vertical scaling under the common-item nonequvalent groups design Xuan Wang University of Iowa Copyright

More information

Multidimensional Modeling of Learning Progression-based Vertical Scales 1

Multidimensional Modeling of Learning Progression-based Vertical Scales 1 Multidimensional Modeling of Learning Progression-based Vertical Scales 1 Nina Deng deng.nina@measuredprogress.org Louis Roussos roussos.louis@measuredprogress.org Lee LaFond leelafond74@gmail.com 1 This

More information

Evaluating the Impact of Construct Shift on Item Parameter Invariance, Test Equating and Proficiency Estimates

Evaluating the Impact of Construct Shift on Item Parameter Invariance, Test Equating and Proficiency Estimates University of Massachusetts Amherst ScholarWorks@UMass Amherst Doctoral Dissertations Dissertations and Theses 2015 Evaluating the Impact of Construct Shift on Item Parameter Invariance, Test Equating

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 26 for Mixed Format Tests Kyong Hee Chon Won-Chan Lee Timothy N. Ansley November 2007 The authors are grateful to

More information

A Bayesian Nonparametric Model Fit statistic of Item Response Models

A Bayesian Nonparametric Model Fit statistic of Item Response Models A Bayesian Nonparametric Model Fit statistic of Item Response Models Purpose As more and more states move to use the computer adaptive test for their assessments, item response theory (IRT) has been widely

More information

Diagnostic Classification Models

Diagnostic Classification Models Diagnostic Classification Models Lecture #13 ICPSR Item Response Theory Workshop Lecture #13: 1of 86 Lecture Overview Key definitions Conceptual example Example uses of diagnostic models in education Classroom

More information

Computerized Mastery Testing

Computerized Mastery Testing Computerized Mastery Testing With Nonequivalent Testlets Kathleen Sheehan and Charles Lewis Educational Testing Service A procedure for determining the effect of testlet nonequivalence on the operating

More information

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison Empowered by Psychometrics The Fundamentals of Psychometrics Jim Wollack University of Wisconsin Madison Psycho-what? Psychometrics is the field of study concerned with the measurement of mental and psychological

More information

LOGISTIC APPROXIMATIONS OF MARGINAL TRACE LINES FOR BIFACTOR ITEM RESPONSE THEORY MODELS. Brian Dale Stucky

LOGISTIC APPROXIMATIONS OF MARGINAL TRACE LINES FOR BIFACTOR ITEM RESPONSE THEORY MODELS. Brian Dale Stucky LOGISTIC APPROXIMATIONS OF MARGINAL TRACE LINES FOR BIFACTOR ITEM RESPONSE THEORY MODELS Brian Dale Stucky A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in

More information

A Multilevel Testlet Model for Dual Local Dependence

A Multilevel Testlet Model for Dual Local Dependence Journal of Educational Measurement Spring 2012, Vol. 49, No. 1, pp. 82 100 A Multilevel Testlet Model for Dual Local Dependence Hong Jiao University of Maryland Akihito Kamata University of Oregon Shudong

More information

Development, Standardization and Application of

Development, Standardization and Application of American Journal of Educational Research, 2018, Vol. 6, No. 3, 238-257 Available online at http://pubs.sciepub.com/education/6/3/11 Science and Education Publishing DOI:10.12691/education-6-3-11 Development,

More information

Using Bayesian Decision Theory to

Using Bayesian Decision Theory to Using Bayesian Decision Theory to Design a Computerized Mastery Test Charles Lewis and Kathleen Sheehan Educational Testing Service A theoretical framework for mastery testing based on item response theory

More information

EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN

EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN By FRANCISCO ANDRES JIMENEZ A THESIS PRESENTED TO THE GRADUATE SCHOOL OF

More information

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2 MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and Lord Equating Methods 1,2 Lisa A. Keller, Ronald K. Hambleton, Pauline Parker, Jenna Copella University of Massachusetts

More information

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

Bruno D. Zumbo, Ph.D. University of Northern British Columbia Bruno Zumbo 1 The Effect of DIF and Impact on Classical Test Statistics: Undetected DIF and Impact, and the Reliability and Interpretability of Scores from a Language Proficiency Test Bruno D. Zumbo, Ph.D.

More information

Rasch Versus Birnbaum: New Arguments in an Old Debate

Rasch Versus Birnbaum: New Arguments in an Old Debate White Paper Rasch Versus Birnbaum: by John Richard Bergan, Ph.D. ATI TM 6700 E. Speedway Boulevard Tucson, Arizona 85710 Phone: 520.323.9033 Fax: 520.323.9139 Copyright 2013. All rights reserved. Galileo

More information

USING MULTIDIMENSIONAL ITEM RESPONSE THEORY TO REPORT SUBSCORES ACROSS MULTIPLE TEST FORMS. Jing-Ru Xu

USING MULTIDIMENSIONAL ITEM RESPONSE THEORY TO REPORT SUBSCORES ACROSS MULTIPLE TEST FORMS. Jing-Ru Xu USING MULTIDIMENSIONAL ITEM RESPONSE THEORY TO REPORT SUBSCORES ACROSS MULTIPLE TEST FORMS By Jing-Ru Xu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements

More information

ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE

ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE California State University, San Bernardino CSUSB ScholarWorks Electronic Theses, Projects, and Dissertations Office of Graduate Studies 6-2016 ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION

More information

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories Kamla-Raj 010 Int J Edu Sci, (): 107-113 (010) Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories O.O. Adedoyin Department of Educational Foundations,

More information

THE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION

THE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION THE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION Timothy Olsen HLM II Dr. Gagne ABSTRACT Recent advances

More information

Item Parameter Recovery for the Two-Parameter Testlet Model with Different. Estimation Methods. Abstract

Item Parameter Recovery for the Two-Parameter Testlet Model with Different. Estimation Methods. Abstract Item Parameter Recovery for the Two-Parameter Testlet Model with Different Estimation Methods Yong Luo National Center for Assessment in Saudi Arabia Abstract The testlet model is a popular statistical

More information

A DIFFERENTIAL RESPONSE FUNCTIONING FRAMEWORK FOR UNDERSTANDING ITEM, BUNDLE, AND TEST BIAS ROBERT PHILIP SIDNEY CHALMERS

A DIFFERENTIAL RESPONSE FUNCTIONING FRAMEWORK FOR UNDERSTANDING ITEM, BUNDLE, AND TEST BIAS ROBERT PHILIP SIDNEY CHALMERS A DIFFERENTIAL RESPONSE FUNCTIONING FRAMEWORK FOR UNDERSTANDING ITEM, BUNDLE, AND TEST BIAS ROBERT PHILIP SIDNEY CHALMERS A DISSERTATION SUBMITTED TO THE FACULTY OF GRADUATE STUDIES IN PARTIAL FULFILMENT

More information

Influences of IRT Item Attributes on Angoff Rater Judgments

Influences of IRT Item Attributes on Angoff Rater Judgments Influences of IRT Item Attributes on Angoff Rater Judgments Christian Jones, M.A. CPS Human Resource Services Greg Hurt!, Ph.D. CSUS, Sacramento Angoff Method Assemble a panel of subject matter experts

More information

Using the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison

Using the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison Using the Testlet Model to Mitigate Test Speededness Effects James A. Wollack Youngsuk Suh Daniel M. Bolt University of Wisconsin Madison April 12, 2007 Paper presented at the annual meeting of the National

More information

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati.

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati. Likelihood Ratio Based Computerized Classification Testing Nathan A. Thompson Assessment Systems Corporation & University of Cincinnati Shungwon Ro Kenexa Abstract An efficient method for making decisions

More information

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests David Shin Pearson Educational Measurement May 007 rr0701 Using assessment and research to promote learning Pearson Educational

More information

POLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS

POLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS POLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS By OU ZHANG A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

More information

André Cyr and Alexander Davies

André Cyr and Alexander Davies Item Response Theory and Latent variable modeling for surveys with complex sampling design The case of the National Longitudinal Survey of Children and Youth in Canada Background André Cyr and Alexander

More information

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Thakur Karkee Measurement Incorporated Dong-In Kim CTB/McGraw-Hill Kevin Fatica CTB/McGraw-Hill

More information

ABERRANT RESPONSE PATTERNS AS A MULTIDIMENSIONAL PHENOMENON: USING FACTOR-ANALYTIC MODEL COMPARISON TO DETECT CHEATING. John Michael Clark III

ABERRANT RESPONSE PATTERNS AS A MULTIDIMENSIONAL PHENOMENON: USING FACTOR-ANALYTIC MODEL COMPARISON TO DETECT CHEATING. John Michael Clark III ABERRANT RESPONSE PATTERNS AS A MULTIDIMENSIONAL PHENOMENON: USING FACTOR-ANALYTIC MODEL COMPARISON TO DETECT CHEATING BY John Michael Clark III Submitted to the graduate degree program in Psychology and

More information

Type I Error Rates and Power Estimates for Several Item Response Theory Fit Indices

Type I Error Rates and Power Estimates for Several Item Response Theory Fit Indices Wright State University CORE Scholar Browse all Theses and Dissertations Theses and Dissertations 2009 Type I Error Rates and Power Estimates for Several Item Response Theory Fit Indices Bradley R. Schlessman

More information

THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH

THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH By JANN MARIE WISE MACINNES A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF

More information

A Comparison of Item and Testlet Selection Procedures. in Computerized Adaptive Testing. Leslie Keng. Pearson. Tsung-Han Ho

A Comparison of Item and Testlet Selection Procedures. in Computerized Adaptive Testing. Leslie Keng. Pearson. Tsung-Han Ho ADAPTIVE TESTLETS 1 Running head: ADAPTIVE TESTLETS A Comparison of Item and Testlet Selection Procedures in Computerized Adaptive Testing Leslie Keng Pearson Tsung-Han Ho The University of Texas at Austin

More information

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model Gary Skaggs Fairfax County, Virginia Public Schools José Stevenson

More information

Exploring dimensionality of scores for mixedformat

Exploring dimensionality of scores for mixedformat University of Iowa Iowa Research Online Theses and Dissertations Summer 2016 Exploring dimensionality of scores for mixedformat tests Mengyao Zhang University of Iowa Copyright 2016 Mengyao Zhang This

More information

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing Terry A. Ackerman University of Illinois This study investigated the effect of using multidimensional items in

More information

linking in educational measurement: Taking differential motivation into account 1

linking in educational measurement: Taking differential motivation into account 1 Selecting a data collection design for linking in educational measurement: Taking differential motivation into account 1 Abstract In educational measurement, multiple test forms are often constructed to

More information

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

Lec 02: Estimation & Hypothesis Testing in Animal Ecology Lec 02: Estimation & Hypothesis Testing in Animal Ecology Parameter Estimation from Samples Samples We typically observe systems incompletely, i.e., we sample according to a designed protocol. We then

More information

TECHNICAL REPORT. The Added Value of Multidimensional IRT Models. Robert D. Gibbons, Jason C. Immekus, and R. Darrell Bock

TECHNICAL REPORT. The Added Value of Multidimensional IRT Models. Robert D. Gibbons, Jason C. Immekus, and R. Darrell Bock 1 TECHNICAL REPORT The Added Value of Multidimensional IRT Models Robert D. Gibbons, Jason C. Immekus, and R. Darrell Bock Center for Health Statistics, University of Illinois at Chicago Corresponding

More information

Applying the Minimax Principle to Sequential Mastery Testing

Applying the Minimax Principle to Sequential Mastery Testing Developments in Social Science Methodology Anuška Ferligoj and Andrej Mrvar (Editors) Metodološki zvezki, 18, Ljubljana: FDV, 2002 Applying the Minimax Principle to Sequential Mastery Testing Hans J. Vos

More information

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups GMAC Scaling Item Difficulty Estimates from Nonequivalent Groups Fanmin Guo, Lawrence Rudner, and Eileen Talento-Miller GMAC Research Reports RR-09-03 April 3, 2009 Abstract By placing item statistics

More information

Differential Item Functioning Amplification and Cancellation in a Reading Test

Differential Item Functioning Amplification and Cancellation in a Reading Test A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to the Practical Assessment, Research & Evaluation. Permission is granted to

More information

Centre for Education Research and Policy

Centre for Education Research and Policy THE EFFECT OF SAMPLE SIZE ON ITEM PARAMETER ESTIMATION FOR THE PARTIAL CREDIT MODEL ABSTRACT Item Response Theory (IRT) models have been widely used to analyse test data and develop IRT-based tests. An

More information

Blending Psychometrics with Bayesian Inference Networks: Measuring Hundreds of Latent Variables Simultaneously

Blending Psychometrics with Bayesian Inference Networks: Measuring Hundreds of Latent Variables Simultaneously Blending Psychometrics with Bayesian Inference Networks: Measuring Hundreds of Latent Variables Simultaneously Jonathan Templin Department of Educational Psychology Achievement and Assessment Institute

More information

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods James Madison University JMU Scholarly Commons Department of Graduate Psychology - Faculty Scholarship Department of Graduate Psychology 3-008 Scoring Multiple Choice Items: A Comparison of IRT and Classical

More information

COMBINING SCALING AND CLASSIFICATION: A PSYCHOMETRIC MODEL FOR SCALING ABILITY AND DIAGNOSING MISCONCEPTIONS LAINE P. BRADSHAW

COMBINING SCALING AND CLASSIFICATION: A PSYCHOMETRIC MODEL FOR SCALING ABILITY AND DIAGNOSING MISCONCEPTIONS LAINE P. BRADSHAW COMBINING SCALING AND CLASSIFICATION: A PSYCHOMETRIC MODEL FOR SCALING ABILITY AND DIAGNOSING MISCONCEPTIONS by LAINE P. BRADSHAW (Under the Direction of Jonathan Templin and Karen Samuelsen) ABSTRACT

More information

Comparing DIF methods for data with dual dependency

Comparing DIF methods for data with dual dependency DOI 10.1186/s40536-016-0033-3 METHODOLOGY Open Access Comparing DIF methods for data with dual dependency Ying Jin 1* and Minsoo Kang 2 *Correspondence: ying.jin@mtsu.edu 1 Department of Psychology, Middle

More information

UNIDIMENSIONAL VERTICAL SCALING OF MIXED FORMAT TESTS IN THE PRESENCE OF ITEM FORMAT EFFECT. Debra White Moore

UNIDIMENSIONAL VERTICAL SCALING OF MIXED FORMAT TESTS IN THE PRESENCE OF ITEM FORMAT EFFECT. Debra White Moore UNIDIMENSIONAL VERTICAL SCALING OF MIXED FORMAT TESTS IN THE PRESENCE OF ITEM FORMAT EFFECT by Debra White Moore B.M.Ed., University of North Carolina, Greensboro, 1989 M.A., University of Pittsburgh,

More information

Using Differential Item Functioning to Test for Inter-rater Reliability in Constructed Response Items

Using Differential Item Functioning to Test for Inter-rater Reliability in Constructed Response Items University of Wisconsin Milwaukee UWM Digital Commons Theses and Dissertations May 215 Using Differential Item Functioning to Test for Inter-rater Reliability in Constructed Response Items Tamara Beth

More information

Estimating the Validity of a

Estimating the Validity of a Estimating the Validity of a Multiple-Choice Test Item Having k Correct Alternatives Rand R. Wilcox University of Southern California and University of Califarnia, Los Angeles In various situations, a

More information

Differential Item Functioning

Differential Item Functioning Differential Item Functioning Lecture #11 ICPSR Item Response Theory Workshop Lecture #11: 1of 62 Lecture Overview Detection of Differential Item Functioning (DIF) Distinguish Bias from DIF Test vs. Item

More information

Parameter Estimation with Mixture Item Response Theory Models: A Monte Carlo Comparison of Maximum Likelihood and Bayesian Methods

Parameter Estimation with Mixture Item Response Theory Models: A Monte Carlo Comparison of Maximum Likelihood and Bayesian Methods Journal of Modern Applied Statistical Methods Volume 11 Issue 1 Article 14 5-1-2012 Parameter Estimation with Mixture Item Response Theory Models: A Monte Carlo Comparison of Maximum Likelihood and Bayesian

More information

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1 Nested Factor Analytic Model Comparison as a Means to Detect Aberrant Response Patterns John M. Clark III Pearson Author Note John M. Clark III,

More information

Copyright. Kelly Diane Brune

Copyright. Kelly Diane Brune Copyright by Kelly Diane Brune 2011 The Dissertation Committee for Kelly Diane Brune Certifies that this is the approved version of the following dissertation: An Evaluation of Item Difficulty and Person

More information

An exploration of decision consistency indices for one form tests

An exploration of decision consistency indices for one form tests Retrospective Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 1983 An exploration of decision consistency indices for one form tests Randi Louise Hagen Iowa State University

More information

Analyzing data from educational surveys: a comparison of HLM and Multilevel IRT. Amin Mousavi

Analyzing data from educational surveys: a comparison of HLM and Multilevel IRT. Amin Mousavi Analyzing data from educational surveys: a comparison of HLM and Multilevel IRT Amin Mousavi Centre for Research in Applied Measurement and Evaluation University of Alberta Paper Presented at the 2013

More information

The effects of ordinal data on coefficient alpha

The effects of ordinal data on coefficient alpha James Madison University JMU Scholarly Commons Masters Theses The Graduate School Spring 2015 The effects of ordinal data on coefficient alpha Kathryn E. Pinder James Madison University Follow this and

More information

A structural equation modeling approach for examining position effects in large scale assessments

A structural equation modeling approach for examining position effects in large scale assessments DOI 10.1186/s40536-017-0042-x METHODOLOGY Open Access A structural equation modeling approach for examining position effects in large scale assessments Okan Bulut *, Qi Quo and Mark J. Gierl *Correspondence:

More information

Factors Affecting the Item Parameter Estimation and Classification Accuracy of the DINA Model

Factors Affecting the Item Parameter Estimation and Classification Accuracy of the DINA Model Journal of Educational Measurement Summer 2010, Vol. 47, No. 2, pp. 227 249 Factors Affecting the Item Parameter Estimation and Classification Accuracy of the DINA Model Jimmy de la Torre and Yuan Hong

More information

Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study

Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study Research Report Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study Xueli Xu Matthias von Davier April 2010 ETS RR-10-10 Listening. Learning. Leading. Linking Errors in Trend Estimation

More information

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian Recovery of Marginal Maximum Likelihood Estimates in the Two-Parameter Logistic Response Model: An Evaluation of MULTILOG Clement A. Stone University of Pittsburgh Marginal maximum likelihood (MML) estimation

More information

Nonparametric DIF. Bruno D. Zumbo and Petronilla M. Witarsa University of British Columbia

Nonparametric DIF. Bruno D. Zumbo and Petronilla M. Witarsa University of British Columbia Nonparametric DIF Nonparametric IRT Methodology For Detecting DIF In Moderate-To-Small Scale Measurement: Operating Characteristics And A Comparison With The Mantel Haenszel Bruno D. Zumbo and Petronilla

More information

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES Correlational Research Correlational Designs Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are

More information

Psychological testing

Psychological testing Psychological testing Lecture 12 Mikołaj Winiewski, PhD Test Construction Strategies Content validation Empirical Criterion Factor Analysis Mixed approach (all of the above) Content Validation Defining

More information

REMOVE OR KEEP: LINKING ITEMS SHOWING ITEM PARAMETER DRIFT. Qi Chen

REMOVE OR KEEP: LINKING ITEMS SHOWING ITEM PARAMETER DRIFT. Qi Chen REMOVE OR KEEP: LINKING ITEMS SHOWING ITEM PARAMETER DRIFT By Qi Chen A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Measurement and

More information

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida Vol. 2 (1), pp. 22-39, Jan, 2015 http://www.ijate.net e-issn: 2148-7456 IJATE A Comparison of Logistic Regression Models for Dif Detection in Polytomous Items: The Effect of Small Sample Sizes and Non-Normality

More information

THE IMPACT OF ANCHOR ITEM EXPOSURE ON MEAN/SIGMA LINKING AND IRT TRUE SCORE EQUATING UNDER THE NEAT DESIGN. Moatasim A. Barri

THE IMPACT OF ANCHOR ITEM EXPOSURE ON MEAN/SIGMA LINKING AND IRT TRUE SCORE EQUATING UNDER THE NEAT DESIGN. Moatasim A. Barri THE IMPACT OF ANCHOR ITEM EXPOSURE ON MEAN/SIGMA LINKING AND IRT TRUE SCORE EQUATING UNDER THE NEAT DESIGN By Moatasim A. Barri B.S., King Abdul Aziz University M.S.Ed., The University of Kansas Ph.D.,

More information

Examining Factors Affecting Language Performance: A Comparison of Three Measurement Approaches

Examining Factors Affecting Language Performance: A Comparison of Three Measurement Approaches Pertanika J. Soc. Sci. & Hum. 21 (3): 1149-1162 (2013) SOCIAL SCIENCES & HUMANITIES Journal homepage: http://www.pertanika.upm.edu.my/ Examining Factors Affecting Language Performance: A Comparison of

More information

An Empirical Examination of the Impact of Item Parameters on IRT Information Functions in Mixed Format Tests

An Empirical Examination of the Impact of Item Parameters on IRT Information Functions in Mixed Format Tests University of Massachusetts - Amherst ScholarWorks@UMass Amherst Dissertations 2-2012 An Empirical Examination of the Impact of Item Parameters on IRT Information Functions in Mixed Format Tests Wai Yan

More information

You must answer question 1.

You must answer question 1. Research Methods and Statistics Specialty Area Exam October 28, 2015 Part I: Statistics Committee: Richard Williams (Chair), Elizabeth McClintock, Sarah Mustillo You must answer question 1. 1. Suppose

More information

The Hierarchical Testlet Response Time Model: Bayesian analysis of a testlet model for item responses and response times.

The Hierarchical Testlet Response Time Model: Bayesian analysis of a testlet model for item responses and response times. The Hierarchical Testlet Response Time Model: Bayesian analysis of a testlet model for item responses and response times By Suk Keun Im Submitted to the graduate degree program in Department of Educational

More information

Building Evaluation Scales for NLP using Item Response Theory

Building Evaluation Scales for NLP using Item Response Theory Building Evaluation Scales for NLP using Item Response Theory John Lalor CICS, UMass Amherst Joint work with Hao Wu (BC) and Hong Yu (UMMS) Motivation Evaluation metrics for NLP have been mostly unchanged

More information

A Comparison of Several Goodness-of-Fit Statistics

A Comparison of Several Goodness-of-Fit Statistics A Comparison of Several Goodness-of-Fit Statistics Robert L. McKinley The University of Toledo Craig N. Mills Educational Testing Service A study was conducted to evaluate four goodnessof-fit procedures

More information

THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL

THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL JOURNAL OF EDUCATIONAL MEASUREMENT VOL. II, NO, 2 FALL 1974 THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL SUSAN E. WHITELY' AND RENE V. DAWIS 2 University of Minnesota Although it has been claimed that

More information

On the purpose of testing:

On the purpose of testing: Why Evaluation & Assessment is Important Feedback to students Feedback to teachers Information to parents Information for selection and certification Information for accountability Incentives to increase

More information

A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM LIKELIHOOD METHODS IN ESTIMATING THE ITEM PARAMETERS FOR THE 2PL IRT MODEL

A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM LIKELIHOOD METHODS IN ESTIMATING THE ITEM PARAMETERS FOR THE 2PL IRT MODEL International Journal of Innovative Management, Information & Production ISME Internationalc2010 ISSN 2185-5439 Volume 1, Number 1, December 2010 PP. 81-89 A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM

More information

A Comparison of Four Test Equating Methods

A Comparison of Four Test Equating Methods A Comparison of Four Test Equating Methods Report Prepared for the Education Quality and Accountability Office (EQAO) by Xiao Pang, Ph.D. Psychometrician, EQAO Ebby Madera, Ph.D. Psychometrician, EQAO

More information

Item Response Theory (IRT): A Modern Statistical Theory for Solving Measurement Problem in 21st Century

Item Response Theory (IRT): A Modern Statistical Theory for Solving Measurement Problem in 21st Century International Journal of Scientific Research in Education, SEPTEMBER 2018, Vol. 11(3B), 627-635. Item Response Theory (IRT): A Modern Statistical Theory for Solving Measurement Problem in 21st Century

More information

Comprehensive Statistical Analysis of a Mathematics Placement Test

Comprehensive Statistical Analysis of a Mathematics Placement Test Comprehensive Statistical Analysis of a Mathematics Placement Test Robert J. Hall Department of Educational Psychology Texas A&M University, USA (bobhall@tamu.edu) Eunju Jung Department of Educational

More information

LUO, XIAO, Ph.D. The Optimal Design of the Dual-purpose Test. (2013) Directed by Dr. Richard M. Luecht. 155 pp.

LUO, XIAO, Ph.D. The Optimal Design of the Dual-purpose Test. (2013) Directed by Dr. Richard M. Luecht. 155 pp. LUO, XIAO, Ph.D. The Optimal Design of the Dual-purpose Test. (2013) Directed by Dr. Richard M. Luecht. 155 pp. Traditional test development focused on one purpose of the test, either ranking test-takers

More information

Assessing Dimensionality in Complex Data Structures: A Performance Comparison of DETECT and NOHARM Procedures. Dubravka Svetina

Assessing Dimensionality in Complex Data Structures: A Performance Comparison of DETECT and NOHARM Procedures. Dubravka Svetina Assessing Dimensionality in Complex Data Structures: A Performance Comparison of DETECT and NOHARM Procedures by Dubravka Svetina A Dissertation Presented in Partial Fulfillment of the Requirements for

More information

Having your cake and eating it too: multiple dimensions and a composite

Having your cake and eating it too: multiple dimensions and a composite Having your cake and eating it too: multiple dimensions and a composite Perman Gochyyev and Mark Wilson UC Berkeley BEAR Seminar October, 2018 outline Motivating example Different modeling approaches Composite

More information

Chapter 1 Introduction. Measurement Theory. broadest sense and not, as it is sometimes used, as a proxy for deterministic models.

Chapter 1 Introduction. Measurement Theory. broadest sense and not, as it is sometimes used, as a proxy for deterministic models. Ostini & Nering - Chapter 1 - Page 1 POLYTOMOUS ITEM RESPONSE THEORY MODELS Chapter 1 Introduction Measurement Theory Mathematical models have been found to be very useful tools in the process of human

More information

An Investigation of vertical scaling with item response theory using a multistage testing framework

An Investigation of vertical scaling with item response theory using a multistage testing framework University of Iowa Iowa Research Online Theses and Dissertations 2008 An Investigation of vertical scaling with item response theory using a multistage testing framework Jonathan James Beard University

More information

Selection of Linking Items

Selection of Linking Items Selection of Linking Items Subset of items that maximally reflect the scale information function Denote the scale information as Linear programming solver (in R, lp_solve 5.5) min(y) Subject to θ, θs,

More information

Statistical Methods and Reasoning for the Clinical Sciences

Statistical Methods and Reasoning for the Clinical Sciences Statistical Methods and Reasoning for the Clinical Sciences Evidence-Based Practice Eiki B. Satake, PhD Contents Preface Introduction to Evidence-Based Statistics: Philosophical Foundation and Preliminaries

More information

Item Response Theory: Methods for the Analysis of Discrete Survey Response Data

Item Response Theory: Methods for the Analysis of Discrete Survey Response Data Item Response Theory: Methods for the Analysis of Discrete Survey Response Data ICPSR Summer Workshop at the University of Michigan June 29, 2015 July 3, 2015 Presented by: Dr. Jonathan Templin Department

More information

Evaluating the quality of analytic ratings with Mokken scaling

Evaluating the quality of analytic ratings with Mokken scaling Psychological Test and Assessment Modeling, Volume 57, 2015 (3), 423-444 Evaluating the quality of analytic ratings with Mokken scaling Stefanie A. Wind 1 Abstract Greatly influenced by the work of Rasch

More information

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

Mantel-Haenszel Procedures for Detecting Differential Item Functioning A Comparison of Logistic Regression and Mantel-Haenszel Procedures for Detecting Differential Item Functioning H. Jane Rogers, Teachers College, Columbia University Hariharan Swaminathan, University of

More information