Decision consistency and accuracy indices for the bifactor and testlet response theory models
|
|
- Elizabeth Henderson
- 6 years ago
- Views:
Transcription
1 University of Iowa Iowa Research Online Theses and Dissertations Summer 2014 Decision consistency and accuracy indices for the bifactor and testlet response theory models Lee James LaFond University of Iowa Copyright 2014 Lee LaFond This dissertation is available at Iowa Research Online: Recommended Citation LaFond, Lee James. "Decision consistency and accuracy indices for the bifactor and testlet response theory models." PhD (Doctor of Philosophy) thesis, University of Iowa, Follow this and additional works at: Part of the Educational Psychology Commons
2 DECISION CONSISTENCY AND ACCURACY INDICES FOR THE BIFACTOR AND TESTLET RESPONSE THEORY MODELS by Lee James LaFond A thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Psychological and Quantitative Foundations in the Graduate College of The University of Iowa August 2014 Thesis Supervisor: Associate Professor Won-Chan Lee
3 Copyright by LEE JAMES LAFOND 2014 All Rights Reserved
4 Graduate College The University of Iowa Iowa City, Iowa CERTIFICATE OF APPROVAL PH.D. THESIS This is to certify that the Ph. D. thesis of Lee James LaFond has been approved by the Examining Committee for the thesis requirement for the Doctor of Philosophy degree in Psychological and Quantitative Foundations at the August 2014 graduation. Thesis Committee: Won-Chan Lee, Thesis Supervisor Robert Brennan Kate Cowles Deborah Harris Michael Kolen Donald Yarbrough
5 ACKNOWLEDGMENTS I would like to thank several people for all of their valuable help in both writing this thesis and helping guide my experience in graduate school. First, I would like to thank Dr. Won-Chan Lee who went above the call of duty in serving as my dissertation chair. In addition to the thoughtful and thorough feedback he provided for this study, I am also extremely grateful for all of the programming work that went into developing mirt-class. Next, I would like to thank Dr. Donald Yarbrough, who in addition to serving on my committee, also served as my academic advisor and supervisor at the Center for Evaluation and Assessment. I cannot think of a better person to have served as a mentor and guide to the field of program evaluation. I also would like to thank all of the other members on my committee: Dr. Robert Brennan, Dr. Michael Kolen, Dr. Deborah Harris, and Dr. Kate Cowles. I am deeply honored to have such key leaders in the field guide me through this process. Finally, I would like to thank Dr. Anna Topczewski for her friendship as a fellow graduate student in the program, and for all of the assistance she provided for the simulation study. ii
6 ABSTRACT The primary goal of this study was to develop a new procedure for estimating decision consistency and accuracy indices using the bifactor and testlet response theory (TRT) models. This study is the first to investigate decision consistency and accuracy from a multidimensional perspective, and the results have shown that the bifactor model at least behaved in way that met the author s expectations and represents a potential useful procedure. The TRT model, on the other hand, did not meet the author s expectations and generally showed poor model performance. The multidimensional decision consistency and accuracy indices proposed in this study appear to provide good performance, at least for the bifactor model, in the case of a testlet effect of large magnitude. For practitioners examining a test containing testlets for decision consistency and accuracy, a recommended first step is to check for dimensionality. If the testlets show a significant degree of multidimensionality, then the usage of the multidimensional indices proposed can be recommended as the simulation study showed an improved level of performance over unidimensional IRT models. However, if there is a not a significant degree of multidimensionality then the unidimensional IRT models and indices would perform as well, or even better, than the multidimensional models. Another goal of this study was to compare methods for numerical integration used in the calculation of decision consistency and accuracy indices. This study investigated a new method (M method) that sampled ability estimates through a Monte Carlo approach. In summary, the M method seems to be just as accurate as the other commonly used methods for numerical integration. However, it has some practical advantages over the D and P methods. As previously mentioned, it is not as nearly as computationally intensive as the D method. Also, the P method requires large sample sizes. In addition, the P method has conceptual disadvantage in that the conditioning variable, in theory, should be the true theta, not an estimated theta. The M method avoids both of these issues and iii
7 seems to provide equally accurate estimates of decision consistency and accuracy indices, which makes it a strong option particularly in multidimensional cases. iv
8 TABLE OF CONTENTS LIST OF TABLES.....vii LIST OF SYMBOLS...ix CHAPTER 1: INTRODUCTION 1 Decision Consistency... 2 Decision Accuracy... 3 Measurement Models... 3 Unidimensional 3PL Item Response Theory Model Bifactor IRT Model..5 Testlet Response Theory Model..6 Decision Consistency Indices..7 Decision Accuracy Indices..8 Purpose of the Study...9 CHAPTER 2: LITERATURE REVIEW...12 Introduction 12 The Development of Decision Consistency and Accuracy Indices...14 Carver Method...14 Swaminathan-Hambleton-Algina Method.14 Strong True Score Models.15 Subkoviak and Compound Multinomial Models...17 Livingston-Lewis Procedure Item Response Theory Methods 19 Item Response Theory...22 Bifactor Model Testlet Response Theory 28 Model Fit 29 Summary 32 CHAPTER 3: METHODOLOGY.33 Data 33 Establishing Cut-Scores.36 Decision Consistency and Accuracy Under the Bifactor and TRT Models..36 Model Specification Estimation.. 39 Testlet Response Model. 40 Analysis.41 Summary 45 CHAPTER 4: RESULTS...53 Simulation Study Results...53 Population Values of Phi and Gamma...54 Estimated Values of Phi and Gamma 55 Bias 56 Standard Errors..59 RMSE. 60 v
9 Summary D Method versus M Method Comparison.61 Unidimensional Models.63 Multidimensional Models..65 Summary Real Data Results...67 Unidimensionality, Model Fit, and Local Independence...69 Marginal Classification Indices.73 Summary CHAPTER 5: DISCUSSION Purpose Purposes 2 and Purpose Limitations and Future Research.115 Summary and Conclusion 119 REFERENCES APPENDIX A: SAMPLE FLEXMIRT BIFACTOR SYNTAX. 130 APPENDIX B: SAMPLE FLEXMIRT TRT SYNTAX vi
10 LIST OF TABLES Table 1.1. Mastery and Non-Mastery Outcomes on Two Administrations of the Same Test Descriptive Statistics for Summed-Scores of Tests and Testlets for Form I Descriptive Statistics for Summed-Scores of Tests and Testlets for Form II Descriptive Statistics for Summed-Scores of Tests and Testlets for Form III Summary of Models Employed in the Study Summary of Model Priors for Estimation Summary of Model Fit Statistics Employed in the Study Simulation Phi and Gamma Results Based On Population Items Statistics Simulation Study Decision Consistency Index (Phi) Means Under Various Models Simulation Study Decision Accuracy Index (Gamma) Means Under Various Models Simulation Study Decision Consistency Index Bias Under Various Models Simulation Study Decision Accuracy Index Bias Under Various Models Simulation Study Decision Consistency Index Absolute Value Bias Under Various Models Simulation Study Decision Accuracy Index Absolute Value Bias Under Various Models Simulation Study Decision Consistency Index SEs Under Various Models Simulation Study Decision Accuracy Index SEs Under Various Models Simulation Study Decision Consistency Index RMSEs Under Various Models Simulation Study Decision Consistency Index RMSEs Under Various Models D method vs M method Decision Consistency Index Comparison for the UIRT Model D method vs M method Decision Accuracy Index Comparison for the UIRT Model D method vs M method Decision Consistency Index Comparison for the GRM Model D method vs M method Decision Accuracy Index Comparison for the GRM Model D method vs M method Decision Consistency Index Comparison for the Bifactor Model D method vs M method Decision Accuracy Index Comparison for the Bifactor Model D method vs M method Decision Consistency Index Comparison for the TRT Model D method vs M method Decision Accuracy Index Comparison for the TRT Model Summary Statistics for Test A Forms and Testlets Summary Statistics for Test B Forms and Testlets DIMTEST p-value results for all forms of Test A and Test B Fit and LD Statistics for Test A.. 97 vii
11 4.24. Fit and LD Statistics for Test B Actual and Estimated Proportions for Test A with 50% Cut-score Actual and Estimated Proportions for Test A with 80% Cut-score Actual and Estimated Proportions for Test A with Both 50% and 80% Cut-scores Actual and Estimated Proportions for Test B with 50% Cut-score Actual and Estimated Proportions for Test B with 80% Cut-score Actual and Estimated Proportions for Test B with Both 50% and 80% Cut-scores Decision Consistency Scores (Phi) for Test A Decision Accuracy Scores (Gamma) for Test A Decision Consistency Scores (Phi) for Test B Decision Accuracy Scores (Gamma) for Test B Comparison of Estimated A Parameters for IRT Models in Test B Form I 121 viii
12 LIST OF SYMBOLS a i b i c i d i g(i) IRT discrimination/slope parameter IRT difficulty/location parameter IRT pseudo-chance parameter IRT multidimensional intercept parameter Testlet with nested item i g(θ) Distribution of ability g(τ) h i j K n N p Q S True score density function Index denoting a specific classification category Index denoting a specific item Number of score points on a polytomous item Total number of classification categories Number of items in item set Number of examinees Probability of earning a particular score Number of quadrature points Number of specific ability factors S X 2 Item fit statistic u U i x An item score A random variable representing responses for item i Test summed-score x k Cut-score in raw score metric (k = 1, 2,, K) y AIC BIC CTT IRT Vector of item responses Akaike Information Criterion Bayesian Information Criterion Classical Test Theory Item Response Theory ix
13 MIRT Multidimensional Item Response Theory TRT Testlet Response Theory UIRT Unidimensional Item Response Theory 1PL 2PL 3PL β φ φ c φ θ γ γ θ + γ θ γ θ γ + γ η η θ θ κ τ π Γg(i) One-parameter logistic IRT model Two-parameter logistic IRT model Three-parameter logistic IRT model Vector of item parameters Marginal decision consistency or agreement index Chance Agreement Conditional classification consistency index Marginal accuracy index Conditional classification accuracy index Conditional false positive error rate Conditional false negative error rate Marginal false positive error rate Marginal false negative error rate True classification Accurate decision category Ability Vector of abilities Coefficient Kappa True score True proportion correct score Testlet effect for testlet with nested item i 2 σ g(i) Variance of testlet effect Χ 2 Local dependence index x
14 1 CHAPTER 1 INTRODUCTION A common application of test scores is to determine levels of examinee performance relative to specified cut-scores. Correspondingly, it is useful to know to what degree the classification is both consistent and accurate. Decision consistency describes the degree to which test takers are re-classified into the same category over parallel replications. Decision accuracy describes the degree to which actual classifications using observed cut-scores agree with true classifications based on known true cut-scores (Lee, 2010). Decision consistency and accuracy related to test scores is of great practical value to psychometricians, practitioners, and test takers. The consequences of NCLB and the role of high stakes testing makes proper determination of proficiency critical. For example, a high school student who is erroneously deemed non-proficient may be denied graduation in some states. As a result, there is a general expectation and need for test classifications to be both stable and accurate. In order to determine proficiency levels, tests use items of various formats that have estimable parameters based on a chosen measurement model. One common type of item employed is the dichotomously scored multiple-choice item. This type of item is appealing as they are generally regarded as reliable, efficient, and can be objectively scored. One common usage of the multiple-choice item is in conjunction with passages. With this type of format, several items can draw from a common stimulus thereby forming a testlet. Estimating parameters of these types of items with item response theory must be done with care due to the assumption of local independence. With assumed local independence, items are conditionally independent of each other given scores on other items. More specifically, the likelihood of a particular test taker getting a certain item correct does not depend on how they performed on other items (Lord, 1980). However,
15 2 the existence of a common stimulus or passage can violate this assumption. For example, a student who is struggling to comprehend a particular reading passage is likely to find all of the related items more difficult in a correlated fashion. Decision Consistency As noted previously, decision consistency is the degree to which a test consistently classifies members of the same group into the same category over replication. For a straightforward example see Table 1.1. An examination of the results of the two administrations of this particular test reveals that out of the 100 test takers: 50 demonstrated mastery on both administrations, 15 demonstrated non-mastery on both administrations, 10 demonstrated mastery on the first but not the second administration, and 25 demonstrated mastery on the second but not the first administration. The results of the 50 test takers who demonstrated mastery both times and the 15 who demonstrated non-mastery both times show decision consistency. In other words, those 65 test takers had the same classification over both replications of the test. The results of the remaining 35 test takers had inconsistent results over replication which demonstrates mastery on one administration but non-mastery in another. Thus, in this particular example only 65 out of a 100, or 65%, display decision consistency between the two replications. This degree of consistency is likely undesirable for those involved with the usage of this particular test. For example, the results of the 35% percent who had inconsistent results would be inconclusive and not of any particular use. In addition, such a lack of consistency might suggest a lack of reliability of the test scores and the possibility that many of the classification results may only be due to chance. While 100% consistency may not be attainable, it is in the best interest of test makers to ensure that the classification of test results is as stable as possible.
16 3 Decision Accuracy While decision consistency is desirable, it should be noted that a high level of consistency does not automatically imply that the results necessarily reflect the test takers true ability or nature (Huynh, 1990). The accuracy or validity of these decisions is commonly assessed through content or validity studies, but indices for decision accuracy have been developed as well. Again, decision accuracy describes the degree to which actual classifications using observed cut-scores agree with true classifications based on known true cut-scores. More specifically, decision accuracy describes how closely a test taker s true score classification aligns with their observed score classification (Haertel, 2006). In decision accuracy, the concepts of false-positives and false-negatives are useful to consider. For example, a false-positive would be a result of mastery when in reality the true classification of the student should be non-mastery. Similarly, a false-negative would be a result of non-mastery when the true classification is mastery. Both of these results represent errors in classification and a lack of decision accuracy. More generally, a falsepositive is when a test taker is classified at a higher classification than the true score indicates and a false-negative is when a test taker is classified at a lower classification than the true score indicates. Measurement Models A primary consideration in determining decision consistency and accuracy is that the true ability and true scores of test takers are not known and must be estimated. In addition, item parameters such as difficulty, discrimination, and the pseudo-guessing parameter need to be estimated as well. Measurement models have been developed according to a set of assumptions for the purposes of calculating these estimates. In this study, the item response theory (IRT) framework is discussed with three specific measurement models: the unidimensional three-parameter logistic (UIRT) IRT model, the testlet response theory model, and the bifactor IRT model.
17 4 Unidimensional 3PL Item Response Theory Model Unidimensional item response theory is a key measurement model in psychometrics and plays a foundational role in this study. Lord (1980) frames item response theory as a theory of statistical estimation that uses latent characterizations of individuals as predictors of observed responses. In comparison to classical test theory, Hambleton (1993) describes several benefits and characteristics of item response theory in comparison to classical test theory. For example, one defining characteristic of IRT is that the item-ability relationship is clearly defined, where in classical test theory it is not specified. For the UIRT model, the probability that a person of a given ability gives a correct response can be represented by: P(θ) = c i + (1 c i ) exp {1.7[a i (θ b i )]} 1+exp {1.7[a i (θ b i )]}, (1.1) where θ is the latent trait (ability, skill, etc.), P(θ) is the probability of a correct response given a particular ability level, parameter a i is the discriminating power of the item, parameter b i is the difficulty of the item, and parameter c i is the pseudo-guessing parameter. However, while UIRT is quite flexible and powerful, it is based on the key assumptions of unidimensionality and local independence. The unidimensionality assumption states that the observations on the items are solely a function of a single continuous latent person variable (de Ayala, 2009). For example, on a mathematics test to meet this assumption there is assumed to be a single latent mathematics proficiency variable that underlies the test taker s performance. This assumption can be violated from a number of perspectives. First, items can contain multiple content strands which can lead to multidimensionality. For example, the performance of a test taker on a math problem which contains a significant amount of text in the stimulus may also partially depend on their level of reading ability. Multidimensionality can also exist from a format
18 5 perspective. Students may perform differentially well on, for example, multiple-choice versus constructed-response items. As a result, it is common practice for practitioners to conduct dimensionality assessments to see if the assumption is violated to the extent that the model no longer fits the data. If multidimensionality exists, the degree of the violation should be examined to see if the model can still provide a reasonable representation. The second assumption, local independence, states that the probability of correctly responding to any particular item is independent of the performance on any other items on the test. Lord (1980) defines this relationship as, P(U i = 1 θ) = P(U i = 1 θ, u j, u k, ) (i j, k, ), (1.2) where θ denotes ability and u is the response for items i, j, k, etc. This assumption can be violated in a number of ways. For example, one item response may clue the response of another thereby creating a dependency. Also, it is a common practice in assessment to tie a number of items to a particular stimulus or passage. As a result, dependency may exist between the items due to the common stimulus resulting in a violation of local independence. This particular effect is of primary concern to this study and leads to the choice of the other two models to be examined. Bifactor IRT Model Full-information item bifactor analysis (Gibbons, et al., 2007; Gibbons & Hedeker, 1992) allows for multidimensionality between item types. A sample item bifactor measurement structure could have the following factor pattern: a 10 a 11 0 a 20 a 21 0 a 30 0 a 32. a 40 0 a 42 ( a 50 0 a 52) The first subscript denotes the item and the second corresponds to a dimension. For example, the above pattern could represent the factor structure for 5 reading items with 2 nested within one passage and 3 in another. The first column represents a general reading
19 6 dimension while the second and third are dimensions specific to each of the passages. Thus, this model controls for the passage effect that potentially violates the unidimensionality assumptions in the UIRT model. This study will focus on the bifactor IRT model (Cai, Yang, & Hansen, 2011) which is an extension of the standard UIRT model. For dichotomous items the bifactor model with general factor θ0 and one specific factor θs is: P(θ 0, θ s ) = c i + (1 c i ) exp [1.7(a 0iθ 0 +a si θ s +d i )] 1+exp [1.7(a 0i θ 0 +a si θ s +d i )]. (1.3) Above, c i is the pseudo-guessing parameter, d i is the item intercept, a 0 is the item slope on the general factor, and a s is the item slope on the specific factor s. Note that the item slopes are similar in interpretation to discrimination in the UIRT model, but are specific to each factor. Testlet Response Theory IRT Model Testlets are a popular form of structuring items which involve attaching multiple items to a single stimulus or passage. One common usage is with reading passages from which several items draw inferences. Wainer, Bradlow, and Wang (2007) state that, for a typical reading passage followed by four to six associated items, the local independence assumption does not hold. Reducing the length of the passage was found to reduce this effect, but it was discovered that the construct measured was not the same. Attaching only one item to the passage also eliminates the problem, but at the cost of being very inefficient in terms of time needed to read the passage relative to the amount of information gained. Thus, there is a need for a model where the unit of test construction is smaller than the whole test but larger than a single item. Wainer and Kiely (1987), argued for such a model and proposed the testlet as a unit of construction. Of particular interest to this study is testlet response theory (TRT), which was developed by Wainer et al. (2007). Essentially it expands the 3PL UIRT model to:
20 7 P(θ) = c i + (1 c i ) exp {1.7[a i(θ b i Γ g(i) )]} 1+exp {1.7[a i (θ b i Γ g(i) )]}, (1.4) where Γg(i) is the testlet effect of a particular respondent for item i nested within testlet g(i). Note that if Γg(i)=0, there is no testlet effect and the model simplifies to the UIRT model. However, as described in DeMars (2006), the testlet response model is simply a constrained bifactor model. In the bifactor model, testlet slopes are independent of general slopes. Under the test response model, the testlet slopes are proportional to the general slopes. Thus, while the estimation of ability and item parameters is impacted, the same general model could be used to estimate decision consistency and accuracy indices. Decision Consistency Indices Classification accuracy and consistency indices can be estimated using a wide variety of models. In the summed-score metric, indices exist both for item response theory (Huynh, 1990) and classical test theory (Livingston & Lewis, 1995). More recently, theta metric indices have been developed by Rudner (2001) and Guo (2006). Specifically, this study will focus on IRT methods described in Schulz et al. (1999), Wang et al. (2000), Lee et al. (2002), and more recently generalized in Lee (2010). Note, however, that these methods were developed with the UIRT model in mind, and the primary purpose of this study is to develop a procedure for the TRT and bifactor models. Assume that a test score is found by summing all of the item scores on a particular form. Also, let x 1, x 2,, x K 1 denote a set of cut-scores that are used to classify examinees into K mutually exclusive categories. So, a score less than x 1 would be placed in the first category, a score equal to or greater than x 1 but less than x 2 would be placed in the second category, and so on. The conditional category probability can be computed by summing conditional summed-score probabilities for all scores that belong to category h: x h 1 p θ (h) = x=x (h 1) Pr (X = x θ), (1.5)
21 8 where h = 1, 2,, K. Once the above probability is calculated, it is useful to know the probability of an individual with a given ability level being placed twice in the same category on two separate parallel administrations of a test. This is the conditional classification consistency index (φ θ ), which can be calculated as: φ θ = K h=1 [p θ (h)] 2. (1.6) Finally, given the above, classification consistency across all levels of ability can be calculated. Given the distribution of ability, g(θ), the marginal classification consistency index ϕ is computed as: ϕ = φ θ g(θ) dθ. (1.7) Decision Accuracy Indices Decision accuracy describes the degree to which actual classifications using observed cut-scores agree with true classifications based on known true cut-scores. However, true scores, true ability levels, and true classifications are not known, and statistical techniques and assumptions need to be employed in order for an estimate to be made. First, suppose that the expected summed-score of the test taker is their true score τ. Next, suppose a set of true cut-scores on the summed-score metric, τ 1, τ 2,, τ K 1, determine the true categorization of each test taker with θ or τ. Also, assume that the conditional probabilities, p θ (h), from Equation 1.5 are known. Finally, the true categorical status, η(= 1, 2,, K), can be determined by comparing the expected sum score for θ with the true cut-scores. Following from the above, the conditional classification accuracy index is: γ θ = p θ (η), for θ η. (1.8) Next, by summing over all ability levels, the marginal classification accuracy index is: γ = γ θ g(θ)dθ. (1.9)
22 9 Note that in order for the indices in Equations 1.7 and 1.9 to be estimated, it is necessary to approximate the integral associated with the θ distribution. There are two different approaches typically employed: the D method and the P method (Lee, 2010). The D method is a distributional approach using estimated quadrature points and weights, and it replaces the integral with summations. The P method uses individual θ estimates to calculate the individual conditional classification indices which are then averaged over all examinees. Lee (2010) found that the D and P methods produced similar results, but suggested that the D method be used when the focus of investigation is at the group-level, and the P method when the focus is on the individual. Purpose of the Study The goal of this study is to investigate decision consistency and accuracy indices based on the UIRT, TRT, and bifactor models. The UIRT model has been thoroughly investigated in past research and will be the most straightforward of the models. However, the bifactor model is multidimensional and will be the primary focus of this study. The presence of multiple thetas does not fit with the current methods for estimating indices, and as a result a new procedure needs to be developed. This new procedure is the first to address multidimensionality in estimating decision consistency and accuracy indices, and as such represents a meaningful contribution to the literature. The TRT model, which is a constrained version of the bifactor model, will use the same general procedures as the bifactor model. Specifically, the purposes of this study are: 1. Develop a new procedure for estimating decision consistency and accuracy indices using the bifactor and TRT models. 2. Compare decision consistency and accuracy indices between the UIRT, TRT, and bifactor models using simulated and real testlet data from various sources. In addition, use the 4P BB and GRM models as a baseline comparison.
23 10 3. Investigate how the placement of cut-scores and degree of multidimensionality affects the estimates of decision consistency and accuracy indices between the UIRT, TRT, and bifactor models. 4. Compare the different numerical integration methods used to calculate the indices. Ideally, it is the hope of this study to offer practitioners another credible option for measuring the degree of decision consistency and accuracy in assessments. At present, no other decision consistency or accuracy index explicitly accounts for multidimensionality. In particular, the indices presented in this study directly address the multidimensionality caused by the presence of testlets. In tests that employ testlets, these new indices have the potential to provide a more accurate picture of decision consistency and accuracy, making them a potentially useful new procedure for practitioners.
24 Table 1.1. Mastery and Non-Mastery Outcomes on Two Administrations of the Same Test Administration 1 Administration 2 Mastery Non-Mastery Total Mastery Non-Mastery Total
25 12 CHAPTER 2 LITERATURE REVIEW This chapter will review the literature of decision consistency and accuracy indices in order to provide a history and broader context which will serve as a foundation for this study. First, this chapter will provide a background and motivation of criterion referenced testing. Next, a history of the development and evolution of consistency and accuracy indices will be provided. In addition, the background will be explored for the three models employed in this study: the unidimensional item response theory model, the testlet response theory model, and the bifactor model. Finally, there will be a discussion of how indices will be estimated for the different models and a brief overview of assessing model fit. Introduction In order for a test score to be meaningfully interpreted a test needs to be referenced in some way. In other words, the score needs to be compared to something external to the test as point of reference for comparisons. For example, one possible way to do this is through norm-referenced testing. In norm-referenced testing, derived scores (e.g. percentile rank, grade-equivalent scores, etc.) are constructed in a way that conveys information about the relative standing of the test taker to others in the defined group. This defined group is referred to as the norm group, and the derived scores are known as norm-referenced scores (Nitko, 1980). However, for some uses norm-referenced scores can be insufficient. For example, you may want to know if a particular student has mastered the prerequisite skills necessary to be successful in a math course. Here a norm-referenced interpretation is not particularly useful. In this case, you know the position of the student within the norm group, but do not know if the student has sufficient mastery. What is needed here is some sort of criterion by which mastery is gauged, and this is the motivation for criterionreferenced testing.
26 13 A common application of criterion-referenced test scores is to determine levels of test taker performance relative to specified cut-scores. With regards to this, it is useful to know to what degree the classification is both consistent and accurate. Decision consistency describes the degree to which test takers are re-classified into the same category over parallel replication. Decision accuracy describes the degree to which actual classifications using observed cut-scores agree with true classifications based on known true cut-scores (Lee, 2010). Decision consistency and accuracy related to test scores are of great practical value to psychometricians, practitioners, and test takers. The consequences of NCLB and the role of high stakes testing makes proper determination of proficiency critical. For example, a high school student who is erroneously deemed non-proficient may be denied graduation in some states. Similarly, a nurse who is mistakenly deemed proficient when there are deficiencies may not be capable of providing adequate care. In particular, the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1999) state, when a test or combination of measures is used to make categorical decisions, estimates should be provided of the percentage of examinees who would be classified in the same way on two applications of the procedure, using the same form or alternate forms of the instrument (p. 35). As a result, there is a general expectation and need for test classifications to be examined as part of the validation process. In order to determine proficiency levels, tests use items of various formats that have estimable parameters based on a chosen measurement model. One common type of item employed is the dichotomously-scored multiple-choice item. This type of item is appealing as they are generally regarded as reliable, efficient, and can be objectively scored. One common usage of the multiple-choice item is associated with passages. With this type of format, several items can draw from a common stimulus thereby forming a testlet. Estimating parameters of these types of items with item response theory must be done with care due to the assumption of local independence. With assumed local
27 14 independence items are conditionally independent of each other given scores on other items. In other words, the likelihood of a particular test taker getting a certain item correct does not depend on how they performed on other items (Lord, 1980). However, the existence of a common stimulus or passage can violate this assumption. The Development of Decision Consistency and Accuracy Indices Carver Method The earliest formal method for determining classification consistency appears in Carver (1970). Essentially, this method compared the percentage of mastery on two parallel administrations of a test. If the two percentages were equal then the test is considered reliable. The clear weakness to this method is that even if the two percentages are identical, the test still could be unreliable with regards to the performance of individual test takers. For example, half of the test takers could be considered masters on the first administration of the test and the other half could be masters on the second. The results are reversed but with the same percentage of mastery which is an unstable result. Swaminathan-Hambleton-Algina Method Following Carver s method is the Swaminathan-Hambleton-Algina Method developed in Hambleton and Novick (1973) and Swaminathan et al. (1974). This method suggested that the proportion of individuals consistently classified as masters be the measure for reliability of decision consistency. Referring back to Table 1.1, fifty out of one hundred students demonstrated mastery on both administrations and fifteen demonstrated non-mastery on both administrations. Therefore, the percentage of consistent classification can be calculated as: m φ = k=1 φ kk, (2.1) where φ kk is the proportion of individuals consistently classified in the kth category on both administrations. Thus, for Table 1.1 this is calculated as: φ = =
28 15 The upper limit of φ is 1.00, which is perfect consistency, and the lower limit is generally the proportion of consistent decisions that could be expected by chance. This is defined as: m φ c = k=1 p k. p.k, (2.2) where p k. and p.k are the proportions assigned to category k on both forms. For example, for Table 1.1 this is calculated as: φ c = ( ) ( ) + ( ) ( ) =.51. Note that while the proportion of consistent decisions in the example is 0.65, the proportion expected by chance is very high at This suggests that a sizeable proportion of the consistent decisions is due to chance at this particular cut-score, and not due to the reliability of the test itself. Swaminathan et al. (1974) suggested using Cohen s (1960) kappa coefficient to remove the chance proportion to determine the proportion of consistent decisions that can be expected beyond chance. The kappa coefficient is calculated by: κ = φ φ c = =.29. (2.3) 1 φ c 1.51 Here, the upper limit is again 1.00, suggesting perfect consistency, while the lower limit is theoretically lower than zero. In this particular example, the low kappa value suggests that once the proportion for chance agreement is removed the remaining degree of classification consistency is quite low. Strong True Score Models Note that the Swaminathan-Hambleton-Algina Method still requires the administration of two parallel forms of the test. However, often it is not practical or possible to give two parallel administrations of the same test in an achievement test context. Thus, it is useful to have an index that can be calculated from a single administration of a test. The beta-binomial model developed by Huynh (1976) is an example of one such index.
29 16 Hanson and Brennan (1990) compared three different strong true score betabinomial models (two-parameter, four-parameter, and four-parameter compound binominal) with regards to estimating indices of classification indices. Strong true score models consider the probability that the summed-score random variable X of a test score (with n dichotomously scored items) equals i (i = 0,, n), as: 1 0 Pr(X = i) = Pr(X = i π) g(π)dπ, (2.4) where π is the proportion correct true score, g(π) is the true score density function, and Pr(X = i π) is the conditional error distribution. Here, g(π)is assumed to belong to a certain parametric class, and Pr(X = i π) is assumed to be either binomial or an approximation of a compound binomial distribution. Each of the three models Hanson and Brennan examined had their own set of assumptions. For the two-parameter beta-binomial model, the true score distribution is beta and the conditional error distribution is binomial. For the four-parameter betabinomial model, the true score distribution is four-parameter beta (Lord, 1965) and the conditional error distribution is binomial. For the four-parameter beta compound binomial model, the true score distribution is four-parameter beta and the conditional error distribution is a two-term approximation to the compound binomial distribution. Here, classification consistency is defined as the consistency with which examinees are categorized on the basis of two independent administrations. Independence is defined so that the summed-scores on the two administrations (X 1 and X 2 ) are conditionally independent and identically distributed. Assuming two categories of classification, the bivariate distribution of X 1 and X 2 is: 1 0 Pr(X 1 = i, X 2 = j) = Pr(X 1 = i π) Pr(X 2 = i π) g(π)dπ. (2.5) Again assuming two categories of classification, from Equation 2.5 the classification index φ is defined as: x 0 1 x 0 1 n j=x 0 φ = i=0 j=0 Pr ( X 1 = i, X 2 = j) + i=x 0 Pr ( X 1 = i, X 2 = j). (2.6) n
30 17 The classification index φ is the probability that two summed-scores from parallel independent administrations are either both less than the cut-score x 0 (non-mastery) or both greater than or equal to x 0 (mastery). As given in Equation 2.3, the coefficient κ = φ φ c 1 φ c. Here, the probability of chance agreement φ c is given by: x 0 1 x 0 1 j=0 + φ c = [ i=0 Pr (X 1 = i)] [ Pr (X 2 = j)] n n [ i=x 0 Pr (X 1 = i)] [ j=x 0 Pr (X 2 = j)]. (2.7) Note that since X 1 and X 2 are independent and identically distributed, x 0 1 x 0 1 i=0 Pr (X 1 = i) = j=0 Pr(X 2 = j) = p 0, (2.8) where p 0 is the marginal probability that a test taker scores below the cut-score x 0. Also, n n i=x 0 Pr (X 1 = i) = j=x 0 Pr(X 2 = j) = p 1, (2.9) where p 1 is the marginal probability that a test taker scores equal to or greater than the cut-score x 0. As a result, φ c = p p 1 2, and p c does not depend on the actual pair of test scores for a test taker. Thus, given the assumptions made, only a single administration is necessary to calculate the indices for classification consistency. In their comparison of the three beta-binomial models, Hanson and Brennan found that the two-parameter model often demonstrates lack of fit. They recommended that before the two-parameter model is used, the adequacy of the model in fitting the raw observed scores be evaluated. In the cases where the two-parameter model does not fit, the four-parameter models may provide better fit. According to their results, the fourparameter beta-binomial and the four-parameter beta compound binomial provided very similar results. If neither the two-parameter nor four-parameter models fit, they suggest using a more complex model similar to those discussed by Wilcox (1981). Subkoviak and Compound Multinomial Models Lee (2005) notes that the strong true score models use a distributional approach, where assumptions are made concerning the distributional form of the true scores.
31 18 Subkoviak (1976) employed an individual approach where no such assumptions were made. The Subkoviak procedure estimates classification consistency one examinee at a time, and then averages over examinees to create an overall consistency index for the entire sample group. Lee (2005) extended Subkoviak s work using the compound multinomial procedure (see also Lee, Brennan, & Wan, 2009). Lee proposed a multinomial error model for a test with undifferentiated polytomous items, and a compound multinomial model for a test containing a mixture of items. The multinomial procedure reduces to Subkoviak s procedure when items are dichotomously scored. Assume there is a test that contains n polytomous items, each with j score points, c 1 < c 2 < < c j. Also, assume that X 1, X 2,, X j are the random variables representing the number of items scored with each of the possible score points. Under this procedure, each examinee response pattern would follow a multinomial distribution as follows: n! x Pr(X 1 = x 1, X 2 = x 2,, X j = x j π) = 1 x π 2 x 2 π j j, (2.10) x 1!x 2! x k 1! π 1 where π = {π 1, π 2,, π j } can be estimated by the observed proportions of items scored with corresponding points. From here the probability density function of Y can be determined by summing over all sets of X 1, X 2,, X j for a total score of y: Pr(Y = y π) = Pr(X 1 = x 1, X 2 = x 2,, X h = x j π) y, (2.11) where y = c 1 x 1 + c 2 x c j x j as the sum of the item scores. Once this density function is determined, the φ and κ indices can be calculated. Livingston-Lewis Procedure Livingston and Lewis (1995) described a procedure for estimating the accuracy and consistency of classifications based on test scores based on effective test length. Effective test length refers to the number of discrete, dichotomously scored, locally independent test items needed to produce the total scores having the same precision as the scores actually being used. The original test score is transformed onto a new scale with a
32 19 maximum equal to the effective test length. The true score distribution of the new scale is then estimated by fitting a 2- or 4-parameter beta model. Also, the conditional distribution of scores on the new scale, given the true score, is estimated from a binomial distribution based on the estimated effective test length. Once the parameters of both distributions are known, classification consistency is estimated in the same way as demonstrated in Hanson and Brennan (1990). Item Response Theory Methods All of the previously discussed models are based on either summed-scores or scale-scores and fall under the umbrella of classical test theory. Huynh (1990) first explored procedures for consistency indices based on latent trait models. Huynh demonstrates how the Rasch model could be used to project the bivariate frequency of test scores based on equivalent test scores. This distribution is then used to estimate consistency indices such as φ and κ. Estimating these indices using IRT requires the determination of the marginal distribution of summed-scores through integration associated with the distribution of the latent trait θ. There are two different approaches possible: the D method and the P method (Lee, 2010). The D method is a distributional approach using estimated quadrature points and weights, and it replaces the integral with summations. The P method uses individual θ estimates to calculate the individual conditional classification indices which are then averaged over all examinees. Lee (2010) found that the D and P methods produced similar results, but suggested that the D method be used when the focus of investigation is at the group-level and the P method when the focus is on the individual. Huynh s work has been further developed by Schulz et al. (1999), Wang et al. (2000), and Lee et al. (2002). More recently, theta metric indices have been developed by Rudner (2001) and Guo (2006). Lee (2010) generalized their work and the following procedures reflect that form.
33 20 Given θ and g(θ), the latent trait being measured and its distribution respectively, the marginal probability of the total summed-score X is given by: Pr(X = x) = Pr(X = x θ) g(θ)dθ. (2.12) Note that Pr(X = x θ) is the conditional summed-score distribution. Also, due to the IRT assumption of conditional independence, the probability of a summed-score is the product of probabilities for item responses given θ. Typically, a recursive formula, such as the Lord-Wingersky algorithm (1984), is employed to calculate the conditional summed-score distribution for dichotomous items. In addition, when all items are dichotomous, a compound binomial model is used for modeling conditional numbercorrect score distributions. Assume that a test score is found by summing all of the item scores on a particular form. Also, let x 1, x 2,, x K 1 denote a set of cut-scores that are used to classify examinees into K mutually exclusive categories. A score less than x 1 would be placed in the first category, a score equal to or greater than x 1 but less than x 2 would be placed in the second category, and so on. The conditional category probability can be computed by summing conditional summed-score probabilities for all scores that belong to category h: x h 1 p θ (h) = x=x (h 1) Pr (X = x θ), (2.13) where h = 1, 2,, K. Once the above probability is calculated it is useful to know the probability of an individual with a given ability level being placed twice in the same category on two separate parallel administrations of a test. This is the conditional classification consistency index (φ θ ) which can be calculated as: φ θ = K h=1 [p θ (h)] 2. (2.14) Finally, given the above, classification consistency across all levels of ability can be calculated. Given the distribution of ability, g(θ), the marginal classification consistency index ϕ is computed as: ϕ = φ θ g(θ) dθ. (2.15)
34 21 Note that in this context, the coefficient κ can be expressed as κ = φ φ c 1 φ c. In addition, the chance probability, φ c, can be expressed as φ c = K h=1 [p(h)] 2, where p(h)is the marginal category probability obtained by integrating p θ (h) over θ. Decision accuracy describes the degree to which actual classifications using observed cut-scores agree with true classifications based on known true cut-scores. However, true scores, true ability levels, and true classifications are not known, and statistical techniques and assumptions need to be employed in order for an informed guess to be made. First, suppose that the expected summed-score of the test taker is their true score τ. Next, suppose a set of true cut-scores on the summed-score metric, τ 1, τ 2,, τ K 1, determine the true categorization of each test taker with θ or τ. Also, assume that the conditional probabilities, p θ (h), from Equation 2.13 are known. Finally, the true categorical status, η(= 1, 2,, K), can be determined by comparing the expected sum score for θ with the true cut-scores. Following from the above, the conditional classification accuracy index is: γ θ = p θ (η), for θ η. (2.16) Next, by summing over all ability levels the marginal classification accuracy index is: γ = γ θ g(θ)dθ. (2.17) Classification accuracy can also be framed in terms of false positive and false negative error rates. A conditional false positive error rate is the probability that a test taker of a given ability is classified into a category higher than the test taker s true category. This can be expressed by: K γ + θ = η=η +1 p θ (η), for θ η, (2.18) where η is the accurate decision category. In contrast, the false negative error rate, or the probability that a test taker of a given ability is classified into a category lower than the test taker s true category, is given by: η 1 γ θ = η=1 p θ (η), for θ η. (2.19)
35 22 Correspondingly, the marginal false positive and false negative error rates are given by: and γ + = γ = γ θ + g(θ)dθ (2.20) γ θ g(θ)dθ. (2.21) Item Response Theory Item response theory (IRT) is a key measurement model in psychometrics. Lord (1980) frames item response theory as a theory of statistical estimation that uses latent characterizations of individuals as predictors of observed responses. While IRT is quite flexible and powerful, it is based on the key assumptions of unidimensionality and local independence. The unidimensionality assumption states that the observations on the items are solely a function of a single continuous latent person variable (de Ayala, 2009). For example, on a mathematics test there is assumed to be a single latent mathematics proficiency variable that underlies the test taker s performance. This assumption can be violated from a number of perspectives. First, items can contain multiple content strands which can lead to multidimensionality. For example, the performance of a test taker on a math problem which contains a significant amount of text in the stimulus may also partially depend on their level of reading ability. Multidimensionality can also exist from a format perspective. Students may perform differentially well on, for example, multiplechoice versus constructed-response items. It should be noted, however, that no test is likely to be perfectly unidimensional. The second assumption, local independence, states that the probability of correctly responding to any particular item is independent of the performance on any other items on the test. As noted in Chapter 1, Lord (1980) defines this relationship as, P(u i = 1 θ) = P(u i = 1 θ, u j, u k, ) (i j, k, ). (2.22) Here, θ denotes ability and u is the response for items i, j, k, etc. This assumption can be violated in a number of ways including the usage of passages.
Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales
University of Iowa Iowa Research Online Theses and Dissertations Summer 2013 Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales Anna Marie
More informationCenter for Advanced Studies in Measurement and Assessment. CASMA Research Report
Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 39 Evaluation of Comparability of Scores and Passing Decisions for Different Item Pools of Computerized Adaptive Examinations
More informationThe Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland
Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, April 23-25, 2003 The Classification Accuracy of Measurement Decision Theory Lawrence Rudner University
More informationTechnical Specifications
Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically
More informationContents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD
Psy 427 Cal State Northridge Andrew Ainsworth, PhD Contents Item Analysis in General Classical Test Theory Item Response Theory Basics Item Response Functions Item Information Functions Invariance IRT
More informationUsing the Score-based Testlet Method to Handle Local Item Dependence
Using the Score-based Testlet Method to Handle Local Item Dependence Author: Wei Tao Persistent link: http://hdl.handle.net/2345/1363 This work is posted on escholarship@bc, Boston College University Libraries.
More informationIDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS
IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS A Dissertation Presented to The Academic Faculty by HeaWon Jun In Partial Fulfillment of the Requirements
More informationItem Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses
Item Response Theory Steven P. Reise University of California, U.S.A. Item response theory (IRT), or modern measurement theory, provides alternatives to classical test theory (CTT) methods for the construction,
More informationUsing the Distractor Categories of Multiple-Choice Items to Improve IRT Linking
Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Jee Seon Kim University of Wisconsin, Madison Paper presented at 2006 NCME Annual Meeting San Francisco, CA Correspondence
More informationGENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS
GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS Michael J. Kolen The University of Iowa March 2011 Commissioned by the Center for K 12 Assessment & Performance Management at
More informationAdaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida
Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida and Oleksandr S. Chernyshenko University of Canterbury Presented at the New CAT Models
More informationLinking across forms in vertical scaling under the common-item nonequvalent groups design
University of Iowa Iowa Research Online Theses and Dissertations Spring 2013 Linking across forms in vertical scaling under the common-item nonequvalent groups design Xuan Wang University of Iowa Copyright
More informationMultidimensional Modeling of Learning Progression-based Vertical Scales 1
Multidimensional Modeling of Learning Progression-based Vertical Scales 1 Nina Deng deng.nina@measuredprogress.org Louis Roussos roussos.louis@measuredprogress.org Lee LaFond leelafond74@gmail.com 1 This
More informationEvaluating the Impact of Construct Shift on Item Parameter Invariance, Test Equating and Proficiency Estimates
University of Massachusetts Amherst ScholarWorks@UMass Amherst Doctoral Dissertations Dissertations and Theses 2015 Evaluating the Impact of Construct Shift on Item Parameter Invariance, Test Equating
More informationCenter for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests
Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 26 for Mixed Format Tests Kyong Hee Chon Won-Chan Lee Timothy N. Ansley November 2007 The authors are grateful to
More informationA Bayesian Nonparametric Model Fit statistic of Item Response Models
A Bayesian Nonparametric Model Fit statistic of Item Response Models Purpose As more and more states move to use the computer adaptive test for their assessments, item response theory (IRT) has been widely
More informationDiagnostic Classification Models
Diagnostic Classification Models Lecture #13 ICPSR Item Response Theory Workshop Lecture #13: 1of 86 Lecture Overview Key definitions Conceptual example Example uses of diagnostic models in education Classroom
More informationComputerized Mastery Testing
Computerized Mastery Testing With Nonequivalent Testlets Kathleen Sheehan and Charles Lewis Educational Testing Service A procedure for determining the effect of testlet nonequivalence on the operating
More informationEmpowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison
Empowered by Psychometrics The Fundamentals of Psychometrics Jim Wollack University of Wisconsin Madison Psycho-what? Psychometrics is the field of study concerned with the measurement of mental and psychological
More informationLOGISTIC APPROXIMATIONS OF MARGINAL TRACE LINES FOR BIFACTOR ITEM RESPONSE THEORY MODELS. Brian Dale Stucky
LOGISTIC APPROXIMATIONS OF MARGINAL TRACE LINES FOR BIFACTOR ITEM RESPONSE THEORY MODELS Brian Dale Stucky A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in
More informationA Multilevel Testlet Model for Dual Local Dependence
Journal of Educational Measurement Spring 2012, Vol. 49, No. 1, pp. 82 100 A Multilevel Testlet Model for Dual Local Dependence Hong Jiao University of Maryland Akihito Kamata University of Oregon Shudong
More informationDevelopment, Standardization and Application of
American Journal of Educational Research, 2018, Vol. 6, No. 3, 238-257 Available online at http://pubs.sciepub.com/education/6/3/11 Science and Education Publishing DOI:10.12691/education-6-3-11 Development,
More informationUsing Bayesian Decision Theory to
Using Bayesian Decision Theory to Design a Computerized Mastery Test Charles Lewis and Kathleen Sheehan Educational Testing Service A theoretical framework for mastery testing based on item response theory
More informationEFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN
EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN By FRANCISCO ANDRES JIMENEZ A THESIS PRESENTED TO THE GRADUATE SCHOOL OF
More informationMCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2
MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and Lord Equating Methods 1,2 Lisa A. Keller, Ronald K. Hambleton, Pauline Parker, Jenna Copella University of Massachusetts
More informationBruno D. Zumbo, Ph.D. University of Northern British Columbia
Bruno Zumbo 1 The Effect of DIF and Impact on Classical Test Statistics: Undetected DIF and Impact, and the Reliability and Interpretability of Scores from a Language Proficiency Test Bruno D. Zumbo, Ph.D.
More informationRasch Versus Birnbaum: New Arguments in an Old Debate
White Paper Rasch Versus Birnbaum: by John Richard Bergan, Ph.D. ATI TM 6700 E. Speedway Boulevard Tucson, Arizona 85710 Phone: 520.323.9033 Fax: 520.323.9139 Copyright 2013. All rights reserved. Galileo
More informationUSING MULTIDIMENSIONAL ITEM RESPONSE THEORY TO REPORT SUBSCORES ACROSS MULTIPLE TEST FORMS. Jing-Ru Xu
USING MULTIDIMENSIONAL ITEM RESPONSE THEORY TO REPORT SUBSCORES ACROSS MULTIPLE TEST FORMS By Jing-Ru Xu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements
More informationITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE
California State University, San Bernardino CSUSB ScholarWorks Electronic Theses, Projects, and Dissertations Office of Graduate Studies 6-2016 ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION
More informationInvestigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories
Kamla-Raj 010 Int J Edu Sci, (): 107-113 (010) Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories O.O. Adedoyin Department of Educational Foundations,
More informationTHE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION
THE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION Timothy Olsen HLM II Dr. Gagne ABSTRACT Recent advances
More informationItem Parameter Recovery for the Two-Parameter Testlet Model with Different. Estimation Methods. Abstract
Item Parameter Recovery for the Two-Parameter Testlet Model with Different Estimation Methods Yong Luo National Center for Assessment in Saudi Arabia Abstract The testlet model is a popular statistical
More informationA DIFFERENTIAL RESPONSE FUNCTIONING FRAMEWORK FOR UNDERSTANDING ITEM, BUNDLE, AND TEST BIAS ROBERT PHILIP SIDNEY CHALMERS
A DIFFERENTIAL RESPONSE FUNCTIONING FRAMEWORK FOR UNDERSTANDING ITEM, BUNDLE, AND TEST BIAS ROBERT PHILIP SIDNEY CHALMERS A DISSERTATION SUBMITTED TO THE FACULTY OF GRADUATE STUDIES IN PARTIAL FULFILMENT
More informationInfluences of IRT Item Attributes on Angoff Rater Judgments
Influences of IRT Item Attributes on Angoff Rater Judgments Christian Jones, M.A. CPS Human Resource Services Greg Hurt!, Ph.D. CSUS, Sacramento Angoff Method Assemble a panel of subject matter experts
More informationUsing the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison
Using the Testlet Model to Mitigate Test Speededness Effects James A. Wollack Youngsuk Suh Daniel M. Bolt University of Wisconsin Madison April 12, 2007 Paper presented at the annual meeting of the National
More informationLikelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati.
Likelihood Ratio Based Computerized Classification Testing Nathan A. Thompson Assessment Systems Corporation & University of Cincinnati Shungwon Ro Kenexa Abstract An efficient method for making decisions
More informationA Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests
A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests David Shin Pearson Educational Measurement May 007 rr0701 Using assessment and research to promote learning Pearson Educational
More informationPOLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS
POLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS By OU ZHANG A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
More informationAndré Cyr and Alexander Davies
Item Response Theory and Latent variable modeling for surveys with complex sampling design The case of the National Longitudinal Survey of Children and Youth in Canada Background André Cyr and Alexander
More informationComparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria
Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Thakur Karkee Measurement Incorporated Dong-In Kim CTB/McGraw-Hill Kevin Fatica CTB/McGraw-Hill
More informationABERRANT RESPONSE PATTERNS AS A MULTIDIMENSIONAL PHENOMENON: USING FACTOR-ANALYTIC MODEL COMPARISON TO DETECT CHEATING. John Michael Clark III
ABERRANT RESPONSE PATTERNS AS A MULTIDIMENSIONAL PHENOMENON: USING FACTOR-ANALYTIC MODEL COMPARISON TO DETECT CHEATING BY John Michael Clark III Submitted to the graduate degree program in Psychology and
More informationType I Error Rates and Power Estimates for Several Item Response Theory Fit Indices
Wright State University CORE Scholar Browse all Theses and Dissertations Theses and Dissertations 2009 Type I Error Rates and Power Estimates for Several Item Response Theory Fit Indices Bradley R. Schlessman
More informationTHE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH
THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH By JANN MARIE WISE MACINNES A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF
More informationA Comparison of Item and Testlet Selection Procedures. in Computerized Adaptive Testing. Leslie Keng. Pearson. Tsung-Han Ho
ADAPTIVE TESTLETS 1 Running head: ADAPTIVE TESTLETS A Comparison of Item and Testlet Selection Procedures in Computerized Adaptive Testing Leslie Keng Pearson Tsung-Han Ho The University of Texas at Austin
More informationA Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model
A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model Gary Skaggs Fairfax County, Virginia Public Schools José Stevenson
More informationExploring dimensionality of scores for mixedformat
University of Iowa Iowa Research Online Theses and Dissertations Summer 2016 Exploring dimensionality of scores for mixedformat tests Mengyao Zhang University of Iowa Copyright 2016 Mengyao Zhang This
More informationThe Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing
The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing Terry A. Ackerman University of Illinois This study investigated the effect of using multidimensional items in
More informationlinking in educational measurement: Taking differential motivation into account 1
Selecting a data collection design for linking in educational measurement: Taking differential motivation into account 1 Abstract In educational measurement, multiple test forms are often constructed to
More informationLec 02: Estimation & Hypothesis Testing in Animal Ecology
Lec 02: Estimation & Hypothesis Testing in Animal Ecology Parameter Estimation from Samples Samples We typically observe systems incompletely, i.e., we sample according to a designed protocol. We then
More informationTECHNICAL REPORT. The Added Value of Multidimensional IRT Models. Robert D. Gibbons, Jason C. Immekus, and R. Darrell Bock
1 TECHNICAL REPORT The Added Value of Multidimensional IRT Models Robert D. Gibbons, Jason C. Immekus, and R. Darrell Bock Center for Health Statistics, University of Illinois at Chicago Corresponding
More informationApplying the Minimax Principle to Sequential Mastery Testing
Developments in Social Science Methodology Anuška Ferligoj and Andrej Mrvar (Editors) Metodološki zvezki, 18, Ljubljana: FDV, 2002 Applying the Minimax Principle to Sequential Mastery Testing Hans J. Vos
More informationGMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups
GMAC Scaling Item Difficulty Estimates from Nonequivalent Groups Fanmin Guo, Lawrence Rudner, and Eileen Talento-Miller GMAC Research Reports RR-09-03 April 3, 2009 Abstract By placing item statistics
More informationDifferential Item Functioning Amplification and Cancellation in a Reading Test
A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to the Practical Assessment, Research & Evaluation. Permission is granted to
More informationCentre for Education Research and Policy
THE EFFECT OF SAMPLE SIZE ON ITEM PARAMETER ESTIMATION FOR THE PARTIAL CREDIT MODEL ABSTRACT Item Response Theory (IRT) models have been widely used to analyse test data and develop IRT-based tests. An
More informationBlending Psychometrics with Bayesian Inference Networks: Measuring Hundreds of Latent Variables Simultaneously
Blending Psychometrics with Bayesian Inference Networks: Measuring Hundreds of Latent Variables Simultaneously Jonathan Templin Department of Educational Psychology Achievement and Assessment Institute
More informationScoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods
James Madison University JMU Scholarly Commons Department of Graduate Psychology - Faculty Scholarship Department of Graduate Psychology 3-008 Scoring Multiple Choice Items: A Comparison of IRT and Classical
More informationCOMBINING SCALING AND CLASSIFICATION: A PSYCHOMETRIC MODEL FOR SCALING ABILITY AND DIAGNOSING MISCONCEPTIONS LAINE P. BRADSHAW
COMBINING SCALING AND CLASSIFICATION: A PSYCHOMETRIC MODEL FOR SCALING ABILITY AND DIAGNOSING MISCONCEPTIONS by LAINE P. BRADSHAW (Under the Direction of Jonathan Templin and Karen Samuelsen) ABSTRACT
More informationComparing DIF methods for data with dual dependency
DOI 10.1186/s40536-016-0033-3 METHODOLOGY Open Access Comparing DIF methods for data with dual dependency Ying Jin 1* and Minsoo Kang 2 *Correspondence: ying.jin@mtsu.edu 1 Department of Psychology, Middle
More informationUNIDIMENSIONAL VERTICAL SCALING OF MIXED FORMAT TESTS IN THE PRESENCE OF ITEM FORMAT EFFECT. Debra White Moore
UNIDIMENSIONAL VERTICAL SCALING OF MIXED FORMAT TESTS IN THE PRESENCE OF ITEM FORMAT EFFECT by Debra White Moore B.M.Ed., University of North Carolina, Greensboro, 1989 M.A., University of Pittsburgh,
More informationUsing Differential Item Functioning to Test for Inter-rater Reliability in Constructed Response Items
University of Wisconsin Milwaukee UWM Digital Commons Theses and Dissertations May 215 Using Differential Item Functioning to Test for Inter-rater Reliability in Constructed Response Items Tamara Beth
More informationEstimating the Validity of a
Estimating the Validity of a Multiple-Choice Test Item Having k Correct Alternatives Rand R. Wilcox University of Southern California and University of Califarnia, Los Angeles In various situations, a
More informationDifferential Item Functioning
Differential Item Functioning Lecture #11 ICPSR Item Response Theory Workshop Lecture #11: 1of 62 Lecture Overview Detection of Differential Item Functioning (DIF) Distinguish Bias from DIF Test vs. Item
More informationParameter Estimation with Mixture Item Response Theory Models: A Monte Carlo Comparison of Maximum Likelihood and Bayesian Methods
Journal of Modern Applied Statistical Methods Volume 11 Issue 1 Article 14 5-1-2012 Parameter Estimation with Mixture Item Response Theory Models: A Monte Carlo Comparison of Maximum Likelihood and Bayesian
More informationRunning head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note
Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1 Nested Factor Analytic Model Comparison as a Means to Detect Aberrant Response Patterns John M. Clark III Pearson Author Note John M. Clark III,
More informationCopyright. Kelly Diane Brune
Copyright by Kelly Diane Brune 2011 The Dissertation Committee for Kelly Diane Brune Certifies that this is the approved version of the following dissertation: An Evaluation of Item Difficulty and Person
More informationAn exploration of decision consistency indices for one form tests
Retrospective Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 1983 An exploration of decision consistency indices for one form tests Randi Louise Hagen Iowa State University
More informationAnalyzing data from educational surveys: a comparison of HLM and Multilevel IRT. Amin Mousavi
Analyzing data from educational surveys: a comparison of HLM and Multilevel IRT Amin Mousavi Centre for Research in Applied Measurement and Evaluation University of Alberta Paper Presented at the 2013
More informationThe effects of ordinal data on coefficient alpha
James Madison University JMU Scholarly Commons Masters Theses The Graduate School Spring 2015 The effects of ordinal data on coefficient alpha Kathryn E. Pinder James Madison University Follow this and
More informationA structural equation modeling approach for examining position effects in large scale assessments
DOI 10.1186/s40536-017-0042-x METHODOLOGY Open Access A structural equation modeling approach for examining position effects in large scale assessments Okan Bulut *, Qi Quo and Mark J. Gierl *Correspondence:
More informationFactors Affecting the Item Parameter Estimation and Classification Accuracy of the DINA Model
Journal of Educational Measurement Summer 2010, Vol. 47, No. 2, pp. 227 249 Factors Affecting the Item Parameter Estimation and Classification Accuracy of the DINA Model Jimmy de la Torre and Yuan Hong
More informationLinking Errors in Trend Estimation in Large-Scale Surveys: A Case Study
Research Report Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study Xueli Xu Matthias von Davier April 2010 ETS RR-10-10 Listening. Learning. Leading. Linking Errors in Trend Estimation
More informationaccuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian
Recovery of Marginal Maximum Likelihood Estimates in the Two-Parameter Logistic Response Model: An Evaluation of MULTILOG Clement A. Stone University of Pittsburgh Marginal maximum likelihood (MML) estimation
More informationNonparametric DIF. Bruno D. Zumbo and Petronilla M. Witarsa University of British Columbia
Nonparametric DIF Nonparametric IRT Methodology For Detecting DIF In Moderate-To-Small Scale Measurement: Operating Characteristics And A Comparison With The Mantel Haenszel Bruno D. Zumbo and Petronilla
More information11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES
Correlational Research Correlational Designs Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are
More informationPsychological testing
Psychological testing Lecture 12 Mikołaj Winiewski, PhD Test Construction Strategies Content validation Empirical Criterion Factor Analysis Mixed approach (all of the above) Content Validation Defining
More informationREMOVE OR KEEP: LINKING ITEMS SHOWING ITEM PARAMETER DRIFT. Qi Chen
REMOVE OR KEEP: LINKING ITEMS SHOWING ITEM PARAMETER DRIFT By Qi Chen A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Measurement and
More informationResearch and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida
Vol. 2 (1), pp. 22-39, Jan, 2015 http://www.ijate.net e-issn: 2148-7456 IJATE A Comparison of Logistic Regression Models for Dif Detection in Polytomous Items: The Effect of Small Sample Sizes and Non-Normality
More informationTHE IMPACT OF ANCHOR ITEM EXPOSURE ON MEAN/SIGMA LINKING AND IRT TRUE SCORE EQUATING UNDER THE NEAT DESIGN. Moatasim A. Barri
THE IMPACT OF ANCHOR ITEM EXPOSURE ON MEAN/SIGMA LINKING AND IRT TRUE SCORE EQUATING UNDER THE NEAT DESIGN By Moatasim A. Barri B.S., King Abdul Aziz University M.S.Ed., The University of Kansas Ph.D.,
More informationExamining Factors Affecting Language Performance: A Comparison of Three Measurement Approaches
Pertanika J. Soc. Sci. & Hum. 21 (3): 1149-1162 (2013) SOCIAL SCIENCES & HUMANITIES Journal homepage: http://www.pertanika.upm.edu.my/ Examining Factors Affecting Language Performance: A Comparison of
More informationAn Empirical Examination of the Impact of Item Parameters on IRT Information Functions in Mixed Format Tests
University of Massachusetts - Amherst ScholarWorks@UMass Amherst Dissertations 2-2012 An Empirical Examination of the Impact of Item Parameters on IRT Information Functions in Mixed Format Tests Wai Yan
More informationYou must answer question 1.
Research Methods and Statistics Specialty Area Exam October 28, 2015 Part I: Statistics Committee: Richard Williams (Chair), Elizabeth McClintock, Sarah Mustillo You must answer question 1. 1. Suppose
More informationThe Hierarchical Testlet Response Time Model: Bayesian analysis of a testlet model for item responses and response times.
The Hierarchical Testlet Response Time Model: Bayesian analysis of a testlet model for item responses and response times By Suk Keun Im Submitted to the graduate degree program in Department of Educational
More informationBuilding Evaluation Scales for NLP using Item Response Theory
Building Evaluation Scales for NLP using Item Response Theory John Lalor CICS, UMass Amherst Joint work with Hao Wu (BC) and Hong Yu (UMMS) Motivation Evaluation metrics for NLP have been mostly unchanged
More informationA Comparison of Several Goodness-of-Fit Statistics
A Comparison of Several Goodness-of-Fit Statistics Robert L. McKinley The University of Toledo Craig N. Mills Educational Testing Service A study was conducted to evaluate four goodnessof-fit procedures
More informationTHE NATURE OF OBJECTIVITY WITH THE RASCH MODEL
JOURNAL OF EDUCATIONAL MEASUREMENT VOL. II, NO, 2 FALL 1974 THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL SUSAN E. WHITELY' AND RENE V. DAWIS 2 University of Minnesota Although it has been claimed that
More informationOn the purpose of testing:
Why Evaluation & Assessment is Important Feedback to students Feedback to teachers Information to parents Information for selection and certification Information for accountability Incentives to increase
More informationA COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM LIKELIHOOD METHODS IN ESTIMATING THE ITEM PARAMETERS FOR THE 2PL IRT MODEL
International Journal of Innovative Management, Information & Production ISME Internationalc2010 ISSN 2185-5439 Volume 1, Number 1, December 2010 PP. 81-89 A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM
More informationA Comparison of Four Test Equating Methods
A Comparison of Four Test Equating Methods Report Prepared for the Education Quality and Accountability Office (EQAO) by Xiao Pang, Ph.D. Psychometrician, EQAO Ebby Madera, Ph.D. Psychometrician, EQAO
More informationItem Response Theory (IRT): A Modern Statistical Theory for Solving Measurement Problem in 21st Century
International Journal of Scientific Research in Education, SEPTEMBER 2018, Vol. 11(3B), 627-635. Item Response Theory (IRT): A Modern Statistical Theory for Solving Measurement Problem in 21st Century
More informationComprehensive Statistical Analysis of a Mathematics Placement Test
Comprehensive Statistical Analysis of a Mathematics Placement Test Robert J. Hall Department of Educational Psychology Texas A&M University, USA (bobhall@tamu.edu) Eunju Jung Department of Educational
More informationLUO, XIAO, Ph.D. The Optimal Design of the Dual-purpose Test. (2013) Directed by Dr. Richard M. Luecht. 155 pp.
LUO, XIAO, Ph.D. The Optimal Design of the Dual-purpose Test. (2013) Directed by Dr. Richard M. Luecht. 155 pp. Traditional test development focused on one purpose of the test, either ranking test-takers
More informationAssessing Dimensionality in Complex Data Structures: A Performance Comparison of DETECT and NOHARM Procedures. Dubravka Svetina
Assessing Dimensionality in Complex Data Structures: A Performance Comparison of DETECT and NOHARM Procedures by Dubravka Svetina A Dissertation Presented in Partial Fulfillment of the Requirements for
More informationHaving your cake and eating it too: multiple dimensions and a composite
Having your cake and eating it too: multiple dimensions and a composite Perman Gochyyev and Mark Wilson UC Berkeley BEAR Seminar October, 2018 outline Motivating example Different modeling approaches Composite
More informationChapter 1 Introduction. Measurement Theory. broadest sense and not, as it is sometimes used, as a proxy for deterministic models.
Ostini & Nering - Chapter 1 - Page 1 POLYTOMOUS ITEM RESPONSE THEORY MODELS Chapter 1 Introduction Measurement Theory Mathematical models have been found to be very useful tools in the process of human
More informationAn Investigation of vertical scaling with item response theory using a multistage testing framework
University of Iowa Iowa Research Online Theses and Dissertations 2008 An Investigation of vertical scaling with item response theory using a multistage testing framework Jonathan James Beard University
More informationSelection of Linking Items
Selection of Linking Items Subset of items that maximally reflect the scale information function Denote the scale information as Linear programming solver (in R, lp_solve 5.5) min(y) Subject to θ, θs,
More informationStatistical Methods and Reasoning for the Clinical Sciences
Statistical Methods and Reasoning for the Clinical Sciences Evidence-Based Practice Eiki B. Satake, PhD Contents Preface Introduction to Evidence-Based Statistics: Philosophical Foundation and Preliminaries
More informationItem Response Theory: Methods for the Analysis of Discrete Survey Response Data
Item Response Theory: Methods for the Analysis of Discrete Survey Response Data ICPSR Summer Workshop at the University of Michigan June 29, 2015 July 3, 2015 Presented by: Dr. Jonathan Templin Department
More informationEvaluating the quality of analytic ratings with Mokken scaling
Psychological Test and Assessment Modeling, Volume 57, 2015 (3), 423-444 Evaluating the quality of analytic ratings with Mokken scaling Stefanie A. Wind 1 Abstract Greatly influenced by the work of Rasch
More informationMantel-Haenszel Procedures for Detecting Differential Item Functioning
A Comparison of Logistic Regression and Mantel-Haenszel Procedures for Detecting Differential Item Functioning H. Jane Rogers, Teachers College, Columbia University Hariharan Swaminathan, University of
More information