Decision consistency and accuracy indices for the bifactor and testlet response theory models

Size: px

Start display at page:

Download "Decision consistency and accuracy indices for the bifactor and testlet response theory models"

Elizabeth Henderson
6 years ago
Views:

University of Iowa Iowa Research Online Theses and Dissertations Summer 2014 Decision consistency and accuracy indices for the bifactor and testlet response

edu/etd/1346 Recommended Citation LaFond, Lee James. "Decision consistency and accuracy indices for the bifactor and testlet response theory models.

1 University of Iowa Iowa Research Online Theses and Dissertations Summer 2014 Decision consistency and accuracy indices for the bifactor and testlet response theory models Lee James LaFond University of Iowa Copyright 2014 Lee LaFond This dissertation is available at Iowa Research Online: Recommended Citation LaFond, Lee James. "Decision consistency and accuracy indices for the bifactor and testlet response theory models." PhD (Doctor of Philosophy) thesis, University of Iowa, Follow this and additional works at: Part of the Educational Psychology Commons

2 DECISION CONSISTENCY AND ACCURACY INDICES FOR THE BIFACTOR AND TESTLET RESPONSE THEORY MODELS by Lee James LaFond A thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Psychological and Quantitative Foundations in the Graduate College of The University of Iowa August 2014 Thesis Supervisor: Associate Professor Won-Chan Lee

4 Graduate College The University of Iowa Iowa City, Iowa CERTIFICATE OF APPROVAL PH.D. THESIS This is to certify that the Ph. D. thesis of Lee James LaFond has been approved by the Examining Committee for the thesis requirement for the Doctor of Philosophy degree in Psychological and Quantitative Foundations at the August 2014 graduation. Thesis Committee: Won-Chan Lee, Thesis Supervisor Robert Brennan Kate Cowles Deborah Harris Michael Kolen Donald Yarbrough

5 ACKNOWLEDGMENTS I would like to thank several people for all of their valuable help in both writing this thesis and helping guide my experience in graduate school. First, I would like to thank Dr. Won-Chan Lee who went above the call of duty in serving as my dissertation chair. In addition to the thoughtful and thorough feedback he provided for this study, I am also extremely grateful for all of the programming work that went into developing mirt-class. Next, I would like to thank Dr. Donald Yarbrough, who in addition to serving on my committee, also served as my academic advisor and supervisor at the Center for Evaluation and Assessment. I cannot think of a better person to have served as a mentor and guide to the field of program evaluation. I also would like to thank all of the other members on my committee: Dr. Robert Brennan, Dr. Michael Kolen, Dr. Deborah Harris, and Dr. Kate Cowles. I am deeply honored to have such key leaders in the field guide me through this process. Finally, I would like to thank Dr. Anna Topczewski for her friendship as a fellow graduate student in the program, and for all of the assistance she provided for the simulation study. ii

6 ABSTRACT The primary goal of this study was to develop a new procedure for estimating decision consistency and accuracy indices using the bifactor and testlet response theory (TRT) models. This study is the first to investigate decision consistency and accuracy from a multidimensional perspective, and the results have shown that the bifactor model at least behaved in way that met the author s expectations and represents a potential useful procedure. The TRT model, on the other hand, did not meet the author s expectations and generally showed poor model performance. The multidimensional decision consistency and accuracy indices proposed in this study appear to provide good performance, at least for the bifactor model, in the case of a testlet effect of large magnitude. For practitioners examining a test containing testlets for decision consistency and accuracy, a recommended first step is to check for dimensionality. If the testlets show a significant degree of multidimensionality, then the usage of the multidimensional indices proposed can be recommended as the simulation study showed an improved level of performance over unidimensional IRT models. However, if there is a not a significant degree of multidimensionality then the unidimensional IRT models and indices would perform as well, or even better, than the multidimensional models. Another goal of this study was to compare methods for numerical integration used in the calculation of decision consistency and accuracy indices. This study investigated a new method (M method) that sampled ability estimates through a Monte Carlo approach. In summary, the M method seems to be just as accurate as the other commonly used methods for numerical integration. However, it has some practical advantages over the D and P methods. As previously mentioned, it is not as nearly as computationally intensive as the D method. Also, the P method requires large sample sizes. In addition, the P method has conceptual disadvantage in that the conditioning variable, in theory, should be the true theta, not an estimated theta. The M method avoids both of these issues and iii

7 seems to provide equally accurate estimates of decision consistency and accuracy indices, which makes it a strong option particularly in multidimensional cases. iv

8 TABLE OF CONTENTS LIST OF TABLES.....vii LIST OF SYMBOLS...ix CHAPTER 1: INTRODUCTION 1 Decision Consistency... 2 Decision Accuracy... 3 Measurement Models... 3 Unidimensional 3PL Item Response Theory Model Bifactor IRT Model..5 Testlet Response Theory Model..6 Decision Consistency Indices..7 Decision Accuracy Indices..8 Purpose of the Study...9 CHAPTER 2: LITERATURE REVIEW...12 Introduction 12 The Development of Decision Consistency and Accuracy Indices...14 Carver Method...14 Swaminathan-Hambleton-Algina Method.14 Strong True Score Models.15 Subkoviak and Compound Multinomial Models...17 Livingston-Lewis Procedure Item Response Theory Methods 19 Item Response Theory...22 Bifactor Model Testlet Response Theory 28 Model Fit 29 Summary 32 CHAPTER 3: METHODOLOGY.33 Data 33 Establishing Cut-Scores.36 Decision Consistency and Accuracy Under the Bifactor and TRT Models..36 Model Specification Estimation.. 39 Testlet Response Model. 40 Analysis.41 Summary 45 CHAPTER 4: RESULTS...53 Simulation Study Results...53 Population Values of Phi and Gamma...54 Estimated Values of Phi and Gamma 55 Bias 56 Standard Errors..59 RMSE. 60 v

9 Summary D Method versus M Method Comparison.61 Unidimensional Models.63 Multidimensional Models..65 Summary Real Data Results...67 Unidimensionality, Model Fit, and Local Independence...69 Marginal Classification Indices.73 Summary CHAPTER 5: DISCUSSION Purpose Purposes 2 and Purpose Limitations and Future Research.115 Summary and Conclusion 119 REFERENCES APPENDIX A: SAMPLE FLEXMIRT BIFACTOR SYNTAX. 130 APPENDIX B: SAMPLE FLEXMIRT TRT SYNTAX vi

10 LIST OF TABLES Table 1.1. Mastery and Non-Mastery Outcomes on Two Administrations of the Same Test Descriptive Statistics for Summed-Scores of Tests and Testlets for Form I Descriptive Statistics for Summed-Scores of Tests and Testlets for Form II Descriptive Statistics for Summed-Scores of Tests and Testlets for Form III Summary of Models Employed in the Study Summary of Model Priors for Estimation Summary of Model Fit Statistics Employed in the Study Simulation Phi and Gamma Results Based On Population Items Statistics Simulation Study Decision Consistency Index (Phi) Means Under Various Models Simulation Study Decision Accuracy Index (Gamma) Means Under Various Models Simulation Study Decision Consistency Index Bias Under Various Models Simulation Study Decision Accuracy Index Bias Under Various Models Simulation Study Decision Consistency Index Absolute Value Bias Under Various Models Simulation Study Decision Accuracy Index Absolute Value Bias Under Various Models Simulation Study Decision Consistency Index SEs Under Various Models Simulation Study Decision Accuracy Index SEs Under Various Models Simulation Study Decision Consistency Index RMSEs Under Various Models Simulation Study Decision Consistency Index RMSEs Under Various Models D method vs M method Decision Consistency Index Comparison for the UIRT Model D method vs M method Decision Accuracy Index Comparison for the UIRT Model D method vs M method Decision Consistency Index Comparison for the GRM Model D method vs M method Decision Accuracy Index Comparison for the GRM Model D method vs M method Decision Consistency Index Comparison for the Bifactor Model D method vs M method Decision Accuracy Index Comparison for the Bifactor Model D method vs M method Decision Consistency Index Comparison for the TRT Model D method vs M method Decision Accuracy Index Comparison for the TRT Model Summary Statistics for Test A Forms and Testlets Summary Statistics for Test B Forms and Testlets DIMTEST p-value results for all forms of Test A and Test B Fit and LD Statistics for Test A.. 97 vii

11 4.24. Fit and LD Statistics for Test B Actual and Estimated Proportions for Test A with 50% Cut-score Actual and Estimated Proportions for Test A with 80% Cut-score Actual and Estimated Proportions for Test A with Both 50% and 80% Cut-scores Actual and Estimated Proportions for Test B with 50% Cut-score Actual and Estimated Proportions for Test B with 80% Cut-score Actual and Estimated Proportions for Test B with Both 50% and 80% Cut-scores Decision Consistency Scores (Phi) for Test A Decision Accuracy Scores (Gamma) for Test A Decision Consistency Scores (Phi) for Test B Decision Accuracy Scores (Gamma) for Test B Comparison of Estimated A Parameters for IRT Models in Test B Form I 121 viii

12 LIST OF SYMBOLS a i b i c i d i g(i) IRT discrimination/slope parameter IRT difficulty/location parameter IRT pseudo-chance parameter IRT multidimensional intercept parameter Testlet with nested item i g(θ) Distribution of ability g(τ) h i j K n N p Q S True score density function Index denoting a specific classification category Index denoting a specific item Number of score points on a polytomous item Total number of classification categories Number of items in item set Number of examinees Probability of earning a particular score Number of quadrature points Number of specific ability factors S X 2 Item fit statistic u U i x An item score A random variable representing responses for item i Test summed-score x k Cut-score in raw score metric (k = 1, 2,, K) y AIC BIC CTT IRT Vector of item responses Akaike Information Criterion Bayesian Information Criterion Classical Test Theory Item Response Theory ix

13 MIRT Multidimensional Item Response Theory TRT Testlet Response Theory UIRT Unidimensional Item Response Theory 1PL 2PL 3PL β φ φ c φ θ γ γ θ + γ θ γ θ γ + γ η η θ θ κ τ π Γg(i) One-parameter logistic IRT model Two-parameter logistic IRT model Three-parameter logistic IRT model Vector of item parameters Marginal decision consistency or agreement index Chance Agreement Conditional classification consistency index Marginal accuracy index Conditional classification accuracy index Conditional false positive error rate Conditional false negative error rate Marginal false positive error rate Marginal false negative error rate True classification Accurate decision category Ability Vector of abilities Coefficient Kappa True score True proportion correct score Testlet effect for testlet with nested item i 2 σ g(i) Variance of testlet effect Χ 2 Local dependence index x

14 1 CHAPTER 1 INTRODUCTION A common application of test scores is to determine levels of examinee performance relative to specified cut-scores. Correspondingly, it is useful to know to what degree the classification is both consistent and accurate. Decision consistency describes the degree to which test takers are re-classified into the same category over parallel replications. Decision accuracy describes the degree to which actual classifications using observed cut-scores agree with true classifications based on known true cut-scores (Lee, 2010). Decision consistency and accuracy related to test scores is of great practical value to psychometricians, practitioners, and test takers. The consequences of NCLB and the role of high stakes testing makes proper determination of proficiency critical. For example, a high school student who is erroneously deemed non-proficient may be denied graduation in some states. As a result, there is a general expectation and need for test classifications to be both stable and accurate. In order to determine proficiency levels, tests use items of various formats that have estimable parameters based on a chosen measurement model. One common type of item employed is the dichotomously scored multiple-choice item. This type of item is appealing as they are generally regarded as reliable, efficient, and can be objectively scored. One common usage of the multiple-choice item is in conjunction with passages. With this type of format, several items can draw from a common stimulus thereby forming a testlet. Estimating parameters of these types of items with item response theory must be done with care due to the assumption of local independence. With assumed local independence, items are conditionally independent of each other given scores on other items. More specifically, the likelihood of a particular test taker getting a certain item correct does not depend on how they performed on other items (Lord, 1980). However,

15 2 the existence of a common stimulus or passage can violate this assumption. For example, a student who is struggling to comprehend a particular reading passage is likely to find all of the related items more difficult in a correlated fashion. Decision Consistency As noted previously, decision consistency is the degree to which a test consistently classifies members of the same group into the same category over replication. For a straightforward example see Table 1.1. An examination of the results of the two administrations of this particular test reveals that out of the 100 test takers: 50 demonstrated mastery on both administrations, 15 demonstrated non-mastery on both administrations, 10 demonstrated mastery on the first but not the second administration, and 25 demonstrated mastery on the second but not the first administration. The results of the 50 test takers who demonstrated mastery both times and the 15 who demonstrated non-mastery both times show decision consistency. In other words, those 65 test takers had the same classification over both replications of the test. The results of the remaining 35 test takers had inconsistent results over replication which demonstrates mastery on one administration but non-mastery in another. Thus, in this particular example only 65 out of a 100, or 65%, display decision consistency between the two replications. This degree of consistency is likely undesirable for those involved with the usage of this particular test. For example, the results of the 35% percent who had inconsistent results would be inconclusive and not of any particular use. In addition, such a lack of consistency might suggest a lack of reliability of the test scores and the possibility that many of the classification results may only be due to chance. While 100% consistency may not be attainable, it is in the best interest of test makers to ensure that the classification of test results is as stable as possible.

16 3 Decision Accuracy While decision consistency is desirable, it should be noted that a high level of consistency does not automatically imply that the results necessarily reflect the test takers true ability or nature (Huynh, 1990). The accuracy or validity of these decisions is commonly assessed through content or validity studies, but indices for decision accuracy have been developed as well. Again, decision accuracy describes the degree to which actual classifications using observed cut-scores agree with true classifications based on known true cut-scores. More specifically, decision accuracy describes how closely a test taker s true score classification aligns with their observed score classification (Haertel, 2006). In decision accuracy, the concepts of false-positives and false-negatives are useful to consider. For example, a false-positive would be a result of mastery when in reality the true classification of the student should be non-mastery. Similarly, a false-negative would be a result of non-mastery when the true classification is mastery. Both of these results represent errors in classification and a lack of decision accuracy. More generally, a falsepositive is when a test taker is classified at a higher classification than the true score indicates and a false-negative is when a test taker is classified at a lower classification than the true score indicates. Measurement Models A primary consideration in determining decision consistency and accuracy is that the true ability and true scores of test takers are not known and must be estimated. In addition, item parameters such as difficulty, discrimination, and the pseudo-guessing parameter need to be estimated as well. Measurement models have been developed according to a set of assumptions for the purposes of calculating these estimates. In this study, the item response theory (IRT) framework is discussed with three specific measurement models: the unidimensional three-parameter logistic (UIRT) IRT model, the testlet response theory model, and the bifactor IRT model.

17 4 Unidimensional 3PL Item Response Theory Model Unidimensional item response theory is a key measurement model in psychometrics and plays a foundational role in this study. Lord (1980) frames item response theory as a theory of statistical estimation that uses latent characterizations of individuals as predictors of observed responses. In comparison to classical test theory, Hambleton (1993) describes several benefits and characteristics of item response theory in comparison to classical test theory. For example, one defining characteristic of IRT is that the item-ability relationship is clearly defined, where in classical test theory it is not specified. For the UIRT model, the probability that a person of a given ability gives a correct response can be represented by: P(θ) = c i + (1 c i ) exp {1.7[a i (θ b i )]} 1+exp {1.7[a i (θ b i )]}, (1.1) where θ is the latent trait (ability, skill, etc.), P(θ) is the probability of a correct response given a particular ability level, parameter a i is the discriminating power of the item, parameter b i is the difficulty of the item, and parameter c i is the pseudo-guessing parameter. However, while UIRT is quite flexible and powerful, it is based on the key assumptions of unidimensionality and local independence. The unidimensionality assumption states that the observations on the items are solely a function of a single continuous latent person variable (de Ayala, 2009). For example, on a mathematics test to meet this assumption there is assumed to be a single latent mathematics proficiency variable that underlies the test taker s performance. This assumption can be violated from a number of perspectives. First, items can contain multiple content strands which can lead to multidimensionality. For example, the performance of a test taker on a math problem which contains a significant amount of text in the stimulus may also partially depend on their level of reading ability. Multidimensionality can also exist from a format

18 5 perspective. Students may perform differentially well on, for example, multiple-choice versus constructed-response items. As a result, it is common practice for practitioners to conduct dimensionality assessments to see if the assumption is violated to the extent that the model no longer fits the data. If multidimensionality exists, the degree of the violation should be examined to see if the model can still provide a reasonable representation. The second assumption, local independence, states that the probability of correctly responding to any particular item is independent of the performance on any other items on the test. Lord (1980) defines this relationship as, P(U i = 1 θ) = P(U i = 1 θ, u j, u k, ) (i j, k, ), (1.2) where θ denotes ability and u is the response for items i, j, k, etc. This assumption can be violated in a number of ways. For example, one item response may clue the response of another thereby creating a dependency. Also, it is a common practice in assessment to tie a number of items to a particular stimulus or passage. As a result, dependency may exist between the items due to the common stimulus resulting in a violation of local independence. This particular effect is of primary concern to this study and leads to the choice of the other two models to be examined. Bifactor IRT Model Full-information item bifactor analysis (Gibbons, et al., 2007; Gibbons & Hedeker, 1992) allows for multidimensionality between item types. A sample item bifactor measurement structure could have the following factor pattern: a 10 a 11 0 a 20 a 21 0 a 30 0 a 32. a 40 0 a 42 ( a 50 0 a 52) The first subscript denotes the item and the second corresponds to a dimension. For example, the above pattern could represent the factor structure for 5 reading items with 2 nested within one passage and 3 in another. The first column represents a general reading

19 6 dimension while the second and third are dimensions specific to each of the passages. Thus, this model controls for the passage effect that potentially violates the unidimensionality assumptions in the UIRT model. This study will focus on the bifactor IRT model (Cai, Yang, & Hansen, 2011) which is an extension of the standard UIRT model. For dichotomous items the bifactor model with general factor θ0 and one specific factor θs is: P(θ 0, θ s ) = c i + (1 c i ) exp [1.7(a 0iθ 0 +a si θ s +d i )] 1+exp [1.7(a 0i θ 0 +a si θ s +d i )]. (1.3) Above, c i is the pseudo-guessing parameter, d i is the item intercept, a 0 is the item slope on the general factor, and a s is the item slope on the specific factor s. Note that the item slopes are similar in interpretation to discrimination in the UIRT model, but are specific to each factor. Testlet Response Theory IRT Model Testlets are a popular form of structuring items which involve attaching multiple items to a single stimulus or passage. One common usage is with reading passages from which several items draw inferences. Wainer, Bradlow, and Wang (2007) state that, for a typical reading passage followed by four to six associated items, the local independence assumption does not hold. Reducing the length of the passage was found to reduce this effect, but it was discovered that the construct measured was not the same. Attaching only one item to the passage also eliminates the problem, but at the cost of being very inefficient in terms of time needed to read the passage relative to the amount of information gained. Thus, there is a need for a model where the unit of test construction is smaller than the whole test but larger than a single item. Wainer and Kiely (1987), argued for such a model and proposed the testlet as a unit of construction. Of particular interest to this study is testlet response theory (TRT), which was developed by Wainer et al. (2007). Essentially it expands the 3PL UIRT model to:

20 7 P(θ) = c i + (1 c i ) exp {1.7[a i(θ b i Γ g(i) )]} 1+exp {1.7[a i (θ b i Γ g(i) )]}, (1.4) where Γg(i) is the testlet effect of a particular respondent for item i nested within testlet g(i). Note that if Γg(i)=0, there is no testlet effect and the model simplifies to the UIRT model. However, as described in DeMars (2006), the testlet response model is simply a constrained bifactor model. In the bifactor model, testlet slopes are independent of general slopes. Under the test response model, the testlet slopes are proportional to the general slopes. Thus, while the estimation of ability and item parameters is impacted, the same general model could be used to estimate decision consistency and accuracy indices. Decision Consistency Indices Classification accuracy and consistency indices can be estimated using a wide variety of models. In the summed-score metric, indices exist both for item response theory (Huynh, 1990) and classical test theory (Livingston & Lewis, 1995). More recently, theta metric indices have been developed by Rudner (2001) and Guo (2006). Specifically, this study will focus on IRT methods described in Schulz et al. (1999), Wang et al. (2000), Lee et al. (2002), and more recently generalized in Lee (2010). Note, however, that these methods were developed with the UIRT model in mind, and the primary purpose of this study is to develop a procedure for the TRT and bifactor models. Assume that a test score is found by summing all of the item scores on a particular form. Also, let x 1, x 2,, x K 1 denote a set of cut-scores that are used to classify examinees into K mutually exclusive categories. So, a score less than x 1 would be placed in the first category, a score equal to or greater than x 1 but less than x 2 would be placed in the second category, and so on. The conditional category probability can be computed by summing conditional summed-score probabilities for all scores that belong to category h: x h 1 p θ (h) = x=x (h 1) Pr (X = x θ), (1.5)

21 8 where h = 1, 2,, K. Once the above probability is calculated, it is useful to know the probability of an individual with a given ability level being placed twice in the same category on two separate parallel administrations of a test. This is the conditional classification consistency index (φ θ ), which can be calculated as: φ θ = K h=1 [p θ (h)] 2. (1.6) Finally, given the above, classification consistency across all levels of ability can be calculated. Given the distribution of ability, g(θ), the marginal classification consistency index ϕ is computed as: ϕ = φ θ g(θ) dθ. (1.7) Decision Accuracy Indices Decision accuracy describes the degree to which actual classifications using observed cut-scores agree with true classifications based on known true cut-scores. However, true scores, true ability levels, and true classifications are not known, and statistical techniques and assumptions need to be employed in order for an estimate to be made. First, suppose that the expected summed-score of the test taker is their true score τ. Next, suppose a set of true cut-scores on the summed-score metric, τ 1, τ 2,, τ K 1, determine the true categorization of each test taker with θ or τ. Also, assume that the conditional probabilities, p θ (h), from Equation 1.5 are known. Finally, the true categorical status, η(= 1, 2,, K), can be determined by comparing the expected sum score for θ with the true cut-scores. Following from the above, the conditional classification accuracy index is: γ θ = p θ (η), for θ η. (1.8) Next, by summing over all ability levels, the marginal classification accuracy index is: γ = γ θ g(θ)dθ. (1.9)

22 9 Note that in order for the indices in Equations 1.7 and 1.9 to be estimated, it is necessary to approximate the integral associated with the θ distribution. There are two different approaches typically employed: the D method and the P method (Lee, 2010). The D method is a distributional approach using estimated quadrature points and weights, and it replaces the integral with summations. The P method uses individual θ estimates to calculate the individual conditional classification indices which are then averaged over all examinees. Lee (2010) found that the D and P methods produced similar results, but suggested that the D method be used when the focus of investigation is at the group-level, and the P method when the focus is on the individual. Purpose of the Study The goal of this study is to investigate decision consistency and accuracy indices based on the UIRT, TRT, and bifactor models. The UIRT model has been thoroughly investigated in past research and will be the most straightforward of the models. However, the bifactor model is multidimensional and will be the primary focus of this study. The presence of multiple thetas does not fit with the current methods for estimating indices, and as a result a new procedure needs to be developed. This new procedure is the first to address multidimensionality in estimating decision consistency and accuracy indices, and as such represents a meaningful contribution to the literature. The TRT model, which is a constrained version of the bifactor model, will use the same general procedures as the bifactor model. Specifically, the purposes of this study are: 1. Develop a new procedure for estimating decision consistency and accuracy indices using the bifactor and TRT models. 2. Compare decision consistency and accuracy indices between the UIRT, TRT, and bifactor models using simulated and real testlet data from various sources. In addition, use the 4P BB and GRM models as a baseline comparison.

23 10 3. Investigate how the placement of cut-scores and degree of multidimensionality affects the estimates of decision consistency and accuracy indices between the UIRT, TRT, and bifactor models. 4. Compare the different numerical integration methods used to calculate the indices. Ideally, it is the hope of this study to offer practitioners another credible option for measuring the degree of decision consistency and accuracy in assessments. At present, no other decision consistency or accuracy index explicitly accounts for multidimensionality. In particular, the indices presented in this study directly address the multidimensionality caused by the presence of testlets. In tests that employ testlets, these new indices have the potential to provide a more accurate picture of decision consistency and accuracy, making them a potentially useful new procedure for practitioners.

24 Table 1.1. Mastery and Non-Mastery Outcomes on Two Administrations of the Same Test Administration 1 Administration 2 Mastery Non-Mastery Total Mastery Non-Mastery Total

25 12 CHAPTER 2 LITERATURE REVIEW This chapter will review the literature of decision consistency and accuracy indices in order to provide a history and broader context which will serve as a foundation for this study. First, this chapter will provide a background and motivation of criterion referenced testing. Next, a history of the development and evolution of consistency and accuracy indices will be provided. In addition, the background will be explored for the three models employed in this study: the unidimensional item response theory model, the testlet response theory model, and the bifactor model. Finally, there will be a discussion of how indices will be estimated for the different models and a brief overview of assessing model fit. Introduction In order for a test score to be meaningfully interpreted a test needs to be referenced in some way. In other words, the score needs to be compared to something external to the test as point of reference for comparisons. For example, one possible way to do this is through norm-referenced testing. In norm-referenced testing, derived scores (e.g. percentile rank, grade-equivalent scores, etc.) are constructed in a way that conveys information about the relative standing of the test taker to others in the defined group. This defined group is referred to as the norm group, and the derived scores are known as norm-referenced scores (Nitko, 1980). However, for some uses norm-referenced scores can be insufficient. For example, you may want to know if a particular student has mastered the prerequisite skills necessary to be successful in a math course. Here a norm-referenced interpretation is not particularly useful. In this case, you know the position of the student within the norm group, but do not know if the student has sufficient mastery. What is needed here is some sort of criterion by which mastery is gauged, and this is the motivation for criterionreferenced testing.

26 13 A common application of criterion-referenced test scores is to determine levels of test taker performance relative to specified cut-scores. With regards to this, it is useful to know to what degree the classification is both consistent and accurate. Decision consistency describes the degree to which test takers are re-classified into the same category over parallel replication. Decision accuracy describes the degree to which actual classifications using observed cut-scores agree with true classifications based on known true cut-scores (Lee, 2010). Decision consistency and accuracy related to test scores are of great practical value to psychometricians, practitioners, and test takers. The consequences of NCLB and the role of high stakes testing makes proper determination of proficiency critical. For example, a high school student who is erroneously deemed non-proficient may be denied graduation in some states. Similarly, a nurse who is mistakenly deemed proficient when there are deficiencies may not be capable of providing adequate care. In particular, the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1999) state, when a test or combination of measures is used to make categorical decisions, estimates should be provided of the percentage of examinees who would be classified in the same way on two applications of the procedure, using the same form or alternate forms of the instrument (p. 35). As a result, there is a general expectation and need for test classifications to be examined as part of the validation process. In order to determine proficiency levels, tests use items of various formats that have estimable parameters based on a chosen measurement model. One common type of item employed is the dichotomously-scored multiple-choice item. This type of item is appealing as they are generally regarded as reliable, efficient, and can be objectively scored. One common usage of the multiple-choice item is associated with passages. With this type of format, several items can draw from a common stimulus thereby forming a testlet. Estimating parameters of these types of items with item response theory must be done with care due to the assumption of local independence. With assumed local

27 14 independence items are conditionally independent of each other given scores on other items. In other words, the likelihood of a particular test taker getting a certain item correct does not depend on how they performed on other items (Lord, 1980). However, the existence of a common stimulus or passage can violate this assumption. The Development of Decision Consistency and Accuracy Indices Carver Method The earliest formal method for determining classification consistency appears in Carver (1970). Essentially, this method compared the percentage of mastery on two parallel administrations of a test. If the two percentages were equal then the test is considered reliable. The clear weakness to this method is that even if the two percentages are identical, the test still could be unreliable with regards to the performance of individual test takers. For example, half of the test takers could be considered masters on the first administration of the test and the other half could be masters on the second. The results are reversed but with the same percentage of mastery which is an unstable result. Swaminathan-Hambleton-Algina Method Following Carver s method is the Swaminathan-Hambleton-Algina Method developed in Hambleton and Novick (1973) and Swaminathan et al. (1974). This method suggested that the proportion of individuals consistently classified as masters be the measure for reliability of decision consistency. Referring back to Table 1.1, fifty out of one hundred students demonstrated mastery on both administrations and fifteen demonstrated non-mastery on both administrations. Therefore, the percentage of consistent classification can be calculated as: m φ = k=1 φ kk, (2.1) where φ kk is the proportion of individuals consistently classified in the kth category on both administrations. Thus, for Table 1.1 this is calculated as: φ = =

28 15 The upper limit of φ is 1.00, which is perfect consistency, and the lower limit is generally the proportion of consistent decisions that could be expected by chance. This is defined as: m φ c = k=1 p k. p.k, (2.2) where p k. and p.k are the proportions assigned to category k on both forms. For example, for Table 1.1 this is calculated as: φ c = ( ) ( ) + ( ) ( ) =.51. Note that while the proportion of consistent decisions in the example is 0.65, the proportion expected by chance is very high at This suggests that a sizeable proportion of the consistent decisions is due to chance at this particular cut-score, and not due to the reliability of the test itself. Swaminathan et al. (1974) suggested using Cohen s (1960) kappa coefficient to remove the chance proportion to determine the proportion of consistent decisions that can be expected beyond chance. The kappa coefficient is calculated by: κ = φ φ c = =.29. (2.3) 1 φ c 1.51 Here, the upper limit is again 1.00, suggesting perfect consistency, while the lower limit is theoretically lower than zero. In this particular example, the low kappa value suggests that once the proportion for chance agreement is removed the remaining degree of classification consistency is quite low. Strong True Score Models Note that the Swaminathan-Hambleton-Algina Method still requires the administration of two parallel forms of the test. However, often it is not practical or possible to give two parallel administrations of the same test in an achievement test context. Thus, it is useful to have an index that can be calculated from a single administration of a test. The beta-binomial model developed by Huynh (1976) is an example of one such index.

29 16 Hanson and Brennan (1990) compared three different strong true score betabinomial models (two-parameter, four-parameter, and four-parameter compound binominal) with regards to estimating indices of classification indices. Strong true score models consider the probability that the summed-score random variable X of a test score (with n dichotomously scored items) equals i (i = 0,, n), as: 1 0 Pr(X = i) = Pr(X = i π) g(π)dπ, (2.4) where π is the proportion correct true score, g(π) is the true score density function, and Pr(X = i π) is the conditional error distribution. Here, g(π)is assumed to belong to a certain parametric class, and Pr(X = i π) is assumed to be either binomial or an approximation of a compound binomial distribution. Each of the three models Hanson and Brennan examined had their own set of assumptions. For the two-parameter beta-binomial model, the true score distribution is beta and the conditional error distribution is binomial. For the four-parameter betabinomial model, the true score distribution is four-parameter beta (Lord, 1965) and the conditional error distribution is binomial. For the four-parameter beta compound binomial model, the true score distribution is four-parameter beta and the conditional error distribution is a two-term approximation to the compound binomial distribution. Here, classification consistency is defined as the consistency with which examinees are categorized on the basis of two independent administrations. Independence is defined so that the summed-scores on the two administrations (X 1 and X 2 ) are conditionally independent and identically distributed. Assuming two categories of classification, the bivariate distribution of X 1 and X 2 is: 1 0 Pr(X 1 = i, X 2 = j) = Pr(X 1 = i π) Pr(X 2 = i π) g(π)dπ. (2.5) Again assuming two categories of classification, from Equation 2.5 the classification index φ is defined as: x 0 1 x 0 1 n j=x 0 φ = i=0 j=0 Pr ( X 1 = i, X 2 = j) + i=x 0 Pr ( X 1 = i, X 2 = j). (2.6) n

30 17 The classification index φ is the probability that two summed-scores from parallel independent administrations are either both less than the cut-score x 0 (non-mastery) or both greater than or equal to x 0 (mastery). As given in Equation 2.3, the coefficient κ = φ φ c 1 φ c. Here, the probability of chance agreement φ c is given by: x 0 1 x 0 1 j=0 + φ c = [ i=0 Pr (X 1 = i)] [ Pr (X 2 = j)] n n [ i=x 0 Pr (X 1 = i)] [ j=x 0 Pr (X 2 = j)]. (2.7) Note that since X 1 and X 2 are independent and identically distributed, x 0 1 x 0 1 i=0 Pr (X 1 = i) = j=0 Pr(X 2 = j) = p 0, (2.8) where p 0 is the marginal probability that a test taker scores below the cut-score x 0. Also, n n i=x 0 Pr (X 1 = i) = j=x 0 Pr(X 2 = j) = p 1, (2.9) where p 1 is the marginal probability that a test taker scores equal to or greater than the cut-score x 0. As a result, φ c = p p 1 2, and p c does not depend on the actual pair of test scores for a test taker. Thus, given the assumptions made, only a single administration is necessary to calculate the indices for classification consistency. In their comparison of the three beta-binomial models, Hanson and Brennan found that the two-parameter model often demonstrates lack of fit. They recommended that before the two-parameter model is used, the adequacy of the model in fitting the raw observed scores be evaluated. In the cases where the two-parameter model does not fit, the four-parameter models may provide better fit. According to their results, the fourparameter beta-binomial and the four-parameter beta compound binomial provided very similar results. If neither the two-parameter nor four-parameter models fit, they suggest using a more complex model similar to those discussed by Wilcox (1981). Subkoviak and Compound Multinomial Models Lee (2005) notes that the strong true score models use a distributional approach, where assumptions are made concerning the distributional form of the true scores.

31 18 Subkoviak (1976) employed an individual approach where no such assumptions were made. The Subkoviak procedure estimates classification consistency one examinee at a time, and then averages over examinees to create an overall consistency index for the entire sample group. Lee (2005) extended Subkoviak s work using the compound multinomial procedure (see also Lee, Brennan, & Wan, 2009). Lee proposed a multinomial error model for a test with undifferentiated polytomous items, and a compound multinomial model for a test containing a mixture of items. The multinomial procedure reduces to Subkoviak s procedure when items are dichotomously scored. Assume there is a test that contains n polytomous items, each with j score points, c 1 < c 2 < < c j. Also, assume that X 1, X 2,, X j are the random variables representing the number of items scored with each of the possible score points. Under this procedure, each examinee response pattern would follow a multinomial distribution as follows: n! x Pr(X 1 = x 1, X 2 = x 2,, X j = x j π) = 1 x π 2 x 2 π j j, (2.10) x 1!x 2! x k 1! π 1 where π = {π 1, π 2,, π j } can be estimated by the observed proportions of items scored with corresponding points. From here the probability density function of Y can be determined by summing over all sets of X 1, X 2,, X j for a total score of y: Pr(Y = y π) = Pr(X 1 = x 1, X 2 = x 2,, X h = x j π) y, (2.11) where y = c 1 x 1 + c 2 x c j x j as the sum of the item scores. Once this density function is determined, the φ and κ indices can be calculated. Livingston-Lewis Procedure Livingston and Lewis (1995) described a procedure for estimating the accuracy and consistency of classifications based on test scores based on effective test length. Effective test length refers to the number of discrete, dichotomously scored, locally independent test items needed to produce the total scores having the same precision as the scores actually being used. The original test score is transformed onto a new scale with a

32 19 maximum equal to the effective test length. The true score distribution of the new scale is then estimated by fitting a 2- or 4-parameter beta model. Also, the conditional distribution of scores on the new scale, given the true score, is estimated from a binomial distribution based on the estimated effective test length. Once the parameters of both distributions are known, classification consistency is estimated in the same way as demonstrated in Hanson and Brennan (1990). Item Response Theory Methods All of the previously discussed models are based on either summed-scores or scale-scores and fall under the umbrella of classical test theory. Huynh (1990) first explored procedures for consistency indices based on latent trait models. Huynh demonstrates how the Rasch model could be used to project the bivariate frequency of test scores based on equivalent test scores. This distribution is then used to estimate consistency indices such as φ and κ. Estimating these indices using IRT requires the determination of the marginal distribution of summed-scores through integration associated with the distribution of the latent trait θ. There are two different approaches possible: the D method and the P method (Lee, 2010). The D method is a distributional approach using estimated quadrature points and weights, and it replaces the integral with summations. The P method uses individual θ estimates to calculate the individual conditional classification indices which are then averaged over all examinees. Lee (2010) found that the D and P methods produced similar results, but suggested that the D method be used when the focus of investigation is at the group-level and the P method when the focus is on the individual. Huynh s work has been further developed by Schulz et al. (1999), Wang et al. (2000), and Lee et al. (2002). More recently, theta metric indices have been developed by Rudner (2001) and Guo (2006). Lee (2010) generalized their work and the following procedures reflect that form.

33 20 Given θ and g(θ), the latent trait being measured and its distribution respectively, the marginal probability of the total summed-score X is given by: Pr(X = x) = Pr(X = x θ) g(θ)dθ. (2.12) Note that Pr(X = x θ) is the conditional summed-score distribution. Also, due to the IRT assumption of conditional independence, the probability of a summed-score is the product of probabilities for item responses given θ. Typically, a recursive formula, such as the Lord-Wingersky algorithm (1984), is employed to calculate the conditional summed-score distribution for dichotomous items. In addition, when all items are dichotomous, a compound binomial model is used for modeling conditional numbercorrect score distributions. Assume that a test score is found by summing all of the item scores on a particular form. Also, let x 1, x 2,, x K 1 denote a set of cut-scores that are used to classify examinees into K mutually exclusive categories. A score less than x 1 would be placed in the first category, a score equal to or greater than x 1 but less than x 2 would be placed in the second category, and so on. The conditional category probability can be computed by summing conditional summed-score probabilities for all scores that belong to category h: x h 1 p θ (h) = x=x (h 1) Pr (X = x θ), (2.13) where h = 1, 2,, K. Once the above probability is calculated it is useful to know the probability of an individual with a given ability level being placed twice in the same category on two separate parallel administrations of a test. This is the conditional classification consistency index (φ θ ) which can be calculated as: φ θ = K h=1 [p θ (h)] 2. (2.14) Finally, given the above, classification consistency across all levels of ability can be calculated. Given the distribution of ability, g(θ), the marginal classification consistency index ϕ is computed as: ϕ = φ θ g(θ) dθ. (2.15)

34 21 Note that in this context, the coefficient κ can be expressed as κ = φ φ c 1 φ c. In addition, the chance probability, φ c, can be expressed as φ c = K h=1 [p(h)] 2, where p(h)is the marginal category probability obtained by integrating p θ (h) over θ. Decision accuracy describes the degree to which actual classifications using observed cut-scores agree with true classifications based on known true cut-scores. However, true scores, true ability levels, and true classifications are not known, and statistical techniques and assumptions need to be employed in order for an informed guess to be made. First, suppose that the expected summed-score of the test taker is their true score τ. Next, suppose a set of true cut-scores on the summed-score metric, τ 1, τ 2,, τ K 1, determine the true categorization of each test taker with θ or τ. Also, assume that the conditional probabilities, p θ (h), from Equation 2.13 are known. Finally, the true categorical status, η(= 1, 2,, K), can be determined by comparing the expected sum score for θ with the true cut-scores. Following from the above, the conditional classification accuracy index is: γ θ = p θ (η), for θ η. (2.16) Next, by summing over all ability levels the marginal classification accuracy index is: γ = γ θ g(θ)dθ. (2.17) Classification accuracy can also be framed in terms of false positive and false negative error rates. A conditional false positive error rate is the probability that a test taker of a given ability is classified into a category higher than the test taker s true category. This can be expressed by: K γ + θ = η=η +1 p θ (η), for θ η, (2.18) where η is the accurate decision category. In contrast, the false negative error rate, or the probability that a test taker of a given ability is classified into a category lower than the test taker s true category, is given by: η 1 γ θ = η=1 p θ (η), for θ η. (2.19)

35 22 Correspondingly, the marginal false positive and false negative error rates are given by: and γ + = γ = γ θ + g(θ)dθ (2.20) γ θ g(θ)dθ. (2.21) Item Response Theory Item response theory (IRT) is a key measurement model in psychometrics. Lord (1980) frames item response theory as a theory of statistical estimation that uses latent characterizations of individuals as predictors of observed responses. While IRT is quite flexible and powerful, it is based on the key assumptions of unidimensionality and local independence. The unidimensionality assumption states that the observations on the items are solely a function of a single continuous latent person variable (de Ayala, 2009). For example, on a mathematics test there is assumed to be a single latent mathematics proficiency variable that underlies the test taker s performance. This assumption can be violated from a number of perspectives. First, items can contain multiple content strands which can lead to multidimensionality. For example, the performance of a test taker on a math problem which contains a significant amount of text in the stimulus may also partially depend on their level of reading ability. Multidimensionality can also exist from a format perspective. Students may perform differentially well on, for example, multiplechoice versus constructed-response items. It should be noted, however, that no test is likely to be perfectly unidimensional. The second assumption, local independence, states that the probability of correctly responding to any particular item is independent of the performance on any other items on the test. As noted in Chapter 1, Lord (1980) defines this relationship as, P(u i = 1 θ) = P(u i = 1 θ, u j, u k, ) (i j, k, ). (2.22) Here, θ denotes ability and u is the response for items i, j, k, etc. This assumption can be violated in a number of ways including the usage of passages.

Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales

University of Iowa Iowa Research Online Theses and Dissertations Summer 2013 Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales Anna Marie