REMOVE OR KEEP: LINKING ITEMS SHOWING ITEM PARAMETER DRIFT. Qi Chen

Size: px
Start display at page:

Download "REMOVE OR KEEP: LINKING ITEMS SHOWING ITEM PARAMETER DRIFT. Qi Chen"

Transcription

1 REMOVE OR KEEP: LINKING ITEMS SHOWING ITEM PARAMETER DRIFT By Qi Chen A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Measurement and Quantitative Methods - Doctor of Philosophy 2013

2 ABSTRACT REMOVE OR KEEP: LINKING ITEMS SHOWING ITEM PARAMETER DRIFT By Qi Chen IRT-based procedures using common items are widely used in test score linking. A critical assumption of linking methods is the invariance property of IRT. This assumption is that item parameters remain the same on different testing occasions when they are reported on the same θ-scale. In practice, however, there are occasions when an item parameter drifts from its original value. This study investigated the impact of keeping or removing linking items that were showing item parameter drift. Simulated data were generated under a modified three-parameter logistic model with common item non-equivalent group linking design. Factors manipulated were percentage of drifting items, type of drift, magnitude of drift, group achievement differences and choice of linking methods. The effect of item drift was studied by examining mean difference between true θs and θ-estimates, and the accuracy of the classification of examinees into proper performance categories. Results indicated that the characteristics of the drift had little impact upon the performance of the Fixed Common Item Parameter (FCIP) method or Stocking & Lord s Test Characteristic Curve (TCC-ST) method and its influence on the performance of the Concurrent method varied depending on whether the drifting items were removed from the linking. In addition, better estimation was achieved if the drifting items were removed from the linking when the Concurrent method was used.

3 ACKNOWLEDGEMENTS There are many people who have been supportive and helpful along the road to the completion of my dissertation. I feel grateful to their assistance, inspiration and encouragement. First of all, my deepest gratitude goes to Dr. Mark Reckase, my academic advisor and chairperson of my dissertation committee, for his scholarly guidance and constant support. He has shown substantial support, discussing the project with me, reviewing the research drafts and providing critical feedbacks. I have benefited greatly from his valuable insights and assistance during the writing process and throughout the whole journey of my Ph.D. study. I would like to express my sincere appreciation to other members in my dissertation committee: Dr. Tenko Raykov, Dr. Edward Roeber, and Dr. Ann Marie Ryan, for their excellent insights, suggestions, and assistance. They have made their time from their busy schedules to attend the meetings and have provided me with their deep insights, constructive feedbacks and thoughtful suggestions. My appreciation also extends to Dr. Sharif Shakrani and Dr. Cassandra Guarino for their valuable comments and insightful suggestions on my dissertation proposal. Their valuable comments on the design and analyses of the dissertation study are highly appreciated. I would like to express my thanks to Dr. Michael Kozlow and Dr. Xiao Pang for sharing their insights and allowing me the access to the data. I feel blessed with the opportunities to work with them. Their assistance and enthusiasm have motivated me to pursue this research topic. I would also like to express my gratitude to Dr. Yong Zhao for providing me with iii

4 assistantship opportunities during my graduate study. I appreciate the opportunities he has given me to participate in many research projects which have helped my development as a professional researcher. I appreciate the great support of my friends Brad and Tinker for editing and proofreading the proposal and the drafts. I am deeply grateful and indebted to my family for their love and support. I could not have completed the journey without the encouragement and patience from my dear husband Haonan, the inspiration and love from my wonderful daughter Karen and the support from my loving parents. iv

5 TABLE OF CONTENTS LIST OF TABLES vii LIST OF FIGURES ix Chapter I: Introduction Research Background. 1 Chapter II: Background and Objective of Study Common-Item Linking Design Linking Methods Item Parameter Drift...6 Chapter III: Methods and Research Design Data Generation Generating Parameters., Simulation of Item Parameter Drift Group Achievement Differences Calibration and Linking Procedures Handling of Drifting Items Evaluation Criteria Chapter IV: Results and Discussion Drift on Discriminating Parameter a Correlation between θ Estimate and True θ Accuracy of θ Estimates Bias and RMSE in Four a-drift Situations Effect of Percentage of Items Showing a-parameter Drift Effect of the Direction of a-parameter Drift Effect of the Linking Method Effect of Group Difference Effect of Drifted Items Handling Effect of Drifted Items Handling at Different θ Levels Accuracy of Performance Level Classification Drift on Difficulty Parameter b Correlation between θ Estimate and True θ Accuracy of θ Estimates Bias and RMSE in Eight b-drift Situations...92 v

6 4.222 Effect of Percentage of Items Showing b-parameter Drift Effect of the Direction of b-parameter Drift Effect of the Magnitude of b-parameter Drift Effect of the Linking Method Effect of Group Difference Effect of Drifted Items Handling Effect of Drifted Items Handling at Different θ Levels Accuracy of Performance Level Classification..171 Chapter V: Conclusions, Implications and Future Research Conclusions Implications Limitations and Future Directions..205 APPENDIX..207 BIBLIOGRAPHY vi

7 LIST OF TABLES Table 3.1 Descriptive Statistics of the Item Parameters...13 Table 3.2 Summary of Simulated Conditions Table 4a.1 Average Correlation Coefficients between θ Estimates and True θs when a-parameter Drifting (with SDs in Parentheses) Table 4a.2 Bias for θ Estimates when a-parameter Drifting Table 4a.3 RMSE for θ Estimates when a-parameter Drifting.. 27 Table 4a.4 Change in bias and RMSE as More Items Drifting in a-parameter Table 4a.5 Change in bias and RMSE with a-parameter Drifting in Different Directions...32 Table 4a.6 Percentage in Each Performance Level Classification (N=3, Drift=a+0.4)..71 Table 4a.7 Percentage in Each Performance Level Classification (N=3, Drift=a-0.4)...76 Table 4a.8 Percentage in Each Performance Level Classification (N=8, Drift=a+0.4)..81 Table 4a.9 Percentage in Each Performance Level Classification (N=8, Drift=a-0.4)...85 Table 4b.1 Average Correlation Coefficients between θ Estimates and True θs when b-parameter Drifting (with SDs in Parentheses) 90 Table 4b.2 Bias for θ Estimates when b-parameter Drifting Table 4b.3 RMSE for θ Estimates when b-parameter Drifting.. 96 Table 4b.4 Changes in bias and RMSE with More Items Drifting in b-parameter (3 items 8 items).99 Table 4b.5 Changes in bias and RMSE with b-parameter Drifting in Different Directions (Positive Drift Negative Drift) 101 vii

8 Table 4b.6 Changes in bias and RMSE as the size of b-parameter Drift Increases ( )..103 Table 4b.7 Percentage in Each Performance Level Classification (N=3, Drift=b+0.2) Table 4b.8 Percentage in Each Performance Level Classification (N=3, Drift=b-0.2)..179 Table 4b.9 Percentage in Each Performance Level Classification (N=3, Drift=b+0.4) Table 4b.10 Percentage in Each Performance Level Classification (N=3, Drift=b-0.4) 185 Table 4b.11 Percentage in Each Performance Level Classification (N=8, Drift=b+0.2) Table 4b.12 Percentage in Each Performance Level Classification (N=8, Drift=b-0.2) 191 Table 4b.13 Percentage in Each Performance Level Classification (N=8, Drift=b+0.4)..194 Table 4b.14 Percentage in Each Performance Level Classification (N=8, Drift=b-0.4) Table 6 Population Item Parameters Used for Simulations viii

9 LIST OF FIGURES Figure 3.1 Design of Dataset Simulation..13 Figure 4a.0.1 Comparison of Linking Methods (Mean Year 1= Mean Year 2; All Items Included) 34 Figure 4a.0.2 Comparison of Linking Methods (Mean Year 1< Mean Year 2; All Items Included) 34 Figure 4a.0.3 Comparison of Linking Methods (Mean Year 1> Mean Year 2; All Items Included) 35 Figure 4a.0.4 Comparison of Linking Methods (Mean Year 1= Mean Year 2; Drifted Items Dropped) 35 Figure 4a.0.5 Comparison of Linking Methods (Mean Year 1< Mean Year 2; Drifted Items Dropped) 36 Figure 4a.0.6 Comparison of Linking Methods (Mean Year 1> Mean Year 2; Drifted Items Dropped) 36 Figure 4a.1.1 Effect of Group Difference (Concurrent; All Items Included)...38 Figure 4a.1.2 Effect of Group Difference (FCIP; All Items Included)...38 Figure 4a.1.3 Effect of Group Difference (TCC-ST; All Items Included)...39 Figure 4a.1.4 Effect of Group Difference (Concurrent; Drifted Items Dropped)..39 Figure 4a.1.5 Effect of Group Difference (FCIP; Drifted Items Dropped)...40 Figure 4a.1.6 Effect of Group Difference (TCC-ST; Drifted Items Dropped in Linking but Kept in Scoring) Figure 4a.1.7 Effect of Group Difference (TCC-ST; Drifted Items Dropped)...41 Figure 4a.2.1 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 = Mean Year 2) ix

10 Figure 4a.2.2 Comparison of Handling Drifted Items (FCIP; Mean Year 1 = Mean Year 2).. 43 Figure 4a.2.3 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 = Mean Year 2)...44 Figure 4a.2.4 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 < Mean Year 2) Figure 4a.2.5 Comparison of Handling Drifted Items (FCIP; Mean Year 1 < Mean Year 2)...45 Figure 4a.2.6 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 < Mean Year 2)...45 Figure 4a.2.7 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 > Mean Year 2) Figure 4a.2.8 Comparison of Handling Drifted Items (FCIP; Mean Year 1 > Mean Year 2)..46 Figure 4a.2.9 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 > Mean Year 2)...47 Figure 4a.3.1 Mean bias at θ Intervals (3 Items a-drift +0.4) Figure 4a.3.2 Mean bias at θ Intervals (3 Items a-drift -0.4)...54 Figure 4a.3.3 Mean bias at θ Intervals (8 Items a-drift +0.4)...59 Figure 4a.3.4 Mean bias at θ Intervals (8 Items a-drift -0.4)...64 Figure 4b.0.1 Comparison of Linking Methods (Mean Year 1= Mean Year 2; All Items Included)..106 Figure 4b.0.2 Comparison of Linking Methods (Mean Year 1= Mean Year 2; Drifted Items Dropped).107 Figure 4b.0.3 Comparison of Linking Methods (Mean Year 1< Mean Year 2; All Items Included) Figure 4b.0.4 Comparison of Linking Methods (Mean Year 1< Mean Year 2; Drifted Items Dropped) Figure 4b.0.5 Comparison of Linking Methods (Mean Year 1> Mean Year 2; All Items Included) x

11 Figure 4b.0.6 Comparison of Linking Methods (Mean Year 1> Mean Year 2; Drifted Items Dropped)..111 Figure 4b.1.1 Effect of Group Difference (Concurrent; All Items Included) Figure 4b.1.2 Effect of Group Difference (Concurrent; Drifted Items Dropped) Figure 4b.1.3 Effect of Group Difference (FCIP; All Items Included) Figure 4b.1.4 Effect of Group Difference (FCIP; Drifted Items Dropped) Figure 4b.1.5 Effect of Group Difference (TCC-ST; All Items Included) Figure 4b.1.6 Effect of Group Difference (TCC-ST; Drifted Items Dropped in Linking but Kept in Scoring) Figure 4b.1.7 Effect of Group Difference (TCC-ST; Drifted Items Dropped) Figure 4b.2.1 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 = Mean Year 2) Figure 4b.2.2 Comparison of Handling Drifted Items (FCIP; Mean Year 1 = Mean Year 2) 123 Figure 4b.2.3 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 = Mean Year 2).124 Figure 4b.2.4 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 < Mean Year 2) Figure 4b.2.5 Comparison of Handling Drifted Items (FCIP; Mean Year 1 < Mean Year 2).126 Figure 4b.2.6 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 < Mean Year 2).127 Figure 4b.2.7 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 > Mean Year 2) Figure 4b.2.8 Comparison of Handling Drifted Items (FCIP; Mean Year 1 > Mean Year 2).129 Figure 4b.2.9 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 > Mean Year 2).130 Figure 4b.3.1 Mean bias at θ Intervals (3 Items b-drift +0.2) xi

12 Figure 4b.3.2 Mean bias at θ Intervals (3 Items b-drift -0.2) Figure 4b.3.3 Mean bias at θ Intervals (3 Items b-drift +0.4) Figure 4b.3.4 Mean bias at θ Intervals (3 Items b-drift -0.4) Figure 4b.3.5 Mean bias at θ Intervals (8 Items b-drift +0.2) Figure 4b.3.6 Mean bias at θ Intervals (8 Items b-drift -0.2) Figure 4b.3.7 Mean bias at θ Intervals (8 Items b-drift +0.4) Figure 4b.3.8 Mean bias at θ Intervals (8 Items b-drift -0.4) xii

13 Chapter I: Introduction 1.1 Research Background Scores from large scale assessment are commonly used as good indicators of student performance. Educational policy makers, administrators and educators hope to compare how students are doing from year to year. However, because the test is administrated each year, different test forms need to be used to ensure test security. Practically, no test developer can guarantee the equivalency of different test forms despite vigorous efforts to ensure that equivalency. Hence, to achieve comparability, practitioners need to link the test scores. One way to achieve comparability is through a common item linking design. There are various approaches to common item linking -- some based on Item Response Theory (IRT) models. In Item Response Theory, item parameters are estimated and are assumed to be invariant under linear transformation (Lord, 1980). Several methods have been developed to place the item parameters on a common metric, including linear transformation of separate calibrations, fixed common item parameter (FCIP) calibrations and concurrent calibration. The invariance property of item response theory assumes that the item characteristic curves for a test item should be the same when estimated from data from two different populations. The linear relationship between θ-estimates and item parameter estimates indicates that the difference in the scaled scores is the result of difference in θs across groups or over time, when the parameters would remain unchanged if they were on the same scale. However, in practice, this assumption of invariance does not always hold. When an item in the same test functions differently for different subgroups with the same degree of proficiency, it is called differential 1

14 item functioning (DIF; Holland & Wainer, 1993). When the statistical properties of the same items change on different testing occasions, it is called item parameter drift (IPD) (Goldstein, 1983; Bock, Muraki, & Pfeiffenberger, 1988). Item parameter drift has the potential to negatively affect the validity of the score scale conversion. Although there has been some research on item parameter drift, it is not as extensive as research on DIF. In practice, items flagged for IPD are often removed from the linking items in the estimation of linking coefficients (Cook & Eignor, 1991). However, since research comparing the IPD detection methods has found that the effectiveness of these methods depends on the testing situation (Donoghue & Isham, 1998; DeMars, 2004), it is likely that some item parameter drift goes undetected while some items are improperly flagged as drifting. Research has found some possible sources of drift, such as change in curriculum (Goldstein, 1983; Bock, Muraki, & Pfeiffenberger, 1988), context effects (Eignor, 1985), sample statistics (Cook, Eignor, & Taft, 1988), content of items (Chan, Drasgow, & Sawin, 1999), and item over-exposure (Veerkamp & Glas, 2000). These reasons for item parameter drift may or may not be related to the construct being measured. It will be a source of linking error to keep items that are drifting due to construct-irrelevant factors. However, it creates another source of linking error to remove items whose drift is closely related to the construct being measured (Miller & Fitzpatrick, 2009). The research on the impact of IPD on θ-estimates has produced mixed results. Some research studies found that IPD had little effect on θ-estimates (Wells, Subkoviak, & Serlin, 2002; Witt, Stahl, & Bergstrom, 2003; Rupp & Zumbo, 2003). On the other hand, other research has 2

15 found that IPD could compound over multiple testing occasions and that the choice of linking model could have a large effect on the θ-estimates (Wollack, Sung & Kang, 2005; 2006). In most of the research on the effect of IPD, items exhibiting IPD were removed from the linking set of items and the test characteristic curve method (TCC) was often chosen as the linking method (Stocking & Lord, 1983). However, drifted items may remain as the linking items due to the ineffective detection of IPD. When the drift is related to the construct being measured, the items should not be removed. This study intends to compare the effects of IPD on θ-estimates when the drifted items are either kept or removed from the linking set. As well, the interaction between the handling of the drifted items and the linking methods will be examined. The linking methods used in this study are Stocking and Lord s test characteristic curve method (CCM) (Stocking & Lord, 1983), fixed common item parameter (FCIP) calibrations, and concurrent calibration (Hambleton, Swaminathan & Rogers, 1991; Kolen & Brennan, 1995). 3

16 Chapter II: Background and Objective of Study 2.1 Common-Item Linking Design Essentially, the process of linking is to change θ-estimates from one test to θ-estimates on the equivalent trait scale on another test (Holland & Dorans, 2006). In many large-scale testing programs, the common item non-equivalent group linking design is widely used. In this design, two or more test forms are created with a set of items in common, and these test forms are used in different test administrations (Kolen & Brennan, 1995). The item parameters obtained from different test forms are placed on the same scale by using the common items as linking items. 2.2 Linking Methods Item parameters obtained from different test forms need to be aligned on the same scale. There are a variety of alignment approaches to achieve this purpose: linear procedures based on θ-estimates, fixed common item parameters, the mean-mean method, the mean-sigma method, the test characteristic curve method, and concurrent calibration (Kolen & Brennan, 1995; Yen & Fitzpatrick, 2006; Holland & Dorans, 2006). Because of the importance of choosing an appropriate linking method to ensure the accuracy of the linking results, there has been research comparing the merits of different linking methods. Some of the studies have investigated the strengths and weaknesses of those IRT-model-based linking methods (Baker & Al-Karni, 1991; Kim & Cohen, 1998; Henson & Beguin, 1999, 2002; Li, Griffith & Tam, 1997; Jodoin, Keller & Swaminathan, 2003; Li, Tam & Tompkins, 2004). One kind of linking method is to transform parameter estimates obtained from two separate calibrations onto a common scale through a linear scale transformation. The Stocking & Lord 4

17 (1983) CCM method was found to yield more stable results than moment methods when data sets are typically troublesome to calibrate (Baker, 1991). The better performance of the Stocking-Lord method over the moment methods was documented in the literature (Hanson & Beguin, 2002). Another commonly used linking method is the fixed common item parameter method. The pre-calibrated item parameter estimates for the common items are fixed while calibrating the non-common items, and then item parameter estimates from the non-common items are placed on the same scale as the fixed parameters. This method does not require the computation of scale transformation coefficients. Li, Griffith, and Tam (1997) found that both the FCIP linking method and the characteristic curve linking method provided stable θ estimates, except for students with low θ values and that the item parameter estimates calibrated with these two methods were consistent except for the estimation of guessing parameter under CCM. Concurrent calibration is also a widely used linking method. Parameters for items from multiple test forms are estimated in a single calibration run. The simulation study by Kim and Cohen (1998) found that separate and concurrent calibration provided similar results when the number of common items was large, but separate calibration provided more accurate results when the number of common items was small. Hanson and Beguin (2002) found that concurrent calibration generally yielded more accurate results, although the results were not sufficient to support total preference for concurrent estimation. Although there has been research comparing these IRT linking methods, there has been not sufficient evidence as to which one is the best. Each method has its own merit. Keller, Jodoin, 5

18 Rogers and Swaminathan (2003) compared linear, FCIP and concurrent linking procedures in detecting academic growth and found that the type of linking method used resulted in differences in mean growth and classification. Lee and Ban (2007) compared concurrent calibration, the Stocking-Lord method and the fixed item parameter linking procedures in the random group linking design. They found that the relative performance of different linking procedures varies with the measurement conditions. So no conclusion can be drawn about one preferred procedure for all occasions. 2.3 Item Parameter Drift In Item Response Theory, the IRT estimates are assumed to be invariant up to a linear transformation (Lord, 1980). For example, under a 3PL model, the probability of a correct response to the i th item is given by where P( a i is the item discrimination, θ ) = c i i + 1.7a ( θ b ) (2.1) i i 1+ e 1 c b i is the item difficulty and c i is the pseudo-guessing parameter for item i. A linear transformation of the parameters will produce the same probability of a correct response. For example, let and θ = θ B, (2.2) a b Jk A Ik + = a A, (2.3) Ji Ii / = Ab B, (2.4) Ji Ii + c Ji = c Ii (2.5) 6

19 where A and B are constants, Scale I, a Ji, θjk and θ Ik are values of θ for individual k on Scale J and bji and cji are the item parameters for item i on Scale J and a Ii, b Ii, and the item parameters for item j on Scale I. The c-parameters do not change with the linear transformation of scale. The probability of correctly answering item i for an examinee with θ Jk (equation 2.1) is c Ji + 1+ e 1 c Ji 1.7a ( θ b ), Ji Jk Ji Which equals (with the expressions from equations (2.2)-(2.5)) cii are c Ii + 1+ e = c Ii 1.7 a / A Ii + 1+ e 1 c Ii ( Aθ + B) ( Ab + B) ) Ik 1 c Ii Ii 1.7a ( θ b ) Ii Ik Ii which is exactly the probability of correctly answering item i for the same examinee with on the alternative scale ( Hambleton & Swaminathan, 1985; Kolen & Brennan, 1995). However, in practice, item parameters do not always remain unchanged. When an item performs differently for examinees of comparable proficiency, it is defined as differential item functioning (DIF, Holland & Wainer, 1993). Goldstein (1983) developed a general framework of the change of item characteristics or parameter values over time. Research has found a number of possible sources of item parameter drift. Goldstein (1983) suggested some possible reasons such as changing curriculum content, different social demands for knowledge and skills, and so on. Bock, Muraki and Pfeiffenberger (1988) analyzed item response data from 10-year administrations of the College Board Physics Achievement Test. θ Ik 7

20 They found differential drift occurred with the change in curricular emphasis. When teachers began to focus more on basic topics of mechanics rather than advanced topics, the difficulty of the mechanics questions increased. As the English units of measurement were phased out of the physics curriculum, the slopes for items using English units and metric units were changing directions. Chan, Drasgow and Sawin (1999) observed item parameter drift on the Armed Services Vocational Aptitude Battery over a 16-year period and found that the drift was related to changing demands for knowledge. They found that tests with more semantic/knowledge content seemed to have higher rates of item drift. Eignor (1985) found that for reading tests, item drift could be explained by the location of reading passages and position of items in the test. Sykes and Fitzpatrick (1992) tried to explain the drift in item difficulty after consecutive administrations of a professional licensure examination and found that change in difficulty parameter was related neither to changes in the booklet or test position of the items, nor to the item type. Cook, Eignor and Taft (1988) investigated curriculum-related achievement tests given in spring or fall and concluded that tests taken at different times during the school year might have measured different attributes. Veerkamp and Glas (2002) investigated item drift in adaptive testing due to previous exposure of the item. Giordano, Subhiyah, and Hess (2005) analyzed the item exposure on take-home examinations in the medicine and its influences on the difficulty of the exam. There has been research on how to detect item parameter drift. Researchers have used DIF procedures such as the Mantel-Haenszel procedure, Lord s X 2 measure, Kim & Cohen s (1991) closed interval measures and Raju s (1988) exact signed- and unsigned-integral measures and 8

21 Kim, Cohen, & Park s (1995) X 2 test for multiple-group differential DIF; analysis of covariance models (Sykes & Fitzpatrick, 1992); restricted item response models (Stone & Lane, 1991); the cumulative sum (CUSUM) chart a statistical quality control used in production processes (Veerkamp & Glas, 2002); and the procedure in BILOG-MG for estimating linear trends in item difficulty. In their study comparing the procedures for detecting IPD, Donoghue and Isham (1998) found that Lord s measure was the most effective in detecting drift on the condition that the item s guessing parameter was controlled to be equal across calibrations. Their findings suggested that the effective functioning of the detection method depended on the specific testing situation. In a study by DeMars (2004), the linear drift procedure in BILOG-MG and the modified-kpc were found effective in identifying drift, similar to those represented in this study, but these procedures falsely identified non-drift items. There has been research on the effect of item parameter drift on the estimation of an examinee s θ, but the research is not extensive and the conclusions are not conclusive. Wells, Subkoviak, and Serlin (2002) simulated item response data using the two-parameter logistic model for two testing occasions. The factors they manipulated in the study included the percentage of items that exhibited IPD, types of drift, sample size and test length. Drift was simulated by increasing the difficulty parameters by 0.4 and the discriminating parameters by 0.5. Their results suggested that item parameter drift as simulated in their study had a small effect on θ-estimates. The study also illustrated the robustness of the 2PL model despite the violation of the invariance property. Witt, Stahl, and Bergstrom (2003) investigated the effects of IPD on the 9

22 stability of test taker θ-estimates and pass/fail status under the Rasch model. The researchers used real, non-normal distribution of examinee θ values. Six levels of shifts in difficulty parameter were simulated. The results of the study illustrated the robustness of the Rasch model in spite of item drift, even when the true θs were not normally distributed. The θ-estimation was stable under moderate drift in item difficulties. Similarly, Rupp and Zumbo (2003) concluded from their study that IRT θ-estimates were relatively robust, with moderate amounts of item parameter drift. In Wollack, Sung & Kang s study(2005) of longitudinal item parameter drift, over 7-year s worth of test forms from a German placement test were linked. The results showed that the choice of linking/ipd model could have a large impact on the resulting θ-estimates and passing rates. The simulation and real data studies by Wollack, Sung & Kang (2006) further supported this conclusion. They found that direct linking of each new form to the base form was slightly better than indirect linking. Models with TCC linking were compared with models that used the fixed parameter linking method in the study and the TCC linking process was found to perform better. The inconsistency in the findings about the effect of item parameter drift indicates that further studies need to be conducted to explore the effect of IPD from the perspectives of its interaction with factors such as linking procedures and treatment of the drifting items. First, for the common-item linking design, no conclusion has been reached as to what is the most effective linking procedure. Therefore, it might provide some interesting insights if the comparison of linking procedures could be combined with the study of the effect of IPD. 10

23 Second, there has been very limited research on how to handle the drifting items. In practice, items flagged for IPD are often removed from the set of items linking two or more test forms (Cook & Eignor, 1991). However, that is not always a proper way of treating item parameter drift. Before removing the drifting items from the linking set, an examination should be conducted about the property of the drift to see whether the drift is related to the construct being measured. Some possible sources for item parameter drift could be irrelevant to the construct being measured, e.g. the drift due to over-exposure of the item, or change in item parameter because of the change of item position. If item drift occurs as a result of the construct-irrelevant factors, then keeping these drifting items in the linking set will be an incorrect way of handling IPD, resulting in linking errors. However, if the item parameter drift cannot be explained by construct-irrelevant factors, it is likely that the drift is related to the construct being measured. In this case, if the drifting items are dropped out of the linking set, they become another source of linking error (Miller and Fitzpatrick, 2009). Moreover, in a real testing situation, items that drift are only items flagged by one or more methods of detecting IPD. There can be linking errors from the false detection of IPD. The objective of the proposed study is to investigate the effects of item parameter drift on θ-estimates when the items exhibiting drift are treated in different ways: either kept or removed from the set of linking items. The study also explores, in the presence of item parameter drift, the performance of three commonly used linking procedures the Stocking and Lord TTC method, the fixed common item parameter method and concurrent calibration. 11

24 Chapter III: Methods and Research Design 3.1 Data Generation In the study, data were generated to simulate a large scale assessment of mathematics. To focus on the effect of item parameter drift under the 3PL model, only multiple-choice items were considered in this study. Item response data were generated to simulate two test administrations one year apart. The test form included 30 operational items and 30 field test items. The items that appeared as field test items in one testing year became operational items in the following testing year and they serve as common items that link the two testing occasions that are one year apart. Item responses were simulated for 3000 examinees taking the test each year. The study tried to model a real testing situation where the test form consisted of operational items and field test items. The operational items in one testing year were field test items in the previous year. So the number of operational items was almost the number of common items. However, in the real situation, a matrix-sample design was used in field testing items. The field test items were divided into subsets and placed into several booklets. Each examinee worked on all the operational items plus a subset of field test items. In this study, to minimize sampling errors, the matrix sampling design was not simulated. Instead, item responses were generated for all examinees answering all the operational and field test items. Figure 3.1 shows the design of the simulated datasets. 12

25 Figure 3.1 Design of Dataset Simulation Year One Group Year Two Group 30 Unique Items 30 Common Items Operational Field test Year One Year One (Considered missing Operational responses in Year Two Linking) Field test Year Two (not simulated) 3.11 Generating Parameters A set of 60 item parameter estimates from the 2006/2007 Canadian provincial mathematics assessment was used as true parameters for generating baseline data. A modified three-parameter logistic model was used in estimating the item parameters where the guessing parameter was fixed to 0.2. The mean difficulty is and the mean slope is Modifications were made to randomly selected a or b parameters to reflect item parameter drift, while the c-parameter was set at 0.2. Table 3.1 describes the distribution of the true item parameters. Table 3.1 Descriptive Statistics of the Item Parameters Parameter Std. N Minimum Maximum Mean Deviation a b c Simulation of Item Parameter Drift To ascertain whether the effect of item parameter drift on θ-estimates differed when more items were showing drift or when items drifted further away from their original values, the number of items drifting and level of drift were manipulated. 13

26 In one condition about 10% and in another condition 25% of the items were randomly selected to exhibit item parameter drift. When 10% of the items were drifting, it suggested a scenario that the number of items drifting was not large and the drift might go undetected. When 25% of the items were drifting, the drift was not likely to be ignored and whether to keep or remove those drifting items could have a larger effect on θ-estimates. Data were also generated with a no-drift condition to serve as baseline. Two types of drift were simulated: drift on discriminating parameter a and drift on difficulty parameter b. The a-parameter drift was simulated by manually increasing or decreasing the a parameter by 0.4. In previous research, a similar magnitude of a-drift was adopted, e.g. a-drift of 0.3 (Donoghue & Isham, 1998) or 0.5 (Wells et al., 2002). Two levels of b-parameter drift were simulated by increasing or decreasing the parameter by 0.2 or 0.4. A drift of 0.2 simulated a moderate amount of item parameter drift while a drift of 0.4 simulated a large amount of drift. The same magnitude of drift (0.4) was used in other studies of IPD, such as Wells et al. (2002), Donoghue and Isham (1998) and Wollack (2006). The changes in p-values of the items with each amount of drift were also examined. To study the effects of a-drift and b-drift separately, the two types of drifts were not mixed in any one condition. Meanwhile, simulated drift were restricted to one direction. In practice, drift is likely to go in either direction. However, mixing positive and negative drift within a test can result in cancellation of drift and give less information on the effects of IPD. Thus, under either of the conditions studied, there was one parameter drifting with the drift always increasing or always decreasing. Though this unidirectional drift design was an oversimplification of how drift 14

27 actually occurred in real testing, it represented a worst-case-scenario where the effect of drift was unlikely to be ignored Group Achievement Differences When item parameters drift, it sometimes indicates change in achievement of the group taking the test, such as changes in curriculum, or in policy. To investigate how drifting items would help identify the changes in group achievement, data were simulated for both equivalent and non-equivalent groups. For examinees taking the test in YEAR1, the set of item responses was generated by sampling the latent trait (θ ) from a normal independent distribution (NID) with mean 0 and standard deviation 1 (NID(0,1)). For examinees taking the test in YEAR2, three sets of items responses were generated by sampling θ from an NID (0,1) distribution, an NID (0.2,1) distribution and an NID (-0.2,1) distribution. The 0.2 shift represented a moderate increase in θ from one year to another. A similar magnitude of increase in achievement was used in other research. Donoghue & Isham (1998) used 0.1 and Wollack et al. (2006) used 0.15 as a yearly increase in student achievement. The group differences were designed such that item parameter drift could be examined in different situations: 1) when there was no remarkable policy change between the two testing years and populations were assumed to be equivalent in achievement; 2) when there was noticeable policy implementation that might affect student learning and students were expected to fluctuate in achievement. Under a modified 3PL model, with the guessing parameter fixed at 0.2, item response data 15

28 were generated for the above 36 conditions: percentage of drift (2) X type and level of drift (6) X group achievement difference (3). 3.2 Calibration and Linking Procedures When different methods are used for linking tests, the effect of keeping or removing the drifting items on θ-estimates might differ. Three commonly used IRT linking methods were examined in the study: 1) Concurrent calibration, 2) Fixed Common Item Parameter calibration, and 3) Stocking & Lord s test characteristic curve method. Table 3.2 Summary of Simulated Conditions Drift Group Difference Linking Method 10% (3 items) drifting 25% (8 items) drifting a-parameter b-parameter Year One [NID(0,1)] and Year Two [NID(0,1)] Year One [NID(0,1)] and Year Two [NID(0.2,1)] Year One [NID(0,1)] and Year Two [NID(-0.2,1)] Concurrent calibration (Concurrent) Fixed Common Item Parameter calibration (FCIP) Stocking & Lord s test characteristic curve method (TCC-ST) The computer program PARSCALE 4 (Muraki & Bock, 2003) was used to calibrate the parameters. In concurrent calibration, responses from the YEAR1 test and YEAR2 test were combined for one concurrent run to estimate the parameters. When the fixed common item parameter method was used, item parameters were first estimated from a separate calibration for the YEAR2 test. Then the item parameters for the common items were fixed while a calibration for YEAR1 test was done. Thus, the item parameters were placed on the YEAR2 scale. 16

29 With Stocking & Lord s test characteristic curve method, item parameters were first estimated through two separate calibrations for YEAR1 test items and YEAR2 test items. The linking coefficients were then obtained from the common items using the test characteristic curve method of Stocking & Lord. Using the linking coefficients, item parameter estimates from the YEAR1 test were placed on the scale of the YEAR2 test. For Stocking-Lord transformation, the computer program ST (Hanson, Zeng & Cui, 2004) was used. 3.3 Handling of Drifting Items Two ways of handling the drifting items were compared: either to treat them or to ignore them. Those items whose parameters had been manually altered during data generation were considered drifting items. The drift was considered as treated when the drifting items were dropped from the linking items. The drift was considered as ignored when the drifting items were kept in the linking items, which was to simulate a scenario where item parameter drift was either undetected or drift was construct-relevant. To compare the two ways of handling drifting items, when item parameter drift was present, each of the linking methods was applied twice, once with drifting items included in the common items and later with the drifting items removed from the linking items. For the Stocking & Lord s method, it was applied a third time when the drifting items were removed from the linking items but they were included in the scoring. 3.4 Evaluation Criteria Several indices were used to evaluate the effect of the choice of treating item parameter drift and linking method on θ-estimates. Correlation between true θs and the θ-estimates was one index. Bias and the root mean square error (RMSE) were used to assess the accuracy of 17

30 θ-estimates. One benefit of a simulation study is that the bias and RMSDs between the estimates and the true θs can be obtained. These indices will indicate the accuracy of θ-estimates. If negatively biased, the θ is under-estimated, otherwise it is over-estimated. The smaller the RMSDs are, the better the estimation method is. The bias and RMSE were calculated as follows (as in Li, Tam & Tompkins, 2004): and p ( Hˆ i H i ) i= bias( H ) = 1 (3.1) p p i= 1 ( Hˆ i H ) i 2 RMSE( H ) = (3.2) p where H i is the true θ, Ĥ i is the corresponding estimate, and p is the total number of examinees. In calculating RMSE and bias the θs that were used to generate the item response data were transformed to match the scales from the estimates. With the Stocking-Lord TCC method and the fixed common item parameter method used in this study, θ-estimates for the YEAR One students were placed on the scale of the YEAR Two estimates after linking. With the concurrent calibration linking method, θ-estimates for the YEAR One students were placed on the scale of the combined YEAR One and YEAR Two estimates after linking. When computing RMSE and bias, the θ-values generating data for YEAR One students were transformed onto the YEAR Two 18

31 scale (YEAR One and YEAR Two combined scale for concurrent calibration linking method) first and then these transformed θ-values will be used as the true θ values to be compared with the θ-estimates of the YEAR One students. The θ values used in generating data are transformed to scaled θ-values through linear transformations (Kolen, 2006): σ s σ S s ( x ) = x + μs μ x (3.3) σ x σ x where X is the raw score (the generating θ-values), S is the scale score (scaled θ-values), μ x and σ x are the mean and standard deviations of raw values, and μ s and σ s are the mean and standard deviations of the scaled values. Another index is the percentage of examinees classified into appropriate performance levels, especially the examinees in levels meeting or above the proficient level. Many large-scale assessments report the performance level of an examinee in addition to or instead of an individual score. Therefore, the proportion of correct classification is one indication of the quality of the estimate. To check this index, the θ cut score for each performance level was set, following the guidelines used in the real assessment. In the 2006/2007 Canadian provincial mathematics assessment, there are four performance levels, Level 3 being the provincial target level. Examinees in Level 3 or 4 are considered to meet or surpass the provincial target level. After θ-estimates from the previous year are placed on the same scale of the current testing year, the percentage of students in each level in the previous year will be used to find the θ thresholds for performance levels in the current year. For this simulation study, the cumulative percentages 17.5%, 57.3%, 94.5% were used to find θ cuts for each level. Performance level, classified 19

32 according to the examinee s true θ, was considered as his/her true performance level, while the level classified through estimated θ was the estimated performance level. To examine different ways of treating drifting items, the proportion of correct classification for each level were compared, and special attention were paid to the pass/fail status at the provincial level, as the percentage of students at or above this level was the important index of how schools were progressing. 20

33 Chapter IV: Results and Discussion The results are presented and discussed for the different types of parameter drift simulated in this study. The conditions that were manipulated included the number of drifted items, the level of drift and the direction of drift. The effects of the item parameter drift on θ-estimates are compared when the drifted items are handled differently. The first part will focus on the effect of a-parameter drift on θ-estimates and the second part will examine the effect of b-parameter drift. In all the tables and figures presented in the results part, Group Difference refers to three types of group ability change: 1) examinees taking exams in both years were equivalent groups in ability ( Year One (0,1), Year Two (0,1) ), 2) examinees were non-equivalent in ability with higher ability in the following year ( Year One (0,1), Year Two (0.2, 1) ), and 3) examinees were non-equivalent in ability with lower ability in the following year ( Year One (0,1), Year Two (-0.2,1) ). Linking Method refers to the three methods to link the Year One and Year Two scores: 1) Concurrent calibration ( Concurrent ), 2) Fixed Common Item Parameter method ( FCIP ) and 3) Stocking & Lord s test characteristic curve method ( TCC-ST ). Drifted Item Handling refers to the way to treat item parameter drift: 1) keeping the drifted items in both calibration and scoring ( keep/keep ), and 2) dropping the items in both calibration and scoring ( drop/drop ), and 3) dropping the items in calibration while keeping them in scoring ( drop/keep ), which was only applied with the TCC-ST method. 4.1 Drift on Discriminating Parameter a The effect of different ways to handle the drifted items were studied in four kinds of a-parameter drift: 1) three items drifting with a-parameter increasing by 0.4; 2) three items 21

34 drifting with a-parameter decreasing by 0.4 ; 3) eight items drifting with a-parameter increasing by 0.4 and 4) eight items drifting with a-parameter decreasing by Correlation between θ Estimate and True θ Table 4a.1 lists the correlation coefficients between the θ estimates and the true θs when a-parameter drifts. The correlations are high, ranging from to This indicates that the θ estimates have a very strong and positive association with the true θs. Compared with the no-drift baseline condition where the average correlation is 0.927, the strong and positive relationship between the θ estimates and true θ are consistent across the four conditions of a-parameter drifts. The correlations tend to drop slightly when more items are showing drift and when the drifted items are dropped from the linking and scoring. But the drop is quite small, less than These consistently high correlations are good indications that different ways to handle the drifting items has a negligible effect on the relationship between θ estimate and true θ, regardless of the group abilities and linking methods. 22

35 Table 4a.1 Average Correlation Coefficients between θ Estimates and True θs when a-parameter Drifting (with SDs in Parentheses) Group Difference Year One (0,1) Year Two (0,1) Year One (0,1) Year Two (0.2,1) Year One (0,1) Year Two (-0.2,1) Linking Methods Items Handling (link/score) No drift 3 items a-drift items a-drift items a-drift items a-drift -0.4 Con- keep/keep 0.927(0.002) 0.927(0.002) 0.926(0.002) 0.926(0.002) 0.926(0.002) current drop/drop 0.924(0.002) 0.925(0.002) 0.924(0.002) 0.924(0.002) FCIP keep/keep 0.927(0.002) 0.926(0.002) 0.925(0.002) 0.925(0.002) 0.923(0.003) drop/drop 0.923(0.002) 0.922(0.002) 0.919(0.002) 0.912(0.002) keep/keep 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) TCC-ST drop/keep 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) drop/drop 0.922(0.005) 0.922(0.002) 0.919(0.002) 0.913(0.002) Con- keep/keep 0.927(0.002) 0.927(0.002) 0.926(0.002) 0.927(0.002) 0.926(0.002) current drop/drop 0.923(0.002) 0.925(0.002) 0.923(0.002) 0.923(0.002) FCIP keep/keep 0.927(0.002) 0.926(0.002) 0.925(0.002) 0.925(0.002) 0.923(0.003) drop/drop 0.923(0.002) 0.922(0.002) 0.919(0.002) 0.912(0.002) keep/keep 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) TCC-ST drop/keep 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) drop/drop 0.924(0.002) 0.922(0.002) 0.919(0.002) 0.913(0.002) Con- keep/keep 0.927(0.002) 0.927(0.002) 0.926(0.002) 0.927(0.002) 0.926(0.002) current drop/drop 0.924(0.002) 0.926(0.002) 0.924(0.002) 0.924(0.002) FCIP keep/keep 0.927(0.002) 0.926(0.002) 0.925(0.002) 0.925(0.002) 0.923(0.002) drop/drop 0.922(0.005) 0.922(0.003) 0.919(0.002) 0.912(0.002) keep/keep 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) TCC-ST drop/keep 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) drop/drop 0.924(0.002) 0.922(0.002) 0.919(0.002) 0.913(0.002) 23

36 4.12 Accuracy of θ Estimates The correlation table (Table 4a.1) showed a strong relationship between θ estimates and true θs and this relationship was not affected by the types of a-drift, group difference, linking methods and way of handling drifted items. However, these differences may have effect on the accuracy of θ estimation. To further examine the accuracy of θ estimation, bias and RMSE between θ estimates and true θs were calculated. Table 4a.2 and 4a.3 give the bias and RMSE values for θ estimates when a-parameter drift is present Bias and RMSE in Four a-drift Situations Four situations of a-drift were examined in the study: 3 items (10%) showing an increase of 0.4 in a-parameter, 3 items (10%) showing a decrease of 0.4 in a-parameter, 8 items (25%) showing the same +0.4 increase, and 8 items (25%) showing the same -0.4 decrease. When 10% of the items were showing a 0.4 increase in a-parameter, bias for θ estimates ranged from to Most of the bias values were negative, indicating that θ was under-estimated except when datasets of non-equal groups were linked by concurrent calibration with all items included. The under-estimation was most obvious when concurrent calibration was applied with the drifting items dropped, the bias being , and The RMSE ranged from to RMSE was relatively smaller when concurrent calibration was used with all the linking items included. However, the largest RMSE occurred when concurrent calibration was used with the drifting items dropped. When the same drift occurred to 25% of the items, bias for θ estimates ranged from to and RMSE ranged from to bias values were negative for most of the datasets, except when TCC-ST method with all linking items included was applied with equivalent groups. 24

EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN

EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN By FRANCISCO ANDRES JIMENEZ A THESIS PRESENTED TO THE GRADUATE SCHOOL OF

More information

A Comparison of Four Test Equating Methods

A Comparison of Four Test Equating Methods A Comparison of Four Test Equating Methods Report Prepared for the Education Quality and Accountability Office (EQAO) by Xiao Pang, Ph.D. Psychometrician, EQAO Ebby Madera, Ph.D. Psychometrician, EQAO

More information

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups GMAC Scaling Item Difficulty Estimates from Nonequivalent Groups Fanmin Guo, Lawrence Rudner, and Eileen Talento-Miller GMAC Research Reports RR-09-03 April 3, 2009 Abstract By placing item statistics

More information

THE IMPACT OF ANCHOR ITEM EXPOSURE ON MEAN/SIGMA LINKING AND IRT TRUE SCORE EQUATING UNDER THE NEAT DESIGN. Moatasim A. Barri

THE IMPACT OF ANCHOR ITEM EXPOSURE ON MEAN/SIGMA LINKING AND IRT TRUE SCORE EQUATING UNDER THE NEAT DESIGN. Moatasim A. Barri THE IMPACT OF ANCHOR ITEM EXPOSURE ON MEAN/SIGMA LINKING AND IRT TRUE SCORE EQUATING UNDER THE NEAT DESIGN By Moatasim A. Barri B.S., King Abdul Aziz University M.S.Ed., The University of Kansas Ph.D.,

More information

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Jee Seon Kim University of Wisconsin, Madison Paper presented at 2006 NCME Annual Meeting San Francisco, CA Correspondence

More information

linking in educational measurement: Taking differential motivation into account 1

linking in educational measurement: Taking differential motivation into account 1 Selecting a data collection design for linking in educational measurement: Taking differential motivation into account 1 Abstract In educational measurement, multiple test forms are often constructed to

More information

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2 MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and Lord Equating Methods 1,2 Lisa A. Keller, Ronald K. Hambleton, Pauline Parker, Jenna Copella University of Massachusetts

More information

Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales

Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales University of Iowa Iowa Research Online Theses and Dissertations Summer 2013 Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales Anna Marie

More information

Linking across forms in vertical scaling under the common-item nonequvalent groups design

Linking across forms in vertical scaling under the common-item nonequvalent groups design University of Iowa Iowa Research Online Theses and Dissertations Spring 2013 Linking across forms in vertical scaling under the common-item nonequvalent groups design Xuan Wang University of Iowa Copyright

More information

Differential Item Functioning

Differential Item Functioning Differential Item Functioning Lecture #11 ICPSR Item Response Theory Workshop Lecture #11: 1of 62 Lecture Overview Detection of Differential Item Functioning (DIF) Distinguish Bias from DIF Test vs. Item

More information

Centre for Education Research and Policy

Centre for Education Research and Policy THE EFFECT OF SAMPLE SIZE ON ITEM PARAMETER ESTIMATION FOR THE PARTIAL CREDIT MODEL ABSTRACT Item Response Theory (IRT) models have been widely used to analyse test data and develop IRT-based tests. An

More information

Section 5. Field Test Analyses

Section 5. Field Test Analyses Section 5. Field Test Analyses Following the receipt of the final scored file from Measurement Incorporated (MI), the field test analyses were completed. The analysis of the field test data can be broken

More information

UNIDIMENSIONAL VERTICAL SCALING OF MIXED FORMAT TESTS IN THE PRESENCE OF ITEM FORMAT EFFECT. Debra White Moore

UNIDIMENSIONAL VERTICAL SCALING OF MIXED FORMAT TESTS IN THE PRESENCE OF ITEM FORMAT EFFECT. Debra White Moore UNIDIMENSIONAL VERTICAL SCALING OF MIXED FORMAT TESTS IN THE PRESENCE OF ITEM FORMAT EFFECT by Debra White Moore B.M.Ed., University of North Carolina, Greensboro, 1989 M.A., University of Pittsburgh,

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 39 Evaluation of Comparability of Scores and Passing Decisions for Different Item Pools of Computerized Adaptive Examinations

More information

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Thakur Karkee Measurement Incorporated Dong-In Kim CTB/McGraw-Hill Kevin Fatica CTB/McGraw-Hill

More information

Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data. Zhen Li & Bruno D.

Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data. Zhen Li & Bruno D. Psicológica (2009), 30, 343-370. SECCIÓN METODOLÓGICA Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data Zhen Li & Bruno D. Zumbo 1 University

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories Kamla-Raj 010 Int J Edu Sci, (): 107-113 (010) Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories O.O. Adedoyin Department of Educational Foundations,

More information

The Matching Criterion Purification for Differential Item Functioning Analyses in a Large-Scale Assessment

The Matching Criterion Purification for Differential Item Functioning Analyses in a Large-Scale Assessment University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Educational Psychology Papers and Publications Educational Psychology, Department of 1-2016 The Matching Criterion Purification

More information

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

Mantel-Haenszel Procedures for Detecting Differential Item Functioning A Comparison of Logistic Regression and Mantel-Haenszel Procedures for Detecting Differential Item Functioning H. Jane Rogers, Teachers College, Columbia University Hariharan Swaminathan, University of

More information

THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH

THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH By JANN MARIE WISE MACINNES A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF

More information

context effects under the nonequivalent anchor test (NEAT) design. In addition, the issue

context effects under the nonequivalent anchor test (NEAT) design. In addition, the issue STORE, DAVIE, Ph.D. Item Parameter Changes and Equating: An Examination of the Effects of Lack of Item Parameter Invariance on Equating and Score Accuracy for Different Proficiency Levels. (2013) Directed

More information

Decision consistency and accuracy indices for the bifactor and testlet response theory models

Decision consistency and accuracy indices for the bifactor and testlet response theory models University of Iowa Iowa Research Online Theses and Dissertations Summer 2014 Decision consistency and accuracy indices for the bifactor and testlet response theory models Lee James LaFond University of

More information

An Empirical Examination of the Impact of Item Parameters on IRT Information Functions in Mixed Format Tests

An Empirical Examination of the Impact of Item Parameters on IRT Information Functions in Mixed Format Tests University of Massachusetts - Amherst ScholarWorks@UMass Amherst Dissertations 2-2012 An Empirical Examination of the Impact of Item Parameters on IRT Information Functions in Mixed Format Tests Wai Yan

More information

Using the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison

Using the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison Using the Testlet Model to Mitigate Test Speededness Effects James A. Wollack Youngsuk Suh Daniel M. Bolt University of Wisconsin Madison April 12, 2007 Paper presented at the annual meeting of the National

More information

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model Gary Skaggs Fairfax County, Virginia Public Schools José Stevenson

More information

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian Recovery of Marginal Maximum Likelihood Estimates in the Two-Parameter Logistic Response Model: An Evaluation of MULTILOG Clement A. Stone University of Pittsburgh Marginal maximum likelihood (MML) estimation

More information

Item Response Theory: Methods for the Analysis of Discrete Survey Response Data

Item Response Theory: Methods for the Analysis of Discrete Survey Response Data Item Response Theory: Methods for the Analysis of Discrete Survey Response Data ICPSR Summer Workshop at the University of Michigan June 29, 2015 July 3, 2015 Presented by: Dr. Jonathan Templin Department

More information

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD Psy 427 Cal State Northridge Andrew Ainsworth, PhD Contents Item Analysis in General Classical Test Theory Item Response Theory Basics Item Response Functions Item Information Functions Invariance IRT

More information

Evaluating the Impact of Construct Shift on Item Parameter Invariance, Test Equating and Proficiency Estimates

Evaluating the Impact of Construct Shift on Item Parameter Invariance, Test Equating and Proficiency Estimates University of Massachusetts Amherst ScholarWorks@UMass Amherst Doctoral Dissertations Dissertations and Theses 2015 Evaluating the Impact of Construct Shift on Item Parameter Invariance, Test Equating

More information

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses Item Response Theory Steven P. Reise University of California, U.S.A. Item response theory (IRT), or modern measurement theory, provides alternatives to classical test theory (CTT) methods for the construction,

More information

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS Michael J. Kolen The University of Iowa March 2011 Commissioned by the Center for K 12 Assessment & Performance Management at

More information

Item Position and Item Difficulty Change in an IRT-Based Common Item Equating Design

Item Position and Item Difficulty Change in an IRT-Based Common Item Equating Design APPLIED MEASUREMENT IN EDUCATION, 22: 38 6, 29 Copyright Taylor & Francis Group, LLC ISSN: 895-7347 print / 1532-4818 online DOI: 1.18/89573482558342 Item Position and Item Difficulty Change in an IRT-Based

More information

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida Vol. 2 (1), pp. 22-39, Jan, 2015 http://www.ijate.net e-issn: 2148-7456 IJATE A Comparison of Logistic Regression Models for Dif Detection in Polytomous Items: The Effect of Small Sample Sizes and Non-Normality

More information

André Cyr and Alexander Davies

André Cyr and Alexander Davies Item Response Theory and Latent variable modeling for surveys with complex sampling design The case of the National Longitudinal Survey of Children and Youth in Canada Background André Cyr and Alexander

More information

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1.

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1. Running Head: A MODIFIED CATSIB PROCEDURE FOR DETECTING DIF ITEMS 1 A Modified CATSIB Procedure for Detecting Differential Item Function on Computer-Based Tests Johnson Ching-hong Li 1 Mark J. Gierl 1

More information

Modeling DIF with the Rasch Model: The Unfortunate Combination of Mean Ability Differences and Guessing

Modeling DIF with the Rasch Model: The Unfortunate Combination of Mean Ability Differences and Guessing James Madison University JMU Scholarly Commons Department of Graduate Psychology - Faculty Scholarship Department of Graduate Psychology 4-2014 Modeling DIF with the Rasch Model: The Unfortunate Combination

More information

The Use of Item Statistics in the Calibration of an Item Bank

The Use of Item Statistics in the Calibration of an Item Bank ~ - -., The Use of Item Statistics in the Calibration of an Item Bank Dato N. M. de Gruijter University of Leyden An IRT analysis based on p (proportion correct) and r (item-test correlation) is proposed

More information

The Impact of Test Dimensionality, Common-Item Set Format, and Scale Linking Methods on Mixed-Format Test Equating *

The Impact of Test Dimensionality, Common-Item Set Format, and Scale Linking Methods on Mixed-Format Test Equating * KURAM VE UYGULAMADA EĞİTİM BİLİMLERİ EDUCATIONAL SCIENCES: THEORY & PRACTICE Received: November 26, 2015 Revision received: January 6, 2015 Accepted: March 18, 2016 OnlineFirst: April 20, 2016 Copyright

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 26 for Mixed Format Tests Kyong Hee Chon Won-Chan Lee Timothy N. Ansley November 2007 The authors are grateful to

More information

A Comparison of Several Goodness-of-Fit Statistics

A Comparison of Several Goodness-of-Fit Statistics A Comparison of Several Goodness-of-Fit Statistics Robert L. McKinley The University of Toledo Craig N. Mills Educational Testing Service A study was conducted to evaluate four goodnessof-fit procedures

More information

CYRINUS B. ESSEN, IDAKA E. IDAKA AND MICHAEL A. METIBEMU. (Received 31, January 2017; Revision Accepted 13, April 2017)

CYRINUS B. ESSEN, IDAKA E. IDAKA AND MICHAEL A. METIBEMU. (Received 31, January 2017; Revision Accepted 13, April 2017) DOI: http://dx.doi.org/10.4314/gjedr.v16i2.2 GLOBAL JOURNAL OF EDUCATIONAL RESEARCH VOL 16, 2017: 87-94 COPYRIGHT BACHUDO SCIENCE CO. LTD PRINTED IN NIGERIA. ISSN 1596-6224 www.globaljournalseries.com;

More information

Copyright. Kelly Diane Brune

Copyright. Kelly Diane Brune Copyright by Kelly Diane Brune 2011 The Dissertation Committee for Kelly Diane Brune Certifies that this is the approved version of the following dissertation: An Evaluation of Item Difficulty and Person

More information

Multidimensionality and Item Bias

Multidimensionality and Item Bias Multidimensionality and Item Bias in Item Response Theory T. C. Oshima, Georgia State University M. David Miller, University of Florida This paper demonstrates empirically how item bias indexes based on

More information

Jason L. Meyers. Ahmet Turhan. Steven J. Fitzpatrick. Pearson. Paper presented at the annual meeting of the

Jason L. Meyers. Ahmet Turhan. Steven J. Fitzpatrick. Pearson. Paper presented at the annual meeting of the Performance of Ability Estimation Methods for Writing Assessments under Conditio ns of Multidime nsionality Jason L. Meyers Ahmet Turhan Steven J. Fitzpatrick Pearson Paper presented at the annual meeting

More information

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION Iweka Fidelis (Ph.D) Department of Educational Psychology, Guidance and Counselling, University of Port Harcourt,

More information

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison Empowered by Psychometrics The Fundamentals of Psychometrics Jim Wollack University of Wisconsin Madison Psycho-what? Psychometrics is the field of study concerned with the measurement of mental and psychological

More information

Computerized Mastery Testing

Computerized Mastery Testing Computerized Mastery Testing With Nonequivalent Testlets Kathleen Sheehan and Charles Lewis Educational Testing Service A procedure for determining the effect of testlet nonequivalence on the operating

More information

Linking Mixed-Format Tests Using Multiple Choice Anchors. Michael E. Walker. Sooyeon Kim. ETS, Princeton, NJ

Linking Mixed-Format Tests Using Multiple Choice Anchors. Michael E. Walker. Sooyeon Kim. ETS, Princeton, NJ Linking Mixed-Format Tests Using Multiple Choice Anchors Michael E. Walker Sooyeon Kim ETS, Princeton, NJ Paper presented at the annual meeting of the American Educational Research Association (AERA) and

More information

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati.

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati. Likelihood Ratio Based Computerized Classification Testing Nathan A. Thompson Assessment Systems Corporation & University of Cincinnati Shungwon Ro Kenexa Abstract An efficient method for making decisions

More information

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

Bruno D. Zumbo, Ph.D. University of Northern British Columbia Bruno Zumbo 1 The Effect of DIF and Impact on Classical Test Statistics: Undetected DIF and Impact, and the Reliability and Interpretability of Scores from a Language Proficiency Test Bruno D. Zumbo, Ph.D.

More information

Influences of IRT Item Attributes on Angoff Rater Judgments

Influences of IRT Item Attributes on Angoff Rater Judgments Influences of IRT Item Attributes on Angoff Rater Judgments Christian Jones, M.A. CPS Human Resource Services Greg Hurt!, Ph.D. CSUS, Sacramento Angoff Method Assemble a panel of subject matter experts

More information

Scaling TOWES and Linking to IALS

Scaling TOWES and Linking to IALS Scaling TOWES and Linking to IALS Kentaro Yamamoto and Irwin Kirsch March, 2002 In 2000, the Organization for Economic Cooperation and Development (OECD) along with Statistics Canada released Literacy

More information

ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE

ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE California State University, San Bernardino CSUSB ScholarWorks Electronic Theses, Projects, and Dissertations Office of Graduate Studies 6-2016 ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION

More information

Standard Errors of Correlations Adjusted for Incidental Selection

Standard Errors of Correlations Adjusted for Incidental Selection Standard Errors of Correlations Adjusted for Incidental Selection Nancy L. Allen Educational Testing Service Stephen B. Dunbar University of Iowa The standard error of correlations that have been adjusted

More information

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods James Madison University JMU Scholarly Commons Department of Graduate Psychology - Faculty Scholarship Department of Graduate Psychology 3-008 Scoring Multiple Choice Items: A Comparison of IRT and Classical

More information

Effects of Ignoring Discrimination Parameter in CAT Item Selection on Student Scores. Shudong Wang NWEA. Liru Zhang Delaware Department of Education

Effects of Ignoring Discrimination Parameter in CAT Item Selection on Student Scores. Shudong Wang NWEA. Liru Zhang Delaware Department of Education Effects of Ignoring Discrimination Parameter in CAT Item Selection on Student Scores Shudong Wang NWEA Liru Zhang Delaware Department of Education Paper to be presented at the annual meeting of the National

More information

Differential Item Functioning Amplification and Cancellation in a Reading Test

Differential Item Functioning Amplification and Cancellation in a Reading Test A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to the Practical Assessment, Research & Evaluation. Permission is granted to

More information

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing Terry A. Ackerman University of Illinois This study investigated the effect of using multidimensional items in

More information

Detecting Suspect Examinees: An Application of Differential Person Functioning Analysis. Russell W. Smith Susan L. Davis-Becker

Detecting Suspect Examinees: An Application of Differential Person Functioning Analysis. Russell W. Smith Susan L. Davis-Becker Detecting Suspect Examinees: An Application of Differential Person Functioning Analysis Russell W. Smith Susan L. Davis-Becker Alpine Testing Solutions Paper presented at the annual conference of the National

More information

Rasch Versus Birnbaum: New Arguments in an Old Debate

Rasch Versus Birnbaum: New Arguments in an Old Debate White Paper Rasch Versus Birnbaum: by John Richard Bergan, Ph.D. ATI TM 6700 E. Speedway Boulevard Tucson, Arizona 85710 Phone: 520.323.9033 Fax: 520.323.9139 Copyright 2013. All rights reserved. Galileo

More information

IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS

IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS A Dissertation Presented to The Academic Faculty by HeaWon Jun In Partial Fulfillment of the Requirements

More information

Nonparametric DIF. Bruno D. Zumbo and Petronilla M. Witarsa University of British Columbia

Nonparametric DIF. Bruno D. Zumbo and Petronilla M. Witarsa University of British Columbia Nonparametric DIF Nonparametric IRT Methodology For Detecting DIF In Moderate-To-Small Scale Measurement: Operating Characteristics And A Comparison With The Mantel Haenszel Bruno D. Zumbo and Petronilla

More information

Modeling the Effect of Differential Motivation on Linking Educational Tests

Modeling the Effect of Differential Motivation on Linking Educational Tests Modeling the Effect of Differential Motivation on Linking Educational Tests Marie-Anne Keizer-Mittelhaëuser MODELING THE EFFECT OF DIFFERENTIAL MOTIVATION ON LINKING EDUCATIONAL TESTS PROEFSCHRIFT TER

More information

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison Group-Level Diagnosis 1 N.B. Please do not cite or distribute. Multilevel IRT for group-level diagnosis Chanho Park Daniel M. Bolt University of Wisconsin-Madison Paper presented at the annual meeting

More information

Comparing DIF methods for data with dual dependency

Comparing DIF methods for data with dual dependency DOI 10.1186/s40536-016-0033-3 METHODOLOGY Open Access Comparing DIF methods for data with dual dependency Ying Jin 1* and Minsoo Kang 2 *Correspondence: ying.jin@mtsu.edu 1 Department of Psychology, Middle

More information

TECHNICAL REPORT. The Added Value of Multidimensional IRT Models. Robert D. Gibbons, Jason C. Immekus, and R. Darrell Bock

TECHNICAL REPORT. The Added Value of Multidimensional IRT Models. Robert D. Gibbons, Jason C. Immekus, and R. Darrell Bock 1 TECHNICAL REPORT The Added Value of Multidimensional IRT Models Robert D. Gibbons, Jason C. Immekus, and R. Darrell Bock Center for Health Statistics, University of Illinois at Chicago Corresponding

More information

POLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS

POLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS POLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS By OU ZHANG A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

More information

Designing small-scale tests: A simulation study of parameter recovery with the 1-PL

Designing small-scale tests: A simulation study of parameter recovery with the 1-PL Psychological Test and Assessment Modeling, Volume 55, 2013 (4), 335-360 Designing small-scale tests: A simulation study of parameter recovery with the 1-PL Dubravka Svetina 1, Aron V. Crawford 2, Roy

More information

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1 Nested Factor Analytic Model Comparison as a Means to Detect Aberrant Response Patterns John M. Clark III Pearson Author Note John M. Clark III,

More information

An Alternative to the Trend Scoring Method for Adjusting Scoring Shifts. in Mixed-Format Tests. Xuan Tan. Sooyeon Kim. Insu Paek.

An Alternative to the Trend Scoring Method for Adjusting Scoring Shifts. in Mixed-Format Tests. Xuan Tan. Sooyeon Kim. Insu Paek. An Alternative to the Trend Scoring Method for Adjusting Scoring Shifts in Mixed-Format Tests Xuan Tan Sooyeon Kim Insu Paek Bihua Xiang ETS, Princeton, NJ Paper presented at the annual meeting of the

More information

Detection of Differential Test Functioning (DTF) and Differential Item Functioning (DIF) in MCCQE Part II Using Logistic Models

Detection of Differential Test Functioning (DTF) and Differential Item Functioning (DIF) in MCCQE Part II Using Logistic Models Detection of Differential Test Functioning (DTF) and Differential Item Functioning (DIF) in MCCQE Part II Using Logistic Models Jin Gong University of Iowa June, 2012 1 Background The Medical Council of

More information

An Introduction to Missing Data in the Context of Differential Item Functioning

An Introduction to Missing Data in the Context of Differential Item Functioning A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to Practical Assessment, Research & Evaluation. Permission is granted to distribute

More information

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, April 23-25, 2003 The Classification Accuracy of Measurement Decision Theory Lawrence Rudner University

More information

Description of components in tailored testing

Description of components in tailored testing Behavior Research Methods & Instrumentation 1977. Vol. 9 (2).153-157 Description of components in tailored testing WAYNE M. PATIENCE University ofmissouri, Columbia, Missouri 65201 The major purpose of

More information

Determining Differential Item Functioning in Mathematics Word Problems Using Item Response Theory

Determining Differential Item Functioning in Mathematics Word Problems Using Item Response Theory Determining Differential Item Functioning in Mathematics Word Problems Using Item Response Theory Teodora M. Salubayba St. Scholastica s College-Manila dory41@yahoo.com Abstract Mathematics word-problem

More information

Differential Item Functioning from a Compensatory-Noncompensatory Perspective

Differential Item Functioning from a Compensatory-Noncompensatory Perspective Differential Item Functioning from a Compensatory-Noncompensatory Perspective Terry Ackerman, Bruce McCollaum, Gilbert Ngerano University of North Carolina at Greensboro Motivation for my Presentation

More information

A Bayesian Nonparametric Model Fit statistic of Item Response Models

A Bayesian Nonparametric Model Fit statistic of Item Response Models A Bayesian Nonparametric Model Fit statistic of Item Response Models Purpose As more and more states move to use the computer adaptive test for their assessments, item response theory (IRT) has been widely

More information

Known-Groups Validity 2017 FSSE Measurement Invariance

Known-Groups Validity 2017 FSSE Measurement Invariance Known-Groups Validity 2017 FSSE Measurement Invariance A key assumption of any latent measure (any questionnaire trying to assess an unobservable construct) is that it functions equally across all different

More information

Differential item functioning procedures for polytomous items when examinee sample sizes are small

Differential item functioning procedures for polytomous items when examinee sample sizes are small University of Iowa Iowa Research Online Theses and Dissertations Spring 2011 Differential item functioning procedures for polytomous items when examinee sample sizes are small Scott William Wood University

More information

Using the Score-based Testlet Method to Handle Local Item Dependence

Using the Score-based Testlet Method to Handle Local Item Dependence Using the Score-based Testlet Method to Handle Local Item Dependence Author: Wei Tao Persistent link: http://hdl.handle.net/2345/1363 This work is posted on escholarship@bc, Boston College University Libraries.

More information

A Comparison of Item and Testlet Selection Procedures. in Computerized Adaptive Testing. Leslie Keng. Pearson. Tsung-Han Ho

A Comparison of Item and Testlet Selection Procedures. in Computerized Adaptive Testing. Leslie Keng. Pearson. Tsung-Han Ho ADAPTIVE TESTLETS 1 Running head: ADAPTIVE TESTLETS A Comparison of Item and Testlet Selection Procedures in Computerized Adaptive Testing Leslie Keng Pearson Tsung-Han Ho The University of Texas at Austin

More information

Effect of Noncompensatory Multidimensionality on Separate and Concurrent estimation in IRT Observed Score Equating A. A. Béguin B. A.

Effect of Noncompensatory Multidimensionality on Separate and Concurrent estimation in IRT Observed Score Equating A. A. Béguin B. A. Measurement and Research Department Reports 2001-2 Effect of Noncompensatory Multidimensionality on Separate and Concurrent estimation in IRT Observed Score Equating A. A. Béguin B. A. Hanson Measurement

More information

Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and

Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and private study only. The thesis may not be reproduced elsewhere

More information

Comprehensive Statistical Analysis of a Mathematics Placement Test

Comprehensive Statistical Analysis of a Mathematics Placement Test Comprehensive Statistical Analysis of a Mathematics Placement Test Robert J. Hall Department of Educational Psychology Texas A&M University, USA (bobhall@tamu.edu) Eunju Jung Department of Educational

More information

A comparability analysis of the National Nurse Aide Assessment Program

A comparability analysis of the National Nurse Aide Assessment Program University of South Florida Scholar Commons Graduate Theses and Dissertations Graduate School 2006 A comparability analysis of the National Nurse Aide Assessment Program Peggy K. Jones University of South

More information

Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX

Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX Paper 1766-2014 Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX ABSTRACT Chunhua Cao, Yan Wang, Yi-Hsin Chen, Isaac Y. Li University

More information

Research Report No Using DIF Dissection Method to Assess Effects of Item Deletion

Research Report No Using DIF Dissection Method to Assess Effects of Item Deletion Research Report No. 2005-10 Using DIF Dissection Method to Assess Effects of Item Deletion Yanling Zhang, Neil J. Dorans, and Joy L. Matthews-López www.collegeboard.com College Board Research Report No.

More information

MEASURING MIDDLE GRADES STUDENTS UNDERSTANDING OF FORCE AND MOTION CONCEPTS: INSIGHTS INTO THE STRUCTURE OF STUDENT IDEAS

MEASURING MIDDLE GRADES STUDENTS UNDERSTANDING OF FORCE AND MOTION CONCEPTS: INSIGHTS INTO THE STRUCTURE OF STUDENT IDEAS MEASURING MIDDLE GRADES STUDENTS UNDERSTANDING OF FORCE AND MOTION CONCEPTS: INSIGHTS INTO THE STRUCTURE OF STUDENT IDEAS The purpose of this study was to create an instrument that measures middle grades

More information

Sensitivity of DFIT Tests of Measurement Invariance for Likert Data

Sensitivity of DFIT Tests of Measurement Invariance for Likert Data Meade, A. W. & Lautenschlager, G. J. (2005, April). Sensitivity of DFIT Tests of Measurement Invariance for Likert Data. Paper presented at the 20 th Annual Conference of the Society for Industrial and

More information

Statistics for Social and Behavioral Sciences

Statistics for Social and Behavioral Sciences Statistics for Social and Behavioral Sciences Advisors: S.E. Fienberg W.J. van der Linden For other titles published in this series, go to http://www.springer.com/series/3463 Jean-Paul Fox Bayesian Item

More information

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan Connexion of Item Response Theory to Decision Making in Chess Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan Acknowledgement A few Slides have been taken from the following presentation

More information

ANNEX A5 CHANGES IN THE ADMINISTRATION AND SCALING OF PISA 2015 AND IMPLICATIONS FOR TRENDS ANALYSES

ANNEX A5 CHANGES IN THE ADMINISTRATION AND SCALING OF PISA 2015 AND IMPLICATIONS FOR TRENDS ANALYSES ANNEX A5 CHANGES IN THE ADMINISTRATION AND SCALING OF PISA 2015 AND IMPLICATIONS FOR TRENDS ANALYSES Comparing science, reading and mathematics performance across PISA cycles The PISA 2006, 2009, 2012

More information

A Monte Carlo Study Investigating Missing Data, Differential Item Functioning, and Effect Size

A Monte Carlo Study Investigating Missing Data, Differential Item Functioning, and Effect Size Georgia State University ScholarWorks @ Georgia State University Educational Policy Studies Dissertations Department of Educational Policy Studies 8-12-2009 A Monte Carlo Study Investigating Missing Data,

More information

Follow this and additional works at:

Follow this and additional works at: University of Miami Scholarly Repository Open Access Dissertations Electronic Theses and Dissertations 2013-06-06 Complex versus Simple Modeling for Differential Item functioning: When the Intraclass Correlation

More information

Chapter 1 Introduction. Measurement Theory. broadest sense and not, as it is sometimes used, as a proxy for deterministic models.

Chapter 1 Introduction. Measurement Theory. broadest sense and not, as it is sometimes used, as a proxy for deterministic models. Ostini & Nering - Chapter 1 - Page 1 POLYTOMOUS ITEM RESPONSE THEORY MODELS Chapter 1 Introduction Measurement Theory Mathematical models have been found to be very useful tools in the process of human

More information

During the past century, mathematics

During the past century, mathematics An Evaluation of Mathematics Competitions Using Item Response Theory Jim Gleason During the past century, mathematics competitions have become part of the landscape in mathematics education. The first

More information

Item-Rest Regressions, Item Response Functions, and the Relation Between Test Forms

Item-Rest Regressions, Item Response Functions, and the Relation Between Test Forms Item-Rest Regressions, Item Response Functions, and the Relation Between Test Forms Dato N. M. de Gruijter University of Leiden John H. A. L. de Jong Dutch Institute for Educational Measurement (CITO)

More information

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests David Shin Pearson Educational Measurement May 007 rr0701 Using assessment and research to promote learning Pearson Educational

More information

Linking Assessments: Concept and History

Linking Assessments: Concept and History Linking Assessments: Concept and History Michael J. Kolen, University of Iowa In this article, the history of linking is summarized, and current linking frameworks that have been proposed are considered.

More information