REMOVE OR KEEP: LINKING ITEMS SHOWING ITEM PARAMETER DRIFT. Qi Chen

Size: px

Start display at page:

Download "REMOVE OR KEEP: LINKING ITEMS SHOWING ITEM PARAMETER DRIFT. Qi Chen"

Roberta O’Neal’
5 years ago
Views:

1 REMOVE OR KEEP: LINKING ITEMS SHOWING ITEM PARAMETER DRIFT By Qi Chen A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Measurement and Quantitative Methods - Doctor of Philosophy 2013

2 ABSTRACT REMOVE OR KEEP: LINKING ITEMS SHOWING ITEM PARAMETER DRIFT By Qi Chen IRT-based procedures using common items are widely used in test score linking. A critical assumption of linking methods is the invariance property of IRT. This assumption is that item parameters remain the same on different testing occasions when they are reported on the same θ-scale. In practice, however, there are occasions when an item parameter drifts from its original value. This study investigated the impact of keeping or removing linking items that were showing item parameter drift. Simulated data were generated under a modified three-parameter logistic model with common item non-equivalent group linking design. Factors manipulated were percentage of drifting items, type of drift, magnitude of drift, group achievement differences and choice of linking methods. The effect of item drift was studied by examining mean difference between true θs and θ-estimates, and the accuracy of the classification of examinees into proper performance categories. Results indicated that the characteristics of the drift had little impact upon the performance of the Fixed Common Item Parameter (FCIP) method or Stocking & Lord s Test Characteristic Curve (TCC-ST) method and its influence on the performance of the Concurrent method varied depending on whether the drifting items were removed from the linking. In addition, better estimation was achieved if the drifting items were removed from the linking when the Concurrent method was used.

3 ACKNOWLEDGEMENTS There are many people who have been supportive and helpful along the road to the completion of my dissertation. I feel grateful to their assistance, inspiration and encouragement. First of all, my deepest gratitude goes to Dr. Mark Reckase, my academic advisor and chairperson of my dissertation committee, for his scholarly guidance and constant support. He has shown substantial support, discussing the project with me, reviewing the research drafts and providing critical feedbacks. I have benefited greatly from his valuable insights and assistance during the writing process and throughout the whole journey of my Ph.D. study. I would like to express my sincere appreciation to other members in my dissertation committee: Dr. Tenko Raykov, Dr. Edward Roeber, and Dr. Ann Marie Ryan, for their excellent insights, suggestions, and assistance. They have made their time from their busy schedules to attend the meetings and have provided me with their deep insights, constructive feedbacks and thoughtful suggestions. My appreciation also extends to Dr. Sharif Shakrani and Dr. Cassandra Guarino for their valuable comments and insightful suggestions on my dissertation proposal. Their valuable comments on the design and analyses of the dissertation study are highly appreciated. I would like to express my thanks to Dr. Michael Kozlow and Dr. Xiao Pang for sharing their insights and allowing me the access to the data. I feel blessed with the opportunities to work with them. Their assistance and enthusiasm have motivated me to pursue this research topic. I would also like to express my gratitude to Dr. Yong Zhao for providing me with iii

4 assistantship opportunities during my graduate study. I appreciate the opportunities he has given me to participate in many research projects which have helped my development as a professional researcher. I appreciate the great support of my friends Brad and Tinker for editing and proofreading the proposal and the drafts. I am deeply grateful and indebted to my family for their love and support. I could not have completed the journey without the encouragement and patience from my dear husband Haonan, the inspiration and love from my wonderful daughter Karen and the support from my loving parents. iv

5 TABLE OF CONTENTS LIST OF TABLES vii LIST OF FIGURES ix Chapter I: Introduction Research Background. 1 Chapter II: Background and Objective of Study Common-Item Linking Design Linking Methods Item Parameter Drift...6 Chapter III: Methods and Research Design Data Generation Generating Parameters., Simulation of Item Parameter Drift Group Achievement Differences Calibration and Linking Procedures Handling of Drifting Items Evaluation Criteria Chapter IV: Results and Discussion Drift on Discriminating Parameter a Correlation between θ Estimate and True θ Accuracy of θ Estimates Bias and RMSE in Four a-drift Situations Effect of Percentage of Items Showing a-parameter Drift Effect of the Direction of a-parameter Drift Effect of the Linking Method Effect of Group Difference Effect of Drifted Items Handling Effect of Drifted Items Handling at Different θ Levels Accuracy of Performance Level Classification Drift on Difficulty Parameter b Correlation between θ Estimate and True θ Accuracy of θ Estimates Bias and RMSE in Eight b-drift Situations...92 v

6 4.222 Effect of Percentage of Items Showing b-parameter Drift Effect of the Direction of b-parameter Drift Effect of the Magnitude of b-parameter Drift Effect of the Linking Method Effect of Group Difference Effect of Drifted Items Handling Effect of Drifted Items Handling at Different θ Levels Accuracy of Performance Level Classification..171 Chapter V: Conclusions, Implications and Future Research Conclusions Implications Limitations and Future Directions..205 APPENDIX..207 BIBLIOGRAPHY vi

7 LIST OF TABLES Table 3.1 Descriptive Statistics of the Item Parameters...13 Table 3.2 Summary of Simulated Conditions Table 4a.1 Average Correlation Coefficients between θ Estimates and True θs when a-parameter Drifting (with SDs in Parentheses) Table 4a.2 Bias for θ Estimates when a-parameter Drifting Table 4a.3 RMSE for θ Estimates when a-parameter Drifting.. 27 Table 4a.4 Change in bias and RMSE as More Items Drifting in a-parameter Table 4a.5 Change in bias and RMSE with a-parameter Drifting in Different Directions...32 Table 4a.6 Percentage in Each Performance Level Classification (N=3, Drift=a+0.4)..71 Table 4a.7 Percentage in Each Performance Level Classification (N=3, Drift=a-0.4)...76 Table 4a.8 Percentage in Each Performance Level Classification (N=8, Drift=a+0.4)..81 Table 4a.9 Percentage in Each Performance Level Classification (N=8, Drift=a-0.4)...85 Table 4b.1 Average Correlation Coefficients between θ Estimates and True θs when b-parameter Drifting (with SDs in Parentheses) 90 Table 4b.2 Bias for θ Estimates when b-parameter Drifting Table 4b.3 RMSE for θ Estimates when b-parameter Drifting.. 96 Table 4b.4 Changes in bias and RMSE with More Items Drifting in b-parameter (3 items 8 items).99 Table 4b.5 Changes in bias and RMSE with b-parameter Drifting in Different Directions (Positive Drift Negative Drift) 101 vii

8 Table 4b.6 Changes in bias and RMSE as the size of b-parameter Drift Increases ( )..103 Table 4b.7 Percentage in Each Performance Level Classification (N=3, Drift=b+0.2) Table 4b.8 Percentage in Each Performance Level Classification (N=3, Drift=b-0.2)..179 Table 4b.9 Percentage in Each Performance Level Classification (N=3, Drift=b+0.4) Table 4b.10 Percentage in Each Performance Level Classification (N=3, Drift=b-0.4) 185 Table 4b.11 Percentage in Each Performance Level Classification (N=8, Drift=b+0.2) Table 4b.12 Percentage in Each Performance Level Classification (N=8, Drift=b-0.2) 191 Table 4b.13 Percentage in Each Performance Level Classification (N=8, Drift=b+0.4)..194 Table 4b.14 Percentage in Each Performance Level Classification (N=8, Drift=b-0.4) Table 6 Population Item Parameters Used for Simulations viii

9 LIST OF FIGURES Figure 3.1 Design of Dataset Simulation..13 Figure 4a.0.1 Comparison of Linking Methods (Mean Year 1= Mean Year 2; All Items Included) 34 Figure 4a.0.2 Comparison of Linking Methods (Mean Year 1< Mean Year 2; All Items Included) 34 Figure 4a.0.3 Comparison of Linking Methods (Mean Year 1> Mean Year 2; All Items Included) 35 Figure 4a.0.4 Comparison of Linking Methods (Mean Year 1= Mean Year 2; Drifted Items Dropped) 35 Figure 4a.0.5 Comparison of Linking Methods (Mean Year 1< Mean Year 2; Drifted Items Dropped) 36 Figure 4a.0.6 Comparison of Linking Methods (Mean Year 1> Mean Year 2; Drifted Items Dropped) 36 Figure 4a.1.1 Effect of Group Difference (Concurrent; All Items Included)...38 Figure 4a.1.2 Effect of Group Difference (FCIP; All Items Included)...38 Figure 4a.1.3 Effect of Group Difference (TCC-ST; All Items Included)...39 Figure 4a.1.4 Effect of Group Difference (Concurrent; Drifted Items Dropped)..39 Figure 4a.1.5 Effect of Group Difference (FCIP; Drifted Items Dropped)...40 Figure 4a.1.6 Effect of Group Difference (TCC-ST; Drifted Items Dropped in Linking but Kept in Scoring) Figure 4a.1.7 Effect of Group Difference (TCC-ST; Drifted Items Dropped)...41 Figure 4a.2.1 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 = Mean Year 2) ix

10 Figure 4a.2.2 Comparison of Handling Drifted Items (FCIP; Mean Year 1 = Mean Year 2).. 43 Figure 4a.2.3 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 = Mean Year 2)...44 Figure 4a.2.4 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 < Mean Year 2) Figure 4a.2.5 Comparison of Handling Drifted Items (FCIP; Mean Year 1 < Mean Year 2)...45 Figure 4a.2.6 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 < Mean Year 2)...45 Figure 4a.2.7 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 > Mean Year 2) Figure 4a.2.8 Comparison of Handling Drifted Items (FCIP; Mean Year 1 > Mean Year 2)..46 Figure 4a.2.9 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 > Mean Year 2)...47 Figure 4a.3.1 Mean bias at θ Intervals (3 Items a-drift +0.4) Figure 4a.3.2 Mean bias at θ Intervals (3 Items a-drift -0.4)...54 Figure 4a.3.3 Mean bias at θ Intervals (8 Items a-drift +0.4)...59 Figure 4a.3.4 Mean bias at θ Intervals (8 Items a-drift -0.4)...64 Figure 4b.0.1 Comparison of Linking Methods (Mean Year 1= Mean Year 2; All Items Included)..106 Figure 4b.0.2 Comparison of Linking Methods (Mean Year 1= Mean Year 2; Drifted Items Dropped).107 Figure 4b.0.3 Comparison of Linking Methods (Mean Year 1< Mean Year 2; All Items Included) Figure 4b.0.4 Comparison of Linking Methods (Mean Year 1< Mean Year 2; Drifted Items Dropped) Figure 4b.0.5 Comparison of Linking Methods (Mean Year 1> Mean Year 2; All Items Included) x

11 Figure 4b.0.6 Comparison of Linking Methods (Mean Year 1> Mean Year 2; Drifted Items Dropped)..111 Figure 4b.1.1 Effect of Group Difference (Concurrent; All Items Included) Figure 4b.1.2 Effect of Group Difference (Concurrent; Drifted Items Dropped) Figure 4b.1.3 Effect of Group Difference (FCIP; All Items Included) Figure 4b.1.4 Effect of Group Difference (FCIP; Drifted Items Dropped) Figure 4b.1.5 Effect of Group Difference (TCC-ST; All Items Included) Figure 4b.1.6 Effect of Group Difference (TCC-ST; Drifted Items Dropped in Linking but Kept in Scoring) Figure 4b.1.7 Effect of Group Difference (TCC-ST; Drifted Items Dropped) Figure 4b.2.1 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 = Mean Year 2) Figure 4b.2.2 Comparison of Handling Drifted Items (FCIP; Mean Year 1 = Mean Year 2) 123 Figure 4b.2.3 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 = Mean Year 2).124 Figure 4b.2.4 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 < Mean Year 2) Figure 4b.2.5 Comparison of Handling Drifted Items (FCIP; Mean Year 1 < Mean Year 2).126 Figure 4b.2.6 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 < Mean Year 2).127 Figure 4b.2.7 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 > Mean Year 2) Figure 4b.2.8 Comparison of Handling Drifted Items (FCIP; Mean Year 1 > Mean Year 2).129 Figure 4b.2.9 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 > Mean Year 2).130 Figure 4b.3.1 Mean bias at θ Intervals (3 Items b-drift +0.2) xi

12 Figure 4b.3.2 Mean bias at θ Intervals (3 Items b-drift -0.2) Figure 4b.3.3 Mean bias at θ Intervals (3 Items b-drift +0.4) Figure 4b.3.4 Mean bias at θ Intervals (3 Items b-drift -0.4) Figure 4b.3.5 Mean bias at θ Intervals (8 Items b-drift +0.2) Figure 4b.3.6 Mean bias at θ Intervals (8 Items b-drift -0.2) Figure 4b.3.7 Mean bias at θ Intervals (8 Items b-drift +0.4) Figure 4b.3.8 Mean bias at θ Intervals (8 Items b-drift -0.4) xii

13 Chapter I: Introduction 1.1 Research Background Scores from large scale assessment are commonly used as good indicators of student performance. Educational policy makers, administrators and educators hope to compare how students are doing from year to year. However, because the test is administrated each year, different test forms need to be used to ensure test security. Practically, no test developer can guarantee the equivalency of different test forms despite vigorous efforts to ensure that equivalency. Hence, to achieve comparability, practitioners need to link the test scores. One way to achieve comparability is through a common item linking design. There are various approaches to common item linking -- some based on Item Response Theory (IRT) models. In Item Response Theory, item parameters are estimated and are assumed to be invariant under linear transformation (Lord, 1980). Several methods have been developed to place the item parameters on a common metric, including linear transformation of separate calibrations, fixed common item parameter (FCIP) calibrations and concurrent calibration. The invariance property of item response theory assumes that the item characteristic curves for a test item should be the same when estimated from data from two different populations. The linear relationship between θ-estimates and item parameter estimates indicates that the difference in the scaled scores is the result of difference in θs across groups or over time, when the parameters would remain unchanged if they were on the same scale. However, in practice, this assumption of invariance does not always hold. When an item in the same test functions differently for different subgroups with the same degree of proficiency, it is called differential 1

14 item functioning (DIF; Holland & Wainer, 1993). When the statistical properties of the same items change on different testing occasions, it is called item parameter drift (IPD) (Goldstein, 1983; Bock, Muraki, & Pfeiffenberger, 1988). Item parameter drift has the potential to negatively affect the validity of the score scale conversion. Although there has been some research on item parameter drift, it is not as extensive as research on DIF. In practice, items flagged for IPD are often removed from the linking items in the estimation of linking coefficients (Cook & Eignor, 1991). However, since research comparing the IPD detection methods has found that the effectiveness of these methods depends on the testing situation (Donoghue & Isham, 1998; DeMars, 2004), it is likely that some item parameter drift goes undetected while some items are improperly flagged as drifting. Research has found some possible sources of drift, such as change in curriculum (Goldstein, 1983; Bock, Muraki, & Pfeiffenberger, 1988), context effects (Eignor, 1985), sample statistics (Cook, Eignor, & Taft, 1988), content of items (Chan, Drasgow, & Sawin, 1999), and item over-exposure (Veerkamp & Glas, 2000). These reasons for item parameter drift may or may not be related to the construct being measured. It will be a source of linking error to keep items that are drifting due to construct-irrelevant factors. However, it creates another source of linking error to remove items whose drift is closely related to the construct being measured (Miller & Fitzpatrick, 2009). The research on the impact of IPD on θ-estimates has produced mixed results. Some research studies found that IPD had little effect on θ-estimates (Wells, Subkoviak, & Serlin, 2002; Witt, Stahl, & Bergstrom, 2003; Rupp & Zumbo, 2003). On the other hand, other research has 2

15 found that IPD could compound over multiple testing occasions and that the choice of linking model could have a large effect on the θ-estimates (Wollack, Sung & Kang, 2005; 2006). In most of the research on the effect of IPD, items exhibiting IPD were removed from the linking set of items and the test characteristic curve method (TCC) was often chosen as the linking method (Stocking & Lord, 1983). However, drifted items may remain as the linking items due to the ineffective detection of IPD. When the drift is related to the construct being measured, the items should not be removed. This study intends to compare the effects of IPD on θ-estimates when the drifted items are either kept or removed from the linking set. As well, the interaction between the handling of the drifted items and the linking methods will be examined. The linking methods used in this study are Stocking and Lord s test characteristic curve method (CCM) (Stocking & Lord, 1983), fixed common item parameter (FCIP) calibrations, and concurrent calibration (Hambleton, Swaminathan & Rogers, 1991; Kolen & Brennan, 1995). 3

16 Chapter II: Background and Objective of Study 2.1 Common-Item Linking Design Essentially, the process of linking is to change θ-estimates from one test to θ-estimates on the equivalent trait scale on another test (Holland & Dorans, 2006). In many large-scale testing programs, the common item non-equivalent group linking design is widely used. In this design, two or more test forms are created with a set of items in common, and these test forms are used in different test administrations (Kolen & Brennan, 1995). The item parameters obtained from different test forms are placed on the same scale by using the common items as linking items. 2.2 Linking Methods Item parameters obtained from different test forms need to be aligned on the same scale. There are a variety of alignment approaches to achieve this purpose: linear procedures based on θ-estimates, fixed common item parameters, the mean-mean method, the mean-sigma method, the test characteristic curve method, and concurrent calibration (Kolen & Brennan, 1995; Yen & Fitzpatrick, 2006; Holland & Dorans, 2006). Because of the importance of choosing an appropriate linking method to ensure the accuracy of the linking results, there has been research comparing the merits of different linking methods. Some of the studies have investigated the strengths and weaknesses of those IRT-model-based linking methods (Baker & Al-Karni, 1991; Kim & Cohen, 1998; Henson & Beguin, 1999, 2002; Li, Griffith & Tam, 1997; Jodoin, Keller & Swaminathan, 2003; Li, Tam & Tompkins, 2004). One kind of linking method is to transform parameter estimates obtained from two separate calibrations onto a common scale through a linear scale transformation. The Stocking & Lord 4

17 (1983) CCM method was found to yield more stable results than moment methods when data sets are typically troublesome to calibrate (Baker, 1991). The better performance of the Stocking-Lord method over the moment methods was documented in the literature (Hanson & Beguin, 2002). Another commonly used linking method is the fixed common item parameter method. The pre-calibrated item parameter estimates for the common items are fixed while calibrating the non-common items, and then item parameter estimates from the non-common items are placed on the same scale as the fixed parameters. This method does not require the computation of scale transformation coefficients. Li, Griffith, and Tam (1997) found that both the FCIP linking method and the characteristic curve linking method provided stable θ estimates, except for students with low θ values and that the item parameter estimates calibrated with these two methods were consistent except for the estimation of guessing parameter under CCM. Concurrent calibration is also a widely used linking method. Parameters for items from multiple test forms are estimated in a single calibration run. The simulation study by Kim and Cohen (1998) found that separate and concurrent calibration provided similar results when the number of common items was large, but separate calibration provided more accurate results when the number of common items was small. Hanson and Beguin (2002) found that concurrent calibration generally yielded more accurate results, although the results were not sufficient to support total preference for concurrent estimation. Although there has been research comparing these IRT linking methods, there has been not sufficient evidence as to which one is the best. Each method has its own merit. Keller, Jodoin, 5

18 Rogers and Swaminathan (2003) compared linear, FCIP and concurrent linking procedures in detecting academic growth and found that the type of linking method used resulted in differences in mean growth and classification. Lee and Ban (2007) compared concurrent calibration, the Stocking-Lord method and the fixed item parameter linking procedures in the random group linking design. They found that the relative performance of different linking procedures varies with the measurement conditions. So no conclusion can be drawn about one preferred procedure for all occasions. 2.3 Item Parameter Drift In Item Response Theory, the IRT estimates are assumed to be invariant up to a linear transformation (Lord, 1980). For example, under a 3PL model, the probability of a correct response to the i th item is given by where P( a i is the item discrimination, θ ) = c i i + 1.7a ( θ b ) (2.1) i i 1+ e 1 c b i is the item difficulty and c i is the pseudo-guessing parameter for item i. A linear transformation of the parameters will produce the same probability of a correct response. For example, let and θ = θ B, (2.2) a b Jk A Ik + = a A, (2.3) Ji Ii / = Ab B, (2.4) Ji Ii + c Ji = c Ii (2.5) 6

19 where A and B are constants, Scale I, a Ji, θjk and θ Ik are values of θ for individual k on Scale J and bji and cji are the item parameters for item i on Scale J and a Ii, b Ii, and the item parameters for item j on Scale I. The c-parameters do not change with the linear transformation of scale. The probability of correctly answering item i for an examinee with θ Jk (equation 2.1) is c Ji + 1+ e 1 c Ji 1.7a ( θ b ), Ji Jk Ji Which equals (with the expressions from equations (2.2)-(2.5)) cii are c Ii + 1+ e = c Ii 1.7 a / A Ii + 1+ e 1 c Ii ( Aθ + B) ( Ab + B) ) Ik 1 c Ii Ii 1.7a ( θ b ) Ii Ik Ii which is exactly the probability of correctly answering item i for the same examinee with on the alternative scale ( Hambleton & Swaminathan, 1985; Kolen & Brennan, 1995). However, in practice, item parameters do not always remain unchanged. When an item performs differently for examinees of comparable proficiency, it is defined as differential item functioning (DIF, Holland & Wainer, 1993). Goldstein (1983) developed a general framework of the change of item characteristics or parameter values over time. Research has found a number of possible sources of item parameter drift. Goldstein (1983) suggested some possible reasons such as changing curriculum content, different social demands for knowledge and skills, and so on. Bock, Muraki and Pfeiffenberger (1988) analyzed item response data from 10-year administrations of the College Board Physics Achievement Test. θ Ik 7

20 They found differential drift occurred with the change in curricular emphasis. When teachers began to focus more on basic topics of mechanics rather than advanced topics, the difficulty of the mechanics questions increased. As the English units of measurement were phased out of the physics curriculum, the slopes for items using English units and metric units were changing directions. Chan, Drasgow and Sawin (1999) observed item parameter drift on the Armed Services Vocational Aptitude Battery over a 16-year period and found that the drift was related to changing demands for knowledge. They found that tests with more semantic/knowledge content seemed to have higher rates of item drift. Eignor (1985) found that for reading tests, item drift could be explained by the location of reading passages and position of items in the test. Sykes and Fitzpatrick (1992) tried to explain the drift in item difficulty after consecutive administrations of a professional licensure examination and found that change in difficulty parameter was related neither to changes in the booklet or test position of the items, nor to the item type. Cook, Eignor and Taft (1988) investigated curriculum-related achievement tests given in spring or fall and concluded that tests taken at different times during the school year might have measured different attributes. Veerkamp and Glas (2002) investigated item drift in adaptive testing due to previous exposure of the item. Giordano, Subhiyah, and Hess (2005) analyzed the item exposure on take-home examinations in the medicine and its influences on the difficulty of the exam. There has been research on how to detect item parameter drift. Researchers have used DIF procedures such as the Mantel-Haenszel procedure, Lord s X 2 measure, Kim & Cohen s (1991) closed interval measures and Raju s (1988) exact signed- and unsigned-integral measures and 8

21 Kim, Cohen, & Park s (1995) X 2 test for multiple-group differential DIF; analysis of covariance models (Sykes & Fitzpatrick, 1992); restricted item response models (Stone & Lane, 1991); the cumulative sum (CUSUM) chart a statistical quality control used in production processes (Veerkamp & Glas, 2002); and the procedure in BILOG-MG for estimating linear trends in item difficulty. In their study comparing the procedures for detecting IPD, Donoghue and Isham (1998) found that Lord s measure was the most effective in detecting drift on the condition that the item s guessing parameter was controlled to be equal across calibrations. Their findings suggested that the effective functioning of the detection method depended on the specific testing situation. In a study by DeMars (2004), the linear drift procedure in BILOG-MG and the modified-kpc were found effective in identifying drift, similar to those represented in this study, but these procedures falsely identified non-drift items. There has been research on the effect of item parameter drift on the estimation of an examinee s θ, but the research is not extensive and the conclusions are not conclusive. Wells, Subkoviak, and Serlin (2002) simulated item response data using the two-parameter logistic model for two testing occasions. The factors they manipulated in the study included the percentage of items that exhibited IPD, types of drift, sample size and test length. Drift was simulated by increasing the difficulty parameters by 0.4 and the discriminating parameters by 0.5. Their results suggested that item parameter drift as simulated in their study had a small effect on θ-estimates. The study also illustrated the robustness of the 2PL model despite the violation of the invariance property. Witt, Stahl, and Bergstrom (2003) investigated the effects of IPD on the 9

22 stability of test taker θ-estimates and pass/fail status under the Rasch model. The researchers used real, non-normal distribution of examinee θ values. Six levels of shifts in difficulty parameter were simulated. The results of the study illustrated the robustness of the Rasch model in spite of item drift, even when the true θs were not normally distributed. The θ-estimation was stable under moderate drift in item difficulties. Similarly, Rupp and Zumbo (2003) concluded from their study that IRT θ-estimates were relatively robust, with moderate amounts of item parameter drift. In Wollack, Sung & Kang s study(2005) of longitudinal item parameter drift, over 7-year s worth of test forms from a German placement test were linked. The results showed that the choice of linking/ipd model could have a large impact on the resulting θ-estimates and passing rates. The simulation and real data studies by Wollack, Sung & Kang (2006) further supported this conclusion. They found that direct linking of each new form to the base form was slightly better than indirect linking. Models with TCC linking were compared with models that used the fixed parameter linking method in the study and the TCC linking process was found to perform better. The inconsistency in the findings about the effect of item parameter drift indicates that further studies need to be conducted to explore the effect of IPD from the perspectives of its interaction with factors such as linking procedures and treatment of the drifting items. First, for the common-item linking design, no conclusion has been reached as to what is the most effective linking procedure. Therefore, it might provide some interesting insights if the comparison of linking procedures could be combined with the study of the effect of IPD. 10

23 Second, there has been very limited research on how to handle the drifting items. In practice, items flagged for IPD are often removed from the set of items linking two or more test forms (Cook & Eignor, 1991). However, that is not always a proper way of treating item parameter drift. Before removing the drifting items from the linking set, an examination should be conducted about the property of the drift to see whether the drift is related to the construct being measured. Some possible sources for item parameter drift could be irrelevant to the construct being measured, e.g. the drift due to over-exposure of the item, or change in item parameter because of the change of item position. If item drift occurs as a result of the construct-irrelevant factors, then keeping these drifting items in the linking set will be an incorrect way of handling IPD, resulting in linking errors. However, if the item parameter drift cannot be explained by construct-irrelevant factors, it is likely that the drift is related to the construct being measured. In this case, if the drifting items are dropped out of the linking set, they become another source of linking error (Miller and Fitzpatrick, 2009). Moreover, in a real testing situation, items that drift are only items flagged by one or more methods of detecting IPD. There can be linking errors from the false detection of IPD. The objective of the proposed study is to investigate the effects of item parameter drift on θ-estimates when the items exhibiting drift are treated in different ways: either kept or removed from the set of linking items. The study also explores, in the presence of item parameter drift, the performance of three commonly used linking procedures the Stocking and Lord TTC method, the fixed common item parameter method and concurrent calibration. 11

24 Chapter III: Methods and Research Design 3.1 Data Generation In the study, data were generated to simulate a large scale assessment of mathematics. To focus on the effect of item parameter drift under the 3PL model, only multiple-choice items were considered in this study. Item response data were generated to simulate two test administrations one year apart. The test form included 30 operational items and 30 field test items. The items that appeared as field test items in one testing year became operational items in the following testing year and they serve as common items that link the two testing occasions that are one year apart. Item responses were simulated for 3000 examinees taking the test each year. The study tried to model a real testing situation where the test form consisted of operational items and field test items. The operational items in one testing year were field test items in the previous year. So the number of operational items was almost the number of common items. However, in the real situation, a matrix-sample design was used in field testing items. The field test items were divided into subsets and placed into several booklets. Each examinee worked on all the operational items plus a subset of field test items. In this study, to minimize sampling errors, the matrix sampling design was not simulated. Instead, item responses were generated for all examinees answering all the operational and field test items. Figure 3.1 shows the design of the simulated datasets. 12

25 Figure 3.1 Design of Dataset Simulation Year One Group Year Two Group 30 Unique Items 30 Common Items Operational Field test Year One Year One (Considered missing Operational responses in Year Two Linking) Field test Year Two (not simulated) 3.11 Generating Parameters A set of 60 item parameter estimates from the 2006/2007 Canadian provincial mathematics assessment was used as true parameters for generating baseline data. A modified three-parameter logistic model was used in estimating the item parameters where the guessing parameter was fixed to 0.2. The mean difficulty is and the mean slope is Modifications were made to randomly selected a or b parameters to reflect item parameter drift, while the c-parameter was set at 0.2. Table 3.1 describes the distribution of the true item parameters. Table 3.1 Descriptive Statistics of the Item Parameters Parameter Std. N Minimum Maximum Mean Deviation a b c Simulation of Item Parameter Drift To ascertain whether the effect of item parameter drift on θ-estimates differed when more items were showing drift or when items drifted further away from their original values, the number of items drifting and level of drift were manipulated. 13

26 In one condition about 10% and in another condition 25% of the items were randomly selected to exhibit item parameter drift. When 10% of the items were drifting, it suggested a scenario that the number of items drifting was not large and the drift might go undetected. When 25% of the items were drifting, the drift was not likely to be ignored and whether to keep or remove those drifting items could have a larger effect on θ-estimates. Data were also generated with a no-drift condition to serve as baseline. Two types of drift were simulated: drift on discriminating parameter a and drift on difficulty parameter b. The a-parameter drift was simulated by manually increasing or decreasing the a parameter by 0.4. In previous research, a similar magnitude of a-drift was adopted, e.g. a-drift of 0.3 (Donoghue & Isham, 1998) or 0.5 (Wells et al., 2002). Two levels of b-parameter drift were simulated by increasing or decreasing the parameter by 0.2 or 0.4. A drift of 0.2 simulated a moderate amount of item parameter drift while a drift of 0.4 simulated a large amount of drift. The same magnitude of drift (0.4) was used in other studies of IPD, such as Wells et al. (2002), Donoghue and Isham (1998) and Wollack (2006). The changes in p-values of the items with each amount of drift were also examined. To study the effects of a-drift and b-drift separately, the two types of drifts were not mixed in any one condition. Meanwhile, simulated drift were restricted to one direction. In practice, drift is likely to go in either direction. However, mixing positive and negative drift within a test can result in cancellation of drift and give less information on the effects of IPD. Thus, under either of the conditions studied, there was one parameter drifting with the drift always increasing or always decreasing. Though this unidirectional drift design was an oversimplification of how drift 14

27 actually occurred in real testing, it represented a worst-case-scenario where the effect of drift was unlikely to be ignored Group Achievement Differences When item parameters drift, it sometimes indicates change in achievement of the group taking the test, such as changes in curriculum, or in policy. To investigate how drifting items would help identify the changes in group achievement, data were simulated for both equivalent and non-equivalent groups. For examinees taking the test in YEAR1, the set of item responses was generated by sampling the latent trait (θ ) from a normal independent distribution (NID) with mean 0 and standard deviation 1 (NID(0,1)). For examinees taking the test in YEAR2, three sets of items responses were generated by sampling θ from an NID (0,1) distribution, an NID (0.2,1) distribution and an NID (-0.2,1) distribution. The 0.2 shift represented a moderate increase in θ from one year to another. A similar magnitude of increase in achievement was used in other research. Donoghue & Isham (1998) used 0.1 and Wollack et al. (2006) used 0.15 as a yearly increase in student achievement. The group differences were designed such that item parameter drift could be examined in different situations: 1) when there was no remarkable policy change between the two testing years and populations were assumed to be equivalent in achievement; 2) when there was noticeable policy implementation that might affect student learning and students were expected to fluctuate in achievement. Under a modified 3PL model, with the guessing parameter fixed at 0.2, item response data 15

28 were generated for the above 36 conditions: percentage of drift (2) X type and level of drift (6) X group achievement difference (3). 3.2 Calibration and Linking Procedures When different methods are used for linking tests, the effect of keeping or removing the drifting items on θ-estimates might differ. Three commonly used IRT linking methods were examined in the study: 1) Concurrent calibration, 2) Fixed Common Item Parameter calibration, and 3) Stocking & Lord s test characteristic curve method. Table 3.2 Summary of Simulated Conditions Drift Group Difference Linking Method 10% (3 items) drifting 25% (8 items) drifting a-parameter b-parameter Year One [NID(0,1)] and Year Two [NID(0,1)] Year One [NID(0,1)] and Year Two [NID(0.2,1)] Year One [NID(0,1)] and Year Two [NID(-0.2,1)] Concurrent calibration (Concurrent) Fixed Common Item Parameter calibration (FCIP) Stocking & Lord s test characteristic curve method (TCC-ST) The computer program PARSCALE 4 (Muraki & Bock, 2003) was used to calibrate the parameters. In concurrent calibration, responses from the YEAR1 test and YEAR2 test were combined for one concurrent run to estimate the parameters. When the fixed common item parameter method was used, item parameters were first estimated from a separate calibration for the YEAR2 test. Then the item parameters for the common items were fixed while a calibration for YEAR1 test was done. Thus, the item parameters were placed on the YEAR2 scale. 16

29 With Stocking & Lord s test characteristic curve method, item parameters were first estimated through two separate calibrations for YEAR1 test items and YEAR2 test items. The linking coefficients were then obtained from the common items using the test characteristic curve method of Stocking & Lord. Using the linking coefficients, item parameter estimates from the YEAR1 test were placed on the scale of the YEAR2 test. For Stocking-Lord transformation, the computer program ST (Hanson, Zeng & Cui, 2004) was used. 3.3 Handling of Drifting Items Two ways of handling the drifting items were compared: either to treat them or to ignore them. Those items whose parameters had been manually altered during data generation were considered drifting items. The drift was considered as treated when the drifting items were dropped from the linking items. The drift was considered as ignored when the drifting items were kept in the linking items, which was to simulate a scenario where item parameter drift was either undetected or drift was construct-relevant. To compare the two ways of handling drifting items, when item parameter drift was present, each of the linking methods was applied twice, once with drifting items included in the common items and later with the drifting items removed from the linking items. For the Stocking & Lord s method, it was applied a third time when the drifting items were removed from the linking items but they were included in the scoring. 3.4 Evaluation Criteria Several indices were used to evaluate the effect of the choice of treating item parameter drift and linking method on θ-estimates. Correlation between true θs and the θ-estimates was one index. Bias and the root mean square error (RMSE) were used to assess the accuracy of 17

30 θ-estimates. One benefit of a simulation study is that the bias and RMSDs between the estimates and the true θs can be obtained. These indices will indicate the accuracy of θ-estimates. If negatively biased, the θ is under-estimated, otherwise it is over-estimated. The smaller the RMSDs are, the better the estimation method is. The bias and RMSE were calculated as follows (as in Li, Tam & Tompkins, 2004): and p ( Hˆ i H i ) i= bias( H ) = 1 (3.1) p p i= 1 ( Hˆ i H ) i 2 RMSE( H ) = (3.2) p where H i is the true θ, Ĥ i is the corresponding estimate, and p is the total number of examinees. In calculating RMSE and bias the θs that were used to generate the item response data were transformed to match the scales from the estimates. With the Stocking-Lord TCC method and the fixed common item parameter method used in this study, θ-estimates for the YEAR One students were placed on the scale of the YEAR Two estimates after linking. With the concurrent calibration linking method, θ-estimates for the YEAR One students were placed on the scale of the combined YEAR One and YEAR Two estimates after linking. When computing RMSE and bias, the θ-values generating data for YEAR One students were transformed onto the YEAR Two 18

31 scale (YEAR One and YEAR Two combined scale for concurrent calibration linking method) first and then these transformed θ-values will be used as the true θ values to be compared with the θ-estimates of the YEAR One students. The θ values used in generating data are transformed to scaled θ-values through linear transformations (Kolen, 2006): σ s σ S s ( x ) = x + μs μ x (3.3) σ x σ x where X is the raw score (the generating θ-values), S is the scale score (scaled θ-values), μ x and σ x are the mean and standard deviations of raw values, and μ s and σ s are the mean and standard deviations of the scaled values. Another index is the percentage of examinees classified into appropriate performance levels, especially the examinees in levels meeting or above the proficient level. Many large-scale assessments report the performance level of an examinee in addition to or instead of an individual score. Therefore, the proportion of correct classification is one indication of the quality of the estimate. To check this index, the θ cut score for each performance level was set, following the guidelines used in the real assessment. In the 2006/2007 Canadian provincial mathematics assessment, there are four performance levels, Level 3 being the provincial target level. Examinees in Level 3 or 4 are considered to meet or surpass the provincial target level. After θ-estimates from the previous year are placed on the same scale of the current testing year, the percentage of students in each level in the previous year will be used to find the θ thresholds for performance levels in the current year. For this simulation study, the cumulative percentages 17.5%, 57.3%, 94.5% were used to find θ cuts for each level. Performance level, classified 19

32 according to the examinee s true θ, was considered as his/her true performance level, while the level classified through estimated θ was the estimated performance level. To examine different ways of treating drifting items, the proportion of correct classification for each level were compared, and special attention were paid to the pass/fail status at the provincial level, as the percentage of students at or above this level was the important index of how schools were progressing. 20

33 Chapter IV: Results and Discussion The results are presented and discussed for the different types of parameter drift simulated in this study. The conditions that were manipulated included the number of drifted items, the level of drift and the direction of drift. The effects of the item parameter drift on θ-estimates are compared when the drifted items are handled differently. The first part will focus on the effect of a-parameter drift on θ-estimates and the second part will examine the effect of b-parameter drift. In all the tables and figures presented in the results part, Group Difference refers to three types of group ability change: 1) examinees taking exams in both years were equivalent groups in ability ( Year One (0,1), Year Two (0,1) ), 2) examinees were non-equivalent in ability with higher ability in the following year ( Year One (0,1), Year Two (0.2, 1) ), and 3) examinees were non-equivalent in ability with lower ability in the following year ( Year One (0,1), Year Two (-0.2,1) ). Linking Method refers to the three methods to link the Year One and Year Two scores: 1) Concurrent calibration ( Concurrent ), 2) Fixed Common Item Parameter method ( FCIP ) and 3) Stocking & Lord s test characteristic curve method ( TCC-ST ). Drifted Item Handling refers to the way to treat item parameter drift: 1) keeping the drifted items in both calibration and scoring ( keep/keep ), and 2) dropping the items in both calibration and scoring ( drop/drop ), and 3) dropping the items in calibration while keeping them in scoring ( drop/keep ), which was only applied with the TCC-ST method. 4.1 Drift on Discriminating Parameter a The effect of different ways to handle the drifted items were studied in four kinds of a-parameter drift: 1) three items drifting with a-parameter increasing by 0.4; 2) three items 21

34 drifting with a-parameter decreasing by 0.4 ; 3) eight items drifting with a-parameter increasing by 0.4 and 4) eight items drifting with a-parameter decreasing by Correlation between θ Estimate and True θ Table 4a.1 lists the correlation coefficients between the θ estimates and the true θs when a-parameter drifts. The correlations are high, ranging from to This indicates that the θ estimates have a very strong and positive association with the true θs. Compared with the no-drift baseline condition where the average correlation is 0.927, the strong and positive relationship between the θ estimates and true θ are consistent across the four conditions of a-parameter drifts. The correlations tend to drop slightly when more items are showing drift and when the drifted items are dropped from the linking and scoring. But the drop is quite small, less than These consistently high correlations are good indications that different ways to handle the drifting items has a negligible effect on the relationship between θ estimate and true θ, regardless of the group abilities and linking methods. 22

35 Table 4a.1 Average Correlation Coefficients between θ Estimates and True θs when a-parameter Drifting (with SDs in Parentheses) Group Difference Year One (0,1) Year Two (0,1) Year One (0,1) Year Two (0.2,1) Year One (0,1) Year Two (-0.2,1) Linking Methods Items Handling (link/score) No drift 3 items a-drift items a-drift items a-drift items a-drift -0.4 Con- keep/keep 0.927(0.002) 0.927(0.002) 0.926(0.002) 0.926(0.002) 0.926(0.002) current drop/drop 0.924(0.002) 0.925(0.002) 0.924(0.002) 0.924(0.002) FCIP keep/keep 0.927(0.002) 0.926(0.002) 0.925(0.002) 0.925(0.002) 0.923(0.003) drop/drop 0.923(0.002) 0.922(0.002) 0.919(0.002) 0.912(0.002) keep/keep 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) TCC-ST drop/keep 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) drop/drop 0.922(0.005) 0.922(0.002) 0.919(0.002) 0.913(0.002) Con- keep/keep 0.927(0.002) 0.927(0.002) 0.926(0.002) 0.927(0.002) 0.926(0.002) current drop/drop 0.923(0.002) 0.925(0.002) 0.923(0.002) 0.923(0.002) FCIP keep/keep 0.927(0.002) 0.926(0.002) 0.925(0.002) 0.925(0.002) 0.923(0.003) drop/drop 0.923(0.002) 0.922(0.002) 0.919(0.002) 0.912(0.002) keep/keep 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) TCC-ST drop/keep 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) drop/drop 0.924(0.002) 0.922(0.002) 0.919(0.002) 0.913(0.002) Con- keep/keep 0.927(0.002) 0.927(0.002) 0.926(0.002) 0.927(0.002) 0.926(0.002) current drop/drop 0.924(0.002) 0.926(0.002) 0.924(0.002) 0.924(0.002) FCIP keep/keep 0.927(0.002) 0.926(0.002) 0.925(0.002) 0.925(0.002) 0.923(0.002) drop/drop 0.922(0.005) 0.922(0.003) 0.919(0.002) 0.912(0.002) keep/keep 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) TCC-ST drop/keep 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) drop/drop 0.924(0.002) 0.922(0.002) 0.919(0.002) 0.913(0.002) 23

36 4.12 Accuracy of θ Estimates The correlation table (Table 4a.1) showed a strong relationship between θ estimates and true θs and this relationship was not affected by the types of a-drift, group difference, linking methods and way of handling drifted items. However, these differences may have effect on the accuracy of θ estimation. To further examine the accuracy of θ estimation, bias and RMSE between θ estimates and true θs were calculated. Table 4a.2 and 4a.3 give the bias and RMSE values for θ estimates when a-parameter drift is present Bias and RMSE in Four a-drift Situations Four situations of a-drift were examined in the study: 3 items (10%) showing an increase of 0.4 in a-parameter, 3 items (10%) showing a decrease of 0.4 in a-parameter, 8 items (25%) showing the same +0.4 increase, and 8 items (25%) showing the same -0.4 decrease. When 10% of the items were showing a 0.4 increase in a-parameter, bias for θ estimates ranged from to Most of the bias values were negative, indicating that θ was under-estimated except when datasets of non-equal groups were linked by concurrent calibration with all items included. The under-estimation was most obvious when concurrent calibration was applied with the drifting items dropped, the bias being , and The RMSE ranged from to RMSE was relatively smaller when concurrent calibration was used with all the linking items included. However, the largest RMSE occurred when concurrent calibration was used with the drifting items dropped. When the same drift occurred to 25% of the items, bias for θ estimates ranged from to and RMSE ranged from to bias values were negative for most of the datasets, except when TCC-ST method with all linking items included was applied with equivalent groups. 24

EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN

EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN By FRANCISCO ANDRES JIMENEZ A THESIS PRESENTED TO THE GRADUATE SCHOOL OF