REMOVE OR KEEP: LINKING ITEMS SHOWING ITEM PARAMETER DRIFT. Qi Chen
|
|
- Roberta O’Neal’
- 5 years ago
- Views:
Transcription
1 REMOVE OR KEEP: LINKING ITEMS SHOWING ITEM PARAMETER DRIFT By Qi Chen A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Measurement and Quantitative Methods - Doctor of Philosophy 2013
2 ABSTRACT REMOVE OR KEEP: LINKING ITEMS SHOWING ITEM PARAMETER DRIFT By Qi Chen IRT-based procedures using common items are widely used in test score linking. A critical assumption of linking methods is the invariance property of IRT. This assumption is that item parameters remain the same on different testing occasions when they are reported on the same θ-scale. In practice, however, there are occasions when an item parameter drifts from its original value. This study investigated the impact of keeping or removing linking items that were showing item parameter drift. Simulated data were generated under a modified three-parameter logistic model with common item non-equivalent group linking design. Factors manipulated were percentage of drifting items, type of drift, magnitude of drift, group achievement differences and choice of linking methods. The effect of item drift was studied by examining mean difference between true θs and θ-estimates, and the accuracy of the classification of examinees into proper performance categories. Results indicated that the characteristics of the drift had little impact upon the performance of the Fixed Common Item Parameter (FCIP) method or Stocking & Lord s Test Characteristic Curve (TCC-ST) method and its influence on the performance of the Concurrent method varied depending on whether the drifting items were removed from the linking. In addition, better estimation was achieved if the drifting items were removed from the linking when the Concurrent method was used.
3 ACKNOWLEDGEMENTS There are many people who have been supportive and helpful along the road to the completion of my dissertation. I feel grateful to their assistance, inspiration and encouragement. First of all, my deepest gratitude goes to Dr. Mark Reckase, my academic advisor and chairperson of my dissertation committee, for his scholarly guidance and constant support. He has shown substantial support, discussing the project with me, reviewing the research drafts and providing critical feedbacks. I have benefited greatly from his valuable insights and assistance during the writing process and throughout the whole journey of my Ph.D. study. I would like to express my sincere appreciation to other members in my dissertation committee: Dr. Tenko Raykov, Dr. Edward Roeber, and Dr. Ann Marie Ryan, for their excellent insights, suggestions, and assistance. They have made their time from their busy schedules to attend the meetings and have provided me with their deep insights, constructive feedbacks and thoughtful suggestions. My appreciation also extends to Dr. Sharif Shakrani and Dr. Cassandra Guarino for their valuable comments and insightful suggestions on my dissertation proposal. Their valuable comments on the design and analyses of the dissertation study are highly appreciated. I would like to express my thanks to Dr. Michael Kozlow and Dr. Xiao Pang for sharing their insights and allowing me the access to the data. I feel blessed with the opportunities to work with them. Their assistance and enthusiasm have motivated me to pursue this research topic. I would also like to express my gratitude to Dr. Yong Zhao for providing me with iii
4 assistantship opportunities during my graduate study. I appreciate the opportunities he has given me to participate in many research projects which have helped my development as a professional researcher. I appreciate the great support of my friends Brad and Tinker for editing and proofreading the proposal and the drafts. I am deeply grateful and indebted to my family for their love and support. I could not have completed the journey without the encouragement and patience from my dear husband Haonan, the inspiration and love from my wonderful daughter Karen and the support from my loving parents. iv
5 TABLE OF CONTENTS LIST OF TABLES vii LIST OF FIGURES ix Chapter I: Introduction Research Background. 1 Chapter II: Background and Objective of Study Common-Item Linking Design Linking Methods Item Parameter Drift...6 Chapter III: Methods and Research Design Data Generation Generating Parameters., Simulation of Item Parameter Drift Group Achievement Differences Calibration and Linking Procedures Handling of Drifting Items Evaluation Criteria Chapter IV: Results and Discussion Drift on Discriminating Parameter a Correlation between θ Estimate and True θ Accuracy of θ Estimates Bias and RMSE in Four a-drift Situations Effect of Percentage of Items Showing a-parameter Drift Effect of the Direction of a-parameter Drift Effect of the Linking Method Effect of Group Difference Effect of Drifted Items Handling Effect of Drifted Items Handling at Different θ Levels Accuracy of Performance Level Classification Drift on Difficulty Parameter b Correlation between θ Estimate and True θ Accuracy of θ Estimates Bias and RMSE in Eight b-drift Situations...92 v
6 4.222 Effect of Percentage of Items Showing b-parameter Drift Effect of the Direction of b-parameter Drift Effect of the Magnitude of b-parameter Drift Effect of the Linking Method Effect of Group Difference Effect of Drifted Items Handling Effect of Drifted Items Handling at Different θ Levels Accuracy of Performance Level Classification..171 Chapter V: Conclusions, Implications and Future Research Conclusions Implications Limitations and Future Directions..205 APPENDIX..207 BIBLIOGRAPHY vi
7 LIST OF TABLES Table 3.1 Descriptive Statistics of the Item Parameters...13 Table 3.2 Summary of Simulated Conditions Table 4a.1 Average Correlation Coefficients between θ Estimates and True θs when a-parameter Drifting (with SDs in Parentheses) Table 4a.2 Bias for θ Estimates when a-parameter Drifting Table 4a.3 RMSE for θ Estimates when a-parameter Drifting.. 27 Table 4a.4 Change in bias and RMSE as More Items Drifting in a-parameter Table 4a.5 Change in bias and RMSE with a-parameter Drifting in Different Directions...32 Table 4a.6 Percentage in Each Performance Level Classification (N=3, Drift=a+0.4)..71 Table 4a.7 Percentage in Each Performance Level Classification (N=3, Drift=a-0.4)...76 Table 4a.8 Percentage in Each Performance Level Classification (N=8, Drift=a+0.4)..81 Table 4a.9 Percentage in Each Performance Level Classification (N=8, Drift=a-0.4)...85 Table 4b.1 Average Correlation Coefficients between θ Estimates and True θs when b-parameter Drifting (with SDs in Parentheses) 90 Table 4b.2 Bias for θ Estimates when b-parameter Drifting Table 4b.3 RMSE for θ Estimates when b-parameter Drifting.. 96 Table 4b.4 Changes in bias and RMSE with More Items Drifting in b-parameter (3 items 8 items).99 Table 4b.5 Changes in bias and RMSE with b-parameter Drifting in Different Directions (Positive Drift Negative Drift) 101 vii
8 Table 4b.6 Changes in bias and RMSE as the size of b-parameter Drift Increases ( )..103 Table 4b.7 Percentage in Each Performance Level Classification (N=3, Drift=b+0.2) Table 4b.8 Percentage in Each Performance Level Classification (N=3, Drift=b-0.2)..179 Table 4b.9 Percentage in Each Performance Level Classification (N=3, Drift=b+0.4) Table 4b.10 Percentage in Each Performance Level Classification (N=3, Drift=b-0.4) 185 Table 4b.11 Percentage in Each Performance Level Classification (N=8, Drift=b+0.2) Table 4b.12 Percentage in Each Performance Level Classification (N=8, Drift=b-0.2) 191 Table 4b.13 Percentage in Each Performance Level Classification (N=8, Drift=b+0.4)..194 Table 4b.14 Percentage in Each Performance Level Classification (N=8, Drift=b-0.4) Table 6 Population Item Parameters Used for Simulations viii
9 LIST OF FIGURES Figure 3.1 Design of Dataset Simulation..13 Figure 4a.0.1 Comparison of Linking Methods (Mean Year 1= Mean Year 2; All Items Included) 34 Figure 4a.0.2 Comparison of Linking Methods (Mean Year 1< Mean Year 2; All Items Included) 34 Figure 4a.0.3 Comparison of Linking Methods (Mean Year 1> Mean Year 2; All Items Included) 35 Figure 4a.0.4 Comparison of Linking Methods (Mean Year 1= Mean Year 2; Drifted Items Dropped) 35 Figure 4a.0.5 Comparison of Linking Methods (Mean Year 1< Mean Year 2; Drifted Items Dropped) 36 Figure 4a.0.6 Comparison of Linking Methods (Mean Year 1> Mean Year 2; Drifted Items Dropped) 36 Figure 4a.1.1 Effect of Group Difference (Concurrent; All Items Included)...38 Figure 4a.1.2 Effect of Group Difference (FCIP; All Items Included)...38 Figure 4a.1.3 Effect of Group Difference (TCC-ST; All Items Included)...39 Figure 4a.1.4 Effect of Group Difference (Concurrent; Drifted Items Dropped)..39 Figure 4a.1.5 Effect of Group Difference (FCIP; Drifted Items Dropped)...40 Figure 4a.1.6 Effect of Group Difference (TCC-ST; Drifted Items Dropped in Linking but Kept in Scoring) Figure 4a.1.7 Effect of Group Difference (TCC-ST; Drifted Items Dropped)...41 Figure 4a.2.1 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 = Mean Year 2) ix
10 Figure 4a.2.2 Comparison of Handling Drifted Items (FCIP; Mean Year 1 = Mean Year 2).. 43 Figure 4a.2.3 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 = Mean Year 2)...44 Figure 4a.2.4 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 < Mean Year 2) Figure 4a.2.5 Comparison of Handling Drifted Items (FCIP; Mean Year 1 < Mean Year 2)...45 Figure 4a.2.6 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 < Mean Year 2)...45 Figure 4a.2.7 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 > Mean Year 2) Figure 4a.2.8 Comparison of Handling Drifted Items (FCIP; Mean Year 1 > Mean Year 2)..46 Figure 4a.2.9 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 > Mean Year 2)...47 Figure 4a.3.1 Mean bias at θ Intervals (3 Items a-drift +0.4) Figure 4a.3.2 Mean bias at θ Intervals (3 Items a-drift -0.4)...54 Figure 4a.3.3 Mean bias at θ Intervals (8 Items a-drift +0.4)...59 Figure 4a.3.4 Mean bias at θ Intervals (8 Items a-drift -0.4)...64 Figure 4b.0.1 Comparison of Linking Methods (Mean Year 1= Mean Year 2; All Items Included)..106 Figure 4b.0.2 Comparison of Linking Methods (Mean Year 1= Mean Year 2; Drifted Items Dropped).107 Figure 4b.0.3 Comparison of Linking Methods (Mean Year 1< Mean Year 2; All Items Included) Figure 4b.0.4 Comparison of Linking Methods (Mean Year 1< Mean Year 2; Drifted Items Dropped) Figure 4b.0.5 Comparison of Linking Methods (Mean Year 1> Mean Year 2; All Items Included) x
11 Figure 4b.0.6 Comparison of Linking Methods (Mean Year 1> Mean Year 2; Drifted Items Dropped)..111 Figure 4b.1.1 Effect of Group Difference (Concurrent; All Items Included) Figure 4b.1.2 Effect of Group Difference (Concurrent; Drifted Items Dropped) Figure 4b.1.3 Effect of Group Difference (FCIP; All Items Included) Figure 4b.1.4 Effect of Group Difference (FCIP; Drifted Items Dropped) Figure 4b.1.5 Effect of Group Difference (TCC-ST; All Items Included) Figure 4b.1.6 Effect of Group Difference (TCC-ST; Drifted Items Dropped in Linking but Kept in Scoring) Figure 4b.1.7 Effect of Group Difference (TCC-ST; Drifted Items Dropped) Figure 4b.2.1 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 = Mean Year 2) Figure 4b.2.2 Comparison of Handling Drifted Items (FCIP; Mean Year 1 = Mean Year 2) 123 Figure 4b.2.3 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 = Mean Year 2).124 Figure 4b.2.4 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 < Mean Year 2) Figure 4b.2.5 Comparison of Handling Drifted Items (FCIP; Mean Year 1 < Mean Year 2).126 Figure 4b.2.6 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 < Mean Year 2).127 Figure 4b.2.7 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 > Mean Year 2) Figure 4b.2.8 Comparison of Handling Drifted Items (FCIP; Mean Year 1 > Mean Year 2).129 Figure 4b.2.9 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 > Mean Year 2).130 Figure 4b.3.1 Mean bias at θ Intervals (3 Items b-drift +0.2) xi
12 Figure 4b.3.2 Mean bias at θ Intervals (3 Items b-drift -0.2) Figure 4b.3.3 Mean bias at θ Intervals (3 Items b-drift +0.4) Figure 4b.3.4 Mean bias at θ Intervals (3 Items b-drift -0.4) Figure 4b.3.5 Mean bias at θ Intervals (8 Items b-drift +0.2) Figure 4b.3.6 Mean bias at θ Intervals (8 Items b-drift -0.2) Figure 4b.3.7 Mean bias at θ Intervals (8 Items b-drift +0.4) Figure 4b.3.8 Mean bias at θ Intervals (8 Items b-drift -0.4) xii
13 Chapter I: Introduction 1.1 Research Background Scores from large scale assessment are commonly used as good indicators of student performance. Educational policy makers, administrators and educators hope to compare how students are doing from year to year. However, because the test is administrated each year, different test forms need to be used to ensure test security. Practically, no test developer can guarantee the equivalency of different test forms despite vigorous efforts to ensure that equivalency. Hence, to achieve comparability, practitioners need to link the test scores. One way to achieve comparability is through a common item linking design. There are various approaches to common item linking -- some based on Item Response Theory (IRT) models. In Item Response Theory, item parameters are estimated and are assumed to be invariant under linear transformation (Lord, 1980). Several methods have been developed to place the item parameters on a common metric, including linear transformation of separate calibrations, fixed common item parameter (FCIP) calibrations and concurrent calibration. The invariance property of item response theory assumes that the item characteristic curves for a test item should be the same when estimated from data from two different populations. The linear relationship between θ-estimates and item parameter estimates indicates that the difference in the scaled scores is the result of difference in θs across groups or over time, when the parameters would remain unchanged if they were on the same scale. However, in practice, this assumption of invariance does not always hold. When an item in the same test functions differently for different subgroups with the same degree of proficiency, it is called differential 1
14 item functioning (DIF; Holland & Wainer, 1993). When the statistical properties of the same items change on different testing occasions, it is called item parameter drift (IPD) (Goldstein, 1983; Bock, Muraki, & Pfeiffenberger, 1988). Item parameter drift has the potential to negatively affect the validity of the score scale conversion. Although there has been some research on item parameter drift, it is not as extensive as research on DIF. In practice, items flagged for IPD are often removed from the linking items in the estimation of linking coefficients (Cook & Eignor, 1991). However, since research comparing the IPD detection methods has found that the effectiveness of these methods depends on the testing situation (Donoghue & Isham, 1998; DeMars, 2004), it is likely that some item parameter drift goes undetected while some items are improperly flagged as drifting. Research has found some possible sources of drift, such as change in curriculum (Goldstein, 1983; Bock, Muraki, & Pfeiffenberger, 1988), context effects (Eignor, 1985), sample statistics (Cook, Eignor, & Taft, 1988), content of items (Chan, Drasgow, & Sawin, 1999), and item over-exposure (Veerkamp & Glas, 2000). These reasons for item parameter drift may or may not be related to the construct being measured. It will be a source of linking error to keep items that are drifting due to construct-irrelevant factors. However, it creates another source of linking error to remove items whose drift is closely related to the construct being measured (Miller & Fitzpatrick, 2009). The research on the impact of IPD on θ-estimates has produced mixed results. Some research studies found that IPD had little effect on θ-estimates (Wells, Subkoviak, & Serlin, 2002; Witt, Stahl, & Bergstrom, 2003; Rupp & Zumbo, 2003). On the other hand, other research has 2
15 found that IPD could compound over multiple testing occasions and that the choice of linking model could have a large effect on the θ-estimates (Wollack, Sung & Kang, 2005; 2006). In most of the research on the effect of IPD, items exhibiting IPD were removed from the linking set of items and the test characteristic curve method (TCC) was often chosen as the linking method (Stocking & Lord, 1983). However, drifted items may remain as the linking items due to the ineffective detection of IPD. When the drift is related to the construct being measured, the items should not be removed. This study intends to compare the effects of IPD on θ-estimates when the drifted items are either kept or removed from the linking set. As well, the interaction between the handling of the drifted items and the linking methods will be examined. The linking methods used in this study are Stocking and Lord s test characteristic curve method (CCM) (Stocking & Lord, 1983), fixed common item parameter (FCIP) calibrations, and concurrent calibration (Hambleton, Swaminathan & Rogers, 1991; Kolen & Brennan, 1995). 3
16 Chapter II: Background and Objective of Study 2.1 Common-Item Linking Design Essentially, the process of linking is to change θ-estimates from one test to θ-estimates on the equivalent trait scale on another test (Holland & Dorans, 2006). In many large-scale testing programs, the common item non-equivalent group linking design is widely used. In this design, two or more test forms are created with a set of items in common, and these test forms are used in different test administrations (Kolen & Brennan, 1995). The item parameters obtained from different test forms are placed on the same scale by using the common items as linking items. 2.2 Linking Methods Item parameters obtained from different test forms need to be aligned on the same scale. There are a variety of alignment approaches to achieve this purpose: linear procedures based on θ-estimates, fixed common item parameters, the mean-mean method, the mean-sigma method, the test characteristic curve method, and concurrent calibration (Kolen & Brennan, 1995; Yen & Fitzpatrick, 2006; Holland & Dorans, 2006). Because of the importance of choosing an appropriate linking method to ensure the accuracy of the linking results, there has been research comparing the merits of different linking methods. Some of the studies have investigated the strengths and weaknesses of those IRT-model-based linking methods (Baker & Al-Karni, 1991; Kim & Cohen, 1998; Henson & Beguin, 1999, 2002; Li, Griffith & Tam, 1997; Jodoin, Keller & Swaminathan, 2003; Li, Tam & Tompkins, 2004). One kind of linking method is to transform parameter estimates obtained from two separate calibrations onto a common scale through a linear scale transformation. The Stocking & Lord 4
17 (1983) CCM method was found to yield more stable results than moment methods when data sets are typically troublesome to calibrate (Baker, 1991). The better performance of the Stocking-Lord method over the moment methods was documented in the literature (Hanson & Beguin, 2002). Another commonly used linking method is the fixed common item parameter method. The pre-calibrated item parameter estimates for the common items are fixed while calibrating the non-common items, and then item parameter estimates from the non-common items are placed on the same scale as the fixed parameters. This method does not require the computation of scale transformation coefficients. Li, Griffith, and Tam (1997) found that both the FCIP linking method and the characteristic curve linking method provided stable θ estimates, except for students with low θ values and that the item parameter estimates calibrated with these two methods were consistent except for the estimation of guessing parameter under CCM. Concurrent calibration is also a widely used linking method. Parameters for items from multiple test forms are estimated in a single calibration run. The simulation study by Kim and Cohen (1998) found that separate and concurrent calibration provided similar results when the number of common items was large, but separate calibration provided more accurate results when the number of common items was small. Hanson and Beguin (2002) found that concurrent calibration generally yielded more accurate results, although the results were not sufficient to support total preference for concurrent estimation. Although there has been research comparing these IRT linking methods, there has been not sufficient evidence as to which one is the best. Each method has its own merit. Keller, Jodoin, 5
18 Rogers and Swaminathan (2003) compared linear, FCIP and concurrent linking procedures in detecting academic growth and found that the type of linking method used resulted in differences in mean growth and classification. Lee and Ban (2007) compared concurrent calibration, the Stocking-Lord method and the fixed item parameter linking procedures in the random group linking design. They found that the relative performance of different linking procedures varies with the measurement conditions. So no conclusion can be drawn about one preferred procedure for all occasions. 2.3 Item Parameter Drift In Item Response Theory, the IRT estimates are assumed to be invariant up to a linear transformation (Lord, 1980). For example, under a 3PL model, the probability of a correct response to the i th item is given by where P( a i is the item discrimination, θ ) = c i i + 1.7a ( θ b ) (2.1) i i 1+ e 1 c b i is the item difficulty and c i is the pseudo-guessing parameter for item i. A linear transformation of the parameters will produce the same probability of a correct response. For example, let and θ = θ B, (2.2) a b Jk A Ik + = a A, (2.3) Ji Ii / = Ab B, (2.4) Ji Ii + c Ji = c Ii (2.5) 6
19 where A and B are constants, Scale I, a Ji, θjk and θ Ik are values of θ for individual k on Scale J and bji and cji are the item parameters for item i on Scale J and a Ii, b Ii, and the item parameters for item j on Scale I. The c-parameters do not change with the linear transformation of scale. The probability of correctly answering item i for an examinee with θ Jk (equation 2.1) is c Ji + 1+ e 1 c Ji 1.7a ( θ b ), Ji Jk Ji Which equals (with the expressions from equations (2.2)-(2.5)) cii are c Ii + 1+ e = c Ii 1.7 a / A Ii + 1+ e 1 c Ii ( Aθ + B) ( Ab + B) ) Ik 1 c Ii Ii 1.7a ( θ b ) Ii Ik Ii which is exactly the probability of correctly answering item i for the same examinee with on the alternative scale ( Hambleton & Swaminathan, 1985; Kolen & Brennan, 1995). However, in practice, item parameters do not always remain unchanged. When an item performs differently for examinees of comparable proficiency, it is defined as differential item functioning (DIF, Holland & Wainer, 1993). Goldstein (1983) developed a general framework of the change of item characteristics or parameter values over time. Research has found a number of possible sources of item parameter drift. Goldstein (1983) suggested some possible reasons such as changing curriculum content, different social demands for knowledge and skills, and so on. Bock, Muraki and Pfeiffenberger (1988) analyzed item response data from 10-year administrations of the College Board Physics Achievement Test. θ Ik 7
20 They found differential drift occurred with the change in curricular emphasis. When teachers began to focus more on basic topics of mechanics rather than advanced topics, the difficulty of the mechanics questions increased. As the English units of measurement were phased out of the physics curriculum, the slopes for items using English units and metric units were changing directions. Chan, Drasgow and Sawin (1999) observed item parameter drift on the Armed Services Vocational Aptitude Battery over a 16-year period and found that the drift was related to changing demands for knowledge. They found that tests with more semantic/knowledge content seemed to have higher rates of item drift. Eignor (1985) found that for reading tests, item drift could be explained by the location of reading passages and position of items in the test. Sykes and Fitzpatrick (1992) tried to explain the drift in item difficulty after consecutive administrations of a professional licensure examination and found that change in difficulty parameter was related neither to changes in the booklet or test position of the items, nor to the item type. Cook, Eignor and Taft (1988) investigated curriculum-related achievement tests given in spring or fall and concluded that tests taken at different times during the school year might have measured different attributes. Veerkamp and Glas (2002) investigated item drift in adaptive testing due to previous exposure of the item. Giordano, Subhiyah, and Hess (2005) analyzed the item exposure on take-home examinations in the medicine and its influences on the difficulty of the exam. There has been research on how to detect item parameter drift. Researchers have used DIF procedures such as the Mantel-Haenszel procedure, Lord s X 2 measure, Kim & Cohen s (1991) closed interval measures and Raju s (1988) exact signed- and unsigned-integral measures and 8
21 Kim, Cohen, & Park s (1995) X 2 test for multiple-group differential DIF; analysis of covariance models (Sykes & Fitzpatrick, 1992); restricted item response models (Stone & Lane, 1991); the cumulative sum (CUSUM) chart a statistical quality control used in production processes (Veerkamp & Glas, 2002); and the procedure in BILOG-MG for estimating linear trends in item difficulty. In their study comparing the procedures for detecting IPD, Donoghue and Isham (1998) found that Lord s measure was the most effective in detecting drift on the condition that the item s guessing parameter was controlled to be equal across calibrations. Their findings suggested that the effective functioning of the detection method depended on the specific testing situation. In a study by DeMars (2004), the linear drift procedure in BILOG-MG and the modified-kpc were found effective in identifying drift, similar to those represented in this study, but these procedures falsely identified non-drift items. There has been research on the effect of item parameter drift on the estimation of an examinee s θ, but the research is not extensive and the conclusions are not conclusive. Wells, Subkoviak, and Serlin (2002) simulated item response data using the two-parameter logistic model for two testing occasions. The factors they manipulated in the study included the percentage of items that exhibited IPD, types of drift, sample size and test length. Drift was simulated by increasing the difficulty parameters by 0.4 and the discriminating parameters by 0.5. Their results suggested that item parameter drift as simulated in their study had a small effect on θ-estimates. The study also illustrated the robustness of the 2PL model despite the violation of the invariance property. Witt, Stahl, and Bergstrom (2003) investigated the effects of IPD on the 9
22 stability of test taker θ-estimates and pass/fail status under the Rasch model. The researchers used real, non-normal distribution of examinee θ values. Six levels of shifts in difficulty parameter were simulated. The results of the study illustrated the robustness of the Rasch model in spite of item drift, even when the true θs were not normally distributed. The θ-estimation was stable under moderate drift in item difficulties. Similarly, Rupp and Zumbo (2003) concluded from their study that IRT θ-estimates were relatively robust, with moderate amounts of item parameter drift. In Wollack, Sung & Kang s study(2005) of longitudinal item parameter drift, over 7-year s worth of test forms from a German placement test were linked. The results showed that the choice of linking/ipd model could have a large impact on the resulting θ-estimates and passing rates. The simulation and real data studies by Wollack, Sung & Kang (2006) further supported this conclusion. They found that direct linking of each new form to the base form was slightly better than indirect linking. Models with TCC linking were compared with models that used the fixed parameter linking method in the study and the TCC linking process was found to perform better. The inconsistency in the findings about the effect of item parameter drift indicates that further studies need to be conducted to explore the effect of IPD from the perspectives of its interaction with factors such as linking procedures and treatment of the drifting items. First, for the common-item linking design, no conclusion has been reached as to what is the most effective linking procedure. Therefore, it might provide some interesting insights if the comparison of linking procedures could be combined with the study of the effect of IPD. 10
23 Second, there has been very limited research on how to handle the drifting items. In practice, items flagged for IPD are often removed from the set of items linking two or more test forms (Cook & Eignor, 1991). However, that is not always a proper way of treating item parameter drift. Before removing the drifting items from the linking set, an examination should be conducted about the property of the drift to see whether the drift is related to the construct being measured. Some possible sources for item parameter drift could be irrelevant to the construct being measured, e.g. the drift due to over-exposure of the item, or change in item parameter because of the change of item position. If item drift occurs as a result of the construct-irrelevant factors, then keeping these drifting items in the linking set will be an incorrect way of handling IPD, resulting in linking errors. However, if the item parameter drift cannot be explained by construct-irrelevant factors, it is likely that the drift is related to the construct being measured. In this case, if the drifting items are dropped out of the linking set, they become another source of linking error (Miller and Fitzpatrick, 2009). Moreover, in a real testing situation, items that drift are only items flagged by one or more methods of detecting IPD. There can be linking errors from the false detection of IPD. The objective of the proposed study is to investigate the effects of item parameter drift on θ-estimates when the items exhibiting drift are treated in different ways: either kept or removed from the set of linking items. The study also explores, in the presence of item parameter drift, the performance of three commonly used linking procedures the Stocking and Lord TTC method, the fixed common item parameter method and concurrent calibration. 11
24 Chapter III: Methods and Research Design 3.1 Data Generation In the study, data were generated to simulate a large scale assessment of mathematics. To focus on the effect of item parameter drift under the 3PL model, only multiple-choice items were considered in this study. Item response data were generated to simulate two test administrations one year apart. The test form included 30 operational items and 30 field test items. The items that appeared as field test items in one testing year became operational items in the following testing year and they serve as common items that link the two testing occasions that are one year apart. Item responses were simulated for 3000 examinees taking the test each year. The study tried to model a real testing situation where the test form consisted of operational items and field test items. The operational items in one testing year were field test items in the previous year. So the number of operational items was almost the number of common items. However, in the real situation, a matrix-sample design was used in field testing items. The field test items were divided into subsets and placed into several booklets. Each examinee worked on all the operational items plus a subset of field test items. In this study, to minimize sampling errors, the matrix sampling design was not simulated. Instead, item responses were generated for all examinees answering all the operational and field test items. Figure 3.1 shows the design of the simulated datasets. 12
25 Figure 3.1 Design of Dataset Simulation Year One Group Year Two Group 30 Unique Items 30 Common Items Operational Field test Year One Year One (Considered missing Operational responses in Year Two Linking) Field test Year Two (not simulated) 3.11 Generating Parameters A set of 60 item parameter estimates from the 2006/2007 Canadian provincial mathematics assessment was used as true parameters for generating baseline data. A modified three-parameter logistic model was used in estimating the item parameters where the guessing parameter was fixed to 0.2. The mean difficulty is and the mean slope is Modifications were made to randomly selected a or b parameters to reflect item parameter drift, while the c-parameter was set at 0.2. Table 3.1 describes the distribution of the true item parameters. Table 3.1 Descriptive Statistics of the Item Parameters Parameter Std. N Minimum Maximum Mean Deviation a b c Simulation of Item Parameter Drift To ascertain whether the effect of item parameter drift on θ-estimates differed when more items were showing drift or when items drifted further away from their original values, the number of items drifting and level of drift were manipulated. 13
26 In one condition about 10% and in another condition 25% of the items were randomly selected to exhibit item parameter drift. When 10% of the items were drifting, it suggested a scenario that the number of items drifting was not large and the drift might go undetected. When 25% of the items were drifting, the drift was not likely to be ignored and whether to keep or remove those drifting items could have a larger effect on θ-estimates. Data were also generated with a no-drift condition to serve as baseline. Two types of drift were simulated: drift on discriminating parameter a and drift on difficulty parameter b. The a-parameter drift was simulated by manually increasing or decreasing the a parameter by 0.4. In previous research, a similar magnitude of a-drift was adopted, e.g. a-drift of 0.3 (Donoghue & Isham, 1998) or 0.5 (Wells et al., 2002). Two levels of b-parameter drift were simulated by increasing or decreasing the parameter by 0.2 or 0.4. A drift of 0.2 simulated a moderate amount of item parameter drift while a drift of 0.4 simulated a large amount of drift. The same magnitude of drift (0.4) was used in other studies of IPD, such as Wells et al. (2002), Donoghue and Isham (1998) and Wollack (2006). The changes in p-values of the items with each amount of drift were also examined. To study the effects of a-drift and b-drift separately, the two types of drifts were not mixed in any one condition. Meanwhile, simulated drift were restricted to one direction. In practice, drift is likely to go in either direction. However, mixing positive and negative drift within a test can result in cancellation of drift and give less information on the effects of IPD. Thus, under either of the conditions studied, there was one parameter drifting with the drift always increasing or always decreasing. Though this unidirectional drift design was an oversimplification of how drift 14
27 actually occurred in real testing, it represented a worst-case-scenario where the effect of drift was unlikely to be ignored Group Achievement Differences When item parameters drift, it sometimes indicates change in achievement of the group taking the test, such as changes in curriculum, or in policy. To investigate how drifting items would help identify the changes in group achievement, data were simulated for both equivalent and non-equivalent groups. For examinees taking the test in YEAR1, the set of item responses was generated by sampling the latent trait (θ ) from a normal independent distribution (NID) with mean 0 and standard deviation 1 (NID(0,1)). For examinees taking the test in YEAR2, three sets of items responses were generated by sampling θ from an NID (0,1) distribution, an NID (0.2,1) distribution and an NID (-0.2,1) distribution. The 0.2 shift represented a moderate increase in θ from one year to another. A similar magnitude of increase in achievement was used in other research. Donoghue & Isham (1998) used 0.1 and Wollack et al. (2006) used 0.15 as a yearly increase in student achievement. The group differences were designed such that item parameter drift could be examined in different situations: 1) when there was no remarkable policy change between the two testing years and populations were assumed to be equivalent in achievement; 2) when there was noticeable policy implementation that might affect student learning and students were expected to fluctuate in achievement. Under a modified 3PL model, with the guessing parameter fixed at 0.2, item response data 15
28 were generated for the above 36 conditions: percentage of drift (2) X type and level of drift (6) X group achievement difference (3). 3.2 Calibration and Linking Procedures When different methods are used for linking tests, the effect of keeping or removing the drifting items on θ-estimates might differ. Three commonly used IRT linking methods were examined in the study: 1) Concurrent calibration, 2) Fixed Common Item Parameter calibration, and 3) Stocking & Lord s test characteristic curve method. Table 3.2 Summary of Simulated Conditions Drift Group Difference Linking Method 10% (3 items) drifting 25% (8 items) drifting a-parameter b-parameter Year One [NID(0,1)] and Year Two [NID(0,1)] Year One [NID(0,1)] and Year Two [NID(0.2,1)] Year One [NID(0,1)] and Year Two [NID(-0.2,1)] Concurrent calibration (Concurrent) Fixed Common Item Parameter calibration (FCIP) Stocking & Lord s test characteristic curve method (TCC-ST) The computer program PARSCALE 4 (Muraki & Bock, 2003) was used to calibrate the parameters. In concurrent calibration, responses from the YEAR1 test and YEAR2 test were combined for one concurrent run to estimate the parameters. When the fixed common item parameter method was used, item parameters were first estimated from a separate calibration for the YEAR2 test. Then the item parameters for the common items were fixed while a calibration for YEAR1 test was done. Thus, the item parameters were placed on the YEAR2 scale. 16
29 With Stocking & Lord s test characteristic curve method, item parameters were first estimated through two separate calibrations for YEAR1 test items and YEAR2 test items. The linking coefficients were then obtained from the common items using the test characteristic curve method of Stocking & Lord. Using the linking coefficients, item parameter estimates from the YEAR1 test were placed on the scale of the YEAR2 test. For Stocking-Lord transformation, the computer program ST (Hanson, Zeng & Cui, 2004) was used. 3.3 Handling of Drifting Items Two ways of handling the drifting items were compared: either to treat them or to ignore them. Those items whose parameters had been manually altered during data generation were considered drifting items. The drift was considered as treated when the drifting items were dropped from the linking items. The drift was considered as ignored when the drifting items were kept in the linking items, which was to simulate a scenario where item parameter drift was either undetected or drift was construct-relevant. To compare the two ways of handling drifting items, when item parameter drift was present, each of the linking methods was applied twice, once with drifting items included in the common items and later with the drifting items removed from the linking items. For the Stocking & Lord s method, it was applied a third time when the drifting items were removed from the linking items but they were included in the scoring. 3.4 Evaluation Criteria Several indices were used to evaluate the effect of the choice of treating item parameter drift and linking method on θ-estimates. Correlation between true θs and the θ-estimates was one index. Bias and the root mean square error (RMSE) were used to assess the accuracy of 17
30 θ-estimates. One benefit of a simulation study is that the bias and RMSDs between the estimates and the true θs can be obtained. These indices will indicate the accuracy of θ-estimates. If negatively biased, the θ is under-estimated, otherwise it is over-estimated. The smaller the RMSDs are, the better the estimation method is. The bias and RMSE were calculated as follows (as in Li, Tam & Tompkins, 2004): and p ( Hˆ i H i ) i= bias( H ) = 1 (3.1) p p i= 1 ( Hˆ i H ) i 2 RMSE( H ) = (3.2) p where H i is the true θ, Ĥ i is the corresponding estimate, and p is the total number of examinees. In calculating RMSE and bias the θs that were used to generate the item response data were transformed to match the scales from the estimates. With the Stocking-Lord TCC method and the fixed common item parameter method used in this study, θ-estimates for the YEAR One students were placed on the scale of the YEAR Two estimates after linking. With the concurrent calibration linking method, θ-estimates for the YEAR One students were placed on the scale of the combined YEAR One and YEAR Two estimates after linking. When computing RMSE and bias, the θ-values generating data for YEAR One students were transformed onto the YEAR Two 18
31 scale (YEAR One and YEAR Two combined scale for concurrent calibration linking method) first and then these transformed θ-values will be used as the true θ values to be compared with the θ-estimates of the YEAR One students. The θ values used in generating data are transformed to scaled θ-values through linear transformations (Kolen, 2006): σ s σ S s ( x ) = x + μs μ x (3.3) σ x σ x where X is the raw score (the generating θ-values), S is the scale score (scaled θ-values), μ x and σ x are the mean and standard deviations of raw values, and μ s and σ s are the mean and standard deviations of the scaled values. Another index is the percentage of examinees classified into appropriate performance levels, especially the examinees in levels meeting or above the proficient level. Many large-scale assessments report the performance level of an examinee in addition to or instead of an individual score. Therefore, the proportion of correct classification is one indication of the quality of the estimate. To check this index, the θ cut score for each performance level was set, following the guidelines used in the real assessment. In the 2006/2007 Canadian provincial mathematics assessment, there are four performance levels, Level 3 being the provincial target level. Examinees in Level 3 or 4 are considered to meet or surpass the provincial target level. After θ-estimates from the previous year are placed on the same scale of the current testing year, the percentage of students in each level in the previous year will be used to find the θ thresholds for performance levels in the current year. For this simulation study, the cumulative percentages 17.5%, 57.3%, 94.5% were used to find θ cuts for each level. Performance level, classified 19
32 according to the examinee s true θ, was considered as his/her true performance level, while the level classified through estimated θ was the estimated performance level. To examine different ways of treating drifting items, the proportion of correct classification for each level were compared, and special attention were paid to the pass/fail status at the provincial level, as the percentage of students at or above this level was the important index of how schools were progressing. 20
33 Chapter IV: Results and Discussion The results are presented and discussed for the different types of parameter drift simulated in this study. The conditions that were manipulated included the number of drifted items, the level of drift and the direction of drift. The effects of the item parameter drift on θ-estimates are compared when the drifted items are handled differently. The first part will focus on the effect of a-parameter drift on θ-estimates and the second part will examine the effect of b-parameter drift. In all the tables and figures presented in the results part, Group Difference refers to three types of group ability change: 1) examinees taking exams in both years were equivalent groups in ability ( Year One (0,1), Year Two (0,1) ), 2) examinees were non-equivalent in ability with higher ability in the following year ( Year One (0,1), Year Two (0.2, 1) ), and 3) examinees were non-equivalent in ability with lower ability in the following year ( Year One (0,1), Year Two (-0.2,1) ). Linking Method refers to the three methods to link the Year One and Year Two scores: 1) Concurrent calibration ( Concurrent ), 2) Fixed Common Item Parameter method ( FCIP ) and 3) Stocking & Lord s test characteristic curve method ( TCC-ST ). Drifted Item Handling refers to the way to treat item parameter drift: 1) keeping the drifted items in both calibration and scoring ( keep/keep ), and 2) dropping the items in both calibration and scoring ( drop/drop ), and 3) dropping the items in calibration while keeping them in scoring ( drop/keep ), which was only applied with the TCC-ST method. 4.1 Drift on Discriminating Parameter a The effect of different ways to handle the drifted items were studied in four kinds of a-parameter drift: 1) three items drifting with a-parameter increasing by 0.4; 2) three items 21
34 drifting with a-parameter decreasing by 0.4 ; 3) eight items drifting with a-parameter increasing by 0.4 and 4) eight items drifting with a-parameter decreasing by Correlation between θ Estimate and True θ Table 4a.1 lists the correlation coefficients between the θ estimates and the true θs when a-parameter drifts. The correlations are high, ranging from to This indicates that the θ estimates have a very strong and positive association with the true θs. Compared with the no-drift baseline condition where the average correlation is 0.927, the strong and positive relationship between the θ estimates and true θ are consistent across the four conditions of a-parameter drifts. The correlations tend to drop slightly when more items are showing drift and when the drifted items are dropped from the linking and scoring. But the drop is quite small, less than These consistently high correlations are good indications that different ways to handle the drifting items has a negligible effect on the relationship between θ estimate and true θ, regardless of the group abilities and linking methods. 22
35 Table 4a.1 Average Correlation Coefficients between θ Estimates and True θs when a-parameter Drifting (with SDs in Parentheses) Group Difference Year One (0,1) Year Two (0,1) Year One (0,1) Year Two (0.2,1) Year One (0,1) Year Two (-0.2,1) Linking Methods Items Handling (link/score) No drift 3 items a-drift items a-drift items a-drift items a-drift -0.4 Con- keep/keep 0.927(0.002) 0.927(0.002) 0.926(0.002) 0.926(0.002) 0.926(0.002) current drop/drop 0.924(0.002) 0.925(0.002) 0.924(0.002) 0.924(0.002) FCIP keep/keep 0.927(0.002) 0.926(0.002) 0.925(0.002) 0.925(0.002) 0.923(0.003) drop/drop 0.923(0.002) 0.922(0.002) 0.919(0.002) 0.912(0.002) keep/keep 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) TCC-ST drop/keep 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) drop/drop 0.922(0.005) 0.922(0.002) 0.919(0.002) 0.913(0.002) Con- keep/keep 0.927(0.002) 0.927(0.002) 0.926(0.002) 0.927(0.002) 0.926(0.002) current drop/drop 0.923(0.002) 0.925(0.002) 0.923(0.002) 0.923(0.002) FCIP keep/keep 0.927(0.002) 0.926(0.002) 0.925(0.002) 0.925(0.002) 0.923(0.003) drop/drop 0.923(0.002) 0.922(0.002) 0.919(0.002) 0.912(0.002) keep/keep 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) TCC-ST drop/keep 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) drop/drop 0.924(0.002) 0.922(0.002) 0.919(0.002) 0.913(0.002) Con- keep/keep 0.927(0.002) 0.927(0.002) 0.926(0.002) 0.927(0.002) 0.926(0.002) current drop/drop 0.924(0.002) 0.926(0.002) 0.924(0.002) 0.924(0.002) FCIP keep/keep 0.927(0.002) 0.926(0.002) 0.925(0.002) 0.925(0.002) 0.923(0.002) drop/drop 0.922(0.005) 0.922(0.003) 0.919(0.002) 0.912(0.002) keep/keep 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) TCC-ST drop/keep 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) drop/drop 0.924(0.002) 0.922(0.002) 0.919(0.002) 0.913(0.002) 23
36 4.12 Accuracy of θ Estimates The correlation table (Table 4a.1) showed a strong relationship between θ estimates and true θs and this relationship was not affected by the types of a-drift, group difference, linking methods and way of handling drifted items. However, these differences may have effect on the accuracy of θ estimation. To further examine the accuracy of θ estimation, bias and RMSE between θ estimates and true θs were calculated. Table 4a.2 and 4a.3 give the bias and RMSE values for θ estimates when a-parameter drift is present Bias and RMSE in Four a-drift Situations Four situations of a-drift were examined in the study: 3 items (10%) showing an increase of 0.4 in a-parameter, 3 items (10%) showing a decrease of 0.4 in a-parameter, 8 items (25%) showing the same +0.4 increase, and 8 items (25%) showing the same -0.4 decrease. When 10% of the items were showing a 0.4 increase in a-parameter, bias for θ estimates ranged from to Most of the bias values were negative, indicating that θ was under-estimated except when datasets of non-equal groups were linked by concurrent calibration with all items included. The under-estimation was most obvious when concurrent calibration was applied with the drifting items dropped, the bias being , and The RMSE ranged from to RMSE was relatively smaller when concurrent calibration was used with all the linking items included. However, the largest RMSE occurred when concurrent calibration was used with the drifting items dropped. When the same drift occurred to 25% of the items, bias for θ estimates ranged from to and RMSE ranged from to bias values were negative for most of the datasets, except when TCC-ST method with all linking items included was applied with equivalent groups. 24
EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN
EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN By FRANCISCO ANDRES JIMENEZ A THESIS PRESENTED TO THE GRADUATE SCHOOL OF
More informationA Comparison of Four Test Equating Methods
A Comparison of Four Test Equating Methods Report Prepared for the Education Quality and Accountability Office (EQAO) by Xiao Pang, Ph.D. Psychometrician, EQAO Ebby Madera, Ph.D. Psychometrician, EQAO
More informationGMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups
GMAC Scaling Item Difficulty Estimates from Nonequivalent Groups Fanmin Guo, Lawrence Rudner, and Eileen Talento-Miller GMAC Research Reports RR-09-03 April 3, 2009 Abstract By placing item statistics
More informationTHE IMPACT OF ANCHOR ITEM EXPOSURE ON MEAN/SIGMA LINKING AND IRT TRUE SCORE EQUATING UNDER THE NEAT DESIGN. Moatasim A. Barri
THE IMPACT OF ANCHOR ITEM EXPOSURE ON MEAN/SIGMA LINKING AND IRT TRUE SCORE EQUATING UNDER THE NEAT DESIGN By Moatasim A. Barri B.S., King Abdul Aziz University M.S.Ed., The University of Kansas Ph.D.,
More informationUsing the Distractor Categories of Multiple-Choice Items to Improve IRT Linking
Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Jee Seon Kim University of Wisconsin, Madison Paper presented at 2006 NCME Annual Meeting San Francisco, CA Correspondence
More informationlinking in educational measurement: Taking differential motivation into account 1
Selecting a data collection design for linking in educational measurement: Taking differential motivation into account 1 Abstract In educational measurement, multiple test forms are often constructed to
More informationMCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2
MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and Lord Equating Methods 1,2 Lisa A. Keller, Ronald K. Hambleton, Pauline Parker, Jenna Copella University of Massachusetts
More informationEffect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales
University of Iowa Iowa Research Online Theses and Dissertations Summer 2013 Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales Anna Marie
More informationLinking across forms in vertical scaling under the common-item nonequvalent groups design
University of Iowa Iowa Research Online Theses and Dissertations Spring 2013 Linking across forms in vertical scaling under the common-item nonequvalent groups design Xuan Wang University of Iowa Copyright
More informationDifferential Item Functioning
Differential Item Functioning Lecture #11 ICPSR Item Response Theory Workshop Lecture #11: 1of 62 Lecture Overview Detection of Differential Item Functioning (DIF) Distinguish Bias from DIF Test vs. Item
More informationCentre for Education Research and Policy
THE EFFECT OF SAMPLE SIZE ON ITEM PARAMETER ESTIMATION FOR THE PARTIAL CREDIT MODEL ABSTRACT Item Response Theory (IRT) models have been widely used to analyse test data and develop IRT-based tests. An
More informationSection 5. Field Test Analyses
Section 5. Field Test Analyses Following the receipt of the final scored file from Measurement Incorporated (MI), the field test analyses were completed. The analysis of the field test data can be broken
More informationUNIDIMENSIONAL VERTICAL SCALING OF MIXED FORMAT TESTS IN THE PRESENCE OF ITEM FORMAT EFFECT. Debra White Moore
UNIDIMENSIONAL VERTICAL SCALING OF MIXED FORMAT TESTS IN THE PRESENCE OF ITEM FORMAT EFFECT by Debra White Moore B.M.Ed., University of North Carolina, Greensboro, 1989 M.A., University of Pittsburgh,
More informationCenter for Advanced Studies in Measurement and Assessment. CASMA Research Report
Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 39 Evaluation of Comparability of Scores and Passing Decisions for Different Item Pools of Computerized Adaptive Examinations
More informationComparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria
Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Thakur Karkee Measurement Incorporated Dong-In Kim CTB/McGraw-Hill Kevin Fatica CTB/McGraw-Hill
More informationImpact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data. Zhen Li & Bruno D.
Psicológica (2009), 30, 343-370. SECCIÓN METODOLÓGICA Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data Zhen Li & Bruno D. Zumbo 1 University
More informationTechnical Specifications
Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically
More informationInvestigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories
Kamla-Raj 010 Int J Edu Sci, (): 107-113 (010) Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories O.O. Adedoyin Department of Educational Foundations,
More informationThe Matching Criterion Purification for Differential Item Functioning Analyses in a Large-Scale Assessment
University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Educational Psychology Papers and Publications Educational Psychology, Department of 1-2016 The Matching Criterion Purification
More informationMantel-Haenszel Procedures for Detecting Differential Item Functioning
A Comparison of Logistic Regression and Mantel-Haenszel Procedures for Detecting Differential Item Functioning H. Jane Rogers, Teachers College, Columbia University Hariharan Swaminathan, University of
More informationTHE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH
THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH By JANN MARIE WISE MACINNES A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF
More informationcontext effects under the nonequivalent anchor test (NEAT) design. In addition, the issue
STORE, DAVIE, Ph.D. Item Parameter Changes and Equating: An Examination of the Effects of Lack of Item Parameter Invariance on Equating and Score Accuracy for Different Proficiency Levels. (2013) Directed
More informationDecision consistency and accuracy indices for the bifactor and testlet response theory models
University of Iowa Iowa Research Online Theses and Dissertations Summer 2014 Decision consistency and accuracy indices for the bifactor and testlet response theory models Lee James LaFond University of
More informationAn Empirical Examination of the Impact of Item Parameters on IRT Information Functions in Mixed Format Tests
University of Massachusetts - Amherst ScholarWorks@UMass Amherst Dissertations 2-2012 An Empirical Examination of the Impact of Item Parameters on IRT Information Functions in Mixed Format Tests Wai Yan
More informationUsing the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison
Using the Testlet Model to Mitigate Test Speededness Effects James A. Wollack Youngsuk Suh Daniel M. Bolt University of Wisconsin Madison April 12, 2007 Paper presented at the annual meeting of the National
More informationA Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model
A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model Gary Skaggs Fairfax County, Virginia Public Schools José Stevenson
More informationaccuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian
Recovery of Marginal Maximum Likelihood Estimates in the Two-Parameter Logistic Response Model: An Evaluation of MULTILOG Clement A. Stone University of Pittsburgh Marginal maximum likelihood (MML) estimation
More informationItem Response Theory: Methods for the Analysis of Discrete Survey Response Data
Item Response Theory: Methods for the Analysis of Discrete Survey Response Data ICPSR Summer Workshop at the University of Michigan June 29, 2015 July 3, 2015 Presented by: Dr. Jonathan Templin Department
More informationContents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD
Psy 427 Cal State Northridge Andrew Ainsworth, PhD Contents Item Analysis in General Classical Test Theory Item Response Theory Basics Item Response Functions Item Information Functions Invariance IRT
More informationEvaluating the Impact of Construct Shift on Item Parameter Invariance, Test Equating and Proficiency Estimates
University of Massachusetts Amherst ScholarWorks@UMass Amherst Doctoral Dissertations Dissertations and Theses 2015 Evaluating the Impact of Construct Shift on Item Parameter Invariance, Test Equating
More informationItem Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses
Item Response Theory Steven P. Reise University of California, U.S.A. Item response theory (IRT), or modern measurement theory, provides alternatives to classical test theory (CTT) methods for the construction,
More informationGENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS
GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS Michael J. Kolen The University of Iowa March 2011 Commissioned by the Center for K 12 Assessment & Performance Management at
More informationItem Position and Item Difficulty Change in an IRT-Based Common Item Equating Design
APPLIED MEASUREMENT IN EDUCATION, 22: 38 6, 29 Copyright Taylor & Francis Group, LLC ISSN: 895-7347 print / 1532-4818 online DOI: 1.18/89573482558342 Item Position and Item Difficulty Change in an IRT-Based
More informationResearch and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida
Vol. 2 (1), pp. 22-39, Jan, 2015 http://www.ijate.net e-issn: 2148-7456 IJATE A Comparison of Logistic Regression Models for Dif Detection in Polytomous Items: The Effect of Small Sample Sizes and Non-Normality
More informationAndré Cyr and Alexander Davies
Item Response Theory and Latent variable modeling for surveys with complex sampling design The case of the National Longitudinal Survey of Children and Youth in Canada Background André Cyr and Alexander
More informationA Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1.
Running Head: A MODIFIED CATSIB PROCEDURE FOR DETECTING DIF ITEMS 1 A Modified CATSIB Procedure for Detecting Differential Item Function on Computer-Based Tests Johnson Ching-hong Li 1 Mark J. Gierl 1
More informationModeling DIF with the Rasch Model: The Unfortunate Combination of Mean Ability Differences and Guessing
James Madison University JMU Scholarly Commons Department of Graduate Psychology - Faculty Scholarship Department of Graduate Psychology 4-2014 Modeling DIF with the Rasch Model: The Unfortunate Combination
More informationThe Use of Item Statistics in the Calibration of an Item Bank
~ - -., The Use of Item Statistics in the Calibration of an Item Bank Dato N. M. de Gruijter University of Leyden An IRT analysis based on p (proportion correct) and r (item-test correlation) is proposed
More informationThe Impact of Test Dimensionality, Common-Item Set Format, and Scale Linking Methods on Mixed-Format Test Equating *
KURAM VE UYGULAMADA EĞİTİM BİLİMLERİ EDUCATIONAL SCIENCES: THEORY & PRACTICE Received: November 26, 2015 Revision received: January 6, 2015 Accepted: March 18, 2016 OnlineFirst: April 20, 2016 Copyright
More informationCenter for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests
Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 26 for Mixed Format Tests Kyong Hee Chon Won-Chan Lee Timothy N. Ansley November 2007 The authors are grateful to
More informationA Comparison of Several Goodness-of-Fit Statistics
A Comparison of Several Goodness-of-Fit Statistics Robert L. McKinley The University of Toledo Craig N. Mills Educational Testing Service A study was conducted to evaluate four goodnessof-fit procedures
More informationCYRINUS B. ESSEN, IDAKA E. IDAKA AND MICHAEL A. METIBEMU. (Received 31, January 2017; Revision Accepted 13, April 2017)
DOI: http://dx.doi.org/10.4314/gjedr.v16i2.2 GLOBAL JOURNAL OF EDUCATIONAL RESEARCH VOL 16, 2017: 87-94 COPYRIGHT BACHUDO SCIENCE CO. LTD PRINTED IN NIGERIA. ISSN 1596-6224 www.globaljournalseries.com;
More informationCopyright. Kelly Diane Brune
Copyright by Kelly Diane Brune 2011 The Dissertation Committee for Kelly Diane Brune Certifies that this is the approved version of the following dissertation: An Evaluation of Item Difficulty and Person
More informationMultidimensionality and Item Bias
Multidimensionality and Item Bias in Item Response Theory T. C. Oshima, Georgia State University M. David Miller, University of Florida This paper demonstrates empirically how item bias indexes based on
More informationJason L. Meyers. Ahmet Turhan. Steven J. Fitzpatrick. Pearson. Paper presented at the annual meeting of the
Performance of Ability Estimation Methods for Writing Assessments under Conditio ns of Multidime nsionality Jason L. Meyers Ahmet Turhan Steven J. Fitzpatrick Pearson Paper presented at the annual meeting
More informationUSE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION
USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION Iweka Fidelis (Ph.D) Department of Educational Psychology, Guidance and Counselling, University of Port Harcourt,
More informationEmpowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison
Empowered by Psychometrics The Fundamentals of Psychometrics Jim Wollack University of Wisconsin Madison Psycho-what? Psychometrics is the field of study concerned with the measurement of mental and psychological
More informationComputerized Mastery Testing
Computerized Mastery Testing With Nonequivalent Testlets Kathleen Sheehan and Charles Lewis Educational Testing Service A procedure for determining the effect of testlet nonequivalence on the operating
More informationLinking Mixed-Format Tests Using Multiple Choice Anchors. Michael E. Walker. Sooyeon Kim. ETS, Princeton, NJ
Linking Mixed-Format Tests Using Multiple Choice Anchors Michael E. Walker Sooyeon Kim ETS, Princeton, NJ Paper presented at the annual meeting of the American Educational Research Association (AERA) and
More informationLikelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati.
Likelihood Ratio Based Computerized Classification Testing Nathan A. Thompson Assessment Systems Corporation & University of Cincinnati Shungwon Ro Kenexa Abstract An efficient method for making decisions
More informationBruno D. Zumbo, Ph.D. University of Northern British Columbia
Bruno Zumbo 1 The Effect of DIF and Impact on Classical Test Statistics: Undetected DIF and Impact, and the Reliability and Interpretability of Scores from a Language Proficiency Test Bruno D. Zumbo, Ph.D.
More informationInfluences of IRT Item Attributes on Angoff Rater Judgments
Influences of IRT Item Attributes on Angoff Rater Judgments Christian Jones, M.A. CPS Human Resource Services Greg Hurt!, Ph.D. CSUS, Sacramento Angoff Method Assemble a panel of subject matter experts
More informationScaling TOWES and Linking to IALS
Scaling TOWES and Linking to IALS Kentaro Yamamoto and Irwin Kirsch March, 2002 In 2000, the Organization for Economic Cooperation and Development (OECD) along with Statistics Canada released Literacy
More informationITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE
California State University, San Bernardino CSUSB ScholarWorks Electronic Theses, Projects, and Dissertations Office of Graduate Studies 6-2016 ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION
More informationStandard Errors of Correlations Adjusted for Incidental Selection
Standard Errors of Correlations Adjusted for Incidental Selection Nancy L. Allen Educational Testing Service Stephen B. Dunbar University of Iowa The standard error of correlations that have been adjusted
More informationScoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods
James Madison University JMU Scholarly Commons Department of Graduate Psychology - Faculty Scholarship Department of Graduate Psychology 3-008 Scoring Multiple Choice Items: A Comparison of IRT and Classical
More informationEffects of Ignoring Discrimination Parameter in CAT Item Selection on Student Scores. Shudong Wang NWEA. Liru Zhang Delaware Department of Education
Effects of Ignoring Discrimination Parameter in CAT Item Selection on Student Scores Shudong Wang NWEA Liru Zhang Delaware Department of Education Paper to be presented at the annual meeting of the National
More informationDifferential Item Functioning Amplification and Cancellation in a Reading Test
A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to the Practical Assessment, Research & Evaluation. Permission is granted to
More informationThe Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing
The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing Terry A. Ackerman University of Illinois This study investigated the effect of using multidimensional items in
More informationDetecting Suspect Examinees: An Application of Differential Person Functioning Analysis. Russell W. Smith Susan L. Davis-Becker
Detecting Suspect Examinees: An Application of Differential Person Functioning Analysis Russell W. Smith Susan L. Davis-Becker Alpine Testing Solutions Paper presented at the annual conference of the National
More informationRasch Versus Birnbaum: New Arguments in an Old Debate
White Paper Rasch Versus Birnbaum: by John Richard Bergan, Ph.D. ATI TM 6700 E. Speedway Boulevard Tucson, Arizona 85710 Phone: 520.323.9033 Fax: 520.323.9139 Copyright 2013. All rights reserved. Galileo
More informationIDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS
IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS A Dissertation Presented to The Academic Faculty by HeaWon Jun In Partial Fulfillment of the Requirements
More informationNonparametric DIF. Bruno D. Zumbo and Petronilla M. Witarsa University of British Columbia
Nonparametric DIF Nonparametric IRT Methodology For Detecting DIF In Moderate-To-Small Scale Measurement: Operating Characteristics And A Comparison With The Mantel Haenszel Bruno D. Zumbo and Petronilla
More informationModeling the Effect of Differential Motivation on Linking Educational Tests
Modeling the Effect of Differential Motivation on Linking Educational Tests Marie-Anne Keizer-Mittelhaëuser MODELING THE EFFECT OF DIFFERENTIAL MOTIVATION ON LINKING EDUCATIONAL TESTS PROEFSCHRIFT TER
More informationMultilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison
Group-Level Diagnosis 1 N.B. Please do not cite or distribute. Multilevel IRT for group-level diagnosis Chanho Park Daniel M. Bolt University of Wisconsin-Madison Paper presented at the annual meeting
More informationComparing DIF methods for data with dual dependency
DOI 10.1186/s40536-016-0033-3 METHODOLOGY Open Access Comparing DIF methods for data with dual dependency Ying Jin 1* and Minsoo Kang 2 *Correspondence: ying.jin@mtsu.edu 1 Department of Psychology, Middle
More informationTECHNICAL REPORT. The Added Value of Multidimensional IRT Models. Robert D. Gibbons, Jason C. Immekus, and R. Darrell Bock
1 TECHNICAL REPORT The Added Value of Multidimensional IRT Models Robert D. Gibbons, Jason C. Immekus, and R. Darrell Bock Center for Health Statistics, University of Illinois at Chicago Corresponding
More informationPOLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS
POLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS By OU ZHANG A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
More informationDesigning small-scale tests: A simulation study of parameter recovery with the 1-PL
Psychological Test and Assessment Modeling, Volume 55, 2013 (4), 335-360 Designing small-scale tests: A simulation study of parameter recovery with the 1-PL Dubravka Svetina 1, Aron V. Crawford 2, Roy
More informationRunning head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note
Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1 Nested Factor Analytic Model Comparison as a Means to Detect Aberrant Response Patterns John M. Clark III Pearson Author Note John M. Clark III,
More informationAn Alternative to the Trend Scoring Method for Adjusting Scoring Shifts. in Mixed-Format Tests. Xuan Tan. Sooyeon Kim. Insu Paek.
An Alternative to the Trend Scoring Method for Adjusting Scoring Shifts in Mixed-Format Tests Xuan Tan Sooyeon Kim Insu Paek Bihua Xiang ETS, Princeton, NJ Paper presented at the annual meeting of the
More informationDetection of Differential Test Functioning (DTF) and Differential Item Functioning (DIF) in MCCQE Part II Using Logistic Models
Detection of Differential Test Functioning (DTF) and Differential Item Functioning (DIF) in MCCQE Part II Using Logistic Models Jin Gong University of Iowa June, 2012 1 Background The Medical Council of
More informationAn Introduction to Missing Data in the Context of Differential Item Functioning
A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to Practical Assessment, Research & Evaluation. Permission is granted to distribute
More informationThe Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland
Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, April 23-25, 2003 The Classification Accuracy of Measurement Decision Theory Lawrence Rudner University
More informationDescription of components in tailored testing
Behavior Research Methods & Instrumentation 1977. Vol. 9 (2).153-157 Description of components in tailored testing WAYNE M. PATIENCE University ofmissouri, Columbia, Missouri 65201 The major purpose of
More informationDetermining Differential Item Functioning in Mathematics Word Problems Using Item Response Theory
Determining Differential Item Functioning in Mathematics Word Problems Using Item Response Theory Teodora M. Salubayba St. Scholastica s College-Manila dory41@yahoo.com Abstract Mathematics word-problem
More informationDifferential Item Functioning from a Compensatory-Noncompensatory Perspective
Differential Item Functioning from a Compensatory-Noncompensatory Perspective Terry Ackerman, Bruce McCollaum, Gilbert Ngerano University of North Carolina at Greensboro Motivation for my Presentation
More informationA Bayesian Nonparametric Model Fit statistic of Item Response Models
A Bayesian Nonparametric Model Fit statistic of Item Response Models Purpose As more and more states move to use the computer adaptive test for their assessments, item response theory (IRT) has been widely
More informationKnown-Groups Validity 2017 FSSE Measurement Invariance
Known-Groups Validity 2017 FSSE Measurement Invariance A key assumption of any latent measure (any questionnaire trying to assess an unobservable construct) is that it functions equally across all different
More informationDifferential item functioning procedures for polytomous items when examinee sample sizes are small
University of Iowa Iowa Research Online Theses and Dissertations Spring 2011 Differential item functioning procedures for polytomous items when examinee sample sizes are small Scott William Wood University
More informationUsing the Score-based Testlet Method to Handle Local Item Dependence
Using the Score-based Testlet Method to Handle Local Item Dependence Author: Wei Tao Persistent link: http://hdl.handle.net/2345/1363 This work is posted on escholarship@bc, Boston College University Libraries.
More informationA Comparison of Item and Testlet Selection Procedures. in Computerized Adaptive Testing. Leslie Keng. Pearson. Tsung-Han Ho
ADAPTIVE TESTLETS 1 Running head: ADAPTIVE TESTLETS A Comparison of Item and Testlet Selection Procedures in Computerized Adaptive Testing Leslie Keng Pearson Tsung-Han Ho The University of Texas at Austin
More informationEffect of Noncompensatory Multidimensionality on Separate and Concurrent estimation in IRT Observed Score Equating A. A. Béguin B. A.
Measurement and Research Department Reports 2001-2 Effect of Noncompensatory Multidimensionality on Separate and Concurrent estimation in IRT Observed Score Equating A. A. Béguin B. A. Hanson Measurement
More informationCopyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and
Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and private study only. The thesis may not be reproduced elsewhere
More informationComprehensive Statistical Analysis of a Mathematics Placement Test
Comprehensive Statistical Analysis of a Mathematics Placement Test Robert J. Hall Department of Educational Psychology Texas A&M University, USA (bobhall@tamu.edu) Eunju Jung Department of Educational
More informationA comparability analysis of the National Nurse Aide Assessment Program
University of South Florida Scholar Commons Graduate Theses and Dissertations Graduate School 2006 A comparability analysis of the National Nurse Aide Assessment Program Peggy K. Jones University of South
More informationParameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX
Paper 1766-2014 Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX ABSTRACT Chunhua Cao, Yan Wang, Yi-Hsin Chen, Isaac Y. Li University
More informationResearch Report No Using DIF Dissection Method to Assess Effects of Item Deletion
Research Report No. 2005-10 Using DIF Dissection Method to Assess Effects of Item Deletion Yanling Zhang, Neil J. Dorans, and Joy L. Matthews-López www.collegeboard.com College Board Research Report No.
More informationMEASURING MIDDLE GRADES STUDENTS UNDERSTANDING OF FORCE AND MOTION CONCEPTS: INSIGHTS INTO THE STRUCTURE OF STUDENT IDEAS
MEASURING MIDDLE GRADES STUDENTS UNDERSTANDING OF FORCE AND MOTION CONCEPTS: INSIGHTS INTO THE STRUCTURE OF STUDENT IDEAS The purpose of this study was to create an instrument that measures middle grades
More informationSensitivity of DFIT Tests of Measurement Invariance for Likert Data
Meade, A. W. & Lautenschlager, G. J. (2005, April). Sensitivity of DFIT Tests of Measurement Invariance for Likert Data. Paper presented at the 20 th Annual Conference of the Society for Industrial and
More informationStatistics for Social and Behavioral Sciences
Statistics for Social and Behavioral Sciences Advisors: S.E. Fienberg W.J. van der Linden For other titles published in this series, go to http://www.springer.com/series/3463 Jean-Paul Fox Bayesian Item
More informationConnexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan
Connexion of Item Response Theory to Decision Making in Chess Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan Acknowledgement A few Slides have been taken from the following presentation
More informationANNEX A5 CHANGES IN THE ADMINISTRATION AND SCALING OF PISA 2015 AND IMPLICATIONS FOR TRENDS ANALYSES
ANNEX A5 CHANGES IN THE ADMINISTRATION AND SCALING OF PISA 2015 AND IMPLICATIONS FOR TRENDS ANALYSES Comparing science, reading and mathematics performance across PISA cycles The PISA 2006, 2009, 2012
More informationA Monte Carlo Study Investigating Missing Data, Differential Item Functioning, and Effect Size
Georgia State University ScholarWorks @ Georgia State University Educational Policy Studies Dissertations Department of Educational Policy Studies 8-12-2009 A Monte Carlo Study Investigating Missing Data,
More informationFollow this and additional works at:
University of Miami Scholarly Repository Open Access Dissertations Electronic Theses and Dissertations 2013-06-06 Complex versus Simple Modeling for Differential Item functioning: When the Intraclass Correlation
More informationChapter 1 Introduction. Measurement Theory. broadest sense and not, as it is sometimes used, as a proxy for deterministic models.
Ostini & Nering - Chapter 1 - Page 1 POLYTOMOUS ITEM RESPONSE THEORY MODELS Chapter 1 Introduction Measurement Theory Mathematical models have been found to be very useful tools in the process of human
More informationDuring the past century, mathematics
An Evaluation of Mathematics Competitions Using Item Response Theory Jim Gleason During the past century, mathematics competitions have become part of the landscape in mathematics education. The first
More informationItem-Rest Regressions, Item Response Functions, and the Relation Between Test Forms
Item-Rest Regressions, Item Response Functions, and the Relation Between Test Forms Dato N. M. de Gruijter University of Leiden John H. A. L. de Jong Dutch Institute for Educational Measurement (CITO)
More informationA Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests
A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests David Shin Pearson Educational Measurement May 007 rr0701 Using assessment and research to promote learning Pearson Educational
More informationLinking Assessments: Concept and History
Linking Assessments: Concept and History Michael J. Kolen, University of Iowa In this article, the history of linking is summarized, and current linking frameworks that have been proposed are considered.
More information