A Comparison of Four Test Equating Methods

Size: px
Start display at page:

Download "A Comparison of Four Test Equating Methods"

Transcription

1 A Comparison of Four Test Equating Methods Report Prepared for the Education Quality and Accountability Office (EQAO) by Xiao Pang, Ph.D. Psychometrician, EQAO Ebby Madera, Ph.D. Psychometrician, EQAO Nizam Radwan, Ph.D. Psychometrician, EQAO Su Zhang, Ph.D. Psychometrician, EQAO APRIL 2010

2 About the Education Quality and Accountability Office The Education Quality and Accountability Office (EQAO) is an independent provincial agency funded by the Government of Ontario. EQAO s mandate is to conduct province-wide tests at key points in every student s primary, junior and secondary education and report the results to educators, parents and the public. EQAO acts as a catalyst for increasing the success of Ontario students by measuring their achievement in reading, writing and mathematics in relation to Ontario Curriculum expectations. The resulting data provide a gauge of quality and accountability in the Ontario education system. The objective and reliable assessment results are evidence that adds to current knowledge about student learning and serves as an important tool for improvement at all levels: for individual students, schools, boards and the province. About EQAO Research EQAO undertakes research for two main purposes: to maintain best-of-class practices and to ensure that the agency remains at the forefront of largescale assessment and to promote the use of EQAO data for improved student achievement through the investigation of means to inform policy directions and decisions made by educators, parents and the government. EQAO research projects delve into the factors that influence student achievement and education quality, and examine the statistical and psychometric processes that result in high-quality assessment data. Education Quality and Accountability Office, 2 Carlton Street, Suite 1200, Toronto ON M5B 2M9, , Queen s Printer for Ontario

3 Acknowledgements This research was conducted under the direction of Michael Kozlow and the EQAO scholars in residence, Todd Rogers and Mark Reckase, who provided guidance on the development of the proposal and the conduct of the study. They provided extensive and valuable advice on the research procedures, input at different stages of the analysis and review and editorial comments on the final report. Qi Chen provided academic and technical assistance to speed up the process of the analysis. Yunmei Xu provided timely assistance in completing the analysis. The authors are grateful to them for the significant contributions they made to improve the academic quality of this research.

4 Abstract This research evaluated the effectiveness of identifying students real gains through the application of four commonly used equating methods: concurrent calibration (CC) equating, fixed common item parameter (FCIP) equating, Stocking and Lord test characteristic curve (TCC) equating, and mean/sigma (M/S) equating. The performance of the four procedures was evaluated using simulated data for a test design with multiple item formats. Five gain conditions (-0.3, -0.1, 0.0, 0.1 and 0.3 on the θ-scale) were built into the simulation to mimic the Ontario Secondary School Literacy Test (OSSLT), the Test provincial de compétences linguistiques (TPCL), the Assessments of Reading, Writing and Mathematics, Primary and Junior Divisions and the applied version of the English Grade 9 Assessment of Mathematics. Twenty replications were conducted. The estimated percentages at multiple achievement levels and in the successful and unsuccessful categories were compared with the respective true percentages obtained from the known θ-distributions. The results across seven assessments showed that the FCIP, TCC and M/S equating procedures based on separate calibrations performed equally well and much better than the CC procedure.

5 Introduction One of the goals of the Education Quality and Accountability Office (EQAO) is to provide evidence concerning changes in student achievement from year to year in the province of Ontario. 1 Yearly assessments in both English and French are conducted at the primary (Grade 3) and junior (Grade 6) levels (reading, writing and mathematics) and in Grade 9 (academic and applied mathematics). The results for these assessments are reported in terms of the percentage of students at each of five achievement levels (Not Enough Evidence for Level 1 [NE1] or Below Level 1 and Levels 1, 2, 3 and 4). The provincial standard for acceptable performance is Level 3. In addition to these assessments, EQAO is responsible for two literacy tests: the Ontario Secondary School Literacy Test (OSSLT) in English and the Test provincial de compétences linguistiques (TPCL) in French, either of which is a required credential for graduation from high school. 2 When reporting evidence of change in performance between two years, it is important that a distinction be made between differences in difficulty of the test forms used to assess the students and real gains or losses in achievement between the two years. The purpose of equating is to adjust for test difficulty differences so that only real differences in performance are reported. There are, however, different procedures for equating tests, some of which are based on classical test score theory (CTST) and others on item response theory (IRT). Some research has shown that equating based on CTST and IRT provides similar results for horizontal equating. For example, Hills, Subhiyah and Hirsch (1988) found similar results with linear equating, concurrent calibration (CC) using the Rasch model and the three-parameter IRT model, and separate calibration using the three-parameter IRT model with fixed common item parameter (FCIP) equating and mean/sigma (M/S) equating (Marco, 1977). However, Kolen and Brennan (1995) pointed out that since many large assessment programs use IRT models to develop and calibrate tests, the use 1 EQAO is an arm s length agency of the Ontario Ministry of Education that administers large-scale provincial assessments. 2 Students who are unsuccessful on the OSSLT may take it again the next year or enrol in the Ontario Secondary School Literary Course. 1

6 of IRT-based equating methods is often the logical choice. Therefore, since EQAO uses procedures based on IRT to calibrate and equate the items in each of its assessments, the equating methods considered in the present study were restricted to IRT-based equating methods. The most commonly used IRT equating procedures are the CC procedure (Wingersky & Lord, 1984), which is based on a concurrent calibration of a sample consisting of the students assessed in each of two years to be equated; the FCIP procedure; the test characteristic curve (TCC) procedure (Loyd & Hoover, 1980) and the M/S procedure. The FCIP, TCC and M/S procedures are based on separate calibrations of the two samples. Unfortunately these procedures do not always yield the same results. Therefore, understanding the behavior of different equating methods is critical to ensuring that the interpretation of estimates of change is valid. EQAO currently uses separate IRT calibration and the FCIP equating procedure. However, no research has examined the effectiveness of this approach in recovering gains or differences between two years for the EQAO assessments, or whether or not one of the other IRT equating methods might better recover such changes. Purpose of the Study The purpose of the present study is to assess the effectiveness of the four different equating procedures identified above (CC, FCIP, TCC and M/S) in identifying the real changes in student performance across years. Specifically, the four procedures were compared in terms of how accurately the results they yielded represented known changes in the percentages of students at each achievement level for the primary (Grade 3), junior (Grade 6) and Grade 9 assessments and in the two achievement categories for the OSSLT and TPCL (successful and unsuccessful). Review of Equating Methods When the common-item nonequivalent group design and IRT-based equating methods are used, one of two approaches can be taken: concurrent or separate calibration. With the concurrent calibration and equating approach (Lord & Winkersky, 1984), the students responses from the two tests to be equated are combined into one data file 2

7 through the alignment of the common items. The tests are then simultaneously calibrated. As a result, the parameter estimates for the items in the tests are put on a common scale. The students ability scores for two tests are estimated separately using the corresponding scaled item parameters, and the means of the two tests are then compared to determine the direction and magnitude of the change. Theoretically, CC is expected to yield more stable results than the separate-calibration methods that employ transformations, and CC is also expected to minimize the impact of sampling fluctuations in the estimation of the pseudo-guessing parameter due to an increase in the number of low-ability examinees. With separate calibration, the calibrations are performed separately for the two tests and common items are used to put the two tests on a common scale. The test used to set the common scale is referred to as the reference test and the second test is referred to as the equated test. A linear transformation can then be used to place the item parameters from the equated test on the scale of the reference test based on the items common to the two tests. Equating procedures that use a linear transformation include the mean/mean approach (M/M) (Loyd & Hoover, 1980), the M/S method (Marco, 1977) and the TCC approach (Li, Lissitz & Yang, 1999; Stocking & Lord, 1983). While it is theoretically correct to use the M/M or the M/S procedure, the parameters are used separately to estimate the equating coefficients. In contrast, the TCC method is a simultaneous estimation procedure that takes better account of the information provided (Li et al., 1999). FCIP is an alternative two-step calibration and equating method. In it, the reference test is calibrated first. When the equated test is calibrated, the parameters of its common items are fixed at the estimates obtained through the calibration of the reference test. As a result, the equated test score distribution is placed on the reference test scale (for a technically detailed description of FCIP, refer to Kim, 2006). The FCIP procedure is expected to produce results superior to those produced by the M/M, M/S and TCC procedures because of the avoidance of incorrect transformation functions. While some research has been conducted to evaluate different IRT equating approaches (Hanson & Belguin, 2002; Hills, Subhiyah & Hirsch, 1988; Kim & Cohen, 1998; Kolen & Brennan, 2004; Petersen, Cook & Stocking, 1983; Prowker & Camilli, 3

8 2006; Bishop, Shari, Lei & Domaleski, 2006; Hu, Rogers & Vulkmirovic, 2008; Kim, 2006; Lee & Ban, 2010; Wingersky, Cook & Eignor, 1987), a limited number of studies have been completed comparing the behaviour of the concurrent and separate approaches. Petersen, et al. (1983) compared the CTST linear procedure, the 1PL and 3PL IRT concurrent methods and the 3PL linear transformation method. They found that the different methods produced similar results when the tests to be equated were parallel and the groups in the two years were equivalent. Wingersky et al. (1987) investigated the effects on IRT true-score equating results of the characteristics of the linking items and concluded that the TCC procedure was affected by the presence of linking items that function differently for the two groups used to provide the data. Hills et al. (1988) compared CTST linear equating, the IRT 1PL and 3PL concurrent methods, the 3PL FCIP method and the 3PL linear transformation method when the tests were parallel and the groups were equivalent. They found that the different methods produced similar results. When they equated six test forms using a single set of anchor items, Li, Griffith, and Tam (1997) found that the FCIP and TCC approaches produced comparable equated ability estimates, except in case of students with extreme ability under the TCC equating method. Lee and Ban (2010) compared four different IRT equating procedures (CC, TCC, Haebara and proficiency transformation) and found that separate-calibration procedures performed better than the CC and proficiency transformation procedures. Kim and Cohen (1998) compared the TCC and the concurrent procedures with two different estimation procedures marginal maximum likelihood and marginal maximum a posteriori using multiple-choice items. They found that the two procedures produced similar results except when the number of common items was small, in which case the separate approach provided more accurate results. Linn, et al. (1980) compared the TCC, M/M, M/S and weighted M/S equating procedures. Their results indicated that the differences in equating coefficients among these methods were small across different sample sizes and numbers of common items. Hanson and Belguin (2002) compared the CC, M/M, M/S, TCC and Haebrara TCC procedures using computer simulation. They found that, overall, the CC procedure 4

9 resulted in smaller bias and less random error across replications than the separate calibration and transformation procedures. Keller, et al. (2004) evaluated the ability of four equating methods (CC, M/S, TCC and FCIP) to recover changes in the examinee ability distribution using simulated data based on a standard normal-ability distribution. They found that M/S performed the best while FCIP performed the worst. Hu et al. (2008) conducted a simulation study to investigate ten variations of four equating methods (CC, M/S, TCC and FCIP) in the absence and presence of outliers in the set of common items. They concluded that TCC and M/S transformations performed the best... The CC and FCIP calibrations had a complex interaction with group equivalence and number/score-points of outliers (p. 311). When there were no outliers in the set of common items, they found that the four methods were sensitive, but not equally, to the presence of nonequivalent groups; when there was no difference between the mean abilities of the two groups, the four procedures were equivalent (bias and random error), but when the mean abilities of the two groups differed by one standard deviation, the M/S and TCC produced small biases and had small random error, FCIP had a moderate bias and random error and CC yielded the largest bias and largest random error. The results of the studies reviewed were inconsistent. Further, no comprehensive research has compared the four equating methods using a) data that consist of dichotomously and polytomously scored items; b) data that are not necessarily normally distributed; c) nonequivalent groups; and d) assessments of multiple subjects. As indicated earlier, the EQAO assessments consist of a combination of dichotomously and polytomously scored items. The distributions of scores are not normal in shape. The populations form year to year are not equivalent. Currently, separate calibration followed by FCIP equating are used to link two consecutive years of each assessment. The CC procedure and the FCIP procedure were compared using and primary, junior, and Grade 9 EQAO assessments. Some differences in the change in percentages of students classified into the four achievement levels (Levels 1, 2, 3 and 4) in adjacent years were observed for the two equating procedures. Consequently, 5

10 the intent of the present study is to examine comprehensively the degree to which these two equating procedures could recover true change reflected in the percentage of students in each achievement category between two consecutive years. At the same time, the TCC and M/S approaches were examined to determine which of the four procedures best recovered the true change. Method This study addressed the issue of how well equated IRT proficiency estimates yielded by the CC, M/S, TCC and FCIP equating methods recovered true change in student performance between and , herein referred to as Year 1 and Year 2, respectively. In real test situations, it is impossible to know the true changes, because students true ability is not known. If the required information cannot otherwise be reasonably obtained analytically, simulation studies should be conducted (Psychometric Society, 1979; Lehman & Bailey, 1968). Therefore, computer simulation procedures were employed in the current study, in which estimated changes in student performance were compared to the known true changes. Data Simulations Equating Design. To reflect realistic cases, simulated data were generated to mimic EQAO s common-item non-equivalent group matrix design. With this design, sets of different field-test items are placed in multiple versions of the operational tests for Year 1 using a matrix. The field test items that have good psychometric properties are then used as the Year 2 operational items and serve as the linking items between the Year 1 and Year 2 assessments. A visual display of this equating design is presented in Figure 1. The upper left block contains the operational form for Year 1 and the upper right block contains a different set of embedded field-test items (M_1, M_2 and M_3). The field-test items that match the test blueprints, possess good psychometric characteristics and produce the desired test information function are brought forward to construct the operational test form for Year 2, shown in the lower right block. Consequently, all the operational items for Year 2 have been previously field tested, which results in a much stronger link between the two tests than those established by other equating designs in which a limited number of items in common are used to equate forms (commonly 20 6

11 items or at least 20% of the total number of items) (Angoff, 1984; Kolen & Brennan, 2004). 3 Year 1 Test Form Matrix Items M_1 M_2 M_3 Year 2 Test Form Figure 1 Equating design for EQAO assessments There are additional advantages to this design. More items can be field tested within the normal test administration period. Further, since the field-test items are embedded among the Year 1 operational items and the students do not know which items are being field tested and which are operational, the students are equally motivated for all items. The effect of fatigue, which would influence the responses of field-test items if they were placed at the end of the test, is avoided. Finally, the risk to test security is greatly reduced. These advantages make the common-item non-equivalent group matrix design, which has been employed in many large-scale assessment programs, appealing (Hu et al. 2008). Sample Size. Since the population of students for each assessment and the number of matrix field-test booklets vary, the sample size for each matrix booklet also varies. Generally the English primary and junior assessments and the OSSLT have the largest matrix sample sizes (e.g., n 5000 for the OSSLT) while the French primary and junior assessments and the TPCL have the smallest matrix sample sizes (e.g. n 526 for the TPCL). English Grade 9 applied mathematics has medium matrix sample sizes (n 1500). To determine the possible effect of subject areas on the recovery rates, different tests and subtests were selected for the simulation: English primary writing, French primary writing, English junior reading, French junior mathematics, English Grade 9 applied mathematics, and the OSSLT and TPCL. 3 The one exception is the long-writing items. These items require too much time for them to be included as field-test items in an operational form. 7

12 Test Characteristics. The psychometric characteristics of the tests and subtests used in this study are presented in Table 1. As shown, the assessments have different numbers of multiple-choice and open-response items and different numbers of score points for the open-response items. For example, for the OSSLT and TPCL, the score categories range from three to 10 for the open-response reading, short-writing and longwriting items. For Year 2, English junior reading and English Grade 9 applied mathematics, the number of items was reduced; however, the basic test structure and content to be measured were similar between the two years (refer to the Frameworks on the EQAO Web site: For all assessments, the total number of points for the multiple-choice items is fewer than the total number of points for the open-response items. The mean of the θ-distributions is negative but close to zero. This is likely due to the negative skewness of the distributions (Lord, 1980, pp.49 50). The standard deviations of the θ-distributions are slightly less than one. Some distributions were slightly leptokurtic (e.g., English junior reading), others were essentially mesokurtic (e.g., OSSLT and English primary writing), while others were somewhat platykurtic (e.g., French primary writing). 8

13 Table 1 Psychometric Characteristics of the Tests and Subtests in the Study # of items Assessment Year N MC OR Mean SD Skew. Kurt. OSSLT (50) a c (50) TPCL (50) (50) Eng Primary Writing Fre Primary Writing Eng Junior Reading Fre Junior Mathematics Eng Grade 9 Applied Math (36) (36) (36) (36) (48) (40) (32) (32) b (30) (28) a x (y): number of open-response items and the total number of possible points for these items. b Combined winter and spring samples. c The descriptive statistics were based on the θ-scale from the operational calibrations. IRT Models. The IRT model used to generate the item responses for the OSSLT and TPCL was a modified Rasch model with guessing fixed to 0.20 for multiple-choice items and the a-parameter fixed to This value of the a-parameter effectively sets the discrimination to 1.0 because the a-parameter is multiplied by 1.7 in the model. For the primary, junior and Grade 9 subtests, the two-parameter model with a fixed guessing parameter added was used for multiple-choice items. For all tests and subtests, the 9

14 generalized partial credit model was used for open-response items. These IRT models appear to be the most appropriate for EQAO s assessments (Xie, 2006). Steps for Data Simulation. The following two questions guided the development of the computer simulation for each assessment: a. What are the true changes (in percentage) at each achievement level? b. What would the gains be at each achievement level in a real testing situation after the four equating processes of interest are applied, and how close are these estimated changes to the true changes? The following data simulation steps were carried out to help answer these questions. 1. True Percentages To identify the true percentages in each achievement category, the known θ- distributions for Year 1 and Year 2 were simulated using the Pearson type-four family (mean, standard deviation, skewness and kurtosis) of the θ-distributions taken from the Year 1 and Year 2 operational tests, respectively (see Table 1). Since true changes between the two years are not known, five possible true changes (-0.3, -0.1, 0.0, 0.1 and 0.3 units on the θ-scale) were modelled in the data simulations to reflect different performance changes. These values span the range of changes in performance that might be seen in realistic educational settings, although the ±0.3 conditions represent changes that are larger than those that have been generally observed in the EQAO assessments. To create the five gain conditions for Year 2, the five gains were added to the mean of the Year 1 θ-distribution. The known θ-distribution for Year 2 was then simulated for each of the five gain conditions for each selected test or subtest. The sample sizes used in the simulations were chosen to be close to the equating samples used in practice for each assessment. In the equating samples, the students who were accommodated with special versions and the students who did not respond were excluded. In the case of the OSSLT, the students who had previously been eligible to write were also excluded from the equating sample. Cut scores were determined on the known Year 1 θ-distribution using the EQAOreported percentages for each achievement level. These cut scores were then applied to 10

15 the five simulated known Year 2 θ-distributions to identify the true percentage for each achievement level. 2. Empirical Percentages To obtain the empirical percentages, the matrix data that mimic EQAO assessments have to be simulated. The data simulation includes two stages: a) simulate the full data set for Year 1 and Year 2 students; b) use the full data set to generate the matrix data set for calibration. To simulate the full data set, the operational-item parameters from Years 1 and 2 were combined into one file. The known Year 1 θ- distribution was also combined with each of the five known Year 2 θ-distributions. The item-response vectors for the students were then generated for theyear 1 and Year 2 test forms for each gain condition based on the combined parameter file and true θ- distributions (see Figure 2). Sample Year 1 Year 1 Test Form Year 2 Test Form Sample Year 2 Figure 2. Full Data Structure The vertical axis of the diagram represents students. Those above the mid-point on the vertical axis are from Year 1 and those below the mid-point are from Year 2. The horizontal axis represents items. Items to the left of the mid-point on the horizontal axis are included in the form administered in Year 1 and items to the right are included in the form administered in Year 2. To create the Year 1 matrix equating sample and the Year 2 operational equating sample, the light grey parts in the diagram are removed from the full data set. It is believed that the best way to get good information about changes in students performance would be to have both cohorts of students take both tests. Therefore, creating the equating samples from the ideal full data set seemed reasonable. 11

16 After the usual equating data sets were created, equating was conducted using the CC, FCIP, TCC and M/S equating methods to obtain empirical percentages. For the CC procedure, the Year 1 and Year 2 data sets were combined and calibrated together. In the case of the TCC and M/S procedures, the Year 1 and Year 2 data sets were first calibrated separately. Then the TCC and M/S procedures were applied to obtain the linear transformation coefficients to scale the Year 1 test (equated) to the Year 2 test (reference). With the FCIP procedures, the two tests were calibrated separately with the matrix item parameters of the Year 1 test fixed at the values of the Year 2 operationalitem parameters to place the Year 1 test on the Year 2 scale. Similar procedures to those used in Step 1 were applied to identify a cut score and obtain an empirical percentage for each achievement level and each gain condition. Computer Programs. The examinees item responses were simulated using Matlab Datagengpcmv and Datagen3plt. Datagengpcmv was used to simulate responses for the open-response items and datagen3plt was used to simulate responses for the multiple-choice items. The simulated item-response distributions were compared with the actual item-response distributions, and they showed very similar patterns for each of the selected assessments. PARSCALE was chosen to conduct calibrations because EQAO uses it for operational IRT calibration and scoring. MULTILOG and PARSCALE generate similar parameter estimates (Childs and Chen, 1999; Hanson and Beguin, 2002). However, PARSCALE produces an overall item location parameter and reproduces the category parameters by centring them to zero (Childs & Chen, 1999). Further, the number of examinees PARSCALE can handle is much greater. Evaluation of the FCIP, CC, TCC and M/S Equating Methods. The performance of the FCIP, CC, TCC and M/S equating methods was evaluated by comparing the estimated percentage with the corresponding true percentage at each of the four achievement levels for the primary, junior and Grade 9 assessments and for the successful and unsuccessful categories for the OSSLT and TPCL. Twenty replications of each simulation were carried out. The inclusion of a wide variety of assessments was also considered to be very important for this study. Descriptive statistics of the empirical percentages across the 20 replications were computed for each achievement level, equating method and change condition. Each 12

17 average estimated percentage was compared to the corresponding true percentage to determine the bias in the empirical estimate: where ) n 1 ) Δil = Δil nσ i = 1 Bias, 1 n = nσ= i 1 ) ( Δ il ) Δ ) = Δ l il Δ l, (1) Δ l is the true value for achievement level l, Δ ) is the estimated value for the ith replication at achievement level l, and il n = 20 is the number of replications (Sinharay & Holland, 2007). If the bias is negative, then the true percentage is underestimated; if the bias is positive, then the true percentage is overestimated. The stability of the empirical percentages across replications was assessed using the root mean square error (RMSE): n 1 ) RMSE = ( Δ n Σ = i 1 il Δ l ) 2 (2) The smaller the RMSE is, the closer the estimated values are to the true values. For the purposes of this study, bias and RMSE values smaller than or equal to 1% were considered to be negligible. Differences in bias yielded by two methods had to exceed 0.50 to be claimed as a meaningful difference. Many large-scale assessment programs consider a change of 1% from one year to the next to be meaningful. Results The results for the OSSLT and TPCL are presented first, followed by the results for the primary, junior and Grade 9 subtests selected for this study. OSSLT and TPCL 13

18 OSSLT. The results for the OSSLT are reported in the top panel of Table 2. The pattern of bias for the four equating methods is complex. For example, while the CC method recovered the true change of zero, this procedure did not recover the other changes as well. While the FCIP, TCC and M/S procedures recovered changes in the percentages of unsuccessful students equally well across all change conditions, the TCC and M/S procedures recovered the -0.1 and 0.3 changes much better than the FCIP and CC procedures did. Overestimates were observed in the positive changes (i.e., an increases in the percentage of unsuccessful students), with the bias of the CC procedures being more pronounced than the bias of the FCIP, TCC and M/S procedures (e.g., 3.64% vs. 0.72%, 0.38% and 0.37% for a true gain of 0.3 on the theta scale). Underestimates were observed in the negative changes, with the bias of the CC equating procedure larger than that for the other three procedures for a true change of -0.3 (-5.76% vs %, % and -1.05%). Overall, the TCC and M/S methods fared slightly better than the FCIP method, and these three performed much better than the CC method. Expressed in terms of average RMSE across the five gain conditions, the four equating methods rank as follows: TCC (0.45%), M/S (0.49%), FCIP (0.59%) and CC (2.52%). TPCL. As shown in the lower panel of Table 2, the performance of the CC method was again the poorest at the changes in both directions, with large overestimates in the positive changes (2.48% for a change of 0.1 and 4.29% for a change of 0.3) and underestimates 1n the negative changes (-0.89% in a change of -0.1 and -4.57% in a gain of -0.3). The FCIP, TCC and M/S procedures also overestimated the zero gain. Interestingly, except for those of -0.3, FCIP, TCC and M/S procedures overestimated rest of the gains, with -0.1 and 0.1 more pronounced. Again, FCIP, TCC and M/S were ranked first with an average RMSE of around 1% across the five gain conditions. The CC procedure showed an average RMSE of 2.66%, which was substantially larger. Lastly, the magnitude of RMSE for the TPCL was generally larger than the bias for the OSSLT. 14

19 Table 2 Equating Results for the OSSLT and TPCL: Percentage Unsuccessful for Year 2 Theta Distribution Change Method OSSLT N = True 23.76% 17.72% 15.04% 12.70% 8.65% CC Bias % -1.69% -0.03% 1.39% 3.64% RMSE 5.76% 1.70% 0.11% 1.39% 3.64% FCIP Bias % -0.58% -0.11% 0.12% 0.72% RMSE 1.00% 0.78% 0.18% 0.14% 0.83% TCC Bias % -0.37% -0.15% 0.06% 0.38% RMSE 1.07% 0.41% 0.22% 0.11% 0.44% M/S Bias % -0.34% -0.20% 0.02% 0.37% RMSE 1.09% 0.41% 0.31% 0.22% 0.41% TPCL N = 5260 True 24.43% 17.62% 14.33% 11.43% 7.64% CC Bias -0.17% 0.64% 0.90% 1.34% 0.95% RMSE 4.59% 0.97% 0.76% 2.52% 4.48% FCIP Bias -0.20% 0.45% 0.66% 1.35% 0.99% RMSE 0.71% 0.76% 0.75% 1.44% 1.09% TCC Bias -0.17% 0.64% 0.90% 1.34% 0.95% RMSE 0.72% 1.12% 1.05% 1.46% 1.05% M/S Bias 0.00% 0.00% 0.01% 0.01% 0.01% RMSE 0.76% 1.09% 0.99% 1.12% 0.84% Primary, Junior and Grade 9 Assessments There are four achievement categories for the primary, junior and Grade 9 assessment programs. EQAO also reports on an achievement category below Level 1, but very few students are in this category so it was combined with Level 1 in this study. The estimated and true percentages of students in each achievement category were compared for the CC, FCIP, TCC and M/S procedures. Thus, there are 20 bias estimates (4 levels 5 gain conditions) for each test or subtest and equating procedure. The results for the five subtests (English and French primary writing, English junior reading, French junior mathematics and English Grade 9 applied mathematics) are presented below. 15

20 English Primary Writing. The results for English primary writing are provided in Table 3. First, the bias for the zero change condition was negligible for each achievement level and for each of the CC, FCIP, TCC and M/S equating methods. Likewise, the bias was negligible for all of the remaining change conditions for achievement Level 1 for all four equating methods. Differences appeared among the equating methods for the remaining conditions and achievement levels. For example, the bias for the CC method was greater than those for the other procedures at achievement Levels 2, 3, and 4 for changes of -0.3, -0.1, 0.1 and 0.3. For example, for a change of -0.3 at achievement Level 2, the bias was -8.50% for the CC procedure, -2.47% for the FCIP procedure, -2.37% for TCC procedure and -2.34% for the M/S procedure. The results for the FCIP, TCC and M/S were comparable: each method essentially recovered the gain of 0.1 at all achievement levels, the changes of -0.3 and -0.1 at achievement Level 4 and the change of 0.3 at achievement Level 3. It is interesting to note that these three procedures were positively biased at achievement Level 3 and negatively biased at Level 2 for gains of and -0.1, but negatively biased at Level 4 for a gain of 0.3. The results for English primary writing reveal the poorer performance of the CC equating method with the largest amount of bias and largest RMSE and the equal performance of the FCIP, TCC and M/S procedures across the five change conditions. French Primary Writing. The equating results for French primary writing are presented in Table 4. A comparison of these results with the results for English primary writing reveals differences between the two subtests. For example, while the bias was again small for all four equating procedures at achievement Level 1 for all change conditions, the bias was relatively large for each equating procedure at Levels 2, 3 and 4. Although the direction of bias was noticeable for all the changes but 0.0, it did not follow a clear pattern. 16

21 Table 3 Equating Results for English Primary Writing: Percentage at Each Achievement Level for Year 2 Theta Distributions (N = ) CC FCIP TCC M/S Change Level True Bias RMSE True Bias RMSE True Bias RMSE True Bias RMSE True Bias RMSE

22 Table 4 Equating Results for French Primary Writing: Percentage at Each Achievement Level for Year 2 Theta Distributions (N = 6000) CC FCIP TCC M/S Change Level True Bias RMSE True Bias RMSE True Bias RMSE True Bias RMSE True Bias RMSE

23 With the CC procedure, negative estimates were observed for changes of -0.1 and -0.3 at Levels 1 and 2, while for Levels 3 and 4 positive estimates were shown. For the positive changes, positive estimates were observed for Levels 1, 2 and 3, and negative estimates were shown at Level 4. For the FCIP and TCC procedures, except for a change of 0.0 at Level 4, the bias was mostly positive. For the M/S procedure, positive and negative biases were observed for Levels 1 and 2 and Levels 3 and 4, respectively. The CC procedure showed a significantly larger bias and RMSE for all conditions but a gain of 0.0. For all procedures, the farther the gain condition departed from 0.0, the greater the observed bias and RMSE. Over all five change conditions, the size of the bias and RMSE tended to be greater than those for English primary writing. This may be attributable to the difference in sample size, although the same finding was not observed for the OSSLT and TPCL (where the difference between the number of French and English students was approximately the same). French Junior Mathematics. The results for French junior mathematics are reported in Table 5. The CC procedure again resulted in a much larger bias and RMSE at Levels 2 and 4 for a change of -0.1 and -0.3 and Levels 2 to 4 for a change of 0.3. In contrast, FCIP and TCC performed similarly and yielded a negligible bias and much smaller RMSE at all achievement levels across all gain conditions. Compared with the FCIP and TCC procedures, the M/S procedure did not perform as well. For quite a few cases the bias was negligible but the RMSE was fairly large, which indicates that the behavior of the M/S procedure was not as stable as that of the FCIP and TCC. For example, for a change of 0.3 at Level 3, the bias was but the RMSE was The positive and negative estimates were observed and no clear pattern was shown. The magnitude of the bias and RMSE tends to be smaller than those observed for English and French primary writing in most cases. The three assessments showed that the performance of the CC procedure was the poorest. The FCIP, TCC and M/S demonstrated very similar performance. However, the M/S yielded larger RMSE in some cases, indicating less stable performance than the other two procedures. 19

24 Table 5 Equating Results for French Junior Mathematics: Percentage at Each Achievement Level for Year 2 Theta Distributions (N = 5600) CC FCIP TCC M/S Change Level True Bias RMSE True Bias RMSE True Bias RMSE True Bias RMSE True Bias RMSE

25 English Junior Reading. While the patterns of bias were not identical, the size of the biases for English junior reading (Table 6) tended to be more similar to the values observed for the OSSLT than for the other tests and subtests. This finding is likely attributable to the larger sample sizes ( ). Although the CC recovered the true change of 0.0, it did not recover other changes as well. For the negative changes, underestimates were observed for achievement Levels 1 and 2 and overestimates for achievement Levels 3 and 4, with the bias more pronounced for the change condition of For the positive changes, the pattern was reversed. Although the CC procedure performed better with English junior reading than with other subtests and tests, compared to the other three procedures, much larger bias and RMSE were observed for the change of -0.3 at all achievement levels, as well as for the change of 0.3 at Levels 2 and 4. In contrast, the other three procedures performed almost equally well, with the size of bias and RMSE less than 1%. English Grade 9 Applied Mathematics. The results for this subtest were very similar to those for English junior reading. Overall, the CC method performed the worst (see Table 7). Substantial overestimates and underestimates were shown for the two extreme change conditions. For a change of -0.3, overestimates were yielded for Levels 1 and 2 and underestimates for Levels 3 and 4. For a change of 0.3, the pattern was reversed. The largest bias was greater than 5%. The other three methods performed much better than the CC procedure, and equally well, with FCIP and TCC showing slightly larger biases and RMSE than M/S at Level 2 for a change of 0.3. Otherwise, the magnitude of bias was smaller than 1% across all conditions and achievement levels. 21

26 Table 6 Equating Results for English Junior Reading: Percentage at Each Achievement Level for Year 2 Theta Distribution (N = ) CC FCIP TCC M/S Change Level True Bias RMSE True Bias RMSE True Bias RMSE True Bias RMSE True Bias RMSE

27 Table 7 Equating Results for English Grade 9 Applied Mathematics: Percentage at Each Achievement Level for Year 2 Theta Distribution (N = ) CC FCIP TCC M/S Change Level True Bias RMSE True Bias RMSE True Bias RMSE True Bias RMSE True Bias RMSE

28 Discussion The present study examined the performance of the CC, FCIP, TCC, and M/S equating procedures in recovering the change in student performance between two successive years using bias and RMSE. Five change conditions ( , 0, 0.1 and 0.3) on the θ-scale were considered. Seven of EQAO s tests or subtests with different complexity and student sample sizes were selected: the OSSLT and TPCL, English and French primary writing, French junior mathematics, English junior reading and English Grade 9 applied mathematics. The results revealed that bias and RMSE followed a complex pattern across the five change conditions and seven assessments. The CC procedure yielded a substantially larger bias than the other three equating procedures, except in the cases of the zero gain condition. The magnitude of bias was as large as 8.50% in absolute value. For all equating procedures, a larger bias was observed for the two extreme change conditions (±0.3) but not always for a given assessment. This effect was more pronounced with the CC method. In contrast, the FCIP, TCC and M/S produced much smaller biases. The maximum bias was approximately 2% in absolute value. Smaller biases were found for the English assessments than for the French assessments, which was likely due to the differences in sample sizes. For the OSSLT and TPCL, underestimates were shown in the negative gains while overestimates were indicated in the positive gains. Among the other assessments with multiple achievement categories, underestimates and overestimates varied. The substantially large bias and RMSE revealed by the CC procedure with change in group performance indicated that it generally failed to handle non-equivalent groups. This finding was consistent with those of previous studies involving both equivalent and non-equivalent groups (Petersen et al., 1983; Hills et al., 1988; Hu et al., 2008). Of the three separate-calibration and -equating procedures, the M/S procedure seemed to be more sensitive to sample size. Greater variations in bias results across replications were observed for the M/S procedure when the French tests and subtests were used. The M/S procedure used the mean and standard deviation of the parameters to determine the linear transformation coefficients. According to Yen and Fitzpatrick 24

29 (2006), there are two major limitations to this method: a) item parameters are treated as independent entities and outliers can significantly affect the results and b) if one set of item-parameter estimates is based on a smaller sample of examinees than another, the error variances of the two sets of item parameters will be different. In this case, if the variances of the two sets of estimates are set as equal (as in the M/S procedure), a systematic bias in results will occur through shrinking in the set of results with greater error variance to match the set with less error variance. That is, the variances of the estimates will be equal, but the variances of the true values will not be equal (Yen & Fitzpatrick, 2006). In the case of the French assessments, the sample sizes were small due to the matrix design used to field-test items in Year 1. For example, for French junior mathematics, there were 10 matrix blocks, resulting in a sample size of approximately 560 students. This may have led to unstable item-parameter estimates, which would adversely affect the mean and standard deviation. In contrast to the M/S procedure, the TCC equating procedure obtained linking coefficients by minimizing the difference between the test characteristic curves for the two test administrations. According to Yen & Fitzpatrick (2006), one of the most important advantages of the TCC method is that it minimizes differences in expected scores rather than in observed scores or parameters. In doing so, the parameters or thetas are aligned, thereby reducing the variance across replications. For the assessments with multiple performance levels, the largest bias was shown for English and French primary writing. These two subtests had only 14 items and 38 maximum possible scores. These results seem to suggest that a short test may affect, to varying degrees, the performance of the equating methods, with the CC method substantially affected, followed by the M/S procedure. Still, the French primary writing subtest was likely to produce the confounding effect of short test and small sample on the equating methods (see Table 4). Although the FCIP procedure outperformed the other equating methods in this study, one has to be careful in its application. Yen & Fitzpatrick (2006) pointed out that one of the important features of the FCIP procedure is that by holding the common item parameters fixed, it forces the parameter estimation program to accommodate the 25

EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN

EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN By FRANCISCO ANDRES JIMENEZ A THESIS PRESENTED TO THE GRADUATE SCHOOL OF

More information

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2 MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and Lord Equating Methods 1,2 Lisa A. Keller, Ronald K. Hambleton, Pauline Parker, Jenna Copella University of Massachusetts

More information

REMOVE OR KEEP: LINKING ITEMS SHOWING ITEM PARAMETER DRIFT. Qi Chen

REMOVE OR KEEP: LINKING ITEMS SHOWING ITEM PARAMETER DRIFT. Qi Chen REMOVE OR KEEP: LINKING ITEMS SHOWING ITEM PARAMETER DRIFT By Qi Chen A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Measurement and

More information

THE IMPACT OF ANCHOR ITEM EXPOSURE ON MEAN/SIGMA LINKING AND IRT TRUE SCORE EQUATING UNDER THE NEAT DESIGN. Moatasim A. Barri

THE IMPACT OF ANCHOR ITEM EXPOSURE ON MEAN/SIGMA LINKING AND IRT TRUE SCORE EQUATING UNDER THE NEAT DESIGN. Moatasim A. Barri THE IMPACT OF ANCHOR ITEM EXPOSURE ON MEAN/SIGMA LINKING AND IRT TRUE SCORE EQUATING UNDER THE NEAT DESIGN By Moatasim A. Barri B.S., King Abdul Aziz University M.S.Ed., The University of Kansas Ph.D.,

More information

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Jee Seon Kim University of Wisconsin, Madison Paper presented at 2006 NCME Annual Meeting San Francisco, CA Correspondence

More information

Linking across forms in vertical scaling under the common-item nonequvalent groups design

Linking across forms in vertical scaling under the common-item nonequvalent groups design University of Iowa Iowa Research Online Theses and Dissertations Spring 2013 Linking across forms in vertical scaling under the common-item nonequvalent groups design Xuan Wang University of Iowa Copyright

More information

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups GMAC Scaling Item Difficulty Estimates from Nonequivalent Groups Fanmin Guo, Lawrence Rudner, and Eileen Talento-Miller GMAC Research Reports RR-09-03 April 3, 2009 Abstract By placing item statistics

More information

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian Recovery of Marginal Maximum Likelihood Estimates in the Two-Parameter Logistic Response Model: An Evaluation of MULTILOG Clement A. Stone University of Pittsburgh Marginal maximum likelihood (MML) estimation

More information

linking in educational measurement: Taking differential motivation into account 1

linking in educational measurement: Taking differential motivation into account 1 Selecting a data collection design for linking in educational measurement: Taking differential motivation into account 1 Abstract In educational measurement, multiple test forms are often constructed to

More information

UNIDIMENSIONAL VERTICAL SCALING OF MIXED FORMAT TESTS IN THE PRESENCE OF ITEM FORMAT EFFECT. Debra White Moore

UNIDIMENSIONAL VERTICAL SCALING OF MIXED FORMAT TESTS IN THE PRESENCE OF ITEM FORMAT EFFECT. Debra White Moore UNIDIMENSIONAL VERTICAL SCALING OF MIXED FORMAT TESTS IN THE PRESENCE OF ITEM FORMAT EFFECT by Debra White Moore B.M.Ed., University of North Carolina, Greensboro, 1989 M.A., University of Pittsburgh,

More information

Jason L. Meyers. Ahmet Turhan. Steven J. Fitzpatrick. Pearson. Paper presented at the annual meeting of the

Jason L. Meyers. Ahmet Turhan. Steven J. Fitzpatrick. Pearson. Paper presented at the annual meeting of the Performance of Ability Estimation Methods for Writing Assessments under Conditio ns of Multidime nsionality Jason L. Meyers Ahmet Turhan Steven J. Fitzpatrick Pearson Paper presented at the annual meeting

More information

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Thakur Karkee Measurement Incorporated Dong-In Kim CTB/McGraw-Hill Kevin Fatica CTB/McGraw-Hill

More information

Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales

Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales University of Iowa Iowa Research Online Theses and Dissertations Summer 2013 Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales Anna Marie

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 39 Evaluation of Comparability of Scores and Passing Decisions for Different Item Pools of Computerized Adaptive Examinations

More information

Linking Mixed-Format Tests Using Multiple Choice Anchors. Michael E. Walker. Sooyeon Kim. ETS, Princeton, NJ

Linking Mixed-Format Tests Using Multiple Choice Anchors. Michael E. Walker. Sooyeon Kim. ETS, Princeton, NJ Linking Mixed-Format Tests Using Multiple Choice Anchors Michael E. Walker Sooyeon Kim ETS, Princeton, NJ Paper presented at the annual meeting of the American Educational Research Association (AERA) and

More information

A Comparison of Several Goodness-of-Fit Statistics

A Comparison of Several Goodness-of-Fit Statistics A Comparison of Several Goodness-of-Fit Statistics Robert L. McKinley The University of Toledo Craig N. Mills Educational Testing Service A study was conducted to evaluate four goodnessof-fit procedures

More information

Section 5. Field Test Analyses

Section 5. Field Test Analyses Section 5. Field Test Analyses Following the receipt of the final scored file from Measurement Incorporated (MI), the field test analyses were completed. The analysis of the field test data can be broken

More information

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods James Madison University JMU Scholarly Commons Department of Graduate Psychology - Faculty Scholarship Department of Graduate Psychology 3-008 Scoring Multiple Choice Items: A Comparison of IRT and Classical

More information

The Impact of Test Dimensionality, Common-Item Set Format, and Scale Linking Methods on Mixed-Format Test Equating *

The Impact of Test Dimensionality, Common-Item Set Format, and Scale Linking Methods on Mixed-Format Test Equating * KURAM VE UYGULAMADA EĞİTİM BİLİMLERİ EDUCATIONAL SCIENCES: THEORY & PRACTICE Received: November 26, 2015 Revision received: January 6, 2015 Accepted: March 18, 2016 OnlineFirst: April 20, 2016 Copyright

More information

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, April 23-25, 2003 The Classification Accuracy of Measurement Decision Theory Lawrence Rudner University

More information

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS Michael J. Kolen The University of Iowa March 2011 Commissioned by the Center for K 12 Assessment & Performance Management at

More information

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

Bruno D. Zumbo, Ph.D. University of Northern British Columbia Bruno Zumbo 1 The Effect of DIF and Impact on Classical Test Statistics: Undetected DIF and Impact, and the Reliability and Interpretability of Scores from a Language Proficiency Test Bruno D. Zumbo, Ph.D.

More information

Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX

Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX Paper 1766-2014 Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX ABSTRACT Chunhua Cao, Yan Wang, Yi-Hsin Chen, Isaac Y. Li University

More information

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD Psy 427 Cal State Northridge Andrew Ainsworth, PhD Contents Item Analysis in General Classical Test Theory Item Response Theory Basics Item Response Functions Item Information Functions Invariance IRT

More information

The Use of Item Statistics in the Calibration of an Item Bank

The Use of Item Statistics in the Calibration of an Item Bank ~ - -., The Use of Item Statistics in the Calibration of an Item Bank Dato N. M. de Gruijter University of Leyden An IRT analysis based on p (proportion correct) and r (item-test correlation) is proposed

More information

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model Gary Skaggs Fairfax County, Virginia Public Schools José Stevenson

More information

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison Empowered by Psychometrics The Fundamentals of Psychometrics Jim Wollack University of Wisconsin Madison Psycho-what? Psychometrics is the field of study concerned with the measurement of mental and psychological

More information

Using the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison

Using the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison Using the Testlet Model to Mitigate Test Speededness Effects James A. Wollack Youngsuk Suh Daniel M. Bolt University of Wisconsin Madison April 12, 2007 Paper presented at the annual meeting of the National

More information

Influences of IRT Item Attributes on Angoff Rater Judgments

Influences of IRT Item Attributes on Angoff Rater Judgments Influences of IRT Item Attributes on Angoff Rater Judgments Christian Jones, M.A. CPS Human Resource Services Greg Hurt!, Ph.D. CSUS, Sacramento Angoff Method Assemble a panel of subject matter experts

More information

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests David Shin Pearson Educational Measurement May 007 rr0701 Using assessment and research to promote learning Pearson Educational

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

Differential Item Functioning

Differential Item Functioning Differential Item Functioning Lecture #11 ICPSR Item Response Theory Workshop Lecture #11: 1of 62 Lecture Overview Detection of Differential Item Functioning (DIF) Distinguish Bias from DIF Test vs. Item

More information

Item Position and Item Difficulty Change in an IRT-Based Common Item Equating Design

Item Position and Item Difficulty Change in an IRT-Based Common Item Equating Design APPLIED MEASUREMENT IN EDUCATION, 22: 38 6, 29 Copyright Taylor & Francis Group, LLC ISSN: 895-7347 print / 1532-4818 online DOI: 1.18/89573482558342 Item Position and Item Difficulty Change in an IRT-Based

More information

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments Greg Pope, Analytics and Psychometrics Manager 2008 Users Conference San Antonio Introduction and purpose of this session

More information

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories Kamla-Raj 010 Int J Edu Sci, (): 107-113 (010) Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories O.O. Adedoyin Department of Educational Foundations,

More information

Centre for Education Research and Policy

Centre for Education Research and Policy THE EFFECT OF SAMPLE SIZE ON ITEM PARAMETER ESTIMATION FOR THE PARTIAL CREDIT MODEL ABSTRACT Item Response Theory (IRT) models have been widely used to analyse test data and develop IRT-based tests. An

More information

Linking Assessments: Concept and History

Linking Assessments: Concept and History Linking Assessments: Concept and History Michael J. Kolen, University of Iowa In this article, the history of linking is summarized, and current linking frameworks that have been proposed are considered.

More information

The Matching Criterion Purification for Differential Item Functioning Analyses in a Large-Scale Assessment

The Matching Criterion Purification for Differential Item Functioning Analyses in a Large-Scale Assessment University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Educational Psychology Papers and Publications Educational Psychology, Department of 1-2016 The Matching Criterion Purification

More information

Evaluating the Impact of Construct Shift on Item Parameter Invariance, Test Equating and Proficiency Estimates

Evaluating the Impact of Construct Shift on Item Parameter Invariance, Test Equating and Proficiency Estimates University of Massachusetts Amherst ScholarWorks@UMass Amherst Doctoral Dissertations Dissertations and Theses 2015 Evaluating the Impact of Construct Shift on Item Parameter Invariance, Test Equating

More information

An Alternative to the Trend Scoring Method for Adjusting Scoring Shifts. in Mixed-Format Tests. Xuan Tan. Sooyeon Kim. Insu Paek.

An Alternative to the Trend Scoring Method for Adjusting Scoring Shifts. in Mixed-Format Tests. Xuan Tan. Sooyeon Kim. Insu Paek. An Alternative to the Trend Scoring Method for Adjusting Scoring Shifts in Mixed-Format Tests Xuan Tan Sooyeon Kim Insu Paek Bihua Xiang ETS, Princeton, NJ Paper presented at the annual meeting of the

More information

CYRINUS B. ESSEN, IDAKA E. IDAKA AND MICHAEL A. METIBEMU. (Received 31, January 2017; Revision Accepted 13, April 2017)

CYRINUS B. ESSEN, IDAKA E. IDAKA AND MICHAEL A. METIBEMU. (Received 31, January 2017; Revision Accepted 13, April 2017) DOI: http://dx.doi.org/10.4314/gjedr.v16i2.2 GLOBAL JOURNAL OF EDUCATIONAL RESEARCH VOL 16, 2017: 87-94 COPYRIGHT BACHUDO SCIENCE CO. LTD PRINTED IN NIGERIA. ISSN 1596-6224 www.globaljournalseries.com;

More information

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION Iweka Fidelis (Ph.D) Department of Educational Psychology, Guidance and Counselling, University of Port Harcourt,

More information

Computerized Mastery Testing

Computerized Mastery Testing Computerized Mastery Testing With Nonequivalent Testlets Kathleen Sheehan and Charles Lewis Educational Testing Service A procedure for determining the effect of testlet nonequivalence on the operating

More information

Multidimensional Modeling of Learning Progression-based Vertical Scales 1

Multidimensional Modeling of Learning Progression-based Vertical Scales 1 Multidimensional Modeling of Learning Progression-based Vertical Scales 1 Nina Deng deng.nina@measuredprogress.org Louis Roussos roussos.louis@measuredprogress.org Lee LaFond leelafond74@gmail.com 1 This

More information

Designing small-scale tests: A simulation study of parameter recovery with the 1-PL

Designing small-scale tests: A simulation study of parameter recovery with the 1-PL Psychological Test and Assessment Modeling, Volume 55, 2013 (4), 335-360 Designing small-scale tests: A simulation study of parameter recovery with the 1-PL Dubravka Svetina 1, Aron V. Crawford 2, Roy

More information

André Cyr and Alexander Davies

André Cyr and Alexander Davies Item Response Theory and Latent variable modeling for surveys with complex sampling design The case of the National Longitudinal Survey of Children and Youth in Canada Background André Cyr and Alexander

More information

Modeling DIF with the Rasch Model: The Unfortunate Combination of Mean Ability Differences and Guessing

Modeling DIF with the Rasch Model: The Unfortunate Combination of Mean Ability Differences and Guessing James Madison University JMU Scholarly Commons Department of Graduate Psychology - Faculty Scholarship Department of Graduate Psychology 4-2014 Modeling DIF with the Rasch Model: The Unfortunate Combination

More information

PROMIS DEPRESSION AND CES-D

PROMIS DEPRESSION AND CES-D PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROMIS DEPRESSION AND CES-D SEUNG W. CHOI, TRACY PODRABSKY, NATALIE MCKINNEY, BENJAMIN D. SCHALET, KARON F. COOK & DAVID CELLA

More information

Development, Standardization and Application of

Development, Standardization and Application of American Journal of Educational Research, 2018, Vol. 6, No. 3, 238-257 Available online at http://pubs.sciepub.com/education/6/3/11 Science and Education Publishing DOI:10.12691/education-6-3-11 Development,

More information

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing Terry A. Ackerman University of Illinois This study investigated the effect of using multidimensional items in

More information

A Bayesian Nonparametric Model Fit statistic of Item Response Models

A Bayesian Nonparametric Model Fit statistic of Item Response Models A Bayesian Nonparametric Model Fit statistic of Item Response Models Purpose As more and more states move to use the computer adaptive test for their assessments, item response theory (IRT) has been widely

More information

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida Vol. 2 (1), pp. 22-39, Jan, 2015 http://www.ijate.net e-issn: 2148-7456 IJATE A Comparison of Logistic Regression Models for Dif Detection in Polytomous Items: The Effect of Small Sample Sizes and Non-Normality

More information

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1 Nested Factor Analytic Model Comparison as a Means to Detect Aberrant Response Patterns John M. Clark III Pearson Author Note John M. Clark III,

More information

Effects of Local Item Dependence

Effects of Local Item Dependence Effects of Local Item Dependence on the Fit and Equating Performance of the Three-Parameter Logistic Model Wendy M. Yen CTB/McGraw-Hill Unidimensional item response theory (IRT) has become widely used

More information

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison Group-Level Diagnosis 1 N.B. Please do not cite or distribute. Multilevel IRT for group-level diagnosis Chanho Park Daniel M. Bolt University of Wisconsin-Madison Paper presented at the annual meeting

More information

Nonparametric DIF. Bruno D. Zumbo and Petronilla M. Witarsa University of British Columbia

Nonparametric DIF. Bruno D. Zumbo and Petronilla M. Witarsa University of British Columbia Nonparametric DIF Nonparametric IRT Methodology For Detecting DIF In Moderate-To-Small Scale Measurement: Operating Characteristics And A Comparison With The Mantel Haenszel Bruno D. Zumbo and Petronilla

More information

Regression Discontinuity Analysis

Regression Discontinuity Analysis Regression Discontinuity Analysis A researcher wants to determine whether tutoring underachieving middle school students improves their math grades. Another wonders whether providing financial aid to low-income

More information

Impact of Methods of Scoring Omitted Responses on Achievement Gaps

Impact of Methods of Scoring Omitted Responses on Achievement Gaps Impact of Methods of Scoring Omitted Responses on Achievement Gaps Dr. Nathaniel J. S. Brown (nathaniel.js.brown@bc.edu)! Educational Research, Evaluation, and Measurement, Boston College! Dr. Dubravka

More information

Selection and Combination of Markers for Prediction

Selection and Combination of Markers for Prediction Selection and Combination of Markers for Prediction NACC Data and Methods Meeting September, 2010 Baojiang Chen, PhD Sarah Monsell, MS Xiao-Hua Andrew Zhou, PhD Overview 1. Research motivation 2. Describe

More information

Item-Rest Regressions, Item Response Functions, and the Relation Between Test Forms

Item-Rest Regressions, Item Response Functions, and the Relation Between Test Forms Item-Rest Regressions, Item Response Functions, and the Relation Between Test Forms Dato N. M. de Gruijter University of Leiden John H. A. L. de Jong Dutch Institute for Educational Measurement (CITO)

More information

PROMIS ANXIETY AND MOOD AND ANXIETY SYMPTOM QUESTIONNAIRE PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES

PROMIS ANXIETY AND MOOD AND ANXIETY SYMPTOM QUESTIONNAIRE PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROMIS ANXIETY AND MOOD AND ANXIETY SYMPTOM QUESTIONNAIRE SEUNG W. CHOI, TRACY PODRABSKY, NATALIE MCKINNEY, BENJAMIN D. SCHALET,

More information

IRT Parameter Estimates

IRT Parameter Estimates An Examination of the Characteristics of Unidimensional IRT Parameter Estimates Derived From Two-Dimensional Data Timothy N. Ansley and Robert A. Forsyth The University of Iowa The purpose of this investigation

More information

Item Response Theory: Methods for the Analysis of Discrete Survey Response Data

Item Response Theory: Methods for the Analysis of Discrete Survey Response Data Item Response Theory: Methods for the Analysis of Discrete Survey Response Data ICPSR Summer Workshop at the University of Michigan June 29, 2015 July 3, 2015 Presented by: Dr. Jonathan Templin Department

More information

Maike Krannich, Odin Jost, Theresa Rohm, Ingrid Koller, Steffi Pohl, Kerstin Haberkorn, Claus H. Carstensen, Luise Fischer, and Timo Gnambs

Maike Krannich, Odin Jost, Theresa Rohm, Ingrid Koller, Steffi Pohl, Kerstin Haberkorn, Claus H. Carstensen, Luise Fischer, and Timo Gnambs neps Survey papers Maike Krannich, Odin Jost, Theresa Rohm, Ingrid Koller, Steffi Pohl, Kerstin Haberkorn, Claus H. Carstensen, Luise Fischer, and Timo Gnambs NEPS Technical Report for reading: Scaling

More information

Effect of Noncompensatory Multidimensionality on Separate and Concurrent estimation in IRT Observed Score Equating A. A. Béguin B. A.

Effect of Noncompensatory Multidimensionality on Separate and Concurrent estimation in IRT Observed Score Equating A. A. Béguin B. A. Measurement and Research Department Reports 2001-2 Effect of Noncompensatory Multidimensionality on Separate and Concurrent estimation in IRT Observed Score Equating A. A. Béguin B. A. Hanson Measurement

More information

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES Correlational Research Correlational Designs Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are

More information

Comprehensive Statistical Analysis of a Mathematics Placement Test

Comprehensive Statistical Analysis of a Mathematics Placement Test Comprehensive Statistical Analysis of a Mathematics Placement Test Robert J. Hall Department of Educational Psychology Texas A&M University, USA (bobhall@tamu.edu) Eunju Jung Department of Educational

More information

Decision consistency and accuracy indices for the bifactor and testlet response theory models

Decision consistency and accuracy indices for the bifactor and testlet response theory models University of Iowa Iowa Research Online Theses and Dissertations Summer 2014 Decision consistency and accuracy indices for the bifactor and testlet response theory models Lee James LaFond University of

More information

Nearest-Integer Response from Normally-Distributed Opinion Model for Likert Scale

Nearest-Integer Response from Normally-Distributed Opinion Model for Likert Scale Nearest-Integer Response from Normally-Distributed Opinion Model for Likert Scale Jonny B. Pornel, Vicente T. Balinas and Giabelle A. Saldaña University of the Philippines Visayas This paper proposes that

More information

The Effect of Guessing on Item Reliability

The Effect of Guessing on Item Reliability The Effect of Guessing on Item Reliability under Answer-Until-Correct Scoring Michael Kane National League for Nursing, Inc. James Moloney State University of New York at Brockport The answer-until-correct

More information

context effects under the nonequivalent anchor test (NEAT) design. In addition, the issue

context effects under the nonequivalent anchor test (NEAT) design. In addition, the issue STORE, DAVIE, Ph.D. Item Parameter Changes and Equating: An Examination of the Effects of Lack of Item Parameter Invariance on Equating and Score Accuracy for Different Proficiency Levels. (2013) Directed

More information

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1.

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1. Running Head: A MODIFIED CATSIB PROCEDURE FOR DETECTING DIF ITEMS 1 A Modified CATSIB Procedure for Detecting Differential Item Function on Computer-Based Tests Johnson Ching-hong Li 1 Mark J. Gierl 1

More information

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses Item Response Theory Steven P. Reise University of California, U.S.A. Item response theory (IRT), or modern measurement theory, provides alternatives to classical test theory (CTT) methods for the construction,

More information

Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data. Zhen Li & Bruno D.

Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data. Zhen Li & Bruno D. Psicológica (2009), 30, 343-370. SECCIÓN METODOLÓGICA Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data Zhen Li & Bruno D. Zumbo 1 University

More information

A Comparison of Item and Testlet Selection Procedures. in Computerized Adaptive Testing. Leslie Keng. Pearson. Tsung-Han Ho

A Comparison of Item and Testlet Selection Procedures. in Computerized Adaptive Testing. Leslie Keng. Pearson. Tsung-Han Ho ADAPTIVE TESTLETS 1 Running head: ADAPTIVE TESTLETS A Comparison of Item and Testlet Selection Procedures in Computerized Adaptive Testing Leslie Keng Pearson Tsung-Han Ho The University of Texas at Austin

More information

The Impact of Statistically Adjusting for Rater Effects on Conditional Standard Errors for Performance Ratings

The Impact of Statistically Adjusting for Rater Effects on Conditional Standard Errors for Performance Ratings 0 The Impact of Statistically Adjusting for Rater Effects on Conditional Standard Errors for Performance Ratings Mark R. Raymond, Polina Harik and Brain E. Clauser National Board of Medical Examiners 1

More information

First Year Paper. July Lieke Voncken ANR: Tilburg University. School of Social and Behavioral Sciences

First Year Paper. July Lieke Voncken ANR: Tilburg University. School of Social and Behavioral Sciences Running&head:&COMPARISON&OF&THE&LZ*&PERSON;FIT&INDEX&& Comparison of the L z * Person-Fit Index and ω Copying-Index in Copying Detection First Year Paper July 2014 Lieke Voncken ANR: 163620 Tilburg University

More information

THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH

THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH By JANN MARIE WISE MACINNES A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF

More information

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

Mantel-Haenszel Procedures for Detecting Differential Item Functioning A Comparison of Logistic Regression and Mantel-Haenszel Procedures for Detecting Differential Item Functioning H. Jane Rogers, Teachers College, Columbia University Hariharan Swaminathan, University of

More information

PROMIS DEPRESSION AND NEURO-QOL DEPRESSION

PROMIS DEPRESSION AND NEURO-QOL DEPRESSION PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROMIS DEPRESSION AND NEURO-QOL DEPRESSION SEUNG W. CHOI, TRACY PODRABSKY, NATALIE MCKINNEY, BENJAMIN D. SCHALET, KARON F. COOK

More information

PROMIS ANXIETY AND KESSLER 6 MENTAL HEALTH SCALE PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES

PROMIS ANXIETY AND KESSLER 6 MENTAL HEALTH SCALE PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROMIS ANXIETY AND KESSLER 6 MENTAL HEALTH SCALE SEUNG W. CHOI, TRACY PODRABSKY, NATALIE MCKINNEY, BENJAMIN D. SCHALET, KARON

More information

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review Results & Statistics: Description and Correlation The description and presentation of results involves a number of topics. These include scales of measurement, descriptive statistics used to summarize

More information

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati.

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati. Likelihood Ratio Based Computerized Classification Testing Nathan A. Thompson Assessment Systems Corporation & University of Cincinnati Shungwon Ro Kenexa Abstract An efficient method for making decisions

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 26 for Mixed Format Tests Kyong Hee Chon Won-Chan Lee Timothy N. Ansley November 2007 The authors are grateful to

More information

Standard Errors of Correlations Adjusted for Incidental Selection

Standard Errors of Correlations Adjusted for Incidental Selection Standard Errors of Correlations Adjusted for Incidental Selection Nancy L. Allen Educational Testing Service Stephen B. Dunbar University of Iowa The standard error of correlations that have been adjusted

More information

USING MULTIDIMENSIONAL ITEM RESPONSE THEORY TO REPORT SUBSCORES ACROSS MULTIPLE TEST FORMS. Jing-Ru Xu

USING MULTIDIMENSIONAL ITEM RESPONSE THEORY TO REPORT SUBSCORES ACROSS MULTIPLE TEST FORMS. Jing-Ru Xu USING MULTIDIMENSIONAL ITEM RESPONSE THEORY TO REPORT SUBSCORES ACROSS MULTIPLE TEST FORMS By Jing-Ru Xu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements

More information

PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROMIS SLEEP DISTURBANCE AND NEURO-QOL SLEEP DISTURBANCE

PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROMIS SLEEP DISTURBANCE AND NEURO-QOL SLEEP DISTURBANCE PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROMIS SLEEP DISTURBANCE AND NEURO-QOL SLEEP DISTURBANCE DAVID CELLA, BENJAMIN D. SCHALET, MICHAEL KALLEN, JIN-SHEI LAI, KARON

More information

During the past century, mathematics

During the past century, mathematics An Evaluation of Mathematics Competitions Using Item Response Theory Jim Gleason During the past century, mathematics competitions have become part of the landscape in mathematics education. The first

More information

Initial Report on the Calibration of Paper and Pencil Forms UCLA/CRESST August 2015

Initial Report on the Calibration of Paper and Pencil Forms UCLA/CRESST August 2015 This report describes the procedures used in obtaining parameter estimates for items appearing on the 2014-2015 Smarter Balanced Assessment Consortium (SBAC) summative paper-pencil forms. Among the items

More information

Noncompensatory. A Comparison Study of the Unidimensional IRT Estimation of Compensatory and. Multidimensional Item Response Data

Noncompensatory. A Comparison Study of the Unidimensional IRT Estimation of Compensatory and. Multidimensional Item Response Data A C T Research Report Series 87-12 A Comparison Study of the Unidimensional IRT Estimation of Compensatory and Noncompensatory Multidimensional Item Response Data Terry Ackerman September 1987 For additional

More information

VIEW: An Assessment of Problem Solving Style

VIEW: An Assessment of Problem Solving Style VIEW: An Assessment of Problem Solving Style 2009 Technical Update Donald J. Treffinger Center for Creative Learning This update reports the results of additional data collection and analyses for the VIEW

More information

PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROMIS PEDIATRIC ANXIETY AND NEURO-QOL PEDIATRIC ANXIETY

PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROMIS PEDIATRIC ANXIETY AND NEURO-QOL PEDIATRIC ANXIETY PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROMIS PEDIATRIC ANXIETY AND NEURO-QOL PEDIATRIC ANXIETY DAVID CELLA, BENJAMIN D. SCHALET, MICHAEL A. KALLEN, JIN-SHEI LAI,

More information

Description of components in tailored testing

Description of components in tailored testing Behavior Research Methods & Instrumentation 1977. Vol. 9 (2).153-157 Description of components in tailored testing WAYNE M. PATIENCE University ofmissouri, Columbia, Missouri 65201 The major purpose of

More information

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan Connexion of Item Response Theory to Decision Making in Chess Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan Acknowledgement A few Slides have been taken from the following presentation

More information

Item Analysis: Classical and Beyond

Item Analysis: Classical and Beyond Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Modified for EPE/EDP 711 by Kelly Bradley on January 8, 2013 Why is item analysis relevant? Item analysis provides

More information

An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow?

An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow? Journal of Educational and Behavioral Statistics Fall 2006, Vol. 31, No. 3, pp. 241 259 An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow? Michael C. Edwards The Ohio

More information

PROMIS PAIN INTERFERENCE AND BRIEF PAIN INVENTORY INTERFERENCE

PROMIS PAIN INTERFERENCE AND BRIEF PAIN INVENTORY INTERFERENCE PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROMIS PAIN INTERFERENCE AND BRIEF PAIN INVENTORY INTERFERENCE SEUNG W. CHOI, TRACY PODRABSKY, NATALIE MCKINNEY, BENJAMIN D.

More information

Modeling the Effect of Differential Motivation on Linking Educational Tests

Modeling the Effect of Differential Motivation on Linking Educational Tests Modeling the Effect of Differential Motivation on Linking Educational Tests Marie-Anne Keizer-Mittelhaëuser MODELING THE EFFECT OF DIFFERENTIAL MOTIVATION ON LINKING EDUCATIONAL TESTS PROEFSCHRIFT TER

More information

Estimating the number of components with defects post-release that showed no defects in testing

Estimating the number of components with defects post-release that showed no defects in testing SOFTWARE TESTING, VERIFICATION AND RELIABILITY Softw. Test. Verif. Reliab. 2002; 12:93 122 (DOI: 10.1002/stvr.235) Estimating the number of components with defects post-release that showed no defects in

More information

Parameter Estimation with Mixture Item Response Theory Models: A Monte Carlo Comparison of Maximum Likelihood and Bayesian Methods

Parameter Estimation with Mixture Item Response Theory Models: A Monte Carlo Comparison of Maximum Likelihood and Bayesian Methods Journal of Modern Applied Statistical Methods Volume 11 Issue 1 Article 14 5-1-2012 Parameter Estimation with Mixture Item Response Theory Models: A Monte Carlo Comparison of Maximum Likelihood and Bayesian

More information