An Alternative to the Trend Scoring Method for Adjusting Scoring Shifts. in Mixed-Format Tests. Xuan Tan. Sooyeon Kim. Insu Paek.

Size: px
Start display at page:

Download "An Alternative to the Trend Scoring Method for Adjusting Scoring Shifts. in Mixed-Format Tests. Xuan Tan. Sooyeon Kim. Insu Paek."

Transcription

1 An Alternative to the Trend Scoring Method for Adjusting Scoring Shifts in Mixed-Format Tests Xuan Tan Sooyeon Kim Insu Paek Bihua Xiang ETS, Princeton, NJ Paper presented at the annual meeting of the National Council on Measurement in Education (NCME) April 13-17, 2009, San Diego, CA. Unpublished Work Copyright 2009 by Educational Testing Service. All Rights Reserved. These materials are an unpublished, proprietary work of ETS. Any limited distribution shall not constitute publication. This work may not be reproduced or distributed to third parties without ETS's prior written consent. Submit all requests through Educational Testing Service, ETS, the ETS logo, and Listening. Learning. Leading. are registered trademarks of Educational Testing Service (ETS).

2 Abstract This study evaluated the comparability of the multiple-choice (MC) item anchor to the MC plus trend scored constructed response (CR) item anchor for adjusting scoring shift in CR items for reprint forms of mixed-format tests using linear and nonlinear equating methods. Two factors, the MC to CR ratio and the correlation between MC/CR, were manipulated and evaluated to look at their impact on the effectiveness of the MC item anchor. Simulation based on real test data was done by selecting subsets of MC items with differential correlations with the CR section to create three levels of MC to CR ratio and two levels of correlations between MC/CR. The results showed that the MC item anchor design the MC anchor NEAT design method worked comparably to the trend scoring method in a situation where the ratio of MC to CR was high and the correlation between MC/CR was moderately high. i

3 Acknowledgement The authors would like to thank Stephanie Fournier-Zajac, Thomas Kaufmann, and Danielle Siwek for extracting and cleaning up the data for the analyses. Many thanks to Longjuan Liang for providing the code for running the chained equipercentile equatings. ii

4 Introduction Many large-scale testing programs increasingly use constructed response (CR) items in their assessments, often with multiple-choice (MC) items. CR items tend to more closely resemble the real-world tasks associated with the construct to be measured. Along with their benefits, CR items bring certain complications that must be addressed to assure quality and equity in the assessments (Walker, 2007). For example, because human raters score these items, scoring standards can be applied more leniently or harshly by the raters than planned, and thus CR items involve more scoring error than multiple-choice (MC) items. Even when the same form (called reprint) is re-used at a different time, form difficulty may not be constant due to human rater changes. The common CR items are not really common, because different raters were employed to score them. In this case, the original test score conversion may no longer apply to the reprint forms. Tate (1999, 2000) articulated a solution, called trend scoring method, to the problem of having scoring standards that have changed over time in the context of the non-equivalent groups anchor test (NEAT) design. A representative sample for examinees from Administration 1 (the reference administration) is inserted into the rating process for Administration 2 (the new administration). The responses, obtained from the reference group of examinees, are rescored by the raters scoring responses on the same items for the new group of examinees. Thus, these trend papers have two sets of scores associated with them: one from the old set of raters and one from the new rater group. Equating can then be used to adjust for the scoring shift. In short, trend scoring is rescoring the same examinee papers across scoring sessions to eliminate group ability difference to detect and control for any scoring shift due to human raters over time. By its 1

5 nature, the trend scoring method has statistical strengths in detecting and adjusting a scoring shift. However, this method is expensive and difficult to implement in practice. The retention and retrieval of trend papers to be interspersed with papers in the later administrations require extra care in test planning and data handling. The rescoring of trend papers adds significantly to operational cost and scoring time. Paek and Kim (2007) suggested alternatives for monitoring CR scoring shift in reprint mixed-format tests without trend scoring. Two differential bundle functioning (DBF) based methods were used by examining the conditional CR score distributions across the two administrations after matching on the MC items. When there is notable scoring shift, the DBF analysis will show shift from zero favoring one administration against the other. Using real data, they showed that the DBF methods were comparable to the trend scoring method in detecting scoring shift. When scoring shift is detected, however, it is unclear whether using the MC items as an anchor to adjust for difference in difficulty due to scoring shift will be as effective as the trend scoring method. The success of the MC anchor may depend on some factors, such as the magnitude of the correlation between MC and CR and the MC to CR ratio (i.e., the proportion of MC points to the total points available on the test). As the correlation between MC and CR items and the MC to CR ratio decrease, they could have a negative impact on the quality of the MC items as a representative anchor of the total test and thus on the MC anchor equating method s ability to accurately adjust scoring shift. The Trend Scoring Method Figure 1 illustrates the trend score equating design used for equating mixedformat tests under the NEAT design framework. This equating design can be used for 2

6 both new form and reprint form equating. For the new form case, the anchor is comprised of a set of common MC and CR items between the new and reference forms. For the reprint form case, the anchor is comprised of all MC and CR items since the same form is used again. For illustration purpose, Form X (black letters) denotes the new or reprint form being given in Administration 2 and form Y (green letters) denotes the reference form being given in Administration 1. The anchor for each form is denoted by XA for Form X and YA for Form Y. Since there are two set of raters involved for the CR section, the rater group in Administration 1 is denoted by blue colored boxes and the rater group in Administration 2 is denoted by red colored boxes. A subset of examinee CR responses from Form Y (trend papers) are selected and rated again in Administration 2 along with the Form Y responses. This trend sample is used as the reference sample. As illustrated by Figure 1, for Form X, the total score is calculated by combining scores on all MC items and ratings on all CR item rated by the Red team, and the anchor score, XA, is calculated by combining scores on common MC items and the first ratings on common CR items also rated by the Red team. In real operational settings, the trend papers are usually scored by only one rater instead of two raters as the operational papers are to save cost. Thus, XA is usually calculated using only the first ratings of the common CR items to keep a consistent scoring schema between new form and reference form anchors. For Form Y, the total score is calculated by combining scores on all MC items and ratings on all CR item rated by the Blue team, and the anchor score, YA, is calculated by combining scores on common MC items and the only ratings on common CR items rated by the Red team along with the CR responses on Form X. 3

7 The only difference between the trend score equating and a regular NEAT design equating is the way the anchor score is constructed for the reference form. In this case, CR ratings on the trend papers by raters from Administration 2 are used as the CR anchor instead of the original ratings by raters from Administration 1. We can also call this design the MC plus trend CR anchor NEAT design. This usually reduces the correlation between the total and anchor for the reference form. Form X Form Y Total X Anchor XA Anchor YA Total Y MC MC MC MC CR CR CR CR Figure 1. The trend score equating design MC + Trend CR NEAT design The Non-trend Method MC Anchor NEAT Design Figure 2 illustrates the MC anchor NEAT design method which is a commonly used design. The total scores on Form X and Form Y are calculated the same way as in the trend scoring method. The anchor scores on Forms X and Y are calculated only based 4

8 on common MC items. For the new form case, the common MC items are a subset of the MC items shared between the new and reference forms. For the reprint case, the common MC items are all the MC items since the same form is used again. Form X Form Y Total X Anchor XA Anchor YA Total Y MC MC MC MC CR CR Figure 2. The MC anchor NEAT design method Purpose The purpose of this study is to discover under what conditions the MC anchor NEAT design method will be as effective as the trend scoring method in adjusting reprint mixed-format tests for differential difficulty caused by scoring shift for the CR common items. The correlation between MC/CR and the MC to CR ratio were manipulated to evaluate the extent to which the MC anchor NEAT design method would work as well as the trend scoring method. 5

9 Method Simulation was conducted based on real data using a bootstrap technique to alleviate the problem of unrealistic data limiting the generalizability of results. Different test structures were created by selecting a subset of items from real tests to create tests with different MC to CR ratios. Within each test structure, different subsets of items were chosen to form tests with different correlations between MC and CR items. Data A test which assesses the knowledge and competence of beginning teachers of middle school science was selected because it had a long MC section (90 MC items) along with three CR items. The MC section was rights-scored, and the CR section was scored by two raters on a rating scale of 0-3 and weighted by a factor of Total test score is, therefore, 120 points (90 MC + 30 CR points). One reprint form administration where the trend scoring method was used for equating was used for simulation. The reprint form sample consisted of 1,011 examinees; the reference form sample (trend sample) consisted of 461 examinees. The two sets of ratings on the trend papers (original ratings in the reference form administration and rescored ratings in the reprint administration) indicated a scoring shift toward more lenient scoring in the reprint form administration raters. Methods and Procedures Tests with higher MC to CR ratios are expected to work better with the MC anchor NEAT design method since, with more MC items as an anchor, the reliability of the anchor and the correlation between anchor and total would be higher. Different numbers of MC items were chosen to create three levels of MC to CR ratio, 30 MC 6

10 items, 45 MC items, and 60 MC items. The three CR items were kept resulting in total score points of 60, 75, and 90 for the three levels indicating three different MC to CR ratios, 1:1, 1.5:1, and 2:1. There are six sub content areas in the MC section, and subsets of items selected for each level maintained the proportional structure of the categories. The second factor is correlation between MC/CR. Tests with higher correlations between MC/CR is expected to work better with the MC anchor NEAT design method since the MC items would work as an adequate anchor representing a mini-test with higher correlations between anchor and total. The correlation between MC/CR is affected by the reliability of each section as well. However, the focus of this study is on different levels of correlations between MC/CR caused by the inherent structure of the construct. For tests with similar reliabilities, the correlation between MC/CR could be different depending on how closely the constructs being measured by the MC and CR items are related. In order to create different levels of correlation between MC/CR within each level of MC to CR ratio, different subsets of MC items were chosen based on their correlations with the CR section. Two levels were created, a low correlation condition by choosing items with the lowest correlations and a high correlation condition by choosing items with the highest correlations while preserving the proportional structure of the categories. Because the correlation between MC/CR for the total test was relatively low, 0.64, it was impossible to created higher correlations such as Table 1 contains the final MC to CR ratios and the correlations between MC/CR created from the total test. 7

11 Table 1 Summary of Simulation Conditions Correlation Test Length (MC:CR Ratio) between MC/CR 1:1 1.5:1 2:1 r = L r = H For each of these six simulation conditions, 500 random samples were created by bootstrapping (random sampling with replacement keeping sample sizes the same as the original samples). Trend score equating through the MC plus trend CR anchor and MC anchor NEAT equating were used to obtain raw to raw conversion lines for all replicated conditions. Chained linear and chained equipercentile equating methods were used in both the trend and the MC anchor NEAT equating designs. The efficiency of the MC anchor NEAT design method was evaluated using the trend score equating as the base line. The differences between the raw-to-equated raw conversion lines obtained in the MC anchor NEAT design method and in the trend scoring method were evaluated using average weighted differences (AvgDIF) and root mean squared differences (RMSD). The AvgDIF was calculated as a weighted sum of the average differences across 500 replications across all score points with the following formula: 1 avgdif = p e x e x j I i j ( y( ij) y( ij) ), where p j is the raw proportion of examinees at score = x j in the total population data, I is the number of replications in the simulation (i = 1 to I, and I = 500), e ( x ) is the trend score method equated score at score = x j for the ith replication, and ey ( xij ) is the MC y ij 8

12 anchor NEAT design method equated score at score = x j for the ith replication. The RMSD was calculated using the following formula: 1 RMSD = p e x e x j I i ( ( ) ( )) 2 j y ij y ij. The AvgDIF and RMSD should be close to zero if the MC anchor NEAT design method works as well as the trend scoring method. Besides summary indices, conditional RMSD (CRMSD) at each score level was also calculated and plotted to compare results across different conditions. The CRMSD at each score level was calculated as follows: 1 CRMSD j = e y xij ey xij I i ( ( ) ( )) 2 The AvgDIF, RMSD, and CRMSD were evaluated based on Differences That Matter (DTM) which is half of a reporting unit (i.e., 0.5). Results First, the summary statistics of the reprint and reference form samples including means and SDs of total and anchor, correlations between total and anchor are presented across simulated conditions. The summary results of AvgDIF and RMSD are then presented. The plotted CRMSD for the score range from the 5 th to 95 th percentile, where most examinees were located, are presented the last to examine the differences in equating between the MC anchor NEAT design and trend scoring methods. Summary Statistics for Equating Samples Tables 2 and 3 contain the summary statistics for the equating samples for the MC anchor NEAT design method and the trend scoring method respectively. The summary statistics include means and SDs of total and anchor scores for the reprint and reference 9

13 form samples, correlations between total and anchor scores, and correlations between MC/CR. Since for both the MC anchor NEAT design and trend scoring methods the operational tests maintained the same, the correlations between MC/CR were only included in Table 2 for the MC anchor NEAT design method. Table 2 Summary Statistics across Simulated Conditions for the MC Anchor NEAT Design Method r = L r = H Test Length (MC:CR Ratio) 1:1 1.5:1 2:1 Reprint Reference Reprint Reference Reprint Reference Sample Size (N) Total Score Mean SD Anchor Score Mean SD Correlation Total/Anchor MC/CR Total Score Mean SD Anchor Score Mean SD Correlation Total/Anchor MC/CR

14 As shown in Table 2, as the MC to CR ratio increased, the correlation between the total and anchor scores also increased. The increase was more apparent for the low correlation condition than for the high correlation condition. Similarly, as the correlation between MC/CR increased, the correlation between the total and anchor scores increased. As the two factors increased, the quality of the MC section as a representative anchor improved. Across simulated conditions, evaluation of the total versus anchor means and SDs would reach the same conclusion on the direction of the scoring shift. Take for example the condition with a MC to CR ratio of 1:1 and low correlations between MC/CR. The SDs for the total and anchor scores for the reprint and reference forms were close to each other. The anchor means showed that the reprint form sample was weaker than the reference form sample (18.76 < 19.15). However, the total means showed that the reprint form has a higher mean than the reference sample (30.86 > 30.20). Since the same form was given again, the higher mean for the weaker reprint form sample indicated that raters in the reprint form administration were more lenient. Similar trends in the correlation between total and anchor scores were found for the trend scoring method. As the MC to CR ratio and the correlation between MC/CR increased, the correlation between total and anchor increased. However, since all MC and Trend CR items were used as the anchor, the increase was minimal. There was always a weaker relationship between total and anchor scores for the reference form. This, as mentioned in the introduction section, was because the reference form anchor scores were calculated using the rescored CR ratings in the reprint administration. 11

15 Table 3 Summary Statistics across Simulated Conditions for the Trend Scoring Method r = L r = H Test Length (MC:CR Ratio) 1:1 1.5:1 2:1 Reprint Reference Reprint Reference Reprint Reference Sample Size (N) Total Score Mean SD Anchor Score Mean SD r(total/anchor) Total Score Mean SD Anchor Score Mean SD r(total/anchor) Average Weighted Differences AvgDIF Table 4 contains the AvgDIF results. Across all three factors, MC to CR ratio, correlation between MC/CR and equating methods, AvgDIFs were consistently smaller than the DTM. As the MC to CR ratio increased, AvgDIFs became closer to zero. As the correlation between MC/CR increased, only for the 1:1 condition, AvgDIF became closer to zero. For the other MC to CR ratio conditions and across different equating methods, the AvgDIFs were close to each other in magnitude not considering their direction (difference of 0.07 or smaller). This is partly because the AvgDIF was affected by 12

16 cancellation of differences in different directions. Thus, AvgDIF was not as informative as RMSD which eliminated the cancellation effect by squaring the differences. Table 4 Average Weighted Differences (AvgDIF) across Simulated Conditions Test Length (MC:CR Ratio) 1:1 1.5:1 2:1 Chained r = L Equipercentile r = H Chained r = L Linear r = H Root Mean Squared Differences RMSD Table 5 contains the RMSD results for the different simulated conditions. For the chained equipercentile equating methods, the RMSD was smaller than the DTM only when there was a large MC section (MC to CR ratio of 2:1) and the correlation between MC/CR was high. For the chained linear method, the RMSDs were smaller than the DTM for all conditions. The same pattern of results was found as those found for the AvgDIF. Generally, as the MC to CR ratio and the correlation between MC/CR increased, RMSD decreased. The improvement was trivial to none when the correlation between MC/CR increased for the chained linear method with the 1.5:1 and the 2:1 MC to CR ratios. This is mainly because the differences between these two correlation conditions were rather small because of the limitation of the tests being generated based on a test with a relatively moderate correlation between MC/CR. 13

17 Table 5 Root Mean Squared Differences (RMSD) across Simulated Conditions Test Length (MC:CR Ratio) 1:1 1.5:1 2:1 Chained r = L Equipercentile r = H Chained r = L Linear r = H Conditional Root Mean Squared Differences (CRMSD) Figure 3 presents the plotted CRMSDs for the first test structure with 1:1 MC to CR ratio. Only CRMSDs for the score range from the 5 th to the 95 th percentile were plotted since beyond this range data was too sparse to obtain stable CRMSD estimates. For the chained equipercentile method, the CRMSDs were smaller than the DTM for only a small range of score in the middle where the scores clustered. For the chained linear methods, the CRMSDs were smaller than the DTM for most of the score scales except for the extreme ends where there was little data. For both methods, as the correlation between MC/CR increased, the CRMSDs decreased with the plotted line getting further below the DTM line. CRMSD r = L Raw Scores CRMSD r = H Chained Equipercentile Chained Linear Raw Scores Figure 3. CRMSDs for tests with a MC to CR ratio of 1:1 14

18 Figure 4 presents the plotted CRMSDs for tests with a MC to CR ratio of 1.5:1. Same patterns of results were found as in the first set of CRMSD results. However, the CRMSDs were consistently smaller than those found in the first set of CRMSD results. As the MC to CR ratio increased, the CRMSDs decreased. Chained linear methods consistently produced CRMSDs smaller than the DTM for all except three to four scores points at the lower end of the score scale. Chained equipercentile did not do as well especially for the low correlation condition. For the high correlation condition, CRMSDs were smaller than the DTM for the middle of the score scale. Investigation of the new form score distribution revealed that for scores below 21 and above 43 the frequencies dropped to below 20, which explains the large CRMSDs obtained. CRMSD 2 r = L Raw Scores Differences r = H Chained Equipercentile Chained Linear Raw Scores Figure 4. CRMSDs for tests with a MC to CR ratio of 1.5:1 Figure 5 presents the plotted CRMSDs for tests with a MC to CR ratio of 2:1. Same patterns of results were again found as in the previous sets of CRMSD results. The CRMSDs became even smaller than the two previous sets of CRMSD results. Chained linear methods consistently produced CRMSDs smaller than the DTM for all except four scores points at the lower end of the score scale. For the chained equipercentile method, with the high correlation condition, CRMSDs were smaller than the DTM for a larger 15

19 range of the score scale in the middle. Investigation of the new form score distribution again revealed that for both ends of the scores scales where CRMSDs were larger than the DTM the frequencies of scores dropped to below 20. Differences r = L Raw Scores Differences r = H Raw Scores Chained Equipercentile Chained Linear Figure 5. CRMSDs for tests with a MC to CR ratio of 2:1 Conclusions and Future Directions Results from the simulation indicated that the MC anchor NEAT design method worked comparably as the trend scoring method in terms of average differences summarized across the score scale AvgDIF. However, because of cancellation effect, AvgDIF tends to come out small and disguise large difference between conversion lines. The RMSD eliminates the cancellation problem with the AvgDIF and evaluates the degree of deviation between the conversion lines obtained from the MC anchor NEAT design method versus the trend scoring method. The RMSD as well as CRMSD results showed that the MC to CR ratio and the correlation between MC/CR had impact on the MC anchor NEAT design method s ability to produce comparable results to the trend scoring methods. As these two factors increased, the RMSDs and CRMSDs decreased. When the chained linear method was used, the MC anchor NEAT design method consistently produced RMSDs and CRMSDs smaller than the DTM across simulated 16

20 conditions. When the chained equipercentile method was used, only with a higher MC to CR ratio of 2:1 and a higher correlation between MC/CR was the RMSD acceptable (smaller than the DTM). When mixed format test is equated with CR items evaluated by different sets of raters, the equating relationship is often nonlinear due to the variable nature of the ratings across two sets of raters. Thus, it is recommended that the MC anchor NEAT design method should be used for test with a relatively large proportion of MC items (consisting at least 67% of the score with 60 items or more) and a moderate correlation between MC/CR (0.64 or higher). Limitations and Future Directions The characteristics of the total test from which subtests were created limited the factors that were created in this study. Because the correlation between the MC and CR sections for the total was relatively moderate, 0.64, it was impossible to create conditions with correlations between MC/CR higher than Because the inter-item correlations among the MC items were relatively high, it was rather difficult to create different levels of correlation between MC/CR while maintaining the proportional structure of different categories on the test. Another factor that limited the generalizability of this study is the fact that the rescoring of the trend papers used a different scoring strategy than the operational papers. The trend papers were only rescored once while the operational papers were always scored twice by two different raters. The different scoring strategies made it necessary to use the NEAT design with MC plus trend CR anchor for equating. When the trend papers are scored with the same scoring strategy as the operational papers, we can use a single group design equating the two sets of scores obtained on the trend sample. The final 17

21 conversion line can then be applied to the reprint form which contains the same items with the same raters who rescored the trend papers. The single group trend scoring design provides a better criterion for evaluation since the single group design has been shown to have much less error than the NEAT design (see Thorndike, 1982; Kolen & Brennan, 2004). The authors of this study are currently looking into other tests with different characteristics (MC to CR ratio, correlation between MC/CR, inherent multidimensionality of the construct, and etc.) to supplement this study. The advantage of simulation from real data is appreciated by a lot of researchers. Thus, finding other tests with different structures will help answer what kind of mixed-format test can do well with the more cost-effective method and hopefully provide guidance to future mixedformat test development. 18

22 References Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and practices (2nd ed.). New York: Springer-Verlag. Paek, I., & Kim, S. (2007, April). Empirical Investigation of Alternatives for Assessing Scoring Consistency on Constructed Response Items in Mixed Format Tests. Paper presented at the 2007 annual meeting of the American Educational Research Association, Chicago, IL. Tate, R. L. (1999). A cautionary note on IRT-based linking of tests with polytomous items. Journal of Educational Measurement, 36(4), Tate, R. L. (2000). Performance of a proposed method for the linking of mixed format tests with constructed response and multiple choice items. Journal of Educational Measurement, 37, Thorndike, R. L. (1982). Applied psychometrics. Boston: Houghton. Walker, M. (2007, April). Criteria to consider when reporting constructed-response scores. Paper presented at the 2007 annual meeting of the National Council on Measurement and Evaluation, Chicago, IL. 19

Linking Mixed-Format Tests Using Multiple Choice Anchors. Michael E. Walker. Sooyeon Kim. ETS, Princeton, NJ

Linking Mixed-Format Tests Using Multiple Choice Anchors. Michael E. Walker. Sooyeon Kim. ETS, Princeton, NJ Linking Mixed-Format Tests Using Multiple Choice Anchors Michael E. Walker Sooyeon Kim ETS, Princeton, NJ Paper presented at the annual meeting of the American Educational Research Association (AERA) and

More information

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups GMAC Scaling Item Difficulty Estimates from Nonequivalent Groups Fanmin Guo, Lawrence Rudner, and Eileen Talento-Miller GMAC Research Reports RR-09-03 April 3, 2009 Abstract By placing item statistics

More information

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS Michael J. Kolen The University of Iowa March 2011 Commissioned by the Center for K 12 Assessment & Performance Management at

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 39 Evaluation of Comparability of Scores and Passing Decisions for Different Item Pools of Computerized Adaptive Examinations

More information

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Thakur Karkee Measurement Incorporated Dong-In Kim CTB/McGraw-Hill Kevin Fatica CTB/McGraw-Hill

More information

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2 MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and Lord Equating Methods 1,2 Lisa A. Keller, Ronald K. Hambleton, Pauline Parker, Jenna Copella University of Massachusetts

More information

Identifying Extraneous Threats to Test Validity for Improving Tests

Identifying Extraneous Threats to Test Validity for Improving Tests Identifying Extraneous Threats to Test Validity for Improving Tests Yi Du, Ph.D. Data Recognition Corporation Presentation at the 40th National Conference on Student Assessment, June 22, 2010 1 If only

More information

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Jee Seon Kim University of Wisconsin, Madison Paper presented at 2006 NCME Annual Meeting San Francisco, CA Correspondence

More information

A Bayesian Nonparametric Model Fit statistic of Item Response Models

A Bayesian Nonparametric Model Fit statistic of Item Response Models A Bayesian Nonparametric Model Fit statistic of Item Response Models Purpose As more and more states move to use the computer adaptive test for their assessments, item response theory (IRT) has been widely

More information

Jason L. Meyers. Ahmet Turhan. Steven J. Fitzpatrick. Pearson. Paper presented at the annual meeting of the

Jason L. Meyers. Ahmet Turhan. Steven J. Fitzpatrick. Pearson. Paper presented at the annual meeting of the Performance of Ability Estimation Methods for Writing Assessments under Conditio ns of Multidime nsionality Jason L. Meyers Ahmet Turhan Steven J. Fitzpatrick Pearson Paper presented at the annual meeting

More information

Examining Factors Affecting Language Performance: A Comparison of Three Measurement Approaches

Examining Factors Affecting Language Performance: A Comparison of Three Measurement Approaches Pertanika J. Soc. Sci. & Hum. 21 (3): 1149-1162 (2013) SOCIAL SCIENCES & HUMANITIES Journal homepage: http://www.pertanika.upm.edu.my/ Examining Factors Affecting Language Performance: A Comparison of

More information

Item Position and Item Difficulty Change in an IRT-Based Common Item Equating Design

Item Position and Item Difficulty Change in an IRT-Based Common Item Equating Design APPLIED MEASUREMENT IN EDUCATION, 22: 38 6, 29 Copyright Taylor & Francis Group, LLC ISSN: 895-7347 print / 1532-4818 online DOI: 1.18/89573482558342 Item Position and Item Difficulty Change in an IRT-Based

More information

Computerized Mastery Testing

Computerized Mastery Testing Computerized Mastery Testing With Nonequivalent Testlets Kathleen Sheehan and Charles Lewis Educational Testing Service A procedure for determining the effect of testlet nonequivalence on the operating

More information

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

Mantel-Haenszel Procedures for Detecting Differential Item Functioning A Comparison of Logistic Regression and Mantel-Haenszel Procedures for Detecting Differential Item Functioning H. Jane Rogers, Teachers College, Columbia University Hariharan Swaminathan, University of

More information

AMERICAN BOARD OF SURGERY 2009 IN-TRAINING EXAMINATION EXPLANATION & INTERPRETATION OF SCORE REPORTS

AMERICAN BOARD OF SURGERY 2009 IN-TRAINING EXAMINATION EXPLANATION & INTERPRETATION OF SCORE REPORTS AMERICAN BOARD OF SURGERY 2009 IN-TRAINING EXAMINATION EXPLANATION & INTERPRETATION OF SCORE REPORTS Attached are the performance reports and analyses for participants from your surgery program on the

More information

The Impact of Statistically Adjusting for Rater Effects on Conditional Standard Errors for Performance Ratings

The Impact of Statistically Adjusting for Rater Effects on Conditional Standard Errors for Performance Ratings 0 The Impact of Statistically Adjusting for Rater Effects on Conditional Standard Errors for Performance Ratings Mark R. Raymond, Polina Harik and Brain E. Clauser National Board of Medical Examiners 1

More information

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests David Shin Pearson Educational Measurement May 007 rr0701 Using assessment and research to promote learning Pearson Educational

More information

Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX

Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX Paper 1766-2014 Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX ABSTRACT Chunhua Cao, Yan Wang, Yi-Hsin Chen, Isaac Y. Li University

More information

Differential Item Functioning

Differential Item Functioning Differential Item Functioning Lecture #11 ICPSR Item Response Theory Workshop Lecture #11: 1of 62 Lecture Overview Detection of Differential Item Functioning (DIF) Distinguish Bias from DIF Test vs. Item

More information

The Impact of Test Dimensionality, Common-Item Set Format, and Scale Linking Methods on Mixed-Format Test Equating *

The Impact of Test Dimensionality, Common-Item Set Format, and Scale Linking Methods on Mixed-Format Test Equating * KURAM VE UYGULAMADA EĞİTİM BİLİMLERİ EDUCATIONAL SCIENCES: THEORY & PRACTICE Received: November 26, 2015 Revision received: January 6, 2015 Accepted: March 18, 2016 OnlineFirst: April 20, 2016 Copyright

More information

Research Report No Using DIF Dissection Method to Assess Effects of Item Deletion

Research Report No Using DIF Dissection Method to Assess Effects of Item Deletion Research Report No. 2005-10 Using DIF Dissection Method to Assess Effects of Item Deletion Yanling Zhang, Neil J. Dorans, and Joy L. Matthews-López www.collegeboard.com College Board Research Report No.

More information

PROMIS DEPRESSION AND CES-D

PROMIS DEPRESSION AND CES-D PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROMIS DEPRESSION AND CES-D SEUNG W. CHOI, TRACY PODRABSKY, NATALIE MCKINNEY, BENJAMIN D. SCHALET, KARON F. COOK & DAVID CELLA

More information

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD Psy 427 Cal State Northridge Andrew Ainsworth, PhD Contents Item Analysis in General Classical Test Theory Item Response Theory Basics Item Response Functions Item Information Functions Invariance IRT

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

PROMIS ANXIETY AND KESSLER 6 MENTAL HEALTH SCALE PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES

PROMIS ANXIETY AND KESSLER 6 MENTAL HEALTH SCALE PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROMIS ANXIETY AND KESSLER 6 MENTAL HEALTH SCALE SEUNG W. CHOI, TRACY PODRABSKY, NATALIE MCKINNEY, BENJAMIN D. SCHALET, KARON

More information

PROMIS DEPRESSION AND NEURO-QOL DEPRESSION

PROMIS DEPRESSION AND NEURO-QOL DEPRESSION PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROMIS DEPRESSION AND NEURO-QOL DEPRESSION SEUNG W. CHOI, TRACY PODRABSKY, NATALIE MCKINNEY, BENJAMIN D. SCHALET, KARON F. COOK

More information

PROMIS PAIN INTERFERENCE AND BRIEF PAIN INVENTORY INTERFERENCE

PROMIS PAIN INTERFERENCE AND BRIEF PAIN INVENTORY INTERFERENCE PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROMIS PAIN INTERFERENCE AND BRIEF PAIN INVENTORY INTERFERENCE SEUNG W. CHOI, TRACY PODRABSKY, NATALIE MCKINNEY, BENJAMIN D.

More information

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing Terry A. Ackerman University of Illinois This study investigated the effect of using multidimensional items in

More information

PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROMIS SLEEP DISTURBANCE AND NEURO-QOL SLEEP DISTURBANCE

PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROMIS SLEEP DISTURBANCE AND NEURO-QOL SLEEP DISTURBANCE PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROMIS SLEEP DISTURBANCE AND NEURO-QOL SLEEP DISTURBANCE DAVID CELLA, BENJAMIN D. SCHALET, MICHAEL KALLEN, JIN-SHEI LAI, KARON

More information

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1 Nested Factor Analytic Model Comparison as a Means to Detect Aberrant Response Patterns John M. Clark III Pearson Author Note John M. Clark III,

More information

Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study

Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study Research Report Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study Xueli Xu Matthias von Davier April 2010 ETS RR-10-10 Listening. Learning. Leading. Linking Errors in Trend Estimation

More information

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments Greg Pope, Analytics and Psychometrics Manager 2008 Users Conference San Antonio Introduction and purpose of this session

More information

PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROMIS PEDIATRIC ANXIETY AND NEURO-QOL PEDIATRIC ANXIETY

PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROMIS PEDIATRIC ANXIETY AND NEURO-QOL PEDIATRIC ANXIETY PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROMIS PEDIATRIC ANXIETY AND NEURO-QOL PEDIATRIC ANXIETY DAVID CELLA, BENJAMIN D. SCHALET, MICHAEL A. KALLEN, JIN-SHEI LAI,

More information

PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES

PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROMIS V2.0 COGNITIVE FUNCTION AND FACT-COG PERCEIVED COGNITIVE IMPAIRMENT DAVID CELLA, BENJAMIN D. SCHALET, MICHAEL KALLEN,

More information

EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN

EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN By FRANCISCO ANDRES JIMENEZ A THESIS PRESENTED TO THE GRADUATE SCHOOL OF

More information

Centre for Education Research and Policy

Centre for Education Research and Policy THE EFFECT OF SAMPLE SIZE ON ITEM PARAMETER ESTIMATION FOR THE PARTIAL CREDIT MODEL ABSTRACT Item Response Theory (IRT) models have been widely used to analyse test data and develop IRT-based tests. An

More information

PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES

PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROMIS GLOBAL HEALTH - PHYSICAL COMPONENT AND VR- 12 PHYSICAL COMPONENT (ALGORITHMIC SCORES) DAVID CELLA, BENJAMIN D. SCHALET,

More information

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison Group-Level Diagnosis 1 N.B. Please do not cite or distribute. Multilevel IRT for group-level diagnosis Chanho Park Daniel M. Bolt University of Wisconsin-Madison Paper presented at the annual meeting

More information

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

Bruno D. Zumbo, Ph.D. University of Northern British Columbia Bruno Zumbo 1 The Effect of DIF and Impact on Classical Test Statistics: Undetected DIF and Impact, and the Reliability and Interpretability of Scores from a Language Proficiency Test Bruno D. Zumbo, Ph.D.

More information

Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz

Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz This study presents the steps Edgenuity uses to evaluate the reliability and validity of its quizzes, topic tests, and cumulative

More information

Comparing standard toughness through weighted and unweighted scores by three standard setting procedures

Comparing standard toughness through weighted and unweighted scores by three standard setting procedures Comparing standard toughness through weighted and unweighted scores by three standard setting procedures Abstract Tsai-Wei Huang National Chiayi University, Taiwan Ayres G. D Costa The Ohio State University

More information

Item Analysis Explanation

Item Analysis Explanation Item Analysis Explanation The item difficulty is the percentage of candidates who answered the question correctly. The recommended range for item difficulty set forth by CASTLE Worldwide, Inc., is between

More information

A Comparison of Several Goodness-of-Fit Statistics

A Comparison of Several Goodness-of-Fit Statistics A Comparison of Several Goodness-of-Fit Statistics Robert L. McKinley The University of Toledo Craig N. Mills Educational Testing Service A study was conducted to evaluate four goodnessof-fit procedures

More information

SLEEP DISTURBANCE ABOUT SLEEP DISTURBANCE INTRODUCTION TO ASSESSMENT OPTIONS. 6/27/2018 PROMIS Sleep Disturbance Page 1

SLEEP DISTURBANCE ABOUT SLEEP DISTURBANCE INTRODUCTION TO ASSESSMENT OPTIONS. 6/27/2018 PROMIS Sleep Disturbance Page 1 SLEEP DISTURBANCE A brief guide to the PROMIS Sleep Disturbance instruments: ADULT PROMIS Item Bank v1.0 Sleep Disturbance PROMIS Short Form v1.0 Sleep Disturbance 4a PROMIS Short Form v1.0 Sleep Disturbance

More information

Chapter 2--Norms and Basic Statistics for Testing

Chapter 2--Norms and Basic Statistics for Testing Chapter 2--Norms and Basic Statistics for Testing Student: 1. Statistical procedures that summarize and describe a series of observations are called A. inferential statistics. B. descriptive statistics.

More information

Linking Assessments: Concept and History

Linking Assessments: Concept and History Linking Assessments: Concept and History Michael J. Kolen, University of Iowa In this article, the history of linking is summarized, and current linking frameworks that have been proposed are considered.

More information

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison Empowered by Psychometrics The Fundamentals of Psychometrics Jim Wollack University of Wisconsin Madison Psycho-what? Psychometrics is the field of study concerned with the measurement of mental and psychological

More information

PROMIS ANXIETY AND MOOD AND ANXIETY SYMPTOM QUESTIONNAIRE PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES

PROMIS ANXIETY AND MOOD AND ANXIETY SYMPTOM QUESTIONNAIRE PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROMIS ANXIETY AND MOOD AND ANXIETY SYMPTOM QUESTIONNAIRE SEUNG W. CHOI, TRACY PODRABSKY, NATALIE MCKINNEY, BENJAMIN D. SCHALET,

More information

THE IMPACT OF ANCHOR ITEM EXPOSURE ON MEAN/SIGMA LINKING AND IRT TRUE SCORE EQUATING UNDER THE NEAT DESIGN. Moatasim A. Barri

THE IMPACT OF ANCHOR ITEM EXPOSURE ON MEAN/SIGMA LINKING AND IRT TRUE SCORE EQUATING UNDER THE NEAT DESIGN. Moatasim A. Barri THE IMPACT OF ANCHOR ITEM EXPOSURE ON MEAN/SIGMA LINKING AND IRT TRUE SCORE EQUATING UNDER THE NEAT DESIGN By Moatasim A. Barri B.S., King Abdul Aziz University M.S.Ed., The University of Kansas Ph.D.,

More information

Effects of Local Item Dependence

Effects of Local Item Dependence Effects of Local Item Dependence on the Fit and Equating Performance of the Three-Parameter Logistic Model Wendy M. Yen CTB/McGraw-Hill Unidimensional item response theory (IRT) has become widely used

More information

linking in educational measurement: Taking differential motivation into account 1

linking in educational measurement: Taking differential motivation into account 1 Selecting a data collection design for linking in educational measurement: Taking differential motivation into account 1 Abstract In educational measurement, multiple test forms are often constructed to

More information

Investigating the robustness of the nonparametric Levene test with more than two groups

Investigating the robustness of the nonparametric Levene test with more than two groups Psicológica (2014), 35, 361-383. Investigating the robustness of the nonparametric Levene test with more than two groups David W. Nordstokke * and S. Mitchell Colp University of Calgary, Canada Testing

More information

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review Results & Statistics: Description and Correlation The description and presentation of results involves a number of topics. These include scales of measurement, descriptive statistics used to summarize

More information

PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROMIS GLOBAL HEALTH-PHYSICAL AND VR-12- PHYSICAL

PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROMIS GLOBAL HEALTH-PHYSICAL AND VR-12- PHYSICAL PROSETTA STONE ANALYSIS REPORT A ROSETTA STONE FOR PATIENT REPORTED OUTCOMES PROMIS GLOBAL HEALTH-PHYSICAL AND VR-12- PHYSICAL DAVID CELLA, BENJAMIN D. SCHALET, MICHAEL KALLEN, JIN-SHEI LAI, KARON F. COOK,

More information

Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE

Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE 1. When you assert that it is improbable that the mean intelligence test score of a particular group is 100, you are using. a. descriptive

More information

RESEARCH ARTICLES. Brian E. Clauser, Polina Harik, and Melissa J. Margolis National Board of Medical Examiners

RESEARCH ARTICLES. Brian E. Clauser, Polina Harik, and Melissa J. Margolis National Board of Medical Examiners APPLIED MEASUREMENT IN EDUCATION, 22: 1 21, 2009 Copyright Taylor & Francis Group, LLC ISSN: 0895-7347 print / 1532-4818 online DOI: 10.1080/08957340802558318 HAME 0895-7347 1532-4818 Applied Measurement

More information

Nearest-Integer Response from Normally-Distributed Opinion Model for Likert Scale

Nearest-Integer Response from Normally-Distributed Opinion Model for Likert Scale Nearest-Integer Response from Normally-Distributed Opinion Model for Likert Scale Jonny B. Pornel, Vicente T. Balinas and Giabelle A. Saldaña University of the Philippines Visayas This paper proposes that

More information

FATIGUE. A brief guide to the PROMIS Fatigue instruments:

FATIGUE. A brief guide to the PROMIS Fatigue instruments: FATIGUE A brief guide to the PROMIS Fatigue instruments: ADULT ADULT CANCER PEDIATRIC PARENT PROXY PROMIS Ca Bank v1.0 Fatigue PROMIS Pediatric Bank v2.0 Fatigue PROMIS Pediatric Bank v1.0 Fatigue* PROMIS

More information

Research Designs. Inferential Statistics. Two Samples from Two Distinct Populations. Sampling Error (Figure 11.2) Sampling Error

Research Designs. Inferential Statistics. Two Samples from Two Distinct Populations. Sampling Error (Figure 11.2) Sampling Error Research Designs Inferential Statistics Note: Bring Green Folder of Readings Chapter Eleven What are Inferential Statistics? Refer to certain procedures that allow researchers to make inferences about

More information

Empirical Demonstration of Techniques for Computing the Discrimination Power of a Dichotomous Item Response Test

Empirical Demonstration of Techniques for Computing the Discrimination Power of a Dichotomous Item Response Test Empirical Demonstration of Techniques for Computing the Discrimination Power of a Dichotomous Item Response Test Doi:10.5901/jesr.2014.v4n1p189 Abstract B. O. Ovwigho Department of Agricultural Economics

More information

Empirical Formula for Creating Error Bars for the Method of Paired Comparison

Empirical Formula for Creating Error Bars for the Method of Paired Comparison Empirical Formula for Creating Error Bars for the Method of Paired Comparison Ethan D. Montag Rochester Institute of Technology Munsell Color Science Laboratory Chester F. Carlson Center for Imaging Science

More information

The Matching Criterion Purification for Differential Item Functioning Analyses in a Large-Scale Assessment

The Matching Criterion Purification for Differential Item Functioning Analyses in a Large-Scale Assessment University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Educational Psychology Papers and Publications Educational Psychology, Department of 1-2016 The Matching Criterion Purification

More information

Catching the Hawks and Doves: A Method for Identifying Extreme Examiners on Objective Structured Clinical Examinations

Catching the Hawks and Doves: A Method for Identifying Extreme Examiners on Objective Structured Clinical Examinations Catching the Hawks and Doves: A Method for Identifying Extreme Examiners on Objective Structured Clinical Examinations July 20, 2011 1 Abstract Performance-based assessments are powerful methods for assessing

More information

Nonparametric DIF. Bruno D. Zumbo and Petronilla M. Witarsa University of British Columbia

Nonparametric DIF. Bruno D. Zumbo and Petronilla M. Witarsa University of British Columbia Nonparametric DIF Nonparametric IRT Methodology For Detecting DIF In Moderate-To-Small Scale Measurement: Operating Characteristics And A Comparison With The Mantel Haenszel Bruno D. Zumbo and Petronilla

More information

Differential Item Functioning from a Compensatory-Noncompensatory Perspective

Differential Item Functioning from a Compensatory-Noncompensatory Perspective Differential Item Functioning from a Compensatory-Noncompensatory Perspective Terry Ackerman, Bruce McCollaum, Gilbert Ngerano University of North Carolina at Greensboro Motivation for my Presentation

More information

Linking across forms in vertical scaling under the common-item nonequvalent groups design

Linking across forms in vertical scaling under the common-item nonequvalent groups design University of Iowa Iowa Research Online Theses and Dissertations Spring 2013 Linking across forms in vertical scaling under the common-item nonequvalent groups design Xuan Wang University of Iowa Copyright

More information

CHAPTER 4 RESULTS. In this chapter the results of the empirical research are reported and discussed in the following order:

CHAPTER 4 RESULTS. In this chapter the results of the empirical research are reported and discussed in the following order: 71 CHAPTER 4 RESULTS 4.1 INTRODUCTION In this chapter the results of the empirical research are reported and discussed in the following order: (1) Descriptive statistics of the sample; the extraneous variables;

More information

Technical Brief for the THOMAS-KILMANN CONFLICT MODE INSTRUMENT

Technical Brief for the THOMAS-KILMANN CONFLICT MODE INSTRUMENT TM Technical Brief for the THOMAS-KILMANN CONFLICT MODE INSTRUMENT Japanese Amanda J. Weber Craig A. Johnson Richard C. Thompson CPP Research Department 800-624-1765 www.cpp.com Technical Brief for the

More information

Rasch Versus Birnbaum: New Arguments in an Old Debate

Rasch Versus Birnbaum: New Arguments in an Old Debate White Paper Rasch Versus Birnbaum: by John Richard Bergan, Ph.D. ATI TM 6700 E. Speedway Boulevard Tucson, Arizona 85710 Phone: 520.323.9033 Fax: 520.323.9139 Copyright 2013. All rights reserved. Galileo

More information

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination Timothy N. Rubin (trubin@uci.edu) Michael D. Lee (mdlee@uci.edu) Charles F. Chubb (cchubb@uci.edu) Department of Cognitive

More information

Detecting Suspect Examinees: An Application of Differential Person Functioning Analysis. Russell W. Smith Susan L. Davis-Becker

Detecting Suspect Examinees: An Application of Differential Person Functioning Analysis. Russell W. Smith Susan L. Davis-Becker Detecting Suspect Examinees: An Application of Differential Person Functioning Analysis Russell W. Smith Susan L. Davis-Becker Alpine Testing Solutions Paper presented at the annual conference of the National

More information

A Comparison of Four Test Equating Methods

A Comparison of Four Test Equating Methods A Comparison of Four Test Equating Methods Report Prepared for the Education Quality and Accountability Office (EQAO) by Xiao Pang, Ph.D. Psychometrician, EQAO Ebby Madera, Ph.D. Psychometrician, EQAO

More information

Impact and adjustment of selection bias. in the assessment of measurement equivalence

Impact and adjustment of selection bias. in the assessment of measurement equivalence Impact and adjustment of selection bias in the assessment of measurement equivalence Thomas Klausch, Joop Hox,& Barry Schouten Working Paper, Utrecht, December 2012 Corresponding author: Thomas Klausch,

More information

Basic concepts and principles of classical test theory

Basic concepts and principles of classical test theory Basic concepts and principles of classical test theory Jan-Eric Gustafsson What is measurement? Assignment of numbers to aspects of individuals according to some rule. The aspect which is measured must

More information

International Journal of Education and Research Vol. 5 No. 5 May 2017

International Journal of Education and Research Vol. 5 No. 5 May 2017 International Journal of Education and Research Vol. 5 No. 5 May 2017 EFFECT OF SAMPLE SIZE, ABILITY DISTRIBUTION AND TEST LENGTH ON DETECTION OF DIFFERENTIAL ITEM FUNCTIONING USING MANTEL-HAENSZEL STATISTIC

More information

MEANING AND PURPOSE. ADULT PEDIATRIC PARENT PROXY PROMIS Item Bank v1.0 Meaning and Purpose PROMIS Short Form v1.0 Meaning and Purpose 4a

MEANING AND PURPOSE. ADULT PEDIATRIC PARENT PROXY PROMIS Item Bank v1.0 Meaning and Purpose PROMIS Short Form v1.0 Meaning and Purpose 4a MEANING AND PURPOSE A brief guide to the PROMIS Meaning and Purpose instruments: ADULT PEDIATRIC PARENT PROXY PROMIS Item Bank v1.0 Meaning and Purpose PROMIS Short Form v1.0 Meaning and Purpose 4a PROMIS

More information

Assessment with Multiple-Choice Questions in Medical Education: Arguments for Selected-Response Formats

Assessment with Multiple-Choice Questions in Medical Education: Arguments for Selected-Response Formats Assessment with Multiple-Choice Questions in Medical Education: Arguments for Selected-Response Formats Congreso Nacional De Educacion Medica Puebla, Mexico 11 January, 2007 Steven M. Downing, PhD Department

More information

CONTEXTUAL EFFECTS IN INFORMATION INTEGRATION

CONTEXTUAL EFFECTS IN INFORMATION INTEGRATION Journal ol Experimental Psychology 1971, Vol. 88, No. 2, 18-170 CONTEXTUAL EFFECTS IN INFORMATION INTEGRATION MICHAEL H. BIRNBAUM,* ALLEN PARDUCCI, AND ROBERT K. GIFFORD University of California, Los Angeles

More information

Context of Best Subset Regression

Context of Best Subset Regression Estimation of the Squared Cross-Validity Coefficient in the Context of Best Subset Regression Eugene Kennedy South Carolina Department of Education A monte carlo study was conducted to examine the performance

More information

A Message from Leiter-3 Author, Dr. Gale Roid: June 2014

A Message from Leiter-3 Author, Dr. Gale Roid: June 2014 A Message from Author, Dr. Gale Roid: June 2014 The development and standardization of a widely-used cognitive test requires several years and some very complex statistical and psychometric analyses. The

More information

Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data. Zhen Li & Bruno D.

Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data. Zhen Li & Bruno D. Psicológica (2009), 30, 343-370. SECCIÓN METODOLÓGICA Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data Zhen Li & Bruno D. Zumbo 1 University

More information

PAIN INTERFERENCE. ADULT ADULT CANCER PEDIATRIC PARENT PROXY PROMIS-Ca Bank v1.1 Pain Interference PROMIS-Ca Bank v1.0 Pain Interference*

PAIN INTERFERENCE. ADULT ADULT CANCER PEDIATRIC PARENT PROXY PROMIS-Ca Bank v1.1 Pain Interference PROMIS-Ca Bank v1.0 Pain Interference* PROMIS Item Bank v1.1 Pain Interference PROMIS Item Bank v1.0 Pain Interference* PROMIS Short Form v1.0 Pain Interference 4a PROMIS Short Form v1.0 Pain Interference 6a PROMIS Short Form v1.0 Pain Interference

More information

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian Recovery of Marginal Maximum Likelihood Estimates in the Two-Parameter Logistic Response Model: An Evaluation of MULTILOG Clement A. Stone University of Pittsburgh Marginal maximum likelihood (MML) estimation

More information

An Introduction to Missing Data in the Context of Differential Item Functioning

An Introduction to Missing Data in the Context of Differential Item Functioning A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to Practical Assessment, Research & Evaluation. Permission is granted to distribute

More information

IAPT: Regression. Regression analyses

IAPT: Regression. Regression analyses Regression analyses IAPT: Regression Regression is the rather strange name given to a set of methods for predicting one variable from another. The data shown in Table 1 and come from a student project

More information

PHYSICAL FUNCTION A brief guide to the PROMIS Physical Function instruments:

PHYSICAL FUNCTION A brief guide to the PROMIS Physical Function instruments: PROMIS Bank v1.0 - Physical Function* PROMIS Short Form v1.0 Physical Function 4a* PROMIS Short Form v1.0-physical Function 6a* PROMIS Short Form v1.0-physical Function 8a* PROMIS Short Form v1.0 Physical

More information

Using the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison

Using the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison Using the Testlet Model to Mitigate Test Speededness Effects James A. Wollack Youngsuk Suh Daniel M. Bolt University of Wisconsin Madison April 12, 2007 Paper presented at the annual meeting of the National

More information

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1.

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1. Running Head: A MODIFIED CATSIB PROCEDURE FOR DETECTING DIF ITEMS 1 A Modified CATSIB Procedure for Detecting Differential Item Function on Computer-Based Tests Johnson Ching-hong Li 1 Mark J. Gierl 1

More information

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model Gary Skaggs Fairfax County, Virginia Public Schools José Stevenson

More information

Sum of Neurally Distinct Stimulus- and Task-Related Components.

Sum of Neurally Distinct Stimulus- and Task-Related Components. SUPPLEMENTARY MATERIAL for Cardoso et al. 22 The Neuroimaging Signal is a Linear Sum of Neurally Distinct Stimulus- and Task-Related Components. : Appendix: Homogeneous Linear ( Null ) and Modified Linear

More information

MEASUREMENT OF SKILLED PERFORMANCE

MEASUREMENT OF SKILLED PERFORMANCE MEASUREMENT OF SKILLED PERFORMANCE Name: Score: Part I: Introduction The most common tasks used for evaluating performance in motor behavior may be placed into three categories: time, response magnitude,

More information

9 research designs likely for PSYC 2100

9 research designs likely for PSYC 2100 9 research designs likely for PSYC 2100 1) 1 factor, 2 levels, 1 group (one group gets both treatment levels) related samples t-test (compare means of 2 levels only) 2) 1 factor, 2 levels, 2 groups (one

More information

Classical Psychophysical Methods (cont.)

Classical Psychophysical Methods (cont.) Classical Psychophysical Methods (cont.) 1 Outline Method of Adjustment Method of Limits Method of Constant Stimuli Probit Analysis 2 Method of Constant Stimuli A set of equally spaced levels of the stimulus

More information

Simple Linear Regression the model, estimation and testing

Simple Linear Regression the model, estimation and testing Simple Linear Regression the model, estimation and testing Lecture No. 05 Example 1 A production manager has compared the dexterity test scores of five assembly-line employees with their hourly productivity.

More information

3 CONCEPTUAL FOUNDATIONS OF STATISTICS

3 CONCEPTUAL FOUNDATIONS OF STATISTICS 3 CONCEPTUAL FOUNDATIONS OF STATISTICS In this chapter, we examine the conceptual foundations of statistics. The goal is to give you an appreciation and conceptual understanding of some basic statistical

More information

PSYCHOLOGICAL STRESS EXPERIENCES

PSYCHOLOGICAL STRESS EXPERIENCES PSYCHOLOGICAL STRESS EXPERIENCES A brief guide to the PROMIS Pediatric and Parent Proxy Report Psychological Stress Experiences instruments: PEDIATRIC PROMIS Pediatric Item Bank v1.0 Psychological Stress

More information

INTRODUCTION TO ASSESSMENT OPTIONS

INTRODUCTION TO ASSESSMENT OPTIONS DEPRESSION A brief guide to the PROMIS Depression instruments: ADULT ADULT CANCER PEDIATRIC PARENT PROXY PROMIS-Ca Bank v1.0 Depression PROMIS Pediatric Item Bank v2.0 Depressive Symptoms PROMIS Pediatric

More information

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, April 23-25, 2003 The Classification Accuracy of Measurement Decision Theory Lawrence Rudner University

More information