An Alternative to the Trend Scoring Method for Adjusting Scoring Shifts. in Mixed-Format Tests. Xuan Tan. Sooyeon Kim. Insu Paek.

An Alternative to the Trend Scoring Method for Adjusting Scoring Shifts in Mixed-Format Tests Xuan Tan Sooyeon Kim Insu Paek Bihua Xiang ETS, Princeton, NJ Paper presented at the annual meeting of the National Council on Measurement in Education (NCME) April 13-17, 2009, San Diego, CA. Unpublished Work Copyright 2009 by Educational Testing Service. All Rights Reserved. These materials are an unpublished, proprietary work of ETS. Any limited distribution shall not constitute publication. This work may not be reproduced or distributed to third parties without ETS's prior written consent. Submit all requests through www.ets.org/legal/index.html. Educational Testing Service, ETS, the ETS logo, and Listening. Learning. Leading. are registered trademarks of Educational Testing Service (ETS).

Abstract This study evaluated the comparability of the multiple-choice (MC) item anchor to the MC plus trend scored constructed response (CR) item anchor for adjusting scoring shift in CR items for reprint forms of mixed-format tests using linear and nonlinear equating methods. Two factors, the MC to CR ratio and the correlation between MC/CR, were manipulated and evaluated to look at their impact on the effectiveness of the MC item anchor. Simulation based on real test data was done by selecting subsets of MC items with differential correlations with the CR section to create three levels of MC to CR ratio and two levels of correlations between MC/CR. The results showed that the MC item anchor design the MC anchor NEAT design method worked comparably to the trend scoring method in a situation where the ratio of MC to CR was high and the correlation between MC/CR was moderately high. i

Acknowledgement The authors would like to thank Stephanie Fournier-Zajac, Thomas Kaufmann, and Danielle Siwek for extracting and cleaning up the data for the analyses. Many thanks to Longjuan Liang for providing the code for running the chained equipercentile equatings. ii

Introduction Many large-scale testing programs increasingly use constructed response (CR) items in their assessments, often with multiple-choice (MC) items. CR items tend to more closely resemble the real-world tasks associated with the construct to be measured. Along with their benefits, CR items bring certain complications that must be addressed to assure quality and equity in the assessments (Walker, 2007). For example, because human raters score these items, scoring standards can be applied more leniently or harshly by the raters than planned, and thus CR items involve more scoring error than multiple-choice (MC) items. Even when the same form (called reprint) is re-used at a different time, form difficulty may not be constant due to human rater changes. The common CR items are not really common, because different raters were employed to score them. In this case, the original test score conversion may no longer apply to the reprint forms. Tate (1999, 2000) articulated a solution, called trend scoring method, to the problem of having scoring standards that have changed over time in the context of the non-equivalent groups anchor test (NEAT) design. A representative sample for examinees from Administration 1 (the reference administration) is inserted into the rating process for Administration 2 (the new administration). The responses, obtained from the reference group of examinees, are rescored by the raters scoring responses on the same items for the new group of examinees. Thus, these trend papers have two sets of scores associated with them: one from the old set of raters and one from the new rater group. Equating can then be used to adjust for the scoring shift. In short, trend scoring is rescoring the same examinee papers across scoring sessions to eliminate group ability difference to detect and control for any scoring shift due to human raters over time. By its 1

nature, the trend scoring method has statistical strengths in detecting and adjusting a scoring shift. However, this method is expensive and difficult to implement in practice. The retention and retrieval of trend papers to be interspersed with papers in the later administrations require extra care in test planning and data handling. The rescoring of trend papers adds significantly to operational cost and scoring time. Paek and Kim (2007) suggested alternatives for monitoring CR scoring shift in reprint mixed-format tests without trend scoring. Two differential bundle functioning (DBF) based methods were used by examining the conditional CR score distributions across the two administrations after matching on the MC items. When there is notable scoring shift, the DBF analysis will show shift from zero favoring one administration against the other. Using real data, they showed that the DBF methods were comparable to the trend scoring method in detecting scoring shift. When scoring shift is detected, however, it is unclear whether using the MC items as an anchor to adjust for difference in difficulty due to scoring shift will be as effective as the trend scoring method. The success of the MC anchor may depend on some factors, such as the magnitude of the correlation between MC and CR and the MC to CR ratio (i.e., the proportion of MC points to the total points available on the test). As the correlation between MC and CR items and the MC to CR ratio decrease, they could have a negative impact on the quality of the MC items as a representative anchor of the total test and thus on the MC anchor equating method s ability to accurately adjust scoring shift. The Trend Scoring Method Figure 1 illustrates the trend score equating design used for equating mixedformat tests under the NEAT design framework. This equating design can be used for 2

both new form and reprint form equating. For the new form case, the anchor is comprised of a set of common MC and CR items between the new and reference forms. For the reprint form case, the anchor is comprised of all MC and CR items since the same form is used again. For illustration purpose, Form X (black letters) denotes the new or reprint form being given in Administration 2 and form Y (green letters) denotes the reference form being given in Administration 1. The anchor for each form is denoted by XA for Form X and YA for Form Y. Since there are two set of raters involved for the CR section, the rater group in Administration 1 is denoted by blue colored boxes and the rater group in Administration 2 is denoted by red colored boxes. A subset of examinee CR responses from Form Y (trend papers) are selected and rated again in Administration 2 along with the Form Y responses. This trend sample is used as the reference sample. As illustrated by Figure 1, for Form X, the total score is calculated by combining scores on all MC items and ratings on all CR item rated by the Red team, and the anchor score, XA, is calculated by combining scores on common MC items and the first ratings on common CR items also rated by the Red team. In real operational settings, the trend papers are usually scored by only one rater instead of two raters as the operational papers are to save cost. Thus, XA is usually calculated using only the first ratings of the common CR items to keep a consistent scoring schema between new form and reference form anchors. For Form Y, the total score is calculated by combining scores on all MC items and ratings on all CR item rated by the Blue team, and the anchor score, YA, is calculated by combining scores on common MC items and the only ratings on common CR items rated by the Red team along with the CR responses on Form X. 3

The only difference between the trend score equating and a regular NEAT design equating is the way the anchor score is constructed for the reference form. In this case, CR ratings on the trend papers by raters from Administration 2 are used as the CR anchor instead of the original ratings by raters from Administration 1. We can also call this design the MC plus trend CR anchor NEAT design. This usually reduces the correlation between the total and anchor for the reference form. Form X Form Y Total X Anchor XA Anchor YA Total Y MC MC MC MC CR CR CR CR Figure 1. The trend score equating design MC + Trend CR NEAT design The Non-trend Method MC Anchor NEAT Design Figure 2 illustrates the MC anchor NEAT design method which is a commonly used design. The total scores on Form X and Form Y are calculated the same way as in the trend scoring method. The anchor scores on Forms X and Y are calculated only based 4

on common MC items. For the new form case, the common MC items are a subset of the MC items shared between the new and reference forms. For the reprint case, the common MC items are all the MC items since the same form is used again. Form X Form Y Total X Anchor XA Anchor YA Total Y MC MC MC MC CR CR Figure 2. The MC anchor NEAT design method Purpose The purpose of this study is to discover under what conditions the MC anchor NEAT design method will be as effective as the trend scoring method in adjusting reprint mixed-format tests for differential difficulty caused by scoring shift for the CR common items. The correlation between MC/CR and the MC to CR ratio were manipulated to evaluate the extent to which the MC anchor NEAT design method would work as well as the trend scoring method. 5

Method Simulation was conducted based on real data using a bootstrap technique to alleviate the problem of unrealistic data limiting the generalizability of results. Different test structures were created by selecting a subset of items from real tests to create tests with different MC to CR ratios. Within each test structure, different subsets of items were chosen to form tests with different correlations between MC and CR items. Data A test which assesses the knowledge and competence of beginning teachers of middle school science was selected because it had a long MC section (90 MC items) along with three CR items. The MC section was rights-scored, and the CR section was scored by two raters on a rating scale of 0-3 and weighted by a factor of 1.6667. Total test score is, therefore, 120 points (90 MC + 30 CR points). One reprint form administration where the trend scoring method was used for equating was used for simulation. The reprint form sample consisted of 1,011 examinees; the reference form sample (trend sample) consisted of 461 examinees. The two sets of ratings on the trend papers (original ratings in the reference form administration and rescored ratings in the reprint administration) indicated a scoring shift toward more lenient scoring in the reprint form administration raters. Methods and Procedures Tests with higher MC to CR ratios are expected to work better with the MC anchor NEAT design method since, with more MC items as an anchor, the reliability of the anchor and the correlation between anchor and total would be higher. Different numbers of MC items were chosen to create three levels of MC to CR ratio, 30 MC 6

items, 45 MC items, and 60 MC items. The three CR items were kept resulting in total score points of 60, 75, and 90 for the three levels indicating three different MC to CR ratios, 1:1, 1.5:1, and 2:1. There are six sub content areas in the MC section, and subsets of items selected for each level maintained the proportional structure of the categories. The second factor is correlation between MC/CR. Tests with higher correlations between MC/CR is expected to work better with the MC anchor NEAT design method since the MC items would work as an adequate anchor representing a mini-test with higher correlations between anchor and total. The correlation between MC/CR is affected by the reliability of each section as well. However, the focus of this study is on different levels of correlations between MC/CR caused by the inherent structure of the construct. For tests with similar reliabilities, the correlation between MC/CR could be different depending on how closely the constructs being measured by the MC and CR items are related. In order to create different levels of correlation between MC/CR within each level of MC to CR ratio, different subsets of MC items were chosen based on their correlations with the CR section. Two levels were created, a low correlation condition by choosing items with the lowest correlations and a high correlation condition by choosing items with the highest correlations while preserving the proportional structure of the categories. Because the correlation between MC/CR for the total test was relatively low, 0.64, it was impossible to created higher correlations such as 0.80. Table 1 contains the final MC to CR ratios and the correlations between MC/CR created from the total test. 7

Table 1 Summary of Simulation Conditions Correlation Test Length (MC:CR Ratio) between MC/CR 1:1 1.5:1 2:1 r = L 0.43 0.52 0.59 r = H 0.64 0.65 0.65 For each of these six simulation conditions, 500 random samples were created by bootstrapping (random sampling with replacement keeping sample sizes the same as the original samples). Trend score equating through the MC plus trend CR anchor and MC anchor NEAT equating were used to obtain raw to raw conversion lines for all replicated conditions. Chained linear and chained equipercentile equating methods were used in both the trend and the MC anchor NEAT equating designs. The efficiency of the MC anchor NEAT design method was evaluated using the trend score equating as the base line. The differences between the raw-to-equated raw conversion lines obtained in the MC anchor NEAT design method and in the trend scoring method were evaluated using average weighted differences (AvgDIF) and root mean squared differences (RMSD). The AvgDIF was calculated as a weighted sum of the average differences across 500 replications across all score points with the following formula: 1 avgdif = p e x e x j I i j ( y( ij) y( ij) ), where p j is the raw proportion of examinees at score = x j in the total population data, I is the number of replications in the simulation (i = 1 to I, and I = 500), e ( x ) is the trend score method equated score at score = x j for the ith replication, and ey ( xij ) is the MC y ij 8

anchor NEAT design method equated score at score = x j for the ith replication. The RMSD was calculated using the following formula: 1 RMSD = p e x e x j I i ( ( ) ( )) 2 j y ij y ij. The AvgDIF and RMSD should be close to zero if the MC anchor NEAT design method works as well as the trend scoring method. Besides summary indices, conditional RMSD (CRMSD) at each score level was also calculated and plotted to compare results across different conditions. The CRMSD at each score level was calculated as follows: 1 CRMSD j = e y xij ey xij I i ( ( ) ( )) 2 The AvgDIF, RMSD, and CRMSD were evaluated based on Differences That Matter (DTM) which is half of a reporting unit (i.e., 0.5). Results First, the summary statistics of the reprint and reference form samples including means and SDs of total and anchor, correlations between total and anchor are presented across simulated conditions. The summary results of AvgDIF and RMSD are then presented. The plotted CRMSD for the score range from the 5 th to 95 th percentile, where most examinees were located, are presented the last to examine the differences in equating between the MC anchor NEAT design and trend scoring methods. Summary Statistics for Equating Samples Tables 2 and 3 contain the summary statistics for the equating samples for the MC anchor NEAT design method and the trend scoring method respectively. The summary statistics include means and SDs of total and anchor scores for the reprint and reference 9

form samples, correlations between total and anchor scores, and correlations between MC/CR. Since for both the MC anchor NEAT design and trend scoring methods the operational tests maintained the same, the correlations between MC/CR were only included in Table 2 for the MC anchor NEAT design method. Table 2 Summary Statistics across Simulated Conditions for the MC Anchor NEAT Design Method r = L r = H Test Length (MC:CR Ratio) 1:1 1.5:1 2:1 Reprint Reference Reprint Reference Reprint Reference Sample Size (N) 1011 461 1011 461 1011 461 Total Score Mean 30.86 30.20 40.52 40.11 48.53 48.57 SD 7.43 8.07 8.75 9.51 10.70 11.53 Anchor Score Mean 18.76 19.15 28.42 29.07 36.43 37.52 SD 3.21 3.44 4.56 4.94 6.56 6.99 Correlation Total/Anchor 0.75 0.80 0.85 0.87 0.91 0.92 MC/CR 0.43 0.52 0.52 0.59 0.59 0.64 Total Score Mean 28.45 28.52 36.53 37.06 45.99 46.77 SD 10.14 10.62 12.21 12.69 13.72 14.29 Anchor Score Mean 16.35 17.47 24.43 26.02 33.90 35.73 SD 5.75 5.88 7.95 8.09 9.53 9.72 Correlation Total/Anchor 0.91 0.92 0.94 0.94 0.95 0.96 MC/CR 0.64 0.68 0.65 0.68 0.65 0.69 10

As shown in Table 2, as the MC to CR ratio increased, the correlation between the total and anchor scores also increased. The increase was more apparent for the low correlation condition than for the high correlation condition. Similarly, as the correlation between MC/CR increased, the correlation between the total and anchor scores increased. As the two factors increased, the quality of the MC section as a representative anchor improved. Across simulated conditions, evaluation of the total versus anchor means and SDs would reach the same conclusion on the direction of the scoring shift. Take for example the condition with a MC to CR ratio of 1:1 and low correlations between MC/CR. The SDs for the total and anchor scores for the reprint and reference forms were close to each other. The anchor means showed that the reprint form sample was weaker than the reference form sample (18.76 < 19.15). However, the total means showed that the reprint form has a higher mean than the reference sample (30.86 > 30.20). Since the same form was given again, the higher mean for the weaker reprint form sample indicated that raters in the reprint form administration were more lenient. Similar trends in the correlation between total and anchor scores were found for the trend scoring method. As the MC to CR ratio and the correlation between MC/CR increased, the correlation between total and anchor increased. However, since all MC and Trend CR items were used as the anchor, the increase was minimal. There was always a weaker relationship between total and anchor scores for the reference form. This, as mentioned in the introduction section, was because the reference form anchor scores were calculated using the rescored CR ratings in the reprint administration. 11

Table 3 Summary Statistics across Simulated Conditions for the Trend Scoring Method r = L r = H Test Length (MC:CR Ratio) 1:1 1.5:1 2:1 Reprint Reference Reprint Reference Reprint Reference Sample Size (N) 1011 461 1011 461 1011 461 Total Score Mean 30.86 30.20 40.52 40.11 48.53 48.57 SD 7.43 8.07 8.75 9.51 10.70 11.53 Anchor Score Mean 31.00 32.16 40.66 42.08 48.67 50.53 SD 7.72 8.25 9.02 9.70 10.95 11.67 r(total/anchor) 0.97 0.88 0.98 0.92 0.99 0.94 Total Score Mean 28.45 28.52 36.53 37.06 45.99 46.77 SD 10.14 10.62 12.21 12.69 13.72 14.29 Anchor Score Mean 28.59 30.48 36.67 39.03 46.14 48.73 SD 10.40 10.72 12.43 12.77 13.93 14.37 r(total/anchor) 0.99 0.93 0.99 0.95 0.99 0.96 Average Weighted Differences AvgDIF Table 4 contains the AvgDIF results. Across all three factors, MC to CR ratio, correlation between MC/CR and equating methods, AvgDIFs were consistently smaller than the DTM. As the MC to CR ratio increased, AvgDIFs became closer to zero. As the correlation between MC/CR increased, only for the 1:1 condition, AvgDIF became closer to zero. For the other MC to CR ratio conditions and across different equating methods, the AvgDIFs were close to each other in magnitude not considering their direction (difference of 0.07 or smaller). This is partly because the AvgDIF was affected by 12

cancellation of differences in different directions. Thus, AvgDIF was not as informative as RMSD which eliminated the cancellation effect by squaring the differences. Table 4 Average Weighted Differences (AvgDIF) across Simulated Conditions Test Length (MC:CR Ratio) 1:1 1.5:1 2:1 Chained r = L 0.17 0.10 0.02 Equipercentile r = H -0.12-0.11-0.08 Chained r = L 0.22 0.15 0.04 Linear r = H -0.16-0.15-0.11 Root Mean Squared Differences RMSD Table 5 contains the RMSD results for the different simulated conditions. For the chained equipercentile equating methods, the RMSD was smaller than the DTM only when there was a large MC section (MC to CR ratio of 2:1) and the correlation between MC/CR was high. For the chained linear method, the RMSDs were smaller than the DTM for all conditions. The same pattern of results was found as those found for the AvgDIF. Generally, as the MC to CR ratio and the correlation between MC/CR increased, RMSD decreased. The improvement was trivial to none when the correlation between MC/CR increased for the chained linear method with the 1.5:1 and the 2:1 MC to CR ratios. This is mainly because the differences between these two correlation conditions were rather small because of the limitation of the tests being generated based on a test with a relatively moderate correlation between MC/CR. 13

Table 5 Root Mean Squared Differences (RMSD) across Simulated Conditions Test Length (MC:CR Ratio) 1:1 1.5:1 2:1 Chained r = L 0.72 0.59 0.57 Equipercentile r = H 0.57 0.51 0.46 Chained r = L 0.44 0.39 0.34 Linear r = H 0.37 0.37 0.36 Conditional Root Mean Squared Differences (CRMSD) Figure 3 presents the plotted CRMSDs for the first test structure with 1:1 MC to CR ratio. Only CRMSDs for the score range from the 5 th to the 95 th percentile were plotted since beyond this range data was too sparse to obtain stable CRMSD estimates. For the chained equipercentile method, the CRMSDs were smaller than the DTM for only a small range of score in the middle where the scores clustered. For the chained linear methods, the CRMSDs were smaller than the DTM for most of the score scales except for the extreme ends where there was little data. For both methods, as the correlation between MC/CR increased, the CRMSDs decreased with the plotted line getting further below the DTM line. CRMSD r = L 2 1.5 1 0.5 0 20 22 24 26 28 30 32 34 36 38 40 42 44 Raw Scores CRMSD 2 1.5 1 0.5 0 r = H Chained Equipercentile Chained Linear 13 16 19 22 25 28 31 34 37 40 43 46 Raw Scores Figure 3. CRMSDs for tests with a MC to CR ratio of 1:1 14

Figure 4 presents the plotted CRMSDs for tests with a MC to CR ratio of 1.5:1. Same patterns of results were found as in the first set of CRMSD results. However, the CRMSDs were consistently smaller than those found in the first set of CRMSD results. As the MC to CR ratio increased, the CRMSDs decreased. Chained linear methods consistently produced CRMSDs smaller than the DTM for all except three to four scores points at the lower end of the score scale. Chained equipercentile did not do as well especially for the low correlation condition. For the high correlation condition, CRMSDs were smaller than the DTM for the middle of the score scale. Investigation of the new form score distribution revealed that for scores below 21 and above 43 the frequencies dropped to below 20, which explains the large CRMSDs obtained. CRMSD 2 r = L 1.5 1 0.5 0 27 30 33 36 39 42 45 48 51 54 Raw Scores Differences 2 1.5 1 0.5 0 r = H Chained Equipercentile Chained Linear 18 21 24 27 30 33 36 39 42 45 48 51 54 57 Raw Scores Figure 4. CRMSDs for tests with a MC to CR ratio of 1.5:1 Figure 5 presents the plotted CRMSDs for tests with a MC to CR ratio of 2:1. Same patterns of results were again found as in the previous sets of CRMSD results. The CRMSDs became even smaller than the two previous sets of CRMSD results. Chained linear methods consistently produced CRMSDs smaller than the DTM for all except four scores points at the lower end of the score scale. For the chained equipercentile method, with the high correlation condition, CRMSDs were smaller than the DTM for a larger 15

range of the score scale in the middle. Investigation of the new form score distribution again revealed that for both ends of the scores scales where CRMSDs were larger than the DTM the frequencies of scores dropped to below 20. Differences r = L 1.2 1 0.8 0.6 0.4 0.2 0 32 35 38 41 44 47 50 53 56 59 62 65 Raw Scores Differences 2 1.5 1 0.5-0.5 r = H 0 25 29 33 37 41 45 49 53 57 61 65 69 Raw Scores Chained Equipercentile Chained Linear Figure 5. CRMSDs for tests with a MC to CR ratio of 2:1 Conclusions and Future Directions Results from the simulation indicated that the MC anchor NEAT design method worked comparably as the trend scoring method in terms of average differences summarized across the score scale AvgDIF. However, because of cancellation effect, AvgDIF tends to come out small and disguise large difference between conversion lines. The RMSD eliminates the cancellation problem with the AvgDIF and evaluates the degree of deviation between the conversion lines obtained from the MC anchor NEAT design method versus the trend scoring method. The RMSD as well as CRMSD results showed that the MC to CR ratio and the correlation between MC/CR had impact on the MC anchor NEAT design method s ability to produce comparable results to the trend scoring methods. As these two factors increased, the RMSDs and CRMSDs decreased. When the chained linear method was used, the MC anchor NEAT design method consistently produced RMSDs and CRMSDs smaller than the DTM across simulated 16

conditions. When the chained equipercentile method was used, only with a higher MC to CR ratio of 2:1 and a higher correlation between MC/CR was the RMSD acceptable (smaller than the DTM). When mixed format test is equated with CR items evaluated by different sets of raters, the equating relationship is often nonlinear due to the variable nature of the ratings across two sets of raters. Thus, it is recommended that the MC anchor NEAT design method should be used for test with a relatively large proportion of MC items (consisting at least 67% of the score with 60 items or more) and a moderate correlation between MC/CR (0.64 or higher). Limitations and Future Directions The characteristics of the total test from which subtests were created limited the factors that were created in this study. Because the correlation between the MC and CR sections for the total was relatively moderate, 0.64, it was impossible to create conditions with correlations between MC/CR higher than 0.65. Because the inter-item correlations among the MC items were relatively high, it was rather difficult to create different levels of correlation between MC/CR while maintaining the proportional structure of different categories on the test. Another factor that limited the generalizability of this study is the fact that the rescoring of the trend papers used a different scoring strategy than the operational papers. The trend papers were only rescored once while the operational papers were always scored twice by two different raters. The different scoring strategies made it necessary to use the NEAT design with MC plus trend CR anchor for equating. When the trend papers are scored with the same scoring strategy as the operational papers, we can use a single group design equating the two sets of scores obtained on the trend sample. The final 17

conversion line can then be applied to the reprint form which contains the same items with the same raters who rescored the trend papers. The single group trend scoring design provides a better criterion for evaluation since the single group design has been shown to have much less error than the NEAT design (see Thorndike, 1982; Kolen & Brennan, 2004). The authors of this study are currently looking into other tests with different characteristics (MC to CR ratio, correlation between MC/CR, inherent multidimensionality of the construct, and etc.) to supplement this study. The advantage of simulation from real data is appreciated by a lot of researchers. Thus, finding other tests with different structures will help answer what kind of mixed-format test can do well with the more cost-effective method and hopefully provide guidance to future mixedformat test development. 18

References Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and practices (2nd ed.). New York: Springer-Verlag. Paek, I., & Kim, S. (2007, April). Empirical Investigation of Alternatives for Assessing Scoring Consistency on Constructed Response Items in Mixed Format Tests. Paper presented at the 2007 annual meeting of the American Educational Research Association, Chicago, IL. Tate, R. L. (1999). A cautionary note on IRT-based linking of tests with polytomous items. Journal of Educational Measurement, 36(4), 336-346. Tate, R. L. (2000). Performance of a proposed method for the linking of mixed format tests with constructed response and multiple choice items. Journal of Educational Measurement, 37, 329-346. Thorndike, R. L. (1982). Applied psychometrics. Boston: Houghton. Walker, M. (2007, April). Criteria to consider when reporting constructed-response scores. Paper presented at the 2007 annual meeting of the National Council on Measurement and Evaluation, Chicago, IL. 19