An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow?

Size: px

Start display at page:

Download "An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow?"

Hilary Flynn
5 years ago
Views:

1 Journal of Educational and Behavioral Statistics Fall 2006, Vol. 31, No. 3, pp An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow? Michael C. Edwards The Ohio State University Jack L. Vevea The University of California at Santa Cruz This article examines a subscore augmentation procedure. The approach uses empirical Bayes adjustments and is intended to improve the overall accuracy of measurement when information is scant. Simulations examined the impact of the method on subscale scores in a variety of realistic conditions. The authors focused on two popular scoring methods: summed scores and item response theory scale scores for summed scores. Simulation conditions included number of subscales, length (hence, reliability) of subscales, and the underlying correlations between scales. To examine the relative performance of the augmented scales, the authors computed root mean square error, reliability, percentage correctly identified as falling within specific proficiency ranges, and the percentage of simulated individuals for whom the augmented score was closer to the true score than was the nonaugmented score. The general findings and limitations of the study are discussed and areas for future research are suggested. Keywords: ability estimation, empirical Bayes, item response theory, subscore augmentation As the popularity of testing grows in the realm of education, tests that were originally designed for one purpose are frequently being pressed into service for others. For example, tests designed to produce reliable scores for ranking individuals may also be expected to provide diagnostic information to interested parties (e.g., the test taker, parents, teachers, etc.). It is difficult to construct a test that can achieve both of these goals adequately. A potential solution to this challenge involves the reporting of subscores. While the overall scores are used to rank individuals, scores on smaller subsections of the test are used to provide feedback specific to a narrow content area. This approach, although intuitively appealing, has a drawback. A test may be constructed so that the overall score is reliable, but such a test may not provide reliable subscores, which are based on less information than the overall score. This work was supported by the North Carolina Department of Public Instruction. The authors thank Howard Wainer and David Thissen for their comments on earlier drafts of this article. 241

2 Edwards and Vevea There are three ways we could address this problem. First, we could increase the length of each subscale until it provides reliable scores. This solution is unrealistic given time and length constraints often present in large-scale educational testing. Another option is to use adaptive technologies, in the form of computerized adaptive testing (CAT). Such a strategy enables tailored testing to achieve reliable measurement on a subject-by-subject basis. The realities of educational testing make this solution impractical under many circumstances (see Wainer, 2000). The third possibility is to increase the reliability of diagnostic subscores by incorporating information from the rest of the test. This final procedure is the focus of the current article. Multivariate Empirical Bayes Estimation Nearly 50 years after Kelley s (1927) seminal work, what are now called empirical Bayes (EB) methods resurfaced in psychology in the form of multivariate generalizability theory in Cronbach, Gleser, Nanda, and Rajaratnam s The Dependability of Behavioral Measurements (1972). Equation 10.3 in chapter 10 of that book is a multivariate form of Kelley s regressed estimates. Although the language is slightly different (Cronbach et al. discuss universe scores instead of true scores), the underlying theory is the same. The authors note that using all the information in the profile will produce a superior estimate of the subscale true score whenever the subscales are correlated. They also point out that the new, augmented estimates will always have smaller error variance than the nonaugmented estimates. Kelley s (1927) univariate regression of true score on observed score can be rewritten for the multivariate case as where x. is a vector of subscale means, x is a vector of subscale scores, and B is a matrix of reliability-based weights. The least-squares estimate of B is the product of the true score covariance matrix and the inverse of the observed score covariance matrix, where t is the true score covariance matrix and x is the observed score covariance matrix. Given data, both of these covariance matrices may be estimated: t by S t, and x by S x. S x is an estimate of the observed score covariance matrix and is easily obtained from the data. As shown by Wainer et al. (2001), S t can be computed from the estimated covariance matrix of the observed scores using and τ= ˆ x. + B x x., B = t x s = s for v v, vv vv ( ) 1 t x, 242

3 t x s = ρ s for v = v, vv v vv Empirical Bayes Subscore Augmentation where s t vv is the vv element in the covariance matrix S t, s x vv is the corresponding element in the observed covariance matrix (S x ), and ρ v is the reliability of subscore v. Using Subscore Augmentation The multivariate augmentation procedure can be used with observed scores (Vevea, Billeaud, & Nelson, 1998), item response theory (IRT) scale scores for response patterns, and IRT scale scores for summed scores. Differences in the procedure for each are largely computational; the interested reader is directed to Wainer et al. s (2001) treatment of this issue. The current article focuses on the two summed-score-based procedures: observed scores and IRT scale scores for summed scores. Although IRT scale scores for summed scores do not incorporate as much information as scale scores for response patterns, they preserve IRT s nonlinear relationship between number correct and proficiency while eliminating the awkwardness of a test on which equal summed scores may map to different proficiency estimates. In addition, IRT scale scores are comparable across different forms of a test, so they are widely used in large-scale testing with multiple forms. The need for subscale augmentation arose when the North Carolina Department of Public Instruction (NCDPI) decided to use its end-of-grade (EOG) tests to provide diagnostic subscores in curricular areas. The EOG tests are primarily designed to allow reliable rankings of individuals; they were not designed to provide diagnostic subscores. When the decision was made to use them for diagnosis, problems arose because of the relatively small size, and hence low reliability, of some of the subscales. The goal of the current article is to demonstrate the utility of using subscore augmentation under such circumstances. The simulation study described below examines the method implemented with the summed-score IRT approach used in the NCDPI context. In addition, we present parallel simulations using multivariate EB methods for simple summed scores. These latter results may be of more interest to persons considering the use of EB methods in small-scale contexts where the number of respondents cannot support reliable IRT estimates of item characteristics. Method We conducted simulations to examine the effect of augmenting summed scores and IRT scale scores for summed scores. The general simulation design elements are the same for the two types of scores. Simulation Design The goal of the current study is to assess the performance of subscore augmentation at various levels of subscale number, subscale length (reliability), and subscale correlation. Although it is known in advance that with an appropriate weighting 243

4 Edwards and Vevea scheme the empirical Bayes estimates will outperform scores that have not been adjusted, the extent and practical significance of the difference has yet to be determined. The design includes tests with two or four subscales. Items were simulated using the same item parameter distributions as Wainer, Bradlow, and Du (2000), which are based on marginal distributions from SAT data. Using standard notation for the 3PL IRT model, the distributions are: a N(0.8, ), b N(0, 1), and c N(0.2, ). We truncated the distributions at three standard deviations above and below the mean to avoid the unusual parameter values discussed by Harwell, Stone, Hsu, and Kirisci (1996). We sampled individual ability parameters (thetas) from a standard normal distribution. One of the most important aspects of the empirical Bayes estimation method is the correlation among the subscales. We chose three levels of correlation to reflect relatively uncorrelated subscales (r = 0.3), moderately correlated subscales (r = 0.6), and highly correlated subscales (r = 0.9). In this simulation, average subscale reliability is a function of number of items on a subscale. We chose four subscale lengths so that two of them (5 and 10 items) are relatively unreliable (0.43 and 0.59, respectively), and two of them (20 and 40 items) are relatively reliable (0.75 and 0.85, respectively). These reliability estimates are the average observed squared correlations between scale scores and generating ability parameters (thetas) over 100 replications. The 5-, 10-, and 20-item subscales have been seen in actual work with the NCDPI, which is a principal reason for their inclusion. We included the 40-item subscale as a ceiling condition. The levels within conditions chosen for the two-subscale case result in a total of 30 cells. Crossing each combination of subscale item counts (5 5, 5 10, etc.) with each correlation results in 48 possible combinations. Of those combinations 18 are redundant (e.g., for two 5-item subscales, Subscale 1 correlated 0.3 with Subscale 2 is the same condition as Subscale 2 correlated 0.3 with Subscale 1), so only 30 of the 48 possible cells provide unique information. In the four-subscale condition, nine different combinations of subscale length were examined at the three correlation levels, resulting in a total of 27 cells. Table 1 shows the details of how these design factors are crossed. One hundred replications were used in every cell throughout the simulation because that represents a good balance between data yield and computational time. The sample size for each cell was 2,000, a number that was large enough to yield stable IRT parameter estimates and small enough to converge relatively quickly. Analytic Strategy To evaluate the relative performance of augmented versus nonaugmented scores we focused on four measures. Root mean squared error (RMSE) and reliability are reasonable ways to compare the relative performance of scores. RMSE is the square root of the average squared difference between estimated scores and true scores. In the case of IRT simulations, true scores are defined as the simulated individual thetas. In the summed-score analyses, true scores are the expected summed scores, given 244

5 TABLE 1 Simulation Design Two-Subscale Design the simulated individual thetas and the generating values of the IRT parameters. That is, for an individual with ability equal to θ j, the expected summed score is items 1 c i Ε( Xj) = ci +. i= 1 1+ exp 1. 7ai( θ j bi) We computed reliability as the square of the correlation between true and estimated scores. Users of these methods will not know the true scores and, hence, will be unable to compute reliability in this manner, so we also examined estimates of reliability for each type of score. In the case of the EB augmented scores, we used a new procedure to produce a reliability estimate. Wainer et al. (2001) computed reliability for the augmented scores as a r 2 vv v =, c where the numerator is an approximation of the vth subscale s unconditional true score variance, and the denominator is an estimate of unconditional score variance of the estimates for the vth subscale. The numerator is taken from the diagonal of the matrix vv 1 1 t x t x t Α= ( ) ( ) Empirical Bayes Subscore Augmentation S S S S S, Four-Subscale Design Length of Length of Subscale B Length of Subscale B, C, & D Subscale A ,5,5 10,10,10 20,20, Note: Entries in the table indicate level of correlation between subscales at which the simulation was performed in each cell. 245

6 Edwards and Vevea and the denominator is taken from the diagonal of the matrix Research conducted since the publication of Test Scoring has shown this estimate of reliability for augmented scores to be positively biased (Edwards, 2002). When estimating reliability for augmented scores we used (and recommend) taking the numerator of the fraction from r 2 v = 1 * 1 A = S S S S, t t x t and the denominator from the estimated true score matrix. In addition to RMSE and reliability, we computed the percentage of simulated individuals for whom the augmented score was closer to the true score than was the nonaugmented score. This percentage adds to our knowledge of the relative accuracy of the various scores, especially for situations that involve classification. One final measure of performance involved accuracy in assigning individuals to groups based on their scores. Many tests in education and certification contexts make use of cut scores to determine levels of proficiency. In some cases, this leads to pass fail decisions for schoolchildren; in other cases, it leads to a placement in some achievement level or identifies a need for remediation. In all cases, it is of interest to compare the performance of augmented subscores in these decisions with the performance of their nonaugmented counterparts. The analyses of percentage correctly identified focused on a smaller subset of the cells for the two-subscale portion of the simulation. The notation A B denotes the augmentation of a subscale with A items using information from a subscale with B items; for example, 5 40 denotes the case where a 5-item subscale has been improved by EB augmentation using a 40-item subscale. Four different combinations of subscales (5 40, 5 5, 20 20, and 40 5) were examined at the three correlation levels (0.3, 0.6, and 0.9) used throughout the study. For the four-subscale case, we examined three different combinations of subscales ( , , and ) at the three correlation levels mentioned above. We conducted this analysis by dividing the simulees into four ability groupings based on their true and estimated scale scores (expected a posteriori scale scores, or EAPs, were used as IRT scale score estimates throughout this study). We selected three cutscores ( 1.96, 0, 1.036) to place 2.5% of simulees in the lowest category, 47.5% in the 2nd category, 35% in the 3rd category, and 15% in the highest category. For analyses of this issue in the context of traditional summed scores (rather than summed-score EAPs), we set cutscores by computing the expected summed scores of individuals whose thetas were 1.96, 0, or 1.036, using the simulated IRT parameters to compute the expectation. 246 C S S S. = ( ) t x a * vv t vv s 1, t

7 Results IRT Scale Scores for Summed Scores In the course of analyzing the results it became apparent that the two-subscale simulations would be sufficient to convey the general trends observed throughout the simulation. In the interest of brevity we do not present the four-subscale results here, although they are available from the authors upon request. Item Parameter Estimation All of the augmentation procedures discussed in this article were implemented using the AUGMENT software 1 (Vevea et al., 2002). In addition to performing the various augmentation procedures, AUGMENT estimates item parameters using the Bock and Aitkin (1981) EM algorithm. RMSE As expected, in all cases the augmented EAPs had lower RMSE than the nonaugmented EAPs. Figure 1 shows the overall trends in the difference in RMSE between the nonaugmented and augmented EAPs. The four plots in Figure 1 highlight the effect of augmentation on a subscale as a function of the correlation between subscales and the length of the subscale contributing ancillary information. The top left panel and lower right panel of Figure 1 deserve special scrutiny because these are the conditions under which we expect to see the most and the least improvement, respectively, from the augmentation procedure. We see from the top left panel of Figure 1 that when the subscale being augmented is small (5 items), the number of items in the second subscale is large (40 items), and the correlation between the two subscales is high (r = 0.9) the RMSE of the augmented EAPs (0.507) is approximately 33% smaller than the RMSE of the nonaugmented EAPs (0.757). The lower right panel of Figure 1 shows that when the subscale being augmented is large (40 items), the number of items in the second subscale is small (5 items), and the correlation between the two subscales is small (r = 0.3), then the RMSE of the augmented EAPs (0.386) is not appreciably lower than the RMSE of the nonaugmented EAPs (0.387). This is not a surprising finding, but it highlights the fact that even in the case in which we expect the least improvement from the augmentation procedure we still observe some improvement. It is also interesting to assess the benefit of employing the augmentation procedure in less extreme cases. Two such less extreme cases are the and conditions, where the correlation between the subscales is 0.6. In the case, we see an approximate 5% decrease in RMSE when using the augmented subscores. In the case, the RMSE of the augmented subscores is roughly 4.5% smaller. The practical significance of this difference depends largely on the use of the subscores. For reliability, we were able to examine the true reliability (the squared correlation between a theta estimate and the generating theta) as well as the two 247

8 5-Item Subscale r = 0.9 r = 0.6 r = Item Subscale RMSE Difference RMSE Difference Item Subscale 40-Item Subscale RMSE Difference RMSE Difference FIGURE 1. Difference in RMSE between nonaugmented and augmented IRT scale scores for summed scores as a function of correlation between subscales and number of ancillary items. The solid line, for example, indicates the difference between the RMSE for nonaugmented scores and augmented scores when the subscales are correlated

9 Empirical Bayes Subscore Augmentation estimates of reliability employed in this portion of the study. The two estimates of reliability, IRT marginal reliability (Green, Bock, Humphreys, Linn, & Reckase, 1984) and the augmented score reliability, both performed very well in this simulation. On average, over the 100 replications in each cell, the estimated reliability was never more than 0.01 away from the true reliability. Given the accuracy of reliability estimation for both augmented and nonaugmented scores, we focus on the differences in true reliability between the scores. The augmented scores exhibited higher true reliability in all cases, but these differences were not always of practical significance. Figure 2 displays the reliabilities for the various configurations examined in the two-subscale portion of this simulation. An examination of Figure 2 reveals several general trends. The size of the difference in reliability between the augmented and nonaugmented scores depends on the size of the subscale being augmented, the size of the subscale used to provide ancillary information, and the correlation between the two subscales. As with RMSE, the greatest advantage when using the augmented scores was seen when the subscale being augmented was small (5 items), the subscale providing ancillary information was large (40 items), and the correlation between the two subscales was high (r = 0.9). In this extreme case, the reliability of the augmented scores was more than 1.5 times that of the nonaugmented scores. Percentage of Simulees for Whom Augmented Scores Were More Accurate To provide additional information about the performance of the augmented subscores, we computed the percentage of simulees for whom augmented subscores were more accurate (i.e., the distance between the augmented estimate and the generating value was smaller than the distance between the nonaugmented estimate and the generating value). The results, summarized in Table 2, show that, in all cases examined in the current study, the augmented scores were more accurate for a larger percentage of the simulees than the nonaugmented scores. The size of this difference ranges from 1% to 16%, with an average over all conditions of about 5%. These findings provide evidence that the gains in RMSE previously discussed are not attributable to a few extreme scores, but reflect an accurate picture of a general pattern of improvement seen when using augmented subscores. Percentage Correctly Identified The percentage correctly identified analyses focused on a smaller subset of the cells run for the two-subscale portion of the simulation. Four different combinations of subscales (40 5, 20 20, 5 5, and 5 40) were examined at the three correlation levels (0.3, 0.6, and 0.9) used throughout the study. The results are summarized in Table 3. As expected, the augmented subscores placed a higher percentage of simulees in the correct ability group in all cases analyzed. The magnitude of this difference ranges from small (0.03%) to quite large (13.43%). Although it is extremely unlikely that anyone would ever try to place individuals into four ability groups based on five items, it is interesting to see how the augmentation procedure performs in the worst- and best-case scenarios. A more realistic situation is the subscale 249

10 5-Item Subscale r = 0.9 r = 0.6 r = 0.3 Non-Augmented 10-Item Subscale Item Subscale 40-Item Subscale FIGURE 2. of IRT scale scores for summed scores as a function of correlation between subscales and number of ancillary items. 250

11 Empirical Bayes Subscore Augmentation TABLE 2 Two-Subscale Percentages of Simulees for Whom Augmented Subscores Were More Accurate # Items in # Items in IRT Method Summed Score Method Subscale A Subscale B r = 0.3 r = 0.6 r = 0.9 r = 0.3 r = 0.6 r = Note: r = correlation. All statistics based on 100 replications per cell, 2,000 simulees per replication. TABLE 3 Two-Subscale Percentages Correctly Identified Results IRT Method Summed Score Method # Items # Items Non- Non- Correlation in A in B Augmented augmented Augmented augmented Note: # Items in A = number of items in Subscale A; # Items in B = number of items in Subscale B; Augmented = percentage of simulees correctly placed in ability group using augmented scores; Nonaugmented = percentage of simulees correctly placed in ability group using nonaugmented scores. 251

12 Edwards and Vevea configuration, which shows improvement ranging from 0.15% to 3.82%. While these numbers are small in absolute magnitude, it is important to remember that they are based on aggregation across replications, which essentially results in a sample size of 200,000 simulees. In this case, an improvement of 0.15% means that 300 more simulees were correctly placed in ability group. The improvement of 3.82%, when the correlation between subscores is 0.9, translates into an additional 7,640 simulees being correctly placed into ability group when using the augmented subscores. In North Carolina in there were approximately 1 million children in Grades 3 12 (Statistical Research Section NC-DPI, 2001). An improvement of 3.82% in classification decisions across tests would result in roughly 38,200 more children being correctly placed in ability group (if such a classification system was in use). Summed Scores The simulations for summed scores employed random number seeds for the generation of item parameters, thetas, and item responses that were exactly the same as those used in the IRT simulations. Thus, the results are based on simulated test takers and simulated item responses that are identical to the data from the IRT simulations; the only difference is that the analyses proceed with summed scores rather than with summed score IRT proficiency estimates. RMSE The RMSE was uniformly lower for augmented summed scores than for nonaugmented scores. Figure 3 shows the patterns of differences graphically. For summed scores, the magnitude of the RMSE is affected by the number of items on the scale. Accordingly, the RMSE in each cell was divided by the number of items on the scale, so that values are more directly comparable across cells. These renormalized RMSEs may be interpreted as if they represented average error expressed as a proportion of subscale length. In every case, the difference between the RMSE of the nonaugmented and augmented subscores increases as the number of items in the augmenting subscale increases. This effect becomes more dramatic as the correlation between the scales increases. In general, shorter subscales show greater reductions in RMSE than longer subscales. Higher correlations with the ancillary scale produce more improvement in RMSE, and longer (i.e., more reliable) ancillary scales result in greater reductions in RMSE. We defined true reliability as the squared correlation between a summed score (augmented or nonaugmented) and the expected scores of individuals given their thetas and the generating values of item parameters. Conventional calculations of Cronbach s alpha for the nonaugmented scores and estimated reliability for the augmented scores compared favorably with these true values. In general, estimates for the shortest subscales were somewhat negatively biased, and estimates for the longer subscales were almost identical to the true values, on average. We focus here on the true reliabilities. 252

13 5-Item Subscale r = 0.9 r = 0.6 r = Item Subscale RMSE Difference RMSE Difference Item Subscale 40-Item Subscale RMSE Difference RMSE Difference FIGURE 3. Difference in RMSE between augmented and nonaugmented summed scores as a function of correlation between subscales and number of ancillary items. 253

14 Edwards and Vevea The reliabilities of the augmented scores were always higher than those of nonaugmented scores, although differences were sometimes trivial. Figure 4 shows the reliabilities for the same configurations depicted for IRT results in Figure 2. In general, the pattern is that higher correlations and longer ancillary subscales produce greater gains in reliability for the augmented subscores. These gains are most dramatic when the subscale that is augmented has low reliability (i.e., in the first panel of Figure 4). If the augmented subscale is long, and, hence, already has high reliability, the gains are minimal (see the fourth panel of Figure 4). For the subscales most affected, the gains in reliability are dramatic. Percentage of Simulees for Whom Augmented Scores Were More Accurate Table 2 reports the percentages of simulated respondents for whom augmented scores were more accurate than nonaugmented scores under the various simulation conditions. The tabulated values represent the percentage of simulees whose augmented summed scores were closer than nonaugmented scores to their true scores. The results seem more favorable to the augmentation procedure than was the case for IRT summed scores. In no case is the percentage below 55, and in the best case (the 5-item subscale augmented by a 40-item scale with a correlation of 0.9) 76% of the augmented subscores are closer to the true scores (compared to 66% for summed score IRT). This seemingly better performance, however, may not reflect the way augmented summed scores would be used in practice. The augmentation procedure, which involves a weighted combination of integer-valued summed scores, results in noninteger-valued augmented subscores. In practice, it is likely that these fractional summed scores would be rounded to the nearest integer; in many cases, that nearest integer would be identical to the value of the nonaugmented score. Table 4 repeats the summed score portion of Table 2 using rounded augmented scores. The first number represents the percentage of rounded augmented scores that are less accurate than nonaugmented scores. The second number in each cell of the table is the percentage of rounded augmented scores that possess accuracy equal to that of the nonaugmented scores. The third number is the percentage of rounded augmented scores that are more accurate. For example, in the first cell, 17% of the augmented scores were less accurate than the conventional scores, 46% were equally accurate, and 37% of the rounded augmented scores were more accurate than the nonaugmented scores. The ratio of improvement to degradation associated with the rounded augmenting procedure is better than 2:1 for that cell. For all cells, this ratio is greater than 1.0, and the only cells in which it approaches 1.0 are those for which we would expect augmentation to make little difference: the augmented scale has 40 items, and the correlation is low. Percentage Correctly Identified Table 3 displays percentage correctly identified for the same design cells that were described for the IRT simulations. This statistic seems to be relatively unaffected by the rounding described in the previous section; accordingly, we report 254

15 r = 0.9 r = 0.6 r = Item Subscale Non-Augmented 10-Item Subscale Item Subscale 40-Item Subscale FIGURE 4. of summed scores as a function of correlation between subscales and number of ancillary items. 255

16 Edwards and Vevea TABLE 4 Two-Subscale Percentages of Simulees for Whom Augmented Subscores Were Less Accurate, Equally Accurate, or More Accurate (Using Integer Summed Score Method) # Items in # Items in Subscale A Subscale B r = 0.3 r = 0.6 r = / 46 / / 46 / / 45 / / 47 / / 46 / / 44 / / 45 / / 45 / / 42 / / 47 / / 46 / / 42 / / 40 / / 39 / / 38 / / 40 / / 39 / / 37 / / 40 / / 38 / / 34 / / 39 / / 38 / / 34 / / 35 / / 34 / / 33 / / 35 / / 34 / / 31 / / 35 / / 34 / / 29 / / 35 / / 33 / / 28 / / 32 / / 32 / / 30 / / 32 / / 31 / / 28 / / 33 / / 31 / / 26 / / 33 / / 30 / / 24 / 48 Note: r = correlation. All statistics based on 100 replications per cell, 2,000 simulees per replication. only the results for the augmented scores that are not rounded. The percentages correctly identified for augmented summed scores are strikingly similar to the corresponding results for the augmented summed-score IRT scores. The performance of the nonaugmented scores, however, is substantially worse for summed scores, so that the augmentation procedure results in greater improvement for this criterion than was the case for IRT scores. This is not surprising. Summedscore IRT scores are, by virtue of being based on statistical expectations, already regressed estimates to a certain degree. This is reflected in the fact that the nonaugmented summed-score IRT estimates perform better on this criterion than the nonaugmented summed scores. Hence, for simple summed scores, there is more room for improvement as a result of the further regression that results from the augmentation procedure. Discussion General Findings The results of the current simulation indicate that the subscores produced by the EB augmentation procedure represent a global improvement over nonaugmented subscores. For both the IRT-based EAPs and for summed scores, the augmented scores exhibit smaller RMSE, correctly place a higher percentage of simulees in appropriate ability groups, and are more reliable. As expected, the magnitude of the improvement gained through the use of the augmentation procedure is a func- 256

17 Empirical Bayes Subscore Augmentation tion of correlation between subscales and subscale length (reliability). The largest gains are seen in cases where the correlations between subscales are high, the reliability of the subscale being augmented is low, and the reliability of the subscale providing ancillary information is high. With respect to RMSE, there is no way to state absolutely that any particular difference in RMSE is of practical significance. Ultimately, the meaning of RMSE depends on the intended use of the ability estimates. The issue of the granularity of reported scores also comes into play. For summed scores, RMSE gains are somewhat attenuated if augmented scores are rounded to a small number of integer values. Similarly, any use of IRT-based theta estimates with many reported score levels would benefit more from the lower RMSE provided by the augmentation procedure. A problem similar to that of interpreting the practical significance of RMSE differences occurs when assessing the practical significance of a difference in reliability. In the most extreme case, the augmented scores were as much as 31% more reliable than the nonaugmented scores. The magnitude of improvement in reliability seen in a wide variety of conditions suggests that, given suitable uses of the scores, the use of the empirical Bayes procedure is justified. Tests with subscales are often used to assign individuals to some ability group. Whether that group is indicative of severity of depression, quality of life, or proficiency in mathematics, this study suggests that using the empirical Bayes augmentation procedure will increase the number of individuals who are correctly classified. Again, the real-world implications of that increase depend upon the number of individuals being assessed. In some contexts, a 1% improvement in the number of individuals correctly classified would not warrant the use of this procedure. However, in many cases where large numbers of individuals are assessed, or where the placement into group carries important consequences, there are clear advantages to using the empirical Bayes procedure. The implementation of the empirical Bayes augmentation procedure (once the software is written) requires no more time than the process to provide subscores in general. The results of this study show that, in the best case, the empirical Bayes augmentation procedure greatly increases reliability, lowers RMSE, and correctly classifies a higher percentage of individuals. The results of this study also show that, under the conditions used in this study, even in the worst case, the empirical Bayes augmentation procedure does no harm. Validity Given these findings, should we use this procedure all the time? The answer to that question is a resounding No! An example used in Wainer et al. (2001) shows why this is clearly the case. Consider the 100-m sprint in the Olympics. Whoever runs the fastest on that particular day wins. It doesn t matter if another person has run faster every day before that, or every day after that. All that matters in the Olympics is how fast you run that day. This is inherent in most competitions; the best person that day wins. 257

18 Edwards and Vevea Although this procedure is not advocated in a competitive setting, there are many other areas where it would be appropriate. One such example is the impetus for this study. Using empirical Bayes augmentation to report subscores is beneficial because the subscores are more reliable than they would be otherwise. Subscore augmentation provides a statistically and ethically viable solution to the problem of generating reliable subscores from small (and, hence, unreliable) subscales. Limitations As with any simulation study, this one includes only a small subset of all possible interesting conditions. Issues of sample size, which play an important role in item parameter and ability estimation, were not addressed (large-scale educational testing, where this procedure is most likely to see use, does not generally have problems related to small sample sizes). Conclusion The present simulation has shown that, in many circumstances, empirical Bayes subscore augmentation of IRT scale scores and traditional summed scores is a useful procedure for increasing the reliability of subscores. In all cases, the augmented subscores outperformed their nonaugmented counterparts. Although the magnitude of this difference was dependent on subscale correlation and length, in the best case, augmentation showed marked improvement, and in the worst case, it did no harm. The results of this study suggest that the empirical Bayes methods employed here are an acceptable solution to increasing the reliability of subscales without changing the overall makeup of the test. Note 1 This software is free and can be downloaded, along with documentation, at dthissen/dl.html References Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: An application of the EM algorithm. Psychometrika, 46, Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability of scores and profiles. New York: Wiley. Edwards, M. C. (2002). Augmenting IRT scale scores: A Monte Carlo study to evaluate an empirical Bayes approach to subscore augmentation. Unpublished master s thesis, University of North Carolina at Chapel Hill. Green, B. F., Bock, R. D., Humphreys, L. G., Linn, R. L., & Reckase, M. D. (1984). Technical guidelines for assessing computerized adaptive tests. Journal of Educational Measurement, 21, Harwell, M., Stone, C. A., Hsu, T.-C., & Kirisci, L. (1996). Monte Carlo studies in item response theory. Applied Psychological Measurement, 20, Kelley, T. L. (1927). The interpretation of educational measurements. New York: World Book. 258

19 Empirical Bayes Subscore Augmentation Statistical Research Section, Financial & Business Services, N.C. Department of Public Instruction, Raleigh, NC (2001). Retrieved March 13, 2002, from fbs/factsfigs00_01.htm#students Vevea, J. L., Billeaud, K., & Nelson, L. (1998, June). An empirical Bayes approach to subscore augmentation. Paper presented at the annual meeting of the Psychometric Society, Champaign-Urbana, IL. Vevea, J. L., Edwards, M. C., Thissen, D., Reeve, B. B., Flora, D. B., Sathy, V., et al. (2002). User s guide for augment v.2: Empirical Bayes subscore augmentation software. Electronic Research Memorandum # Chapel Hill, NC: University of North Carolina, L. L. Thurstone Psychometric Laboratory. Wainer, H. (Ed.). (2000). Computerized adaptive testing: A primer (2nd ed.). Hillsdale, NJ: Erlbaum. Wainer, H., Bradlow, E. T., & Du, Z. (2000). Testlet response theory: An analog for the 3PL model useful in testlet-based adaptive testing. In W. J. van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp ). The Netherlands: Kluwer. Wainer, H., Vevea, J. L., Camacho, F., Reeve, B. B. III, Rosa, K., Nelson, L., et al. (2001). Augmented scores Borrowing strength to compute scores based on small numbers of items. In D. Thissen & H. Wainer (Eds.), Test scoring (pp ). Mahwah, NJ: Erlbaum. Authors MICHAEL C. EDWARDS is Assistant Professor, Department of Psychology, The Ohio State University, Columbus, OH 43210; edwards.134@osu.edu. His areas of specialization include item response theory, factor analysis, and computerized adaptive testing. JACK L. VEVEA is Associate Professor, Department of Psychology, The University of California at Santa Cruz, Santa Cruz, CA 95064; jvevea@ucsc.edu. His areas of specialization include statistical methods for meta-analysis, mathematical models for bias in memory, and applied statistics. Manuscript received June 24, 2004 Revision received September 30, 2004 Accepted October 6,

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests David Shin Pearson Educational Measurement May 007 rr0701 Using assessment and research to promote learning Pearson Educational