An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow?

Size: px
Start display at page:

Download "An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow?"

Transcription

1 Journal of Educational and Behavioral Statistics Fall 2006, Vol. 31, No. 3, pp An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow? Michael C. Edwards The Ohio State University Jack L. Vevea The University of California at Santa Cruz This article examines a subscore augmentation procedure. The approach uses empirical Bayes adjustments and is intended to improve the overall accuracy of measurement when information is scant. Simulations examined the impact of the method on subscale scores in a variety of realistic conditions. The authors focused on two popular scoring methods: summed scores and item response theory scale scores for summed scores. Simulation conditions included number of subscales, length (hence, reliability) of subscales, and the underlying correlations between scales. To examine the relative performance of the augmented scales, the authors computed root mean square error, reliability, percentage correctly identified as falling within specific proficiency ranges, and the percentage of simulated individuals for whom the augmented score was closer to the true score than was the nonaugmented score. The general findings and limitations of the study are discussed and areas for future research are suggested. Keywords: ability estimation, empirical Bayes, item response theory, subscore augmentation As the popularity of testing grows in the realm of education, tests that were originally designed for one purpose are frequently being pressed into service for others. For example, tests designed to produce reliable scores for ranking individuals may also be expected to provide diagnostic information to interested parties (e.g., the test taker, parents, teachers, etc.). It is difficult to construct a test that can achieve both of these goals adequately. A potential solution to this challenge involves the reporting of subscores. While the overall scores are used to rank individuals, scores on smaller subsections of the test are used to provide feedback specific to a narrow content area. This approach, although intuitively appealing, has a drawback. A test may be constructed so that the overall score is reliable, but such a test may not provide reliable subscores, which are based on less information than the overall score. This work was supported by the North Carolina Department of Public Instruction. The authors thank Howard Wainer and David Thissen for their comments on earlier drafts of this article. 241

2 Edwards and Vevea There are three ways we could address this problem. First, we could increase the length of each subscale until it provides reliable scores. This solution is unrealistic given time and length constraints often present in large-scale educational testing. Another option is to use adaptive technologies, in the form of computerized adaptive testing (CAT). Such a strategy enables tailored testing to achieve reliable measurement on a subject-by-subject basis. The realities of educational testing make this solution impractical under many circumstances (see Wainer, 2000). The third possibility is to increase the reliability of diagnostic subscores by incorporating information from the rest of the test. This final procedure is the focus of the current article. Multivariate Empirical Bayes Estimation Nearly 50 years after Kelley s (1927) seminal work, what are now called empirical Bayes (EB) methods resurfaced in psychology in the form of multivariate generalizability theory in Cronbach, Gleser, Nanda, and Rajaratnam s The Dependability of Behavioral Measurements (1972). Equation 10.3 in chapter 10 of that book is a multivariate form of Kelley s regressed estimates. Although the language is slightly different (Cronbach et al. discuss universe scores instead of true scores), the underlying theory is the same. The authors note that using all the information in the profile will produce a superior estimate of the subscale true score whenever the subscales are correlated. They also point out that the new, augmented estimates will always have smaller error variance than the nonaugmented estimates. Kelley s (1927) univariate regression of true score on observed score can be rewritten for the multivariate case as where x. is a vector of subscale means, x is a vector of subscale scores, and B is a matrix of reliability-based weights. The least-squares estimate of B is the product of the true score covariance matrix and the inverse of the observed score covariance matrix, where t is the true score covariance matrix and x is the observed score covariance matrix. Given data, both of these covariance matrices may be estimated: t by S t, and x by S x. S x is an estimate of the observed score covariance matrix and is easily obtained from the data. As shown by Wainer et al. (2001), S t can be computed from the estimated covariance matrix of the observed scores using and τ= ˆ x. + B x x., B = t x s = s for v v, vv vv ( ) 1 t x, 242

3 t x s = ρ s for v = v, vv v vv Empirical Bayes Subscore Augmentation where s t vv is the vv element in the covariance matrix S t, s x vv is the corresponding element in the observed covariance matrix (S x ), and ρ v is the reliability of subscore v. Using Subscore Augmentation The multivariate augmentation procedure can be used with observed scores (Vevea, Billeaud, & Nelson, 1998), item response theory (IRT) scale scores for response patterns, and IRT scale scores for summed scores. Differences in the procedure for each are largely computational; the interested reader is directed to Wainer et al. s (2001) treatment of this issue. The current article focuses on the two summed-score-based procedures: observed scores and IRT scale scores for summed scores. Although IRT scale scores for summed scores do not incorporate as much information as scale scores for response patterns, they preserve IRT s nonlinear relationship between number correct and proficiency while eliminating the awkwardness of a test on which equal summed scores may map to different proficiency estimates. In addition, IRT scale scores are comparable across different forms of a test, so they are widely used in large-scale testing with multiple forms. The need for subscale augmentation arose when the North Carolina Department of Public Instruction (NCDPI) decided to use its end-of-grade (EOG) tests to provide diagnostic subscores in curricular areas. The EOG tests are primarily designed to allow reliable rankings of individuals; they were not designed to provide diagnostic subscores. When the decision was made to use them for diagnosis, problems arose because of the relatively small size, and hence low reliability, of some of the subscales. The goal of the current article is to demonstrate the utility of using subscore augmentation under such circumstances. The simulation study described below examines the method implemented with the summed-score IRT approach used in the NCDPI context. In addition, we present parallel simulations using multivariate EB methods for simple summed scores. These latter results may be of more interest to persons considering the use of EB methods in small-scale contexts where the number of respondents cannot support reliable IRT estimates of item characteristics. Method We conducted simulations to examine the effect of augmenting summed scores and IRT scale scores for summed scores. The general simulation design elements are the same for the two types of scores. Simulation Design The goal of the current study is to assess the performance of subscore augmentation at various levels of subscale number, subscale length (reliability), and subscale correlation. Although it is known in advance that with an appropriate weighting 243

4 Edwards and Vevea scheme the empirical Bayes estimates will outperform scores that have not been adjusted, the extent and practical significance of the difference has yet to be determined. The design includes tests with two or four subscales. Items were simulated using the same item parameter distributions as Wainer, Bradlow, and Du (2000), which are based on marginal distributions from SAT data. Using standard notation for the 3PL IRT model, the distributions are: a N(0.8, ), b N(0, 1), and c N(0.2, ). We truncated the distributions at three standard deviations above and below the mean to avoid the unusual parameter values discussed by Harwell, Stone, Hsu, and Kirisci (1996). We sampled individual ability parameters (thetas) from a standard normal distribution. One of the most important aspects of the empirical Bayes estimation method is the correlation among the subscales. We chose three levels of correlation to reflect relatively uncorrelated subscales (r = 0.3), moderately correlated subscales (r = 0.6), and highly correlated subscales (r = 0.9). In this simulation, average subscale reliability is a function of number of items on a subscale. We chose four subscale lengths so that two of them (5 and 10 items) are relatively unreliable (0.43 and 0.59, respectively), and two of them (20 and 40 items) are relatively reliable (0.75 and 0.85, respectively). These reliability estimates are the average observed squared correlations between scale scores and generating ability parameters (thetas) over 100 replications. The 5-, 10-, and 20-item subscales have been seen in actual work with the NCDPI, which is a principal reason for their inclusion. We included the 40-item subscale as a ceiling condition. The levels within conditions chosen for the two-subscale case result in a total of 30 cells. Crossing each combination of subscale item counts (5 5, 5 10, etc.) with each correlation results in 48 possible combinations. Of those combinations 18 are redundant (e.g., for two 5-item subscales, Subscale 1 correlated 0.3 with Subscale 2 is the same condition as Subscale 2 correlated 0.3 with Subscale 1), so only 30 of the 48 possible cells provide unique information. In the four-subscale condition, nine different combinations of subscale length were examined at the three correlation levels, resulting in a total of 27 cells. Table 1 shows the details of how these design factors are crossed. One hundred replications were used in every cell throughout the simulation because that represents a good balance between data yield and computational time. The sample size for each cell was 2,000, a number that was large enough to yield stable IRT parameter estimates and small enough to converge relatively quickly. Analytic Strategy To evaluate the relative performance of augmented versus nonaugmented scores we focused on four measures. Root mean squared error (RMSE) and reliability are reasonable ways to compare the relative performance of scores. RMSE is the square root of the average squared difference between estimated scores and true scores. In the case of IRT simulations, true scores are defined as the simulated individual thetas. In the summed-score analyses, true scores are the expected summed scores, given 244

5 TABLE 1 Simulation Design Two-Subscale Design the simulated individual thetas and the generating values of the IRT parameters. That is, for an individual with ability equal to θ j, the expected summed score is items 1 c i Ε( Xj) = ci +. i= 1 1+ exp 1. 7ai( θ j bi) We computed reliability as the square of the correlation between true and estimated scores. Users of these methods will not know the true scores and, hence, will be unable to compute reliability in this manner, so we also examined estimates of reliability for each type of score. In the case of the EB augmented scores, we used a new procedure to produce a reliability estimate. Wainer et al. (2001) computed reliability for the augmented scores as a r 2 vv v =, c where the numerator is an approximation of the vth subscale s unconditional true score variance, and the denominator is an estimate of unconditional score variance of the estimates for the vth subscale. The numerator is taken from the diagonal of the matrix vv 1 1 t x t x t Α= ( ) ( ) Empirical Bayes Subscore Augmentation S S S S S, Four-Subscale Design Length of Length of Subscale B Length of Subscale B, C, & D Subscale A ,5,5 10,10,10 20,20, Note: Entries in the table indicate level of correlation between subscales at which the simulation was performed in each cell. 245

6 Edwards and Vevea and the denominator is taken from the diagonal of the matrix Research conducted since the publication of Test Scoring has shown this estimate of reliability for augmented scores to be positively biased (Edwards, 2002). When estimating reliability for augmented scores we used (and recommend) taking the numerator of the fraction from r 2 v = 1 * 1 A = S S S S, t t x t and the denominator from the estimated true score matrix. In addition to RMSE and reliability, we computed the percentage of simulated individuals for whom the augmented score was closer to the true score than was the nonaugmented score. This percentage adds to our knowledge of the relative accuracy of the various scores, especially for situations that involve classification. One final measure of performance involved accuracy in assigning individuals to groups based on their scores. Many tests in education and certification contexts make use of cut scores to determine levels of proficiency. In some cases, this leads to pass fail decisions for schoolchildren; in other cases, it leads to a placement in some achievement level or identifies a need for remediation. In all cases, it is of interest to compare the performance of augmented subscores in these decisions with the performance of their nonaugmented counterparts. The analyses of percentage correctly identified focused on a smaller subset of the cells for the two-subscale portion of the simulation. The notation A B denotes the augmentation of a subscale with A items using information from a subscale with B items; for example, 5 40 denotes the case where a 5-item subscale has been improved by EB augmentation using a 40-item subscale. Four different combinations of subscales (5 40, 5 5, 20 20, and 40 5) were examined at the three correlation levels (0.3, 0.6, and 0.9) used throughout the study. For the four-subscale case, we examined three different combinations of subscales ( , , and ) at the three correlation levels mentioned above. We conducted this analysis by dividing the simulees into four ability groupings based on their true and estimated scale scores (expected a posteriori scale scores, or EAPs, were used as IRT scale score estimates throughout this study). We selected three cutscores ( 1.96, 0, 1.036) to place 2.5% of simulees in the lowest category, 47.5% in the 2nd category, 35% in the 3rd category, and 15% in the highest category. For analyses of this issue in the context of traditional summed scores (rather than summed-score EAPs), we set cutscores by computing the expected summed scores of individuals whose thetas were 1.96, 0, or 1.036, using the simulated IRT parameters to compute the expectation. 246 C S S S. = ( ) t x a * vv t vv s 1, t

7 Results IRT Scale Scores for Summed Scores In the course of analyzing the results it became apparent that the two-subscale simulations would be sufficient to convey the general trends observed throughout the simulation. In the interest of brevity we do not present the four-subscale results here, although they are available from the authors upon request. Item Parameter Estimation All of the augmentation procedures discussed in this article were implemented using the AUGMENT software 1 (Vevea et al., 2002). In addition to performing the various augmentation procedures, AUGMENT estimates item parameters using the Bock and Aitkin (1981) EM algorithm. RMSE As expected, in all cases the augmented EAPs had lower RMSE than the nonaugmented EAPs. Figure 1 shows the overall trends in the difference in RMSE between the nonaugmented and augmented EAPs. The four plots in Figure 1 highlight the effect of augmentation on a subscale as a function of the correlation between subscales and the length of the subscale contributing ancillary information. The top left panel and lower right panel of Figure 1 deserve special scrutiny because these are the conditions under which we expect to see the most and the least improvement, respectively, from the augmentation procedure. We see from the top left panel of Figure 1 that when the subscale being augmented is small (5 items), the number of items in the second subscale is large (40 items), and the correlation between the two subscales is high (r = 0.9) the RMSE of the augmented EAPs (0.507) is approximately 33% smaller than the RMSE of the nonaugmented EAPs (0.757). The lower right panel of Figure 1 shows that when the subscale being augmented is large (40 items), the number of items in the second subscale is small (5 items), and the correlation between the two subscales is small (r = 0.3), then the RMSE of the augmented EAPs (0.386) is not appreciably lower than the RMSE of the nonaugmented EAPs (0.387). This is not a surprising finding, but it highlights the fact that even in the case in which we expect the least improvement from the augmentation procedure we still observe some improvement. It is also interesting to assess the benefit of employing the augmentation procedure in less extreme cases. Two such less extreme cases are the and conditions, where the correlation between the subscales is 0.6. In the case, we see an approximate 5% decrease in RMSE when using the augmented subscores. In the case, the RMSE of the augmented subscores is roughly 4.5% smaller. The practical significance of this difference depends largely on the use of the subscores. For reliability, we were able to examine the true reliability (the squared correlation between a theta estimate and the generating theta) as well as the two 247

8 5-Item Subscale r = 0.9 r = 0.6 r = Item Subscale RMSE Difference RMSE Difference Item Subscale 40-Item Subscale RMSE Difference RMSE Difference FIGURE 1. Difference in RMSE between nonaugmented and augmented IRT scale scores for summed scores as a function of correlation between subscales and number of ancillary items. The solid line, for example, indicates the difference between the RMSE for nonaugmented scores and augmented scores when the subscales are correlated

9 Empirical Bayes Subscore Augmentation estimates of reliability employed in this portion of the study. The two estimates of reliability, IRT marginal reliability (Green, Bock, Humphreys, Linn, & Reckase, 1984) and the augmented score reliability, both performed very well in this simulation. On average, over the 100 replications in each cell, the estimated reliability was never more than 0.01 away from the true reliability. Given the accuracy of reliability estimation for both augmented and nonaugmented scores, we focus on the differences in true reliability between the scores. The augmented scores exhibited higher true reliability in all cases, but these differences were not always of practical significance. Figure 2 displays the reliabilities for the various configurations examined in the two-subscale portion of this simulation. An examination of Figure 2 reveals several general trends. The size of the difference in reliability between the augmented and nonaugmented scores depends on the size of the subscale being augmented, the size of the subscale used to provide ancillary information, and the correlation between the two subscales. As with RMSE, the greatest advantage when using the augmented scores was seen when the subscale being augmented was small (5 items), the subscale providing ancillary information was large (40 items), and the correlation between the two subscales was high (r = 0.9). In this extreme case, the reliability of the augmented scores was more than 1.5 times that of the nonaugmented scores. Percentage of Simulees for Whom Augmented Scores Were More Accurate To provide additional information about the performance of the augmented subscores, we computed the percentage of simulees for whom augmented subscores were more accurate (i.e., the distance between the augmented estimate and the generating value was smaller than the distance between the nonaugmented estimate and the generating value). The results, summarized in Table 2, show that, in all cases examined in the current study, the augmented scores were more accurate for a larger percentage of the simulees than the nonaugmented scores. The size of this difference ranges from 1% to 16%, with an average over all conditions of about 5%. These findings provide evidence that the gains in RMSE previously discussed are not attributable to a few extreme scores, but reflect an accurate picture of a general pattern of improvement seen when using augmented subscores. Percentage Correctly Identified The percentage correctly identified analyses focused on a smaller subset of the cells run for the two-subscale portion of the simulation. Four different combinations of subscales (40 5, 20 20, 5 5, and 5 40) were examined at the three correlation levels (0.3, 0.6, and 0.9) used throughout the study. The results are summarized in Table 3. As expected, the augmented subscores placed a higher percentage of simulees in the correct ability group in all cases analyzed. The magnitude of this difference ranges from small (0.03%) to quite large (13.43%). Although it is extremely unlikely that anyone would ever try to place individuals into four ability groups based on five items, it is interesting to see how the augmentation procedure performs in the worst- and best-case scenarios. A more realistic situation is the subscale 249

10 5-Item Subscale r = 0.9 r = 0.6 r = 0.3 Non-Augmented 10-Item Subscale Item Subscale 40-Item Subscale FIGURE 2. of IRT scale scores for summed scores as a function of correlation between subscales and number of ancillary items. 250

11 Empirical Bayes Subscore Augmentation TABLE 2 Two-Subscale Percentages of Simulees for Whom Augmented Subscores Were More Accurate # Items in # Items in IRT Method Summed Score Method Subscale A Subscale B r = 0.3 r = 0.6 r = 0.9 r = 0.3 r = 0.6 r = Note: r = correlation. All statistics based on 100 replications per cell, 2,000 simulees per replication. TABLE 3 Two-Subscale Percentages Correctly Identified Results IRT Method Summed Score Method # Items # Items Non- Non- Correlation in A in B Augmented augmented Augmented augmented Note: # Items in A = number of items in Subscale A; # Items in B = number of items in Subscale B; Augmented = percentage of simulees correctly placed in ability group using augmented scores; Nonaugmented = percentage of simulees correctly placed in ability group using nonaugmented scores. 251

12 Edwards and Vevea configuration, which shows improvement ranging from 0.15% to 3.82%. While these numbers are small in absolute magnitude, it is important to remember that they are based on aggregation across replications, which essentially results in a sample size of 200,000 simulees. In this case, an improvement of 0.15% means that 300 more simulees were correctly placed in ability group. The improvement of 3.82%, when the correlation between subscores is 0.9, translates into an additional 7,640 simulees being correctly placed into ability group when using the augmented subscores. In North Carolina in there were approximately 1 million children in Grades 3 12 (Statistical Research Section NC-DPI, 2001). An improvement of 3.82% in classification decisions across tests would result in roughly 38,200 more children being correctly placed in ability group (if such a classification system was in use). Summed Scores The simulations for summed scores employed random number seeds for the generation of item parameters, thetas, and item responses that were exactly the same as those used in the IRT simulations. Thus, the results are based on simulated test takers and simulated item responses that are identical to the data from the IRT simulations; the only difference is that the analyses proceed with summed scores rather than with summed score IRT proficiency estimates. RMSE The RMSE was uniformly lower for augmented summed scores than for nonaugmented scores. Figure 3 shows the patterns of differences graphically. For summed scores, the magnitude of the RMSE is affected by the number of items on the scale. Accordingly, the RMSE in each cell was divided by the number of items on the scale, so that values are more directly comparable across cells. These renormalized RMSEs may be interpreted as if they represented average error expressed as a proportion of subscale length. In every case, the difference between the RMSE of the nonaugmented and augmented subscores increases as the number of items in the augmenting subscale increases. This effect becomes more dramatic as the correlation between the scales increases. In general, shorter subscales show greater reductions in RMSE than longer subscales. Higher correlations with the ancillary scale produce more improvement in RMSE, and longer (i.e., more reliable) ancillary scales result in greater reductions in RMSE. We defined true reliability as the squared correlation between a summed score (augmented or nonaugmented) and the expected scores of individuals given their thetas and the generating values of item parameters. Conventional calculations of Cronbach s alpha for the nonaugmented scores and estimated reliability for the augmented scores compared favorably with these true values. In general, estimates for the shortest subscales were somewhat negatively biased, and estimates for the longer subscales were almost identical to the true values, on average. We focus here on the true reliabilities. 252

13 5-Item Subscale r = 0.9 r = 0.6 r = Item Subscale RMSE Difference RMSE Difference Item Subscale 40-Item Subscale RMSE Difference RMSE Difference FIGURE 3. Difference in RMSE between augmented and nonaugmented summed scores as a function of correlation between subscales and number of ancillary items. 253

14 Edwards and Vevea The reliabilities of the augmented scores were always higher than those of nonaugmented scores, although differences were sometimes trivial. Figure 4 shows the reliabilities for the same configurations depicted for IRT results in Figure 2. In general, the pattern is that higher correlations and longer ancillary subscales produce greater gains in reliability for the augmented subscores. These gains are most dramatic when the subscale that is augmented has low reliability (i.e., in the first panel of Figure 4). If the augmented subscale is long, and, hence, already has high reliability, the gains are minimal (see the fourth panel of Figure 4). For the subscales most affected, the gains in reliability are dramatic. Percentage of Simulees for Whom Augmented Scores Were More Accurate Table 2 reports the percentages of simulated respondents for whom augmented scores were more accurate than nonaugmented scores under the various simulation conditions. The tabulated values represent the percentage of simulees whose augmented summed scores were closer than nonaugmented scores to their true scores. The results seem more favorable to the augmentation procedure than was the case for IRT summed scores. In no case is the percentage below 55, and in the best case (the 5-item subscale augmented by a 40-item scale with a correlation of 0.9) 76% of the augmented subscores are closer to the true scores (compared to 66% for summed score IRT). This seemingly better performance, however, may not reflect the way augmented summed scores would be used in practice. The augmentation procedure, which involves a weighted combination of integer-valued summed scores, results in noninteger-valued augmented subscores. In practice, it is likely that these fractional summed scores would be rounded to the nearest integer; in many cases, that nearest integer would be identical to the value of the nonaugmented score. Table 4 repeats the summed score portion of Table 2 using rounded augmented scores. The first number represents the percentage of rounded augmented scores that are less accurate than nonaugmented scores. The second number in each cell of the table is the percentage of rounded augmented scores that possess accuracy equal to that of the nonaugmented scores. The third number is the percentage of rounded augmented scores that are more accurate. For example, in the first cell, 17% of the augmented scores were less accurate than the conventional scores, 46% were equally accurate, and 37% of the rounded augmented scores were more accurate than the nonaugmented scores. The ratio of improvement to degradation associated with the rounded augmenting procedure is better than 2:1 for that cell. For all cells, this ratio is greater than 1.0, and the only cells in which it approaches 1.0 are those for which we would expect augmentation to make little difference: the augmented scale has 40 items, and the correlation is low. Percentage Correctly Identified Table 3 displays percentage correctly identified for the same design cells that were described for the IRT simulations. This statistic seems to be relatively unaffected by the rounding described in the previous section; accordingly, we report 254

15 r = 0.9 r = 0.6 r = Item Subscale Non-Augmented 10-Item Subscale Item Subscale 40-Item Subscale FIGURE 4. of summed scores as a function of correlation between subscales and number of ancillary items. 255

16 Edwards and Vevea TABLE 4 Two-Subscale Percentages of Simulees for Whom Augmented Subscores Were Less Accurate, Equally Accurate, or More Accurate (Using Integer Summed Score Method) # Items in # Items in Subscale A Subscale B r = 0.3 r = 0.6 r = / 46 / / 46 / / 45 / / 47 / / 46 / / 44 / / 45 / / 45 / / 42 / / 47 / / 46 / / 42 / / 40 / / 39 / / 38 / / 40 / / 39 / / 37 / / 40 / / 38 / / 34 / / 39 / / 38 / / 34 / / 35 / / 34 / / 33 / / 35 / / 34 / / 31 / / 35 / / 34 / / 29 / / 35 / / 33 / / 28 / / 32 / / 32 / / 30 / / 32 / / 31 / / 28 / / 33 / / 31 / / 26 / / 33 / / 30 / / 24 / 48 Note: r = correlation. All statistics based on 100 replications per cell, 2,000 simulees per replication. only the results for the augmented scores that are not rounded. The percentages correctly identified for augmented summed scores are strikingly similar to the corresponding results for the augmented summed-score IRT scores. The performance of the nonaugmented scores, however, is substantially worse for summed scores, so that the augmentation procedure results in greater improvement for this criterion than was the case for IRT scores. This is not surprising. Summedscore IRT scores are, by virtue of being based on statistical expectations, already regressed estimates to a certain degree. This is reflected in the fact that the nonaugmented summed-score IRT estimates perform better on this criterion than the nonaugmented summed scores. Hence, for simple summed scores, there is more room for improvement as a result of the further regression that results from the augmentation procedure. Discussion General Findings The results of the current simulation indicate that the subscores produced by the EB augmentation procedure represent a global improvement over nonaugmented subscores. For both the IRT-based EAPs and for summed scores, the augmented scores exhibit smaller RMSE, correctly place a higher percentage of simulees in appropriate ability groups, and are more reliable. As expected, the magnitude of the improvement gained through the use of the augmentation procedure is a func- 256

17 Empirical Bayes Subscore Augmentation tion of correlation between subscales and subscale length (reliability). The largest gains are seen in cases where the correlations between subscales are high, the reliability of the subscale being augmented is low, and the reliability of the subscale providing ancillary information is high. With respect to RMSE, there is no way to state absolutely that any particular difference in RMSE is of practical significance. Ultimately, the meaning of RMSE depends on the intended use of the ability estimates. The issue of the granularity of reported scores also comes into play. For summed scores, RMSE gains are somewhat attenuated if augmented scores are rounded to a small number of integer values. Similarly, any use of IRT-based theta estimates with many reported score levels would benefit more from the lower RMSE provided by the augmentation procedure. A problem similar to that of interpreting the practical significance of RMSE differences occurs when assessing the practical significance of a difference in reliability. In the most extreme case, the augmented scores were as much as 31% more reliable than the nonaugmented scores. The magnitude of improvement in reliability seen in a wide variety of conditions suggests that, given suitable uses of the scores, the use of the empirical Bayes procedure is justified. Tests with subscales are often used to assign individuals to some ability group. Whether that group is indicative of severity of depression, quality of life, or proficiency in mathematics, this study suggests that using the empirical Bayes augmentation procedure will increase the number of individuals who are correctly classified. Again, the real-world implications of that increase depend upon the number of individuals being assessed. In some contexts, a 1% improvement in the number of individuals correctly classified would not warrant the use of this procedure. However, in many cases where large numbers of individuals are assessed, or where the placement into group carries important consequences, there are clear advantages to using the empirical Bayes procedure. The implementation of the empirical Bayes augmentation procedure (once the software is written) requires no more time than the process to provide subscores in general. The results of this study show that, in the best case, the empirical Bayes augmentation procedure greatly increases reliability, lowers RMSE, and correctly classifies a higher percentage of individuals. The results of this study also show that, under the conditions used in this study, even in the worst case, the empirical Bayes augmentation procedure does no harm. Validity Given these findings, should we use this procedure all the time? The answer to that question is a resounding No! An example used in Wainer et al. (2001) shows why this is clearly the case. Consider the 100-m sprint in the Olympics. Whoever runs the fastest on that particular day wins. It doesn t matter if another person has run faster every day before that, or every day after that. All that matters in the Olympics is how fast you run that day. This is inherent in most competitions; the best person that day wins. 257

18 Edwards and Vevea Although this procedure is not advocated in a competitive setting, there are many other areas where it would be appropriate. One such example is the impetus for this study. Using empirical Bayes augmentation to report subscores is beneficial because the subscores are more reliable than they would be otherwise. Subscore augmentation provides a statistically and ethically viable solution to the problem of generating reliable subscores from small (and, hence, unreliable) subscales. Limitations As with any simulation study, this one includes only a small subset of all possible interesting conditions. Issues of sample size, which play an important role in item parameter and ability estimation, were not addressed (large-scale educational testing, where this procedure is most likely to see use, does not generally have problems related to small sample sizes). Conclusion The present simulation has shown that, in many circumstances, empirical Bayes subscore augmentation of IRT scale scores and traditional summed scores is a useful procedure for increasing the reliability of subscores. In all cases, the augmented subscores outperformed their nonaugmented counterparts. Although the magnitude of this difference was dependent on subscale correlation and length, in the best case, augmentation showed marked improvement, and in the worst case, it did no harm. The results of this study suggest that the empirical Bayes methods employed here are an acceptable solution to increasing the reliability of subscales without changing the overall makeup of the test. Note 1 This software is free and can be downloaded, along with documentation, at dthissen/dl.html References Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: An application of the EM algorithm. Psychometrika, 46, Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability of scores and profiles. New York: Wiley. Edwards, M. C. (2002). Augmenting IRT scale scores: A Monte Carlo study to evaluate an empirical Bayes approach to subscore augmentation. Unpublished master s thesis, University of North Carolina at Chapel Hill. Green, B. F., Bock, R. D., Humphreys, L. G., Linn, R. L., & Reckase, M. D. (1984). Technical guidelines for assessing computerized adaptive tests. Journal of Educational Measurement, 21, Harwell, M., Stone, C. A., Hsu, T.-C., & Kirisci, L. (1996). Monte Carlo studies in item response theory. Applied Psychological Measurement, 20, Kelley, T. L. (1927). The interpretation of educational measurements. New York: World Book. 258

19 Empirical Bayes Subscore Augmentation Statistical Research Section, Financial & Business Services, N.C. Department of Public Instruction, Raleigh, NC (2001). Retrieved March 13, 2002, from fbs/factsfigs00_01.htm#students Vevea, J. L., Billeaud, K., & Nelson, L. (1998, June). An empirical Bayes approach to subscore augmentation. Paper presented at the annual meeting of the Psychometric Society, Champaign-Urbana, IL. Vevea, J. L., Edwards, M. C., Thissen, D., Reeve, B. B., Flora, D. B., Sathy, V., et al. (2002). User s guide for augment v.2: Empirical Bayes subscore augmentation software. Electronic Research Memorandum # Chapel Hill, NC: University of North Carolina, L. L. Thurstone Psychometric Laboratory. Wainer, H. (Ed.). (2000). Computerized adaptive testing: A primer (2nd ed.). Hillsdale, NJ: Erlbaum. Wainer, H., Bradlow, E. T., & Du, Z. (2000). Testlet response theory: An analog for the 3PL model useful in testlet-based adaptive testing. In W. J. van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp ). The Netherlands: Kluwer. Wainer, H., Vevea, J. L., Camacho, F., Reeve, B. B. III, Rosa, K., Nelson, L., et al. (2001). Augmented scores Borrowing strength to compute scores based on small numbers of items. In D. Thissen & H. Wainer (Eds.), Test scoring (pp ). Mahwah, NJ: Erlbaum. Authors MICHAEL C. EDWARDS is Assistant Professor, Department of Psychology, The Ohio State University, Columbus, OH 43210; edwards.134@osu.edu. His areas of specialization include item response theory, factor analysis, and computerized adaptive testing. JACK L. VEVEA is Associate Professor, Department of Psychology, The University of California at Santa Cruz, Santa Cruz, CA 95064; jvevea@ucsc.edu. His areas of specialization include statistical methods for meta-analysis, mathematical models for bias in memory, and applied statistics. Manuscript received June 24, 2004 Revision received September 30, 2004 Accepted October 6,

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests David Shin Pearson Educational Measurement May 007 rr0701 Using assessment and research to promote learning Pearson Educational

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 39 Evaluation of Comparability of Scores and Passing Decisions for Different Item Pools of Computerized Adaptive Examinations

More information

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati.

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati. Likelihood Ratio Based Computerized Classification Testing Nathan A. Thompson Assessment Systems Corporation & University of Cincinnati Shungwon Ro Kenexa Abstract An efficient method for making decisions

More information

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2 MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and Lord Equating Methods 1,2 Lisa A. Keller, Ronald K. Hambleton, Pauline Parker, Jenna Copella University of Massachusetts

More information

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Jee Seon Kim University of Wisconsin, Madison Paper presented at 2006 NCME Annual Meeting San Francisco, CA Correspondence

More information

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups GMAC Scaling Item Difficulty Estimates from Nonequivalent Groups Fanmin Guo, Lawrence Rudner, and Eileen Talento-Miller GMAC Research Reports RR-09-03 April 3, 2009 Abstract By placing item statistics

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

Computerized Mastery Testing

Computerized Mastery Testing Computerized Mastery Testing With Nonequivalent Testlets Kathleen Sheehan and Charles Lewis Educational Testing Service A procedure for determining the effect of testlet nonequivalence on the operating

More information

Termination Criteria in Computerized Adaptive Tests: Variable-Length CATs Are Not Biased. Ben Babcock and David J. Weiss University of Minnesota

Termination Criteria in Computerized Adaptive Tests: Variable-Length CATs Are Not Biased. Ben Babcock and David J. Weiss University of Minnesota Termination Criteria in Computerized Adaptive Tests: Variable-Length CATs Are Not Biased Ben Babcock and David J. Weiss University of Minnesota Presented at the Realities of CAT Paper Session, June 2,

More information

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian Recovery of Marginal Maximum Likelihood Estimates in the Two-Parameter Logistic Response Model: An Evaluation of MULTILOG Clement A. Stone University of Pittsburgh Marginal maximum likelihood (MML) estimation

More information

Jason L. Meyers. Ahmet Turhan. Steven J. Fitzpatrick. Pearson. Paper presented at the annual meeting of the

Jason L. Meyers. Ahmet Turhan. Steven J. Fitzpatrick. Pearson. Paper presented at the annual meeting of the Performance of Ability Estimation Methods for Writing Assessments under Conditio ns of Multidime nsionality Jason L. Meyers Ahmet Turhan Steven J. Fitzpatrick Pearson Paper presented at the annual meeting

More information

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1 Nested Factor Analytic Model Comparison as a Means to Detect Aberrant Response Patterns John M. Clark III Pearson Author Note John M. Clark III,

More information

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, April 23-25, 2003 The Classification Accuracy of Measurement Decision Theory Lawrence Rudner University

More information

Using the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison

Using the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison Using the Testlet Model to Mitigate Test Speededness Effects James A. Wollack Youngsuk Suh Daniel M. Bolt University of Wisconsin Madison April 12, 2007 Paper presented at the annual meeting of the National

More information

Using Bayesian Decision Theory to

Using Bayesian Decision Theory to Using Bayesian Decision Theory to Design a Computerized Mastery Test Charles Lewis and Kathleen Sheehan Educational Testing Service A theoretical framework for mastery testing based on item response theory

More information

Computerized Adaptive Testing for Classifying Examinees Into Three Categories

Computerized Adaptive Testing for Classifying Examinees Into Three Categories Measurement and Research Department Reports 96-3 Computerized Adaptive Testing for Classifying Examinees Into Three Categories T.J.H.M. Eggen G.J.J.M. Straetmans Measurement and Research Department Reports

More information

MEA DISCUSSION PAPERS

MEA DISCUSSION PAPERS Inference Problems under a Special Form of Heteroskedasticity Helmut Farbmacher, Heinrich Kögel 03-2015 MEA DISCUSSION PAPERS mea Amalienstr. 33_D-80799 Munich_Phone+49 89 38602-355_Fax +49 89 38602-390_www.mea.mpisoc.mpg.de

More information

A Bayesian Nonparametric Model Fit statistic of Item Response Models

A Bayesian Nonparametric Model Fit statistic of Item Response Models A Bayesian Nonparametric Model Fit statistic of Item Response Models Purpose As more and more states move to use the computer adaptive test for their assessments, item response theory (IRT) has been widely

More information

A Comparison of Several Goodness-of-Fit Statistics

A Comparison of Several Goodness-of-Fit Statistics A Comparison of Several Goodness-of-Fit Statistics Robert L. McKinley The University of Toledo Craig N. Mills Educational Testing Service A study was conducted to evaluate four goodnessof-fit procedures

More information

Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions

Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions J. Harvey a,b, & A.J. van der Merwe b a Centre for Statistical Consultation Department of Statistics

More information

Selection of Linking Items

Selection of Linking Items Selection of Linking Items Subset of items that maximally reflect the scale information function Denote the scale information as Linear programming solver (in R, lp_solve 5.5) min(y) Subject to θ, θs,

More information

Context of Best Subset Regression

Context of Best Subset Regression Estimation of the Squared Cross-Validity Coefficient in the Context of Best Subset Regression Eugene Kennedy South Carolina Department of Education A monte carlo study was conducted to examine the performance

More information

Investigating the robustness of the nonparametric Levene test with more than two groups

Investigating the robustness of the nonparametric Levene test with more than two groups Psicológica (2014), 35, 361-383. Investigating the robustness of the nonparametric Levene test with more than two groups David W. Nordstokke * and S. Mitchell Colp University of Calgary, Canada Testing

More information

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS Michael J. Kolen The University of Iowa March 2011 Commissioned by the Center for K 12 Assessment & Performance Management at

More information

Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement Invariance Tests Of Multi-Group Confirmatory Factor Analyses

Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement Invariance Tests Of Multi-Group Confirmatory Factor Analyses Journal of Modern Applied Statistical Methods Copyright 2005 JMASM, Inc. May, 2005, Vol. 4, No.1, 275-282 1538 9472/05/$95.00 Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement

More information

Adaptive EAP Estimation of Ability

Adaptive EAP Estimation of Ability Adaptive EAP Estimation of Ability in a Microcomputer Environment R. Darrell Bock University of Chicago Robert J. Mislevy National Opinion Research Center Expected a posteriori (EAP) estimation of ability,

More information

Estimating Reliability in Primary Research. Michael T. Brannick. University of South Florida

Estimating Reliability in Primary Research. Michael T. Brannick. University of South Florida Estimating Reliability 1 Running Head: ESTIMATING RELIABILITY Estimating Reliability in Primary Research Michael T. Brannick University of South Florida Paper presented in S. Morris (Chair) Advances in

More information

Sample Sizes for Predictive Regression Models and Their Relationship to Correlation Coefficients

Sample Sizes for Predictive Regression Models and Their Relationship to Correlation Coefficients Sample Sizes for Predictive Regression Models and Their Relationship to Correlation Coefficients Gregory T. Knofczynski Abstract This article provides recommended minimum sample sizes for multiple linear

More information

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses Item Response Theory Steven P. Reise University of California, U.S.A. Item response theory (IRT), or modern measurement theory, provides alternatives to classical test theory (CTT) methods for the construction,

More information

Nearest-Integer Response from Normally-Distributed Opinion Model for Likert Scale

Nearest-Integer Response from Normally-Distributed Opinion Model for Likert Scale Nearest-Integer Response from Normally-Distributed Opinion Model for Likert Scale Jonny B. Pornel, Vicente T. Balinas and Giabelle A. Saldaña University of the Philippines Visayas This paper proposes that

More information

Using collateral information in the estimation of sub-scores --- a fully Bayesian approach

Using collateral information in the estimation of sub-scores --- a fully Bayesian approach University of Iowa Iowa Research Online Theses and Dissertations Summer 009 Using collateral information in the estimation of sub-scores --- a fully Bayesian approach Shuqin Tao University of Iowa Copyright

More information

JSM Survey Research Methods Section

JSM Survey Research Methods Section Methods and Issues in Trimming Extreme Weights in Sample Surveys Frank Potter and Yuhong Zheng Mathematica Policy Research, P.O. Box 393, Princeton, NJ 08543 Abstract In survey sampling practice, unequal

More information

IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS

IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS A Dissertation Presented to The Academic Faculty by HeaWon Jun In Partial Fulfillment of the Requirements

More information

An Alternative to the Trend Scoring Method for Adjusting Scoring Shifts. in Mixed-Format Tests. Xuan Tan. Sooyeon Kim. Insu Paek.

An Alternative to the Trend Scoring Method for Adjusting Scoring Shifts. in Mixed-Format Tests. Xuan Tan. Sooyeon Kim. Insu Paek. An Alternative to the Trend Scoring Method for Adjusting Scoring Shifts in Mixed-Format Tests Xuan Tan Sooyeon Kim Insu Paek Bihua Xiang ETS, Princeton, NJ Paper presented at the annual meeting of the

More information

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods James Madison University JMU Scholarly Commons Department of Graduate Psychology - Faculty Scholarship Department of Graduate Psychology 3-008 Scoring Multiple Choice Items: A Comparison of IRT and Classical

More information

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD Psy 427 Cal State Northridge Andrew Ainsworth, PhD Contents Item Analysis in General Classical Test Theory Item Response Theory Basics Item Response Functions Item Information Functions Invariance IRT

More information

Constrained Multidimensional Adaptive Testing without intermixing items from different dimensions

Constrained Multidimensional Adaptive Testing without intermixing items from different dimensions Psychological Test and Assessment Modeling, Volume 56, 2014 (4), 348-367 Constrained Multidimensional Adaptive Testing without intermixing items from different dimensions Ulf Kroehne 1, Frank Goldhammer

More information

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1.

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1. Running Head: A MODIFIED CATSIB PROCEDURE FOR DETECTING DIF ITEMS 1 A Modified CATSIB Procedure for Detecting Differential Item Function on Computer-Based Tests Johnson Ching-hong Li 1 Mark J. Gierl 1

More information

Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz

Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz This study presents the steps Edgenuity uses to evaluate the reliability and validity of its quizzes, topic tests, and cumulative

More information

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model Gary Skaggs Fairfax County, Virginia Public Schools José Stevenson

More information

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing Terry A. Ackerman University of Illinois This study investigated the effect of using multidimensional items in

More information

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination Timothy N. Rubin (trubin@uci.edu) Michael D. Lee (mdlee@uci.edu) Charles F. Chubb (cchubb@uci.edu) Department of Cognitive

More information

International Journal of Education and Research Vol. 5 No. 5 May 2017

International Journal of Education and Research Vol. 5 No. 5 May 2017 International Journal of Education and Research Vol. 5 No. 5 May 2017 EFFECT OF SAMPLE SIZE, ABILITY DISTRIBUTION AND TEST LENGTH ON DETECTION OF DIFFERENTIAL ITEM FUNCTIONING USING MANTEL-HAENSZEL STATISTIC

More information

A Comparison of Item and Testlet Selection Procedures. in Computerized Adaptive Testing. Leslie Keng. Pearson. Tsung-Han Ho

A Comparison of Item and Testlet Selection Procedures. in Computerized Adaptive Testing. Leslie Keng. Pearson. Tsung-Han Ho ADAPTIVE TESTLETS 1 Running head: ADAPTIVE TESTLETS A Comparison of Item and Testlet Selection Procedures in Computerized Adaptive Testing Leslie Keng Pearson Tsung-Han Ho The University of Texas at Austin

More information

Remarks on Bayesian Control Charts

Remarks on Bayesian Control Charts Remarks on Bayesian Control Charts Amir Ahmadi-Javid * and Mohsen Ebadi Department of Industrial Engineering, Amirkabir University of Technology, Tehran, Iran * Corresponding author; email address: ahmadi_javid@aut.ac.ir

More information

Some General Guidelines for Choosing Missing Data Handling Methods in Educational Research

Some General Guidelines for Choosing Missing Data Handling Methods in Educational Research Journal of Modern Applied Statistical Methods Volume 13 Issue 2 Article 3 11-2014 Some General Guidelines for Choosing Missing Data Handling Methods in Educational Research Jehanzeb R. Cheema University

More information

Supplemental material on graphical presentation methods to accompany:

Supplemental material on graphical presentation methods to accompany: Supplemental material on graphical presentation methods to accompany: Preacher, K. J., & Kelley, K. (011). Effect size measures for mediation models: Quantitative strategies for communicating indirect

More information

Scaling TOWES and Linking to IALS

Scaling TOWES and Linking to IALS Scaling TOWES and Linking to IALS Kentaro Yamamoto and Irwin Kirsch March, 2002 In 2000, the Organization for Economic Cooperation and Development (OECD) along with Statistics Canada released Literacy

More information

The Influence of Test Characteristics on the Detection of Aberrant Response Patterns

The Influence of Test Characteristics on the Detection of Aberrant Response Patterns The Influence of Test Characteristics on the Detection of Aberrant Response Patterns Steven P. Reise University of California, Riverside Allan M. Due University of Minnesota Statistical methods to assess

More information

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

Mantel-Haenszel Procedures for Detecting Differential Item Functioning A Comparison of Logistic Regression and Mantel-Haenszel Procedures for Detecting Differential Item Functioning H. Jane Rogers, Teachers College, Columbia University Hariharan Swaminathan, University of

More information

Adjustments for Rater Effects in

Adjustments for Rater Effects in Adjustments for Rater Effects in Performance Assessment Walter M. Houston, Mark R. Raymond, and Joseph C. Svec American College Testing Alternative methods to correct for rater leniency/stringency effects

More information

The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests

The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests Mary E. Lunz and Betty A. Bergstrom, American Society of Clinical Pathologists Benjamin D. Wright, University

More information

The Use of Item Statistics in the Calibration of an Item Bank

The Use of Item Statistics in the Calibration of an Item Bank ~ - -., The Use of Item Statistics in the Calibration of an Item Bank Dato N. M. de Gruijter University of Leyden An IRT analysis based on p (proportion correct) and r (item-test correlation) is proposed

More information

Measurement Efficiency for Fixed-Precision Multidimensional Computerized Adaptive Tests Paap, Muirne C. S.; Born, Sebastian; Braeken, Johan

Measurement Efficiency for Fixed-Precision Multidimensional Computerized Adaptive Tests Paap, Muirne C. S.; Born, Sebastian; Braeken, Johan University of Groningen Measurement Efficiency for Fixed-Precision Multidimensional Computerized Adaptive Tests Paap, Muirne C. S.; Born, Sebastian; Braeken, Johan Published in: Applied Psychological Measurement

More information

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Thakur Karkee Measurement Incorporated Dong-In Kim CTB/McGraw-Hill Kevin Fatica CTB/McGraw-Hill

More information

Effects of Ignoring Discrimination Parameter in CAT Item Selection on Student Scores. Shudong Wang NWEA. Liru Zhang Delaware Department of Education

Effects of Ignoring Discrimination Parameter in CAT Item Selection on Student Scores. Shudong Wang NWEA. Liru Zhang Delaware Department of Education Effects of Ignoring Discrimination Parameter in CAT Item Selection on Student Scores Shudong Wang NWEA Liru Zhang Delaware Department of Education Paper to be presented at the annual meeting of the National

More information

Meta-Analysis of Correlation Coefficients: A Monte Carlo Comparison of Fixed- and Random-Effects Methods

Meta-Analysis of Correlation Coefficients: A Monte Carlo Comparison of Fixed- and Random-Effects Methods Psychological Methods 01, Vol. 6, No. 2, 161-1 Copyright 01 by the American Psychological Association, Inc. 82-989X/01/S.00 DOI:.37//82-989X.6.2.161 Meta-Analysis of Correlation Coefficients: A Monte Carlo

More information

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

Lec 02: Estimation & Hypothesis Testing in Animal Ecology Lec 02: Estimation & Hypothesis Testing in Animal Ecology Parameter Estimation from Samples Samples We typically observe systems incompletely, i.e., we sample according to a designed protocol. We then

More information

Modeling Sentiment with Ridge Regression

Modeling Sentiment with Ridge Regression Modeling Sentiment with Ridge Regression Luke Segars 2/20/2012 The goal of this project was to generate a linear sentiment model for classifying Amazon book reviews according to their star rank. More generally,

More information

On the Performance of Maximum Likelihood Versus Means and Variance Adjusted Weighted Least Squares Estimation in CFA

On the Performance of Maximum Likelihood Versus Means and Variance Adjusted Weighted Least Squares Estimation in CFA STRUCTURAL EQUATION MODELING, 13(2), 186 203 Copyright 2006, Lawrence Erlbaum Associates, Inc. On the Performance of Maximum Likelihood Versus Means and Variance Adjusted Weighted Least Squares Estimation

More information

Minimizing Uncertainty in Property Casualty Loss Reserve Estimates Chris G. Gross, ACAS, MAAA

Minimizing Uncertainty in Property Casualty Loss Reserve Estimates Chris G. Gross, ACAS, MAAA Minimizing Uncertainty in Property Casualty Loss Reserve Estimates Chris G. Gross, ACAS, MAAA The uncertain nature of property casualty loss reserves Property Casualty loss reserves are inherently uncertain.

More information

Fixed-Effect Versus Random-Effects Models

Fixed-Effect Versus Random-Effects Models PART 3 Fixed-Effect Versus Random-Effects Models Introduction to Meta-Analysis. Michael Borenstein, L. V. Hedges, J. P. T. Higgins and H. R. Rothstein 2009 John Wiley & Sons, Ltd. ISBN: 978-0-470-05724-7

More information

The Relative Performance of Full Information Maximum Likelihood Estimation for Missing Data in Structural Equation Models

The Relative Performance of Full Information Maximum Likelihood Estimation for Missing Data in Structural Equation Models University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Educational Psychology Papers and Publications Educational Psychology, Department of 7-1-2001 The Relative Performance of

More information

Comparison of Computerized Adaptive Testing and Classical Methods for Measuring Individual Change

Comparison of Computerized Adaptive Testing and Classical Methods for Measuring Individual Change Comparison of Computerized Adaptive Testing and Classical Methods for Measuring Individual Change Gyenam Kim Kang Korea Nazarene University David J. Weiss University of Minnesota Presented at the Item

More information

Designing small-scale tests: A simulation study of parameter recovery with the 1-PL

Designing small-scale tests: A simulation study of parameter recovery with the 1-PL Psychological Test and Assessment Modeling, Volume 55, 2013 (4), 335-360 Designing small-scale tests: A simulation study of parameter recovery with the 1-PL Dubravka Svetina 1, Aron V. Crawford 2, Roy

More information

for Scaling Ability and Diagnosing Misconceptions Laine P. Bradshaw James Madison University Jonathan Templin University of Georgia Author Note

for Scaling Ability and Diagnosing Misconceptions Laine P. Bradshaw James Madison University Jonathan Templin University of Georgia Author Note Combing Item Response Theory and Diagnostic Classification Models: A Psychometric Model for Scaling Ability and Diagnosing Misconceptions Laine P. Bradshaw James Madison University Jonathan Templin University

More information

Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX

Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX Paper 1766-2014 Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX ABSTRACT Chunhua Cao, Yan Wang, Yi-Hsin Chen, Isaac Y. Li University

More information

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS)

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS) Chapter : Advanced Remedial Measures Weighted Least Squares (WLS) When the error variance appears nonconstant, a transformation (of Y and/or X) is a quick remedy. But it may not solve the problem, or it

More information

UvA-DARE (Digital Academic Repository)

UvA-DARE (Digital Academic Repository) UvA-DARE (Digital Academic Repository) Standaarden voor kerndoelen basisonderwijs : de ontwikkeling van standaarden voor kerndoelen basisonderwijs op basis van resultaten uit peilingsonderzoek van der

More information

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

Bruno D. Zumbo, Ph.D. University of Northern British Columbia Bruno Zumbo 1 The Effect of DIF and Impact on Classical Test Statistics: Undetected DIF and Impact, and the Reliability and Interpretability of Scores from a Language Proficiency Test Bruno D. Zumbo, Ph.D.

More information

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison Empowered by Psychometrics The Fundamentals of Psychometrics Jim Wollack University of Wisconsin Madison Psycho-what? Psychometrics is the field of study concerned with the measurement of mental and psychological

More information

Meta-Analysis. Zifei Liu. Biological and Agricultural Engineering

Meta-Analysis. Zifei Liu. Biological and Agricultural Engineering Meta-Analysis Zifei Liu What is a meta-analysis; why perform a metaanalysis? How a meta-analysis work some basic concepts and principles Steps of Meta-analysis Cautions on meta-analysis 2 What is Meta-analysis

More information

Sampling Weights, Model Misspecification and Informative Sampling: A Simulation Study

Sampling Weights, Model Misspecification and Informative Sampling: A Simulation Study Sampling Weights, Model Misspecification and Informative Sampling: A Simulation Study Marianne (Marnie) Bertolet Department of Statistics Carnegie Mellon University Abstract Linear mixed-effects (LME)

More information

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp The Stata Journal (22) 2, Number 3, pp. 28 289 Comparative assessment of three common algorithms for estimating the variance of the area under the nonparametric receiver operating characteristic curve

More information

Computerized Adaptive Testing With the Bifactor Model

Computerized Adaptive Testing With the Bifactor Model Computerized Adaptive Testing With the Bifactor Model David J. Weiss University of Minnesota and Robert D. Gibbons Center for Health Statistics University of Illinois at Chicago Presented at the New CAT

More information

Effects of Local Item Dependence

Effects of Local Item Dependence Effects of Local Item Dependence on the Fit and Equating Performance of the Three-Parameter Logistic Model Wendy M. Yen CTB/McGraw-Hill Unidimensional item response theory (IRT) has become widely used

More information

Basic concepts and principles of classical test theory

Basic concepts and principles of classical test theory Basic concepts and principles of classical test theory Jan-Eric Gustafsson What is measurement? Assignment of numbers to aspects of individuals according to some rule. The aspect which is measured must

More information

Robert Pearson Daniel Mundfrom Adam Piccone

Robert Pearson Daniel Mundfrom Adam Piccone Number of Factors A Comparison of Ten Methods for Determining the Number of Factors in Exploratory Factor Analysis Robert Pearson Daniel Mundfrom Adam Piccone University of Northern Colorado Eastern Kentucky

More information

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories Kamla-Raj 010 Int J Edu Sci, (): 107-113 (010) Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories O.O. Adedoyin Department of Educational Foundations,

More information

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison Group-Level Diagnosis 1 N.B. Please do not cite or distribute. Multilevel IRT for group-level diagnosis Chanho Park Daniel M. Bolt University of Wisconsin-Madison Paper presented at the annual meeting

More information

was also my mentor, teacher, colleague, and friend. It is tempting to review John Horn s main contributions to the field of intelligence by

was also my mentor, teacher, colleague, and friend. It is tempting to review John Horn s main contributions to the field of intelligence by Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika, 30, 179 185. (3362 citations in Google Scholar as of 4/1/2016) Who would have thought that a paper

More information

Meta-Analysis and Publication Bias: How Well Does the FAT-PET-PEESE Procedure Work?

Meta-Analysis and Publication Bias: How Well Does the FAT-PET-PEESE Procedure Work? Meta-Analysis and Publication Bias: How Well Does the FAT-PET-PEESE Procedure Work? Nazila Alinaghi W. Robert Reed Department of Economics and Finance, University of Canterbury Abstract: This study uses

More information

Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales

Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales University of Iowa Iowa Research Online Theses and Dissertations Summer 2013 Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales Anna Marie

More information

A review of statistical methods in the analysis of data arising from observer reliability studies (Part 11) *

A review of statistical methods in the analysis of data arising from observer reliability studies (Part 11) * A review of statistical methods in the analysis of data arising from observer reliability studies (Part 11) * by J. RICHARD LANDIS** and GARY G. KOCH** 4 Methods proposed for nominal and ordinal data Many

More information

Sawtooth Software. The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? RESEARCH PAPER SERIES

Sawtooth Software. The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? RESEARCH PAPER SERIES Sawtooth Software RESEARCH PAPER SERIES The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? Dick Wittink, Yale University Joel Huber, Duke University Peter Zandan,

More information

Author's response to reviews

Author's response to reviews Author's response to reviews Title: Comparison of two Bayesian methods to detect mode effects between paper-based and computerized adaptive assessments: A preliminary Monte Carlo study Authors: Barth B.

More information

The Effect of Guessing on Item Reliability

The Effect of Guessing on Item Reliability The Effect of Guessing on Item Reliability under Answer-Until-Correct Scoring Michael Kane National League for Nursing, Inc. James Moloney State University of New York at Brockport The answer-until-correct

More information

Section on Survey Research Methods JSM 2009

Section on Survey Research Methods JSM 2009 Missing Data and Complex Samples: The Impact of Listwise Deletion vs. Subpopulation Analysis on Statistical Bias and Hypothesis Test Results when Data are MCAR and MAR Bethany A. Bell, Jeffrey D. Kromrey

More information

A MONTE CARLO STUDY OF MODEL SELECTION PROCEDURES FOR THE ANALYSIS OF CATEGORICAL DATA

A MONTE CARLO STUDY OF MODEL SELECTION PROCEDURES FOR THE ANALYSIS OF CATEGORICAL DATA A MONTE CARLO STUDY OF MODEL SELECTION PROCEDURES FOR THE ANALYSIS OF CATEGORICAL DATA Elizabeth Martin Fischer, University of North Carolina Introduction Researchers and social scientists frequently confront

More information

OLS Regression with Clustered Data

OLS Regression with Clustered Data OLS Regression with Clustered Data Analyzing Clustered Data with OLS Regression: The Effect of a Hierarchical Data Structure Daniel M. McNeish University of Maryland, College Park A previous study by Mundfrom

More information

LEDYARD R TUCKER AND CHARLES LEWIS

LEDYARD R TUCKER AND CHARLES LEWIS PSYCHOMETRIKA--VOL. ~ NO. 1 MARCH, 1973 A RELIABILITY COEFFICIENT FOR MAXIMUM LIKELIHOOD FACTOR ANALYSIS* LEDYARD R TUCKER AND CHARLES LEWIS UNIVERSITY OF ILLINOIS Maximum likelihood factor analysis provides

More information

RESEARCH ARTICLES. Brian E. Clauser, Polina Harik, and Melissa J. Margolis National Board of Medical Examiners

RESEARCH ARTICLES. Brian E. Clauser, Polina Harik, and Melissa J. Margolis National Board of Medical Examiners APPLIED MEASUREMENT IN EDUCATION, 22: 1 21, 2009 Copyright Taylor & Francis Group, LLC ISSN: 0895-7347 print / 1532-4818 online DOI: 10.1080/08957340802558318 HAME 0895-7347 1532-4818 Applied Measurement

More information

Sensitivity of DFIT Tests of Measurement Invariance for Likert Data

Sensitivity of DFIT Tests of Measurement Invariance for Likert Data Meade, A. W. & Lautenschlager, G. J. (2005, April). Sensitivity of DFIT Tests of Measurement Invariance for Likert Data. Paper presented at the 20 th Annual Conference of the Society for Industrial and

More information

Adjusting for mode of administration effect in surveys using mailed questionnaire and telephone interview data

Adjusting for mode of administration effect in surveys using mailed questionnaire and telephone interview data Adjusting for mode of administration effect in surveys using mailed questionnaire and telephone interview data Karl Bang Christensen National Institute of Occupational Health, Denmark Helene Feveille National

More information

A Race Model of Perceptual Forced Choice Reaction Time

A Race Model of Perceptual Forced Choice Reaction Time A Race Model of Perceptual Forced Choice Reaction Time David E. Huber (dhuber@psyc.umd.edu) Department of Psychology, 1147 Biology/Psychology Building College Park, MD 2742 USA Denis Cousineau (Denis.Cousineau@UMontreal.CA)

More information

A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM LIKELIHOOD METHODS IN ESTIMATING THE ITEM PARAMETERS FOR THE 2PL IRT MODEL

A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM LIKELIHOOD METHODS IN ESTIMATING THE ITEM PARAMETERS FOR THE 2PL IRT MODEL International Journal of Innovative Management, Information & Production ISME Internationalc2010 ISSN 2185-5439 Volume 1, Number 1, December 2010 PP. 81-89 A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM

More information

Does momentary accessibility influence metacomprehension judgments? The influence of study judgment lags on accessibility effects

Does momentary accessibility influence metacomprehension judgments? The influence of study judgment lags on accessibility effects Psychonomic Bulletin & Review 26, 13 (1), 6-65 Does momentary accessibility influence metacomprehension judgments? The influence of study judgment lags on accessibility effects JULIE M. C. BAKER and JOHN

More information