A Comparison of Several Goodness-of-Fit Statistics

Size: px
Start display at page:

Download "A Comparison of Several Goodness-of-Fit Statistics"

Transcription

1 A Comparison of Several Goodness-of-Fit Statistics Robert L. McKinley The University of Toledo Craig N. Mills Educational Testing Service A study was conducted to evaluate four goodnessof-fit procedures using data simulation techniques. The procedures were evaluated using data generated according to three different item response theory models and a factor analytic model. Three different distributions of ability were used, as were three different sample sizes. It was concluded that the likelihood ratio chi-square procedure yielded the fewest erroneous rejections of the hypothesis of fit, whereas Bock s chisquare procedure yielded the fewest erroneous acceptances of fit. It was found that sample sizes somewhere between 500 and 1,000 were best. Shifts in the mean of the ability distribution were found to cause minor fluctuations, but they did not appear to be a major issue. Item response theory (IRT) is becoming a widely used psychometric tool, with applications ranging from item banking to equating to adaptive testing. IRT models offer many advantages over more traditional test analysis procedures. However, these advantages are gained at the expense of making rather strong assumptions about the nature of the data. It is widely recognized that these assumptions are unlikely to be fully met in practice. Although in some respects IRT models appear to be robust with respect to the violation of these assumptions, 9 it is clear that in many instances the violation of these assumptions has profound implications for the application of IRT methodology. Because of the strong assumptions required for the use of IRT and the fact that the advantages associated with the use of IRT may not be fully realized if these assumptions are not met, it is important that prospective users of IRT methodology assess the appropriateness of IRT for use in intended applications. One way in which this can be done is by conducting a goodness-of-fit study. Broadly defined, a goodness-of-fit study is the evaluation of the similarity between observed and expected (predicted) outcomes. Within the context of IRT, this typically involves (1) estimating the parameters of an IRT model, (2) using those parameter estimates to predict, by way of the IRT model, examinee response patterns, and (3) comparing the predicted response patterns to actual observed examinee response patterns. A number of procedures have been proposed in the literature for assessing the goodness of fit of IRT models to data. Unfortunately, there is little information available to assist in the selection or evaluation of such procedures. Data are not generally available regarding the performance of the various procedures under different conditions, nor are criteria available for selecting among the competing alternative goodness-of-fit procedures. The purpose of this research was to investigate a number of goodness-of-fit procedures to assess 49

2 - 50 their adequacy for assessing the degree to which the more popular IRT models fit data. This was.accomplished by generating simulated test data with known properties. The parameters of the three most popular IRT models-the one-parameter logistic, the two-parameter logistic, and the three-parameter logistic models-were estimated, and several goodness-of-fit procedures were applied to the resuits. The accuracy with which the procedures identified known fit and misfit were then evaluated. Goodness-of=Fit Statistics The focus of this research was on the goodness of fit of specific IRT models, as opposed to specific aspects of fit, such as local independence. Although there have been some procedures proposed for assessing local independence (e.g., Rosenbaum, 1994), these procedures have largely focused on assessing whether any IRT model is appropriate, rather than on whether a specific model fits the data. Note, for instance, that the Rosenbaum procedure not only fails to utilize IRT model parameters, but the procedure precludes the use of model parameters. Therefore, these procedures were not included in this study. Rather, only statistics that could be used to evaluate the fit of a specific model, or to compare the fit of two models, were included. After a review of the literature, four goodnessof-fit procedures were selected for this research. All of the procedures selected lend themselves to chi-square analyses, and allow the statistical testing of fit for individual items and for a test as a whole. Bock s Chi-Square Bock s chi-square (BCHI) procedure (~ clc9 1972) involves computing a chi-square statistic for each item in the following manner. First, the ability scale is divided into J intervals so that roughly equal numbers of examinees are placed into each interval. Each examinee is then assigned to one of the 2 x J cells on the basis of the examinee s ability estimate and whether the examinee answered the item of interest correctly or incorrectly. For each interval the observed and predicted proportion-correct and proportion-incorrect scores are computed and used to form a chi-square statistic. The predicted value for an interval for a given item ~~~~~ is computed using the median of the ability estimates falling within the interval and the item parameter estimates for that item using the IRT model. The BCHI statistic for item a is given by where is the observed proportion-correct on item i for interval j, and Nj is the number of examinees with ability estimates falling within interval j. To test the significance of an item s lack of fit, J m degrees of freedom are used, where yrc is the number of item parameters estimated. Yen s CM-Square Yen s chi-square (YCHI) procedure (Yen, 1981) is the same as the BCHI procedure with two exceptions. First, the YCHI procedure uses 10 intervals, whereas the BCHI procedure does not specify a specific number of intervals. Second, the predicted score Eij is computed as the mean of the predicted probabilities of a correct response for the examinees within an interval. The YCHI statistic is given by Equation 1 with ~ 10. The degrees = of freedom are IO - m. and Mead CM-Square The Wright and Mead chi-square (WCHI) procedure was first proposed by Wight and Mead (1977). It is identical to the YCHI procedure with three exceptions. First, the procedure is based on number-correct score groups (i.e., the one-parameter logistic model) rather than on intervals of the ability scale. Second, rather than using 10 interv~ls 9 the WCHI procedure requires that six or fewer score groups be used. This is accomplished by collapsing adjacent number-correct score groups until there are six or fewer groups, while maintaining a roughly uniform number of examinees across groups. Third, the chi-square statistic that is computed is modified to correct for the theoretical variance of

3 51 the predicted probabilities of a correct response within a score group (due to examinees of different abilities being in the same interval). To use this procedure with IRT models other than the one-parameter logistic an del9 ~en (1981) modified it by substituting the grouping method of the YCHI procedure for the number-correct grouping approach. This resulted in within-group ability estimate variances similar in magnitude to those produced by the number-correct score group method using the one-parameter data. The WCHI statistic is given by J ~T //>, -,, where Pi(Oj) is the predicted proportion correctly answering item in score group k, and the other terms are as previously defined. The degrees of freedom are,t - m. Likelihood Ratio CM-Square The likelihood ratio chi-square (LCHI) procedure follows much the same pattern as the YCHI procedure. The ability scale is divided into 10 intervals in such a way as to result in roughly equal numbers of examinees having ability estimates falling within the intervals. The examinees are sorted into 1 of 20 cells based on their ability estimates and their item responses. A 10 x 2 contingency table is formed, and a likelihood ratio chi-square statistic (Bishop, Fienberg, & Holland, 1975) is computed. The LCHI statistic is given by where ln(x) is the logarithm to the base e f x, and the remaining terms are as previously defined. The degrees of freedom are 10 - ~a e The Simulation of Test Data Method In all, 36 tests were simulated, each composed of 75 items. Nine tests were simulated to fit each of four models: (1) the one-parameter logistic (1PL) model, (2) the two-parameter logistic (2PL) model, (3) the three-parameter logistic (3PL) model, and, (4) a two-factor linear (LIN) model. The nine tests simulated for each model were composed of three tests at each of three sample sizes-500, 1,000, and 2,000 cases, The three tests with a given sample size varied on the mean ability of the simulated examinees. There was a low ability group (ability centered about one standard deviation below the mean item difficulty), a centered ability group (ability centered at the mean of the item difficulties), and a high ability group (ability centered about one standard deviation above the mean item difficulty). The item parameters used to simulate the one-, two-, and three-parameter data were selected as follows. All of the item parameters were selected from uniform distributions having the ranges shown in Table 1. The same b values were used for all datasets. The same a values were used for all twoand three-parameter datasets. For all one-parameter datasets a value of 1.0 was used for all c~ values. For all one- and two-parameter datasets a value of 0.0 was used for all c values. The 0 parameters used for the one-, two-, and three-parameter data were selected as follows. All Os were randomly selected from a standard normal distribution. First, 500 Os were selected and used for the N = 500 datasets. For the lit = datasets an additional 500 Os were selected and combined with the 500 Os previously selected. Likewise, for the hl 2,000 datasets = an additional 1,000 Os were selected and combined with those already selected. Note that the low ability groups were simulated by subtracting 1.0 from all of the Os, whereas the high ability groups were simulated by adding 1.0 to all of the Os. The LIN data were generated using the procedure described by Wherry, Naylor, Wherry, and Fallis (1965), vvhich is based on the linear factor analysis model. The procedure forms a multidimensional variable as a weighted sum of independent, normally distributed random variables, and then dichotomizes the variable to give the desired proportion correct. Groups with different mean abilities were simulated by shifting the mean of the target proportion-correct scores of the items. The

4 52 Table 1 Ranges of Item Parameter Distributions target mean total test proportion-correct scores for the three ability groups were p =.3 ~~, p.500, = = and p.625. Items 1-37 had factor loadings of.70 on the first factor and.20 on the second factor, whereas items had loadings of.20 on the first factor and.70 on the second factor. Calibration The data for all conditions were calibrated for the one-, two-, and three-parameter models using LOGIST ~~li~~~rs~y9 ~ n, ~ Lord, 1982). For the one- and two-parameter models, all c values were held constant at 0.0. For the one-parameter model the a values were held constant at.588 (1.0/ 1.7). However, this value was modified due to the rescaling, and as a result, different values of a were obtained for each dataset. Analyses The four goodness-of-fit procedures were applied to each of the simulation datasets. The results were then inspected to determine whether the procedures performed satisfactorily. That is, it was determined whether the procedures could be used to discriminate cases of fit (such as the three-parameter calibration of the one-parameter data) from cases of misfit (such as the one-parameter calibration of multidimensional data). Two types of errors were investigated in this study. The first type of error involved the erroneous conclusion that the model did not fit the data when, in fact, the data were generated with that model or a model subsuming the calibration model (e.g., the two-parameter calibration of one-parameter data). The second type of error was the erroneous conclusion of fit when the calibration model did not subsume the generation model (e.g., the one-parameter calibration of three-parameter data). Results Table 2 reports summaries of the results obtained for the goodness-of-fit procedures for the one-, two-, and three-parameter data. Table 3 contains a summary of the results obtained for the multidimensional data. The values reported in the tables of items for which there was are the proportion significant misfit of the model to the data. Significance levels of.01 and.05 were used for testing the significance of the chi-squares for individual items. Because the two sets of analyses yielded quite similar patterns of results, only the results for the.01 analyses are reported. Under the hypothesis of fit, the proportion of items for which there was misfit should have been approximately.01. Table 2 summarizes the results for the one-parameter data. Since these data were generated to fit the one-parameter model, it would be expected

5 53 Table 2 Proportion of Items Identified as Misfitting IPL, 2PL, and 3PL Models for Four Generation Models, Three Ability Distributions, Three Sample Sizes, and Four Fit Statistics Note. Decimal points omitted.

6 ~ 54 that all three calibration models would yield fit. As can be seen from Table 2, however, misfit was shown by all of the chi-square procedures. The most misfit was shown for the centered ability distribution and, to some extent, for the largest sample size. It seems clear from an examination of Table 2 that the values are consistently lower for the LCHI procedure than for the other procedures, 9 though the level of significance of the differences is unclear. Table 2 also summarizes the results obtained for the chi-square procedures for the two-parameter data. For these data, fit was expected for the twoand three-parameter models, but not for the oneparameter model. As can be seen from Table 2, all four procedures showed clear differences between the one-parameter calibrations and the twoand three-parameter calibrations. There is some lack of fit for the two- and three-parameter models, especially for the three-parameter model, but the proportions of items for which there was misfit are dramatically less than for the one-parameter model, regardless of which procedure is considered. In the cases when fit was expected, the LCHI procedure once again showed consistently lower values than the other procedures. In the cases when misfit was expected, the LCHI procedure performed as well or better than the other procedures for the 500 sample size case, whereas it performed about as well as the others for the larger sample size cases. The results obtained for the three-parameter data are also shown in Table 2. For these data, only the three-parameter calibration model was expected to yield fit. It was expected that the fit for the twoparameter model would be worse than for the threeparameter model, but not as bad as for the oneparameter model. This is the pattern obtained for all four procedures. There was some misfit for the three-parameter model, but at relatively low levels. The least misfit was indicated by the LCHI procedure. The LCHI procedure also tended to show less misfit for the two-parameter model than did the other procedures. There were no clear patterns for the oneparameter calibrations. Table 3 Proportions of Items Generated by the Multidimensional Model Identified as Misfitting IPL, 2PL, and 3PL Models for Three Ability Levels, Three Sample Sizes, and Four Fit Statistics Note. Decimal points omitted.

7 55 Table 3 shows a summary of the results obtained for the chi-square procedures for the multidimensional data. For these data, misfit was expected for all three calibration models. This was the obtained pattern, though the level of misfit (proportions of items for which there was misfit) was surprisingly low for the 500 and 1,000 sample size cases for the centered ability distribution. This result was fairly consistent across the four procedures. The only consistent difference among the fit procedures for these data was the tendency of the WCHI procedure to indicate less misfit than the other procedures, especially for the two- and three-parameter calibration models. Nonetheless, in no case would these results be interpreted as indicating that the unidimensional models yield adequate fit. Discussion and Conclusions Before discussing the results of this study, it will be helpful if a discussion of goodness of fit as an issue is presented. This will provide a rationale for the selection of the statistics for this study and a context for the discussion of the results. Goodness of fit is generally recognized as an important aspect of any model-based psychological measurement. Certainly, if goodness of fit of a model to data is not established, the validity of the use of the model comes into question. Within the context of IRT, goodness of fit is a crucial ingredient in the establishment of the appropriateness of any particular model, and of the use of IRT methodology itself. Despite claims for the robustness of IRT models, it is clear that misfit can seriously detract from the validity of IR f-b~sed measurements. Robustness cannot simply be assumed. Rather, it must be established in each new application. Lack of goodness of fit can occur for several reasons. For example, the assumptions of the model may not be met in the data. IRT, like any other model-based methodology, involves the use of some rather strong assumptions. Most commonly used models assume that the complete latent space is unidimensional, and that local independence therefore holds when unidimensional IRT models are employed. If the data are multidimensional, which is often the case, local independence may not hold. IRT models also involve strong assumptions about the shape of the function relating performance to latent ability. If the assumed shape of the item response function (or item characteristic curve) is incorrect, such as is the case in which the oneparameter model is used when guessing is a factor in responses, misfit results. Even if the assumptions of the model are met, misfit can result from inadequacies in the estimation process. This can be the result of sample sizes that are too small, poor estimation algorithms, or a variety of other problems. Different procedures for assessing goodness of fit tend to focus on different aspects of misfit. Some procedures, for instance, are designed for detecting violations of local independence and monotonicity assumptions (e.g., Holland, 1981; lzosenbaum9 1984). Others are designed for and limited to particular models or estimation algorithms (e.g., Andersen, 1973; Bock & Aitken, 1981; Wright ~ Mead, 1977; Wright & Stone, 1979). In this study, only statistics that could be applied with any IRT model or estimation algorithm were employed. For the one-parameter data of this study, all three models were expected to fit the data. That is, the proportions of misfitting items were expected to all be approximately.01 for the three models. As can be seen in Table 2, to a great extent this was the observed outcome. However, there were a number of instances in which the proportions deviated from the.01 region. Overall, it appeared as though the LCHI procedure yielded values consistently closer to 0 1 tha~ were the values for the other procedures. This would seem to indicate that when the null hypothesis (fit) is true, the LCHI procedure is least likely to result in erroneous rejections of the null hypothesis. This seemed to be true regardless of sample size, distribution of ability, or calibration model. For the two-parameter data, lack of fit was expected for the one-parameter model calibrations, but not for the two- and three-parameter calibrations. Again, this was the general observed outcome, though there were a few cases when pro-

8 56 portions diverged somewhat from the expected value of 0 1. As was the case for the one-parameter data, the LCHI procedure appeared to result in the fewest erroneous rejections of the hypothesis that the model fit the data. In terms of correct rejections of the hypothesis of fit, the LCHI and BCHI procedures appeared to yield marginally higher proportions than the other two procedures. However, it would be very difficult, on the basis of these data, to select one fit procedure over the others. For the three-parameter data, misfit was expected for both the one- and two-parameter models, 9 whereas fit was expected for the three-parameter model. The fit of the two-parameter model was expected to be somewhat better than the fit of the one-parameter model. Once again, the observed outcome closely paralleled the expected outcome. As was the case with the one- and two-parameter data, the LCHI procedure yielded the fewest erroneous rejections of fit when the three-parameter data were calibrated using the three-parameter model. In terms of correct rejections of fit, it would be very difficult to identify one procedure as performing better than the others. All three models were expected to yield lack of fit to the multidimensional data, and that was, in fact, the observed outcome. Although there was no clear pattern in these data, it did appear as though the BCHI procedure might have yielded, on the average, slightly higher proportions of correct rejections of fit than did the other procedures. Overall, it appeared as though the selection of a goodness-of-fit procedure would differ depending on the type of error considered more serious. The LCHI procedure appeared to be least likely to result in erroneous rejections of fit, whereas the BCHI procedure appeared to yield marginally fewer erroneous acceptances of fit. The following cautions in the interpretation of these results should be noted. The study addressed only the issue of fit with nonskewed, normal distributions of ability. The results do not generalize beyond this limitation. Nor does this study address the question of fit for tests of lengths shorter than 75 items, though the results do probably generalize to longer tests. It should also be noted that the results for the multidimensional data are based on a linear model involving two roughly equal factors. These results may not generalize to multidimensional data in which the factors are not equal, or in which linearity does not contribute to misfit. These results must be interpreted in the light of these limitations, in which case the results appear fairly clear-cut. Based on the results of this study, the following conclusions seem appropriate about the use of these goodness-of-fit procedures. First, sample sizes of 500 to 1,000 seemed to yield the best results. A sample of 2,000 seemed to make the fit procedures too sensitive. Second, shifts in the mean of the ability distribution caused minor fluctuations, but did not seem to be a major issue. This does not, however, address the issue of distribution skewness or non-normality. Third, the likelihood ratio chisquare procedure appeared to yield the fewest erroneous rejections of the hypothesis of fit, whereas Bock chi-square procedure yielded the fewest erroneous conclusions of fit. The procedure of choice apparently depends on which type of error is considered to be the more serious error. References Andersen, E. B. (1973). A goodness of fit test for the Rasch model. Psychometrika, 38, Bishop, Y. M. M., Fienberg, S. E., & Holland, P. W. (1975). Discrete multivariate analysis: Theory and practice. Cambridge MA: The MIT Press. Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, Bock, R. D., & Aitken, M. (1981). Marginal maximum likelihood estimation of item parameters: An application of an EM algorithm. Psychometrika, 46, Holland, P. W. (1981). When are item response models consistent with observed data? Psychometrika, 46, Rosenbaum, P. R. (1984). Testing the conditional independence and monotonicity assumptions of item response theory. Psychometrika, 49, Wherry, R. J. Sr., Naylor, J. C., Wherry, R. J. Jr., & Fallis, R. F. (1965). Generating multiple samples of multivariate data with arbitrary population parameters. Psychometrika, 30,

9 57 Wingersky, M. S., Barton, M. A., &Lord, F. M. (1982). LOGIST user s guide. Princeton NJ: Educational Testing Service. Wright, B. D., & Mead, R. J. (1977). BICAL: Calibrating items and scales with the Rasch model (Research Memorandum No. 23). Chicago IL: University of Chicago, Statistical Laboratory, Department of Education. Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago: Mesa Press. Yen, W. M. (1981). Using simulation results to choose a latent trait model. Applied Psychological Measurement, 5, Acknowledgment This research was funded by the Department of Education the State of Louisiana, U.S.A. Author9s Address Send requests for reprints or further information to Robert L. McKinley, College of Education, The University of Toledo, 2801 W. Bancroft St., Toledo OH 43606, U.S.A.

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian Recovery of Marginal Maximum Likelihood Estimates in the Two-Parameter Logistic Response Model: An Evaluation of MULTILOG Clement A. Stone University of Pittsburgh Marginal maximum likelihood (MML) estimation

More information

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model Gary Skaggs Fairfax County, Virginia Public Schools José Stevenson

More information

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing Terry A. Ackerman University of Illinois This study investigated the effect of using multidimensional items in

More information

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

Mantel-Haenszel Procedures for Detecting Differential Item Functioning A Comparison of Logistic Regression and Mantel-Haenszel Procedures for Detecting Differential Item Functioning H. Jane Rogers, Teachers College, Columbia University Hariharan Swaminathan, University of

More information

Effects of Local Item Dependence

Effects of Local Item Dependence Effects of Local Item Dependence on the Fit and Equating Performance of the Three-Parameter Logistic Model Wendy M. Yen CTB/McGraw-Hill Unidimensional item response theory (IRT) has become widely used

More information

Centre for Education Research and Policy

Centre for Education Research and Policy THE EFFECT OF SAMPLE SIZE ON ITEM PARAMETER ESTIMATION FOR THE PARTIAL CREDIT MODEL ABSTRACT Item Response Theory (IRT) models have been widely used to analyse test data and develop IRT-based tests. An

More information

The Use of Item Statistics in the Calibration of an Item Bank

The Use of Item Statistics in the Calibration of an Item Bank ~ - -., The Use of Item Statistics in the Calibration of an Item Bank Dato N. M. de Gruijter University of Leyden An IRT analysis based on p (proportion correct) and r (item-test correlation) is proposed

More information

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2 MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and Lord Equating Methods 1,2 Lisa A. Keller, Ronald K. Hambleton, Pauline Parker, Jenna Copella University of Massachusetts

More information

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1 Nested Factor Analytic Model Comparison as a Means to Detect Aberrant Response Patterns John M. Clark III Pearson Author Note John M. Clark III,

More information

A Bayesian Nonparametric Model Fit statistic of Item Response Models

A Bayesian Nonparametric Model Fit statistic of Item Response Models A Bayesian Nonparametric Model Fit statistic of Item Response Models Purpose As more and more states move to use the computer adaptive test for their assessments, item response theory (IRT) has been widely

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses Item Response Theory Steven P. Reise University of California, U.S.A. Item response theory (IRT), or modern measurement theory, provides alternatives to classical test theory (CTT) methods for the construction,

More information

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups GMAC Scaling Item Difficulty Estimates from Nonequivalent Groups Fanmin Guo, Lawrence Rudner, and Eileen Talento-Miller GMAC Research Reports RR-09-03 April 3, 2009 Abstract By placing item statistics

More information

The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests

The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests Mary E. Lunz and Betty A. Bergstrom, American Society of Clinical Pathologists Benjamin D. Wright, University

More information

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Jee Seon Kim University of Wisconsin, Madison Paper presented at 2006 NCME Annual Meeting San Francisco, CA Correspondence

More information

Computerized Mastery Testing

Computerized Mastery Testing Computerized Mastery Testing With Nonequivalent Testlets Kathleen Sheehan and Charles Lewis Educational Testing Service A procedure for determining the effect of testlet nonequivalence on the operating

More information

Scaling TOWES and Linking to IALS

Scaling TOWES and Linking to IALS Scaling TOWES and Linking to IALS Kentaro Yamamoto and Irwin Kirsch March, 2002 In 2000, the Organization for Economic Cooperation and Development (OECD) along with Statistics Canada released Literacy

More information

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD Psy 427 Cal State Northridge Andrew Ainsworth, PhD Contents Item Analysis in General Classical Test Theory Item Response Theory Basics Item Response Functions Item Information Functions Invariance IRT

More information

Description of components in tailored testing

Description of components in tailored testing Behavior Research Methods & Instrumentation 1977. Vol. 9 (2).153-157 Description of components in tailored testing WAYNE M. PATIENCE University ofmissouri, Columbia, Missouri 65201 The major purpose of

More information

Item Analysis: Classical and Beyond

Item Analysis: Classical and Beyond Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Modified for EPE/EDP 711 by Kelly Bradley on January 8, 2013 Why is item analysis relevant? Item analysis provides

More information

Type I Error Rates and Power Estimates for Several Item Response Theory Fit Indices

Type I Error Rates and Power Estimates for Several Item Response Theory Fit Indices Wright State University CORE Scholar Browse all Theses and Dissertations Theses and Dissertations 2009 Type I Error Rates and Power Estimates for Several Item Response Theory Fit Indices Bradley R. Schlessman

More information

Rasch Versus Birnbaum: New Arguments in an Old Debate

Rasch Versus Birnbaum: New Arguments in an Old Debate White Paper Rasch Versus Birnbaum: by John Richard Bergan, Ph.D. ATI TM 6700 E. Speedway Boulevard Tucson, Arizona 85710 Phone: 520.323.9033 Fax: 520.323.9139 Copyright 2013. All rights reserved. Galileo

More information

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, April 23-25, 2003 The Classification Accuracy of Measurement Decision Theory Lawrence Rudner University

More information

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

Jason L. Meyers. Ahmet Turhan. Steven J. Fitzpatrick. Pearson. Paper presented at the annual meeting of the

Jason L. Meyers. Ahmet Turhan. Steven J. Fitzpatrick. Pearson. Paper presented at the annual meeting of the Performance of Ability Estimation Methods for Writing Assessments under Conditio ns of Multidime nsionality Jason L. Meyers Ahmet Turhan Steven J. Fitzpatrick Pearson Paper presented at the annual meeting

More information

Item-Rest Regressions, Item Response Functions, and the Relation Between Test Forms

Item-Rest Regressions, Item Response Functions, and the Relation Between Test Forms Item-Rest Regressions, Item Response Functions, and the Relation Between Test Forms Dato N. M. de Gruijter University of Leiden John H. A. L. de Jong Dutch Institute for Educational Measurement (CITO)

More information

A MONTE CARLO STUDY OF MODEL SELECTION PROCEDURES FOR THE ANALYSIS OF CATEGORICAL DATA

A MONTE CARLO STUDY OF MODEL SELECTION PROCEDURES FOR THE ANALYSIS OF CATEGORICAL DATA A MONTE CARLO STUDY OF MODEL SELECTION PROCEDURES FOR THE ANALYSIS OF CATEGORICAL DATA Elizabeth Martin Fischer, University of North Carolina Introduction Researchers and social scientists frequently confront

More information

Bayesian Tailored Testing and the Influence

Bayesian Tailored Testing and the Influence Bayesian Tailored Testing and the Influence of Item Bank Characteristics Carl J. Jensema Gallaudet College Owen s (1969) Bayesian tailored testing method is introduced along with a brief review of its

More information

The Influence of Test Characteristics on the Detection of Aberrant Response Patterns

The Influence of Test Characteristics on the Detection of Aberrant Response Patterns The Influence of Test Characteristics on the Detection of Aberrant Response Patterns Steven P. Reise University of California, Riverside Allan M. Due University of Minnesota Statistical methods to assess

More information

Differential Item Functioning

Differential Item Functioning Differential Item Functioning Lecture #11 ICPSR Item Response Theory Workshop Lecture #11: 1of 62 Lecture Overview Detection of Differential Item Functioning (DIF) Distinguish Bias from DIF Test vs. Item

More information

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati.

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati. Likelihood Ratio Based Computerized Classification Testing Nathan A. Thompson Assessment Systems Corporation & University of Cincinnati Shungwon Ro Kenexa Abstract An efficient method for making decisions

More information

INVESTIGATING FIT WITH THE RASCH MODEL. Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form

INVESTIGATING FIT WITH THE RASCH MODEL. Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form INVESTIGATING FIT WITH THE RASCH MODEL Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form of multidimensionality. The settings in which measurement

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 26 for Mixed Format Tests Kyong Hee Chon Won-Chan Lee Timothy N. Ansley November 2007 The authors are grateful to

More information

CYRINUS B. ESSEN, IDAKA E. IDAKA AND MICHAEL A. METIBEMU. (Received 31, January 2017; Revision Accepted 13, April 2017)

CYRINUS B. ESSEN, IDAKA E. IDAKA AND MICHAEL A. METIBEMU. (Received 31, January 2017; Revision Accepted 13, April 2017) DOI: http://dx.doi.org/10.4314/gjedr.v16i2.2 GLOBAL JOURNAL OF EDUCATIONAL RESEARCH VOL 16, 2017: 87-94 COPYRIGHT BACHUDO SCIENCE CO. LTD PRINTED IN NIGERIA. ISSN 1596-6224 www.globaljournalseries.com;

More information

properties of these alternative techniques have not

properties of these alternative techniques have not Methodology Review: Item Parameter Estimation Under the One-, Two-, and Three-Parameter Logistic Models Frank B. Baker University of Wisconsin This paper surveys the techniques used in item response theory

More information

Multidimensionality and Item Bias

Multidimensionality and Item Bias Multidimensionality and Item Bias in Item Response Theory T. C. Oshima, Georgia State University M. David Miller, University of Florida This paper demonstrates empirically how item bias indexes based on

More information

IRT Parameter Estimates

IRT Parameter Estimates An Examination of the Characteristics of Unidimensional IRT Parameter Estimates Derived From Two-Dimensional Data Timothy N. Ansley and Robert A. Forsyth The University of Iowa The purpose of this investigation

More information

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION Iweka Fidelis (Ph.D) Department of Educational Psychology, Guidance and Counselling, University of Port Harcourt,

More information

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories Kamla-Raj 010 Int J Edu Sci, (): 107-113 (010) Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories O.O. Adedoyin Department of Educational Foundations,

More information

Adaptive EAP Estimation of Ability

Adaptive EAP Estimation of Ability Adaptive EAP Estimation of Ability in a Microcomputer Environment R. Darrell Bock University of Chicago Robert J. Mislevy National Opinion Research Center Expected a posteriori (EAP) estimation of ability,

More information

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida Vol. 2 (1), pp. 22-39, Jan, 2015 http://www.ijate.net e-issn: 2148-7456 IJATE A Comparison of Logistic Regression Models for Dif Detection in Polytomous Items: The Effect of Small Sample Sizes and Non-Normality

More information

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Thakur Karkee Measurement Incorporated Dong-In Kim CTB/McGraw-Hill Kevin Fatica CTB/McGraw-Hill

More information

A Broad-Range Tailored Test of Verbal Ability

A Broad-Range Tailored Test of Verbal Ability A Broad-Range Tailored Test of Verbal Ability Frederic M. Lord Educational Testing Service Two parallel forms of a broad-range tailored test of verbal ability have been built. The test is appropriate from

More information

Initial Report on the Calibration of Paper and Pencil Forms UCLA/CRESST August 2015

Initial Report on the Calibration of Paper and Pencil Forms UCLA/CRESST August 2015 This report describes the procedures used in obtaining parameter estimates for items appearing on the 2014-2015 Smarter Balanced Assessment Consortium (SBAC) summative paper-pencil forms. Among the items

More information

Assessing Measurement Invariance in the Attitude to Marriage Scale across East Asian Societies. Xiaowen Zhu. Xi an Jiaotong University.

Assessing Measurement Invariance in the Attitude to Marriage Scale across East Asian Societies. Xiaowen Zhu. Xi an Jiaotong University. Running head: ASSESS MEASUREMENT INVARIANCE Assessing Measurement Invariance in the Attitude to Marriage Scale across East Asian Societies Xiaowen Zhu Xi an Jiaotong University Yanjie Bian Xi an Jiaotong

More information

An Introduction to Missing Data in the Context of Differential Item Functioning

An Introduction to Missing Data in the Context of Differential Item Functioning A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to Practical Assessment, Research & Evaluation. Permission is granted to distribute

More information

Noncompensatory. A Comparison Study of the Unidimensional IRT Estimation of Compensatory and. Multidimensional Item Response Data

Noncompensatory. A Comparison Study of the Unidimensional IRT Estimation of Compensatory and. Multidimensional Item Response Data A C T Research Report Series 87-12 A Comparison Study of the Unidimensional IRT Estimation of Compensatory and Noncompensatory Multidimensional Item Response Data Terry Ackerman September 1987 For additional

More information

A review of statistical methods in the analysis of data arising from observer reliability studies (Part 11) *

A review of statistical methods in the analysis of data arising from observer reliability studies (Part 11) * A review of statistical methods in the analysis of data arising from observer reliability studies (Part 11) * by J. RICHARD LANDIS** and GARY G. KOCH** 4 Methods proposed for nominal and ordinal data Many

More information

Latent Trait Standardization of the Benzodiazepine Dependence. Self-Report Questionnaire using the Rasch Scaling Model

Latent Trait Standardization of the Benzodiazepine Dependence. Self-Report Questionnaire using the Rasch Scaling Model Chapter 7 Latent Trait Standardization of the Benzodiazepine Dependence Self-Report Questionnaire using the Rasch Scaling Model C.C. Kan 1, A.H.G.S. van der Ven 2, M.H.M. Breteler 3 and F.G. Zitman 1 1

More information

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

Bruno D. Zumbo, Ph.D. University of Northern British Columbia Bruno Zumbo 1 The Effect of DIF and Impact on Classical Test Statistics: Undetected DIF and Impact, and the Reliability and Interpretability of Scores from a Language Proficiency Test Bruno D. Zumbo, Ph.D.

More information

Gender-Based Differential Item Performance in English Usage Items

Gender-Based Differential Item Performance in English Usage Items A C T Research Report Series 89-6 Gender-Based Differential Item Performance in English Usage Items Catherine J. Welch Allen E. Doolittle August 1989 For additional copies write: ACT Research Report Series

More information

The Influence of Conditioning Scores In Performing DIF Analyses

The Influence of Conditioning Scores In Performing DIF Analyses The Influence of Conditioning Scores In Performing DIF Analyses Terry A. Ackerman and John A. Evans University of Illinois The effect of the conditioning score on the results of differential item functioning

More information

Lessons in biostatistics

Lessons in biostatistics Lessons in biostatistics The test of independence Mary L. McHugh Department of Nursing, School of Health and Human Services, National University, Aero Court, San Diego, California, USA Corresponding author:

More information

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys Jill F. Kilanowski, PhD, APRN,CPNP Associate Professor Alpha Zeta & Mu Chi Acknowledgements Dr. Li Lin,

More information

ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE

ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE California State University, San Bernardino CSUSB ScholarWorks Electronic Theses, Projects, and Dissertations Office of Graduate Studies 6-2016 ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION

More information

On indirect measurement of health based on survey data. Responses to health related questions (items) Y 1,..,Y k A unidimensional latent health state

On indirect measurement of health based on survey data. Responses to health related questions (items) Y 1,..,Y k A unidimensional latent health state On indirect measurement of health based on survey data Responses to health related questions (items) Y 1,..,Y k A unidimensional latent health state A scaling model: P(Y 1,..,Y k ;α, ) α = item difficulties

More information

Information Structure for Geometric Analogies: A Test Theory Approach

Information Structure for Geometric Analogies: A Test Theory Approach Information Structure for Geometric Analogies: A Test Theory Approach Susan E. Whitely and Lisa M. Schneider University of Kansas Although geometric analogies are popular items for measuring intelligence,

More information

The Effect of Guessing on Item Reliability

The Effect of Guessing on Item Reliability The Effect of Guessing on Item Reliability under Answer-Until-Correct Scoring Michael Kane National League for Nursing, Inc. James Moloney State University of New York at Brockport The answer-until-correct

More information

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES Correlational Research Correlational Designs Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are

More information

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison Empowered by Psychometrics The Fundamentals of Psychometrics Jim Wollack University of Wisconsin Madison Psycho-what? Psychometrics is the field of study concerned with the measurement of mental and psychological

More information

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods James Madison University JMU Scholarly Commons Department of Graduate Psychology - Faculty Scholarship Department of Graduate Psychology 3-008 Scoring Multiple Choice Items: A Comparison of IRT and Classical

More information

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS Michael J. Kolen The University of Iowa March 2011 Commissioned by the Center for K 12 Assessment & Performance Management at

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 39 Evaluation of Comparability of Scores and Passing Decisions for Different Item Pools of Computerized Adaptive Examinations

More information

Chapter 11. Experimental Design: One-Way Independent Samples Design

Chapter 11. Experimental Design: One-Way Independent Samples Design 11-1 Chapter 11. Experimental Design: One-Way Independent Samples Design Advantages and Limitations Comparing Two Groups Comparing t Test to ANOVA Independent Samples t Test Independent Samples ANOVA Comparing

More information

A simulation study of person-fit in the Rasch model

A simulation study of person-fit in the Rasch model Psychological Test and Assessment Modeling, Volume 58, 2016 (3), 531-563 A simulation study of person-fit in the Rasch model Richard Artner 1 Abstract The validation of individual test scores in the Rasch

More information

Business Statistics Probability

Business Statistics Probability Business Statistics The following was provided by Dr. Suzanne Delaney, and is a comprehensive review of Business Statistics. The workshop instructor will provide relevant examples during the Skills Assessment

More information

Conditional Independence in a Clustered Item Test

Conditional Independence in a Clustered Item Test Conditional Independence in a Clustered Item Test Richard C. Bell and Philippa E. Pattison University of Melbourne Graeme P. Withers Australian Council for Educational Research Although the assumption

More information

Estimating Item and Ability Parameters

Estimating Item and Ability Parameters Estimating Item and Ability Parameters in Homogeneous Tests With the Person Characteristic Function John B. Carroll University of North Carolina, Chapel Hill On the basis of monte carlo runs, in which

More information

The Relative Performance of Full Information Maximum Likelihood Estimation for Missing Data in Structural Equation Models

The Relative Performance of Full Information Maximum Likelihood Estimation for Missing Data in Structural Equation Models University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Educational Psychology Papers and Publications Educational Psychology, Department of 7-1-2001 The Relative Performance of

More information

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida and Oleksandr S. Chernyshenko University of Canterbury Presented at the New CAT Models

More information

Model fit and robustness? - A critical look at the foundation of the PISA project

Model fit and robustness? - A critical look at the foundation of the PISA project Model fit and robustness? - A critical look at the foundation of the PISA project Svend Kreiner, Dept. of Biostatistics, Univ. of Copenhagen TOC The PISA project and PISA data PISA methodology Rasch item

More information

Development of a descriptive fit statistic for the Rasch model

Development of a descriptive fit statistic for the Rasch model Development of a descriptive fit statistic for the Rasch model Purya Baghaei, Takuya Yanagida, Moritz Heene To cite this version: Purya Baghaei, Takuya Yanagida, Moritz Heene. Development of a descriptive

More information

11/24/2017. Do not imply a cause-and-effect relationship

11/24/2017. Do not imply a cause-and-effect relationship Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are highly extraverted people less afraid of rejection

More information

Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales

Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales University of Iowa Iowa Research Online Theses and Dissertations Summer 2013 Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales Anna Marie

More information

linking in educational measurement: Taking differential motivation into account 1

linking in educational measurement: Taking differential motivation into account 1 Selecting a data collection design for linking in educational measurement: Taking differential motivation into account 1 Abstract In educational measurement, multiple test forms are often constructed to

More information

MEASUREMENT DISTURBANCE EFFECTS ON RASCH FIT STATISTICS AND THE LOGIT RESIDUAL INDEX DISSERTATION. Presented to the Graduate Council of the

MEASUREMENT DISTURBANCE EFFECTS ON RASCH FIT STATISTICS AND THE LOGIT RESIDUAL INDEX DISSERTATION. Presented to the Graduate Council of the Vrv^ MEASUREMENT DISTURBANCE EFFECTS ON RASCH FIT STATISTICS AND THE LOGIT RESIDUAL INDEX DISSERTATION Presented to the Graduate Council of the University of North Texas in Partial Fulfillment of the Requirements

More information

Statistical Techniques. Masoud Mansoury and Anas Abulfaraj

Statistical Techniques. Masoud Mansoury and Anas Abulfaraj Statistical Techniques Masoud Mansoury and Anas Abulfaraj What is Statistics? https://www.youtube.com/watch?v=lmmzj7599pw The definition of Statistics The practice or science of collecting and analyzing

More information

Investigating the robustness of the nonparametric Levene test with more than two groups

Investigating the robustness of the nonparametric Levene test with more than two groups Psicológica (2014), 35, 361-383. Investigating the robustness of the nonparametric Levene test with more than two groups David W. Nordstokke * and S. Mitchell Colp University of Calgary, Canada Testing

More information

THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL

THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL JOURNAL OF EDUCATIONAL MEASUREMENT VOL. II, NO, 2 FALL 1974 THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL SUSAN E. WHITELY' AND RENE V. DAWIS 2 University of Minnesota Although it has been claimed that

More information

Using the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison

Using the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison Using the Testlet Model to Mitigate Test Speededness Effects James A. Wollack Youngsuk Suh Daniel M. Bolt University of Wisconsin Madison April 12, 2007 Paper presented at the annual meeting of the National

More information

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments Greg Pope, Analytics and Psychometrics Manager 2008 Users Conference San Antonio Introduction and purpose of this session

More information

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan Connexion of Item Response Theory to Decision Making in Chess Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan Acknowledgement A few Slides have been taken from the following presentation

More information

During the past century, mathematics

During the past century, mathematics An Evaluation of Mathematics Competitions Using Item Response Theory Jim Gleason During the past century, mathematics competitions have become part of the landscape in mathematics education. The first

More information

Detection of Differential Test Functioning (DTF) and Differential Item Functioning (DIF) in MCCQE Part II Using Logistic Models

Detection of Differential Test Functioning (DTF) and Differential Item Functioning (DIF) in MCCQE Part II Using Logistic Models Detection of Differential Test Functioning (DTF) and Differential Item Functioning (DIF) in MCCQE Part II Using Logistic Models Jin Gong University of Iowa June, 2012 1 Background The Medical Council of

More information

Analysis and Interpretation of Data Part 1

Analysis and Interpretation of Data Part 1 Analysis and Interpretation of Data Part 1 DATA ANALYSIS: PRELIMINARY STEPS 1. Editing Field Edit Completeness Legibility Comprehensibility Consistency Uniformity Central Office Edit 2. Coding Specifying

More information

MMI 409 Spring 2009 Final Examination Gordon Bleil. 1. Is there a difference in depression as a function of group and drug?

MMI 409 Spring 2009 Final Examination Gordon Bleil. 1. Is there a difference in depression as a function of group and drug? MMI 409 Spring 2009 Final Examination Gordon Bleil Table of Contents Research Scenario and General Assumptions Questions for Dataset (Questions are hyperlinked to detailed answers) 1. Is there a difference

More information

Using Bayesian Decision Theory to

Using Bayesian Decision Theory to Using Bayesian Decision Theory to Design a Computerized Mastery Test Charles Lewis and Kathleen Sheehan Educational Testing Service A theoretical framework for mastery testing based on item response theory

More information

Estimating the Validity of a

Estimating the Validity of a Estimating the Validity of a Multiple-Choice Test Item Having k Correct Alternatives Rand R. Wilcox University of Southern California and University of Califarnia, Los Angeles In various situations, a

More information

An Assessment of The Nonparametric Approach for Evaluating The Fit of Item Response Models

An Assessment of The Nonparametric Approach for Evaluating The Fit of Item Response Models University of Massachusetts Amherst ScholarWorks@UMass Amherst Open Access Dissertations 2-2010 An Assessment of The Nonparametric Approach for Evaluating The Fit of Item Response Models Tie Liang University

More information

Inferential Statistics

Inferential Statistics Inferential Statistics and t - tests ScWk 242 Session 9 Slides Inferential Statistics Ø Inferential statistics are used to test hypotheses about the relationship between the independent and the dependent

More information

Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data. Zhen Li & Bruno D.

Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data. Zhen Li & Bruno D. Psicológica (2009), 30, 343-370. SECCIÓN METODOLÓGICA Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data Zhen Li & Bruno D. Zumbo 1 University

More information

The Impact of Item Sequence Order on Local Item Dependence: An Item Response Theory Perspective

The Impact of Item Sequence Order on Local Item Dependence: An Item Response Theory Perspective Vol. 9, Issue 5, 2016 The Impact of Item Sequence Order on Local Item Dependence: An Item Response Theory Perspective Kenneth D. Royal 1 Survey Practice 10.29115/SP-2016-0027 Sep 01, 2016 Tags: bias, item

More information

Section 5. Field Test Analyses

Section 5. Field Test Analyses Section 5. Field Test Analyses Following the receipt of the final scored file from Measurement Incorporated (MI), the field test analyses were completed. The analysis of the field test data can be broken

More information

References. Embretson, S. E. & Reise, S. P. (2000). Item response theory for psychologists. Mahwah,

References. Embretson, S. E. & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, The Western Aphasia Battery (WAB) (Kertesz, 1982) is used to classify aphasia by classical type, measure overall severity, and measure change over time. Despite its near-ubiquitousness, it has significant

More information

Sensitivity of DFIT Tests of Measurement Invariance for Likert Data

Sensitivity of DFIT Tests of Measurement Invariance for Likert Data Meade, A. W. & Lautenschlager, G. J. (2005, April). Sensitivity of DFIT Tests of Measurement Invariance for Likert Data. Paper presented at the 20 th Annual Conference of the Society for Industrial and

More information

Simultaneous Equation and Instrumental Variable Models for Sexiness and Power/Status

Simultaneous Equation and Instrumental Variable Models for Sexiness and Power/Status Simultaneous Equation and Instrumental Variable Models for Seiness and Power/Status We would like ideally to determine whether power is indeed sey, or whether seiness is powerful. We here describe the

More information

Building Evaluation Scales for NLP using Item Response Theory

Building Evaluation Scales for NLP using Item Response Theory Building Evaluation Scales for NLP using Item Response Theory John Lalor CICS, UMass Amherst Joint work with Hao Wu (BC) and Hong Yu (UMMS) Motivation Evaluation metrics for NLP have been mostly unchanged

More information

ESTIMATING PISA STUDENTS ON THE IALS PROSE LITERACY SCALE

ESTIMATING PISA STUDENTS ON THE IALS PROSE LITERACY SCALE ESTIMATING PISA STUDENTS ON THE IALS PROSE LITERACY SCALE Kentaro Yamamoto, Educational Testing Service (Summer, 2002) Before PISA, the International Adult Literacy Survey (IALS) was conducted multiple

More information

Confidence Intervals On Subsets May Be Misleading

Confidence Intervals On Subsets May Be Misleading Journal of Modern Applied Statistical Methods Volume 3 Issue 2 Article 2 11-1-2004 Confidence Intervals On Subsets May Be Misleading Juliet Popper Shaffer University of California, Berkeley, shaffer@stat.berkeley.edu

More information

OBSERVATIONAL MEDICAL OUTCOMES PARTNERSHIP

OBSERVATIONAL MEDICAL OUTCOMES PARTNERSHIP An empirical approach to measuring and calibrating for error in observational analyses Patrick Ryan on behalf of the OMOP research team 25 April 2013 Consider a typical observational database study: Exploring

More information