A Bayesian Nonparametric Model Fit statistic of Item Response Models

Similar documents
Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests

An Assessment of The Nonparametric Approach for Evaluating The Fit of Item Response Models

CYRINUS B. ESSEN, IDAKA E. IDAKA AND MICHAEL A. METIBEMU. (Received 31, January 2017; Revision Accepted 13, April 2017)

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

Nonparametric DIF. Bruno D. Zumbo and Petronilla M. Witarsa University of British Columbia

A Comparison of Several Goodness-of-Fit Statistics

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland

Item Response Theory: Methods for the Analysis of Discrete Survey Response Data

A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM LIKELIHOOD METHODS IN ESTIMATING THE ITEM PARAMETERS FOR THE 2PL IRT MODEL

Using the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati.

Differential Item Functioning Amplification and Cancellation in a Reading Test

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests

Type I Error Rates and Power Estimates for Several Item Response Theory Fit Indices

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

A Comparison of DIMTEST and Generalized Dimensionality Discrepancy. Approaches to Assessing Dimensionality in Item Response Theory. Ray E.

ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION

Dimensionality of the Force Concept Inventory: Comparing Bayesian Item Response Models. Xiaowen Liu Eric Loken University of Connecticut

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Evaluating the quality of analytic ratings with Mokken scaling

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

Russian Journal of Agricultural and Socio-Economic Sciences, 3(15)

International Journal of Education and Research Vol. 5 No. 5 May 2017

Ordinal Data Modeling

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking

Item-Rest Regressions, Item Response Functions, and the Relation Between Test Forms

Scaling TOWES and Linking to IALS

Parameter Estimation with Mixture Item Response Theory Models: A Monte Carlo Comparison of Maximum Likelihood and Bayesian Methods

Adjusting for mode of administration effect in surveys using mailed questionnaire and telephone interview data

UvA-DARE (Digital Academic Repository)

S Imputation of Categorical Missing Data: A comparison of Multivariate Normal and. Multinomial Methods. Holmes Finch.

Sensitivity of DFIT Tests of Measurement Invariance for Likert Data

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model

Bayesian and Frequentist Approaches

A Multilevel Testlet Model for Dual Local Dependence

Nearest-Integer Response from Normally-Distributed Opinion Model for Likert Scale

Item Parameter Recovery for the Two-Parameter Testlet Model with Different. Estimation Methods. Abstract

Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data. Zhen Li & Bruno D.

Termination Criteria in Computerized Adaptive Tests: Variable-Length CATs Are Not Biased. Ben Babcock and David J. Weiss University of Minnesota

ESTIMATING PISA STUDENTS ON THE IALS PROSE LITERACY SCALE

Multidimensionality and Item Bias

Nonparametric IRT analysis of Quality-of-Life Scales and its application to the World Health Organization Quality-of-Life Scale (WHOQOL-Bref)

Impact and adjustment of selection bias. in the assessment of measurement equivalence

An Introduction to Missing Data in the Context of Differential Item Functioning

Comparing DIF methods for data with dual dependency

Combining Risks from Several Tumors Using Markov Chain Monte Carlo

A Brief Introduction to Bayesian Statistics

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp

Analyzing data from educational surveys: a comparison of HLM and Multilevel IRT. Amin Mousavi

A Monte Carlo Study Investigating Missing Data, Differential Item Functioning, and Effect Size

Centre for Education Research and Policy

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida

Analyzing Psychopathology Items: A Case for Nonparametric Item Response Theory Modeling

The Influence of Test Characteristics on the Detection of Aberrant Response Patterns

Factors Affecting the Item Parameter Estimation and Classification Accuracy of the DINA Model

An Alternative to the Trend Scoring Method for Adjusting Scoring Shifts. in Mixed-Format Tests. Xuan Tan. Sooyeon Kim. Insu Paek.

Unit 1 Exploring and Understanding Data

TECHNICAL REPORT. The Added Value of Multidimensional IRT Models. Robert D. Gibbons, Jason C. Immekus, and R. Darrell Bock

Data harmonization tutorial:teaser for FH2019

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison

First Year Paper. July Lieke Voncken ANR: Tilburg University. School of Social and Behavioral Sciences

Rasch Versus Birnbaum: New Arguments in an Old Debate

A DIFFERENTIAL RESPONSE FUNCTIONING FRAMEWORK FOR UNDERSTANDING ITEM, BUNDLE, AND TEST BIAS ROBERT PHILIP SIDNEY CHALMERS

Introduction to Bayesian Analysis 1

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm

The use of predicted values for item parameters in item response theory models: an. application in intelligence tests

Exploring rater errors and systematic biases using adjacent-categories Mokken models

Development, Standardization and Application of

POLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS

Designing small-scale tests: A simulation study of parameter recovery with the 1-PL

André Cyr and Alexander Davies

Initial Report on the Calibration of Paper and Pencil Forms UCLA/CRESST August 2015

Bayesian Inference Bayes Laplace

EPI 200C Final, June 4 th, 2009 This exam includes 24 questions.

EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN

Practical Bayesian Design and Analysis for Drug and Device Clinical Trials

Decision consistency and accuracy indices for the bifactor and testlet response theory models

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1.

Maximum Marginal Likelihood Bifactor Analysis with Estimation of the General Dimension as an Empirical Histogram

Advanced Bayesian Models for the Social Sciences. TA: Elizabeth Menninga (University of North Carolina, Chapel Hill)

Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement Invariance Tests Of Multi-Group Confirmatory Factor Analyses

How few countries will do? Comparative survey analysis from a Bayesian perspective

The Matching Criterion Purification for Differential Item Functioning Analyses in a Large-Scale Assessment

ABERRANT RESPONSE PATTERNS AS A MULTIDIMENSIONAL PHENOMENON: USING FACTOR-ANALYTIC MODEL COMPARISON TO DETECT CHEATING. John Michael Clark III

Item Response Theory. Author's personal copy. Glossary

Technical Specifications

When can Multidimensional Item Response Theory (MIRT) Models be a Solution for. Differential Item Functioning (DIF)? A Monte Carlo Simulation Study

Advanced Bayesian Models for the Social Sciences

Multidimensional Modeling of Learning Progression-based Vertical Scales 1

Evaluating the Impact of Construct Shift on Item Parameter Invariance, Test Equating and Proficiency Estimates

Statistics for Social and Behavioral Sciences

Item Selection in Polytomous CAT

Transcription:

A Bayesian Nonparametric Model Fit statistic of Item Response Models Purpose As more and more states move to use the computer adaptive test for their assessments, item response theory (IRT) has been widely applied. Then investigating the fit of a parametric model becomes an important measurement process before building the item pool. If a misfitting item has been put in the item pool, it will cause the item selection and ability estimation error and then affect the validity of the test. Several researchers have developed IRT model fit assessing tests (Douglas & Cohen, 2001; Orlando & Thissen, 2000; Yen, 1981). However, these tests have some drawbacks might lead to over identify misfitting items and provide no help on investigating the reasons for item misfitting. Therefore, this study aims to propose a new Bayesian nonparametric item fit statistics and compare this new item fit statistics with more traditional item fit statistics to indicate the advantages of this method. Theoretical Framework While many of studies have developed and compared different approaches for model fit assessment, this topic remains a major hurdle to overcome for effective implementing IRT model (Hambleton & Han, 2005). Most of these approaches are X 2 -based item fit statistics: Yen s (1981) Q1 statistic, Orlando and Thissen s (2000) S-X 2 statistics, and G 2. However, these X 2 - based item fit statistics have several drawbacks. First, they are sensitive to sample size (Hambleton & Swaminathan, 1985). When the sample size is large, this statistic test tends to over reject models because with large sample sizes statistical power is available to detect even very small discrepancies between the model and data. Since the state tests always have large sample size, almost all items have a significant X 2 statistic and do not fit the model. Second, several popular fit statistics do not have a X 2 distribution because ability estimation error, treating

estimated parameters as true values (Stone & Zhang, 2003), and the degrees of freedom are in question (Orlando & Thissen, 2000). This drawbacks lead to falsely identify valid items as misfitting. Third, these fit statistics are not able to indicate the location and magnitude of misfit for a misfitting item. As a result, content experts cannot explain the item misfitting reasons and give suggestions on changing the model. Because of these limitations of X 2 -based fit statistics, Douglas and Cohen (2001) developed a nonparameteric approach for assessing the model fit of dichotomous IRT model, hereafter refered to as kernel smoothing method. This method uses the kernel smoothing to draw a nonparametric item response function (IRF) and compares it with parametric IRF. Later, Liang and Wells generalized it for assessing the polytomouse IRT model fit. Liang, Wells and Hambleton (2014) compared this approach with traditional X 2 -based item fit statistics (S-X 2 and G 2 ) under different conditions by manipulating test length, sample size, IRT models, and ability distribution. Their result indicated that this kernel smoothing method has exhibited controlled Type I error rates and adequate power. Moreover, this method provides a clear graphical representation of model misfit. However, in all these studies applying the kernel smoothing model fit assessment, the bootstrapping procedure was performed to construct an empirical distribution for determining the significance level of fitting statistics. This bootstrapping procedure did not count the uncertainty of the parameter estimation because it used the item parameter estimates previously obtained. The Bayesian statistic used in this study applying the posterior predictive model checking (PPMC) method (Gelman, Meng, & Stern, 1996; Guttman, 1967; Rubin, 1984) take the uncertainty of the parameter estimation into account by integrating over the parameters. Thus, this study will use the kernel smoothing and PPMC method to develop a Bayesian nonparametric model fit statistics and compare it with S-X 2 and bootstrapping statistics using simulation data and empirical data.

Method The Monte Carlo simulation study was performed to examine the Type I error rate and power of the proposed statistic on detecting misfitting items in a mixed-format test under conditions differing on significance levels. In addition, this proposed Bayesian nonparametric statistic was being compared with: 1) S-X 2, provided by the computer software IRTPro (Cai, Thissen, & du Toit, 2011); 2) the bootstrapping kernel smoothing method. An empirical data set from a mixed-format test was analyzed to explore the use of the Bayesian nonparametric approach for assessing the model fit. Kernel smoothing method. To apply the kernel smoothing method in educational assessment data, Yi is a binary random variable denoting whether the score of one item is obtained or not and the latent trait variable is Θ. However, the latent trait variable Θ could not be observed directly. An estimator Θ will replace Θ. The nonparametric IRF estimated by the kernel smoothing method is as follows: P inon (Θ = θ) = N j=1 N j=1 K( θ θ j h K( θ θ j h ) )y ij. (1) In order to estimate IRF, yjs are averaged in the small range around every evaluation point θ by their weights. The weights of yjs are K(u), a nonnegative symmetric kernel function with mode at 0,which is the monotonic decreasing of the absolute value of u (Copas, 1983; Douglas, 1997). In this research, Gaussian function is used as the kernel function in equation 1 because it is a typical choice of the kernel function (Douglas, 1997): K(u) = exp ( u2 2 ). (2)

In the IRF expression, h is a parameter called the bandwidth, which controls the smoothing amount. The choice of h is a trade-off between the fluctuation of the regression function and the bias of the regression function estimation (Copas, 1983; Douglas, 1997). Ramsay (1991) suggested an optimal value of h for psychometric binary data is depending on the sample size (N): h=1.1*n 0.2. In the present research, this value is used for h because Ramsay (1991) also showed that this value functioned well under the Gaussian kernel function. In order to plot the IRF estimated from equation 1, an estimate of the proficiency of each examinee is needed. The θ j (the estimate of the j-th examinee s proficiency) is estimated by the ordinal ability estimation (Douglas, 1997). In the ordinal ability estimation method, the empirical percentile of the examinee in the latent trait distribution is determined using the sum of item scores and latent trait distribution G s inverse function G -1 is used to calculate the proficiency based on the empirical percentile. In this research, we choose the standard normal distribution as the latent trait distribution. PPMC. Suppose ω is the unknown parameters of the assumed model H, y is the observed data, p(ω) is the prior distribution of the unknown parameters, and p(y ω) is the likelihood distribution of observed data assuming that the model H is true (Sinharay, 2005), then the posterior distribution of unknown parameters is p(ω y) and p(ω y) p(y ω)p(ω). The replicated data, y rep, could be interpreted as the data that will be observed in the future or predicted and are then replicated using the same model H and parameters drawn from the p(ω y). The posterior predictive distribution of y rep is as follows: p(y rep y) = p(y rep ω)p(ω y)dω. (33) This distribution was calculated from a Bayesian perspective to eliminate the nuisance parameters by integrating them out (Bayarri & Berger, 2000). A statistic for measuring model fit

that was calculated from the observed data is compared to the distribution of the same statistic calculated from the replicated data drawn from this distribution and the posterior predictive p- value (PPP-value) is calculated. This was done to check the model fit (Gelman, Meng, & Stern, 1996). PPP-value provides a quantitative measure of the degree to which the model is able to capture the features of the observed data, in other words, the fit of the model to the observed data. The PPP-value is defined as (Bayarri & Berger, 2000): p = p(t(y rep ) T(y) y) = T(y rep ) T(y) p(y rep y) dy rep. (34) In this equation, T(y) is a discrepancy measure which is the statistic for measuring model fit. Extreme PPP-values, those close to 0, 1, or both (depending on the nature of the discrepancy measure), indicate the model does not fit the data (Sinharay et al., 2006). Procedure of Bayesian nonparametric fit statistic. The model H in this study was IRT model. The parameters ω were item parameters, and proficiency parameters, θs. The observed data y were the responses of examinees for all the items. The discrepancy measure T(y) for the i- th item was as follows: Q q=1 [ (P qk K 1 P qk k=1 Q T(y) = K 1 non ) 2 non where P qk and P qk are the estimated probability of IRT model and nonparametric model for each point and score category; Q is the number of evaluation points (e.g., Q=100); and K is the total number of categories. The following were procedures of calculating the Bayesian nonparametric fit statistic: 1. The MCMC algorithm was used to simulate the posterior distributions of item parameters and proficiency parameters using the observed data and the IRT model. ]

2. A total of N θs was drawn from corresponding posterior distributions. The sample size was N. For example, θj of the j-th examinee was drawn from the posterior distribution p(θj y). 3. The item parameters of n items were drawn from their posterior distributions. 4. A data set was generated from the IRT model using item parameters and proficiency parameters. 5. The discrepancy measure was calculated for this data set and compared with the same discrepancy measure calculated from the observed data. 6. Steps 2 to 5 were repeated for M (e.g. M=100) times to compute the PPP-value for every item. Data Simulation data. The simulation data were generated from the simulated item parameter estimates follow the distributions, a~log-n (0, 0.4) and b~n (0, 1). Both two-parameter logistic model (2PLM) and the graded response model (GRM) were used. Among all items, 80% of them are generated from 2PLM and 20% of them are generated from GRM. In each simulated test, the 20% of misfitting items were generated, and 50% of misfitting items were dichotomous items and 50% of misfitting items were polytomous items. The samples size was 3000 and test length was 60 items. Two significance levels, α, were 0.05 and 0.01. For each significance levels, 100 replications were conducted. Empirical data. The data came from a large-scale assessment with 53 items at a sample size of 3804. Ten partial credit polychromous items were fitted by the GRM and the rest of dichotomous items were fitted by the 2PLM. The fit were tested using S-X 2, bootstrapping kernel smoothing method, and Bayesian nonparametric method. Three statistics were compared for the

items flagged as misfitting. These misfitting items were further explored by via a graphical representation. Result For simulation data, Table 1 and Table 2 summarize the Type I error rate and power for three fit statistics being compared. Table 1 reports the empirical type I error rates of PPMC, bootstrapping and S-X 2 for the mix format test, 2PLM items and GRM items. Table1 Empirical type I error rates of PPMC, bootstrapping and S-X 2 Significant level Model PPMC Bootstrapping S-X 2 Mix format 0.088 0.095 0.063 0.05 2PLM 0.090 0.109 0.062 GRM 0.072 0.000 0.07 Mix format 0.020 0.029 0.015 0.01 2PLM 0.020 0.033 0.015 GRM 0.020 0.000 0.020 The comparison results of 0.05 and 0.01 significant level indicated that S-X 2 has the lowest type I error rate for the mix format test and 2PLM items and bootstrapping method has lowest type I error for the GRM items. In general, beside the 0 type I error rate of bootstrapping method for GRM items, the type I error rates of these three methods under two significant levels are very close. Table 2 reports the empirical detection rate of PPMC, bootstrapping and S-X 2 for the mix format test, 2PLM items and GRM items.

Table2. Empirical detection rates of PPMC, bootstrapping and S-X 2 Significant level Model PPMC Bootstrapping S-X 2 Mix format 0.898 0.498 0.983 0.05 2PLM 0.888 0.995 0.987 GRM 0.908 0.000 0.978 0.01 Mix format 0.765 0.439 0.932 2PLM 0.823 0.878 0.923 GRM 0.707 0.000 0.940 The comparison results of both significant levels indicated that S-X 2 method has the highest empirical detection rate for the mix format test, 2PLM items and GRM items. It should be noticed that the empirical detection rate for bootstrapping method is very low for GRM items. When the significant level is at 0.05, the empirical detection rate of bootstrapping and S-X 2 method is very close. But when the significant level is at 0.01, the S-X 2 method has much higher empirical detection rate. Based on the simulation result, three methods all provide the accurate model checking rate for the 2PL model which is indicating by high empirical detection rate and low type I rate under both significant levels. But when the model is the GRM, the bootstrapping method cannot identify misfit items. The other two methods can identify the misfit GRM items accurately. For the empirical data, among 53 items, S-X 2 method has identified 22 items as misfitting items, bootstrapping method has identified 14 items as misfitting items and PPMC method has identified 13 items as misfitting items. Only four items were identified misfitting by three methods. The following graphs includes the IRFs of nonparametric and parametric model for these four misfitting items of the assessment.

Figure 1. IRFs of nonparametric and parametric model for four misfitting items. The nonparametric IRFs indicate the misfitting location is at the middle and high theta range for the first item, misfitting location is at the lower theta range for the second item, misfitting location is at the low and high theta range for the third item, and mifitting location is at the low theta range. Significance The new proposed Bayesian nonparametric model fit assessing method can be easily generalized to any IRT model and count the uncertainty of parameter estimation. This method also provides the graph representation of misfitting items for investigation of the location and magnitude of misfit. The result from the simulation study indicate this Bayesian nonparametric model fit statistics have low type I error and high detection rate. In the empirical study, the

Bayesian nonparametric method detected reasonable number of misfitting items. These results indicate Bayesian nonparametric model fit statistics can be used as a model fit assessing method. When Bayesian nonparametric method was compared to other model fit methods in simulation study, the detection rates are very similar between the Bayesian nonparametric method and the S-X 2 method. This result indicates these two model fit methods can check model fit equally well and the selection of model fit method depends on the estimation method used for item calibration because each method used different item calibration results for model fit checking. For example, if the MCMC method is used for estimation, Bayesian nonparametric method should be used for model fit checking. If the maximum likelihood method is used for estimation, S-X 2 method should be used for model fit checking.

Reference Bayarri, M., & Berger, J. O. (2000). P values for composite null models. Journal of the American Statistical Association, 95(452), 1127-1142. Cai, L., Thissen, D., & du Toit, S. (2011). IRTPRO for Windows [Computer program]. Chicago, IL: Scientific Software International Copas, J. (1983). Plotting p against x. Journal of the Royal Statistical Society. Series C (Applied Statistics), 32(1), 25-31. Douglas, J. (1997). Joint consistency of nonparametric item characteristic curve and ability estimation. Psychometrika, 62(1), 7-28. Douglas, J., & Cohen, A. S. (2001). Nonparametric item response function estimation for assessing parametric model fit. Applied Psychological Measurement, 25, 234 243. Gelman, A., Meng, X. L., & Stern, H. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica, 6, 733-759. Guttman, I. (1967). The use of the concept of a future observation in goodness-of-fit problems. Journal of the Royal Statistical Society. Series B (Methodological), 29(1), 83-100. Hambleton, R. K., & Han, N. (2005). Assessing the fit of IRT models to educational and psychological test data: A five step plan and several graphical displays. In W. R. Lenderking & D. Revicki (Eds.), Advances in health outcomes research methods, measurement, statistical analysis, and clinical applications (pp. 57 78). Washington, DC: Degnon Associates. Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Boston, MA: Kluwer-Nijhoff. Liang, T., & Wells, C. S. (2009). A model fit statistic for the generalized partial credit model. Educational and Psychological Measurement, 69, 913 928. Liang, T., Wells, C. S. & Hambleton, R. K. (2014). An assessment of the nonparametric approach for evaluating the fit of item response models. Journal of Educational Measurement, 51 (1), 1-17. Orlando, M., & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24, 50 64. Ramsay, J. O. (1991). Kernel smoothing approaches to nonparametric item characteristic curve estimation. Psychometrika, 56(4), 611-630. Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. The Annals of Statistics, 12(4), 1151-1172. Sinharay, S. (2005). Assessing fit of unidimensional item response theory models using a Bayesian approach. Journal of Educational Measurement, 42(4), 375-394. Stone, C. A., & Zhang, B. (2003). Assessing goodness of fit in item response theory models: A comparison of traditional and alternative procedures. Journal of Educational Measurement, 40, 331 352. Yen, W. (1981). Using simulation results to choose a latent trait model. Applied Psychological Measurement, 5, 245 262.