A Multilevel Testlet Model for Dual Local Dependence
|
|
- Ellen Alexander
- 6 years ago
- Views:
Transcription
1 Journal of Educational Measurement Spring 2012, Vol. 49, No. 1, pp A Multilevel Testlet Model for Dual Local Dependence Hong Jiao University of Maryland Akihito Kamata University of Oregon Shudong Wang Northwest Evaluation Association Ying Jin United BioSource Corporation The applications of item response theory (IRT) models assume local item independence and that examinees are independent of each other. When a representative sample for psychometric analysis is selected using a cluster sampling method in a testlet-based assessment, both local item dependence and local person dependence are likely to be induced. This study proposed a four-level IRT model to simultaneously account for dual local dependence due to item clustering and person clustering. Model parameter estimation was explored using the Markov Chain Monte Carlo method. Model parameter recovery was evaluated in a simulation study in comparison with three other related models: the Rasch model, the Rasch testlet model, and the three-level Rasch model for person clustering. In general, the proposed model recovered the item difficulty and person ability parameters with the least total error. The bias in both item and person parameter estimation was not affected but the standard error (SE) was affected. In some simulation conditions, the difference in classification accuracy between models could go up to 11%. The illustration using the real data generally supported model performance observed in the simulation study. Local independence is one of the central assumptions that underlie item response theory (IRT) models (e.g., Hambleton & Swaminathan, 1985). As Embretson and Reise (2000, p. 48) have stated, Essentially, local independence is obtained when the relationship among items (or persons) is fully characterized by the IRT model. According to Reckase (2009, p. 13), the local independence assumption has two parts: local item independence and local person independence. Local item independence implies that a person s response to an item will not affect the probability of the person s response to another item and can be represented mathematically as p(u = u θ) = I p(u i θ) = p(u 1 θ)p(u 2 θ)...p(u I θ). (1) i=1 This equation indicates that the probability of a response pattern, u, to multiple items for a person with latent trait level θ, p(u = u θ.), equals the product of the probabilities of the individual s responses, u i,totheith item on a test, p(u i θ). Local person independence implies that a person s response to a specific item is 82 Copyright c 2012 by the National Council on Measurement in Education
2 Multilevel IRT Model for Local Dependence independent of any other person s response to that item and can be mathematically represented as p(u i = u i θ) = n p(u ij θ j ) = p(u i1 θ 1 )p(u i2 θ 2 )...p(u in θ n ). (2) j=1 This equation indicates that the probability of a set of responses to any item i by n persons with abilities θ j in the vector θ is the product of the probabilities of each individual person s response to that item. When the local independence assumption is evaluated, both of the facets of independence should be considered since when local item dependence or local person dependence is present, the assumption of local independence is violated. Further, when local independence is not present, the IRT model parameter estimation, which is based on the likelihood function, will be affected (Baker & Kim, 2004). However, the research that has investigated the effects of these violations typically has addressed the effects of local item dependence and local person dependence on IRT model parameter estimation separately (e.g., Bradlow, Wainer, & Wang, 1999; Fox & Glas, 2001; Jiao, Wang, & Kamata, 2005; Kamata, 1998, 2001; Wang & Wilson, 2005; Yen, 1984). Local Item Dependence IRT models are not robust to the violation of the local item independence assumption. It has been demonstrated that local item dependence affects model parameter estimation, equating, and estimation of test reliability (e.g., Ackerman, 1987; Bradlow et al., 1999; Chen & Thissen, 1997; Jiao et al., 2005; Wainer, Bradlow, & Wang, 2007; Wang & Wilson, 2005; Yen, 1984). Possible causes for local item dependence include passage dependence; item chaining; explanation of previous answers such as clueing, item, or response format (multiple-choice vs. constructedresponse items); scoring rubrics; fatigue; speededness; and practice effects (e.g., Yen, 1984). Passage dependence is a commonly studied cause of local item dependence in educational assessments. A passage does not necessarily mean a reading passage. It could alternatively refer to such contextual effects as a scenario in science assessment or a graph or table in a mathematics test. When items are constructed around such a common stimulus, items associated with the same stimulus are connected by the common context. Thus, item connection or clustering may affect an examinee s performance on those items due to the common contextual effects. As a result, local item dependence or testlet effects may be induced (Wainer & Kiely, 1987). Recent studies have proposed models to deal with local item dependence due to testlets. For example, Bradlow et al. (1999) proposed a two-parameter Bayesian random-effects testlet model indicating the interaction between person and item cluster by incorporating a random-effect parameter into unidimensional two-parameter item response models. Extensions of this model have been made to a three-parameter IRT model as well as to the graded-response model (Du, 1998; Wang, Bradlow, & Wainer, 2002). In another attempt to model testlet effects, Wang and Wilson (2005) proposed the Rasch testlet model as a special case of the multidimensional random 83
3 Jiao et al. coefficients multinomial logit model. Similarly, Jiao et al. (2005) developed a three-level one-parameter testlet model from the hierarchical generalized linear modeling framework for item analysis (Kamata, 2001). These testlet models account for local item dependence due to item clustering; however, they only deal with one facet of the local independence assumption and none of these papers takes into account local person dependence due to person clustering. Local Person Dependence As pointed out earlier, local person independence is an important facet of the local independence assumption for IRT models. Violations of this assumption are likely to occur as a result of person clustering due to factors such as cluster sampling, external assistance or interference, differential opportunity to learn, or different problemsolving strategies. A failure to incorporate person clustering effects in model parameter estimation may jeopardize measurement precision and lead to biased parameter estimates as the dependence among individuals within clusters reduces the effective sample size (Cochrane, 1977; Cyr & Davies, 2005; Kish, 1965). In the educational assessment setting, it is typical that students are nested within classes and classes are nested within schools. Students in a given classroom or school may respond to items in a more consistent manner than students from different classrooms or schools as a result of having received more similar instruction than what would be expected across classrooms or schools. Therefore, when the sampling unit is classrooms or schools, a person clustering effect may have an increased likelihood of occurrence. For standardized achievement tests, such as the Stanford Achievement Tests, psychometric analyses often are based on a representative sample that is selected using a cluster sampling method. Another example of this kind of clustering is the K-12 state tests required by the No Child Left Behind Act of States with a population of more than 100,000 students per grade may select a representative sample for analyses using a cluster sampling method. A third example is the Race to the Top assessment. It is likely that the common core assessments will use a cluster sampling method to select a representative sample per grade from all participating states. A multilevel modeling framework can be used to account for dependence among the clustered individuals in psychometric analysis (e.g., Bryk & Raudenbush, 1992; de Leeuw & Kreft, 1986; Goldstein, 1995). Some researchers have reformulated IRT models into multilevel models in which items are nested within persons, and extended these IRT models to reflect the fact that persons are nested within higher contextual levels (e.g., Adams, Wilson, & Wu, 1997; Fox & Glas, 2001; Kamata, 1998, 2001). Other researchers have included person-level covariates to improve the estimation of item and person parameters (e.g., Mislevy, 1987; Mislevy & Bock, 1989). Another attempt was made by Cohen, Chan, Jiang, and Seburn (2008) to use a nonparametric method to study Rasch modeling under the complex sample designs typically found in state testing programs. In addition, multilevel IRT models (Johnson & Jenkins, 2005; Li, Oranje, & Jiang, 2009) have been applied in largescale survey assessments such as the National Assessment of Educational Progress. However, these multilevel IRT models do not explain the effects due to item clustering. 84
4 Person 1 Item 1 Group 1 Testlet 1 Person 2 Item 2 Item 3 Testlet 2 Person 3 Group 2 Item 4 Person 4 Item 5 Item 6 Testlet 3 Person 5 Group 3 Person 6 Figure 1. The hierarchy of the four-level model for item clustering and person clustering. A Four-Level IRT Model for Dual Local Dependence Currently available IRT models do not simultaneously account for both local item dependence due to item clustering and local person dependence due to person clustering. Testlet models, including random-effects testlet models, multidimensional testlet models, and multilevel testlet models, ignore the effects of person clustering; multilevel IRT models, which account for person clustering, ignore item clustering effects. Thus, the local independence assumption is not fully addressed under any of the above-mentioned modeling approaches. To simultaneously model both item and person dependence, this study proposes a four-level IRT model from the multilevel measurement modeling framework (Jiao, Kamata, Wang, & Jin, 2010b). By extending the hierarchy of the three-level one-parameter testlet model by Jiao et al. (2005), the four-level model can be schematically represented as depicted in Figure 1. The first level models item effects and the second level models testlet effects. The items are nested within the testlets. Level three models the effects of persons who are fully crossed with testlets and ultimately with items. This indicates that every person answers every item but items are associated with only one testlet. The fourth level models the examinee group effects. Persons are nested within groups such as classes, schools, or school districts. Each person is associated with only one level-four unit. This hierarchy represents testing situations where testlets are used for assessment purposes and a cluster sampling method is used for selecting a representative sample for psychometric analysis. The four-level IRT model originally was constructed within a hierarchical generalized linear modeling framework (Jiao, Kamata, Wang, & Jin, 2010b). (Information about the model setup is available to the reader upon request.) By following the convention used in the IRT framework, the combined four-level IRT model for dual dependence can be expressed as p jdig = exp[ (θ j + θ g b i + γ jd(i) )], (3) 85
5 Jiao et al. where θ j is the person-specific ability for person j, θ g is the group-specific ability for group g, b i is the item difficulty for item i, and γ jd(i) is the testlet effect for person j on testlet d. The interpretations of θ j and b i follow those in the conventional unidimensional IRT models. The group ability represented by θ g is the same for all individuals in the group but different for individuals from different groups. The variability of θ g indicates the extent of group effects. The item-clustering effects, the person-clustering effects, and the person ability are assumed to be mutually exclusive and independent. It also is assumed that the residuals are uncorrelated after controlling the variances from the above-mentioned three sources. Model Identification Like the conventional Rasch model, the proposed four-level model for dual dependence cannot be identified without imposing some constraints. The common practice used to identify the Rasch model is to constrain either the mean ability or the mean of the item difficulty parameters to be 0. When considering the Rasch model as a multilevel model assuming that the ability and item difficulty parameters follow a normal distribution, Gelman and Hill (2007) suggested fixing the mean of θ or fixing the mean of item difficulty parameters to be 0 but not doing both at the same time. This restriction approach for producing model identification is essentially the same as the current practice used in Rasch model applications. Another method to identify the Rasch model scale is to allow the item and person ability parameters to float but to adjust the floating parameters to a new defined value (Bafumi, Gelman, Park, & Kaplan, 2005; Gelman & Hill, 2007). The new adjusted quantities replace the model parameters to make the model well identified but preserve the logit scale of the model. The adjusted quantities can be defined as θ adj j = θ j θ, for j = 1, 2,...,J and b adj j = b j θ, for i = 1, 2,...,I. (4) As discussed in Fox and Glas (2001), in multilevel IRT models with the scale of the latent dimension made up of several variance components, one often fits a hierarchical set of models with various decompositions of the ability variance. Fixing one of these variance components is not practical. An alternative is to impose the identifying restrictions on the item parameters. For the two-parameter normal ogive multilevel model, they fixed one discrimination parameter to 1 and one difficulty to 0. In this study, the mean item difficulty is constrained to be 0 to produce identifiability for the proposed model with no further adjustment needed. This approach for scale identification is in fact consistent with one of the approaches provided in popularly utilized software for identifying the Rasch model (such as WINSTEPS). Model Estimation To our knowledge, none of the currently available IRT or multilevel software packages can analyze data using the proposed four-level IRT model for dual local dependence. (Note that it is possible to take a two-level bi-factor modeling approach with a complex sampling design in software like Mplus.) Therefore, this study explored model parameter estimation using the Markov Chain Monte Carlo (MCMC) method 86
6 Multilevel IRT Model for Local Dependence implemented in WinBUGS (Lunn, Thomas, Best, & Spiegelhalter, 2000). The advantages of using the MCMC method for multilevel modeling have been clearly illustrated in Draper (2008, p. 91). In general, the MCMC procedure makes use of the prior information of model parameters and the data to obtain the posterior estimates of the parameters. The proposed four-level IRT model as presented in (3) can be expressed as follows: y jdig Bernoulli (p jdig ), logit(p jdig ) = θ j + θ g b i + γ jd(i), i = 1, 2,..., I, d = 1, 2,...,T, j = 1, 2,...,J, g = 1, 2,...,G b i iid N(0, 1), θ j iid N ( 0, σ 2 θ j ), θg iid N ( 0, σ 2 θ g ), and γj iid N ( 0, σ 2 γ j ). (5) These specify the distributions for item difficulty parameters, person ability, group ability, and testlet effect parameters. The item difficulty parameters are assumed to have a standard normal distribution. Those random effects are assumed to be normally distributed with means of 0 and respective variances to be estimated. Gibbs sampling generates random variables from a joint distribution indirectly without calculating the density (Casella & George, 1992, p. 167). The MCMC with Gibbs sampling starts with the initialization of the model parameters (Draper, 2008). In this study, the model parameters to be estimated are item difficulty parameters, person ability parameters, ability variance, group ability variance, and testlet variance for each testlet. The initial values for these parameters are generated from the prespecified prior distributions. After initialization for the model parameters, given the observed item response data, testlet indicators, and group indicators, the MCMC data set is filled by sampling each parameter to be estimated from a joint distribution of the latest estimates of other parameters. When the Markov chains reach equilibrium, the burn-in process ends and the monitoring process starts. The results from each iteration after the burn-in iteration are used for model inferences and summarized to obtain the estimates of the model parameters. Gibbs sampling requires that samples be drawn from the full conditional distributions derived from the target distribution. However, the full conditional distributions for the multilevel logistic regression models do not have an analytic closed form (Chaimongkol, Huffer, & Kamata, 2006). Gilks and Wild (1992) suggested using adaptive rejection sampling for log-concave full conditional distributions. This approach is implemented in the WinBUGS 1.4 software and is used to fit the models considered in this study. Method Simulation Study A simulation study was carried out to examine the performance of the proposed four-level IRT model for dual dependence. Item response data that mimicked a largescale passage-based reading comprehension test were simulated by assuming four 87
7 Jiao et al. passages with nine items associated with each passage. A cluster sampling method using class as the sampling unit was used to select simulated examinees. The true values of the person ability and the item difficulty parameters were both randomly generated from a standard normal distribution. The cluster size was assumed to be an average class size of 20, which reflected a typical class sample size in the National Assessment of Educational Progress state assessments (Li et al., 2009). The number of clusters was set to be 50. According to Binici (2007), a cluster number of 50 with a cluster size of 20 produced a reasonable and acceptable level of estimation error for a multilevel IRT model. This combination of the cluster number and the cluster size resulted in a total of 1,000 students. For the simulated study conditions, the true ability and item difficulty parameters were kept the same across simulation conditions, while the magnitudes of local item dependence and local person dependence were manipulated. Local item dependence parameter values were simulated from a normal distribution N(0,σγ 2) by specifying different magnitudes of variance, σ2 γ. The magnitude of local item dependence was set at two levels by specifying σγ 2 as.25 or Two levels of person clustering variance, σθ 2 g, were set at.25 or 1.00 to represent low or moderate person clustering effects, respectively. The group-specific effect parameters (θ g ) then were simulated from a normal distribution N(0, σθ 2 g ). For each joint level of local item dependence and local person dependence, item responses were generated by incorporating the true ability, item difficulty, group-specific ability, and local item dependence parameters in (3). Once item responses were generated, the MCMC method in WinBUGS 1.4 was used to estimate parameters for the proposed four-level IRT model for dual dependence as well as for the Rasch model, the Rasch testlet model, and the three-level Rasch model (modeling person clustering effects) for comparison purposes. The combination of different levels of item dependence (testlet variance), person dependence (group variance), and the estimation model resulted in 16 study conditions. Twenty-five replications were run and parameter recovery was evaluated in terms of bias, standard error (SE), and root mean squared error (RMSE) for the item and person parameters and the variances for the three random effects. The bias, SE, and RMSE were computed by Bias(ˆβ) = SE(ˆβ) = 1 N N (ˆβ r β) N, (6) r=1 N r ˆβ r=1 2 N ˆβ t, (7) N t=1 RMSE(ˆβ) = 1 N N (ˆβ r β) 2, (8) r=1 88
8 Multilevel IRT Model for Local Dependence where β is the true model parameter, ˆβ r is the estimated model parameter for the rth replication, and N is the number of replications. Classification accuracy, computed as the percentage of correct classifications of examinees into pass and fail groups, was compared across the four calibration models under each simulation condition. The binary classification decisions were made by comparing the relative position of a person s θ estimate and θ cut score obtained from standard setting conducted for a state reading comprehension test for high school graduation. Prior Setting Within the Bayesian framework, a prior distribution needs to be specified for each of the model parameters. When the prior distribution is noninformative, the range of the uncertainty should be wider than the range of reasonable values of the parameters (Gelman & Hill, 2007). When the prior knowledge is reliable, the use of an informative prior with smaller uncertainty may facilitate the estimation of the posterior. When a noninformative proper prior is supplied, the posterior is estimated by relying heavily on the data. Various noninformative prior distributions have been suggested for the variance parameters in the literature. For example, Gelman and Hill (2007) found that some noninformative prior distributions may unduly affect inferences when the number of groups is small and the group-level variability is close to zero in multilevel modeling. They demonstrated three different types of priors for variance estimation. Their results showed that the inverse-gamma (α =.001, β =.001) prior distribution distorted the posterior distribution. On the other hand, the inverse-gamma (α = 1, β = 1) prior distribution generally made the priors concentrate in the range (.5, 5) and the posterior closely matched the prior distribution. In addition, they found that the uniform prior distribution with a wide range functioned close to a noninformative prior distribution and did not appear to constrain the posterior inferences. In general, Gelman and Hill (2007) recommend the uniform prior for the standard deviation with a wide range such as (0, 100). They do not recommend using the inverse-gamma distribution as a noninformative prior but recommend the use of the inverse-gamma as a proper prior distribution when the group sample size is small and the group variance is near zero. It is noted, however, that the inverse-gamma prior family may be preferred by some other researchers due to its conditional conjugacy, which provides cleaner mathematical properties. In fact, we experimented with all three options suggested by Gelman and Hill (2007) as the prior distributions for the variance parameters, including the inverse-gamma with α =.001 and β =.001, the inversegamma with α = 1 and β = 1, and the uniform prior with a range (0, 100). However, only the inverse-gamma (α = 1, β = 1) prior distribution worked for our model. In addition, for our simulation data, the group variance was substantially different from zero and our group sample size was not small. Therefore, the prior distributions for the ability variance, group variance, and testlet variances all were set to an inverse-gamma (α = 1, β = 1) distribution. In other applications, different priors may function differently, and other choices should be explored and inferences should be compared. 89
9 Jiao et al. Convergence Check The MCMC procedure is an iterative algorithm. Several Markov chains run in parallel, each starting with different initial values. The algorithm is often run until the simulation from different initial values converges to a common distribution. Usually multiple chains are recommended for checking the proper mixing of the chains (Gelman & Hill, 2007). Thus in this study, four chains were used for the MCMC runs. The convergence was checked based on multiple criteria to make sure that convergence was achieved before the model parameter estimates were monitored. First, the Gelman-Rubin statistic (R) as modified by Brooks and Gelman (1998) was used. Convergence was assessed by comparing within-chain (W) and between-chain (B) variability over the second half of those chains. The ratio R = B/W was expected to be greater than 1 if the starting values were sufficiently different and it would get close to 1 as convergence was approached. For practical purposes, convergence can be assumed if R < 1.05 (Lunn et al., 2000). A sample check over replications under simulation conditions indicated that R generally was close to 1 and smaller than Brooks and Gelman (1998) emphasized the importance of ensuring not only that R has converged to 1 but also that B and W have converged to stability. The Brooks-Gelman Ratio (BGR) diagnostic plots indicated that stability and convergence usually occurred between iteration 1,000 and 2,000. The trace plots also indicated that each parameter of interest became stationary after about 700 iterations, which further supported that convergence was reached before 2,000 iterations. The quantile plots showed the running mean with 95% confidence intervals against iteration numbers. The running mean and the 95% confidence intervals from the four chains mixed very well and reached equilibrium before 2,000 iterations. Other plots including history and density plots all indicated that the four chains mixed well before 2,000 iterations and reached equilibrium by then. Based on these observations, we decided to discard the first 2,000 iterations as burn-in iterations. After the burn-in iteration phase, an additional 3,000 iterations were monitored for each chain. The model parameter inferences were made based on these 3,000 monitoring iterations from each of the four chains, which yielded a total of 12,000 samples. Real Data Analysis The proposed model was further fit to data from a reading comprehension test for high school graduation in a mid-south state in the United States. The test contained four passages with 8, 8, 9, and 7 items, respectively. School districts were used as the group-level units. The original data set contained a total of 1,644 students nested within 424 school districts, with the number of students in each school district ranging from 1 to 397. Those school districts with sample sizes smaller than 15 students were excluded, resulting in a sample size of 1,534 students in 20 school districts with district sizes ranging from 16 to 397. The same prior distributions as those in the simulation study were used for the four models, and the same procedure was followed to check model parameter convergence in WinBUGS runs. It was concluded that we would use 2,000 burn-in iterations and 3,000 monitoring iterations for four chains to evaluate the models. 90
10 Table 1 Standard Error in Item Difficulty Parameter Estimation Simulation Conditions Model N Minimum Maximum Mean SD S1 Dual Multilevel Rasch Testlet S2 Dual Multilevel Rasch Testlet S3 Dual Multilevel Rasch Testlet S4 Dual Multilevel Rasch Testlet Note. S1 = small local item dependence and small local person dependence, S2 = moderate local item dependence and small local person dependence, S3 = small local item dependence and moderate local person dependence, S4 = moderate local item dependence and moderate local person dependence. Results Simulation Study The Deviance Information Criterion (DIC) produced by WinBUGS was used to assess model fit. The model with the smallest DIC was the best fitting model. The proposed model accounting for dual dependence and the testlet model performed similarly with better fit, while the multilevel model and the Rasch model performed similarly with worse fit. Based on the DIC, the four models were ranked consistently across simulation conditions and replications from the best to the worst fitting models as follows: the proposed model, the testlet model, the multilevel model, and the Rasch model. Item difficulty recovery was evaluated and compared in terms of bias, SE, and RMSE. A univariate three-way analysis of variance was conducted by specifying each of the error indexes as the dependent variable and local item dependence (2 levels), local person dependence (2 levels), and Model (4 levels) as three factors. Based on the analysis of variance results, none of the factors significantly impacted bias. This is due to the fact that all models were identified by constraining the mean item difficulty to zero. Further, no interaction effect had a statistically significant impact on bias. However, the SE (see Table 1) was significantly affected by the local person dependence and the model factors, each with a small (f =.16) and a moderate (f =.26) effect size, respectively. The interaction between local item dependence and model was significant with a small effect size (f =.13). The average SEs were about the same for the multilevel model and the Rasch model, while those for the 91
11 Table 2 Root Mean Squared Error in Item Difficulty Parameter Estimation Simulation Conditions Model N Minimum Maximum Mean SD S1 Dual Multilevel Rasch Testlet S2 Dual Multilevel Rasch Testlet S3 Dual Multilevel Rasch Testlet S4 Dual Multilevel Rasch Testlet proposed model and the testlet model were close with the latter two having higher average SEs. The higher average SE in the proposed model and the testlet model might have been due to the increased number of parameters estimated for these models. Similarly, the local item dependence, the estimation model, and their interaction all had significant effects on the RMSE. All three significant effects occurred with moderate effect sizes (f =.32, f =.33, f =.27, respectively). The average RMSE were smaller for the proposed dual model and the testlet model than those for the multilevel model and the Rasch model (See Table 2). The differences in the RMSE between the two pairs of models were smaller for small local item dependence than for moderate local item dependence. The lower RMSE in the better fitting models was consistent with the expectations that a better fitting model usually has less total estimation error. The impact of the calibration model on bias in the ability parameter estimation was statistically significant but with a negligible effect size (f =.08). No other factors had a significant impact on bias in the ability parameter estimation. All factors significantly influenced the SE in the ability parameter estimation. However, only the model factor was of moderate effect size (f =.35). The local person dependence factor, the local item dependence factor, and the interaction between local person dependence and model factors were of small effect sizes (f =.12, f =.19, and f =.20, respectively). All other effects were negligible. No consistent patterns in the SE were observed for the four models when the magnitudes of local item and person dependence changed (see Table 3). In general, the mean SEs in the ability parameter estimates for the proposed model and the multilevel model were slightly higher than those for the Rasch and the testlet model. A possible explanation is that the proposed model and the multilevel model estimated both individual ability and group ability parameters simultaneously, which increased the difficulty in separating the effect 92
12 Table 3 Standard Error in Ability Parameter Estimation Simulation Conditions Model N Minimum Maximum Mean SD S1 Dual Multilevel Rasch Testlet S2 Dual Multilevel Rasch Testlet S3 Dual Multilevel Rasch Testlet S4 Dual Multilevel Rasch Testlet from individual persons and the group. In addition, the number of clusters (50) and the cluster size (20), which are not very large, may contribute to the difficulty in separating the individual ability and the group ability effects, thus increasing the random error in the ability parameter estimation. Regarding the RMSE, all the effects except the three-way interaction were statistically significant at the α-level of.05. The three studied factors, model, local item dependence, and local person dependence, and the interaction effects between the model and local person dependence factors had small effect sizes (f =.20, f =.10, f =.13, and f =.10, respectively), while other effects were negligible. The average RMSE were smaller for the proposed dual model and the multilevel model than those for the testlet model and the Rasch model, regardless of the local item dependence and local person dependence magnitudes (see Table 4). This can be explained by the proper modeling of the person and group effect separately in the two models. The local item and person dependence factors had no significant impact on the ability variance recovery, while the effects of the model factor on the three types of error were significant and had large effect sizes. The proposed model and the multilevel model were effective in recovering the true ability variance, while the testlet model and the Rasch model overestimated the true values. No factors other than the local person dependence factor had an impact on the group variance recovery: As the magnitude of local person dependence increased, the SE increased. On the other hand, the local item dependence factor significantly affected the SE and the RMSE in the testlet variance estimation with large effect sizes. The increase in the magnitude of local item dependence also increased the SE and the RMSE in testlet variance estimation. The local person dependence factor also impacted the SE in testlet variance recovery with a large effect size. The increase in the magnitude of local person dependence increased the SE in testlet variance estimation. 93
13 Table 4 Root Mean Squared Error in Ability Parameter Estimation Simulation Conditions Model N Minimum Maximum Mean SD S1 Dual Multilevel Rasch Testlet S2 Dual Multilevel Rasch Testlet S3 Dual Multilevel Rasch Testlet S4 Dual Multilevel Rasch Testlet In general, the simulation results demonstrated that for both item and person parameter estimation, the bias was not affected but the SE was affected. These results were consistent with the general theory of the mixed effect model where the fixed-effect model parameter estimates are the best linear unbiased estimators and the random-effect model parameter are the best linear unbiased predictor. Therefore, ignoring the dependence will not affect bias but will only affect the efficiency. The effects of the three studied factors on person parameter estimation were further investigated by examining the accuracy in classifying students into passing or failing categories based on the comparison of their estimated ability relative to θ cut score obtained from a state reading comprehension test under simulation conditions. All three factors had significant effects on the classification accuracy with large effect sizes. No matter what magnitude the local item dependence or the local person dependence was, the proposed dual model always led to the most accurate classification of examinees while the Rasch model showed the worst performance. The mean classification accuracy difference between the proposed model and the Rasch model was about 5.5%, ranging from 1.8% to 11%. In sum, this simulation study demonstrated that the proposed four-level IRT model accounting for dual local dependence performed better in terms of model parameter recovery and classification accuracy compared to the three alternative models that were considered. Real Data Analysis The proposed model and the three other comparison models were fit to a real data set. Model fit assessed by DIC indicated that the proposed model was the best fitting model (DIC = 48,374.6) followed by the multilevel model (DIC = 48,512.0), the testlet model (DIC = 48,636.9), and the Rasch model (DIC = 48,780.6). 94
14 Multilevel IRT Model for Local Dependence The estimated testlet effects were relatively small, but the person clustering effects were relatively large. Therefore, it made sense that the multilevel model provided better fit than the testlet model. The estimated testlet variances for the four testlets were.1557,.1836,.2656, and.1884 using the proposed model, and.2125,.176,.1894, and.1831 using the testlet model. In general, the testlet variances were small (even smaller than the small values in the simulation study). The estimated group variances for the proposed model and the multilevel (three-level) model were close with values of 1.41 and 1.332, respectively, and the ability variance estimates based on these models were.545 and.567. The estimated ability variance for the testlet model and the Rasch model were and A possible explanation for the much larger ability variance estimates for the Rasch model and the testlet model is that ignoring a large person clustering effect may have increased the errors in ability parameter estimation. All four models were identified by constraining the mean item difficulty to be zero. The mean item parameter estimates were not significantly different across models. The estimated person parameters were highly correlated between the proposed model and the multilevel model with a Pearson product moment correlation coefficient near 1.0. The same was true for correlations between the Rasch model and the testlet model estimates. The correlations between the models across these two pairs were about.9. The means for the ability estimates based on the two models that incorporated the local person dependence effect were close to zero, while the two models that ignored the local person dependence effect had a mean of around.7. The classification consistency between the two models that accounted for the local person dependence was 99.3%, while that between the two models that ignored the local person dependence effects was 99.7%. The classification consistency between the proposed model and the other two models that ignored local person dependence were clearly lower at around 79%. To better understand the real data analysis results, one replication from the simulation condition with small local item dependence and moderate local person dependence effects, which was close to the characteristics of the real data, was examined in detail. The findings related to item parameter estimates and the correlations between the estimated person parameters among models were nearly identical to those in the real data analyses. The means for the ability estimates based on the two models that incorporated the local person dependence effect were close to zero, while the means for the ability parameter estimates based on the two models that ignored the local person dependence effect were around.13. The classification accuracy was higher for the two models that incorporated local person dependence (87% correct classification) than that for the other two models that ignored the local person dependence (78% correct classification). The classification consistency between the two models that incorporated the local person dependence was 98%, while the one between the two models that ignored the local person dependence effects was 99.5%. The classification consistency between the proposed model and each of the other two models that ignored the local person dependence was around 82%. In general, the real data analysis revealed similar results to those found in the simulation study. The item difficulty parameter estimates were not greatly affected by estimation models. Classification consistency between the proposed model and the 95
15 Jiao et al. multilevel model was very high, while between the proposed model and the other two models that ignored the local person dependence classification consistency was lower. The differences in correlations between ability estimates for different models and the reported classification consistency based on ability estimates for different models provide evidence that the differences in the ability parameter estimates among the models were practically important. In the real data set, the local item dependence effects were relatively small but the local person dependence effects were large. Discussion Local independence is one of the important assumptions for traditional IRT models. This assumption implies both local item independence and local person independence. Different indexes and models have been proposed for detecting and modeling local dependence due to item and person clustering. However, little research has addressed both local item dependence and local person dependence concurrently. This study proposed a four-level IRT model for dual local dependency that incorporates both local item dependence and local person dependence simultaneously. Model parameter estimation was explored using the MCMC algorithm in WinBUGS. The proposed four-level IRT model can be extended to two- and three-parameter RT models (Jiao, Kamata, & Binici, 2010a), as well as polytomous item response data (Jiao, Mislevy, & Zhang, 2011), when the test form is built upon testlets and a representative sample is selected from a large student population based on a cluster sampling method. It is relevant to K-12 state assessment programs, such as the common core assessments, and large-scale national and international assessment programs. When matrix sampling is used in large-scale assessments, it is not clear whether the findings in this study can be extended to missing data cases. Further exploration can be extended to this scenario. A possible criticism of the four-level IRT model is that because it has more parameters than the other models it provides better fit; this issue has been accounted for in three ways. First, the proposed model is for more complicated testing situations when both item clustering due to testlets and person clustering due to cluster sampling are present. If there is no such structure in the test, the proposed model does not need to be considered. Second, DIC as a model fit index does not always favor the model with more parameters. In general, an information-criterion-index-based model selection strategy does not always select a model with a larger number of parameters since the model complexity is compensated for in the index. DIC contains two terms: one for the expectation of deviance and a second for the effective number of parameters in the model. As the number of model parameters increases, the expectation of deviance decreases and the model fit is better. However, the term for the effective number of model parameter compensates for this effect by favoring models with a smaller number of parameters; this implies that DIC does not always favor a model with a larger number of parameters. This point was in fact supported in the real data analysis where the multilevel IRT model was favored over the testlet model, which had a larger number of model parameters. Third, our simulation study results indicated that model choice had a moderate effect on ability parameter estimation error and a large effect on classification decisions. This implies that the inferences 96
16 Multilevel IRT Model for Local Dependence from different models bear practical significance. If the clustering effects, especially the person clustering effect, are not properly modeled, the model parameters and classification both will be affected significantly. Although they were not reported in this article, two other information criterion indexes, the Akaike information criterion (AIC) and the Bayesian information criterion (BIC), were explored in our simulation study. However, neither index could properly identify the simulated true model; they could not effectively distinguish between the proposed model and the testlet model. Only DIC was able to pick the true model for all simulation conditions. We determined that DIC, an information criterion index directly extracted from WINBUGS output, was sufficient to choose the best fit model under various simulation conditions. Thus, we did not report the model fit results based on AIC and BIC. Nevertheless, a comparison of various information indexes in choosing the best fitting model is an important issue and worthy of future exploration. For any given issue in psychometrics, multiple perspectives or approaches are always possible. For example, as one reviewer pointed out, one alternative way to deal with clustering effects would be the correction of SEs using replication methods like jackknife repeated replication or balanced repeated replication. Alternatively the quasi-pseudo maximum likelihood approach (Asparouhov, 2004) could be considered; this approach can be implemented in some off-the-shelf software packages such as Mplus. All of these options are possible solutions to the clustering effects in the IRT applications. However, since no previous study has explicitly modeled dual clustering effects, this study focused on describing one possible solution to this problem rather than comparing multiple approaches. This study explored the MCMC algorithm for model parameter estimation. This algorithm is simple to implement, but it is computationally intensive and the computer running time can be very long. (As computational power of personal computers is increasing dramatically, this may not be a concern in the near future.) Other model estimation methods could be further explored and compared to the results obtained with the MCMC algorithm. For example, Mplus would allow for specifying a testlet model and simultaneously correcting for cluster sampling using a quasi-pseudo maximum likelihood method. This combination also would account for dual dependence (as one reviewer pointed out). The MCMC estimation presented in this article could be compared with this specific maximum likelihood approach. In addition, comparison should be conducted with other estimation methods including the marginal maximum likelihood estimation method with an expectation and maximization algorithm, the sixth-order Laplace method, and the Gauss-Hermite quadrature method (Rabe- Hesketh, Skrondal, & Pickles, 2005). Gelman & Hill (2007) stated that inferences from any statistical theory should include the factors used in the design of data collection; multilevel modeling is a direct way to include indicators of clusters at all levels of a design. This study presented only the simplest model for studying the effects of item and person clustering. Future work could expand the model to include covariates at all four levels to better explain variance in the observed data set. Such covariates include item characteristics, testlet characteristics, person characteristics, and group characteristics to improve model parameter estimation (Mislevy, 1987). 97
17 Jiao et al. Based on the findings in this study, it is recommended that the magnitude of person clustering effects should be evaluated in addition to evaluating item clustering effects when a cluster sampling method is used for psychometric analyses in testlet-based assessments. The analysis of real data clearly revealed that the mean ability could differ by as much as.7 logit units when ignoring both item and person clustering effects. Ignoring these effects could lead to misleading inferences related to grouplevel change from year to year in addition to the inflated error in model parameter estimation and a decrease in the level of classification accuracy. It is further suggested that effects of item clustering and person clustering on equating, vertical scaling, and standard setting be investigated in future studies. Acknowledgments The authors would like to thank the editor, Dr. Brian Clauser, and the reviewers for their valuable advice and suggestions, which greatly improved the manuscript. The authors are indebted to Dr. Robert Mislevy and Dr. Robert Lissitz for insightful discussions. Thanks are also due to Dr. George Macready for editing suggestions. All errors remain the responsibility of the authors. References Ackerman, T. (1987). The robustness of LOGIST and BILOG IRT estimation programs to violations of local independence. ACT Research Report Series, Iowa City, IA: American College Testing. Adams, R. J., Wilson, M., & Wu, M. (1997). Multilevel item response models: An approach to errors in variables regression. Journal of Educational and Behavioral Statistics, 22, Asparouhov, T. (2004). Stratification in multivariate modeling. Web Notes: No. 9. ( (Accessed April 2010). Bafumi, J., Gelman, A., Park, D. K., & Kaplan, N. (2005). Practical issues in implementing and understanding Bayesian ideal point estimation. Political Analysis Advance Access, 13, Baker, F. B., & Kim, S. (2004). Item response theory: Parameter estimation techniques. New York, NY: Marcel Dekker. Binici, S. (2007). Random-effects differential item functioning via hierarchical generalized linear model and generalized linear latent mixed model: A comparison of estimation methods. Unpublished doctoral dissertation, Florida State University. Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64, Brooks, S. P., & Gelman, A. (1998). Alternative methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7, Bryk, A. S., & Raudenbush, S. W. (1992). Hierarchical linear models: Applications and data analysis methods. Newbury Park, CA: Sage. Casella, G., & George, E. (1992). Explaining the Gibbs Sampler. American Statistician, 46, Chaimongkol, S., Huffer, F., & Kamata, A. (2006). A Bayesian approach for fitting a random effect differential item functioning across group units. Thailand Statistician, 4, Chen, W., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22,
Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison
Group-Level Diagnosis 1 N.B. Please do not cite or distribute. Multilevel IRT for group-level diagnosis Chanho Park Daniel M. Bolt University of Wisconsin-Madison Paper presented at the annual meeting
More informationItem Parameter Recovery for the Two-Parameter Testlet Model with Different. Estimation Methods. Abstract
Item Parameter Recovery for the Two-Parameter Testlet Model with Different Estimation Methods Yong Luo National Center for Assessment in Saudi Arabia Abstract The testlet model is a popular statistical
More informationCombining Risks from Several Tumors Using Markov Chain Monte Carlo
University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln U.S. Environmental Protection Agency Papers U.S. Environmental Protection Agency 2009 Combining Risks from Several Tumors
More informationAdvanced Bayesian Models for the Social Sciences. TA: Elizabeth Menninga (University of North Carolina, Chapel Hill)
Advanced Bayesian Models for the Social Sciences Instructors: Week 1&2: Skyler J. Cranmer Department of Political Science University of North Carolina, Chapel Hill skyler@unc.edu Week 3&4: Daniel Stegmueller
More informationUsing the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison
Using the Testlet Model to Mitigate Test Speededness Effects James A. Wollack Youngsuk Suh Daniel M. Bolt University of Wisconsin Madison April 12, 2007 Paper presented at the annual meeting of the National
More informationOrdinal Data Modeling
Valen E. Johnson James H. Albert Ordinal Data Modeling With 73 illustrations I ". Springer Contents Preface v 1 Review of Classical and Bayesian Inference 1 1.1 Learning about a binomial proportion 1 1.1.1
More informationSampling Weights, Model Misspecification and Informative Sampling: A Simulation Study
Sampling Weights, Model Misspecification and Informative Sampling: A Simulation Study Marianne (Marnie) Bertolet Department of Statistics Carnegie Mellon University Abstract Linear mixed-effects (LME)
More informationAnalyzing data from educational surveys: a comparison of HLM and Multilevel IRT. Amin Mousavi
Analyzing data from educational surveys: a comparison of HLM and Multilevel IRT Amin Mousavi Centre for Research in Applied Measurement and Evaluation University of Alberta Paper Presented at the 2013
More informationComparing DIF methods for data with dual dependency
DOI 10.1186/s40536-016-0033-3 METHODOLOGY Open Access Comparing DIF methods for data with dual dependency Ying Jin 1* and Minsoo Kang 2 *Correspondence: ying.jin@mtsu.edu 1 Department of Psychology, Middle
More informationAdvanced Bayesian Models for the Social Sciences
Advanced Bayesian Models for the Social Sciences Jeff Harden Department of Political Science, University of Colorado Boulder jeffrey.harden@colorado.edu Daniel Stegmueller Department of Government, University
More informationDifferential Item Functioning Amplification and Cancellation in a Reading Test
A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to the Practical Assessment, Research & Evaluation. Permission is granted to
More informationA Bayesian Nonparametric Model Fit statistic of Item Response Models
A Bayesian Nonparametric Model Fit statistic of Item Response Models Purpose As more and more states move to use the computer adaptive test for their assessments, item response theory (IRT) has been widely
More informationUsing Sample Weights in Item Response Data Analysis Under Complex Sample Designs
Using Sample Weights in Item Response Data Analysis Under Complex Sample Designs Xiaying Zheng and Ji Seung Yang Abstract Large-scale assessments are often conducted using complex sampling designs that
More informationaccuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian
Recovery of Marginal Maximum Likelihood Estimates in the Two-Parameter Logistic Response Model: An Evaluation of MULTILOG Clement A. Stone University of Pittsburgh Marginal maximum likelihood (MML) estimation
More informationModeling Randomness in Judging Rating Scales with a Random-Effects Rating Scale Model
Journal of Educational Measurement Winter 2006, Vol. 43, No. 4, pp. 335 353 Modeling Randomness in Judging Rating Scales with a Random-Effects Rating Scale Model Wen-Chung Wang National Chung Cheng University,
More informationCopyright. Kelly Diane Brune
Copyright by Kelly Diane Brune 2011 The Dissertation Committee for Kelly Diane Brune Certifies that this is the approved version of the following dissertation: An Evaluation of Item Difficulty and Person
More informationBayesian Analysis of Between-Group Differences in Variance Components in Hierarchical Generalized Linear Models
Bayesian Analysis of Between-Group Differences in Variance Components in Hierarchical Generalized Linear Models Brady T. West Michigan Program in Survey Methodology, Institute for Social Research, 46 Thompson
More informationA COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM LIKELIHOOD METHODS IN ESTIMATING THE ITEM PARAMETERS FOR THE 2PL IRT MODEL
International Journal of Innovative Management, Information & Production ISME Internationalc2010 ISSN 2185-5439 Volume 1, Number 1, December 2010 PP. 81-89 A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM
More informationCenter for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests
Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 26 for Mixed Format Tests Kyong Hee Chon Won-Chan Lee Timothy N. Ansley November 2007 The authors are grateful to
More informationThe Use of Multilevel Item Response Theory Modeling in Applied Research: An Illustration
APPLIED MEASUREMENT IN EDUCATION, 16(3), 223 243 Copyright 2003, Lawrence Erlbaum Associates, Inc. The Use of Multilevel Item Response Theory Modeling in Applied Research: An Illustration Dena A. Pastor
More informationLec 02: Estimation & Hypothesis Testing in Animal Ecology
Lec 02: Estimation & Hypothesis Testing in Animal Ecology Parameter Estimation from Samples Samples We typically observe systems incompletely, i.e., we sample according to a designed protocol. We then
More informationParameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX
Paper 1766-2014 Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX ABSTRACT Chunhua Cao, Yan Wang, Yi-Hsin Chen, Isaac Y. Li University
More informationA Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests
A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests David Shin Pearson Educational Measurement May 007 rr0701 Using assessment and research to promote learning Pearson Educational
More informationResearch and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida
Vol. 2 (1), pp. 22-39, Jan, 2015 http://www.ijate.net e-issn: 2148-7456 IJATE A Comparison of Logistic Regression Models for Dif Detection in Polytomous Items: The Effect of Small Sample Sizes and Non-Normality
More informationA Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model
A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model Gary Skaggs Fairfax County, Virginia Public Schools José Stevenson
More informationParameter Estimation with Mixture Item Response Theory Models: A Monte Carlo Comparison of Maximum Likelihood and Bayesian Methods
Journal of Modern Applied Statistical Methods Volume 11 Issue 1 Article 14 5-1-2012 Parameter Estimation with Mixture Item Response Theory Models: A Monte Carlo Comparison of Maximum Likelihood and Bayesian
More informationABSTRACT. Professor Gregory R. Hancock, Department of Measurement, Statistics and Evaluation
ABSTRACT Title: FACTOR MIXTURE MODELS WITH ORDERED CATEGORICAL OUTCOMES: THE MATHEMATICAL RELATION TO MIXTURE ITEM RESPONSE THEORY MODELS AND A COMPARISON OF MAXIMUM LIKELIHOOD AND BAYESIAN MODEL PARAMETER
More informationItem Response Theory: Methods for the Analysis of Discrete Survey Response Data
Item Response Theory: Methods for the Analysis of Discrete Survey Response Data ICPSR Summer Workshop at the University of Michigan June 29, 2015 July 3, 2015 Presented by: Dr. Jonathan Templin Department
More informationA Comparison of Item and Testlet Selection Procedures. in Computerized Adaptive Testing. Leslie Keng. Pearson. Tsung-Han Ho
ADAPTIVE TESTLETS 1 Running head: ADAPTIVE TESTLETS A Comparison of Item and Testlet Selection Procedures in Computerized Adaptive Testing Leslie Keng Pearson Tsung-Han Ho The University of Texas at Austin
More informationTHE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH
THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH By JANN MARIE WISE MACINNES A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF
More informationBayesian and Frequentist Approaches
Bayesian and Frequentist Approaches G. Jogesh Babu Penn State University http://sites.stat.psu.edu/ babu http://astrostatistics.psu.edu All models are wrong But some are useful George E. P. Box (son-in-law
More informationMISSING DATA AND PARAMETERS ESTIMATES IN MULTIDIMENSIONAL ITEM RESPONSE MODELS. Federico Andreis, Pier Alda Ferrari *
Electronic Journal of Applied Statistical Analysis EJASA (2012), Electron. J. App. Stat. Anal., Vol. 5, Issue 3, 431 437 e-issn 2070-5948, DOI 10.1285/i20705948v5n3p431 2012 Università del Salento http://siba-ese.unile.it/index.php/ejasa/index
More informationUsing the Distractor Categories of Multiple-Choice Items to Improve IRT Linking
Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Jee Seon Kim University of Wisconsin, Madison Paper presented at 2006 NCME Annual Meeting San Francisco, CA Correspondence
More informationDecision consistency and accuracy indices for the bifactor and testlet response theory models
University of Iowa Iowa Research Online Theses and Dissertations Summer 2014 Decision consistency and accuracy indices for the bifactor and testlet response theory models Lee James LaFond University of
More informationHow few countries will do? Comparative survey analysis from a Bayesian perspective
Survey Research Methods (2012) Vol.6, No.2, pp. 87-93 ISSN 1864-3361 http://www.surveymethods.org European Survey Research Association How few countries will do? Comparative survey analysis from a Bayesian
More informationA Hierarchical Linear Modeling Approach for Detecting Cheating and Aberrance. William Skorupski. University of Kansas. Karla Egan.
HLM Cheating 1 A Hierarchical Linear Modeling Approach for Detecting Cheating and Aberrance William Skorupski University of Kansas Karla Egan CTB/McGraw-Hill Paper presented at the May, 2012 Conference
More informationData Analysis Using Regression and Multilevel/Hierarchical Models
Data Analysis Using Regression and Multilevel/Hierarchical Models ANDREW GELMAN Columbia University JENNIFER HILL Columbia University CAMBRIDGE UNIVERSITY PRESS Contents List of examples V a 9 e xv " Preface
More informationDetection of Unknown Confounders. by Bayesian Confirmatory Factor Analysis
Advanced Studies in Medical Sciences, Vol. 1, 2013, no. 3, 143-156 HIKARI Ltd, www.m-hikari.com Detection of Unknown Confounders by Bayesian Confirmatory Factor Analysis Emil Kupek Department of Public
More informationTechnical Specifications
Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically
More informationMediation Analysis With Principal Stratification
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 3-30-009 Mediation Analysis With Principal Stratification Robert Gallop Dylan S. Small University of Pennsylvania
More informationRunning head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note
Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1 Nested Factor Analytic Model Comparison as a Means to Detect Aberrant Response Patterns John M. Clark III Pearson Author Note John M. Clark III,
More information11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES
Correlational Research Correlational Designs Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are
More informationCenter for Advanced Studies in Measurement and Assessment. CASMA Research Report
Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 39 Evaluation of Comparability of Scores and Passing Decisions for Different Item Pools of Computerized Adaptive Examinations
More informationLikelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati.
Likelihood Ratio Based Computerized Classification Testing Nathan A. Thompson Assessment Systems Corporation & University of Cincinnati Shungwon Ro Kenexa Abstract An efficient method for making decisions
More informationBayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm
Journal of Social and Development Sciences Vol. 4, No. 4, pp. 93-97, Apr 203 (ISSN 222-52) Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm Henry De-Graft Acquah University
More informationLinking Errors in Trend Estimation in Large-Scale Surveys: A Case Study
Research Report Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study Xueli Xu Matthias von Davier April 2010 ETS RR-10-10 Listening. Learning. Leading. Linking Errors in Trend Estimation
More informationA Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1.
Running Head: A MODIFIED CATSIB PROCEDURE FOR DETECTING DIF ITEMS 1 A Modified CATSIB Procedure for Detecting Differential Item Function on Computer-Based Tests Johnson Ching-hong Li 1 Mark J. Gierl 1
More informationCentre for Education Research and Policy
THE EFFECT OF SAMPLE SIZE ON ITEM PARAMETER ESTIMATION FOR THE PARTIAL CREDIT MODEL ABSTRACT Item Response Theory (IRT) models have been widely used to analyse test data and develop IRT-based tests. An
More informationStatistics for Social and Behavioral Sciences
Statistics for Social and Behavioral Sciences Advisors: S.E. Fienberg W.J. van der Linden For other titles published in this series, go to http://www.springer.com/series/3463 Jean-Paul Fox Bayesian Item
More informationMultidimensionality and Item Bias
Multidimensionality and Item Bias in Item Response Theory T. C. Oshima, Georgia State University M. David Miller, University of Florida This paper demonstrates empirically how item bias indexes based on
More informationPOLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS
POLYTOMOUS IRT OR TESTLET MODEL: AN EVALUATION OF SCORING MODELS IN SMALL TESTLET SIZE SITUATIONS By OU ZHANG A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
More informationThe Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland
Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, April 23-25, 2003 The Classification Accuracy of Measurement Decision Theory Lawrence Rudner University
More informationBayesian Mediation Analysis
Psychological Methods 2009, Vol. 14, No. 4, 301 322 2009 American Psychological Association 1082-989X/09/$12.00 DOI: 10.1037/a0016972 Bayesian Mediation Analysis Ying Yuan The University of Texas M. D.
More informationBayesian Statistics Estimation of a Single Mean and Variance MCMC Diagnostics and Missing Data
Bayesian Statistics Estimation of a Single Mean and Variance MCMC Diagnostics and Missing Data Michael Anderson, PhD Hélène Carabin, DVM, PhD Department of Biostatistics and Epidemiology The University
More informationJason L. Meyers. Ahmet Turhan. Steven J. Fitzpatrick. Pearson. Paper presented at the annual meeting of the
Performance of Ability Estimation Methods for Writing Assessments under Conditio ns of Multidime nsionality Jason L. Meyers Ahmet Turhan Steven J. Fitzpatrick Pearson Paper presented at the annual meeting
More informationContents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD
Psy 427 Cal State Northridge Andrew Ainsworth, PhD Contents Item Analysis in General Classical Test Theory Item Response Theory Basics Item Response Functions Item Information Functions Invariance IRT
More informationTHE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION
THE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION Timothy Olsen HLM II Dr. Gagne ABSTRACT Recent advances
More informationOLS Regression with Clustered Data
OLS Regression with Clustered Data Analyzing Clustered Data with OLS Regression: The Effect of a Hierarchical Data Structure Daniel M. McNeish University of Maryland, College Park A previous study by Mundfrom
More informationSelection of Linking Items
Selection of Linking Items Subset of items that maximally reflect the scale information function Denote the scale information as Linear programming solver (in R, lp_solve 5.5) min(y) Subject to θ, θs,
More informationDimensionality of the Force Concept Inventory: Comparing Bayesian Item Response Models. Xiaowen Liu Eric Loken University of Connecticut
Dimensionality of the Force Concept Inventory: Comparing Bayesian Item Response Models Xiaowen Liu Eric Loken University of Connecticut 1 Overview Force Concept Inventory Bayesian implementation of one-
More informationScaling TOWES and Linking to IALS
Scaling TOWES and Linking to IALS Kentaro Yamamoto and Irwin Kirsch March, 2002 In 2000, the Organization for Economic Cooperation and Development (OECD) along with Statistics Canada released Literacy
More informationInvestigating the robustness of the nonparametric Levene test with more than two groups
Psicológica (2014), 35, 361-383. Investigating the robustness of the nonparametric Levene test with more than two groups David W. Nordstokke * and S. Mitchell Colp University of Calgary, Canada Testing
More informationAndré Cyr and Alexander Davies
Item Response Theory and Latent variable modeling for surveys with complex sampling design The case of the National Longitudinal Survey of Children and Youth in Canada Background André Cyr and Alexander
More informationUnderstanding Uncertainty in School League Tables*
FISCAL STUDIES, vol. 32, no. 2, pp. 207 224 (2011) 0143-5671 Understanding Uncertainty in School League Tables* GEORGE LECKIE and HARVEY GOLDSTEIN Centre for Multilevel Modelling, University of Bristol
More informationGENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS
GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS Michael J. Kolen The University of Iowa March 2011 Commissioned by the Center for K 12 Assessment & Performance Management at
More informationComputerized Mastery Testing
Computerized Mastery Testing With Nonequivalent Testlets Kathleen Sheehan and Charles Lewis Educational Testing Service A procedure for determining the effect of testlet nonequivalence on the operating
More informationlinking in educational measurement: Taking differential motivation into account 1
Selecting a data collection design for linking in educational measurement: Taking differential motivation into account 1 Abstract In educational measurement, multiple test forms are often constructed to
More informationThe Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing
The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing Terry A. Ackerman University of Illinois This study investigated the effect of using multidimensional items in
More informationBayesian Model Averaging for Propensity Score Analysis
Multivariate Behavioral Research, 49:505 517, 2014 Copyright C Taylor & Francis Group, LLC ISSN: 0027-3171 print / 1532-7906 online DOI: 10.1080/00273171.2014.928492 Bayesian Model Averaging for Propensity
More informationA Brief Introduction to Bayesian Statistics
A Brief Introduction to Statistics David Kaplan Department of Educational Psychology Methods for Social Policy Research and, Washington, DC 2017 1 / 37 The Reverend Thomas Bayes, 1701 1761 2 / 37 Pierre-Simon
More informationModelling Spatially Correlated Survival Data for Individuals with Multiple Cancers
Modelling Spatially Correlated Survival Data for Individuals with Multiple Cancers Dipak K. Dey, Ulysses Diva and Sudipto Banerjee Department of Statistics University of Connecticut, Storrs. March 16,
More informationMCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2
MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and Lord Equating Methods 1,2 Lisa A. Keller, Ronald K. Hambleton, Pauline Parker, Jenna Copella University of Massachusetts
More informationMethods Research Report. An Empirical Assessment of Bivariate Methods for Meta-Analysis of Test Accuracy
Methods Research Report An Empirical Assessment of Bivariate Methods for Meta-Analysis of Test Accuracy Methods Research Report An Empirical Assessment of Bivariate Methods for Meta-Analysis of Test Accuracy
More informationImpact and adjustment of selection bias. in the assessment of measurement equivalence
Impact and adjustment of selection bias in the assessment of measurement equivalence Thomas Klausch, Joop Hox,& Barry Schouten Working Paper, Utrecht, December 2012 Corresponding author: Thomas Klausch,
More informationThe matching effect of intra-class correlation (ICC) on the estimation of contextual effect: A Bayesian approach of multilevel modeling
MODERN MODELING METHODS 2016, 2016/05/23-26 University of Connecticut, Storrs CT, USA The matching effect of intra-class correlation (ICC) on the estimation of contextual effect: A Bayesian approach of
More informationDay Hospital versus Ordinary Hospitalization: factors in treatment discrimination
Working Paper Series, N. 7, July 2004 Day Hospital versus Ordinary Hospitalization: factors in treatment discrimination Luca Grassetti Department of Statistical Sciences University of Padua Italy Michela
More informationComparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria
Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Thakur Karkee Measurement Incorporated Dong-In Kim CTB/McGraw-Hill Kevin Fatica CTB/McGraw-Hill
More informationAdaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida
Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida and Oleksandr S. Chernyshenko University of Canterbury Presented at the New CAT Models
More informationFor general queries, contact
Much of the work in Bayesian econometrics has focused on showing the value of Bayesian methods for parametric models (see, for example, Geweke (2005), Koop (2003), Li and Tobias (2011), and Rossi, Allenby,
More informationImpact of Violation of the Missing-at-Random Assumption on Full-Information Maximum Likelihood Method in Multidimensional Adaptive Testing
A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to Practical Assessment, Research & Evaluation. Permission is granted to distribute
More informationItem Analysis: Classical and Beyond
Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Modified for EPE/EDP 711 by Kelly Bradley on January 8, 2013 Why is item analysis relevant? Item analysis provides
More information11/24/2017. Do not imply a cause-and-effect relationship
Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are highly extraverted people less afraid of rejection
More informationA Comparison of Several Goodness-of-Fit Statistics
A Comparison of Several Goodness-of-Fit Statistics Robert L. McKinley The University of Toledo Craig N. Mills Educational Testing Service A study was conducted to evaluate four goodnessof-fit procedures
More informationThe Hierarchical Testlet Response Time Model: Bayesian analysis of a testlet model for item responses and response times.
The Hierarchical Testlet Response Time Model: Bayesian analysis of a testlet model for item responses and response times By Suk Keun Im Submitted to the graduate degree program in Department of Educational
More informationData Analysis in Practice-Based Research. Stephen Zyzanski, PhD Department of Family Medicine Case Western Reserve University School of Medicine
Data Analysis in Practice-Based Research Stephen Zyzanski, PhD Department of Family Medicine Case Western Reserve University School of Medicine Multilevel Data Statistical analyses that fail to recognize
More informationA structural equation modeling approach for examining position effects in large scale assessments
DOI 10.1186/s40536-017-0042-x METHODOLOGY Open Access A structural equation modeling approach for examining position effects in large scale assessments Okan Bulut *, Qi Quo and Mark J. Gierl *Correspondence:
More informationDesigning small-scale tests: A simulation study of parameter recovery with the 1-PL
Psychological Test and Assessment Modeling, Volume 55, 2013 (4), 335-360 Designing small-scale tests: A simulation study of parameter recovery with the 1-PL Dubravka Svetina 1, Aron V. Crawford 2, Roy
More informationMantel-Haenszel Procedures for Detecting Differential Item Functioning
A Comparison of Logistic Regression and Mantel-Haenszel Procedures for Detecting Differential Item Functioning H. Jane Rogers, Teachers College, Columbia University Hariharan Swaminathan, University of
More informationIndividual Differences in Attention During Category Learning
Individual Differences in Attention During Category Learning Michael D. Lee (mdlee@uci.edu) Department of Cognitive Sciences, 35 Social Sciences Plaza A University of California, Irvine, CA 92697-5 USA
More informationEcological Statistics
A Primer of Ecological Statistics Second Edition Nicholas J. Gotelli University of Vermont Aaron M. Ellison Harvard Forest Sinauer Associates, Inc. Publishers Sunderland, Massachusetts U.S.A. Brief Contents
More informationAdjusting for mode of administration effect in surveys using mailed questionnaire and telephone interview data
Adjusting for mode of administration effect in surveys using mailed questionnaire and telephone interview data Karl Bang Christensen National Institute of Occupational Health, Denmark Helene Feveille National
More informationType I Error Rates and Power Estimates for Several Item Response Theory Fit Indices
Wright State University CORE Scholar Browse all Theses and Dissertations Theses and Dissertations 2009 Type I Error Rates and Power Estimates for Several Item Response Theory Fit Indices Bradley R. Schlessman
More informationUnderstanding and Applying Multilevel Models in Maternal and Child Health Epidemiology and Public Health
Understanding and Applying Multilevel Models in Maternal and Child Health Epidemiology and Public Health Adam C. Carle, M.A., Ph.D. adam.carle@cchmc.org Division of Health Policy and Clinical Effectiveness
More informationAn Empirical Assessment of Bivariate Methods for Meta-analysis of Test Accuracy
Number XX An Empirical Assessment of Bivariate Methods for Meta-analysis of Test Accuracy Prepared for: Agency for Healthcare Research and Quality U.S. Department of Health and Human Services 54 Gaither
More informationMEA DISCUSSION PAPERS
Inference Problems under a Special Form of Heteroskedasticity Helmut Farbmacher, Heinrich Kögel 03-2015 MEA DISCUSSION PAPERS mea Amalienstr. 33_D-80799 Munich_Phone+49 89 38602-355_Fax +49 89 38602-390_www.mea.mpisoc.mpg.de
More informationBias in regression coefficient estimates when assumptions for handling missing data are violated: a simulation study
STATISTICAL METHODS Epidemiology Biostatistics and Public Health - 2016, Volume 13, Number 1 Bias in regression coefficient estimates when assumptions for handling missing data are violated: a simulation
More informationFactors Affecting the Item Parameter Estimation and Classification Accuracy of the DINA Model
Journal of Educational Measurement Summer 2010, Vol. 47, No. 2, pp. 227 249 Factors Affecting the Item Parameter Estimation and Classification Accuracy of the DINA Model Jimmy de la Torre and Yuan Hong
More informationEffect of Noncompensatory Multidimensionality on Separate and Concurrent estimation in IRT Observed Score Equating A. A. Béguin B. A.
Measurement and Research Department Reports 2001-2 Effect of Noncompensatory Multidimensionality on Separate and Concurrent estimation in IRT Observed Score Equating A. A. Béguin B. A. Hanson Measurement
More informationHierarchical Bayesian Modeling of Individual Differences in Texture Discrimination
Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination Timothy N. Rubin (trubin@uci.edu) Michael D. Lee (mdlee@uci.edu) Charles F. Chubb (cchubb@uci.edu) Department of Cognitive
More informationBayesian Bi-Cluster Change-Point Model for Exploring Functional Brain Dynamics
Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'18 85 Bayesian Bi-Cluster Change-Point Model for Exploring Functional Brain Dynamics Bing Liu 1*, Xuan Guo 2, and Jing Zhang 1** 1 Department
More information