for Scaling Ability and Diagnosing Misconceptions Laine P. Bradshaw James Madison University Jonathan Templin University of Georgia Author Note

Size: px
Start display at page:

Download "for Scaling Ability and Diagnosing Misconceptions Laine P. Bradshaw James Madison University Jonathan Templin University of Georgia Author Note"

Transcription

1 Combing Item Response Theory and Diagnostic Classification Models: A Psychometric Model for Scaling Ability and Diagnosing Misconceptions Laine P. Bradshaw James Madison University Jonathan Templin University of Georgia Author Note Correspondence concerning this article should be addressed to Laine Bradshaw, Department of Graduate Psychology, James Madison University, MSC 6806, 821 S. Main St., Harrisonburg, VA laineb@uga.edu. This research was funded by National Science Foundation grants DRL ; SES ; and SES

2 Abstract The Scaling Individuals and Classifying Misconceptions (SICM) model is presented as a combination of an item response theory (IRT) model and a diagnostic classification model (DCM). Common modeling and testing procedures utilize unidimensional IRT to provide an estimate of a student s overall ability. Recent advances in psychometrics have focused on measuring multiple dimensions to provide more detailed feedback for students, teachers, and other stakeholders. DCMs provide multidimensional feedback by using multiple categorical variables that represent skills underlying a test that students may or may not have mastered. The SICM model combines an IRT model with a DCM that uses categorical variables that represent misconceptions instead of skills. In addition to the type of information common testing procedures provide about an examinee an overall continuous ability, the SICM model also is able to provide multidimensional, diagnostic feedback in the form of statistical estimates of misconceptions. This additional feedback can be used by stakeholders to tailor instruction for students needs. Results of a simulation study demonstrate that the SICM MCMC estimation algorithm yields reasonably accurate estimates under large-scale testing conditions. Results of an empirical data analysis highlight the need to address statistical considerations of the model from the onset of the assessment development process.

3 Running head: Scaling Ability and Diagnosing Misconceptions 3 Combining Item Response Theory and Diagnostic Classification Models: A Psychometric Model for Scaling Ability and Diagnosing Misconceptions The need for more fine-grained feedback from assessments that can be used to understand students strengths and weakness has been emphasized at all levels of education. Educational policy (No Child Left Behind, 2001), modern curriculum standards (i.e., Common Core Standards; National Research Council, 2010), and classroom teachers (Huff & Goodman, 2007) have described multidimensional, diagnostic feedback as essential for tailoring instruction to students specific needs and making educational progress. In spite of this need, most statelevel educational tests have been, and continue to be, designed from a unidimensional Item Response Theory (IRT; e.g., Hambleton, Swaminathan, & Rogers, 1991) perspective. This perspective optimizes the statistical estimation of a single continuous variable representing an overall ability and provides a single, composite score to describe a students performance with respect to an entire academic course. A common solution to providing diagnostic feedback from IRT-designed state-level tests has been to report summed scores on sub-sections of a test. These subscores, because they are based on a small number of items, often lack of reliability. Decisions based on unreliable subscores counterproductively may misguide instructional strategies and resources (Wainer, Vevea, Camacho, Reeve, Rosa, Nelson, Swygert, & Thissen, 2001). These types of subscores are computed from items selected to be on the test due to their correlation with the other items on the test, so, as expected, they are highly related to the total score and thus do not provide distinct information from or additional information beyond the total score (Haberman, 2005; Harris & Hanson, 1991; Haberman, et al., 2009; Sinharay & Haberman, 2007).

4 Scaling Ability and Diagnosing Misconceptions 4 The new psychometric model presented in this paper presents a means to providing reliable multidimensional feedback within the framework of prevailing unidimensional IRT methods. The model capitalizes on advances regarding multidimensional measurement models which recently have been at the forefront of psychometric research because they promise detailed feedback for students, teachers, and other stake holders. Diagnostic classification models (DCMs; e.g., Rupp, Templin, & Henson, 2010) provide one approach to the measurement of multiple dimensions. DCMs use categorical latent attributes to represent skills or content components underlying a test that students may or may not have mastered. Using DCMs, the focus of the results of the assessment is shifted to identify which components each student has mastered. Attributes students have mastered can be viewed as areas in which students do not need further instruction. Similarly, attributes that students have not mastered indicate areas in which instruction or remediation should be focused. Thus, the attribute pattern can provide feedback to students and teachers with respect to more fine-grain components of a content area, which can be used to tailor instruction to students specific needs. DCMs sacrifice fine-grained measurement to provide multidimensional measurements. Instead of finely locating each examinee along a set of continuous traits as a multidimensional IRT model does, DCMs coarsely classify each examinee with respect to each trait (i.e., as masters or non-masters of the trait). This trade-off enables DCMs to provide diagnostic, multidimensional feedback with reasonable data demands (i.e., few number of items and examinee, Bradshaw & Cohen, 2010). DCMs low cost in terms of data demands and high benefits in terms of diagnostic information make them very attractive models in an educational setting where time for testing is limited but multidimensional feedback is needed to reflect the realities of the multifaceted nature of the objectives of educational courses. However, given the

5 Scaling Ability and Diagnosing Misconceptions 5 current reliance of testing on measuring an overall ability, DCMs, while efficient for providing detailed feedback, may not fulfill all the needs of policy-driven assessment systems centered on scaling examinee ability. In this paper, we propose a new nominal response psychometric model, the Scaling Individuals and Classifying Misconceptions (SICM) model, that blends the IRT and DCM frameworks. The SICM model alters traditional DCM practices by defining the attributes for a nominal response DCM as misconceptions that students have instead of as abilities (skills) that students have. The SICM model alters traditional nominal response IRT (NR IRT; Bock 1970) practices by having these categorical misconceptions predict the incorrect item response while a continuous ability predicts the correct response. When coupled together through the SICM model, the IRT and DCM components provide a more thorough description of the traits possessed by students. The SICM model both describes a measure of composite ability measured by the correctness of the responses and identifies distinct errors in understanding manifested through specific incorrect responses. The model therefore serves dual purposes of scaling examinee ability for comparative and accountability purposes and also diagnosing misconceptions for remediation purposes. The following section overviews the measurement of misconceptions through previous assessment development projects. The next section provides the statistical specifications of the SICM model and is followed by an illustration of the model through a contrast with two more familiar models. Then results of a simulation study are provided to establish the efficacy of the model and a real data analysis is given to illustrate the use of the model in practice. Measuring Misconceptions A key feature needed by the SICM model is that the incorrect alternatives for multiple

6 Scaling Ability and Diagnosing Misconceptions 6 choice test items are crafted to reflect common misconceptions students have or typical errors students lacking complete understandings systematically make. Previous assessments have been developed in this way, evidencing that empirical theories about misconceptions in students understandings and the desire to capture them through an assessment exist. Assessments like these have been referred to as distractor driven assessments (Sadler, 1998). Examples in science assessment include the Force Concepts Inventory (FCI; Hestenes, Wells, & Swackhamer, 1992), the Astronomy Concept Inventory (ACI; Sadler (1998)), and the Astronomy and Space Science Concepts Inventory (ASSCI; Sadler, Coyle, Miller, Cook-Smith, Dussault, & Gould, 2010). A similar approach was used in two assessments that measured concepts and misconceptions in statistics: the Statistical Reasoning Assessment (SRA; Garfield, 1998) and the Probability Reasoning Questionnaire (Khazanov, 2009). For each of these assessments, misconceptions were theorized through the auspices of extensive qualitative research that studied incorrect conceptions through student interviews. Although the focus of providing information about misconceptions was a goal of these assessments, the psychometric methods employed focused on measuring a single continuous ability using either a total score as in Classical Test Theory (CTT; e.g., see Crocker & Algina, 1986) or as an ability estimate from IRT (used for the ACI). Using these methods, the misconceptions could only be assessed or diagnosed by tallying the number of times an alternative that measures a given misconception was selected by a student, or, when IRT was used, studying the trace lines corresponding to measured misconceptions on an individual item and student basis. Tallies of misconceptions yield similar issues as sub-scores: they are very coarse and unreliable measures, as a misconception may be measured by very few items. In IRT, post-hoc analysis of trace curves for each student and item is tedious, particularly when the task

7 Scaling Ability and Diagnosing Misconceptions 7 falls upon a teacher who may teach a large number of students. The Scaling Individuals and Classifying Misconceptions Model The SICM model is a psychometric model that seeks to statistically diagnose students possession of misconceptions. By using the SICM model instead of classical sum-score approaches, misconceptions can be measured more reliably. Also by using the SICM model, existing empirical theories can be modeled and evaluated quantitatively, allowing for theories to be strengthened (e.g., by verifying misconceptions exist and describing the structural relationships among misconceptions), and also for assessments to be improved (e.g., alternatives need not contribute equally to the measure of a misconception, and alternatives weakly related to the misconception can be revised). The SICM model acknowledges that there is a larger construct to be measured on a continuum that exists in addition to misconceptions that are either present or not. Thus, unlike any existing psychometric model, the model utilizes an examinee s continuous ability as in IRT in addition to an examinee s pattern of categorical misconceptions as latent predictors of the item response. The model uniquely treats misconceptions as categorical latent variables. Misconceptions, individually denoted by α, will be assumed to be dichotomous latent variables, where, for examinee e the misconception a is either present ( or absent ( (sometimes referred to as possession or lack of possession of a misconception). Marginally, each misconception ( ) has a Bernoulli distribution with a probability an examinee possesses the misconception p a (i.e., ). The misconception pattern is a vector of A binary indicators describing the presence or absence of each misconception. As such, has a multivariate Bernoulli distribution (MVB; e.g., Maydeu-Olivares & Joe, 2005) with a mean vector, representing the pattern probabilities (i.e., ).

8 Scaling Ability and Diagnosing Misconceptions 8 Functional Form of the SICM Model To measure both and, the SICM model is defined for nominal response (multiple choice) data. Often in IRT, responses are dichotomized into two categories correct or incorrect, collapsing all of the incorrect alternatives into one category and failing to preserve the uniqueness of each incorrect alternative. Such a dichotomization can be viewed as an incomplete theory of modeling the item response (Thissen & Steinberg, 1984). If characteristics of the incorrect alternatives present variations in the item response, then those characteristics can be modeled in the item response function (van der Linden & Hambleton, 1997). Modeling responses to the alternatives directly can also provide a means of evaluating item alternatives in the test-development process. Given a set of response categories or possible alternatives for an item, the SICM model utilizes a nominal response mixture item response model that defines the probability of observing an examinee s nominal response pattern to items as [ ] (1) The terms and are the structural components of the model, describing the distributions of and relationships among of the latent variables in the model, with and held independent. The term describes the proportion of examinees in each latent class. Each latent class represents a unique misconception (attribute) pattern such that given A misconceptions, there exist unique patterns. The term is parameterized as a function of the individual misconceptions by a log-linear model (e.g., Henson & Templin, 2005; Rupp, Templin, & Henson, 2010). The term is the density function of ability, with for identifiability. The parameter denotes the conditional probability that an examinee s response to item will be the selection of alternative from the set of alternatives for item (i.e.,

9 Scaling Ability and Diagnosing Misconceptions 9, given examinee e s attribute pattern ( and continuous ability.the brackets [ ] are Iverson brackets indicating that if, =1; otherwise =0. This represents the measurement component of the SICM model in that it quantifies how the latent variables (misconceptions and ability) are related to the observed item responses. For the SICM model, ability is measured by the correct alternative on each item, and the misconceptions are measured by the incorrect alternatives. Not every incorrect alternative measures each misconception, so an indicator variable is used to specify when a misconception is measured by an item alternative. Mimicking DCM practices, specifications are set a priori and are described in an item-by-alternative-by-misconception Q-matrix (e.g., Tatsuoka, 1990). The entries in the Q-matrix are indicators denoted by, where if alternative j for item measures misconception and otherwise. The SICM model parameterizes in Equation (1) by utilizing a multicategory logistic regression model (e.g., Agresti, 2002) that models the non-redundant logits with the alternative as the baseline category as ( ( ) )= (2) ( ) for every such that, where and = (3). (4) The correct alternative is specified as the baseline category, denoted, to simplify Equation (2). In Equation (3), equals zero for every alternative where because the incorrect alternatives do not measure θ. In Equation (4), always equals zero because the correct

10 Scaling Ability and Diagnosing Misconceptions 10 alternative does not measure any misconceptions. Therefore the equations specifying the log-odds of selecting an incorrect alternative over the correct alternative in the SICM model can be equivalently formulated as ( ( ) ( ) ) = (5) for every such that. The conditional probability that will be selected is expressed as ( ) (6) The intercept is the logit of selecting an incorrect alternative over the correct alternative for an examinee with an ability of zero who possesses none of the misconceptions measured by alternative. The more difficult the alternative is, the larger the intercept will be. The term is the loading for ability and is the discrimination parameter for ability as in IRT. Using notation consistent with the Log-linear Cognitive Diagnosis model (LCDM; Henson, Templin, & Willse, 2009) and the nominal response DCM (NR DCM; Templin & Bradshaw, under review), the term is a linear combination of main and interaction effects of the model. The vector and denotes the Q- matrix entries for alternative j of item, and is the misconception pattern for examinee e. The term is a column vector of indicators with elements that equal one if and only if (a) the item measures the misconception or set of misconceptions corresponding to the parameter and (b) the examinee possesses the misconception or set of misconceptions corresponding to the parameter. Specifically, equals (7)

11 Scaling Ability and Diagnosing Misconceptions 11 where is the main effect for misconception for the alternative of item is the interaction effect between attributes and for the alternative of item (if alternative of item measures two or more misconceptions), and the ellipses denote the third through higher-order interactions for options on items that measure more than two misconceptions, where is the -way interaction effect between all attributes Main effects and interactions are discrimination parameters with respect to misconception patterns. To identify the model, as is usual for a baseline-category logit model, an arbitrary category is treated as the baseline category, and all parameters for the baseline category are set equal to zero. Additionally, the main effect parameters are constrained to ensure monotonicity for attributes and for ability, meaning (a) the possession of a misconception never leads to a decrease in the probability of selecting an alternative measuring that misconception, and (b) an increase in ability never results in a decrease in the probability of answering the item correctly. Lower Asymptote for SICM Model The specification of the SICM model in Equation (2) does not provide a lower asymptote for the probability of a correct response to account for guessing on a multiple-choice test. An alternative formulation of the SICM model was developed and will be used to provide this lower asymptote without adding an additional parameter to the model. The new formulation is ( ) ( ) =. (8) The difference in Equations (5) and (8) is the ability portion of the model is now exponentiated. The intercept of the model is now interpreted as the logit that an examinee with an extremely low ability who possesses no misconceptions will choose alternative. Holding other parameters constant, as ability decreases, the value of exp decreases, and the logit of selecting the

12 Scaling Ability and Diagnosing Misconceptions 12 correct answer decreases, satisfying the monotonicity assumption for the model. As ability approaches negative infinity, exp approaches 0, meaning the logit approaches, yielding a lower asymptote of the probability of selecting the correct response where (9) This correction results in a more realistic model of the item response, without also resulting in an increased difficulty in estimation due to an increased number of item parameters to be estimated, as is commonly encountered when using the 3-PL IRT model. The SICM Model Illustrated as a Combination of an IRT model and DCM The SICM model posits that there is a continuous trait being measured by an assessment that largely explains the covariance among the selection of the correct alternatives for a set of items. It additionally assumes that there exists a set of categorical misconceptions, each of which a student does or does not possess, that systematically account for the variations in the selections amongst the incorrect alternatives for a set of items. To further illustrate the differences between the NR IRT, the NR DCM model, and the SICM model, consider the example item in Figure 1. From an NR IRT perspective, the level of a person s overall math ability explains the variations in the item responses and is the only latent variable measured by the alternatives. Using the SICM model in Equation (2), if is always equal to 1 (i.e., θ is measured by every alternative), and is fixed to be a 1 x A vector of zeros (i.e., no misconceptions are measured), then the item response function is specified as Bock s (1972) NR IRT model. From a NR DCM perspective, two categorical abilities are needed to answer this item correctly: the ability to find the area of a rectangle (Attribute 1; ) and the ability to make

13 Scaling Ability and Diagnosing Misconceptions 13 conversions among units within a metric system (Attribute 2; ). Consider Alternative B, which measures only Attribute 1. An examinee who selects Alternative B incorrectly converts 3 feet to 1/4 inches (does not possess Attribute 2), but demonstrates an ability to find the area of a rectangle by applying the operation of multiplication to the dimensions given (possesses Attribute 1). Thus, a response of B indicates the absence of Attribute 2, yet the presence of Attribute 1. When sample sizes are large, the NR DCM has been found to be able to capitalize on information in the incorrect alternatives as demonstrated by greater classification accuracy when compared to the LCDM model for dichotomous responses (Templin & Bradshaw, under review). Using the SICM model in Equation (2), if is always equal to 0 (i.e., θ is measured by no alternative), then the item response function is specified as the NR DCM where is defined as a pattern of attributes or skills, instead of a pattern of misconceptions as in the SICM model. For the example item to be modeled from the SICM model perspective, the attributes are defined as misconceptions or errors. Attribute 1 ( ) will be redefined as the inability to find the area of a rectangle and Attribute 2 ( ) as the inability to make conversions among units within a metric system 1. An examinee is expected to answer the item correctly (to select Alternative A) if he or she possess neither of these attributes and has a modest level of overall ability. Figure 2 provides hypothetical item response probabilities for the NR IRT model, the NR LCDM, and the SICM model to compare the type of information each model provides. The Q- matrix entries for each model are given in the legend of the first graph corresponding to that model. For the NR IRT model, in the top left graph, the item response probability is solely a function of ability (θ). The NR DCM, shown in the next 4 graphs, provides the item response 1 We acknowledge that math educators likely would not consider the lack of a skill to be a misconception. This example is an oversimplification meant to convey the statistical properties of the model with an example accessible to a wide-range of researchers. Reviewing the misconception literature is beyond the scope of this article but quality research can be found on the nature of misconceptions (e.g., Smith, disessa, & Rochelle, 1993) and on documented misconceptions for specific fields (e.g., in math and science, Confrey, 1990).

14 Scaling Ability and Diagnosing Misconceptions 14 probability by the examinee s class, which is defined by the attribute pattern (α) an examinee has. When an examinee s attribute pattern corresponds to the attribute pattern measured by an alternative, the examinee is most likely to select that alternative. The SICM model, shown in the last 4 graphs, provides the item response probability not only as a function of ability as the NR IRT model does, but also as a function of the class an examinee is in (the misconception pattern that an examinee has) as the NR DCM does. For the SICM model, each class has a different set of trace lines. For the NR IRT model, all trace lines intersect, meaning at different ability levels, different incorrect alternatives are more likely to be selected. In the SICM model, unlike the NR IRT model, the trace lines for the incorrect alternatives each have an upper asymptote and are monotonically decreasing and thus never intersect. The order of the probability of selecting a given incorrect alternative is dependent upon the misconceptions and is invariant with respect to ability. Put differently, the order is variant across classes and invariant within a class. For example, as seen in Figure 2, an examinee with misconception pattern [01] who misses the item is most likely to select Alternative B regardless of their ability level. Similarly if examinees with misconception patterns [10] and [11] miss the item, they are most likely to select Alternatives C and D, respectively. Estimation of the SICM Model The SICM model was estimated using a Markov Chain Monte Carlo (MCMC) estimation algorithm that uses Metropolis-Hastings sampling and was written in Fortran. Appendix A contains the specific steps of the algorithm. Using the specification of the model by Equation (5), the SICM model is estimable using Mplus Version 6.1 (Muthén & Muthén, ). However, Mplus cannot estimate the SICM* model with the exponentiation, so writing a unique estimation algorithm using Fortran was necessary.

15 Scaling Ability and Diagnosing Misconceptions 15 To evaluate the performance of the SICM model and algorithm, a simulation study was conducted and is discussed next. An empirical data analysis follows. Simulation Study The SICM model is complex due to the large number of parameters that need to be estimated and the different types (i.e., continuous and categorical) of parameters that are being estimating within the model. The simulation study provides information about (a) the performance of the model under realistic testing situations, and (b) the interplay of a continuous ability and a set of categorical misconceptions within a single model. Specifying continuous and categorical variables in the measurement model of psychometric has never been tried at the item alternative level before; it is of interest how the one type of variable will affect the other (e.g., whether the effect of one type of variable will dominant or mask the other s effect). Simulation Study Design The study had four manipulated factors that were fully crossed: sample size (3,000 and 10,000), test length (30 and 60 items), number of misconceptions (3 and 6), and the size of the main effects for ability and attributes. Average low main effects were.4 for ability and 1 for misconceptions; average high main effects were.6 for ability and 2 for misconceptions. To investigate how estimation was affected when either the continuous or categorical variables are more dominant, the relative and absolute magnitudes of the main effects for these latent variables were manipulated by crossing the high and low effect conditions. The tetrachoric correlation between attributes was set to.50. Fifty replications were estimated for each of the 32 conditions. The MCMC estimation algorithm was set to iterate for 10,000 stages with a burn-in period of 50,000 stages. Expected a Posteriori (EAP) estimates were used after the burn-in stages as estimates of model parameters.

16 Scaling Ability and Diagnosing Misconceptions 16 Each simulated item had four alternatives. Each alternative was specified to measure one or two attributes. A balanced Q-matrix was used for the simulation with 2.1 misconceptions measured per item and 1.13 misconceptions measured per alternative. As a baseline for comparison, in the 3-misconception/30-item conditions, each misconception was measured by 34 alternatives in 21 items. Simulation Results and Conclusions Results will be provided in tables where values were (a) averaged across the magnitude of main effects factor and/or (b) averaged across all other factors and given by the magnitude of main effects factor. Results indicate the item, structural, and examinee parameters were accurately estimated with the MCMC algorithm. Generally, item and examinee parameter estimates were most accurate in conditions with more examinees, more items, and fewer misconceptions. These trends are consistent with the psychometric literature at large; estimation is improved when there are fewer parameters to estimate and when the model has more information with which to determine the parameters. More specifically, the results reported in this section are compatible other simulation studies in the DCM literature (e.g., Choi, 2010; Henson, Templin & Willse, 2009). Results from varying the magnitude of the main effects conditions uncovered no barriers to estimating both the categorical and continuous latent predictors and also shed some light on which conditions yielded more accurately estimated parameters. Results from item, structural and examinee parameters will be discussed in turn. Accuracy of Model Parameter Estimates Table 1 gives the average bias, root mean squared error (RMSE), and Pearson correlations of true and estimated parameters. The RMSEs for item parameters were less than.05 for all conditions, so additional improvement as the test length increased,

17 Scaling Ability and Diagnosing Misconceptions 17 misconceptions decreased, and the sample increased was negligible. The estimation of the structural parameters was most significantly affected by the number of misconceptions, which is to be expected because the complexity of the structural model grows quickly as the dimensionality of the assessment increases. RMSEs for structural parameters were less than.10 when 3 misconceptions were measured. Improvement in structural parameter estimation is seen for the 6-misconception conditions as the number of items or examinees increases. Accuracy of Examinee Parameter Estimates Consistent with psychometric model research, the results for the accuracy of the examinee estimates (Table 1) and classifications (Table 2), were less affected by the number of examinees responding to the assessment and more affected by the length of the test and number of misconceptions. Accuracy of classifications is measured by the correct classification rate (CCR). Greatest improvement in estimation was seen by an increase in the length of the test. For the 60-item condition, the RMSE for ability estimates ranged from.588 to.599 and the CCR for individual attributes ranged from.922 to.958. In comparison, for the 30-item conditions, the RMSE ranged from.708 to.725 and CCR ranged from.863 to.918. Reliability of Examinee Estimates The reliability of examinee ability estimates and classifications were evaluated by the comparable reliability measure developed by Templin & Bradshaw (in press). For the SICM model, reliabilities for classifications were uniformly greater than reliabilities for abilities regardless of the characteristics of the conditions under which the estimates were obtained. The top portion of Table 3 give results averaged across the magnitude of main effect condition where reliability ranged from.541 to.675 for ability and, on average, from.908 to.988 for misconceptions. This finding echoes the results in Templin & Bradshaw (in press) that found

18 Scaling Ability and Diagnosing Misconceptions 18 across a set of models, DCM classifications (with 2, 3, 4 and 5 categories) were consistently more reliable than IRT abilities estimates. The reliability of the misconceptions is very high, but the reliability of the ability estimates fall short of reaching values of.70 or above that one might strive for in achievement testing. Interplay of Continuous and Categorical Latent Predictors The examinee estimates were also impacted by the magnitude of main effects factor, which offered some insights into the interplay of continuous and categorical variables being estimated within the same model. The accuracy and reliability of the estimated abilities were greatest when ability had a high main effect in an absolute sense; estimation improved only slightly when ability also had a high main effect in a relative sense (i.e., when misconceptions had a low main effect). Similarly, the accuracy and reliability of the classifications were greatest when misconceptions had a high main effect in an absolute sense, and estimation only improved slightly when the main effect was higher than the main effect for ability in a relative sense. These results indicate that strong main effects for ability improve estimation for ability without significantly hurting estimation of the misconceptions, and strong main effects for misconceptions improve estimation for misconceptions without significantly hurting estimation of ability. Thus, when estimating the SICM model in practice, the larger concern for estimation regarding main effects is the strength of the main effect in an absolute sense. Given strong main effects for each type of variable, the different types of variables can co-exist within the same model without one dominating the other. Limitations of Simulation Study Although the results of the simulation study provide some insights for using the SICM model, it did so under conditions where the estimation model was correct. In practice a host of

19 Scaling Ability and Diagnosing Misconceptions 19 factors may impact the accuracy of an analysis with the SICM model. For example, Q-matrices may have different levels of complexity, or Q-matrices may have different levels of accuracy. Fairly complex Q-matrices were used for this simulation study, but perfect accuracy was assumed such that model misspecification was not examined. Model misspecification is an important topic in psychometrics because misspecification of the model has expected negative consequences. Other situations in practice may offer a different number of alternatives or items, and main effects for misconceptions and ability may be mixed within a test instead of having designated absolute and relative magnitudes across the test. The SICM Model Illustrated with Empirical Data Analysis To demonstrate the SICM model s use in a practical setting, data from a reading comprehension assessment constructed and administered by a large scale testing company were analyzed. Presently modeled with total scores for ability and subscores for misconceptions, the goal of the reading comprehension assessment is to measure an overall literacy level to determine whether or not an examinee would benefit from additional instruction via instructional modules, in addition to determining what weakness should be targeted within the modules. Thus, the SICM model was well aligned with the purpose of this assessment. For this 28-item multiple-choice assessment, each of the incorrect alternatives corresponded to one of three types of errors that students make when responding to reading comprehension items, as pre-determined and specified by content experts and item writers. The three types of errors that were modeled as categorical attributes or misconceptions were: a nontexted based response, a text-based misinterpretation of the passage, and a text-based misinterpretation of the question. The first error reflects that the passage was not read (perhaps lack of effort or time); the second reflects that the passage read, but misinterpreted

20 Scaling Ability and Diagnosing Misconceptions 20 (comprehension error); and the third misconception reflects that the question was misinterpreted (different type of comprehension error). On average, each item measured 1.93 errors. Six items measured all three types of errors. Every incorrect alternative measured exactly one error. Respectively, the errors were measured by 29, 32, and 22 alternatives and 21, 19, and 14 items. To estimate the SICM model, the MCMC algorithm was run for 100,000 steps with 50,000 burn-in. Convergence was assessed using a variation of Gelman and Rubin s (1992) statistic. Convergence was reached for less than 50% of the structural parameters (very poor convergence), but 95% of all other parameters converged (acceptable convergence). Unlike the simulated data, a more informative prior distribution (lognormal (0,0.5)) was used to estimate the main effect for ability in the model due to difficulty in estimation likely caused by a sample size too small (i.e., 1,097 students) to estimate these parameters. To provide a thorough evaluation of SICM as compared with other potential psychometric models, two other psychometric models were used to evaluate the assessment. We first present results of the model comparison and then describe the SICM model estimates. Although the results paint a picture of an assessment with limited dimensionality, we use the comparison of models to help depict how estimates from the SICM model are differentiated from other psychometric models. Comparison of Three Psychometric Models The SICM model with a lower asymptote, NR IRT model with a lower asymptote (formed by exponentiation of the ability portion of the model, as used in the SICM model), and the NR DCM were used to analyze the reading comprehension data. The SICM model scaled examinees according to their ability and classified examinees according to their errors. The NR IRT model provided only an estimate of ability. The NR DCM provided only classifications of examinees according to the three types of errors on the assessment. To distinguish between the

21 Scaling Ability and Diagnosing Misconceptions 21 models with and without a lower asymptote, the models with the lower asymptote will be followed with an asterisk (e.g., SICM*). Comparisons of these results shed light on results found in the SICM* model and will be presented first. Comparison of Examinee Estimates Ability estimates in the SICM* model were strongly correlated with the NR IRT* model estimates (.731), and classifications of examinees were similar for the SICM* model and the NR DCM. The SICM model classified examinees into two of eight possible error patterns. Examinees either possessed all errors (Pattern 8, [111]) or no errors (Pattern 1, [000]), meaning the tetrachoric correlation among the three misconceptions was one. The SICM and NR DCM models were in agreement with respect to individual misconception and whole pattern classification for approximately 84% of the examinees. The NR DCM classified all but eight examinees to Pattern 1 or 8. These findings suggest that high multidimensionality in the assessment with respect to the errors does not exist. This may be due to a theoretical issue of these errors not being stable traits that produce systematic responses to items, a test development issue of the items and alternatives simply not eliciting the error, or an estimation issue of the sample size being too small to contain a substantial set of examinees having each pattern. The effect of this result permeates through the remaining analyses. The classification of examinees into categories with all or no errors empirically suggests that the structural model of the SICM* model and the NR DCM model is incorrect. The models are over-fitted; many estimated parameters would have a value of 0. When the errors are highly correlated, they are no longer practically distinct and cannot be treated as separate categorical variables. As a result, the structural parameters cannot converge because there is no information about examinees in the other six classes posited to exist. For the SICM the NR DCM respectively, only 42.9% and

22 Scaling Ability and Diagnosing Misconceptions % of structural parameters converged. The goal of these models was to model the variation of item responses according to predetermined patterns, which was not feasible because there was no observed variation across those patterns. Relative Model Fit Akaike s information criteria (AIC; Akaike, 1974) and Schwartz s Bayesian information criteria (BIC; Schwartz, 1976) were used to make a relative comparison of model-data fit. Results given in Table 5 show that both of these indices preferred the fit of the SICM* model over the NR IRT* model and the fit of NR IRT* model over the NR DCM. The better fit of the two models that estimated a continuous trait was not surprising because the results of the examinees all-or-none error patterns indicated a lack of dimensionality. The reason why the SICM model was preferred over the NR IRT model is more subtle. The errors did demonstrate some dimensionality; examinees were in one of two classes, not just one class. If no dimensionality with respect to the errors was present, the NR IRT* model should be preferred to the SICM* model because it estimated far fewer parameters. These results suggest the test may be measuring something more than a single continuous trait, but it is not measuring all three distinct errors in the Q-matrix. Perhaps a single error that placed examinees into two classes would be preferred to three errors that failed to place examinees into eight classes. SICM* Model Results for an Example Item Figure 3 shows the SICM* model estimated nominal response probabilities for an example item on the reading comprehension assessment as a function of ability and misconception pattern. Q-matrix entries are given in the legend of the figure, showing for this item each incorrect alternative measured one of the three errors. The order of the intercept for incorrect alternatives B, C, and D can be deduced from the response probabilities of the first

23 Scaling Ability and Diagnosing Misconceptions 23 graph which corresponds to Pattern [000] (i.e., examinees with no errors). The most likely incorrect alternative is D, with the largest intercept of.666, and the least likely is B, with the smallest intercept of Interpretations of trace lines for the next six graphs are difficult as no examinees actually have these patterns. In the last graph, trace lines of incorrect alternatives suggest similar probabilities exist of making any one of these errors when examinees have all of the misconceptions. These graphs illustrate how the SICM model can provide each NR DCMlike class a unique set of NR IRT-like response curves, which with better model-data fit would have more meaningful interpretations. SICM* Model Example Examinee Results In this section, we present results from two examinees with similar response patterns. Examinees 199 and 403 had answered the first 22 items correctly and had answered two of the last six items correctly, giving each examinee a total correct score of 24. However the two final items they answered correctly were different, resulting in their ability estimates to be slightly different (. Additionally, for the items they answered incorrectly, they selected different incorrect answers. Thus, their different incorrect answers on the last six items led to the students being classified as having drastically different error patterns (, ). From the SICM model estimates, we can conclude that both of these students have an above average ability, yet Examinee 403 (with the slightly higher ability estimate) needs instruction relevant to all three errors, while Examinee 199 does not. These results reflect the potential utility that the SICM* model estimates add beyond IRT model estimates for examinees. For an IRT model with an estimated discrimination parameter, it matters which items an examinee answers correctly. The same total score can yield different ability estimates because items are differentially related to the target ability being measured and

24 Scaling Ability and Diagnosing Misconceptions 24 thus count differentially towards the estimated ability. The SICM* model goes a step further and uses information not only from which items an examinee answers incorrectly, but also why the examinee answered the item incorrectly. As a result, two examinees can have the exact same scored response pattern and be classified as possessing a very different set of misconceptions. Data Analysis Discussion Although our initial thought upon analyzing this assessment data was to turn to another data set that may bear better results with the SICM model, ultimately we found value in sharing the story this analysis tells. As noted in Templin and Henson (2006), observing a large number of examinees in patterns for which all or none of the attributes are possessed may indicate that the construct being measured is truly unidimensional. Several reasons may explain this finding. The lack of multidimensionality may be a result of cognitive theory; these errors may not actually exist as a stable latent trait of the examinee, but rather be types of errors that students inconsistently, and thus unpredictably, make. Alternately, the multidimensionality that actually exists may not be captured by the assessment due to lack of validity, or enough information may not be available to estimate the model due to a smaller sample of examinees than in the simulation study. This analysis shows that even in a scenario like this one where the purpose of the assessment was aligned with the purpose of the model, limitations exist and issues arise when retrofitting an assessment to a model. Model-data fit is expected to improve in a test-construction scenario where a test is being developed from the onset to be estimated with the SICM model, and is thus recommended. Developing the assessment from the SICM framework would help identify sources of misfit. Validity studies can verify whether alternatives on an assessment are eliciting misconceptions that they purport to measure, and pilot studies can statistically flag items

25 Scaling Ability and Diagnosing Misconceptions 25 that exhibit model-data misfit and need to be revised or culled. The test development process can also attend to other statistical considerations that may include investigating (a) whether the misconception is measured enough times (in enough alternatives and items) to yield a reliable classification, (b) whether enough examinees are selecting each alternative to provide enough information to yield accurate item parameter estimates for that alternative, and (c) whether enough examinees have responded to the item to yield accurate model parameters. Concluding Remarks The SICM model is presented as a psychometric solution to a realistic need in educational assessment: to gain more feedback from assessments about what students do not understand. The efficacy of the SICM model under various testing conditions was demonstrated through a simulation study, suggesting that when coupled with careful test design characteristics, the SICM model can enable diagnostic score reports that reflect statistical estimates of student misconceptions to be provided in addition to the type of information about student ability that is typically provided to stakeholders by current modeling and testing procedures. These simulation results provide guidelines for test and sampling conditions, but not for creating the test itself. As seen in the empirical data analysis, the development of the assessment from the SICM framework a prior is very important. Although some general test-development considerations can be applied in developing an assessment for the SICM model, open questions still exist as to how to create an assessment that can utilize the statistical features of the SICM model. Assessments, previously mentioned, exist that can provide insights for writing items to measure misconceptions with incorrect answers. However, when the SICM model is employed to model these types of items, unique statistical considerations arise. For example, a continuous ability is estimated in the SICM model, as in a

26 Scaling Ability and Diagnosing Misconceptions 26 unidimensional IRT model. For a unidimensional IRT model, items that exhibit multidimensionality are often screened and revised or deleted from the assessment; for the SICM model, items that measure a single continuous trait and a set of multidimensional categorical traits are desired, so items are expected to show multidimensionality and will thus have to be screened differently. We provide information explaining how the SICM model can be estimated and applied. We hope future assessment development projects can build upon this information to leverage the model in practical settings to provide actionable information about where students misunderstandings lie.

27 Scaling Ability and Diagnosing Misconceptions 27 References Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), Agresti, A. (2002). Categorical data analysis (2nd ed.). Hoboken, NJ: Wiley. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee s ability. In F. M. Lord & M.R. Novick (Eds.), Statistical theories of mental test scores. (pp ). Reading, MA: Addison-Wesley. Bock, R. D. (1972) Estimating item parameters and latent ability when responses are scored in two or more latent categories. Psychometrika, 37, Confrey, J. (1990). A review of the research on student conceptions in mathematics, science, and programming. In C. Cazden (Ed.), Review of research in education (Vol. 16, pp. 3-56). Washington, DC: American Educational Research Association. Crocker, L. & Algina, J. (1986). Introduction to classical and modern test theory. Wadsworth Group/ Thomson Learning. Garfield, J. (1998, April). Challenges in assessing statistical reasoning. Paper presented at the annual meeting of the American Educational Research Association, San Diego, CA. Garfield, J., & Chance, B. (2000). Assessment in statistics education: Issues and challenges. Mathematical Thinking and Learning, 2 (1&2), Gelman, A., & Rubin, D.B. (1992) Inference from iterative simulation using multiple sequences. Statistical Science, 7, Halloun, A. & Hestenes, D. (1985). The initial knowledge state of college physics students. American Journal of Physics, 53 (11),

28 Scaling Ability and Diagnosing Misconceptions 28 Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage. Henson, R. A., & Templin, J. L. (2003). The moving window family of proposal distributions. Educational Testing Service, External Diagnostic Research Group, Unpublished Technical Report. Henson, R., & Templin, J. (2005). Hierarchical log-linear modeling of the joint skill distribution. Unpublished manuscript. Henson, R., Templin, J., & Willse, J. (2009). Defining a family of cognitive diagnosis models using log linear models with latent variables. Psychometrika, 74, Henson, R., Templin, J., Willse, J., & Irwin, P. (2009, April). Ancillary random effects: a way to obtain diagnostic information from existing large scale tests. Paper presented at the annual meeting of the National Council on Measurement in Education in San Diego, California. Hestenes, D.,Wells, M., & Swackhamer, G. (1992). Force concept inventory. The Physics Teacher, 30, Huff, K. & Goodman, D.P. (2007). The demand for cognitive diagnostic assessment. In J. P. Leighton & M. J. Gierl (Eds.), Cognitive diagnostic assessment for education: Theory and applications (pp ). London: Cambridge University Press. Khazanov, L. (2009, February). A diagnostic assessment for misconceptions in probability. Paper presented at the Georgia Perimeter College Mathematics Conference in Clarkston, GA. Maydeu-Olivares, A., & Joe, H. (2005). Limited- and full-information estimation and goodnessof-fit testing in 2 n contingency tables: a unified framework. Journal of the American Statistical Association,100,

COMBINING SCALING AND CLASSIFICATION: A PSYCHOMETRIC MODEL FOR SCALING ABILITY AND DIAGNOSING MISCONCEPTIONS LAINE P. BRADSHAW

COMBINING SCALING AND CLASSIFICATION: A PSYCHOMETRIC MODEL FOR SCALING ABILITY AND DIAGNOSING MISCONCEPTIONS LAINE P. BRADSHAW COMBINING SCALING AND CLASSIFICATION: A PSYCHOMETRIC MODEL FOR SCALING ABILITY AND DIAGNOSING MISCONCEPTIONS by LAINE P. BRADSHAW (Under the Direction of Jonathan Templin and Karen Samuelsen) ABSTRACT

More information

Diagnostic Classification Models

Diagnostic Classification Models Diagnostic Classification Models Lecture #13 ICPSR Item Response Theory Workshop Lecture #13: 1of 86 Lecture Overview Key definitions Conceptual example Example uses of diagnostic models in education Classroom

More information

Fundamental Concepts for Using Diagnostic Classification Models. Section #2 NCME 2016 Training Session. NCME 2016 Training Session: Section 2

Fundamental Concepts for Using Diagnostic Classification Models. Section #2 NCME 2016 Training Session. NCME 2016 Training Session: Section 2 Fundamental Concepts for Using Diagnostic Classification Models Section #2 NCME 2016 Training Session NCME 2016 Training Session: Section 2 Lecture Overview Nature of attributes What s in a name? Grain

More information

Blending Psychometrics with Bayesian Inference Networks: Measuring Hundreds of Latent Variables Simultaneously

Blending Psychometrics with Bayesian Inference Networks: Measuring Hundreds of Latent Variables Simultaneously Blending Psychometrics with Bayesian Inference Networks: Measuring Hundreds of Latent Variables Simultaneously Jonathan Templin Department of Educational Psychology Achievement and Assessment Institute

More information

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison Group-Level Diagnosis 1 N.B. Please do not cite or distribute. Multilevel IRT for group-level diagnosis Chanho Park Daniel M. Bolt University of Wisconsin-Madison Paper presented at the annual meeting

More information

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, April 23-25, 2003 The Classification Accuracy of Measurement Decision Theory Lawrence Rudner University

More information

A Bayesian Nonparametric Model Fit statistic of Item Response Models

A Bayesian Nonparametric Model Fit statistic of Item Response Models A Bayesian Nonparametric Model Fit statistic of Item Response Models Purpose As more and more states move to use the computer adaptive test for their assessments, item response theory (IRT) has been widely

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian Recovery of Marginal Maximum Likelihood Estimates in the Two-Parameter Logistic Response Model: An Evaluation of MULTILOG Clement A. Stone University of Pittsburgh Marginal maximum likelihood (MML) estimation

More information

Item Response Theory: Methods for the Analysis of Discrete Survey Response Data

Item Response Theory: Methods for the Analysis of Discrete Survey Response Data Item Response Theory: Methods for the Analysis of Discrete Survey Response Data ICPSR Summer Workshop at the University of Michigan June 29, 2015 July 3, 2015 Presented by: Dr. Jonathan Templin Department

More information

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests David Shin Pearson Educational Measurement May 007 rr0701 Using assessment and research to promote learning Pearson Educational

More information

André Cyr and Alexander Davies

André Cyr and Alexander Davies Item Response Theory and Latent variable modeling for surveys with complex sampling design The case of the National Longitudinal Survey of Children and Youth in Canada Background André Cyr and Alexander

More information

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Jee Seon Kim University of Wisconsin, Madison Paper presented at 2006 NCME Annual Meeting San Francisco, CA Correspondence

More information

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati.

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati. Likelihood Ratio Based Computerized Classification Testing Nathan A. Thompson Assessment Systems Corporation & University of Cincinnati Shungwon Ro Kenexa Abstract An efficient method for making decisions

More information

A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM LIKELIHOOD METHODS IN ESTIMATING THE ITEM PARAMETERS FOR THE 2PL IRT MODEL

A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM LIKELIHOOD METHODS IN ESTIMATING THE ITEM PARAMETERS FOR THE 2PL IRT MODEL International Journal of Innovative Management, Information & Production ISME Internationalc2010 ISSN 2185-5439 Volume 1, Number 1, December 2010 PP. 81-89 A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM

More information

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1 Nested Factor Analytic Model Comparison as a Means to Detect Aberrant Response Patterns John M. Clark III Pearson Author Note John M. Clark III,

More information

Factors Affecting the Item Parameter Estimation and Classification Accuracy of the DINA Model

Factors Affecting the Item Parameter Estimation and Classification Accuracy of the DINA Model Journal of Educational Measurement Summer 2010, Vol. 47, No. 2, pp. 227 249 Factors Affecting the Item Parameter Estimation and Classification Accuracy of the DINA Model Jimmy de la Torre and Yuan Hong

More information

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

Mantel-Haenszel Procedures for Detecting Differential Item Functioning A Comparison of Logistic Regression and Mantel-Haenszel Procedures for Detecting Differential Item Functioning H. Jane Rogers, Teachers College, Columbia University Hariharan Swaminathan, University of

More information

A Comparison of Several Goodness-of-Fit Statistics

A Comparison of Several Goodness-of-Fit Statistics A Comparison of Several Goodness-of-Fit Statistics Robert L. McKinley The University of Toledo Craig N. Mills Educational Testing Service A study was conducted to evaluate four goodnessof-fit procedures

More information

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses Item Response Theory Steven P. Reise University of California, U.S.A. Item response theory (IRT), or modern measurement theory, provides alternatives to classical test theory (CTT) methods for the construction,

More information

MEASURING MIDDLE GRADES STUDENTS UNDERSTANDING OF FORCE AND MOTION CONCEPTS: INSIGHTS INTO THE STRUCTURE OF STUDENT IDEAS

MEASURING MIDDLE GRADES STUDENTS UNDERSTANDING OF FORCE AND MOTION CONCEPTS: INSIGHTS INTO THE STRUCTURE OF STUDENT IDEAS MEASURING MIDDLE GRADES STUDENTS UNDERSTANDING OF FORCE AND MOTION CONCEPTS: INSIGHTS INTO THE STRUCTURE OF STUDENT IDEAS The purpose of this study was to create an instrument that measures middle grades

More information

Parameter Estimation with Mixture Item Response Theory Models: A Monte Carlo Comparison of Maximum Likelihood and Bayesian Methods

Parameter Estimation with Mixture Item Response Theory Models: A Monte Carlo Comparison of Maximum Likelihood and Bayesian Methods Journal of Modern Applied Statistical Methods Volume 11 Issue 1 Article 14 5-1-2012 Parameter Estimation with Mixture Item Response Theory Models: A Monte Carlo Comparison of Maximum Likelihood and Bayesian

More information

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories Kamla-Raj 010 Int J Edu Sci, (): 107-113 (010) Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories O.O. Adedoyin Department of Educational Foundations,

More information

Description of components in tailored testing

Description of components in tailored testing Behavior Research Methods & Instrumentation 1977. Vol. 9 (2).153-157 Description of components in tailored testing WAYNE M. PATIENCE University ofmissouri, Columbia, Missouri 65201 The major purpose of

More information

The Influence of Test Characteristics on the Detection of Aberrant Response Patterns

The Influence of Test Characteristics on the Detection of Aberrant Response Patterns The Influence of Test Characteristics on the Detection of Aberrant Response Patterns Steven P. Reise University of California, Riverside Allan M. Due University of Minnesota Statistical methods to assess

More information

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model Gary Skaggs Fairfax County, Virginia Public Schools José Stevenson

More information

During the past century, mathematics

During the past century, mathematics An Evaluation of Mathematics Competitions Using Item Response Theory Jim Gleason During the past century, mathematics competitions have become part of the landscape in mathematics education. The first

More information

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD Psy 427 Cal State Northridge Andrew Ainsworth, PhD Contents Item Analysis in General Classical Test Theory Item Response Theory Basics Item Response Functions Item Information Functions Invariance IRT

More information

Dimensionality of the Force Concept Inventory: Comparing Bayesian Item Response Models. Xiaowen Liu Eric Loken University of Connecticut

Dimensionality of the Force Concept Inventory: Comparing Bayesian Item Response Models. Xiaowen Liu Eric Loken University of Connecticut Dimensionality of the Force Concept Inventory: Comparing Bayesian Item Response Models Xiaowen Liu Eric Loken University of Connecticut 1 Overview Force Concept Inventory Bayesian implementation of one-

More information

Parallel Forms for Diagnostic Purpose

Parallel Forms for Diagnostic Purpose Paper presented at AERA, 2010 Parallel Forms for Diagnostic Purpose Fang Chen Xinrui Wang UNCG, USA May, 2010 INTRODUCTION With the advancement of validity discussions, the measurement field is pushing

More information

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm Journal of Social and Development Sciences Vol. 4, No. 4, pp. 93-97, Apr 203 (ISSN 222-52) Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm Henry De-Graft Acquah University

More information

Adaptive EAP Estimation of Ability

Adaptive EAP Estimation of Ability Adaptive EAP Estimation of Ability in a Microcomputer Environment R. Darrell Bock University of Chicago Robert J. Mislevy National Opinion Research Center Expected a posteriori (EAP) estimation of ability,

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 26 for Mixed Format Tests Kyong Hee Chon Won-Chan Lee Timothy N. Ansley November 2007 The authors are grateful to

More information

MISSING DATA AND PARAMETERS ESTIMATES IN MULTIDIMENSIONAL ITEM RESPONSE MODELS. Federico Andreis, Pier Alda Ferrari *

MISSING DATA AND PARAMETERS ESTIMATES IN MULTIDIMENSIONAL ITEM RESPONSE MODELS. Federico Andreis, Pier Alda Ferrari * Electronic Journal of Applied Statistical Analysis EJASA (2012), Electron. J. App. Stat. Anal., Vol. 5, Issue 3, 431 437 e-issn 2070-5948, DOI 10.1285/i20705948v5n3p431 2012 Università del Salento http://siba-ese.unile.it/index.php/ejasa/index

More information

JONATHAN TEMPLIN LAINE BRADSHAW THE USE AND MISUSE OF PSYCHOMETRIC MODELS

JONATHAN TEMPLIN LAINE BRADSHAW THE USE AND MISUSE OF PSYCHOMETRIC MODELS PSYCHOMETRIKA VOL. 79, NO. 2, 347 354 APRIL 2014 DOI: 10.1007/S11336-013-9364-Y THE USE AND MISUSE OF PSYCHOMETRIC MODELS JONATHAN TEMPLIN UNIVERSITY OF KANSAS LAINE BRADSHAW THE UNIVERSITY OF GEORGIA

More information

Scaling TOWES and Linking to IALS

Scaling TOWES and Linking to IALS Scaling TOWES and Linking to IALS Kentaro Yamamoto and Irwin Kirsch March, 2002 In 2000, the Organization for Economic Cooperation and Development (OECD) along with Statistics Canada released Literacy

More information

Centre for Education Research and Policy

Centre for Education Research and Policy THE EFFECT OF SAMPLE SIZE ON ITEM PARAMETER ESTIMATION FOR THE PARTIAL CREDIT MODEL ABSTRACT Item Response Theory (IRT) models have been widely used to analyse test data and develop IRT-based tests. An

More information

Comprehensive Statistical Analysis of a Mathematics Placement Test

Comprehensive Statistical Analysis of a Mathematics Placement Test Comprehensive Statistical Analysis of a Mathematics Placement Test Robert J. Hall Department of Educational Psychology Texas A&M University, USA (bobhall@tamu.edu) Eunju Jung Department of Educational

More information

Decision consistency and accuracy indices for the bifactor and testlet response theory models

Decision consistency and accuracy indices for the bifactor and testlet response theory models University of Iowa Iowa Research Online Theses and Dissertations Summer 2014 Decision consistency and accuracy indices for the bifactor and testlet response theory models Lee James LaFond University of

More information

An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow?

An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow? Journal of Educational and Behavioral Statistics Fall 2006, Vol. 31, No. 3, pp. 241 259 An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow? Michael C. Edwards The Ohio

More information

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2 MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and Lord Equating Methods 1,2 Lisa A. Keller, Ronald K. Hambleton, Pauline Parker, Jenna Copella University of Massachusetts

More information

The Effect of Guessing on Item Reliability

The Effect of Guessing on Item Reliability The Effect of Guessing on Item Reliability under Answer-Until-Correct Scoring Michael Kane National League for Nursing, Inc. James Moloney State University of New York at Brockport The answer-until-correct

More information

A Multilevel Testlet Model for Dual Local Dependence

A Multilevel Testlet Model for Dual Local Dependence Journal of Educational Measurement Spring 2012, Vol. 49, No. 1, pp. 82 100 A Multilevel Testlet Model for Dual Local Dependence Hong Jiao University of Maryland Akihito Kamata University of Oregon Shudong

More information

Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure. Rob Cavanagh Len Sparrow Curtin University

Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure. Rob Cavanagh Len Sparrow Curtin University Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure Rob Cavanagh Len Sparrow Curtin University R.Cavanagh@curtin.edu.au Abstract The study sought to measure mathematics anxiety

More information

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES Correlational Research Correlational Designs Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are

More information

EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN

EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN By FRANCISCO ANDRES JIMENEZ A THESIS PRESENTED TO THE GRADUATE SCHOOL OF

More information

Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data. Zhen Li & Bruno D.

Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data. Zhen Li & Bruno D. Psicológica (2009), 30, 343-370. SECCIÓN METODOLÓGICA Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data Zhen Li & Bruno D. Zumbo 1 University

More information

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan Connexion of Item Response Theory to Decision Making in Chess Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan Acknowledgement A few Slides have been taken from the following presentation

More information

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods James Madison University JMU Scholarly Commons Department of Graduate Psychology - Faculty Scholarship Department of Graduate Psychology 3-008 Scoring Multiple Choice Items: A Comparison of IRT and Classical

More information

A Broad-Range Tailored Test of Verbal Ability

A Broad-Range Tailored Test of Verbal Ability A Broad-Range Tailored Test of Verbal Ability Frederic M. Lord Educational Testing Service Two parallel forms of a broad-range tailored test of verbal ability have been built. The test is appropriate from

More information

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS Michael J. Kolen The University of Iowa March 2011 Commissioned by the Center for K 12 Assessment & Performance Management at

More information

ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE

ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE California State University, San Bernardino CSUSB ScholarWorks Electronic Theses, Projects, and Dissertations Office of Graduate Studies 6-2016 ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION

More information

Using the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison

Using the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison Using the Testlet Model to Mitigate Test Speededness Effects James A. Wollack Youngsuk Suh Daniel M. Bolt University of Wisconsin Madison April 12, 2007 Paper presented at the annual meeting of the National

More information

Workshop Overview. Diagnostic Measurement. Theory, Methods, and Applications. Session Overview. Conceptual Foundations of. Workshop Sessions:

Workshop Overview. Diagnostic Measurement. Theory, Methods, and Applications. Session Overview. Conceptual Foundations of. Workshop Sessions: Workshop Overview Workshop Sessions: Diagnostic Measurement: Theory, Methods, and Applications Jonathan Templin The University of Georgia Session 1 Conceptual Foundations of Diagnostic Measurement Session

More information

Differential Item Functioning

Differential Item Functioning Differential Item Functioning Lecture #11 ICPSR Item Response Theory Workshop Lecture #11: 1of 62 Lecture Overview Detection of Differential Item Functioning (DIF) Distinguish Bias from DIF Test vs. Item

More information

THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH

THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH By JANN MARIE WISE MACINNES A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF

More information

Using Bayesian Decision Theory to

Using Bayesian Decision Theory to Using Bayesian Decision Theory to Design a Computerized Mastery Test Charles Lewis and Kathleen Sheehan Educational Testing Service A theoretical framework for mastery testing based on item response theory

More information

Michael Hallquist, Thomas M. Olino, Paul A. Pilkonis University of Pittsburgh

Michael Hallquist, Thomas M. Olino, Paul A. Pilkonis University of Pittsburgh Comparing the evidence for categorical versus dimensional representations of psychiatric disorders in the presence of noisy observations: a Monte Carlo study of the Bayesian Information Criterion and Akaike

More information

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION Iweka Fidelis (Ph.D) Department of Educational Psychology, Guidance and Counselling, University of Port Harcourt,

More information

Item Parameter Recovery for the Two-Parameter Testlet Model with Different. Estimation Methods. Abstract

Item Parameter Recovery for the Two-Parameter Testlet Model with Different. Estimation Methods. Abstract Item Parameter Recovery for the Two-Parameter Testlet Model with Different Estimation Methods Yong Luo National Center for Assessment in Saudi Arabia Abstract The testlet model is a popular statistical

More information

Chapter 1 Introduction. Measurement Theory. broadest sense and not, as it is sometimes used, as a proxy for deterministic models.

Chapter 1 Introduction. Measurement Theory. broadest sense and not, as it is sometimes used, as a proxy for deterministic models. Ostini & Nering - Chapter 1 - Page 1 POLYTOMOUS ITEM RESPONSE THEORY MODELS Chapter 1 Introduction Measurement Theory Mathematical models have been found to be very useful tools in the process of human

More information

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups GMAC Scaling Item Difficulty Estimates from Nonequivalent Groups Fanmin Guo, Lawrence Rudner, and Eileen Talento-Miller GMAC Research Reports RR-09-03 April 3, 2009 Abstract By placing item statistics

More information

CYRINUS B. ESSEN, IDAKA E. IDAKA AND MICHAEL A. METIBEMU. (Received 31, January 2017; Revision Accepted 13, April 2017)

CYRINUS B. ESSEN, IDAKA E. IDAKA AND MICHAEL A. METIBEMU. (Received 31, January 2017; Revision Accepted 13, April 2017) DOI: http://dx.doi.org/10.4314/gjedr.v16i2.2 GLOBAL JOURNAL OF EDUCATIONAL RESEARCH VOL 16, 2017: 87-94 COPYRIGHT BACHUDO SCIENCE CO. LTD PRINTED IN NIGERIA. ISSN 1596-6224 www.globaljournalseries.com;

More information

Impact and adjustment of selection bias. in the assessment of measurement equivalence

Impact and adjustment of selection bias. in the assessment of measurement equivalence Impact and adjustment of selection bias in the assessment of measurement equivalence Thomas Klausch, Joop Hox,& Barry Schouten Working Paper, Utrecht, December 2012 Corresponding author: Thomas Klausch,

More information

Ordinal Data Modeling

Ordinal Data Modeling Valen E. Johnson James H. Albert Ordinal Data Modeling With 73 illustrations I ". Springer Contents Preface v 1 Review of Classical and Bayesian Inference 1 1.1 Learning about a binomial proportion 1 1.1.1

More information

Jason L. Meyers. Ahmet Turhan. Steven J. Fitzpatrick. Pearson. Paper presented at the annual meeting of the

Jason L. Meyers. Ahmet Turhan. Steven J. Fitzpatrick. Pearson. Paper presented at the annual meeting of the Performance of Ability Estimation Methods for Writing Assessments under Conditio ns of Multidime nsionality Jason L. Meyers Ahmet Turhan Steven J. Fitzpatrick Pearson Paper presented at the annual meeting

More information

THE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION

THE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION THE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION Timothy Olsen HLM II Dr. Gagne ABSTRACT Recent advances

More information

Bayesian Tailored Testing and the Influence

Bayesian Tailored Testing and the Influence Bayesian Tailored Testing and the Influence of Item Bank Characteristics Carl J. Jensema Gallaudet College Owen s (1969) Bayesian tailored testing method is introduced along with a brief review of its

More information

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida Vol. 2 (1), pp. 22-39, Jan, 2015 http://www.ijate.net e-issn: 2148-7456 IJATE A Comparison of Logistic Regression Models for Dif Detection in Polytomous Items: The Effect of Small Sample Sizes and Non-Normality

More information

11/24/2017. Do not imply a cause-and-effect relationship

11/24/2017. Do not imply a cause-and-effect relationship Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are highly extraverted people less afraid of rejection

More information

International Journal of Education and Research Vol. 5 No. 5 May 2017

International Journal of Education and Research Vol. 5 No. 5 May 2017 International Journal of Education and Research Vol. 5 No. 5 May 2017 EFFECT OF SAMPLE SIZE, ABILITY DISTRIBUTION AND TEST LENGTH ON DETECTION OF DIFFERENTIAL ITEM FUNCTIONING USING MANTEL-HAENSZEL STATISTIC

More information

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida and Oleksandr S. Chernyshenko University of Canterbury Presented at the New CAT Models

More information

Differential Item Functioning Amplification and Cancellation in a Reading Test

Differential Item Functioning Amplification and Cancellation in a Reading Test A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to the Practical Assessment, Research & Evaluation. Permission is granted to

More information

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1.

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1. Running Head: A MODIFIED CATSIB PROCEDURE FOR DETECTING DIF ITEMS 1 A Modified CATSIB Procedure for Detecting Differential Item Function on Computer-Based Tests Johnson Ching-hong Li 1 Mark J. Gierl 1

More information

On indirect measurement of health based on survey data. Responses to health related questions (items) Y 1,..,Y k A unidimensional latent health state

On indirect measurement of health based on survey data. Responses to health related questions (items) Y 1,..,Y k A unidimensional latent health state On indirect measurement of health based on survey data Responses to health related questions (items) Y 1,..,Y k A unidimensional latent health state A scaling model: P(Y 1,..,Y k ;α, ) α = item difficulties

More information

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison Empowered by Psychometrics The Fundamentals of Psychometrics Jim Wollack University of Wisconsin Madison Psycho-what? Psychometrics is the field of study concerned with the measurement of mental and psychological

More information

Structural Equation Modeling (SEM)

Structural Equation Modeling (SEM) Structural Equation Modeling (SEM) Today s topics The Big Picture of SEM What to do (and what NOT to do) when SEM breaks for you Single indicator (ASU) models Parceling indicators Using single factor scores

More information

Selection of Linking Items

Selection of Linking Items Selection of Linking Items Subset of items that maximally reflect the scale information function Denote the scale information as Linear programming solver (in R, lp_solve 5.5) min(y) Subject to θ, θs,

More information

Ecological Statistics

Ecological Statistics A Primer of Ecological Statistics Second Edition Nicholas J. Gotelli University of Vermont Aaron M. Ellison Harvard Forest Sinauer Associates, Inc. Publishers Sunderland, Massachusetts U.S.A. Brief Contents

More information

Answers to end of chapter questions

Answers to end of chapter questions Answers to end of chapter questions Chapter 1 What are the three most important characteristics of QCA as a method of data analysis? QCA is (1) systematic, (2) flexible, and (3) it reduces data. What are

More information

Advanced Bayesian Models for the Social Sciences

Advanced Bayesian Models for the Social Sciences Advanced Bayesian Models for the Social Sciences Jeff Harden Department of Political Science, University of Colorado Boulder jeffrey.harden@colorado.edu Daniel Stegmueller Department of Government, University

More information

Development, Standardization and Application of

Development, Standardization and Application of American Journal of Educational Research, 2018, Vol. 6, No. 3, 238-257 Available online at http://pubs.sciepub.com/education/6/3/11 Science and Education Publishing DOI:10.12691/education-6-3-11 Development,

More information

Section 5. Field Test Analyses

Section 5. Field Test Analyses Section 5. Field Test Analyses Following the receipt of the final scored file from Measurement Incorporated (MI), the field test analyses were completed. The analysis of the field test data can be broken

More information

Multidimensional Modeling of Learning Progression-based Vertical Scales 1

Multidimensional Modeling of Learning Progression-based Vertical Scales 1 Multidimensional Modeling of Learning Progression-based Vertical Scales 1 Nina Deng deng.nina@measuredprogress.org Louis Roussos roussos.louis@measuredprogress.org Lee LaFond leelafond74@gmail.com 1 This

More information

Running head: ATTRIBUTE CODING FOR RETROFITTING MODELS. Comparison of Attribute Coding Procedures for Retrofitting Cognitive Diagnostic Models

Running head: ATTRIBUTE CODING FOR RETROFITTING MODELS. Comparison of Attribute Coding Procedures for Retrofitting Cognitive Diagnostic Models Running head: ATTRIBUTE CODING FOR RETROFITTING MODELS Comparison of Attribute Coding Procedures for Retrofitting Cognitive Diagnostic Models Amy Clark Neal Kingston University of Kansas Corresponding

More information

IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS

IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS A Dissertation Presented to The Academic Faculty by HeaWon Jun In Partial Fulfillment of the Requirements

More information

Statistical Methods and Reasoning for the Clinical Sciences

Statistical Methods and Reasoning for the Clinical Sciences Statistical Methods and Reasoning for the Clinical Sciences Evidence-Based Practice Eiki B. Satake, PhD Contents Preface Introduction to Evidence-Based Statistics: Philosophical Foundation and Preliminaries

More information

Item Response Theory (IRT): A Modern Statistical Theory for Solving Measurement Problem in 21st Century

Item Response Theory (IRT): A Modern Statistical Theory for Solving Measurement Problem in 21st Century International Journal of Scientific Research in Education, SEPTEMBER 2018, Vol. 11(3B), 627-635. Item Response Theory (IRT): A Modern Statistical Theory for Solving Measurement Problem in 21st Century

More information

Proceedings of the 2011 International Conference on Teaching, Learning and Change (c) International Association for Teaching and Learning (IATEL)

Proceedings of the 2011 International Conference on Teaching, Learning and Change (c) International Association for Teaching and Learning (IATEL) EVALUATION OF MATHEMATICS ACHIEVEMENT TEST: A COMPARISON BETWEEN CLASSICAL TEST THEORY (CTT)AND ITEM RESPONSE THEORY (IRT) Eluwa, O. Idowu 1, Akubuike N. Eluwa 2 and Bekom K. Abang 3 1& 3 Dept of Educational

More information

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys Jill F. Kilanowski, PhD, APRN,CPNP Associate Professor Alpha Zeta & Mu Chi Acknowledgements Dr. Li Lin,

More information

Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz

Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz This study presents the steps Edgenuity uses to evaluate the reliability and validity of its quizzes, topic tests, and cumulative

More information

Advanced Bayesian Models for the Social Sciences. TA: Elizabeth Menninga (University of North Carolina, Chapel Hill)

Advanced Bayesian Models for the Social Sciences. TA: Elizabeth Menninga (University of North Carolina, Chapel Hill) Advanced Bayesian Models for the Social Sciences Instructors: Week 1&2: Skyler J. Cranmer Department of Political Science University of North Carolina, Chapel Hill skyler@unc.edu Week 3&4: Daniel Stegmueller

More information

Differential Item Functioning from a Compensatory-Noncompensatory Perspective

Differential Item Functioning from a Compensatory-Noncompensatory Perspective Differential Item Functioning from a Compensatory-Noncompensatory Perspective Terry Ackerman, Bruce McCollaum, Gilbert Ngerano University of North Carolina at Greensboro Motivation for my Presentation

More information

TECHNICAL REPORT. The Added Value of Multidimensional IRT Models. Robert D. Gibbons, Jason C. Immekus, and R. Darrell Bock

TECHNICAL REPORT. The Added Value of Multidimensional IRT Models. Robert D. Gibbons, Jason C. Immekus, and R. Darrell Bock 1 TECHNICAL REPORT The Added Value of Multidimensional IRT Models Robert D. Gibbons, Jason C. Immekus, and R. Darrell Bock Center for Health Statistics, University of Illinois at Chicago Corresponding

More information

Validating Measures of Self Control via Rasch Measurement. Jonathan Hasford Department of Marketing, University of Kentucky

Validating Measures of Self Control via Rasch Measurement. Jonathan Hasford Department of Marketing, University of Kentucky Validating Measures of Self Control via Rasch Measurement Jonathan Hasford Department of Marketing, University of Kentucky Kelly D. Bradley Department of Educational Policy Studies & Evaluation, University

More information

A Brief Introduction to Bayesian Statistics

A Brief Introduction to Bayesian Statistics A Brief Introduction to Statistics David Kaplan Department of Educational Psychology Methods for Social Policy Research and, Washington, DC 2017 1 / 37 The Reverend Thomas Bayes, 1701 1761 2 / 37 Pierre-Simon

More information

The Psychometric Development Process of Recovery Measures and Markers: Classical Test Theory and Item Response Theory

The Psychometric Development Process of Recovery Measures and Markers: Classical Test Theory and Item Response Theory The Psychometric Development Process of Recovery Measures and Markers: Classical Test Theory and Item Response Theory Kate DeRoche, M.A. Mental Health Center of Denver Antonio Olmos, Ph.D. Mental Health

More information

An Introduction to Missing Data in the Context of Differential Item Functioning

An Introduction to Missing Data in the Context of Differential Item Functioning A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to Practical Assessment, Research & Evaluation. Permission is granted to distribute

More information

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

Bruno D. Zumbo, Ph.D. University of Northern British Columbia Bruno Zumbo 1 The Effect of DIF and Impact on Classical Test Statistics: Undetected DIF and Impact, and the Reliability and Interpretability of Scores from a Language Proficiency Test Bruno D. Zumbo, Ph.D.

More information

A Comparison of Item and Testlet Selection Procedures. in Computerized Adaptive Testing. Leslie Keng. Pearson. Tsung-Han Ho

A Comparison of Item and Testlet Selection Procedures. in Computerized Adaptive Testing. Leslie Keng. Pearson. Tsung-Han Ho ADAPTIVE TESTLETS 1 Running head: ADAPTIVE TESTLETS A Comparison of Item and Testlet Selection Procedures in Computerized Adaptive Testing Leslie Keng Pearson Tsung-Han Ho The University of Texas at Austin

More information