for Scaling Ability and Diagnosing Misconceptions Laine P. Bradshaw James Madison University Jonathan Templin University of Georgia Author Note
|
|
- Georgina Tyler
- 5 years ago
- Views:
Transcription
1 Combing Item Response Theory and Diagnostic Classification Models: A Psychometric Model for Scaling Ability and Diagnosing Misconceptions Laine P. Bradshaw James Madison University Jonathan Templin University of Georgia Author Note Correspondence concerning this article should be addressed to Laine Bradshaw, Department of Graduate Psychology, James Madison University, MSC 6806, 821 S. Main St., Harrisonburg, VA laineb@uga.edu. This research was funded by National Science Foundation grants DRL ; SES ; and SES
2 Abstract The Scaling Individuals and Classifying Misconceptions (SICM) model is presented as a combination of an item response theory (IRT) model and a diagnostic classification model (DCM). Common modeling and testing procedures utilize unidimensional IRT to provide an estimate of a student s overall ability. Recent advances in psychometrics have focused on measuring multiple dimensions to provide more detailed feedback for students, teachers, and other stakeholders. DCMs provide multidimensional feedback by using multiple categorical variables that represent skills underlying a test that students may or may not have mastered. The SICM model combines an IRT model with a DCM that uses categorical variables that represent misconceptions instead of skills. In addition to the type of information common testing procedures provide about an examinee an overall continuous ability, the SICM model also is able to provide multidimensional, diagnostic feedback in the form of statistical estimates of misconceptions. This additional feedback can be used by stakeholders to tailor instruction for students needs. Results of a simulation study demonstrate that the SICM MCMC estimation algorithm yields reasonably accurate estimates under large-scale testing conditions. Results of an empirical data analysis highlight the need to address statistical considerations of the model from the onset of the assessment development process.
3 Running head: Scaling Ability and Diagnosing Misconceptions 3 Combining Item Response Theory and Diagnostic Classification Models: A Psychometric Model for Scaling Ability and Diagnosing Misconceptions The need for more fine-grained feedback from assessments that can be used to understand students strengths and weakness has been emphasized at all levels of education. Educational policy (No Child Left Behind, 2001), modern curriculum standards (i.e., Common Core Standards; National Research Council, 2010), and classroom teachers (Huff & Goodman, 2007) have described multidimensional, diagnostic feedback as essential for tailoring instruction to students specific needs and making educational progress. In spite of this need, most statelevel educational tests have been, and continue to be, designed from a unidimensional Item Response Theory (IRT; e.g., Hambleton, Swaminathan, & Rogers, 1991) perspective. This perspective optimizes the statistical estimation of a single continuous variable representing an overall ability and provides a single, composite score to describe a students performance with respect to an entire academic course. A common solution to providing diagnostic feedback from IRT-designed state-level tests has been to report summed scores on sub-sections of a test. These subscores, because they are based on a small number of items, often lack of reliability. Decisions based on unreliable subscores counterproductively may misguide instructional strategies and resources (Wainer, Vevea, Camacho, Reeve, Rosa, Nelson, Swygert, & Thissen, 2001). These types of subscores are computed from items selected to be on the test due to their correlation with the other items on the test, so, as expected, they are highly related to the total score and thus do not provide distinct information from or additional information beyond the total score (Haberman, 2005; Harris & Hanson, 1991; Haberman, et al., 2009; Sinharay & Haberman, 2007).
4 Scaling Ability and Diagnosing Misconceptions 4 The new psychometric model presented in this paper presents a means to providing reliable multidimensional feedback within the framework of prevailing unidimensional IRT methods. The model capitalizes on advances regarding multidimensional measurement models which recently have been at the forefront of psychometric research because they promise detailed feedback for students, teachers, and other stake holders. Diagnostic classification models (DCMs; e.g., Rupp, Templin, & Henson, 2010) provide one approach to the measurement of multiple dimensions. DCMs use categorical latent attributes to represent skills or content components underlying a test that students may or may not have mastered. Using DCMs, the focus of the results of the assessment is shifted to identify which components each student has mastered. Attributes students have mastered can be viewed as areas in which students do not need further instruction. Similarly, attributes that students have not mastered indicate areas in which instruction or remediation should be focused. Thus, the attribute pattern can provide feedback to students and teachers with respect to more fine-grain components of a content area, which can be used to tailor instruction to students specific needs. DCMs sacrifice fine-grained measurement to provide multidimensional measurements. Instead of finely locating each examinee along a set of continuous traits as a multidimensional IRT model does, DCMs coarsely classify each examinee with respect to each trait (i.e., as masters or non-masters of the trait). This trade-off enables DCMs to provide diagnostic, multidimensional feedback with reasonable data demands (i.e., few number of items and examinee, Bradshaw & Cohen, 2010). DCMs low cost in terms of data demands and high benefits in terms of diagnostic information make them very attractive models in an educational setting where time for testing is limited but multidimensional feedback is needed to reflect the realities of the multifaceted nature of the objectives of educational courses. However, given the
5 Scaling Ability and Diagnosing Misconceptions 5 current reliance of testing on measuring an overall ability, DCMs, while efficient for providing detailed feedback, may not fulfill all the needs of policy-driven assessment systems centered on scaling examinee ability. In this paper, we propose a new nominal response psychometric model, the Scaling Individuals and Classifying Misconceptions (SICM) model, that blends the IRT and DCM frameworks. The SICM model alters traditional DCM practices by defining the attributes for a nominal response DCM as misconceptions that students have instead of as abilities (skills) that students have. The SICM model alters traditional nominal response IRT (NR IRT; Bock 1970) practices by having these categorical misconceptions predict the incorrect item response while a continuous ability predicts the correct response. When coupled together through the SICM model, the IRT and DCM components provide a more thorough description of the traits possessed by students. The SICM model both describes a measure of composite ability measured by the correctness of the responses and identifies distinct errors in understanding manifested through specific incorrect responses. The model therefore serves dual purposes of scaling examinee ability for comparative and accountability purposes and also diagnosing misconceptions for remediation purposes. The following section overviews the measurement of misconceptions through previous assessment development projects. The next section provides the statistical specifications of the SICM model and is followed by an illustration of the model through a contrast with two more familiar models. Then results of a simulation study are provided to establish the efficacy of the model and a real data analysis is given to illustrate the use of the model in practice. Measuring Misconceptions A key feature needed by the SICM model is that the incorrect alternatives for multiple
6 Scaling Ability and Diagnosing Misconceptions 6 choice test items are crafted to reflect common misconceptions students have or typical errors students lacking complete understandings systematically make. Previous assessments have been developed in this way, evidencing that empirical theories about misconceptions in students understandings and the desire to capture them through an assessment exist. Assessments like these have been referred to as distractor driven assessments (Sadler, 1998). Examples in science assessment include the Force Concepts Inventory (FCI; Hestenes, Wells, & Swackhamer, 1992), the Astronomy Concept Inventory (ACI; Sadler (1998)), and the Astronomy and Space Science Concepts Inventory (ASSCI; Sadler, Coyle, Miller, Cook-Smith, Dussault, & Gould, 2010). A similar approach was used in two assessments that measured concepts and misconceptions in statistics: the Statistical Reasoning Assessment (SRA; Garfield, 1998) and the Probability Reasoning Questionnaire (Khazanov, 2009). For each of these assessments, misconceptions were theorized through the auspices of extensive qualitative research that studied incorrect conceptions through student interviews. Although the focus of providing information about misconceptions was a goal of these assessments, the psychometric methods employed focused on measuring a single continuous ability using either a total score as in Classical Test Theory (CTT; e.g., see Crocker & Algina, 1986) or as an ability estimate from IRT (used for the ACI). Using these methods, the misconceptions could only be assessed or diagnosed by tallying the number of times an alternative that measures a given misconception was selected by a student, or, when IRT was used, studying the trace lines corresponding to measured misconceptions on an individual item and student basis. Tallies of misconceptions yield similar issues as sub-scores: they are very coarse and unreliable measures, as a misconception may be measured by very few items. In IRT, post-hoc analysis of trace curves for each student and item is tedious, particularly when the task
7 Scaling Ability and Diagnosing Misconceptions 7 falls upon a teacher who may teach a large number of students. The Scaling Individuals and Classifying Misconceptions Model The SICM model is a psychometric model that seeks to statistically diagnose students possession of misconceptions. By using the SICM model instead of classical sum-score approaches, misconceptions can be measured more reliably. Also by using the SICM model, existing empirical theories can be modeled and evaluated quantitatively, allowing for theories to be strengthened (e.g., by verifying misconceptions exist and describing the structural relationships among misconceptions), and also for assessments to be improved (e.g., alternatives need not contribute equally to the measure of a misconception, and alternatives weakly related to the misconception can be revised). The SICM model acknowledges that there is a larger construct to be measured on a continuum that exists in addition to misconceptions that are either present or not. Thus, unlike any existing psychometric model, the model utilizes an examinee s continuous ability as in IRT in addition to an examinee s pattern of categorical misconceptions as latent predictors of the item response. The model uniquely treats misconceptions as categorical latent variables. Misconceptions, individually denoted by α, will be assumed to be dichotomous latent variables, where, for examinee e the misconception a is either present ( or absent ( (sometimes referred to as possession or lack of possession of a misconception). Marginally, each misconception ( ) has a Bernoulli distribution with a probability an examinee possesses the misconception p a (i.e., ). The misconception pattern is a vector of A binary indicators describing the presence or absence of each misconception. As such, has a multivariate Bernoulli distribution (MVB; e.g., Maydeu-Olivares & Joe, 2005) with a mean vector, representing the pattern probabilities (i.e., ).
8 Scaling Ability and Diagnosing Misconceptions 8 Functional Form of the SICM Model To measure both and, the SICM model is defined for nominal response (multiple choice) data. Often in IRT, responses are dichotomized into two categories correct or incorrect, collapsing all of the incorrect alternatives into one category and failing to preserve the uniqueness of each incorrect alternative. Such a dichotomization can be viewed as an incomplete theory of modeling the item response (Thissen & Steinberg, 1984). If characteristics of the incorrect alternatives present variations in the item response, then those characteristics can be modeled in the item response function (van der Linden & Hambleton, 1997). Modeling responses to the alternatives directly can also provide a means of evaluating item alternatives in the test-development process. Given a set of response categories or possible alternatives for an item, the SICM model utilizes a nominal response mixture item response model that defines the probability of observing an examinee s nominal response pattern to items as [ ] (1) The terms and are the structural components of the model, describing the distributions of and relationships among of the latent variables in the model, with and held independent. The term describes the proportion of examinees in each latent class. Each latent class represents a unique misconception (attribute) pattern such that given A misconceptions, there exist unique patterns. The term is parameterized as a function of the individual misconceptions by a log-linear model (e.g., Henson & Templin, 2005; Rupp, Templin, & Henson, 2010). The term is the density function of ability, with for identifiability. The parameter denotes the conditional probability that an examinee s response to item will be the selection of alternative from the set of alternatives for item (i.e.,
9 Scaling Ability and Diagnosing Misconceptions 9, given examinee e s attribute pattern ( and continuous ability.the brackets [ ] are Iverson brackets indicating that if, =1; otherwise =0. This represents the measurement component of the SICM model in that it quantifies how the latent variables (misconceptions and ability) are related to the observed item responses. For the SICM model, ability is measured by the correct alternative on each item, and the misconceptions are measured by the incorrect alternatives. Not every incorrect alternative measures each misconception, so an indicator variable is used to specify when a misconception is measured by an item alternative. Mimicking DCM practices, specifications are set a priori and are described in an item-by-alternative-by-misconception Q-matrix (e.g., Tatsuoka, 1990). The entries in the Q-matrix are indicators denoted by, where if alternative j for item measures misconception and otherwise. The SICM model parameterizes in Equation (1) by utilizing a multicategory logistic regression model (e.g., Agresti, 2002) that models the non-redundant logits with the alternative as the baseline category as ( ( ) )= (2) ( ) for every such that, where and = (3). (4) The correct alternative is specified as the baseline category, denoted, to simplify Equation (2). In Equation (3), equals zero for every alternative where because the incorrect alternatives do not measure θ. In Equation (4), always equals zero because the correct
10 Scaling Ability and Diagnosing Misconceptions 10 alternative does not measure any misconceptions. Therefore the equations specifying the log-odds of selecting an incorrect alternative over the correct alternative in the SICM model can be equivalently formulated as ( ( ) ( ) ) = (5) for every such that. The conditional probability that will be selected is expressed as ( ) (6) The intercept is the logit of selecting an incorrect alternative over the correct alternative for an examinee with an ability of zero who possesses none of the misconceptions measured by alternative. The more difficult the alternative is, the larger the intercept will be. The term is the loading for ability and is the discrimination parameter for ability as in IRT. Using notation consistent with the Log-linear Cognitive Diagnosis model (LCDM; Henson, Templin, & Willse, 2009) and the nominal response DCM (NR DCM; Templin & Bradshaw, under review), the term is a linear combination of main and interaction effects of the model. The vector and denotes the Q- matrix entries for alternative j of item, and is the misconception pattern for examinee e. The term is a column vector of indicators with elements that equal one if and only if (a) the item measures the misconception or set of misconceptions corresponding to the parameter and (b) the examinee possesses the misconception or set of misconceptions corresponding to the parameter. Specifically, equals (7)
11 Scaling Ability and Diagnosing Misconceptions 11 where is the main effect for misconception for the alternative of item is the interaction effect between attributes and for the alternative of item (if alternative of item measures two or more misconceptions), and the ellipses denote the third through higher-order interactions for options on items that measure more than two misconceptions, where is the -way interaction effect between all attributes Main effects and interactions are discrimination parameters with respect to misconception patterns. To identify the model, as is usual for a baseline-category logit model, an arbitrary category is treated as the baseline category, and all parameters for the baseline category are set equal to zero. Additionally, the main effect parameters are constrained to ensure monotonicity for attributes and for ability, meaning (a) the possession of a misconception never leads to a decrease in the probability of selecting an alternative measuring that misconception, and (b) an increase in ability never results in a decrease in the probability of answering the item correctly. Lower Asymptote for SICM Model The specification of the SICM model in Equation (2) does not provide a lower asymptote for the probability of a correct response to account for guessing on a multiple-choice test. An alternative formulation of the SICM model was developed and will be used to provide this lower asymptote without adding an additional parameter to the model. The new formulation is ( ) ( ) =. (8) The difference in Equations (5) and (8) is the ability portion of the model is now exponentiated. The intercept of the model is now interpreted as the logit that an examinee with an extremely low ability who possesses no misconceptions will choose alternative. Holding other parameters constant, as ability decreases, the value of exp decreases, and the logit of selecting the
12 Scaling Ability and Diagnosing Misconceptions 12 correct answer decreases, satisfying the monotonicity assumption for the model. As ability approaches negative infinity, exp approaches 0, meaning the logit approaches, yielding a lower asymptote of the probability of selecting the correct response where (9) This correction results in a more realistic model of the item response, without also resulting in an increased difficulty in estimation due to an increased number of item parameters to be estimated, as is commonly encountered when using the 3-PL IRT model. The SICM Model Illustrated as a Combination of an IRT model and DCM The SICM model posits that there is a continuous trait being measured by an assessment that largely explains the covariance among the selection of the correct alternatives for a set of items. It additionally assumes that there exists a set of categorical misconceptions, each of which a student does or does not possess, that systematically account for the variations in the selections amongst the incorrect alternatives for a set of items. To further illustrate the differences between the NR IRT, the NR DCM model, and the SICM model, consider the example item in Figure 1. From an NR IRT perspective, the level of a person s overall math ability explains the variations in the item responses and is the only latent variable measured by the alternatives. Using the SICM model in Equation (2), if is always equal to 1 (i.e., θ is measured by every alternative), and is fixed to be a 1 x A vector of zeros (i.e., no misconceptions are measured), then the item response function is specified as Bock s (1972) NR IRT model. From a NR DCM perspective, two categorical abilities are needed to answer this item correctly: the ability to find the area of a rectangle (Attribute 1; ) and the ability to make
13 Scaling Ability and Diagnosing Misconceptions 13 conversions among units within a metric system (Attribute 2; ). Consider Alternative B, which measures only Attribute 1. An examinee who selects Alternative B incorrectly converts 3 feet to 1/4 inches (does not possess Attribute 2), but demonstrates an ability to find the area of a rectangle by applying the operation of multiplication to the dimensions given (possesses Attribute 1). Thus, a response of B indicates the absence of Attribute 2, yet the presence of Attribute 1. When sample sizes are large, the NR DCM has been found to be able to capitalize on information in the incorrect alternatives as demonstrated by greater classification accuracy when compared to the LCDM model for dichotomous responses (Templin & Bradshaw, under review). Using the SICM model in Equation (2), if is always equal to 0 (i.e., θ is measured by no alternative), then the item response function is specified as the NR DCM where is defined as a pattern of attributes or skills, instead of a pattern of misconceptions as in the SICM model. For the example item to be modeled from the SICM model perspective, the attributes are defined as misconceptions or errors. Attribute 1 ( ) will be redefined as the inability to find the area of a rectangle and Attribute 2 ( ) as the inability to make conversions among units within a metric system 1. An examinee is expected to answer the item correctly (to select Alternative A) if he or she possess neither of these attributes and has a modest level of overall ability. Figure 2 provides hypothetical item response probabilities for the NR IRT model, the NR LCDM, and the SICM model to compare the type of information each model provides. The Q- matrix entries for each model are given in the legend of the first graph corresponding to that model. For the NR IRT model, in the top left graph, the item response probability is solely a function of ability (θ). The NR DCM, shown in the next 4 graphs, provides the item response 1 We acknowledge that math educators likely would not consider the lack of a skill to be a misconception. This example is an oversimplification meant to convey the statistical properties of the model with an example accessible to a wide-range of researchers. Reviewing the misconception literature is beyond the scope of this article but quality research can be found on the nature of misconceptions (e.g., Smith, disessa, & Rochelle, 1993) and on documented misconceptions for specific fields (e.g., in math and science, Confrey, 1990).
14 Scaling Ability and Diagnosing Misconceptions 14 probability by the examinee s class, which is defined by the attribute pattern (α) an examinee has. When an examinee s attribute pattern corresponds to the attribute pattern measured by an alternative, the examinee is most likely to select that alternative. The SICM model, shown in the last 4 graphs, provides the item response probability not only as a function of ability as the NR IRT model does, but also as a function of the class an examinee is in (the misconception pattern that an examinee has) as the NR DCM does. For the SICM model, each class has a different set of trace lines. For the NR IRT model, all trace lines intersect, meaning at different ability levels, different incorrect alternatives are more likely to be selected. In the SICM model, unlike the NR IRT model, the trace lines for the incorrect alternatives each have an upper asymptote and are monotonically decreasing and thus never intersect. The order of the probability of selecting a given incorrect alternative is dependent upon the misconceptions and is invariant with respect to ability. Put differently, the order is variant across classes and invariant within a class. For example, as seen in Figure 2, an examinee with misconception pattern [01] who misses the item is most likely to select Alternative B regardless of their ability level. Similarly if examinees with misconception patterns [10] and [11] miss the item, they are most likely to select Alternatives C and D, respectively. Estimation of the SICM Model The SICM model was estimated using a Markov Chain Monte Carlo (MCMC) estimation algorithm that uses Metropolis-Hastings sampling and was written in Fortran. Appendix A contains the specific steps of the algorithm. Using the specification of the model by Equation (5), the SICM model is estimable using Mplus Version 6.1 (Muthén & Muthén, ). However, Mplus cannot estimate the SICM* model with the exponentiation, so writing a unique estimation algorithm using Fortran was necessary.
15 Scaling Ability and Diagnosing Misconceptions 15 To evaluate the performance of the SICM model and algorithm, a simulation study was conducted and is discussed next. An empirical data analysis follows. Simulation Study The SICM model is complex due to the large number of parameters that need to be estimated and the different types (i.e., continuous and categorical) of parameters that are being estimating within the model. The simulation study provides information about (a) the performance of the model under realistic testing situations, and (b) the interplay of a continuous ability and a set of categorical misconceptions within a single model. Specifying continuous and categorical variables in the measurement model of psychometric has never been tried at the item alternative level before; it is of interest how the one type of variable will affect the other (e.g., whether the effect of one type of variable will dominant or mask the other s effect). Simulation Study Design The study had four manipulated factors that were fully crossed: sample size (3,000 and 10,000), test length (30 and 60 items), number of misconceptions (3 and 6), and the size of the main effects for ability and attributes. Average low main effects were.4 for ability and 1 for misconceptions; average high main effects were.6 for ability and 2 for misconceptions. To investigate how estimation was affected when either the continuous or categorical variables are more dominant, the relative and absolute magnitudes of the main effects for these latent variables were manipulated by crossing the high and low effect conditions. The tetrachoric correlation between attributes was set to.50. Fifty replications were estimated for each of the 32 conditions. The MCMC estimation algorithm was set to iterate for 10,000 stages with a burn-in period of 50,000 stages. Expected a Posteriori (EAP) estimates were used after the burn-in stages as estimates of model parameters.
16 Scaling Ability and Diagnosing Misconceptions 16 Each simulated item had four alternatives. Each alternative was specified to measure one or two attributes. A balanced Q-matrix was used for the simulation with 2.1 misconceptions measured per item and 1.13 misconceptions measured per alternative. As a baseline for comparison, in the 3-misconception/30-item conditions, each misconception was measured by 34 alternatives in 21 items. Simulation Results and Conclusions Results will be provided in tables where values were (a) averaged across the magnitude of main effects factor and/or (b) averaged across all other factors and given by the magnitude of main effects factor. Results indicate the item, structural, and examinee parameters were accurately estimated with the MCMC algorithm. Generally, item and examinee parameter estimates were most accurate in conditions with more examinees, more items, and fewer misconceptions. These trends are consistent with the psychometric literature at large; estimation is improved when there are fewer parameters to estimate and when the model has more information with which to determine the parameters. More specifically, the results reported in this section are compatible other simulation studies in the DCM literature (e.g., Choi, 2010; Henson, Templin & Willse, 2009). Results from varying the magnitude of the main effects conditions uncovered no barriers to estimating both the categorical and continuous latent predictors and also shed some light on which conditions yielded more accurately estimated parameters. Results from item, structural and examinee parameters will be discussed in turn. Accuracy of Model Parameter Estimates Table 1 gives the average bias, root mean squared error (RMSE), and Pearson correlations of true and estimated parameters. The RMSEs for item parameters were less than.05 for all conditions, so additional improvement as the test length increased,
17 Scaling Ability and Diagnosing Misconceptions 17 misconceptions decreased, and the sample increased was negligible. The estimation of the structural parameters was most significantly affected by the number of misconceptions, which is to be expected because the complexity of the structural model grows quickly as the dimensionality of the assessment increases. RMSEs for structural parameters were less than.10 when 3 misconceptions were measured. Improvement in structural parameter estimation is seen for the 6-misconception conditions as the number of items or examinees increases. Accuracy of Examinee Parameter Estimates Consistent with psychometric model research, the results for the accuracy of the examinee estimates (Table 1) and classifications (Table 2), were less affected by the number of examinees responding to the assessment and more affected by the length of the test and number of misconceptions. Accuracy of classifications is measured by the correct classification rate (CCR). Greatest improvement in estimation was seen by an increase in the length of the test. For the 60-item condition, the RMSE for ability estimates ranged from.588 to.599 and the CCR for individual attributes ranged from.922 to.958. In comparison, for the 30-item conditions, the RMSE ranged from.708 to.725 and CCR ranged from.863 to.918. Reliability of Examinee Estimates The reliability of examinee ability estimates and classifications were evaluated by the comparable reliability measure developed by Templin & Bradshaw (in press). For the SICM model, reliabilities for classifications were uniformly greater than reliabilities for abilities regardless of the characteristics of the conditions under which the estimates were obtained. The top portion of Table 3 give results averaged across the magnitude of main effect condition where reliability ranged from.541 to.675 for ability and, on average, from.908 to.988 for misconceptions. This finding echoes the results in Templin & Bradshaw (in press) that found
18 Scaling Ability and Diagnosing Misconceptions 18 across a set of models, DCM classifications (with 2, 3, 4 and 5 categories) were consistently more reliable than IRT abilities estimates. The reliability of the misconceptions is very high, but the reliability of the ability estimates fall short of reaching values of.70 or above that one might strive for in achievement testing. Interplay of Continuous and Categorical Latent Predictors The examinee estimates were also impacted by the magnitude of main effects factor, which offered some insights into the interplay of continuous and categorical variables being estimated within the same model. The accuracy and reliability of the estimated abilities were greatest when ability had a high main effect in an absolute sense; estimation improved only slightly when ability also had a high main effect in a relative sense (i.e., when misconceptions had a low main effect). Similarly, the accuracy and reliability of the classifications were greatest when misconceptions had a high main effect in an absolute sense, and estimation only improved slightly when the main effect was higher than the main effect for ability in a relative sense. These results indicate that strong main effects for ability improve estimation for ability without significantly hurting estimation of the misconceptions, and strong main effects for misconceptions improve estimation for misconceptions without significantly hurting estimation of ability. Thus, when estimating the SICM model in practice, the larger concern for estimation regarding main effects is the strength of the main effect in an absolute sense. Given strong main effects for each type of variable, the different types of variables can co-exist within the same model without one dominating the other. Limitations of Simulation Study Although the results of the simulation study provide some insights for using the SICM model, it did so under conditions where the estimation model was correct. In practice a host of
19 Scaling Ability and Diagnosing Misconceptions 19 factors may impact the accuracy of an analysis with the SICM model. For example, Q-matrices may have different levels of complexity, or Q-matrices may have different levels of accuracy. Fairly complex Q-matrices were used for this simulation study, but perfect accuracy was assumed such that model misspecification was not examined. Model misspecification is an important topic in psychometrics because misspecification of the model has expected negative consequences. Other situations in practice may offer a different number of alternatives or items, and main effects for misconceptions and ability may be mixed within a test instead of having designated absolute and relative magnitudes across the test. The SICM Model Illustrated with Empirical Data Analysis To demonstrate the SICM model s use in a practical setting, data from a reading comprehension assessment constructed and administered by a large scale testing company were analyzed. Presently modeled with total scores for ability and subscores for misconceptions, the goal of the reading comprehension assessment is to measure an overall literacy level to determine whether or not an examinee would benefit from additional instruction via instructional modules, in addition to determining what weakness should be targeted within the modules. Thus, the SICM model was well aligned with the purpose of this assessment. For this 28-item multiple-choice assessment, each of the incorrect alternatives corresponded to one of three types of errors that students make when responding to reading comprehension items, as pre-determined and specified by content experts and item writers. The three types of errors that were modeled as categorical attributes or misconceptions were: a nontexted based response, a text-based misinterpretation of the passage, and a text-based misinterpretation of the question. The first error reflects that the passage was not read (perhaps lack of effort or time); the second reflects that the passage read, but misinterpreted
20 Scaling Ability and Diagnosing Misconceptions 20 (comprehension error); and the third misconception reflects that the question was misinterpreted (different type of comprehension error). On average, each item measured 1.93 errors. Six items measured all three types of errors. Every incorrect alternative measured exactly one error. Respectively, the errors were measured by 29, 32, and 22 alternatives and 21, 19, and 14 items. To estimate the SICM model, the MCMC algorithm was run for 100,000 steps with 50,000 burn-in. Convergence was assessed using a variation of Gelman and Rubin s (1992) statistic. Convergence was reached for less than 50% of the structural parameters (very poor convergence), but 95% of all other parameters converged (acceptable convergence). Unlike the simulated data, a more informative prior distribution (lognormal (0,0.5)) was used to estimate the main effect for ability in the model due to difficulty in estimation likely caused by a sample size too small (i.e., 1,097 students) to estimate these parameters. To provide a thorough evaluation of SICM as compared with other potential psychometric models, two other psychometric models were used to evaluate the assessment. We first present results of the model comparison and then describe the SICM model estimates. Although the results paint a picture of an assessment with limited dimensionality, we use the comparison of models to help depict how estimates from the SICM model are differentiated from other psychometric models. Comparison of Three Psychometric Models The SICM model with a lower asymptote, NR IRT model with a lower asymptote (formed by exponentiation of the ability portion of the model, as used in the SICM model), and the NR DCM were used to analyze the reading comprehension data. The SICM model scaled examinees according to their ability and classified examinees according to their errors. The NR IRT model provided only an estimate of ability. The NR DCM provided only classifications of examinees according to the three types of errors on the assessment. To distinguish between the
21 Scaling Ability and Diagnosing Misconceptions 21 models with and without a lower asymptote, the models with the lower asymptote will be followed with an asterisk (e.g., SICM*). Comparisons of these results shed light on results found in the SICM* model and will be presented first. Comparison of Examinee Estimates Ability estimates in the SICM* model were strongly correlated with the NR IRT* model estimates (.731), and classifications of examinees were similar for the SICM* model and the NR DCM. The SICM model classified examinees into two of eight possible error patterns. Examinees either possessed all errors (Pattern 8, [111]) or no errors (Pattern 1, [000]), meaning the tetrachoric correlation among the three misconceptions was one. The SICM and NR DCM models were in agreement with respect to individual misconception and whole pattern classification for approximately 84% of the examinees. The NR DCM classified all but eight examinees to Pattern 1 or 8. These findings suggest that high multidimensionality in the assessment with respect to the errors does not exist. This may be due to a theoretical issue of these errors not being stable traits that produce systematic responses to items, a test development issue of the items and alternatives simply not eliciting the error, or an estimation issue of the sample size being too small to contain a substantial set of examinees having each pattern. The effect of this result permeates through the remaining analyses. The classification of examinees into categories with all or no errors empirically suggests that the structural model of the SICM* model and the NR DCM model is incorrect. The models are over-fitted; many estimated parameters would have a value of 0. When the errors are highly correlated, they are no longer practically distinct and cannot be treated as separate categorical variables. As a result, the structural parameters cannot converge because there is no information about examinees in the other six classes posited to exist. For the SICM the NR DCM respectively, only 42.9% and
22 Scaling Ability and Diagnosing Misconceptions % of structural parameters converged. The goal of these models was to model the variation of item responses according to predetermined patterns, which was not feasible because there was no observed variation across those patterns. Relative Model Fit Akaike s information criteria (AIC; Akaike, 1974) and Schwartz s Bayesian information criteria (BIC; Schwartz, 1976) were used to make a relative comparison of model-data fit. Results given in Table 5 show that both of these indices preferred the fit of the SICM* model over the NR IRT* model and the fit of NR IRT* model over the NR DCM. The better fit of the two models that estimated a continuous trait was not surprising because the results of the examinees all-or-none error patterns indicated a lack of dimensionality. The reason why the SICM model was preferred over the NR IRT model is more subtle. The errors did demonstrate some dimensionality; examinees were in one of two classes, not just one class. If no dimensionality with respect to the errors was present, the NR IRT* model should be preferred to the SICM* model because it estimated far fewer parameters. These results suggest the test may be measuring something more than a single continuous trait, but it is not measuring all three distinct errors in the Q-matrix. Perhaps a single error that placed examinees into two classes would be preferred to three errors that failed to place examinees into eight classes. SICM* Model Results for an Example Item Figure 3 shows the SICM* model estimated nominal response probabilities for an example item on the reading comprehension assessment as a function of ability and misconception pattern. Q-matrix entries are given in the legend of the figure, showing for this item each incorrect alternative measured one of the three errors. The order of the intercept for incorrect alternatives B, C, and D can be deduced from the response probabilities of the first
23 Scaling Ability and Diagnosing Misconceptions 23 graph which corresponds to Pattern [000] (i.e., examinees with no errors). The most likely incorrect alternative is D, with the largest intercept of.666, and the least likely is B, with the smallest intercept of Interpretations of trace lines for the next six graphs are difficult as no examinees actually have these patterns. In the last graph, trace lines of incorrect alternatives suggest similar probabilities exist of making any one of these errors when examinees have all of the misconceptions. These graphs illustrate how the SICM model can provide each NR DCMlike class a unique set of NR IRT-like response curves, which with better model-data fit would have more meaningful interpretations. SICM* Model Example Examinee Results In this section, we present results from two examinees with similar response patterns. Examinees 199 and 403 had answered the first 22 items correctly and had answered two of the last six items correctly, giving each examinee a total correct score of 24. However the two final items they answered correctly were different, resulting in their ability estimates to be slightly different (. Additionally, for the items they answered incorrectly, they selected different incorrect answers. Thus, their different incorrect answers on the last six items led to the students being classified as having drastically different error patterns (, ). From the SICM model estimates, we can conclude that both of these students have an above average ability, yet Examinee 403 (with the slightly higher ability estimate) needs instruction relevant to all three errors, while Examinee 199 does not. These results reflect the potential utility that the SICM* model estimates add beyond IRT model estimates for examinees. For an IRT model with an estimated discrimination parameter, it matters which items an examinee answers correctly. The same total score can yield different ability estimates because items are differentially related to the target ability being measured and
24 Scaling Ability and Diagnosing Misconceptions 24 thus count differentially towards the estimated ability. The SICM* model goes a step further and uses information not only from which items an examinee answers incorrectly, but also why the examinee answered the item incorrectly. As a result, two examinees can have the exact same scored response pattern and be classified as possessing a very different set of misconceptions. Data Analysis Discussion Although our initial thought upon analyzing this assessment data was to turn to another data set that may bear better results with the SICM model, ultimately we found value in sharing the story this analysis tells. As noted in Templin and Henson (2006), observing a large number of examinees in patterns for which all or none of the attributes are possessed may indicate that the construct being measured is truly unidimensional. Several reasons may explain this finding. The lack of multidimensionality may be a result of cognitive theory; these errors may not actually exist as a stable latent trait of the examinee, but rather be types of errors that students inconsistently, and thus unpredictably, make. Alternately, the multidimensionality that actually exists may not be captured by the assessment due to lack of validity, or enough information may not be available to estimate the model due to a smaller sample of examinees than in the simulation study. This analysis shows that even in a scenario like this one where the purpose of the assessment was aligned with the purpose of the model, limitations exist and issues arise when retrofitting an assessment to a model. Model-data fit is expected to improve in a test-construction scenario where a test is being developed from the onset to be estimated with the SICM model, and is thus recommended. Developing the assessment from the SICM framework would help identify sources of misfit. Validity studies can verify whether alternatives on an assessment are eliciting misconceptions that they purport to measure, and pilot studies can statistically flag items
25 Scaling Ability and Diagnosing Misconceptions 25 that exhibit model-data misfit and need to be revised or culled. The test development process can also attend to other statistical considerations that may include investigating (a) whether the misconception is measured enough times (in enough alternatives and items) to yield a reliable classification, (b) whether enough examinees are selecting each alternative to provide enough information to yield accurate item parameter estimates for that alternative, and (c) whether enough examinees have responded to the item to yield accurate model parameters. Concluding Remarks The SICM model is presented as a psychometric solution to a realistic need in educational assessment: to gain more feedback from assessments about what students do not understand. The efficacy of the SICM model under various testing conditions was demonstrated through a simulation study, suggesting that when coupled with careful test design characteristics, the SICM model can enable diagnostic score reports that reflect statistical estimates of student misconceptions to be provided in addition to the type of information about student ability that is typically provided to stakeholders by current modeling and testing procedures. These simulation results provide guidelines for test and sampling conditions, but not for creating the test itself. As seen in the empirical data analysis, the development of the assessment from the SICM framework a prior is very important. Although some general test-development considerations can be applied in developing an assessment for the SICM model, open questions still exist as to how to create an assessment that can utilize the statistical features of the SICM model. Assessments, previously mentioned, exist that can provide insights for writing items to measure misconceptions with incorrect answers. However, when the SICM model is employed to model these types of items, unique statistical considerations arise. For example, a continuous ability is estimated in the SICM model, as in a
26 Scaling Ability and Diagnosing Misconceptions 26 unidimensional IRT model. For a unidimensional IRT model, items that exhibit multidimensionality are often screened and revised or deleted from the assessment; for the SICM model, items that measure a single continuous trait and a set of multidimensional categorical traits are desired, so items are expected to show multidimensionality and will thus have to be screened differently. We provide information explaining how the SICM model can be estimated and applied. We hope future assessment development projects can build upon this information to leverage the model in practical settings to provide actionable information about where students misunderstandings lie.
27 Scaling Ability and Diagnosing Misconceptions 27 References Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), Agresti, A. (2002). Categorical data analysis (2nd ed.). Hoboken, NJ: Wiley. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee s ability. In F. M. Lord & M.R. Novick (Eds.), Statistical theories of mental test scores. (pp ). Reading, MA: Addison-Wesley. Bock, R. D. (1972) Estimating item parameters and latent ability when responses are scored in two or more latent categories. Psychometrika, 37, Confrey, J. (1990). A review of the research on student conceptions in mathematics, science, and programming. In C. Cazden (Ed.), Review of research in education (Vol. 16, pp. 3-56). Washington, DC: American Educational Research Association. Crocker, L. & Algina, J. (1986). Introduction to classical and modern test theory. Wadsworth Group/ Thomson Learning. Garfield, J. (1998, April). Challenges in assessing statistical reasoning. Paper presented at the annual meeting of the American Educational Research Association, San Diego, CA. Garfield, J., & Chance, B. (2000). Assessment in statistics education: Issues and challenges. Mathematical Thinking and Learning, 2 (1&2), Gelman, A., & Rubin, D.B. (1992) Inference from iterative simulation using multiple sequences. Statistical Science, 7, Halloun, A. & Hestenes, D. (1985). The initial knowledge state of college physics students. American Journal of Physics, 53 (11),
28 Scaling Ability and Diagnosing Misconceptions 28 Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage. Henson, R. A., & Templin, J. L. (2003). The moving window family of proposal distributions. Educational Testing Service, External Diagnostic Research Group, Unpublished Technical Report. Henson, R., & Templin, J. (2005). Hierarchical log-linear modeling of the joint skill distribution. Unpublished manuscript. Henson, R., Templin, J., & Willse, J. (2009). Defining a family of cognitive diagnosis models using log linear models with latent variables. Psychometrika, 74, Henson, R., Templin, J., Willse, J., & Irwin, P. (2009, April). Ancillary random effects: a way to obtain diagnostic information from existing large scale tests. Paper presented at the annual meeting of the National Council on Measurement in Education in San Diego, California. Hestenes, D.,Wells, M., & Swackhamer, G. (1992). Force concept inventory. The Physics Teacher, 30, Huff, K. & Goodman, D.P. (2007). The demand for cognitive diagnostic assessment. In J. P. Leighton & M. J. Gierl (Eds.), Cognitive diagnostic assessment for education: Theory and applications (pp ). London: Cambridge University Press. Khazanov, L. (2009, February). A diagnostic assessment for misconceptions in probability. Paper presented at the Georgia Perimeter College Mathematics Conference in Clarkston, GA. Maydeu-Olivares, A., & Joe, H. (2005). Limited- and full-information estimation and goodnessof-fit testing in 2 n contingency tables: a unified framework. Journal of the American Statistical Association,100,
COMBINING SCALING AND CLASSIFICATION: A PSYCHOMETRIC MODEL FOR SCALING ABILITY AND DIAGNOSING MISCONCEPTIONS LAINE P. BRADSHAW
COMBINING SCALING AND CLASSIFICATION: A PSYCHOMETRIC MODEL FOR SCALING ABILITY AND DIAGNOSING MISCONCEPTIONS by LAINE P. BRADSHAW (Under the Direction of Jonathan Templin and Karen Samuelsen) ABSTRACT
More informationDiagnostic Classification Models
Diagnostic Classification Models Lecture #13 ICPSR Item Response Theory Workshop Lecture #13: 1of 86 Lecture Overview Key definitions Conceptual example Example uses of diagnostic models in education Classroom
More informationFundamental Concepts for Using Diagnostic Classification Models. Section #2 NCME 2016 Training Session. NCME 2016 Training Session: Section 2
Fundamental Concepts for Using Diagnostic Classification Models Section #2 NCME 2016 Training Session NCME 2016 Training Session: Section 2 Lecture Overview Nature of attributes What s in a name? Grain
More informationBlending Psychometrics with Bayesian Inference Networks: Measuring Hundreds of Latent Variables Simultaneously
Blending Psychometrics with Bayesian Inference Networks: Measuring Hundreds of Latent Variables Simultaneously Jonathan Templin Department of Educational Psychology Achievement and Assessment Institute
More informationMultilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison
Group-Level Diagnosis 1 N.B. Please do not cite or distribute. Multilevel IRT for group-level diagnosis Chanho Park Daniel M. Bolt University of Wisconsin-Madison Paper presented at the annual meeting
More informationThe Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland
Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, April 23-25, 2003 The Classification Accuracy of Measurement Decision Theory Lawrence Rudner University
More informationA Bayesian Nonparametric Model Fit statistic of Item Response Models
A Bayesian Nonparametric Model Fit statistic of Item Response Models Purpose As more and more states move to use the computer adaptive test for their assessments, item response theory (IRT) has been widely
More informationTechnical Specifications
Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically
More informationaccuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian
Recovery of Marginal Maximum Likelihood Estimates in the Two-Parameter Logistic Response Model: An Evaluation of MULTILOG Clement A. Stone University of Pittsburgh Marginal maximum likelihood (MML) estimation
More informationItem Response Theory: Methods for the Analysis of Discrete Survey Response Data
Item Response Theory: Methods for the Analysis of Discrete Survey Response Data ICPSR Summer Workshop at the University of Michigan June 29, 2015 July 3, 2015 Presented by: Dr. Jonathan Templin Department
More informationA Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests
A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests David Shin Pearson Educational Measurement May 007 rr0701 Using assessment and research to promote learning Pearson Educational
More informationAndré Cyr and Alexander Davies
Item Response Theory and Latent variable modeling for surveys with complex sampling design The case of the National Longitudinal Survey of Children and Youth in Canada Background André Cyr and Alexander
More informationUsing the Distractor Categories of Multiple-Choice Items to Improve IRT Linking
Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Jee Seon Kim University of Wisconsin, Madison Paper presented at 2006 NCME Annual Meeting San Francisco, CA Correspondence
More informationLikelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati.
Likelihood Ratio Based Computerized Classification Testing Nathan A. Thompson Assessment Systems Corporation & University of Cincinnati Shungwon Ro Kenexa Abstract An efficient method for making decisions
More informationA COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM LIKELIHOOD METHODS IN ESTIMATING THE ITEM PARAMETERS FOR THE 2PL IRT MODEL
International Journal of Innovative Management, Information & Production ISME Internationalc2010 ISSN 2185-5439 Volume 1, Number 1, December 2010 PP. 81-89 A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM
More informationRunning head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note
Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1 Nested Factor Analytic Model Comparison as a Means to Detect Aberrant Response Patterns John M. Clark III Pearson Author Note John M. Clark III,
More informationFactors Affecting the Item Parameter Estimation and Classification Accuracy of the DINA Model
Journal of Educational Measurement Summer 2010, Vol. 47, No. 2, pp. 227 249 Factors Affecting the Item Parameter Estimation and Classification Accuracy of the DINA Model Jimmy de la Torre and Yuan Hong
More informationMantel-Haenszel Procedures for Detecting Differential Item Functioning
A Comparison of Logistic Regression and Mantel-Haenszel Procedures for Detecting Differential Item Functioning H. Jane Rogers, Teachers College, Columbia University Hariharan Swaminathan, University of
More informationA Comparison of Several Goodness-of-Fit Statistics
A Comparison of Several Goodness-of-Fit Statistics Robert L. McKinley The University of Toledo Craig N. Mills Educational Testing Service A study was conducted to evaluate four goodnessof-fit procedures
More informationItem Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses
Item Response Theory Steven P. Reise University of California, U.S.A. Item response theory (IRT), or modern measurement theory, provides alternatives to classical test theory (CTT) methods for the construction,
More informationMEASURING MIDDLE GRADES STUDENTS UNDERSTANDING OF FORCE AND MOTION CONCEPTS: INSIGHTS INTO THE STRUCTURE OF STUDENT IDEAS
MEASURING MIDDLE GRADES STUDENTS UNDERSTANDING OF FORCE AND MOTION CONCEPTS: INSIGHTS INTO THE STRUCTURE OF STUDENT IDEAS The purpose of this study was to create an instrument that measures middle grades
More informationParameter Estimation with Mixture Item Response Theory Models: A Monte Carlo Comparison of Maximum Likelihood and Bayesian Methods
Journal of Modern Applied Statistical Methods Volume 11 Issue 1 Article 14 5-1-2012 Parameter Estimation with Mixture Item Response Theory Models: A Monte Carlo Comparison of Maximum Likelihood and Bayesian
More informationInvestigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories
Kamla-Raj 010 Int J Edu Sci, (): 107-113 (010) Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories O.O. Adedoyin Department of Educational Foundations,
More informationDescription of components in tailored testing
Behavior Research Methods & Instrumentation 1977. Vol. 9 (2).153-157 Description of components in tailored testing WAYNE M. PATIENCE University ofmissouri, Columbia, Missouri 65201 The major purpose of
More informationThe Influence of Test Characteristics on the Detection of Aberrant Response Patterns
The Influence of Test Characteristics on the Detection of Aberrant Response Patterns Steven P. Reise University of California, Riverside Allan M. Due University of Minnesota Statistical methods to assess
More informationA Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model
A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model Gary Skaggs Fairfax County, Virginia Public Schools José Stevenson
More informationDuring the past century, mathematics
An Evaluation of Mathematics Competitions Using Item Response Theory Jim Gleason During the past century, mathematics competitions have become part of the landscape in mathematics education. The first
More informationContents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD
Psy 427 Cal State Northridge Andrew Ainsworth, PhD Contents Item Analysis in General Classical Test Theory Item Response Theory Basics Item Response Functions Item Information Functions Invariance IRT
More informationDimensionality of the Force Concept Inventory: Comparing Bayesian Item Response Models. Xiaowen Liu Eric Loken University of Connecticut
Dimensionality of the Force Concept Inventory: Comparing Bayesian Item Response Models Xiaowen Liu Eric Loken University of Connecticut 1 Overview Force Concept Inventory Bayesian implementation of one-
More informationParallel Forms for Diagnostic Purpose
Paper presented at AERA, 2010 Parallel Forms for Diagnostic Purpose Fang Chen Xinrui Wang UNCG, USA May, 2010 INTRODUCTION With the advancement of validity discussions, the measurement field is pushing
More informationBayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm
Journal of Social and Development Sciences Vol. 4, No. 4, pp. 93-97, Apr 203 (ISSN 222-52) Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm Henry De-Graft Acquah University
More informationAdaptive EAP Estimation of Ability
Adaptive EAP Estimation of Ability in a Microcomputer Environment R. Darrell Bock University of Chicago Robert J. Mislevy National Opinion Research Center Expected a posteriori (EAP) estimation of ability,
More informationCenter for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests
Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 26 for Mixed Format Tests Kyong Hee Chon Won-Chan Lee Timothy N. Ansley November 2007 The authors are grateful to
More informationMISSING DATA AND PARAMETERS ESTIMATES IN MULTIDIMENSIONAL ITEM RESPONSE MODELS. Federico Andreis, Pier Alda Ferrari *
Electronic Journal of Applied Statistical Analysis EJASA (2012), Electron. J. App. Stat. Anal., Vol. 5, Issue 3, 431 437 e-issn 2070-5948, DOI 10.1285/i20705948v5n3p431 2012 Università del Salento http://siba-ese.unile.it/index.php/ejasa/index
More informationJONATHAN TEMPLIN LAINE BRADSHAW THE USE AND MISUSE OF PSYCHOMETRIC MODELS
PSYCHOMETRIKA VOL. 79, NO. 2, 347 354 APRIL 2014 DOI: 10.1007/S11336-013-9364-Y THE USE AND MISUSE OF PSYCHOMETRIC MODELS JONATHAN TEMPLIN UNIVERSITY OF KANSAS LAINE BRADSHAW THE UNIVERSITY OF GEORGIA
More informationScaling TOWES and Linking to IALS
Scaling TOWES and Linking to IALS Kentaro Yamamoto and Irwin Kirsch March, 2002 In 2000, the Organization for Economic Cooperation and Development (OECD) along with Statistics Canada released Literacy
More informationCentre for Education Research and Policy
THE EFFECT OF SAMPLE SIZE ON ITEM PARAMETER ESTIMATION FOR THE PARTIAL CREDIT MODEL ABSTRACT Item Response Theory (IRT) models have been widely used to analyse test data and develop IRT-based tests. An
More informationComprehensive Statistical Analysis of a Mathematics Placement Test
Comprehensive Statistical Analysis of a Mathematics Placement Test Robert J. Hall Department of Educational Psychology Texas A&M University, USA (bobhall@tamu.edu) Eunju Jung Department of Educational
More informationDecision consistency and accuracy indices for the bifactor and testlet response theory models
University of Iowa Iowa Research Online Theses and Dissertations Summer 2014 Decision consistency and accuracy indices for the bifactor and testlet response theory models Lee James LaFond University of
More informationAn Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow?
Journal of Educational and Behavioral Statistics Fall 2006, Vol. 31, No. 3, pp. 241 259 An Empirical Bayes Approach to Subscore Augmentation: How Much Strength Can We Borrow? Michael C. Edwards The Ohio
More informationMCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2
MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and Lord Equating Methods 1,2 Lisa A. Keller, Ronald K. Hambleton, Pauline Parker, Jenna Copella University of Massachusetts
More informationThe Effect of Guessing on Item Reliability
The Effect of Guessing on Item Reliability under Answer-Until-Correct Scoring Michael Kane National League for Nursing, Inc. James Moloney State University of New York at Brockport The answer-until-correct
More informationA Multilevel Testlet Model for Dual Local Dependence
Journal of Educational Measurement Spring 2012, Vol. 49, No. 1, pp. 82 100 A Multilevel Testlet Model for Dual Local Dependence Hong Jiao University of Maryland Akihito Kamata University of Oregon Shudong
More informationMeasuring mathematics anxiety: Paper 2 - Constructing and validating the measure. Rob Cavanagh Len Sparrow Curtin University
Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure Rob Cavanagh Len Sparrow Curtin University R.Cavanagh@curtin.edu.au Abstract The study sought to measure mathematics anxiety
More information11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES
Correlational Research Correlational Designs Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are
More informationEFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN
EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN By FRANCISCO ANDRES JIMENEZ A THESIS PRESENTED TO THE GRADUATE SCHOOL OF
More informationImpact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data. Zhen Li & Bruno D.
Psicológica (2009), 30, 343-370. SECCIÓN METODOLÓGICA Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data Zhen Li & Bruno D. Zumbo 1 University
More informationConnexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan
Connexion of Item Response Theory to Decision Making in Chess Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan Acknowledgement A few Slides have been taken from the following presentation
More informationScoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods
James Madison University JMU Scholarly Commons Department of Graduate Psychology - Faculty Scholarship Department of Graduate Psychology 3-008 Scoring Multiple Choice Items: A Comparison of IRT and Classical
More informationA Broad-Range Tailored Test of Verbal Ability
A Broad-Range Tailored Test of Verbal Ability Frederic M. Lord Educational Testing Service Two parallel forms of a broad-range tailored test of verbal ability have been built. The test is appropriate from
More informationGENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS
GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS Michael J. Kolen The University of Iowa March 2011 Commissioned by the Center for K 12 Assessment & Performance Management at
More informationITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE
California State University, San Bernardino CSUSB ScholarWorks Electronic Theses, Projects, and Dissertations Office of Graduate Studies 6-2016 ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION
More informationUsing the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison
Using the Testlet Model to Mitigate Test Speededness Effects James A. Wollack Youngsuk Suh Daniel M. Bolt University of Wisconsin Madison April 12, 2007 Paper presented at the annual meeting of the National
More informationWorkshop Overview. Diagnostic Measurement. Theory, Methods, and Applications. Session Overview. Conceptual Foundations of. Workshop Sessions:
Workshop Overview Workshop Sessions: Diagnostic Measurement: Theory, Methods, and Applications Jonathan Templin The University of Georgia Session 1 Conceptual Foundations of Diagnostic Measurement Session
More informationDifferential Item Functioning
Differential Item Functioning Lecture #11 ICPSR Item Response Theory Workshop Lecture #11: 1of 62 Lecture Overview Detection of Differential Item Functioning (DIF) Distinguish Bias from DIF Test vs. Item
More informationTHE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH
THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH By JANN MARIE WISE MACINNES A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF
More informationUsing Bayesian Decision Theory to
Using Bayesian Decision Theory to Design a Computerized Mastery Test Charles Lewis and Kathleen Sheehan Educational Testing Service A theoretical framework for mastery testing based on item response theory
More informationMichael Hallquist, Thomas M. Olino, Paul A. Pilkonis University of Pittsburgh
Comparing the evidence for categorical versus dimensional representations of psychiatric disorders in the presence of noisy observations: a Monte Carlo study of the Bayesian Information Criterion and Akaike
More informationUSE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION
USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION Iweka Fidelis (Ph.D) Department of Educational Psychology, Guidance and Counselling, University of Port Harcourt,
More informationItem Parameter Recovery for the Two-Parameter Testlet Model with Different. Estimation Methods. Abstract
Item Parameter Recovery for the Two-Parameter Testlet Model with Different Estimation Methods Yong Luo National Center for Assessment in Saudi Arabia Abstract The testlet model is a popular statistical
More informationChapter 1 Introduction. Measurement Theory. broadest sense and not, as it is sometimes used, as a proxy for deterministic models.
Ostini & Nering - Chapter 1 - Page 1 POLYTOMOUS ITEM RESPONSE THEORY MODELS Chapter 1 Introduction Measurement Theory Mathematical models have been found to be very useful tools in the process of human
More informationGMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups
GMAC Scaling Item Difficulty Estimates from Nonequivalent Groups Fanmin Guo, Lawrence Rudner, and Eileen Talento-Miller GMAC Research Reports RR-09-03 April 3, 2009 Abstract By placing item statistics
More informationCYRINUS B. ESSEN, IDAKA E. IDAKA AND MICHAEL A. METIBEMU. (Received 31, January 2017; Revision Accepted 13, April 2017)
DOI: http://dx.doi.org/10.4314/gjedr.v16i2.2 GLOBAL JOURNAL OF EDUCATIONAL RESEARCH VOL 16, 2017: 87-94 COPYRIGHT BACHUDO SCIENCE CO. LTD PRINTED IN NIGERIA. ISSN 1596-6224 www.globaljournalseries.com;
More informationImpact and adjustment of selection bias. in the assessment of measurement equivalence
Impact and adjustment of selection bias in the assessment of measurement equivalence Thomas Klausch, Joop Hox,& Barry Schouten Working Paper, Utrecht, December 2012 Corresponding author: Thomas Klausch,
More informationOrdinal Data Modeling
Valen E. Johnson James H. Albert Ordinal Data Modeling With 73 illustrations I ". Springer Contents Preface v 1 Review of Classical and Bayesian Inference 1 1.1 Learning about a binomial proportion 1 1.1.1
More informationJason L. Meyers. Ahmet Turhan. Steven J. Fitzpatrick. Pearson. Paper presented at the annual meeting of the
Performance of Ability Estimation Methods for Writing Assessments under Conditio ns of Multidime nsionality Jason L. Meyers Ahmet Turhan Steven J. Fitzpatrick Pearson Paper presented at the annual meeting
More informationTHE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION
THE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION Timothy Olsen HLM II Dr. Gagne ABSTRACT Recent advances
More informationBayesian Tailored Testing and the Influence
Bayesian Tailored Testing and the Influence of Item Bank Characteristics Carl J. Jensema Gallaudet College Owen s (1969) Bayesian tailored testing method is introduced along with a brief review of its
More informationResearch and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida
Vol. 2 (1), pp. 22-39, Jan, 2015 http://www.ijate.net e-issn: 2148-7456 IJATE A Comparison of Logistic Regression Models for Dif Detection in Polytomous Items: The Effect of Small Sample Sizes and Non-Normality
More information11/24/2017. Do not imply a cause-and-effect relationship
Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are highly extraverted people less afraid of rejection
More informationInternational Journal of Education and Research Vol. 5 No. 5 May 2017
International Journal of Education and Research Vol. 5 No. 5 May 2017 EFFECT OF SAMPLE SIZE, ABILITY DISTRIBUTION AND TEST LENGTH ON DETECTION OF DIFFERENTIAL ITEM FUNCTIONING USING MANTEL-HAENSZEL STATISTIC
More informationAdaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida
Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida and Oleksandr S. Chernyshenko University of Canterbury Presented at the New CAT Models
More informationDifferential Item Functioning Amplification and Cancellation in a Reading Test
A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to the Practical Assessment, Research & Evaluation. Permission is granted to
More informationA Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1.
Running Head: A MODIFIED CATSIB PROCEDURE FOR DETECTING DIF ITEMS 1 A Modified CATSIB Procedure for Detecting Differential Item Function on Computer-Based Tests Johnson Ching-hong Li 1 Mark J. Gierl 1
More informationOn indirect measurement of health based on survey data. Responses to health related questions (items) Y 1,..,Y k A unidimensional latent health state
On indirect measurement of health based on survey data Responses to health related questions (items) Y 1,..,Y k A unidimensional latent health state A scaling model: P(Y 1,..,Y k ;α, ) α = item difficulties
More informationEmpowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison
Empowered by Psychometrics The Fundamentals of Psychometrics Jim Wollack University of Wisconsin Madison Psycho-what? Psychometrics is the field of study concerned with the measurement of mental and psychological
More informationStructural Equation Modeling (SEM)
Structural Equation Modeling (SEM) Today s topics The Big Picture of SEM What to do (and what NOT to do) when SEM breaks for you Single indicator (ASU) models Parceling indicators Using single factor scores
More informationSelection of Linking Items
Selection of Linking Items Subset of items that maximally reflect the scale information function Denote the scale information as Linear programming solver (in R, lp_solve 5.5) min(y) Subject to θ, θs,
More informationEcological Statistics
A Primer of Ecological Statistics Second Edition Nicholas J. Gotelli University of Vermont Aaron M. Ellison Harvard Forest Sinauer Associates, Inc. Publishers Sunderland, Massachusetts U.S.A. Brief Contents
More informationAnswers to end of chapter questions
Answers to end of chapter questions Chapter 1 What are the three most important characteristics of QCA as a method of data analysis? QCA is (1) systematic, (2) flexible, and (3) it reduces data. What are
More informationAdvanced Bayesian Models for the Social Sciences
Advanced Bayesian Models for the Social Sciences Jeff Harden Department of Political Science, University of Colorado Boulder jeffrey.harden@colorado.edu Daniel Stegmueller Department of Government, University
More informationDevelopment, Standardization and Application of
American Journal of Educational Research, 2018, Vol. 6, No. 3, 238-257 Available online at http://pubs.sciepub.com/education/6/3/11 Science and Education Publishing DOI:10.12691/education-6-3-11 Development,
More informationSection 5. Field Test Analyses
Section 5. Field Test Analyses Following the receipt of the final scored file from Measurement Incorporated (MI), the field test analyses were completed. The analysis of the field test data can be broken
More informationMultidimensional Modeling of Learning Progression-based Vertical Scales 1
Multidimensional Modeling of Learning Progression-based Vertical Scales 1 Nina Deng deng.nina@measuredprogress.org Louis Roussos roussos.louis@measuredprogress.org Lee LaFond leelafond74@gmail.com 1 This
More informationRunning head: ATTRIBUTE CODING FOR RETROFITTING MODELS. Comparison of Attribute Coding Procedures for Retrofitting Cognitive Diagnostic Models
Running head: ATTRIBUTE CODING FOR RETROFITTING MODELS Comparison of Attribute Coding Procedures for Retrofitting Cognitive Diagnostic Models Amy Clark Neal Kingston University of Kansas Corresponding
More informationIDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS
IDENTIFYING DATA CONDITIONS TO ENHANCE SUBSCALE SCORE ACCURACY BASED ON VARIOUS PSYCHOMETRIC MODELS A Dissertation Presented to The Academic Faculty by HeaWon Jun In Partial Fulfillment of the Requirements
More informationStatistical Methods and Reasoning for the Clinical Sciences
Statistical Methods and Reasoning for the Clinical Sciences Evidence-Based Practice Eiki B. Satake, PhD Contents Preface Introduction to Evidence-Based Statistics: Philosophical Foundation and Preliminaries
More informationItem Response Theory (IRT): A Modern Statistical Theory for Solving Measurement Problem in 21st Century
International Journal of Scientific Research in Education, SEPTEMBER 2018, Vol. 11(3B), 627-635. Item Response Theory (IRT): A Modern Statistical Theory for Solving Measurement Problem in 21st Century
More informationProceedings of the 2011 International Conference on Teaching, Learning and Change (c) International Association for Teaching and Learning (IATEL)
EVALUATION OF MATHEMATICS ACHIEVEMENT TEST: A COMPARISON BETWEEN CLASSICAL TEST THEORY (CTT)AND ITEM RESPONSE THEORY (IRT) Eluwa, O. Idowu 1, Akubuike N. Eluwa 2 and Bekom K. Abang 3 1& 3 Dept of Educational
More informationUsing the Rasch Modeling for psychometrics examination of food security and acculturation surveys
Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys Jill F. Kilanowski, PhD, APRN,CPNP Associate Professor Alpha Zeta & Mu Chi Acknowledgements Dr. Li Lin,
More informationAnalysis of the Reliability and Validity of an Edgenuity Algebra I Quiz
Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz This study presents the steps Edgenuity uses to evaluate the reliability and validity of its quizzes, topic tests, and cumulative
More informationAdvanced Bayesian Models for the Social Sciences. TA: Elizabeth Menninga (University of North Carolina, Chapel Hill)
Advanced Bayesian Models for the Social Sciences Instructors: Week 1&2: Skyler J. Cranmer Department of Political Science University of North Carolina, Chapel Hill skyler@unc.edu Week 3&4: Daniel Stegmueller
More informationDifferential Item Functioning from a Compensatory-Noncompensatory Perspective
Differential Item Functioning from a Compensatory-Noncompensatory Perspective Terry Ackerman, Bruce McCollaum, Gilbert Ngerano University of North Carolina at Greensboro Motivation for my Presentation
More informationTECHNICAL REPORT. The Added Value of Multidimensional IRT Models. Robert D. Gibbons, Jason C. Immekus, and R. Darrell Bock
1 TECHNICAL REPORT The Added Value of Multidimensional IRT Models Robert D. Gibbons, Jason C. Immekus, and R. Darrell Bock Center for Health Statistics, University of Illinois at Chicago Corresponding
More informationValidating Measures of Self Control via Rasch Measurement. Jonathan Hasford Department of Marketing, University of Kentucky
Validating Measures of Self Control via Rasch Measurement Jonathan Hasford Department of Marketing, University of Kentucky Kelly D. Bradley Department of Educational Policy Studies & Evaluation, University
More informationA Brief Introduction to Bayesian Statistics
A Brief Introduction to Statistics David Kaplan Department of Educational Psychology Methods for Social Policy Research and, Washington, DC 2017 1 / 37 The Reverend Thomas Bayes, 1701 1761 2 / 37 Pierre-Simon
More informationThe Psychometric Development Process of Recovery Measures and Markers: Classical Test Theory and Item Response Theory
The Psychometric Development Process of Recovery Measures and Markers: Classical Test Theory and Item Response Theory Kate DeRoche, M.A. Mental Health Center of Denver Antonio Olmos, Ph.D. Mental Health
More informationAn Introduction to Missing Data in the Context of Differential Item Functioning
A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to Practical Assessment, Research & Evaluation. Permission is granted to distribute
More informationBruno D. Zumbo, Ph.D. University of Northern British Columbia
Bruno Zumbo 1 The Effect of DIF and Impact on Classical Test Statistics: Undetected DIF and Impact, and the Reliability and Interpretability of Scores from a Language Proficiency Test Bruno D. Zumbo, Ph.D.
More informationA Comparison of Item and Testlet Selection Procedures. in Computerized Adaptive Testing. Leslie Keng. Pearson. Tsung-Han Ho
ADAPTIVE TESTLETS 1 Running head: ADAPTIVE TESTLETS A Comparison of Item and Testlet Selection Procedures in Computerized Adaptive Testing Leslie Keng Pearson Tsung-Han Ho The University of Texas at Austin
More information