for Scaling Ability and Diagnosing Misconceptions Laine P. Bradshaw James Madison University Jonathan Templin University of Georgia Author Note

Size: px

Start display at page:

Download "for Scaling Ability and Diagnosing Misconceptions Laine P. Bradshaw James Madison University Jonathan Templin University of Georgia Author Note"

Georgina Tyler
5 years ago
Views:

1 Combing Item Response Theory and Diagnostic Classification Models: A Psychometric Model for Scaling Ability and Diagnosing Misconceptions Laine P. Bradshaw James Madison University Jonathan Templin University of Georgia Author Note Correspondence concerning this article should be addressed to Laine Bradshaw, Department of Graduate Psychology, James Madison University, MSC 6806, 821 S. Main St., Harrisonburg, VA laineb@uga.edu. This research was funded by National Science Foundation grants DRL ; SES ; and SES

2 Abstract The Scaling Individuals and Classifying Misconceptions (SICM) model is presented as a combination of an item response theory (IRT) model and a diagnostic classification model (DCM). Common modeling and testing procedures utilize unidimensional IRT to provide an estimate of a student s overall ability. Recent advances in psychometrics have focused on measuring multiple dimensions to provide more detailed feedback for students, teachers, and other stakeholders. DCMs provide multidimensional feedback by using multiple categorical variables that represent skills underlying a test that students may or may not have mastered. The SICM model combines an IRT model with a DCM that uses categorical variables that represent misconceptions instead of skills. In addition to the type of information common testing procedures provide about an examinee an overall continuous ability, the SICM model also is able to provide multidimensional, diagnostic feedback in the form of statistical estimates of misconceptions. This additional feedback can be used by stakeholders to tailor instruction for students needs. Results of a simulation study demonstrate that the SICM MCMC estimation algorithm yields reasonably accurate estimates under large-scale testing conditions. Results of an empirical data analysis highlight the need to address statistical considerations of the model from the onset of the assessment development process.

3 Running head: Scaling Ability and Diagnosing Misconceptions 3 Combining Item Response Theory and Diagnostic Classification Models: A Psychometric Model for Scaling Ability and Diagnosing Misconceptions The need for more fine-grained feedback from assessments that can be used to understand students strengths and weakness has been emphasized at all levels of education. Educational policy (No Child Left Behind, 2001), modern curriculum standards (i.e., Common Core Standards; National Research Council, 2010), and classroom teachers (Huff & Goodman, 2007) have described multidimensional, diagnostic feedback as essential for tailoring instruction to students specific needs and making educational progress. In spite of this need, most statelevel educational tests have been, and continue to be, designed from a unidimensional Item Response Theory (IRT; e.g., Hambleton, Swaminathan, & Rogers, 1991) perspective. This perspective optimizes the statistical estimation of a single continuous variable representing an overall ability and provides a single, composite score to describe a students performance with respect to an entire academic course. A common solution to providing diagnostic feedback from IRT-designed state-level tests has been to report summed scores on sub-sections of a test. These subscores, because they are based on a small number of items, often lack of reliability. Decisions based on unreliable subscores counterproductively may misguide instructional strategies and resources (Wainer, Vevea, Camacho, Reeve, Rosa, Nelson, Swygert, & Thissen, 2001). These types of subscores are computed from items selected to be on the test due to their correlation with the other items on the test, so, as expected, they are highly related to the total score and thus do not provide distinct information from or additional information beyond the total score (Haberman, 2005; Harris & Hanson, 1991; Haberman, et al., 2009; Sinharay & Haberman, 2007).

4 Scaling Ability and Diagnosing Misconceptions 4 The new psychometric model presented in this paper presents a means to providing reliable multidimensional feedback within the framework of prevailing unidimensional IRT methods. The model capitalizes on advances regarding multidimensional measurement models which recently have been at the forefront of psychometric research because they promise detailed feedback for students, teachers, and other stake holders. Diagnostic classification models (DCMs; e.g., Rupp, Templin, & Henson, 2010) provide one approach to the measurement of multiple dimensions. DCMs use categorical latent attributes to represent skills or content components underlying a test that students may or may not have mastered. Using DCMs, the focus of the results of the assessment is shifted to identify which components each student has mastered. Attributes students have mastered can be viewed as areas in which students do not need further instruction. Similarly, attributes that students have not mastered indicate areas in which instruction or remediation should be focused. Thus, the attribute pattern can provide feedback to students and teachers with respect to more fine-grain components of a content area, which can be used to tailor instruction to students specific needs. DCMs sacrifice fine-grained measurement to provide multidimensional measurements. Instead of finely locating each examinee along a set of continuous traits as a multidimensional IRT model does, DCMs coarsely classify each examinee with respect to each trait (i.e., as masters or non-masters of the trait). This trade-off enables DCMs to provide diagnostic, multidimensional feedback with reasonable data demands (i.e., few number of items and examinee, Bradshaw & Cohen, 2010). DCMs low cost in terms of data demands and high benefits in terms of diagnostic information make them very attractive models in an educational setting where time for testing is limited but multidimensional feedback is needed to reflect the realities of the multifaceted nature of the objectives of educational courses. However, given the

5 Scaling Ability and Diagnosing Misconceptions 5 current reliance of testing on measuring an overall ability, DCMs, while efficient for providing detailed feedback, may not fulfill all the needs of policy-driven assessment systems centered on scaling examinee ability. In this paper, we propose a new nominal response psychometric model, the Scaling Individuals and Classifying Misconceptions (SICM) model, that blends the IRT and DCM frameworks. The SICM model alters traditional DCM practices by defining the attributes for a nominal response DCM as misconceptions that students have instead of as abilities (skills) that students have. The SICM model alters traditional nominal response IRT (NR IRT; Bock 1970) practices by having these categorical misconceptions predict the incorrect item response while a continuous ability predicts the correct response. When coupled together through the SICM model, the IRT and DCM components provide a more thorough description of the traits possessed by students. The SICM model both describes a measure of composite ability measured by the correctness of the responses and identifies distinct errors in understanding manifested through specific incorrect responses. The model therefore serves dual purposes of scaling examinee ability for comparative and accountability purposes and also diagnosing misconceptions for remediation purposes. The following section overviews the measurement of misconceptions through previous assessment development projects. The next section provides the statistical specifications of the SICM model and is followed by an illustration of the model through a contrast with two more familiar models. Then results of a simulation study are provided to establish the efficacy of the model and a real data analysis is given to illustrate the use of the model in practice. Measuring Misconceptions A key feature needed by the SICM model is that the incorrect alternatives for multiple

6 Scaling Ability and Diagnosing Misconceptions 6 choice test items are crafted to reflect common misconceptions students have or typical errors students lacking complete understandings systematically make. Previous assessments have been developed in this way, evidencing that empirical theories about misconceptions in students understandings and the desire to capture them through an assessment exist. Assessments like these have been referred to as distractor driven assessments (Sadler, 1998). Examples in science assessment include the Force Concepts Inventory (FCI; Hestenes, Wells, & Swackhamer, 1992), the Astronomy Concept Inventory (ACI; Sadler (1998)), and the Astronomy and Space Science Concepts Inventory (ASSCI; Sadler, Coyle, Miller, Cook-Smith, Dussault, & Gould, 2010). A similar approach was used in two assessments that measured concepts and misconceptions in statistics: the Statistical Reasoning Assessment (SRA; Garfield, 1998) and the Probability Reasoning Questionnaire (Khazanov, 2009). For each of these assessments, misconceptions were theorized through the auspices of extensive qualitative research that studied incorrect conceptions through student interviews. Although the focus of providing information about misconceptions was a goal of these assessments, the psychometric methods employed focused on measuring a single continuous ability using either a total score as in Classical Test Theory (CTT; e.g., see Crocker & Algina, 1986) or as an ability estimate from IRT (used for the ACI). Using these methods, the misconceptions could only be assessed or diagnosed by tallying the number of times an alternative that measures a given misconception was selected by a student, or, when IRT was used, studying the trace lines corresponding to measured misconceptions on an individual item and student basis. Tallies of misconceptions yield similar issues as sub-scores: they are very coarse and unreliable measures, as a misconception may be measured by very few items. In IRT, post-hoc analysis of trace curves for each student and item is tedious, particularly when the task

7 Scaling Ability and Diagnosing Misconceptions 7 falls upon a teacher who may teach a large number of students. The Scaling Individuals and Classifying Misconceptions Model The SICM model is a psychometric model that seeks to statistically diagnose students possession of misconceptions. By using the SICM model instead of classical sum-score approaches, misconceptions can be measured more reliably. Also by using the SICM model, existing empirical theories can be modeled and evaluated quantitatively, allowing for theories to be strengthened (e.g., by verifying misconceptions exist and describing the structural relationships among misconceptions), and also for assessments to be improved (e.g., alternatives need not contribute equally to the measure of a misconception, and alternatives weakly related to the misconception can be revised). The SICM model acknowledges that there is a larger construct to be measured on a continuum that exists in addition to misconceptions that are either present or not. Thus, unlike any existing psychometric model, the model utilizes an examinee s continuous ability as in IRT in addition to an examinee s pattern of categorical misconceptions as latent predictors of the item response. The model uniquely treats misconceptions as categorical latent variables. Misconceptions, individually denoted by α, will be assumed to be dichotomous latent variables, where, for examinee e the misconception a is either present ( or absent ( (sometimes referred to as possession or lack of possession of a misconception). Marginally, each misconception ( ) has a Bernoulli distribution with a probability an examinee possesses the misconception p a (i.e., ). The misconception pattern is a vector of A binary indicators describing the presence or absence of each misconception. As such, has a multivariate Bernoulli distribution (MVB; e.g., Maydeu-Olivares & Joe, 2005) with a mean vector, representing the pattern probabilities (i.e., ).

8 Scaling Ability and Diagnosing Misconceptions 8 Functional Form of the SICM Model To measure both and, the SICM model is defined for nominal response (multiple choice) data. Often in IRT, responses are dichotomized into two categories correct or incorrect, collapsing all of the incorrect alternatives into one category and failing to preserve the uniqueness of each incorrect alternative. Such a dichotomization can be viewed as an incomplete theory of modeling the item response (Thissen & Steinberg, 1984). If characteristics of the incorrect alternatives present variations in the item response, then those characteristics can be modeled in the item response function (van der Linden & Hambleton, 1997). Modeling responses to the alternatives directly can also provide a means of evaluating item alternatives in the test-development process. Given a set of response categories or possible alternatives for an item, the SICM model utilizes a nominal response mixture item response model that defines the probability of observing an examinee s nominal response pattern to items as [ ] (1) The terms and are the structural components of the model, describing the distributions of and relationships among of the latent variables in the model, with and held independent. The term describes the proportion of examinees in each latent class. Each latent class represents a unique misconception (attribute) pattern such that given A misconceptions, there exist unique patterns. The term is parameterized as a function of the individual misconceptions by a log-linear model (e.g., Henson & Templin, 2005; Rupp, Templin, & Henson, 2010). The term is the density function of ability, with for identifiability. The parameter denotes the conditional probability that an examinee s response to item will be the selection of alternative from the set of alternatives for item (i.e.,

9 Scaling Ability and Diagnosing Misconceptions 9, given examinee e s attribute pattern ( and continuous ability.the brackets [ ] are Iverson brackets indicating that if, =1; otherwise =0. This represents the measurement component of the SICM model in that it quantifies how the latent variables (misconceptions and ability) are related to the observed item responses. For the SICM model, ability is measured by the correct alternative on each item, and the misconceptions are measured by the incorrect alternatives. Not every incorrect alternative measures each misconception, so an indicator variable is used to specify when a misconception is measured by an item alternative. Mimicking DCM practices, specifications are set a priori and are described in an item-by-alternative-by-misconception Q-matrix (e.g., Tatsuoka, 1990). The entries in the Q-matrix are indicators denoted by, where if alternative j for item measures misconception and otherwise. The SICM model parameterizes in Equation (1) by utilizing a multicategory logistic regression model (e.g., Agresti, 2002) that models the non-redundant logits with the alternative as the baseline category as ( ( ) )= (2) ( ) for every such that, where and = (3). (4) The correct alternative is specified as the baseline category, denoted, to simplify Equation (2). In Equation (3), equals zero for every alternative where because the incorrect alternatives do not measure θ. In Equation (4), always equals zero because the correct

10 Scaling Ability and Diagnosing Misconceptions 10 alternative does not measure any misconceptions. Therefore the equations specifying the log-odds of selecting an incorrect alternative over the correct alternative in the SICM model can be equivalently formulated as ( ( ) ( ) ) = (5) for every such that. The conditional probability that will be selected is expressed as ( ) (6) The intercept is the logit of selecting an incorrect alternative over the correct alternative for an examinee with an ability of zero who possesses none of the misconceptions measured by alternative. The more difficult the alternative is, the larger the intercept will be. The term is the loading for ability and is the discrimination parameter for ability as in IRT. Using notation consistent with the Log-linear Cognitive Diagnosis model (LCDM; Henson, Templin, & Willse, 2009) and the nominal response DCM (NR DCM; Templin & Bradshaw, under review), the term is a linear combination of main and interaction effects of the model. The vector and denotes the Q- matrix entries for alternative j of item, and is the misconception pattern for examinee e. The term is a column vector of indicators with elements that equal one if and only if (a) the item measures the misconception or set of misconceptions corresponding to the parameter and (b) the examinee possesses the misconception or set of misconceptions corresponding to the parameter. Specifically, equals (7)

11 Scaling Ability and Diagnosing Misconceptions 11 where is the main effect for misconception for the alternative of item is the interaction effect between attributes and for the alternative of item (if alternative of item measures two or more misconceptions), and the ellipses denote the third through higher-order interactions for options on items that measure more than two misconceptions, where is the -way interaction effect between all attributes Main effects and interactions are discrimination parameters with respect to misconception patterns. To identify the model, as is usual for a baseline-category logit model, an arbitrary category is treated as the baseline category, and all parameters for the baseline category are set equal to zero. Additionally, the main effect parameters are constrained to ensure monotonicity for attributes and for ability, meaning (a) the possession of a misconception never leads to a decrease in the probability of selecting an alternative measuring that misconception, and (b) an increase in ability never results in a decrease in the probability of answering the item correctly. Lower Asymptote for SICM Model The specification of the SICM model in Equation (2) does not provide a lower asymptote for the probability of a correct response to account for guessing on a multiple-choice test. An alternative formulation of the SICM model was developed and will be used to provide this lower asymptote without adding an additional parameter to the model. The new formulation is ( ) ( ) =. (8) The difference in Equations (5) and (8) is the ability portion of the model is now exponentiated. The intercept of the model is now interpreted as the logit that an examinee with an extremely low ability who possesses no misconceptions will choose alternative. Holding other parameters constant, as ability decreases, the value of exp decreases, and the logit of selecting the

12 Scaling Ability and Diagnosing Misconceptions 12 correct answer decreases, satisfying the monotonicity assumption for the model. As ability approaches negative infinity, exp approaches 0, meaning the logit approaches, yielding a lower asymptote of the probability of selecting the correct response where (9) This correction results in a more realistic model of the item response, without also resulting in an increased difficulty in estimation due to an increased number of item parameters to be estimated, as is commonly encountered when using the 3-PL IRT model. The SICM Model Illustrated as a Combination of an IRT model and DCM The SICM model posits that there is a continuous trait being measured by an assessment that largely explains the covariance among the selection of the correct alternatives for a set of items. It additionally assumes that there exists a set of categorical misconceptions, each of which a student does or does not possess, that systematically account for the variations in the selections amongst the incorrect alternatives for a set of items. To further illustrate the differences between the NR IRT, the NR DCM model, and the SICM model, consider the example item in Figure 1. From an NR IRT perspective, the level of a person s overall math ability explains the variations in the item responses and is the only latent variable measured by the alternatives. Using the SICM model in Equation (2), if is always equal to 1 (i.e., θ is measured by every alternative), and is fixed to be a 1 x A vector of zeros (i.e., no misconceptions are measured), then the item response function is specified as Bock s (1972) NR IRT model. From a NR DCM perspective, two categorical abilities are needed to answer this item correctly: the ability to find the area of a rectangle (Attribute 1; ) and the ability to make

13 Scaling Ability and Diagnosing Misconceptions 13 conversions among units within a metric system (Attribute 2; ). Consider Alternative B, which measures only Attribute 1. An examinee who selects Alternative B incorrectly converts 3 feet to 1/4 inches (does not possess Attribute 2), but demonstrates an ability to find the area of a rectangle by applying the operation of multiplication to the dimensions given (possesses Attribute 1). Thus, a response of B indicates the absence of Attribute 2, yet the presence of Attribute 1. When sample sizes are large, the NR DCM has been found to be able to capitalize on information in the incorrect alternatives as demonstrated by greater classification accuracy when compared to the LCDM model for dichotomous responses (Templin & Bradshaw, under review). Using the SICM model in Equation (2), if is always equal to 0 (i.e., θ is measured by no alternative), then the item response function is specified as the NR DCM where is defined as a pattern of attributes or skills, instead of a pattern of misconceptions as in the SICM model. For the example item to be modeled from the SICM model perspective, the attributes are defined as misconceptions or errors. Attribute 1 ( ) will be redefined as the inability to find the area of a rectangle and Attribute 2 ( ) as the inability to make conversions among units within a metric system 1. An examinee is expected to answer the item correctly (to select Alternative A) if he or she possess neither of these attributes and has a modest level of overall ability. Figure 2 provides hypothetical item response probabilities for the NR IRT model, the NR LCDM, and the SICM model to compare the type of information each model provides. The Q- matrix entries for each model are given in the legend of the first graph corresponding to that model. For the NR IRT model, in the top left graph, the item response probability is solely a function of ability (θ). The NR DCM, shown in the next 4 graphs, provides the item response 1 We acknowledge that math educators likely would not consider the lack of a skill to be a misconception. This example is an oversimplification meant to convey the statistical properties of the model with an example accessible to a wide-range of researchers. Reviewing the misconception literature is beyond the scope of this article but quality research can be found on the nature of misconceptions (e.g., Smith, disessa, & Rochelle, 1993) and on documented misconceptions for specific fields (e.g., in math and science, Confrey, 1990).

14 Scaling Ability and Diagnosing Misconceptions 14 probability by the examinee s class, which is defined by the attribute pattern (α) an examinee has. When an examinee s attribute pattern corresponds to the attribute pattern measured by an alternative, the examinee is most likely to select that alternative. The SICM model, shown in the last 4 graphs, provides the item response probability not only as a function of ability as the NR IRT model does, but also as a function of the class an examinee is in (the misconception pattern that an examinee has) as the NR DCM does. For the SICM model, each class has a different set of trace lines. For the NR IRT model, all trace lines intersect, meaning at different ability levels, different incorrect alternatives are more likely to be selected. In the SICM model, unlike the NR IRT model, the trace lines for the incorrect alternatives each have an upper asymptote and are monotonically decreasing and thus never intersect. The order of the probability of selecting a given incorrect alternative is dependent upon the misconceptions and is invariant with respect to ability. Put differently, the order is variant across classes and invariant within a class. For example, as seen in Figure 2, an examinee with misconception pattern [01] who misses the item is most likely to select Alternative B regardless of their ability level. Similarly if examinees with misconception patterns [10] and [11] miss the item, they are most likely to select Alternatives C and D, respectively. Estimation of the SICM Model The SICM model was estimated using a Markov Chain Monte Carlo (MCMC) estimation algorithm that uses Metropolis-Hastings sampling and was written in Fortran. Appendix A contains the specific steps of the algorithm. Using the specification of the model by Equation (5), the SICM model is estimable using Mplus Version 6.1 (Muthén & Muthén, ). However, Mplus cannot estimate the SICM* model with the exponentiation, so writing a unique estimation algorithm using Fortran was necessary.

15 Scaling Ability and Diagnosing Misconceptions 15 To evaluate the performance of the SICM model and algorithm, a simulation study was conducted and is discussed next. An empirical data analysis follows. Simulation Study The SICM model is complex due to the large number of parameters that need to be estimated and the different types (i.e., continuous and categorical) of parameters that are being estimating within the model. The simulation study provides information about (a) the performance of the model under realistic testing situations, and (b) the interplay of a continuous ability and a set of categorical misconceptions within a single model. Specifying continuous and categorical variables in the measurement model of psychometric has never been tried at the item alternative level before; it is of interest how the one type of variable will affect the other (e.g., whether the effect of one type of variable will dominant or mask the other s effect). Simulation Study Design The study had four manipulated factors that were fully crossed: sample size (3,000 and 10,000), test length (30 and 60 items), number of misconceptions (3 and 6), and the size of the main effects for ability and attributes. Average low main effects were.4 for ability and 1 for misconceptions; average high main effects were.6 for ability and 2 for misconceptions. To investigate how estimation was affected when either the continuous or categorical variables are more dominant, the relative and absolute magnitudes of the main effects for these latent variables were manipulated by crossing the high and low effect conditions. The tetrachoric correlation between attributes was set to.50. Fifty replications were estimated for each of the 32 conditions. The MCMC estimation algorithm was set to iterate for 10,000 stages with a burn-in period of 50,000 stages. Expected a Posteriori (EAP) estimates were used after the burn-in stages as estimates of model parameters.

16 Scaling Ability and Diagnosing Misconceptions 16 Each simulated item had four alternatives. Each alternative was specified to measure one or two attributes. A balanced Q-matrix was used for the simulation with 2.1 misconceptions measured per item and 1.13 misconceptions measured per alternative. As a baseline for comparison, in the 3-misconception/30-item conditions, each misconception was measured by 34 alternatives in 21 items. Simulation Results and Conclusions Results will be provided in tables where values were (a) averaged across the magnitude of main effects factor and/or (b) averaged across all other factors and given by the magnitude of main effects factor. Results indicate the item, structural, and examinee parameters were accurately estimated with the MCMC algorithm. Generally, item and examinee parameter estimates were most accurate in conditions with more examinees, more items, and fewer misconceptions. These trends are consistent with the psychometric literature at large; estimation is improved when there are fewer parameters to estimate and when the model has more information with which to determine the parameters. More specifically, the results reported in this section are compatible other simulation studies in the DCM literature (e.g., Choi, 2010; Henson, Templin & Willse, 2009). Results from varying the magnitude of the main effects conditions uncovered no barriers to estimating both the categorical and continuous latent predictors and also shed some light on which conditions yielded more accurately estimated parameters. Results from item, structural and examinee parameters will be discussed in turn. Accuracy of Model Parameter Estimates Table 1 gives the average bias, root mean squared error (RMSE), and Pearson correlations of true and estimated parameters. The RMSEs for item parameters were less than.05 for all conditions, so additional improvement as the test length increased,

17 Scaling Ability and Diagnosing Misconceptions 17 misconceptions decreased, and the sample increased was negligible. The estimation of the structural parameters was most significantly affected by the number of misconceptions, which is to be expected because the complexity of the structural model grows quickly as the dimensionality of the assessment increases. RMSEs for structural parameters were less than.10 when 3 misconceptions were measured. Improvement in structural parameter estimation is seen for the 6-misconception conditions as the number of items or examinees increases. Accuracy of Examinee Parameter Estimates Consistent with psychometric model research, the results for the accuracy of the examinee estimates (Table 1) and classifications (Table 2), were less affected by the number of examinees responding to the assessment and more affected by the length of the test and number of misconceptions. Accuracy of classifications is measured by the correct classification rate (CCR). Greatest improvement in estimation was seen by an increase in the length of the test. For the 60-item condition, the RMSE for ability estimates ranged from.588 to.599 and the CCR for individual attributes ranged from.922 to.958. In comparison, for the 30-item conditions, the RMSE ranged from.708 to.725 and CCR ranged from.863 to.918. Reliability of Examinee Estimates The reliability of examinee ability estimates and classifications were evaluated by the comparable reliability measure developed by Templin & Bradshaw (in press). For the SICM model, reliabilities for classifications were uniformly greater than reliabilities for abilities regardless of the characteristics of the conditions under which the estimates were obtained. The top portion of Table 3 give results averaged across the magnitude of main effect condition where reliability ranged from.541 to.675 for ability and, on average, from.908 to.988 for misconceptions. This finding echoes the results in Templin & Bradshaw (in press) that found

18 Scaling Ability and Diagnosing Misconceptions 18 across a set of models, DCM classifications (with 2, 3, 4 and 5 categories) were consistently more reliable than IRT abilities estimates. The reliability of the misconceptions is very high, but the reliability of the ability estimates fall short of reaching values of.70 or above that one might strive for in achievement testing. Interplay of Continuous and Categorical Latent Predictors The examinee estimates were also impacted by the magnitude of main effects factor, which offered some insights into the interplay of continuous and categorical variables being estimated within the same model. The accuracy and reliability of the estimated abilities were greatest when ability had a high main effect in an absolute sense; estimation improved only slightly when ability also had a high main effect in a relative sense (i.e., when misconceptions had a low main effect). Similarly, the accuracy and reliability of the classifications were greatest when misconceptions had a high main effect in an absolute sense, and estimation only improved slightly when the main effect was higher than the main effect for ability in a relative sense. These results indicate that strong main effects for ability improve estimation for ability without significantly hurting estimation of the misconceptions, and strong main effects for misconceptions improve estimation for misconceptions without significantly hurting estimation of ability. Thus, when estimating the SICM model in practice, the larger concern for estimation regarding main effects is the strength of the main effect in an absolute sense. Given strong main effects for each type of variable, the different types of variables can co-exist within the same model without one dominating the other. Limitations of Simulation Study Although the results of the simulation study provide some insights for using the SICM model, it did so under conditions where the estimation model was correct. In practice a host of

19 Scaling Ability and Diagnosing Misconceptions 19 factors may impact the accuracy of an analysis with the SICM model. For example, Q-matrices may have different levels of complexity, or Q-matrices may have different levels of accuracy. Fairly complex Q-matrices were used for this simulation study, but perfect accuracy was assumed such that model misspecification was not examined. Model misspecification is an important topic in psychometrics because misspecification of the model has expected negative consequences. Other situations in practice may offer a different number of alternatives or items, and main effects for misconceptions and ability may be mixed within a test instead of having designated absolute and relative magnitudes across the test. The SICM Model Illustrated with Empirical Data Analysis To demonstrate the SICM model s use in a practical setting, data from a reading comprehension assessment constructed and administered by a large scale testing company were analyzed. Presently modeled with total scores for ability and subscores for misconceptions, the goal of the reading comprehension assessment is to measure an overall literacy level to determine whether or not an examinee would benefit from additional instruction via instructional modules, in addition to determining what weakness should be targeted within the modules. Thus, the SICM model was well aligned with the purpose of this assessment. For this 28-item multiple-choice assessment, each of the incorrect alternatives corresponded to one of three types of errors that students make when responding to reading comprehension items, as pre-determined and specified by content experts and item writers. The three types of errors that were modeled as categorical attributes or misconceptions were: a nontexted based response, a text-based misinterpretation of the passage, and a text-based misinterpretation of the question. The first error reflects that the passage was not read (perhaps lack of effort or time); the second reflects that the passage read, but misinterpreted

20 Scaling Ability and Diagnosing Misconceptions 20 (comprehension error); and the third misconception reflects that the question was misinterpreted (different type of comprehension error). On average, each item measured 1.93 errors. Six items measured all three types of errors. Every incorrect alternative measured exactly one error. Respectively, the errors were measured by 29, 32, and 22 alternatives and 21, 19, and 14 items. To estimate the SICM model, the MCMC algorithm was run for 100,000 steps with 50,000 burn-in. Convergence was assessed using a variation of Gelman and Rubin s (1992) statistic. Convergence was reached for less than 50% of the structural parameters (very poor convergence), but 95% of all other parameters converged (acceptable convergence). Unlike the simulated data, a more informative prior distribution (lognormal (0,0.5)) was used to estimate the main effect for ability in the model due to difficulty in estimation likely caused by a sample size too small (i.e., 1,097 students) to estimate these parameters. To provide a thorough evaluation of SICM as compared with other potential psychometric models, two other psychometric models were used to evaluate the assessment. We first present results of the model comparison and then describe the SICM model estimates. Although the results paint a picture of an assessment with limited dimensionality, we use the comparison of models to help depict how estimates from the SICM model are differentiated from other psychometric models. Comparison of Three Psychometric Models The SICM model with a lower asymptote, NR IRT model with a lower asymptote (formed by exponentiation of the ability portion of the model, as used in the SICM model), and the NR DCM were used to analyze the reading comprehension data. The SICM model scaled examinees according to their ability and classified examinees according to their errors. The NR IRT model provided only an estimate of ability. The NR DCM provided only classifications of examinees according to the three types of errors on the assessment. To distinguish between the

21 Scaling Ability and Diagnosing Misconceptions 21 models with and without a lower asymptote, the models with the lower asymptote will be followed with an asterisk (e.g., SICM*). Comparisons of these results shed light on results found in the SICM* model and will be presented first. Comparison of Examinee Estimates Ability estimates in the SICM* model were strongly correlated with the NR IRT* model estimates (.731), and classifications of examinees were similar for the SICM* model and the NR DCM. The SICM model classified examinees into two of eight possible error patterns. Examinees either possessed all errors (Pattern 8, [111]) or no errors (Pattern 1, [000]), meaning the tetrachoric correlation among the three misconceptions was one. The SICM and NR DCM models were in agreement with respect to individual misconception and whole pattern classification for approximately 84% of the examinees. The NR DCM classified all but eight examinees to Pattern 1 or 8. These findings suggest that high multidimensionality in the assessment with respect to the errors does not exist. This may be due to a theoretical issue of these errors not being stable traits that produce systematic responses to items, a test development issue of the items and alternatives simply not eliciting the error, or an estimation issue of the sample size being too small to contain a substantial set of examinees having each pattern. The effect of this result permeates through the remaining analyses. The classification of examinees into categories with all or no errors empirically suggests that the structural model of the SICM* model and the NR DCM model is incorrect. The models are over-fitted; many estimated parameters would have a value of 0. When the errors are highly correlated, they are no longer practically distinct and cannot be treated as separate categorical variables. As a result, the structural parameters cannot converge because there is no information about examinees in the other six classes posited to exist. For the SICM the NR DCM respectively, only 42.9% and

22 Scaling Ability and Diagnosing Misconceptions % of structural parameters converged. The goal of these models was to model the variation of item responses according to predetermined patterns, which was not feasible because there was no observed variation across those patterns. Relative Model Fit Akaike s information criteria (AIC; Akaike, 1974) and Schwartz s Bayesian information criteria (BIC; Schwartz, 1976) were used to make a relative comparison of model-data fit. Results given in Table 5 show that both of these indices preferred the fit of the SICM* model over the NR IRT* model and the fit of NR IRT* model over the NR DCM. The better fit of the two models that estimated a continuous trait was not surprising because the results of the examinees all-or-none error patterns indicated a lack of dimensionality. The reason why the SICM model was preferred over the NR IRT model is more subtle. The errors did demonstrate some dimensionality; examinees were in one of two classes, not just one class. If no dimensionality with respect to the errors was present, the NR IRT* model should be preferred to the SICM* model because it estimated far fewer parameters. These results suggest the test may be measuring something more than a single continuous trait, but it is not measuring all three distinct errors in the Q-matrix. Perhaps a single error that placed examinees into two classes would be preferred to three errors that failed to place examinees into eight classes. SICM* Model Results for an Example Item Figure 3 shows the SICM* model estimated nominal response probabilities for an example item on the reading comprehension assessment as a function of ability and misconception pattern. Q-matrix entries are given in the legend of the figure, showing for this item each incorrect alternative measured one of the three errors. The order of the intercept for incorrect alternatives B, C, and D can be deduced from the response probabilities of the first

23 Scaling Ability and Diagnosing Misconceptions 23 graph which corresponds to Pattern [000] (i.e., examinees with no errors). The most likely incorrect alternative is D, with the largest intercept of.666, and the least likely is B, with the smallest intercept of Interpretations of trace lines for the next six graphs are difficult as no examinees actually have these patterns. In the last graph, trace lines of incorrect alternatives suggest similar probabilities exist of making any one of these errors when examinees have all of the misconceptions. These graphs illustrate how the SICM model can provide each NR DCMlike class a unique set of NR IRT-like response curves, which with better model-data fit would have more meaningful interpretations. SICM* Model Example Examinee Results In this section, we present results from two examinees with similar response patterns. Examinees 199 and 403 had answered the first 22 items correctly and had answered two of the last six items correctly, giving each examinee a total correct score of 24. However the two final items they answered correctly were different, resulting in their ability estimates to be slightly different (. Additionally, for the items they answered incorrectly, they selected different incorrect answers. Thus, their different incorrect answers on the last six items led to the students being classified as having drastically different error patterns (, ). From the SICM model estimates, we can conclude that both of these students have an above average ability, yet Examinee 403 (with the slightly higher ability estimate) needs instruction relevant to all three errors, while Examinee 199 does not. These results reflect the potential utility that the SICM* model estimates add beyond IRT model estimates for examinees. For an IRT model with an estimated discrimination parameter, it matters which items an examinee answers correctly. The same total score can yield different ability estimates because items are differentially related to the target ability being measured and

24 Scaling Ability and Diagnosing Misconceptions 24 thus count differentially towards the estimated ability. The SICM* model goes a step further and uses information not only from which items an examinee answers incorrectly, but also why the examinee answered the item incorrectly. As a result, two examinees can have the exact same scored response pattern and be classified as possessing a very different set of misconceptions. Data Analysis Discussion Although our initial thought upon analyzing this assessment data was to turn to another data set that may bear better results with the SICM model, ultimately we found value in sharing the story this analysis tells. As noted in Templin and Henson (2006), observing a large number of examinees in patterns for which all or none of the attributes are possessed may indicate that the construct being measured is truly unidimensional. Several reasons may explain this finding. The lack of multidimensionality may be a result of cognitive theory; these errors may not actually exist as a stable latent trait of the examinee, but rather be types of errors that students inconsistently, and thus unpredictably, make. Alternately, the multidimensionality that actually exists may not be captured by the assessment due to lack of validity, or enough information may not be available to estimate the model due to a smaller sample of examinees than in the simulation study. This analysis shows that even in a scenario like this one where the purpose of the assessment was aligned with the purpose of the model, limitations exist and issues arise when retrofitting an assessment to a model. Model-data fit is expected to improve in a test-construction scenario where a test is being developed from the onset to be estimated with the SICM model, and is thus recommended. Developing the assessment from the SICM framework would help identify sources of misfit. Validity studies can verify whether alternatives on an assessment are eliciting misconceptions that they purport to measure, and pilot studies can statistically flag items

25 Scaling Ability and Diagnosing Misconceptions 25 that exhibit model-data misfit and need to be revised or culled. The test development process can also attend to other statistical considerations that may include investigating (a) whether the misconception is measured enough times (in enough alternatives and items) to yield a reliable classification, (b) whether enough examinees are selecting each alternative to provide enough information to yield accurate item parameter estimates for that alternative, and (c) whether enough examinees have responded to the item to yield accurate model parameters. Concluding Remarks The SICM model is presented as a psychometric solution to a realistic need in educational assessment: to gain more feedback from assessments about what students do not understand. The efficacy of the SICM model under various testing conditions was demonstrated through a simulation study, suggesting that when coupled with careful test design characteristics, the SICM model can enable diagnostic score reports that reflect statistical estimates of student misconceptions to be provided in addition to the type of information about student ability that is typically provided to stakeholders by current modeling and testing procedures. These simulation results provide guidelines for test and sampling conditions, but not for creating the test itself. As seen in the empirical data analysis, the development of the assessment from the SICM framework a prior is very important. Although some general test-development considerations can be applied in developing an assessment for the SICM model, open questions still exist as to how to create an assessment that can utilize the statistical features of the SICM model. Assessments, previously mentioned, exist that can provide insights for writing items to measure misconceptions with incorrect answers. However, when the SICM model is employed to model these types of items, unique statistical considerations arise. For example, a continuous ability is estimated in the SICM model, as in a

26 Scaling Ability and Diagnosing Misconceptions 26 unidimensional IRT model. For a unidimensional IRT model, items that exhibit multidimensionality are often screened and revised or deleted from the assessment; for the SICM model, items that measure a single continuous trait and a set of multidimensional categorical traits are desired, so items are expected to show multidimensionality and will thus have to be screened differently. We provide information explaining how the SICM model can be estimated and applied. We hope future assessment development projects can build upon this information to leverage the model in practical settings to provide actionable information about where students misunderstandings lie.

27 Scaling Ability and Diagnosing Misconceptions 27 References Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), Agresti, A. (2002). Categorical data analysis (2nd ed.). Hoboken, NJ: Wiley. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee s ability. In F. M. Lord & M.R. Novick (Eds.), Statistical theories of mental test scores. (pp ). Reading, MA: Addison-Wesley. Bock, R. D. (1972) Estimating item parameters and latent ability when responses are scored in two or more latent categories. Psychometrika, 37, Confrey, J. (1990). A review of the research on student conceptions in mathematics, science, and programming. In C. Cazden (Ed.), Review of research in education (Vol. 16, pp. 3-56). Washington, DC: American Educational Research Association. Crocker, L. & Algina, J. (1986). Introduction to classical and modern test theory. Wadsworth Group/ Thomson Learning. Garfield, J. (1998, April). Challenges in assessing statistical reasoning. Paper presented at the annual meeting of the American Educational Research Association, San Diego, CA. Garfield, J., & Chance, B. (2000). Assessment in statistics education: Issues and challenges. Mathematical Thinking and Learning, 2 (1&2), Gelman, A., & Rubin, D.B. (1992) Inference from iterative simulation using multiple sequences. Statistical Science, 7, Halloun, A. & Hestenes, D. (1985). The initial knowledge state of college physics students. American Journal of Physics, 53 (11),

28 Scaling Ability and Diagnosing Misconceptions 28 Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage. Henson, R. A., & Templin, J. L. (2003). The moving window family of proposal distributions. Educational Testing Service, External Diagnostic Research Group, Unpublished Technical Report. Henson, R., & Templin, J. (2005). Hierarchical log-linear modeling of the joint skill distribution. Unpublished manuscript. Henson, R., Templin, J., & Willse, J. (2009). Defining a family of cognitive diagnosis models using log linear models with latent variables. Psychometrika, 74, Henson, R., Templin, J., Willse, J., & Irwin, P. (2009, April). Ancillary random effects: a way to obtain diagnostic information from existing large scale tests. Paper presented at the annual meeting of the National Council on Measurement in Education in San Diego, California. Hestenes, D.,Wells, M., & Swackhamer, G. (1992). Force concept inventory. The Physics Teacher, 30, Huff, K. & Goodman, D.P. (2007). The demand for cognitive diagnostic assessment. In J. P. Leighton & M. J. Gierl (Eds.), Cognitive diagnostic assessment for education: Theory and applications (pp ). London: Cambridge University Press. Khazanov, L. (2009, February). A diagnostic assessment for misconceptions in probability. Paper presented at the Georgia Perimeter College Mathematics Conference in Clarkston, GA. Maydeu-Olivares, A., & Joe, H. (2005). Limited- and full-information estimation and goodnessof-fit testing in 2 n contingency tables: a unified framework. Journal of the American Statistical Association,100,

COMBINING SCALING AND CLASSIFICATION: A PSYCHOMETRIC MODEL FOR SCALING ABILITY AND DIAGNOSING MISCONCEPTIONS LAINE P. BRADSHAW

COMBINING SCALING AND CLASSIFICATION: A PSYCHOMETRIC MODEL FOR SCALING ABILITY AND DIAGNOSING MISCONCEPTIONS by LAINE P. BRADSHAW (Under the Direction of Jonathan Templin and Karen Samuelsen) ABSTRACT