Investigating Psychometric Isomorphism for Traditional and Performance-Based Assessment

Size: px
Start display at page:

Download "Investigating Psychometric Isomorphism for Traditional and Performance-Based Assessment"

Transcription

1 Journal of Educational Measurement Spring 2018, Vol. 55, No. 1, pp Investigating Psychometric Isomorphism for Traditional and Performance-Based Assessment Derek M. Fay Independent Researcher Roy Levy Arizona State University Vandhana Mehta Independent Researcher A common practice in educational assessment is to construct multiple forms of an assessment that consists of tasks with similar psychometric properties. This study utilizes a Bayesian multilevel item response model and descriptive graphical representations to evaluate the psychometric similarity of variations of the same task. These approaches for describing the psychometric similarity of task variants were applied to two different types of assessments (one traditional assessment and one performance-based assessment) with markedly different response formats. Due to the general nature of the multilevel item response model and graphical approaches that were utilized, the methods used for this work can readily be applied to many assessment contexts for the purposes of evaluating the psychometric similarity of tasks. A common practice in educational assessment is to construct multiple forms of a test, such that the forms are different in that they contain at least some different tasks, but at the same time the resulting student performances are of evidentiary comparability. One aspect of this notion of evidentiary comparability is the difficulty of the forms. If one form is easier than another form, the resulting scores may not be evidentiarily comparable. In such cases, equating processes may be needed to yield transformed scores to eradicate the effects of differences in difficulty (Kolen & Brennan, 2014). However, equal difficulty is not sufficient to ensure evidentiary comparability. In addition to other psychometric characteristics (e.g., discrimination), other considerations include the comparability of the content and cognitive demands, which are often operationalized via tables of specifications that specify the weights assigned to different content areas and cognitive skills relevant in the domain. By designing the forms that adhere to these specifications, the forms will align with the purposes of the assessment and each other. A burgeoning area of research involves the principled design of tasks that may be seen as targeting the goal of creating tasks that are different but evidentiarily comparable, which has emerged from a lineage of research on the intersection of task design, creation, and theories of cognition (Bejar, 1993; Bejar et al., 2003; Embretson, 1998, 1999; Gierl & Haladyna, 2013; Guttman, 1969; Hively, Patterson, & Page, 1968; Irvine & Kyllonen, 2002; LaDuca, Staples, Templeton, & Holtzman, 1986; Mislevy, Almond, & Steinberg, 2002). Such task generation may be automated (Gierl & Haladyna, 2013) though it need not be, as in the case of the contexts 52 Copyright c 2018 by the National Council on Measurement in Education

2 Investigating Psychometric Isomorphism studied in this work. As typically implemented, these approaches guide the construction of many instances of a task from a well-defined task template. The task template defines which structural features of a task remain unchanged as well as the structural features that are altered to produce instances of the task. Any two instances of a task derived from the same task template are referred to as structural isomorphs. A task is deemed psychometrically isomorphic if the various instances of the task do not meaningfully differ with respect to psychometric properties (e.g., difficulty, discrimination, guessing). Importantly, tasks that are structurally isomorphic need not be psychometrically isomorphic, and vice versa. Table 1 lays out the possibilities. Starting with the upper left cell, tasks that are not structurally isomorphic might not be psychometrically isomorphic. This is the usual case where different items (i.e., those that are structurally different in terms of their design, such as the aspect of the content that they measure) are also psychometrically different. This is well understood in assessment and test assembly different items may differ in terms of their difficulty (or discrimination, or how subject they are to guessing). Turning to the bottom left cell, items that are not structurally isomorphic may indeed be psychometrically isomorphic. This is the case where different items have the same psychometric properties. Turning to the bottom right cell, structurally isomorphic items may be psychometrically isomorphic. As a simple example, an item is structurally isomorphic with itself, and when placed on multiple forms at the same time is typically also psychometrically isomorphic with itself; such an assumption underlies common-item equating Table 1 Possible Combinations of Structurally and Psychometrically Isomorphic Tasks Structurally Psychometrically Not Isomorphic Isomorphic Not Isomorphic Isomorphic Example: Different items with different psychometric properties Example: Different items with the same psychometric properties Examples: Instances from the same task template with different psychometric properties (e.g., a family of items created to assess learning on the same construct, but the items do not have similar item difficulty values; an item and itself on a later form if item drift has occurred) Instances from the same task template with the same psychometric properties (e.g., a family of items created to assess learning on the same construct and the items also provide similar item difficulty values, an item and itself on another form) 53

3 Fay, Levy, and Mehta designs (Kolen & Brennan, 2014). Finally, structurally isomorphic items need not be psychometrically isomorphic. An extreme example of this is item drift, where the same item appears on multiple forms across time points, but its psychometric properties are dependent on when the item was administered. We note that though Table 1 lays out possibilities in terms of categories, we can conceive of isomorphism as a matter of degree. For example, two tasks may be highly psychometrically isomorphic if their psychometric properties, though not identical, are nearly the same. Different instances of the same task template are structurally isomorphic, and prior to piloting it is an open question as to whether the instances are psychometrically isomorphic. In some cases, the desire for different instances to produce a psychometrically isomorphic task is a matter of design. Some situations, such as producing equated test forms, call for psychometrically isomorphic tasks. Other situations, such as testing cognitive theories (Gorin & Embretson, 2013), call for similarly structured tasks with different psychometric properties. A task feature is referred to as an incidental feature if, when varied, the resulting structural isomorphs have similar psychometric properties. In contrast, radicals refer to task features that, when varied, produce tasks with meaningful psychometric differences (Bejar, 1993; Gorin & Embretson, 2013), thereby producing tasks that are not psychometrically isomorphic. The term noninvariant will be used interchangeably to refer to such tasks. As more fully described below, the current work is concerned with investigating whether structurally isomorphic tasks in fact exhibit psychometric isomorphism. The generation of tasks for this work was not computerized, but done by subject matter experts (SMEs) for two different types of assessments. The design process was aligned with Bejar et al. s (2003) weak theory of design, in which an existing task served as the basis for a task model. We refer to each instance of a task created from the same model as being members of the same task family (e.g., Sinharay, Johnson, & Williamson, 2003). Owing to the design process, most of the generated tasks are best characterized as structurally isomorphic. Details on this are given in a later section that describes the assessments that form the context of our work. The hope is that all structurally isomorphic tasks are also highly psychometrically isomorphic. The goal of this work is to investigate whether this is indeed the case in two different assessment scenarios, and to advance methods for doing so. The next section reviews existing methods that have been proposed for investigating psychometric isomorphism and advances multiple alternative methods for doing so that have advantages over existing methods. Data Analyses to Investigate Psychometric Isomorphism We briefly review three approaches for modeling item families in the context of the one-parameter logistic (1PL) or Rasch item response theory (IRT) model (Sinharay et al., 2003). The unrelated siblings model assumes independent item response functions for all tasks, essentially ignoring family membership. The model may be expressed mathematically as P(Y ij = 1 θ i, b j ) = exp(θ i b j ) 1 + exp(θ i b j ), (1) 54

4 Investigating Psychometric Isomorphism where P(Y ij = 1 θ i, b j ) is the probability of success on task j for examinee i, θ i is the latent variable for examinee i, and b j is the difficulty of task j. In this and the remaining models, the indeterminacy in the latent variable may be resolved by setting the mean of the θs to 0. Sinharay et al. (2003) noted that this model is limited in that it ignores the relationship among the members of the task family, requiring larger sample sizes for calibration and yielding inflated standard errors. To this we add that this model does not naturally support what we desire and pursue in this work, namely, ways to characterize the psychometric isomorphism among members of the same task family. The identical siblings model assumes that all the tasks in the same family have the same item response function. This model may be expressed mathematically as P(Y ij = 1 θ i, b jg ) = exp(θ i b jg ) 1 + exp(θ i b jg ), (2) where b jg is the difficulty task j that is a member of task family g, and all such b jg s for task family g are assumed equal. This model is one of total psychometric isomorphism among all members of the task family. Sinharay et al. (2003) noted that this model is limited in that it ignores the variability among the members of the task family, providing incorrect estimates of the task parameters in the presence of such variability. To this, we add that this model does not by itself provide a way to characterize the psychometric isomorphism among members of the same task family. As reviewed below, it may be used in conjunction with the model described next to assess psychometric isomorphism to some degree, though in ways that are somewhat limited. The related siblings model departs from the identical siblings model by not assuming that all the tasks in the same family have the same item response function, allowing for departures from exact psychometric isomorphism. This model may be expressed as a hierarchical model, where the first component is given by Equation 2. The hierarchical component specifies a distribution to relate the parameters for tasks from the same family, ) b jg N (μ bg, σ 2. (3) bg This expresses that each task-specific parameter b jg is modeled as varying around a family-specific mean (μ bg ). Note that the unrelated siblings and identical siblings models are limiting cases of the related siblings model. The unrelated siblings model results from σb 2 g approaching infinity, and the identical siblings model results from σb 2 g approaching 0. These models for task families have been extended and employed in the contexts of dichotomous IRT models (Glas & van der Linden, 2003; Lathrop & Cheng, 2017; Sinharay et al., 2003), polytomous IRT models (Johnson & Sinharay, 2005), and models with covariates (Cho, de Boeck, Embretson, & Rabe-Hesketh, 2014; Geerlings, Glas, & van der Linden, 2011; Lathrop & Cheng, 2017). However, guidelines for evaluating the results of those models to characterize whether the tasks are psychometrically isomorphic are relatively underdeveloped. Johnson and Sinharay (2005) advocated computing Bayes factors (BFs) to compare the unrelated siblings and related siblings models, which characterizes the amount of evidence in favor of the related siblings model that contains a family structure 55

5 Fay, Levy, and Mehta as opposed to the unrelated siblings model that does not contain a family structure. This addresses whether there is evidence of any psychometric isomorphism, as opposed to no psychometric isomorphism. The focus of the current work is on a different question, namely, to characterize the amount of the assumed-to-be present psychometric isomorphism. One potential avenue would pursue a comparison of the related siblings and the identical siblings models through BFs or other approaches to model comparison such as the use of information criteria (Geerlings et al., 2011). One drawback of these approaches is that they require the fitting of multiple models. Furthermore, if the related siblings model is supported over the identical siblings model, the question still remains as to whether a subset of the task families might exhibit a high level or even exact psychometric isomorphism. To target each task family, we could also compare a related siblings model to a partially constrained version of the related siblings model that specifies one task family as following the identical siblings model and the remaining families as following the related siblings model. This would require fitting as many models as there are task families. This work therefore develops and advances a series of statistical and graphical procedures aimed at characterizing the psychometric isomorphism among members of a task family. There are several benefits to our approach, which we illustrate through two different assessment contexts. First, our approach does not require multiple models to be fit; all of the inference is done within the context of fitting the related siblings model. Second, our approach allows for a separate characterization of the psychometric isomorphism for each of the task families. Third, as we illustrate with our examples, our approach can be instantiated differently depending on the number of members of the task family. Fourth, we employ graphical approaches using the results of the IRT model, as well as classical test theoretic approaches to investigate isomorphism with respect to distractor selection in multiple-choice tasks. As our examples illustrate, this can directly lead to deeper substantive understandings of the reasons behind a lack of psychometric isomorphism. To pursue our ends of characterizing psychometric isomorphism based on fitting the related siblings model, we leverage statistical testing procedures that have been developed for related multiple-group models (Verhagen, Levy, Millsap, & Fox, 2016). In essence, we treat the examinees who see different members of the task family as defining different groups of examinees. In this context, we might rewrite the model in Equation 2 as P(Y ij = 1 θ ig, b jg ) = exp(θ ig b jg ) 1 + exp(θ ig b jg ), (4) with the additional subscripting of θ by g to indicate that each examinee is now framed as a member of a group. In multiple-group models, care must be taken to link the group-specific metrics of the latent variable. This may be done by constraining the task parameters for a given task family to be equal across groups. Alternatively, the group-specific means and variances of the latent variable may be constrained to be equal. Note that the related siblings model as typically formulated (e.g., Sinharay et al., 2003) and described above takes this latter approach, in which case the additional subscripting in Equation 4 is not present in 56

6 Investigating Psychometric Isomorphism Equation 2. We follow in this tradition and assume the group-specific means and variances of the latent variable are equal across groups. This is warranted by random assignment of examinees to groups, which in the current context is accomplished by randomly assigning which member of each task family is presented to each examinee. The model affords two possibilities for investigating psychometric isomorphism. The first involves comparing the difficulty parameters for all the members of the same family (i.e., the b jg s for task family g) (Verhagen et al., 2016). This approach is practical when the number of members in the family is small. In contrast, when the number of members is large, the number of direct comparisons becomes problematic. In this situation, we can turn to the variance of task difficulties across groups (σ 2 b g ) as a summary of the degree to which tasks in the family are psychometrically isomorphic (Verhagen & Fox, 2013). If σ 2 b g = 0, the difficulties for the members of a task family j are all identical. In contrast, large values of σ 2 b g reflect some differentiation in task difficulty between at least two members. We treat each of these situations in turn. In the first situation, we seek to compare the difficulty parameters for all the members of the same family. Here, BFs (Kass & Raftery, 1995) can be computed using all of the information contained in the marginal posterior distributions for all versions of a task. Let d = b jg b jg* represent the difference in difficulty for item j for any two groups g and g* (g g*). We can consider the posterior distribution of d to characterize the difference. Further, under the hypothesis that d = 0,theBFisgiven as the ratio of the density for d = 0 for the posterior P(d = 0 H 1, Y) and the prior P(d = 0 H 1 ) as follows (Verhagen et al., 2016): BF = p(d = 0 H 1, Y ). (5) P(d = 0 H 1 ) As operationalized for this work, BFs in Equation 5 capture the amount of evidence in favor of the null hypothesis that the difficulty parameters are invariant (i.e., the tasks are psychometrically isomorphic). Drawing from Jeffreys s (1961) general recommendations, substantial support for the hypothesis of invariance of the difficulty parameters obtains for BFs of 3 or larger (Verhagen et al., 2016). Conversely, BFs.33 constitute substantial support for the hypothesis of noninvariance (i.e., differences) among the difficulty parameters. BFs in between these values are interpreted as being inconclusive about the invariance among the difficulty parameters. These interpretations are also proposed for the BFs used in the second situation, defined next. In the second situation, we seek to evaluate the variance of task difficulties across groups (σb 2 g ) to summarize the degree to which the members exhibit psychometric isomorphism (i.e., σb 2 j = 0) (Verhagen & Fox, 2013). Unlike the difference among difficulty parameters across groups, zero is not in the interior of the parameter space for variance terms. Accordingly, the hypothesis of essential invariance was investigated by specifying a small prior variance (denoted σ 2* ) for which to compare the posterior draws to (Klugkist, 2008; Klugkist & Hoijtink, 2007). The BF 57

7 Fay, Levy, and Mehta to evaluate the hypothesis that the tasks in a family exhibit essential invariance is given by BF = P(σ2 b jg < σ 2 H 1, Y ) P(σb 2, (6) jg < σ 2 H 1 ) where σ 2 b jg is the variance of the b jg s for the tasks in family g. The BFs in Equation 6 capture the amount of evidence in favor of the null hypothesis that the difficulty parameters do not vary (i.e., the tasks are psychometrically isomorphic), operationalized by having their variance be lower than some threshold σ 2*. When BF in Equation 6 flags items in a family as being noninvariant, this supports the conclusion that some of the items parameters differ, but does not speak to which ones. In this case, the BF effectively serves as an omnibus assessment that calls for further investigations to identify the likely culprits. Techniques for doing such follow-up investigations using item characteristic curves (ICCs) are described below. Notably, the scale of σ 2* renders it difficult to declare a value that constitutes a meaningful variation among the difficulty parameters. Importantly, finding evidence of invariance (noninvariance) becomes easier with larger (smaller) values of σ 2*. Selecting a particular value of σ 2* to serve as a threshold for declaring invariance requires more subjectivity than is warranted given the exploratory nature of this work. However, it is easy to compute the BF for different choices of σ 2* (Verhagen & Fox, 2013), which we do in the current work to explore the magnitude of σ 2* that was appropriate to find strong support for the invariance hypothesis. To complement the results from the related siblings model and the computation of BFs, a number of graphical representations were pursued to directly compare the ICCs associated with tasks hypothesized to be isomorphic. The ICCs for variations on the same tasks were overlaid onto a single plot. Evidence that the tasks were psychometrically isomorphic obtained to the extent that ICCs among variations of the task were indistinguishable. Evidence against the hypothesis that the tasks were psychometrically isomorphic obtained to the extent that ICCs among variations were distinguishable. In addition to the comparisons of ICCs, additional analyses for the traditional assessment comprised of multiple-choice tasks (described more fully below) included comparing the selection of response options. Using the total score across items as a proxy for examinee ability, three equally sized ability groups (low ability, average ability, high ability) were created. The proportion of examinees that selected each response option was computed for each ability group. Looking across ability groups for one member of a task family, it was expected that the proportion of examinees selecting the correct response would be higher for more proficient examinees. Looking within ability groups across different members of a task family, evidence of psychometric isomorphism obtained to the extent that the patterns of response selection were similar for examinees with similar levels of ability. Hence, when task families are flagged into the three groups (i.e., Non-Invariant, Inconclusive, Invariant), further investigation is performed on items in the Inconclusive group by comparing the ICCs and proportion of examinees that selected each response option, which helps to determine whether the tasks are psychometrically isomorphic. 58

8 Methods Description of the Assessments In the following subsections, we briefly describe the two assessment contexts in which our work takes place, highlighting features that pertain to the desire for structurally and psychometrically isomorphic tasks. Traditional assessment. In the first context, we investigated final exam forms from a summative, knowledge-based assessment for an introductory course in computer networking that is provided to the global audience of students participating in the Cisco Networking Academy Program ( Established in 1997, the program consists of 10,000 participating academies in 165 countries. The final exam forms under study come from the first course in a sequence of four courses that serve as preparation for the Cisco Certified Network Associate Routing and Switching certification exam. As it pertains to this assessment, isomorphic tasks (which appear on different forms as described below) were created to improve test security for a globally distributed assessment to facilitate the practical goal of enhancing fairness to students. The final exams are typically given at the end of each course and are provided online through the Cisco Network Academy web site. The exam forms investigated in this study consist of 60 tasks; the tasks that appear on the final exam forms will herein be referred to as items. The majority of the items on the assessment are multiplechoice single-answer or multiple-answer items. In addition, the forms include several other item types, such as fill-in-the-blank, drag-and-drop items, and items with exhibits that show a network topology, display device output, or utilize small simulated computer networks through Cisco s Packet Tracer tool that serve as interactive exhibits. Within each form, there were three types of items with respect to isomorphism. The first type included 17 items that were identical with respect to the stimulus, the response options, and the order of response options across the three forms of the assessment (i.e., these were the same item, in the usual sense, across the three forms). This set of 17 items will herein be referred to as items with option randomization disabled (OR-D). The second type included 14 items that were invariant across the three forms with respect to the stimulus and the particular set of response options. Although the response options were identical across forms, the order of the response options was randomized on an examinee-to-examinee basis; these items will herein be referred to as items with option randomization enabled (OR-E). The third type included 22 item families with 3 members each that were judged to be to some degree isomorphic by SMEs, and are referred to as (SME-I). Among these 22 families, there was some variability with respect to the degree of isomorphism from the perspective of SMEs. Some items were developed from a template with variable features being substitutes that were not believed to have any bearing on the psychometric properties of items; these items were indeed characterized as structural isomorphs. For other items, the instances differed with respect to the difficulty of the language used in the stem and/or set of response options. Owing to this research serving as the initial exploration of the psychometric comparability of these items, the analytical goal was 59

9 Figure 1. Four topological structures of device networks that served as structurally isomorphic tasks on the Packet Tracer Skills Assessment. to investigate which features resulted in items being psychometrically comparable or incomparable. Performance-based assessment. The second assessment is a simulation-based task in which examinees must configure and troubleshoot a simulated network. These assessments are known as Packet Tracer Skills Assessments (PTSAs). Much of the assessment interaction with the devices is through a high-verisimilitude command line interface (see Rupp et al., 2012 for further descriptions of PTSAs). The work products produced by examinees include (a) a log of commands entered in the command line window to diagnose and configure the devices in the network and (b) the final state of the network. The work product features that were identified by SMEs as evidence of examinees proficiency ranged from the inclusion of commands that represent best practices to observable features of the final state of the network and the devices within it. For the PTSA under consideration in this study, 57 work product features (referred to as primary observables herein) deemed by SMEs as evidence of examinees proficiency were dichotomously scored such that 1 denotes success and 0 otherwise. The versions of the PTSA considered in this work were designed to be structurally isomorphic. The features that were manipulated to produce 12 versions of the PTSA included four versions of a topological structure (see Figure 1) and three sets of device labels (see Table 2). For the purposes of this work, the topological structure 60

10 Table 2 Device Labels for Three Label Sets Used to Produce Structural Isomorphs for the PTSA Label Set Device Router Town Hall Building 1 CS Department Switch 1 IT Department Switch First Floor Switch LAB 124-C Switch Switch 2 Administration Switch Second Floor Switch LAB 214-A Switch LAN 1 IT Department LAN First Floor LAN LAB 124-C LAN LAN 2 Administration LAN Second Floor LAN LAB 214-A LAN PC 1 Reception Host Host PC 2 Operator Host Host PC 3 IT Host Host refers to the logical interconnection among devices; the presentation of the topological structures may vary the spatial location among devices but not the logical interconnection among devices. The device labels simply corresponded to the names that were assigned to each device within the network. Description of Data Used for Analysis Traditional assessment. The data for the traditional assessment was comprised of N = 5,511 examinees that were randomly assigned to one of three forms. For all students, only their first attempt on the exam was retained. There were two types of missing data. One type of missing data occurred if an examinee was not administered the item. These instances of missing data are missing at random by design, and accordingly could be ignored for subsequent data analyses (Enders, 2010). The second type of missing data occurred if an examinee was presented the item but did not provide a response. SMEs indicated that failures to provide a response might be due to any of a number of reasons, including those unrelated to proficiency, particularly for examinees with a large amount of missing data. Records for examinees that provided more than 50 responses (out of 60 possible) were employed for subsequent analyses. The final sample consisted of N = 5,425 examinees responses with n 1 = 1,779, n 2 = 1,851, and n 3 = 1,795 examinees receiving the first, second, and third form, respectively. For these examinees, all instances of missing data were scored as an unsuccessful attempt at the task. PTSA assessment. Examinees were randomly assigned to 1 of the 12 versions of the PTSA task. The analytic sample for this work consists of scores on J = 57 primary observables from 801 examinees whose activity was deemed indicative of a motivated student. 1 Table 3 shows the number of examinees that were retained for each structurally isomorphic version of the PTSA. Investigations of the distributions of time spent and the number of commands used (not shown on space considerations) suggested that each of the 12 groups of examinees, as defined by which version of the PTSA task they received, were fairly similar in these metrics. 61

11 Table 3 Number of Examinees for Each Structural Isomorph of the PTSA Defined by the Combination of Label Set and Topological Structure Topological Structure Label set Analyses to Investigate Psychometric Isomorphism Traditional assessment. We fit the related siblings model as given in Equations 4 and 3, where the three test forms define the three groups of examinees. The multilevel component of the model for the difficulty parameters is operationalized via diffuse hyperpriors on the family-specific parameters on the right-hand side of Equation 3 (Sinharay et al., 2003; see also Fox, 2010 and Levy & Mislevy, 2016 for Bayesian IRT modeling via multilevel structures): μ bg N(0, 100) and σ 2 b g Inv-Gamma(.01,.01). As discussed above, we assume the groups of examinees defined by which form they were presented are randomly equivalent, as warranted by the random assignment of forms to examinees. The implication is that, in the multiple-group formulation of the model, each group was modeled as having the same mean of the latent variable, and the same variance of the latent variable. The distribution of the latent variables for examinees was θ ig N(0, σ 2 θ ), where σ2 θ Inv-Gamma(.01,.01). As the task families are small, having three members each, we employed the approach to investigating psychometric isomorophism by examining the differences in difficulty parameters directly. For the three comparisons that were possible (i.e., Form 1 versus Form 2; Form 1 versus Form 3; Form 2 versus Form 3), a BF was computed for each item using Equation 5. To complement the BFs, several additional analyses were also conducted. As a proxy for the difficulty for each version of a task, the arithmetic average of the MCMC draws for b used to construct the marginal posterior distribution was computed. Using this point estimate, the ICCs and 95% intervals for those ICCs for variations on the same tasks were overlaid onto a single plot and compared (Wickham, 2009). Finally, we employed classical test theoretic approaches to investigate isomorphism with respect to distractor selection in multiple-choice tasks, as described above. PTSA assessment. For the PTSA, we fitted the related siblings model in the same fashion as just described for the traditional assessment. As the task families are large, having 12 members each, we employed the approach to investigating psychometric isomorophism by examining the variance in difficulty parameters, computing the 62

12 Investigating Psychometric Isomorphism BFs in Equation 6. To explore the sensitivity for different threshold values of σ 2*,we computed BFs for five values of σ 2* (.05,.10,.15,.20,.25) to explore the magnitude of σ 2* that was required to find strong support for the invariance hypothesis across the majority of the primary observables. We then also plotted the ICCs in the same fashion as was done for the traditional assessment. Model fitting. Using JAGS (Plummer, 2011) and the R package rjags (Plummer, 2013), a fully Bayesian approach to estimation was used to obtain the posterior distribution; all features of estimation described in this section apply to both assessments. After a burn-in period of 500 iterations, two Markov chains were run for 5,000 iterations with a thinning interval of 10 iterations, yielding 1,000 iterations to represent the posterior. Fit of the model. For both the traditional assessment and the PTSA, we conducted posterior predictive checks of the related siblings model (Sinharay, 2005). As our focus is on the difficulty of the tasks, we pursued fit analyses that focus on the extent to which the related siblings model adequately models the difficulties of the items in the traditional assessment, and primary observables in the PTSA, using proportion correct as a test statistic. For both assessments, the results (not shown due to space considerations) indicated the related siblings model does an adequate job of accounting for the observed proportions correct, lending support that it is modeling the difficulty of the tasks quite well. We also examined measures of local dependence (Levy, Mislevy, & Sinharay, 2009), and found the model suffered with respect to these properties in several places. As our focus was on psychometric isomorphism with respect to difficulty, and our checks indicated the model is performing well with respect to accounting for the observed difficulty, we were comfortable in the use of the model for our current purposes. Future work could extend the related siblings model and our procedures to multidimensional models or other ways to model local dependence, as may be warranted by extending the examination of isomorphism to focus on other psychometric features. Results Traditional Assessment Table 4 tabulates the number of pairwise comparisons that resulted in each invariance designation on the basis of the values of BFs (i.e., Non-Invariant, Inconclusive, Invariant) for the three item types (i.e., OR-D, OR-E, SME-I). The desired designation for all items is Invariance, or at the very least, Inconclusive. Among the 93 pairwise comparisons computed for the OR-D and OR-E items, only 3 were assigned a designation of Non-Invariance. Moreover, the distribution of designations was largely consistent for OR-D and OR-E items. This result suggests that randomization of response options had little bearing on the degree to which the OR-D and OR-E items were psychometrically isomorphic. Among the 66 pairwise comparisons computed for the SME-I items, 45 were designated as Non-Invariant with the remaining 21 items designated as either Inconclusive or Invariant. Figure 2 illustrates selected results for the BFs and ICCs. The top, middle, and bottom rows show the results for two items that represent the ends of the continuum 63

13 Table 4 Summary of (Non-)Invariance Results for Each Item Type on the Traditional Assessment Value of Bayes Factor (Interpretation) Item Type Number of Items a Number of Comparisons BF.33 (Non- Invariance).33 < BF < 3 (Inconclusive) BF 3 (Invariance) OR-D OR-E Isomorph Total a The total number of items reflect the number of multiple choice-type items; the seven items that were not a multiple-choice type item are not included here. BF = Bayes factor; OR-D = option randomization disabled; OR-E = option randomization enabled. of invariance isomorphs by SMEs, respectively. Figure 3 illustrates the posterior densities for the d values for the items presented in Figure 2. Starting with the panels on the left sides of Figures 2 and 3, the BFs for each comparison, the inability to distinguish among the ICCs, and the high degree of overlap among the densities of d that are concentrated near 0 provide strong evidence for the invariance hypothesis. Among the least invariant of the OR-D and OR-E items (depicted on in the panels on the right sides of Figures 2 and 3), some evidence of noninvariance was found via the BFs. However, the ICCs for these items (Item 51 and Item 1 for OR-D and OR-E, respectively) were reasonably close, and the overlap among the densities of d that are concentrated near 0 lend support to the invariance hypothesis. As for the least invariant among the isomorphic items (Item 18), all BFs were extremely small (all rounded to zero); the magnitude of the difference in difficulty as captured by the ICCs and the densities of d provide strong evidence that these items are psychometrically quite different. Turning to the analyses of the response options, Figure 4 shows the proportion of examinees that selected each response option across three ability groups. For the purposes of accumulating evidence of invariance or noninvariance, the focus lies in the patterns of response selection rather than the conceptual underpinnings that gave rise to the similarities and differences among items hypothesized to be isomorphic. The top (Item 48), middle (Item 3), and bottom (Item 18) rows represent items for which there was strong evidence in favor of invariance, partial invariance, and noninvariance, respectively. For the item exhibiting a high degree of invariance (Item 48), the proportion of examinees that selected each response option was very similar across the three variations of the item. For the item that exhibited partial invariance (Item 3), the proportion of examinees selecting each response option was very similar for the versions of the item on Forms 1 and 3. However, the version on Form 2 was found to be markedly more difficult owing to the increased selection of all incorrect response options (B, C, D) across all ability groups. Finally, the pattern by which response options were selected for the noninvariant item (Item 18) exhibited a high degree of variability across all versions. 64

14 OR-Disabled: Item 49 OR-Disabled: Item 51 Form 1: b = Form 2: b = Form 3: b = Form 1: b = Form 2: b = Form 3: b = P(Y = 1) OR-Enabled: Item 4 OR-Enabled: Item 1 Form 1: b = Form 2: b = Form 3: b = BF 12 = 2.29 BF 13 = 2.26 BF 23 = 1.72 Form 1: b = Form 2: b = Form 3: b = BF 12 = 0.30 BF 13 = 2.21 BF 23 = 0.94 P(Y = 1) BF 12 = 2.59 BF 13 = 2.38 BF 23 = Isomorph: Item 9 Isomorph: Item 18 Form 1: b = Form 2: b = Form 3: b = Form 1: b = Form 2: b = 2.03 Form 3: b = BF 12 = 2.85 BF 13 = 0.15 BF 23 = 0.02 P(Y = 1) BF 12 = 0.86 BF 13 = 2.36 BF 23 = 1.54 BF 12 = 0 BF 13 = 0 BF 23 = θ θ Figure 2. Item characteristic curves for isomorphic items on different forms of the traditional assessment. OR = option randomization; BF12 = Bayes factor comparing the isomorphic item on Forms 1 and 2; BF13 = Bayes factor comparing the isomorphic item on Forms 1 and 3; BF23 = Bayes factor comparing the isomorphic item on Forms 2 and 3. 65

15 OR-Disabled: Item 49 OR-Disabled: Item d d OR-Enabled: Item 4 OR-Enabled: Item d d Isomorph: Item 9 Isomorph: Item d d Figure 3. Posterior densities for the difference (d) in location parameters between isomorphic items on different forms of the traditional assessment. OR = option randomization. Solid lines depict the comparison of Forms 1 and 2; dashed lines depict the comparison of Forms 1 and 3; dotted lines depict the comparison of Forms 2 and 3. 66

16 Figure 4. Percent of examinees selecting each response option for an isomorphic item that was invariant over forms (top row, Item 48), invariant for two items but noninvariant for one item (middle row, Item 3), and completely noninvariant over forms (bottom row, Item 18). 67

17 Figure 5. Magnitude of Bayes factors (BFs) across different values of the between-group variance for the difficulty parameter for each primary observable on the Packet Tracer Skills Assessment. The dashed gray lines within panels represent Jeffreys s (1961) recommended cutoffs that represent substantial evidence for the hypothesis that difficulty parameters were not invariant (i.e., BF.33) and the hypothesis of parameter invariance among difficulty parameters (i.e., BF 3). (Color figure can be viewed at wileyonlinelibrary.com) Packet Tracer Skills Assessment Figure 5 show the values of BFs across each of the values of σ 2* for each of the primary observables. The results for primary observables appear within panels with values of σ 2* shown along the horizontal axis and the value of BF shown on the vertical axis. The general upward trend within each panel reflects that allowing a larger amount of within-family variability (σ 2* ) to still count as being isomorphic yields more task families to be deemed isomorphic. The results of the sensitivity analysis provide strong evidence that the particular value of σ 2* has a meaningful impact on the conclusions to be drawn about the degree of invariance (or noninvariance) for primary observables. On the one hand, setting σ 2* =.05 would result in every primary observable being declared as noninvariant. This suggests that σ 2* =.05 may be too strict of a criterion that identifies primary observables with very similar levels of difficulty as non-invariant. On the other hand, setting σ 2* =.15 would result in most (but not all) of the primary observables being identified as invariant. This suggests that σ 2* =.15 may be too loose of a criterion that identifies 68

18 Figure 6. Item characteristic curves for each of the primary observables identified as exhibiting noninvariance across structural isomorphs of the Packet Tracer Skills Assessment. The minimum and maximum values shown within plots represent the minimum and maximum means of the marginal posterior distributions associated with the respective variants of the primary observable. primary observables with markedly different levels of difficulty as invariant. To avoid overidentifying primary observables as invariant or noninvariant, we selected σ 2* =.10 as the threshold for designating whether variations of each primary observable should be treated as invariant (i.e., (BF 3), noninvariant (i.e., BF.33), or inconclusive (i.e.,.33 < BF < 3). Based on the threshold value of σ 2* =.10, only one primary observable (i.e., PO1018) was found to be invariant; the remaining 56 (of the 57) primary observables were found to be inconclusive with respect to invariance or noninvariance. Figure 6 shows the ICCs for the four primary observables that show the full range of psychometric isomorphism that was observed for the PTSA. The left panel in the top row (primary observable 1005) is one of the many instances in which there was inconclusive evidence in support of either hypothesis at the ends of the invariance continuum. For this primary observable (and many others on the PTSA), the ICCs for 69

19 Fay, Levy, and Mehta all 12 variations were largely indistinguishable. In contrast, the right panel in the bottom row (primary observable 1018) shows the only primary observable with strong support for the hypothesis of noninvariance. For this primary observable, there was a clearer variation in the ICCs across the 12 versions of the PTSA. The right panel in the top row (primary observable 1023) and the left panel in the bottom row (primary observable 1032) serve to fill in the gradient from invariance to noninvariance among the primary observables on the PTSA. Discussion As an ideal, the psychometric properties of structurally isomorphic versions of the same task would be identical. Under this ideal, examinees would not be differentially advantaged based on the particular version of the task they were exposed to. To the extent this ideal holds for the assessments considered in this work, the use of isomorphic tasks reduces task exposure and ensures fairness across multiple versions. Different versions of a task that are (a) developed from the same template, or (b) developed to measure the same construct with the same degree of cognitive complexity ought to reasonably approximate the ideal notion of psychometric isomorphism. Using data from a traditional assessment and a performance-based assessment, model-based and descriptive approaches were pursued to evaluate the degree to which isomorphs were discrepant from the notion of psychometric isomorphism. In what follows, the merits of these approaches are discussed in light of the results pertaining to both types of assessment pursued in this work. The model-based approach involved estimating a related siblings model via a multilevel parameterization. Different versions of a task were deemed to be (not be) psychometrically isomorphic to the extent that the difficulties associated with different versions of a task were invariant (not invariant) across the different forms of the assessment. Notably, this model-based approach was sufficiently general to handle data derived from two very different types of assessment. Moreover, trivial alterations to the set-up of the model readily supported inferences for both pairwise group comparisons, which were pursued for the traditional assessment, and a between-group variance term, which was pursued for the PTSA. In essence, the model pursued in this work is readily equipped to investigate the hypothesis that different versions of a task are psychometrically invariant regardless of the number of groups. Notably, the Rasch version of the model was pursued on the basis of its simplicity and ease of exposition. The model can be readily expanded to investigate the degree to which other psychometric properties of task variants (e.g., discrimination, guessing) are similar or different (for examples, see Geerlings et al., 2011; Glas & van der Linden, 2003; Janssen, Tuerlinckx, Meulders, & De Boeck, 2000; Johnson & Sinharay, 2005; Sinharay et al, 2003). Still other possibilities for expanding the model may involve incorporating additional dimensions (see Raudenbush & Bryk, 2002, pp ). As an empirical check, BFs were pursued as a method to flag variations of tasks as invariant, noninvariant, or inconclusive with respect to both invariance and noninvariance. For the traditional assessment, almost all of the OR-D and OR-E items were identified as having strong support for invariance or the evidence was not 70

20 Investigating Psychometric Isomorphism sufficiently strong to rule out invariance on the basis of the BFs associated with pairwise comparisons. In contrast, the majority of the isomorphs were deemed as having strong support for noninvariance (i.e., 45 out of the 66 comparisons across the three groups). As for the PTSA, only one primary observable was found to be noninvariant; evidence in favor of either invariance or noninvariance was inconclusive for the remaining primary observables. As noted, the BFs served the purpose of flagging tasks that may not be performing as expected. When the variable features used to create different instances of the tasks are anticipated to be incidental, the goal of the analysis is to flag tasks for which the variable features are behaving as radicals. This was the case for the PTSA, because the device networks were structurally isomorphic with respect to logical interconnections among network devices. In more exploratory settings, as was the case for the traditional assessment considered in this work, the goal was to investigate the types of changes that render tasks as invariant or noninvariant. In the absence of previous research or strong theoretical support, our approach was to hypothesize invariance for all items on the traditional assessment. In doing so, the model-based evidence was used to guide SMEs to the relevant features that gave rise to invariance or noninvariance within item families. For example, in one of the SME-I item families, we found that two of the items in the family were highly psychometrically isomorphic, while the third was quite different, being much more difficult. Follow-up discussions and investigations revealed that this was due to the third item involving computer IP addresses that end in the 70s, whereas for the other two items the IP addresses ended in the 60s. When these items were created, SMEs expected that this would not make a difference, that is, the magnitude of the IP address was thought to be an incidental feature. However, our follow-up investigations revealed that a key feature of the item involved having IP address numbers that were divisible by 4, and that it is harder to identify numbers in the 70s divisible by 4 (i.e., 72, 76) than it is to identify numbers in 60s divisible by 4 (i.e., 60, 64, 68). Whether the IP addresses were in the 60s to the 70s was not an incidental feature, but a radical one. As a result of this work, SMEs gained a deeper understanding of the variable features that have bearing on item difficulty. Moreover, strengthening what SMEs know may lead to the generation of testable hypotheses or even motivate broader developments in theory about the cognitive underpinnings that yield differences (and similarities) in examinees performance. To complement the results of the BFs, graphical representations were constructed to visualize the extent to which tasks were (non-)invariant in a couple of ways. For most of the tasks evaluated in this work, the relative locations of task variants, as depicted via their ICCs, were consistent with the conclusions reached on the basis of the BFs. In this scenario, the amount of evidence in support of any one conclusion simply compounded. In other cases, the relative location of ICCs was in support of the conclusion opposite to that suggested by the BF. One instance of this scenario occurred for the comparison between the first and second groups on Item 51 on the traditional assessment. In this case, the BF =.30 but the ICCs may be deemed as sufficiently close (b 51,Group1 = 1.21, b 51, Group2 = 1.05) to be viewed as psychometrically isomorphic, depending on the stakes of assessment and the perspective of stakeholders. 71

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD Psy 427 Cal State Northridge Andrew Ainsworth, PhD Contents Item Analysis in General Classical Test Theory Item Response Theory Basics Item Response Functions Item Information Functions Invariance IRT

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

Computerized Mastery Testing

Computerized Mastery Testing Computerized Mastery Testing With Nonequivalent Testlets Kathleen Sheehan and Charles Lewis Educational Testing Service A procedure for determining the effect of testlet nonequivalence on the operating

More information

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2 MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and Lord Equating Methods 1,2 Lisa A. Keller, Ronald K. Hambleton, Pauline Parker, Jenna Copella University of Massachusetts

More information

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, April 23-25, 2003 The Classification Accuracy of Measurement Decision Theory Lawrence Rudner University

More information

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 39 Evaluation of Comparability of Scores and Passing Decisions for Different Item Pools of Computerized Adaptive Examinations

More information

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1 Nested Factor Analytic Model Comparison as a Means to Detect Aberrant Response Patterns John M. Clark III Pearson Author Note John M. Clark III,

More information

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison Empowered by Psychometrics The Fundamentals of Psychometrics Jim Wollack University of Wisconsin Madison Psycho-what? Psychometrics is the field of study concerned with the measurement of mental and psychological

More information

A Brief Introduction to Bayesian Statistics

A Brief Introduction to Bayesian Statistics A Brief Introduction to Statistics David Kaplan Department of Educational Psychology Methods for Social Policy Research and, Washington, DC 2017 1 / 37 The Reverend Thomas Bayes, 1701 1761 2 / 37 Pierre-Simon

More information

A Bayesian Nonparametric Model Fit statistic of Item Response Models

A Bayesian Nonparametric Model Fit statistic of Item Response Models A Bayesian Nonparametric Model Fit statistic of Item Response Models Purpose As more and more states move to use the computer adaptive test for their assessments, item response theory (IRT) has been widely

More information

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination Timothy N. Rubin (trubin@uci.edu) Michael D. Lee (mdlee@uci.edu) Charles F. Chubb (cchubb@uci.edu) Department of Cognitive

More information

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Thakur Karkee Measurement Incorporated Dong-In Kim CTB/McGraw-Hill Kevin Fatica CTB/McGraw-Hill

More information

Comprehensive Statistical Analysis of a Mathematics Placement Test

Comprehensive Statistical Analysis of a Mathematics Placement Test Comprehensive Statistical Analysis of a Mathematics Placement Test Robert J. Hall Department of Educational Psychology Texas A&M University, USA (bobhall@tamu.edu) Eunju Jung Department of Educational

More information

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories Kamla-Raj 010 Int J Edu Sci, (): 107-113 (010) Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories O.O. Adedoyin Department of Educational Foundations,

More information

A Case Study: Two-sample categorical data

A Case Study: Two-sample categorical data A Case Study: Two-sample categorical data Patrick Breheny January 31 Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 1/43 Introduction Model specification Continuous vs. mixture priors Choice

More information

Bayesian and Frequentist Approaches

Bayesian and Frequentist Approaches Bayesian and Frequentist Approaches G. Jogesh Babu Penn State University http://sites.stat.psu.edu/ babu http://astrostatistics.psu.edu All models are wrong But some are useful George E. P. Box (son-in-law

More information

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses Item Response Theory Steven P. Reise University of California, U.S.A. Item response theory (IRT), or modern measurement theory, provides alternatives to classical test theory (CTT) methods for the construction,

More information

Detecting Suspect Examinees: An Application of Differential Person Functioning Analysis. Russell W. Smith Susan L. Davis-Becker

Detecting Suspect Examinees: An Application of Differential Person Functioning Analysis. Russell W. Smith Susan L. Davis-Becker Detecting Suspect Examinees: An Application of Differential Person Functioning Analysis Russell W. Smith Susan L. Davis-Becker Alpine Testing Solutions Paper presented at the annual conference of the National

More information

linking in educational measurement: Taking differential motivation into account 1

linking in educational measurement: Taking differential motivation into account 1 Selecting a data collection design for linking in educational measurement: Taking differential motivation into account 1 Abstract In educational measurement, multiple test forms are often constructed to

More information

THE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION

THE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION THE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION Timothy Olsen HLM II Dr. Gagne ABSTRACT Recent advances

More information

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS Michael J. Kolen The University of Iowa March 2011 Commissioned by the Center for K 12 Assessment & Performance Management at

More information

Differential Item Functioning

Differential Item Functioning Differential Item Functioning Lecture #11 ICPSR Item Response Theory Workshop Lecture #11: 1of 62 Lecture Overview Detection of Differential Item Functioning (DIF) Distinguish Bias from DIF Test vs. Item

More information

Formulating and Evaluating Interaction Effects

Formulating and Evaluating Interaction Effects Formulating and Evaluating Interaction Effects Floryt van Wesel, Irene Klugkist, Herbert Hoijtink Authors note Floryt van Wesel, Irene Klugkist and Herbert Hoijtink, Department of Methodology and Statistics,

More information

Evaluating Cognitive Theory: A Joint Modeling Approach Using Responses and Response Times

Evaluating Cognitive Theory: A Joint Modeling Approach Using Responses and Response Times Psychological Methods 2009, Vol. 14, No. 1, 54 75 2009 American Psychological Association 1082-989X/09/$12.00 DOI: 10.1037/a0014877 Evaluating Cognitive Theory: A Joint Modeling Approach Using Responses

More information

Description of components in tailored testing

Description of components in tailored testing Behavior Research Methods & Instrumentation 1977. Vol. 9 (2).153-157 Description of components in tailored testing WAYNE M. PATIENCE University ofmissouri, Columbia, Missouri 65201 The major purpose of

More information

Influences of IRT Item Attributes on Angoff Rater Judgments

Influences of IRT Item Attributes on Angoff Rater Judgments Influences of IRT Item Attributes on Angoff Rater Judgments Christian Jones, M.A. CPS Human Resource Services Greg Hurt!, Ph.D. CSUS, Sacramento Angoff Method Assemble a panel of subject matter experts

More information

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

Bruno D. Zumbo, Ph.D. University of Northern British Columbia Bruno Zumbo 1 The Effect of DIF and Impact on Classical Test Statistics: Undetected DIF and Impact, and the Reliability and Interpretability of Scores from a Language Proficiency Test Bruno D. Zumbo, Ph.D.

More information

Evaluating the quality of analytic ratings with Mokken scaling

Evaluating the quality of analytic ratings with Mokken scaling Psychological Test and Assessment Modeling, Volume 57, 2015 (3), 423-444 Evaluating the quality of analytic ratings with Mokken scaling Stefanie A. Wind 1 Abstract Greatly influenced by the work of Rasch

More information

Multilevel analysis quantifies variation in the experimental effect while optimizing power and preventing false positives

Multilevel analysis quantifies variation in the experimental effect while optimizing power and preventing false positives DOI 10.1186/s12868-015-0228-5 BMC Neuroscience RESEARCH ARTICLE Open Access Multilevel analysis quantifies variation in the experimental effect while optimizing power and preventing false positives Emmeke

More information

Turning Output of Item Response Theory Data Analysis into Graphs with R

Turning Output of Item Response Theory Data Analysis into Graphs with R Overview Turning Output of Item Response Theory Data Analysis into Graphs with R Motivation Importance of graphing data Graphical methods for item response theory Why R? Two examples Ching-Fan Sheu, Cheng-Te

More information

Convergence Principles: Information in the Answer

Convergence Principles: Information in the Answer Convergence Principles: Information in the Answer Sets of Some Multiple-Choice Intelligence Tests A. P. White and J. E. Zammarelli University of Durham It is hypothesized that some common multiplechoice

More information

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model Gary Skaggs Fairfax County, Virginia Public Schools José Stevenson

More information

SUPPLEMENTAL MATERIAL

SUPPLEMENTAL MATERIAL 1 SUPPLEMENTAL MATERIAL Response time and signal detection time distributions SM Fig. 1. Correct response time (thick solid green curve) and error response time densities (dashed red curve), averaged across

More information

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION Iweka Fidelis (Ph.D) Department of Educational Psychology, Guidance and Counselling, University of Port Harcourt,

More information

Individual Differences in Attention During Category Learning

Individual Differences in Attention During Category Learning Individual Differences in Attention During Category Learning Michael D. Lee (mdlee@uci.edu) Department of Cognitive Sciences, 35 Social Sciences Plaza A University of California, Irvine, CA 92697-5 USA

More information

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison Group-Level Diagnosis 1 N.B. Please do not cite or distribute. Multilevel IRT for group-level diagnosis Chanho Park Daniel M. Bolt University of Wisconsin-Madison Paper presented at the annual meeting

More information

André Cyr and Alexander Davies

André Cyr and Alexander Davies Item Response Theory and Latent variable modeling for surveys with complex sampling design The case of the National Longitudinal Survey of Children and Youth in Canada Background André Cyr and Alexander

More information

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati.

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati. Likelihood Ratio Based Computerized Classification Testing Nathan A. Thompson Assessment Systems Corporation & University of Cincinnati Shungwon Ro Kenexa Abstract An efficient method for making decisions

More information

Factors Affecting the Item Parameter Estimation and Classification Accuracy of the DINA Model

Factors Affecting the Item Parameter Estimation and Classification Accuracy of the DINA Model Journal of Educational Measurement Summer 2010, Vol. 47, No. 2, pp. 227 249 Factors Affecting the Item Parameter Estimation and Classification Accuracy of the DINA Model Jimmy de la Torre and Yuan Hong

More information

PROFILE SIMILARITY IN BIOEQUIVALENCE TRIALS

PROFILE SIMILARITY IN BIOEQUIVALENCE TRIALS Sankhyā : The Indian Journal of Statistics Special Issue on Biostatistics 2000, Volume 62, Series B, Pt. 1, pp. 149 161 PROFILE SIMILARITY IN BIOEQUIVALENCE TRIALS By DAVID T. MAUGER and VERNON M. CHINCHILLI

More information

Item Analysis: Classical and Beyond

Item Analysis: Classical and Beyond Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Modified for EPE/EDP 711 by Kelly Bradley on January 8, 2013 Why is item analysis relevant? Item analysis provides

More information

During the past century, mathematics

During the past century, mathematics An Evaluation of Mathematics Competitions Using Item Response Theory Jim Gleason During the past century, mathematics competitions have become part of the landscape in mathematics education. The first

More information

Using the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison

Using the Testlet Model to Mitigate Test Speededness Effects. James A. Wollack Youngsuk Suh Daniel M. Bolt. University of Wisconsin Madison Using the Testlet Model to Mitigate Test Speededness Effects James A. Wollack Youngsuk Suh Daniel M. Bolt University of Wisconsin Madison April 12, 2007 Paper presented at the annual meeting of the National

More information

Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure. Rob Cavanagh Len Sparrow Curtin University

Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure. Rob Cavanagh Len Sparrow Curtin University Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure Rob Cavanagh Len Sparrow Curtin University R.Cavanagh@curtin.edu.au Abstract The study sought to measure mathematics anxiety

More information

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida and Oleksandr S. Chernyshenko University of Canterbury Presented at the New CAT Models

More information

EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN

EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN By FRANCISCO ANDRES JIMENEZ A THESIS PRESENTED TO THE GRADUATE SCHOOL OF

More information

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

Mantel-Haenszel Procedures for Detecting Differential Item Functioning A Comparison of Logistic Regression and Mantel-Haenszel Procedures for Detecting Differential Item Functioning H. Jane Rogers, Teachers College, Columbia University Hariharan Swaminathan, University of

More information

A Multilevel Testlet Model for Dual Local Dependence

A Multilevel Testlet Model for Dual Local Dependence Journal of Educational Measurement Spring 2012, Vol. 49, No. 1, pp. 82 100 A Multilevel Testlet Model for Dual Local Dependence Hong Jiao University of Maryland Akihito Kamata University of Oregon Shudong

More information

MEASURING MIDDLE GRADES STUDENTS UNDERSTANDING OF FORCE AND MOTION CONCEPTS: INSIGHTS INTO THE STRUCTURE OF STUDENT IDEAS

MEASURING MIDDLE GRADES STUDENTS UNDERSTANDING OF FORCE AND MOTION CONCEPTS: INSIGHTS INTO THE STRUCTURE OF STUDENT IDEAS MEASURING MIDDLE GRADES STUDENTS UNDERSTANDING OF FORCE AND MOTION CONCEPTS: INSIGHTS INTO THE STRUCTURE OF STUDENT IDEAS The purpose of this study was to create an instrument that measures middle grades

More information

Item Analysis Explanation

Item Analysis Explanation Item Analysis Explanation The item difficulty is the percentage of candidates who answered the question correctly. The recommended range for item difficulty set forth by CASTLE Worldwide, Inc., is between

More information

Group Assignment #1: Concept Explication. For each concept, ask and answer the questions before your literature search.

Group Assignment #1: Concept Explication. For each concept, ask and answer the questions before your literature search. Group Assignment #1: Concept Explication 1. Preliminary identification of the concept. Identify and name each concept your group is interested in examining. Questions to asked and answered: Is each concept

More information

Understanding Uncertainty in School League Tables*

Understanding Uncertainty in School League Tables* FISCAL STUDIES, vol. 32, no. 2, pp. 207 224 (2011) 0143-5671 Understanding Uncertainty in School League Tables* GEORGE LECKIE and HARVEY GOLDSTEIN Centre for Multilevel Modelling, University of Bristol

More information

Linking Assessments: Concept and History

Linking Assessments: Concept and History Linking Assessments: Concept and History Michael J. Kolen, University of Iowa In this article, the history of linking is summarized, and current linking frameworks that have been proposed are considered.

More information

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1.

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1. Running Head: A MODIFIED CATSIB PROCEDURE FOR DETECTING DIF ITEMS 1 A Modified CATSIB Procedure for Detecting Differential Item Function on Computer-Based Tests Johnson Ching-hong Li 1 Mark J. Gierl 1

More information

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods James Madison University JMU Scholarly Commons Department of Graduate Psychology - Faculty Scholarship Department of Graduate Psychology 3-008 Scoring Multiple Choice Items: A Comparison of IRT and Classical

More information

Issues That Should Not Be Overlooked in the Dominance Versus Ideal Point Controversy

Issues That Should Not Be Overlooked in the Dominance Versus Ideal Point Controversy Industrial and Organizational Psychology, 3 (2010), 489 493. Copyright 2010 Society for Industrial and Organizational Psychology. 1754-9426/10 Issues That Should Not Be Overlooked in the Dominance Versus

More information

Exploring rater errors and systematic biases using adjacent-categories Mokken models

Exploring rater errors and systematic biases using adjacent-categories Mokken models Psychological Test and Assessment Modeling, Volume 59, 2017 (4), 493-515 Exploring rater errors and systematic biases using adjacent-categories Mokken models Stefanie A. Wind 1 & George Engelhard, Jr.

More information

Computer Adaptive-Attribute Testing

Computer Adaptive-Attribute Testing Zeitschrift M.J. für Psychologie Gierl& J. / Zhou: Journalof Computer Psychology 2008Adaptive-Attribute Hogrefe 2008; & Vol. Huber 216(1):29 39 Publishers Testing Computer Adaptive-Attribute Testing A

More information

Assessing Dimensionality in Complex Data Structures: A Performance Comparison of DETECT and NOHARM Procedures. Dubravka Svetina

Assessing Dimensionality in Complex Data Structures: A Performance Comparison of DETECT and NOHARM Procedures. Dubravka Svetina Assessing Dimensionality in Complex Data Structures: A Performance Comparison of DETECT and NOHARM Procedures by Dubravka Svetina A Dissertation Presented in Partial Fulfillment of the Requirements for

More information

INVESTIGATING FIT WITH THE RASCH MODEL. Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form

INVESTIGATING FIT WITH THE RASCH MODEL. Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form INVESTIGATING FIT WITH THE RASCH MODEL Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form of multidimensionality. The settings in which measurement

More information

Having your cake and eating it too: multiple dimensions and a composite

Having your cake and eating it too: multiple dimensions and a composite Having your cake and eating it too: multiple dimensions and a composite Perman Gochyyev and Mark Wilson UC Berkeley BEAR Seminar October, 2018 outline Motivating example Different modeling approaches Composite

More information

Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data. Zhen Li & Bruno D.

Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data. Zhen Li & Bruno D. Psicológica (2009), 30, 343-370. SECCIÓN METODOLÓGICA Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data Zhen Li & Bruno D. Zumbo 1 University

More information

Re-Examining the Role of Individual Differences in Educational Assessment

Re-Examining the Role of Individual Differences in Educational Assessment Re-Examining the Role of Individual Differences in Educational Assesent Rebecca Kopriva David Wiley Phoebe Winter University of Maryland College Park Paper presented at the Annual Conference of the National

More information

Development, Standardization and Application of

Development, Standardization and Application of American Journal of Educational Research, 2018, Vol. 6, No. 3, 238-257 Available online at http://pubs.sciepub.com/education/6/3/11 Science and Education Publishing DOI:10.12691/education-6-3-11 Development,

More information

Chapter 1 Introduction. Measurement Theory. broadest sense and not, as it is sometimes used, as a proxy for deterministic models.

Chapter 1 Introduction. Measurement Theory. broadest sense and not, as it is sometimes used, as a proxy for deterministic models. Ostini & Nering - Chapter 1 - Page 1 POLYTOMOUS ITEM RESPONSE THEORY MODELS Chapter 1 Introduction Measurement Theory Mathematical models have been found to be very useful tools in the process of human

More information

Parameter Estimation with Mixture Item Response Theory Models: A Monte Carlo Comparison of Maximum Likelihood and Bayesian Methods

Parameter Estimation with Mixture Item Response Theory Models: A Monte Carlo Comparison of Maximum Likelihood and Bayesian Methods Journal of Modern Applied Statistical Methods Volume 11 Issue 1 Article 14 5-1-2012 Parameter Estimation with Mixture Item Response Theory Models: A Monte Carlo Comparison of Maximum Likelihood and Bayesian

More information

Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study

Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study Research Report Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study Xueli Xu Matthias von Davier April 2010 ETS RR-10-10 Listening. Learning. Leading. Linking Errors in Trend Estimation

More information

A Race Model of Perceptual Forced Choice Reaction Time

A Race Model of Perceptual Forced Choice Reaction Time A Race Model of Perceptual Forced Choice Reaction Time David E. Huber (dhuber@psyc.umd.edu) Department of Psychology, 1147 Biology/Psychology Building College Park, MD 2742 USA Denis Cousineau (Denis.Cousineau@UMontreal.CA)

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introduction 1.1 Motivation and Goals The increasing availability and decreasing cost of high-throughput (HT) technologies coupled with the availability of computational tools and data form a

More information

RATER EFFECTS AND ALIGNMENT 1. Modeling Rater Effects in a Formative Mathematics Alignment Study

RATER EFFECTS AND ALIGNMENT 1. Modeling Rater Effects in a Formative Mathematics Alignment Study RATER EFFECTS AND ALIGNMENT 1 Modeling Rater Effects in a Formative Mathematics Alignment Study An integrated assessment system considers the alignment of both summative and formative assessments with

More information

Understanding and quantifying cognitive complexity level in mathematical problem solving items

Understanding and quantifying cognitive complexity level in mathematical problem solving items Psychology Science Quarterly, Volume 50, 2008 (3), pp. 328-344 Understanding and quantifying cognitive complexity level in mathematical problem solving items SUSN E. EMBRETSON 1 & ROBERT C. DNIEL bstract

More information

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan Connexion of Item Response Theory to Decision Making in Chess Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan Acknowledgement A few Slides have been taken from the following presentation

More information

Exploring the Influence of Particle Filter Parameters on Order Effects in Causal Learning

Exploring the Influence of Particle Filter Parameters on Order Effects in Causal Learning Exploring the Influence of Particle Filter Parameters on Order Effects in Causal Learning Joshua T. Abbott (joshua.abbott@berkeley.edu) Thomas L. Griffiths (tom griffiths@berkeley.edu) Department of Psychology,

More information

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian

accuracy (see, e.g., Mislevy & Stocking, 1989; Qualls & Ansley, 1985; Yen, 1987). A general finding of this research is that MML and Bayesian Recovery of Marginal Maximum Likelihood Estimates in the Two-Parameter Logistic Response Model: An Evaluation of MULTILOG Clement A. Stone University of Pittsburgh Marginal maximum likelihood (MML) estimation

More information

Methodological Issues in Measuring the Development of Character

Methodological Issues in Measuring the Development of Character Methodological Issues in Measuring the Development of Character Noel A. Card Department of Human Development and Family Studies College of Liberal Arts and Sciences Supported by a grant from the John Templeton

More information

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking Jee Seon Kim University of Wisconsin, Madison Paper presented at 2006 NCME Annual Meeting San Francisco, CA Correspondence

More information

Diagnostic Classification Models

Diagnostic Classification Models Diagnostic Classification Models Lecture #13 ICPSR Item Response Theory Workshop Lecture #13: 1of 86 Lecture Overview Key definitions Conceptual example Example uses of diagnostic models in education Classroom

More information

Cognitive Design Principles and the Successful Performer: A Study on Spatial Ability

Cognitive Design Principles and the Successful Performer: A Study on Spatial Ability Journal of Educational Measurement Spring 1996, Vol. 33, No. 1, pp. 29-39 Cognitive Design Principles and the Successful Performer: A Study on Spatial Ability Susan E. Embretson University of Kansas An

More information

THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH

THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH By JANN MARIE WISE MACINNES A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF

More information

Learning Deterministic Causal Networks from Observational Data

Learning Deterministic Causal Networks from Observational Data Carnegie Mellon University Research Showcase @ CMU Department of Psychology Dietrich College of Humanities and Social Sciences 8-22 Learning Deterministic Causal Networks from Observational Data Ben Deverett

More information

Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data

Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data 1. Purpose of data collection...................................................... 2 2. Samples and populations.......................................................

More information

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments Greg Pope, Analytics and Psychometrics Manager 2008 Users Conference San Antonio Introduction and purpose of this session

More information

Designing small-scale tests: A simulation study of parameter recovery with the 1-PL

Designing small-scale tests: A simulation study of parameter recovery with the 1-PL Psychological Test and Assessment Modeling, Volume 55, 2013 (4), 335-360 Designing small-scale tests: A simulation study of parameter recovery with the 1-PL Dubravka Svetina 1, Aron V. Crawford 2, Roy

More information

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing Terry A. Ackerman University of Illinois This study investigated the effect of using multidimensional items in

More information

Does factor indeterminacy matter in multi-dimensional item response theory?

Does factor indeterminacy matter in multi-dimensional item response theory? ABSTRACT Paper 957-2017 Does factor indeterminacy matter in multi-dimensional item response theory? Chong Ho Yu, Ph.D., Azusa Pacific University This paper aims to illustrate proper applications of multi-dimensional

More information

Selection of Linking Items

Selection of Linking Items Selection of Linking Items Subset of items that maximally reflect the scale information function Denote the scale information as Linear programming solver (in R, lp_solve 5.5) min(y) Subject to θ, θs,

More information

Lecturer: Rob van der Willigen 11/9/08

Lecturer: Rob van der Willigen 11/9/08 Auditory Perception - Detection versus Discrimination - Localization versus Discrimination - - Electrophysiological Measurements Psychophysical Measurements Three Approaches to Researching Audition physiology

More information

Lecturer: Rob van der Willigen 11/9/08

Lecturer: Rob van der Willigen 11/9/08 Auditory Perception - Detection versus Discrimination - Localization versus Discrimination - Electrophysiological Measurements - Psychophysical Measurements 1 Three Approaches to Researching Audition physiology

More information

Dimensionality of the Force Concept Inventory: Comparing Bayesian Item Response Models. Xiaowen Liu Eric Loken University of Connecticut

Dimensionality of the Force Concept Inventory: Comparing Bayesian Item Response Models. Xiaowen Liu Eric Loken University of Connecticut Dimensionality of the Force Concept Inventory: Comparing Bayesian Item Response Models Xiaowen Liu Eric Loken University of Connecticut 1 Overview Force Concept Inventory Bayesian implementation of one-

More information

The Influence of Test Characteristics on the Detection of Aberrant Response Patterns

The Influence of Test Characteristics on the Detection of Aberrant Response Patterns The Influence of Test Characteristics on the Detection of Aberrant Response Patterns Steven P. Reise University of California, Riverside Allan M. Due University of Minnesota Statistical methods to assess

More information

Item Selection in Polytomous CAT

Item Selection in Polytomous CAT Item Selection in Polytomous CAT Bernard P. Veldkamp* Department of Educational Measurement and Data-Analysis, University of Twente, P.O.Box 217, 7500 AE Enschede, The etherlands 6XPPDU\,QSRO\WRPRXV&$7LWHPVFDQEHVHOHFWHGXVLQJ)LVKHU,QIRUPDWLRQ

More information

Initial Report on the Calibration of Paper and Pencil Forms UCLA/CRESST August 2015

Initial Report on the Calibration of Paper and Pencil Forms UCLA/CRESST August 2015 This report describes the procedures used in obtaining parameter estimates for items appearing on the 2014-2015 Smarter Balanced Assessment Consortium (SBAC) summative paper-pencil forms. Among the items

More information

Building Evaluation Scales for NLP using Item Response Theory

Building Evaluation Scales for NLP using Item Response Theory Building Evaluation Scales for NLP using Item Response Theory John Lalor CICS, UMass Amherst Joint work with Hao Wu (BC) and Hong Yu (UMMS) Motivation Evaluation metrics for NLP have been mostly unchanged

More information

Using Bayesian Decision Theory to

Using Bayesian Decision Theory to Using Bayesian Decision Theory to Design a Computerized Mastery Test Charles Lewis and Kathleen Sheehan Educational Testing Service A theoretical framework for mastery testing based on item response theory

More information

Section 6: Analysing Relationships Between Variables

Section 6: Analysing Relationships Between Variables 6. 1 Analysing Relationships Between Variables Section 6: Analysing Relationships Between Variables Choosing a Technique The Crosstabs Procedure The Chi Square Test The Means Procedure The Correlations

More information

Measuring noncompliance in insurance benefit regulations with randomized response methods for multiple items

Measuring noncompliance in insurance benefit regulations with randomized response methods for multiple items Measuring noncompliance in insurance benefit regulations with randomized response methods for multiple items Ulf Böckenholt 1 and Peter G.M. van der Heijden 2 1 Faculty of Management, McGill University,

More information

CHAPTER 3 RESEARCH METHODOLOGY

CHAPTER 3 RESEARCH METHODOLOGY CHAPTER 3 RESEARCH METHODOLOGY 3.1 Introduction 3.1 Methodology 3.1.1 Research Design 3.1. Research Framework Design 3.1.3 Research Instrument 3.1.4 Validity of Questionnaire 3.1.5 Statistical Measurement

More information

VERDIN MANUSCRIPT REVIEW HISTORY REVISION NOTES FROM AUTHORS (ROUND 2)

VERDIN MANUSCRIPT REVIEW HISTORY REVISION NOTES FROM AUTHORS (ROUND 2) 1 VERDIN MANUSCRIPT REVIEW HISTORY REVISION NOTES FROM AUTHORS (ROUND 2) Thank you for providing us with the opportunity to revise our paper. We have revised the manuscript according to the editors and

More information

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida Vol. 2 (1), pp. 22-39, Jan, 2015 http://www.ijate.net e-issn: 2148-7456 IJATE A Comparison of Logistic Regression Models for Dif Detection in Polytomous Items: The Effect of Small Sample Sizes and Non-Normality

More information

A Memory Model for Decision Processes in Pigeons

A Memory Model for Decision Processes in Pigeons From M. L. Commons, R.J. Herrnstein, & A.R. Wagner (Eds.). 1983. Quantitative Analyses of Behavior: Discrimination Processes. Cambridge, MA: Ballinger (Vol. IV, Chapter 1, pages 3-19). A Memory Model for

More information