Investigating Psychometric Isomorphism for Traditional and Performance-Based Assessment

Size: px

Start display at page:

Download "Investigating Psychometric Isomorphism for Traditional and Performance-Based Assessment"

Marcus Hubbard
5 years ago
Views:

1 Journal of Educational Measurement Spring 2018, Vol. 55, No. 1, pp Investigating Psychometric Isomorphism for Traditional and Performance-Based Assessment Derek M. Fay Independent Researcher Roy Levy Arizona State University Vandhana Mehta Independent Researcher A common practice in educational assessment is to construct multiple forms of an assessment that consists of tasks with similar psychometric properties. This study utilizes a Bayesian multilevel item response model and descriptive graphical representations to evaluate the psychometric similarity of variations of the same task. These approaches for describing the psychometric similarity of task variants were applied to two different types of assessments (one traditional assessment and one performance-based assessment) with markedly different response formats. Due to the general nature of the multilevel item response model and graphical approaches that were utilized, the methods used for this work can readily be applied to many assessment contexts for the purposes of evaluating the psychometric similarity of tasks. A common practice in educational assessment is to construct multiple forms of a test, such that the forms are different in that they contain at least some different tasks, but at the same time the resulting student performances are of evidentiary comparability. One aspect of this notion of evidentiary comparability is the difficulty of the forms. If one form is easier than another form, the resulting scores may not be evidentiarily comparable. In such cases, equating processes may be needed to yield transformed scores to eradicate the effects of differences in difficulty (Kolen & Brennan, 2014). However, equal difficulty is not sufficient to ensure evidentiary comparability. In addition to other psychometric characteristics (e.g., discrimination), other considerations include the comparability of the content and cognitive demands, which are often operationalized via tables of specifications that specify the weights assigned to different content areas and cognitive skills relevant in the domain. By designing the forms that adhere to these specifications, the forms will align with the purposes of the assessment and each other. A burgeoning area of research involves the principled design of tasks that may be seen as targeting the goal of creating tasks that are different but evidentiarily comparable, which has emerged from a lineage of research on the intersection of task design, creation, and theories of cognition (Bejar, 1993; Bejar et al., 2003; Embretson, 1998, 1999; Gierl & Haladyna, 2013; Guttman, 1969; Hively, Patterson, & Page, 1968; Irvine & Kyllonen, 2002; LaDuca, Staples, Templeton, & Holtzman, 1986; Mislevy, Almond, & Steinberg, 2002). Such task generation may be automated (Gierl & Haladyna, 2013) though it need not be, as in the case of the contexts 52 Copyright c 2018 by the National Council on Measurement in Education

2 Investigating Psychometric Isomorphism studied in this work. As typically implemented, these approaches guide the construction of many instances of a task from a well-defined task template. The task template defines which structural features of a task remain unchanged as well as the structural features that are altered to produce instances of the task. Any two instances of a task derived from the same task template are referred to as structural isomorphs. A task is deemed psychometrically isomorphic if the various instances of the task do not meaningfully differ with respect to psychometric properties (e.g., difficulty, discrimination, guessing). Importantly, tasks that are structurally isomorphic need not be psychometrically isomorphic, and vice versa. Table 1 lays out the possibilities. Starting with the upper left cell, tasks that are not structurally isomorphic might not be psychometrically isomorphic. This is the usual case where different items (i.e., those that are structurally different in terms of their design, such as the aspect of the content that they measure) are also psychometrically different. This is well understood in assessment and test assembly different items may differ in terms of their difficulty (or discrimination, or how subject they are to guessing). Turning to the bottom left cell, items that are not structurally isomorphic may indeed be psychometrically isomorphic. This is the case where different items have the same psychometric properties. Turning to the bottom right cell, structurally isomorphic items may be psychometrically isomorphic. As a simple example, an item is structurally isomorphic with itself, and when placed on multiple forms at the same time is typically also psychometrically isomorphic with itself; such an assumption underlies common-item equating Table 1 Possible Combinations of Structurally and Psychometrically Isomorphic Tasks Structurally Psychometrically Not Isomorphic Isomorphic Not Isomorphic Isomorphic Example: Different items with different psychometric properties Example: Different items with the same psychometric properties Examples: Instances from the same task template with different psychometric properties (e.g., a family of items created to assess learning on the same construct, but the items do not have similar item difficulty values; an item and itself on a later form if item drift has occurred) Instances from the same task template with the same psychometric properties (e.g., a family of items created to assess learning on the same construct and the items also provide similar item difficulty values, an item and itself on another form) 53

3 Fay, Levy, and Mehta designs (Kolen & Brennan, 2014). Finally, structurally isomorphic items need not be psychometrically isomorphic. An extreme example of this is item drift, where the same item appears on multiple forms across time points, but its psychometric properties are dependent on when the item was administered. We note that though Table 1 lays out possibilities in terms of categories, we can conceive of isomorphism as a matter of degree. For example, two tasks may be highly psychometrically isomorphic if their psychometric properties, though not identical, are nearly the same. Different instances of the same task template are structurally isomorphic, and prior to piloting it is an open question as to whether the instances are psychometrically isomorphic. In some cases, the desire for different instances to produce a psychometrically isomorphic task is a matter of design. Some situations, such as producing equated test forms, call for psychometrically isomorphic tasks. Other situations, such as testing cognitive theories (Gorin & Embretson, 2013), call for similarly structured tasks with different psychometric properties. A task feature is referred to as an incidental feature if, when varied, the resulting structural isomorphs have similar psychometric properties. In contrast, radicals refer to task features that, when varied, produce tasks with meaningful psychometric differences (Bejar, 1993; Gorin & Embretson, 2013), thereby producing tasks that are not psychometrically isomorphic. The term noninvariant will be used interchangeably to refer to such tasks. As more fully described below, the current work is concerned with investigating whether structurally isomorphic tasks in fact exhibit psychometric isomorphism. The generation of tasks for this work was not computerized, but done by subject matter experts (SMEs) for two different types of assessments. The design process was aligned with Bejar et al. s (2003) weak theory of design, in which an existing task served as the basis for a task model. We refer to each instance of a task created from the same model as being members of the same task family (e.g., Sinharay, Johnson, & Williamson, 2003). Owing to the design process, most of the generated tasks are best characterized as structurally isomorphic. Details on this are given in a later section that describes the assessments that form the context of our work. The hope is that all structurally isomorphic tasks are also highly psychometrically isomorphic. The goal of this work is to investigate whether this is indeed the case in two different assessment scenarios, and to advance methods for doing so. The next section reviews existing methods that have been proposed for investigating psychometric isomorphism and advances multiple alternative methods for doing so that have advantages over existing methods. Data Analyses to Investigate Psychometric Isomorphism We briefly review three approaches for modeling item families in the context of the one-parameter logistic (1PL) or Rasch item response theory (IRT) model (Sinharay et al., 2003). The unrelated siblings model assumes independent item response functions for all tasks, essentially ignoring family membership. The model may be expressed mathematically as P(Y ij = 1 θ i, b j ) = exp(θ i b j ) 1 + exp(θ i b j ), (1) 54

4 Investigating Psychometric Isomorphism where P(Y ij = 1 θ i, b j ) is the probability of success on task j for examinee i, θ i is the latent variable for examinee i, and b j is the difficulty of task j. In this and the remaining models, the indeterminacy in the latent variable may be resolved by setting the mean of the θs to 0. Sinharay et al. (2003) noted that this model is limited in that it ignores the relationship among the members of the task family, requiring larger sample sizes for calibration and yielding inflated standard errors. To this we add that this model does not naturally support what we desire and pursue in this work, namely, ways to characterize the psychometric isomorphism among members of the same task family. The identical siblings model assumes that all the tasks in the same family have the same item response function. This model may be expressed mathematically as P(Y ij = 1 θ i, b jg ) = exp(θ i b jg ) 1 + exp(θ i b jg ), (2) where b jg is the difficulty task j that is a member of task family g, and all such b jg s for task family g are assumed equal. This model is one of total psychometric isomorphism among all members of the task family. Sinharay et al. (2003) noted that this model is limited in that it ignores the variability among the members of the task family, providing incorrect estimates of the task parameters in the presence of such variability. To this, we add that this model does not by itself provide a way to characterize the psychometric isomorphism among members of the same task family. As reviewed below, it may be used in conjunction with the model described next to assess psychometric isomorphism to some degree, though in ways that are somewhat limited. The related siblings model departs from the identical siblings model by not assuming that all the tasks in the same family have the same item response function, allowing for departures from exact psychometric isomorphism. This model may be expressed as a hierarchical model, where the first component is given by Equation 2. The hierarchical component specifies a distribution to relate the parameters for tasks from the same family, ) b jg N (μ bg, σ 2. (3) bg This expresses that each task-specific parameter b jg is modeled as varying around a family-specific mean (μ bg ). Note that the unrelated siblings and identical siblings models are limiting cases of the related siblings model. The unrelated siblings model results from σb 2 g approaching infinity, and the identical siblings model results from σb 2 g approaching 0. These models for task families have been extended and employed in the contexts of dichotomous IRT models (Glas & van der Linden, 2003; Lathrop & Cheng, 2017; Sinharay et al., 2003), polytomous IRT models (Johnson & Sinharay, 2005), and models with covariates (Cho, de Boeck, Embretson, & Rabe-Hesketh, 2014; Geerlings, Glas, & van der Linden, 2011; Lathrop & Cheng, 2017). However, guidelines for evaluating the results of those models to characterize whether the tasks are psychometrically isomorphic are relatively underdeveloped. Johnson and Sinharay (2005) advocated computing Bayes factors (BFs) to compare the unrelated siblings and related siblings models, which characterizes the amount of evidence in favor of the related siblings model that contains a family structure 55

5 Fay, Levy, and Mehta as opposed to the unrelated siblings model that does not contain a family structure. This addresses whether there is evidence of any psychometric isomorphism, as opposed to no psychometric isomorphism. The focus of the current work is on a different question, namely, to characterize the amount of the assumed-to-be present psychometric isomorphism. One potential avenue would pursue a comparison of the related siblings and the identical siblings models through BFs or other approaches to model comparison such as the use of information criteria (Geerlings et al., 2011). One drawback of these approaches is that they require the fitting of multiple models. Furthermore, if the related siblings model is supported over the identical siblings model, the question still remains as to whether a subset of the task families might exhibit a high level or even exact psychometric isomorphism. To target each task family, we could also compare a related siblings model to a partially constrained version of the related siblings model that specifies one task family as following the identical siblings model and the remaining families as following the related siblings model. This would require fitting as many models as there are task families. This work therefore develops and advances a series of statistical and graphical procedures aimed at characterizing the psychometric isomorphism among members of a task family. There are several benefits to our approach, which we illustrate through two different assessment contexts. First, our approach does not require multiple models to be fit; all of the inference is done within the context of fitting the related siblings model. Second, our approach allows for a separate characterization of the psychometric isomorphism for each of the task families. Third, as we illustrate with our examples, our approach can be instantiated differently depending on the number of members of the task family. Fourth, we employ graphical approaches using the results of the IRT model, as well as classical test theoretic approaches to investigate isomorphism with respect to distractor selection in multiple-choice tasks. As our examples illustrate, this can directly lead to deeper substantive understandings of the reasons behind a lack of psychometric isomorphism. To pursue our ends of characterizing psychometric isomorphism based on fitting the related siblings model, we leverage statistical testing procedures that have been developed for related multiple-group models (Verhagen, Levy, Millsap, & Fox, 2016). In essence, we treat the examinees who see different members of the task family as defining different groups of examinees. In this context, we might rewrite the model in Equation 2 as P(Y ij = 1 θ ig, b jg ) = exp(θ ig b jg ) 1 + exp(θ ig b jg ), (4) with the additional subscripting of θ by g to indicate that each examinee is now framed as a member of a group. In multiple-group models, care must be taken to link the group-specific metrics of the latent variable. This may be done by constraining the task parameters for a given task family to be equal across groups. Alternatively, the group-specific means and variances of the latent variable may be constrained to be equal. Note that the related siblings model as typically formulated (e.g., Sinharay et al., 2003) and described above takes this latter approach, in which case the additional subscripting in Equation 4 is not present in 56

6 Investigating Psychometric Isomorphism Equation 2. We follow in this tradition and assume the group-specific means and variances of the latent variable are equal across groups. This is warranted by random assignment of examinees to groups, which in the current context is accomplished by randomly assigning which member of each task family is presented to each examinee. The model affords two possibilities for investigating psychometric isomorphism. The first involves comparing the difficulty parameters for all the members of the same family (i.e., the b jg s for task family g) (Verhagen et al., 2016). This approach is practical when the number of members in the family is small. In contrast, when the number of members is large, the number of direct comparisons becomes problematic. In this situation, we can turn to the variance of task difficulties across groups (σ 2 b g ) as a summary of the degree to which tasks in the family are psychometrically isomorphic (Verhagen & Fox, 2013). If σ 2 b g = 0, the difficulties for the members of a task family j are all identical. In contrast, large values of σ 2 b g reflect some differentiation in task difficulty between at least two members. We treat each of these situations in turn. In the first situation, we seek to compare the difficulty parameters for all the members of the same family. Here, BFs (Kass & Raftery, 1995) can be computed using all of the information contained in the marginal posterior distributions for all versions of a task. Let d = b jg b jg* represent the difference in difficulty for item j for any two groups g and g* (g g*). We can consider the posterior distribution of d to characterize the difference. Further, under the hypothesis that d = 0,theBFisgiven as the ratio of the density for d = 0 for the posterior P(d = 0 H 1, Y) and the prior P(d = 0 H 1 ) as follows (Verhagen et al., 2016): BF = p(d = 0 H 1, Y ). (5) P(d = 0 H 1 ) As operationalized for this work, BFs in Equation 5 capture the amount of evidence in favor of the null hypothesis that the difficulty parameters are invariant (i.e., the tasks are psychometrically isomorphic). Drawing from Jeffreys s (1961) general recommendations, substantial support for the hypothesis of invariance of the difficulty parameters obtains for BFs of 3 or larger (Verhagen et al., 2016). Conversely, BFs.33 constitute substantial support for the hypothesis of noninvariance (i.e., differences) among the difficulty parameters. BFs in between these values are interpreted as being inconclusive about the invariance among the difficulty parameters. These interpretations are also proposed for the BFs used in the second situation, defined next. In the second situation, we seek to evaluate the variance of task difficulties across groups (σb 2 g ) to summarize the degree to which the members exhibit psychometric isomorphism (i.e., σb 2 j = 0) (Verhagen & Fox, 2013). Unlike the difference among difficulty parameters across groups, zero is not in the interior of the parameter space for variance terms. Accordingly, the hypothesis of essential invariance was investigated by specifying a small prior variance (denoted σ 2* ) for which to compare the posterior draws to (Klugkist, 2008; Klugkist & Hoijtink, 2007). The BF 57

7 Fay, Levy, and Mehta to evaluate the hypothesis that the tasks in a family exhibit essential invariance is given by BF = P(σ2 b jg < σ 2 H 1, Y ) P(σb 2, (6) jg < σ 2 H 1 ) where σ 2 b jg is the variance of the b jg s for the tasks in family g. The BFs in Equation 6 capture the amount of evidence in favor of the null hypothesis that the difficulty parameters do not vary (i.e., the tasks are psychometrically isomorphic), operationalized by having their variance be lower than some threshold σ 2*. When BF in Equation 6 flags items in a family as being noninvariant, this supports the conclusion that some of the items parameters differ, but does not speak to which ones. In this case, the BF effectively serves as an omnibus assessment that calls for further investigations to identify the likely culprits. Techniques for doing such follow-up investigations using item characteristic curves (ICCs) are described below. Notably, the scale of σ 2* renders it difficult to declare a value that constitutes a meaningful variation among the difficulty parameters. Importantly, finding evidence of invariance (noninvariance) becomes easier with larger (smaller) values of σ 2*. Selecting a particular value of σ 2* to serve as a threshold for declaring invariance requires more subjectivity than is warranted given the exploratory nature of this work. However, it is easy to compute the BF for different choices of σ 2* (Verhagen & Fox, 2013), which we do in the current work to explore the magnitude of σ 2* that was appropriate to find strong support for the invariance hypothesis. To complement the results from the related siblings model and the computation of BFs, a number of graphical representations were pursued to directly compare the ICCs associated with tasks hypothesized to be isomorphic. The ICCs for variations on the same tasks were overlaid onto a single plot. Evidence that the tasks were psychometrically isomorphic obtained to the extent that ICCs among variations of the task were indistinguishable. Evidence against the hypothesis that the tasks were psychometrically isomorphic obtained to the extent that ICCs among variations were distinguishable. In addition to the comparisons of ICCs, additional analyses for the traditional assessment comprised of multiple-choice tasks (described more fully below) included comparing the selection of response options. Using the total score across items as a proxy for examinee ability, three equally sized ability groups (low ability, average ability, high ability) were created. The proportion of examinees that selected each response option was computed for each ability group. Looking across ability groups for one member of a task family, it was expected that the proportion of examinees selecting the correct response would be higher for more proficient examinees. Looking within ability groups across different members of a task family, evidence of psychometric isomorphism obtained to the extent that the patterns of response selection were similar for examinees with similar levels of ability. Hence, when task families are flagged into the three groups (i.e., Non-Invariant, Inconclusive, Invariant), further investigation is performed on items in the Inconclusive group by comparing the ICCs and proportion of examinees that selected each response option, which helps to determine whether the tasks are psychometrically isomorphic. 58

8 Methods Description of the Assessments In the following subsections, we briefly describe the two assessment contexts in which our work takes place, highlighting features that pertain to the desire for structurally and psychometrically isomorphic tasks. Traditional assessment. In the first context, we investigated final exam forms from a summative, knowledge-based assessment for an introductory course in computer networking that is provided to the global audience of students participating in the Cisco Networking Academy Program ( Established in 1997, the program consists of 10,000 participating academies in 165 countries. The final exam forms under study come from the first course in a sequence of four courses that serve as preparation for the Cisco Certified Network Associate Routing and Switching certification exam. As it pertains to this assessment, isomorphic tasks (which appear on different forms as described below) were created to improve test security for a globally distributed assessment to facilitate the practical goal of enhancing fairness to students. The final exams are typically given at the end of each course and are provided online through the Cisco Network Academy web site. The exam forms investigated in this study consist of 60 tasks; the tasks that appear on the final exam forms will herein be referred to as items. The majority of the items on the assessment are multiplechoice single-answer or multiple-answer items. In addition, the forms include several other item types, such as fill-in-the-blank, drag-and-drop items, and items with exhibits that show a network topology, display device output, or utilize small simulated computer networks through Cisco s Packet Tracer tool that serve as interactive exhibits. Within each form, there were three types of items with respect to isomorphism. The first type included 17 items that were identical with respect to the stimulus, the response options, and the order of response options across the three forms of the assessment (i.e., these were the same item, in the usual sense, across the three forms). This set of 17 items will herein be referred to as items with option randomization disabled (OR-D). The second type included 14 items that were invariant across the three forms with respect to the stimulus and the particular set of response options. Although the response options were identical across forms, the order of the response options was randomized on an examinee-to-examinee basis; these items will herein be referred to as items with option randomization enabled (OR-E). The third type included 22 item families with 3 members each that were judged to be to some degree isomorphic by SMEs, and are referred to as (SME-I). Among these 22 families, there was some variability with respect to the degree of isomorphism from the perspective of SMEs. Some items were developed from a template with variable features being substitutes that were not believed to have any bearing on the psychometric properties of items; these items were indeed characterized as structural isomorphs. For other items, the instances differed with respect to the difficulty of the language used in the stem and/or set of response options. Owing to this research serving as the initial exploration of the psychometric comparability of these items, the analytical goal was 59

9 Figure 1. Four topological structures of device networks that served as structurally isomorphic tasks on the Packet Tracer Skills Assessment. to investigate which features resulted in items being psychometrically comparable or incomparable. Performance-based assessment. The second assessment is a simulation-based task in which examinees must configure and troubleshoot a simulated network. These assessments are known as Packet Tracer Skills Assessments (PTSAs). Much of the assessment interaction with the devices is through a high-verisimilitude command line interface (see Rupp et al., 2012 for further descriptions of PTSAs). The work products produced by examinees include (a) a log of commands entered in the command line window to diagnose and configure the devices in the network and (b) the final state of the network. The work product features that were identified by SMEs as evidence of examinees proficiency ranged from the inclusion of commands that represent best practices to observable features of the final state of the network and the devices within it. For the PTSA under consideration in this study, 57 work product features (referred to as primary observables herein) deemed by SMEs as evidence of examinees proficiency were dichotomously scored such that 1 denotes success and 0 otherwise. The versions of the PTSA considered in this work were designed to be structurally isomorphic. The features that were manipulated to produce 12 versions of the PTSA included four versions of a topological structure (see Figure 1) and three sets of device labels (see Table 2). For the purposes of this work, the topological structure 60

10 Table 2 Device Labels for Three Label Sets Used to Produce Structural Isomorphs for the PTSA Label Set Device Router Town Hall Building 1 CS Department Switch 1 IT Department Switch First Floor Switch LAB 124-C Switch Switch 2 Administration Switch Second Floor Switch LAB 214-A Switch LAN 1 IT Department LAN First Floor LAN LAB 124-C LAN LAN 2 Administration LAN Second Floor LAN LAB 214-A LAN PC 1 Reception Host Host PC 2 Operator Host Host PC 3 IT Host Host refers to the logical interconnection among devices; the presentation of the topological structures may vary the spatial location among devices but not the logical interconnection among devices. The device labels simply corresponded to the names that were assigned to each device within the network. Description of Data Used for Analysis Traditional assessment. The data for the traditional assessment was comprised of N = 5,511 examinees that were randomly assigned to one of three forms. For all students, only their first attempt on the exam was retained. There were two types of missing data. One type of missing data occurred if an examinee was not administered the item. These instances of missing data are missing at random by design, and accordingly could be ignored for subsequent data analyses (Enders, 2010). The second type of missing data occurred if an examinee was presented the item but did not provide a response. SMEs indicated that failures to provide a response might be due to any of a number of reasons, including those unrelated to proficiency, particularly for examinees with a large amount of missing data. Records for examinees that provided more than 50 responses (out of 60 possible) were employed for subsequent analyses. The final sample consisted of N = 5,425 examinees responses with n 1 = 1,779, n 2 = 1,851, and n 3 = 1,795 examinees receiving the first, second, and third form, respectively. For these examinees, all instances of missing data were scored as an unsuccessful attempt at the task. PTSA assessment. Examinees were randomly assigned to 1 of the 12 versions of the PTSA task. The analytic sample for this work consists of scores on J = 57 primary observables from 801 examinees whose activity was deemed indicative of a motivated student. 1 Table 3 shows the number of examinees that were retained for each structurally isomorphic version of the PTSA. Investigations of the distributions of time spent and the number of commands used (not shown on space considerations) suggested that each of the 12 groups of examinees, as defined by which version of the PTSA task they received, were fairly similar in these metrics. 61

11 Table 3 Number of Examinees for Each Structural Isomorph of the PTSA Defined by the Combination of Label Set and Topological Structure Topological Structure Label set Analyses to Investigate Psychometric Isomorphism Traditional assessment. We fit the related siblings model as given in Equations 4 and 3, where the three test forms define the three groups of examinees. The multilevel component of the model for the difficulty parameters is operationalized via diffuse hyperpriors on the family-specific parameters on the right-hand side of Equation 3 (Sinharay et al., 2003; see also Fox, 2010 and Levy & Mislevy, 2016 for Bayesian IRT modeling via multilevel structures): μ bg N(0, 100) and σ 2 b g Inv-Gamma(.01,.01). As discussed above, we assume the groups of examinees defined by which form they were presented are randomly equivalent, as warranted by the random assignment of forms to examinees. The implication is that, in the multiple-group formulation of the model, each group was modeled as having the same mean of the latent variable, and the same variance of the latent variable. The distribution of the latent variables for examinees was θ ig N(0, σ 2 θ ), where σ2 θ Inv-Gamma(.01,.01). As the task families are small, having three members each, we employed the approach to investigating psychometric isomorophism by examining the differences in difficulty parameters directly. For the three comparisons that were possible (i.e., Form 1 versus Form 2; Form 1 versus Form 3; Form 2 versus Form 3), a BF was computed for each item using Equation 5. To complement the BFs, several additional analyses were also conducted. As a proxy for the difficulty for each version of a task, the arithmetic average of the MCMC draws for b used to construct the marginal posterior distribution was computed. Using this point estimate, the ICCs and 95% intervals for those ICCs for variations on the same tasks were overlaid onto a single plot and compared (Wickham, 2009). Finally, we employed classical test theoretic approaches to investigate isomorphism with respect to distractor selection in multiple-choice tasks, as described above. PTSA assessment. For the PTSA, we fitted the related siblings model in the same fashion as just described for the traditional assessment. As the task families are large, having 12 members each, we employed the approach to investigating psychometric isomorophism by examining the variance in difficulty parameters, computing the 62

12 Investigating Psychometric Isomorphism BFs in Equation 6. To explore the sensitivity for different threshold values of σ 2*,we computed BFs for five values of σ 2* (.05,.10,.15,.20,.25) to explore the magnitude of σ 2* that was required to find strong support for the invariance hypothesis across the majority of the primary observables. We then also plotted the ICCs in the same fashion as was done for the traditional assessment. Model fitting. Using JAGS (Plummer, 2011) and the R package rjags (Plummer, 2013), a fully Bayesian approach to estimation was used to obtain the posterior distribution; all features of estimation described in this section apply to both assessments. After a burn-in period of 500 iterations, two Markov chains were run for 5,000 iterations with a thinning interval of 10 iterations, yielding 1,000 iterations to represent the posterior. Fit of the model. For both the traditional assessment and the PTSA, we conducted posterior predictive checks of the related siblings model (Sinharay, 2005). As our focus is on the difficulty of the tasks, we pursued fit analyses that focus on the extent to which the related siblings model adequately models the difficulties of the items in the traditional assessment, and primary observables in the PTSA, using proportion correct as a test statistic. For both assessments, the results (not shown due to space considerations) indicated the related siblings model does an adequate job of accounting for the observed proportions correct, lending support that it is modeling the difficulty of the tasks quite well. We also examined measures of local dependence (Levy, Mislevy, & Sinharay, 2009), and found the model suffered with respect to these properties in several places. As our focus was on psychometric isomorphism with respect to difficulty, and our checks indicated the model is performing well with respect to accounting for the observed difficulty, we were comfortable in the use of the model for our current purposes. Future work could extend the related siblings model and our procedures to multidimensional models or other ways to model local dependence, as may be warranted by extending the examination of isomorphism to focus on other psychometric features. Results Traditional Assessment Table 4 tabulates the number of pairwise comparisons that resulted in each invariance designation on the basis of the values of BFs (i.e., Non-Invariant, Inconclusive, Invariant) for the three item types (i.e., OR-D, OR-E, SME-I). The desired designation for all items is Invariance, or at the very least, Inconclusive. Among the 93 pairwise comparisons computed for the OR-D and OR-E items, only 3 were assigned a designation of Non-Invariance. Moreover, the distribution of designations was largely consistent for OR-D and OR-E items. This result suggests that randomization of response options had little bearing on the degree to which the OR-D and OR-E items were psychometrically isomorphic. Among the 66 pairwise comparisons computed for the SME-I items, 45 were designated as Non-Invariant with the remaining 21 items designated as either Inconclusive or Invariant. Figure 2 illustrates selected results for the BFs and ICCs. The top, middle, and bottom rows show the results for two items that represent the ends of the continuum 63

13 Table 4 Summary of (Non-)Invariance Results for Each Item Type on the Traditional Assessment Value of Bayes Factor (Interpretation) Item Type Number of Items a Number of Comparisons BF.33 (Non- Invariance).33 < BF < 3 (Inconclusive) BF 3 (Invariance) OR-D OR-E Isomorph Total a The total number of items reflect the number of multiple choice-type items; the seven items that were not a multiple-choice type item are not included here. BF = Bayes factor; OR-D = option randomization disabled; OR-E = option randomization enabled. of invariance isomorphs by SMEs, respectively. Figure 3 illustrates the posterior densities for the d values for the items presented in Figure 2. Starting with the panels on the left sides of Figures 2 and 3, the BFs for each comparison, the inability to distinguish among the ICCs, and the high degree of overlap among the densities of d that are concentrated near 0 provide strong evidence for the invariance hypothesis. Among the least invariant of the OR-D and OR-E items (depicted on in the panels on the right sides of Figures 2 and 3), some evidence of noninvariance was found via the BFs. However, the ICCs for these items (Item 51 and Item 1 for OR-D and OR-E, respectively) were reasonably close, and the overlap among the densities of d that are concentrated near 0 lend support to the invariance hypothesis. As for the least invariant among the isomorphic items (Item 18), all BFs were extremely small (all rounded to zero); the magnitude of the difference in difficulty as captured by the ICCs and the densities of d provide strong evidence that these items are psychometrically quite different. Turning to the analyses of the response options, Figure 4 shows the proportion of examinees that selected each response option across three ability groups. For the purposes of accumulating evidence of invariance or noninvariance, the focus lies in the patterns of response selection rather than the conceptual underpinnings that gave rise to the similarities and differences among items hypothesized to be isomorphic. The top (Item 48), middle (Item 3), and bottom (Item 18) rows represent items for which there was strong evidence in favor of invariance, partial invariance, and noninvariance, respectively. For the item exhibiting a high degree of invariance (Item 48), the proportion of examinees that selected each response option was very similar across the three variations of the item. For the item that exhibited partial invariance (Item 3), the proportion of examinees selecting each response option was very similar for the versions of the item on Forms 1 and 3. However, the version on Form 2 was found to be markedly more difficult owing to the increased selection of all incorrect response options (B, C, D) across all ability groups. Finally, the pattern by which response options were selected for the noninvariant item (Item 18) exhibited a high degree of variability across all versions. 64

14 OR-Disabled: Item 49 OR-Disabled: Item 51 Form 1: b = Form 2: b = Form 3: b = Form 1: b = Form 2: b = Form 3: b = P(Y = 1) OR-Enabled: Item 4 OR-Enabled: Item 1 Form 1: b = Form 2: b = Form 3: b = BF 12 = 2.29 BF 13 = 2.26 BF 23 = 1.72 Form 1: b = Form 2: b = Form 3: b = BF 12 = 0.30 BF 13 = 2.21 BF 23 = 0.94 P(Y = 1) BF 12 = 2.59 BF 13 = 2.38 BF 23 = Isomorph: Item 9 Isomorph: Item 18 Form 1: b = Form 2: b = Form 3: b = Form 1: b = Form 2: b = 2.03 Form 3: b = BF 12 = 2.85 BF 13 = 0.15 BF 23 = 0.02 P(Y = 1) BF 12 = 0.86 BF 13 = 2.36 BF 23 = 1.54 BF 12 = 0 BF 13 = 0 BF 23 = θ θ Figure 2. Item characteristic curves for isomorphic items on different forms of the traditional assessment. OR = option randomization; BF12 = Bayes factor comparing the isomorphic item on Forms 1 and 2; BF13 = Bayes factor comparing the isomorphic item on Forms 1 and 3; BF23 = Bayes factor comparing the isomorphic item on Forms 2 and 3. 65

15 OR-Disabled: Item 49 OR-Disabled: Item d d OR-Enabled: Item 4 OR-Enabled: Item d d Isomorph: Item 9 Isomorph: Item d d Figure 3. Posterior densities for the difference (d) in location parameters between isomorphic items on different forms of the traditional assessment. OR = option randomization. Solid lines depict the comparison of Forms 1 and 2; dashed lines depict the comparison of Forms 1 and 3; dotted lines depict the comparison of Forms 2 and 3. 66

16 Figure 4. Percent of examinees selecting each response option for an isomorphic item that was invariant over forms (top row, Item 48), invariant for two items but noninvariant for one item (middle row, Item 3), and completely noninvariant over forms (bottom row, Item 18). 67

17 Figure 5. Magnitude of Bayes factors (BFs) across different values of the between-group variance for the difficulty parameter for each primary observable on the Packet Tracer Skills Assessment. The dashed gray lines within panels represent Jeffreys s (1961) recommended cutoffs that represent substantial evidence for the hypothesis that difficulty parameters were not invariant (i.e., BF.33) and the hypothesis of parameter invariance among difficulty parameters (i.e., BF 3). (Color figure can be viewed at wileyonlinelibrary.com) Packet Tracer Skills Assessment Figure 5 show the values of BFs across each of the values of σ 2* for each of the primary observables. The results for primary observables appear within panels with values of σ 2* shown along the horizontal axis and the value of BF shown on the vertical axis. The general upward trend within each panel reflects that allowing a larger amount of within-family variability (σ 2* ) to still count as being isomorphic yields more task families to be deemed isomorphic. The results of the sensitivity analysis provide strong evidence that the particular value of σ 2* has a meaningful impact on the conclusions to be drawn about the degree of invariance (or noninvariance) for primary observables. On the one hand, setting σ 2* =.05 would result in every primary observable being declared as noninvariant. This suggests that σ 2* =.05 may be too strict of a criterion that identifies primary observables with very similar levels of difficulty as non-invariant. On the other hand, setting σ 2* =.15 would result in most (but not all) of the primary observables being identified as invariant. This suggests that σ 2* =.15 may be too loose of a criterion that identifies 68

18 Figure 6. Item characteristic curves for each of the primary observables identified as exhibiting noninvariance across structural isomorphs of the Packet Tracer Skills Assessment. The minimum and maximum values shown within plots represent the minimum and maximum means of the marginal posterior distributions associated with the respective variants of the primary observable. primary observables with markedly different levels of difficulty as invariant. To avoid overidentifying primary observables as invariant or noninvariant, we selected σ 2* =.10 as the threshold for designating whether variations of each primary observable should be treated as invariant (i.e., (BF 3), noninvariant (i.e., BF.33), or inconclusive (i.e.,.33 < BF < 3). Based on the threshold value of σ 2* =.10, only one primary observable (i.e., PO1018) was found to be invariant; the remaining 56 (of the 57) primary observables were found to be inconclusive with respect to invariance or noninvariance. Figure 6 shows the ICCs for the four primary observables that show the full range of psychometric isomorphism that was observed for the PTSA. The left panel in the top row (primary observable 1005) is one of the many instances in which there was inconclusive evidence in support of either hypothesis at the ends of the invariance continuum. For this primary observable (and many others on the PTSA), the ICCs for 69

19 Fay, Levy, and Mehta all 12 variations were largely indistinguishable. In contrast, the right panel in the bottom row (primary observable 1018) shows the only primary observable with strong support for the hypothesis of noninvariance. For this primary observable, there was a clearer variation in the ICCs across the 12 versions of the PTSA. The right panel in the top row (primary observable 1023) and the left panel in the bottom row (primary observable 1032) serve to fill in the gradient from invariance to noninvariance among the primary observables on the PTSA. Discussion As an ideal, the psychometric properties of structurally isomorphic versions of the same task would be identical. Under this ideal, examinees would not be differentially advantaged based on the particular version of the task they were exposed to. To the extent this ideal holds for the assessments considered in this work, the use of isomorphic tasks reduces task exposure and ensures fairness across multiple versions. Different versions of a task that are (a) developed from the same template, or (b) developed to measure the same construct with the same degree of cognitive complexity ought to reasonably approximate the ideal notion of psychometric isomorphism. Using data from a traditional assessment and a performance-based assessment, model-based and descriptive approaches were pursued to evaluate the degree to which isomorphs were discrepant from the notion of psychometric isomorphism. In what follows, the merits of these approaches are discussed in light of the results pertaining to both types of assessment pursued in this work. The model-based approach involved estimating a related siblings model via a multilevel parameterization. Different versions of a task were deemed to be (not be) psychometrically isomorphic to the extent that the difficulties associated with different versions of a task were invariant (not invariant) across the different forms of the assessment. Notably, this model-based approach was sufficiently general to handle data derived from two very different types of assessment. Moreover, trivial alterations to the set-up of the model readily supported inferences for both pairwise group comparisons, which were pursued for the traditional assessment, and a between-group variance term, which was pursued for the PTSA. In essence, the model pursued in this work is readily equipped to investigate the hypothesis that different versions of a task are psychometrically invariant regardless of the number of groups. Notably, the Rasch version of the model was pursued on the basis of its simplicity and ease of exposition. The model can be readily expanded to investigate the degree to which other psychometric properties of task variants (e.g., discrimination, guessing) are similar or different (for examples, see Geerlings et al., 2011; Glas & van der Linden, 2003; Janssen, Tuerlinckx, Meulders, & De Boeck, 2000; Johnson & Sinharay, 2005; Sinharay et al, 2003). Still other possibilities for expanding the model may involve incorporating additional dimensions (see Raudenbush & Bryk, 2002, pp ). As an empirical check, BFs were pursued as a method to flag variations of tasks as invariant, noninvariant, or inconclusive with respect to both invariance and noninvariance. For the traditional assessment, almost all of the OR-D and OR-E items were identified as having strong support for invariance or the evidence was not 70

20 Investigating Psychometric Isomorphism sufficiently strong to rule out invariance on the basis of the BFs associated with pairwise comparisons. In contrast, the majority of the isomorphs were deemed as having strong support for noninvariance (i.e., 45 out of the 66 comparisons across the three groups). As for the PTSA, only one primary observable was found to be noninvariant; evidence in favor of either invariance or noninvariance was inconclusive for the remaining primary observables. As noted, the BFs served the purpose of flagging tasks that may not be performing as expected. When the variable features used to create different instances of the tasks are anticipated to be incidental, the goal of the analysis is to flag tasks for which the variable features are behaving as radicals. This was the case for the PTSA, because the device networks were structurally isomorphic with respect to logical interconnections among network devices. In more exploratory settings, as was the case for the traditional assessment considered in this work, the goal was to investigate the types of changes that render tasks as invariant or noninvariant. In the absence of previous research or strong theoretical support, our approach was to hypothesize invariance for all items on the traditional assessment. In doing so, the model-based evidence was used to guide SMEs to the relevant features that gave rise to invariance or noninvariance within item families. For example, in one of the SME-I item families, we found that two of the items in the family were highly psychometrically isomorphic, while the third was quite different, being much more difficult. Follow-up discussions and investigations revealed that this was due to the third item involving computer IP addresses that end in the 70s, whereas for the other two items the IP addresses ended in the 60s. When these items were created, SMEs expected that this would not make a difference, that is, the magnitude of the IP address was thought to be an incidental feature. However, our follow-up investigations revealed that a key feature of the item involved having IP address numbers that were divisible by 4, and that it is harder to identify numbers in the 70s divisible by 4 (i.e., 72, 76) than it is to identify numbers in 60s divisible by 4 (i.e., 60, 64, 68). Whether the IP addresses were in the 60s to the 70s was not an incidental feature, but a radical one. As a result of this work, SMEs gained a deeper understanding of the variable features that have bearing on item difficulty. Moreover, strengthening what SMEs know may lead to the generation of testable hypotheses or even motivate broader developments in theory about the cognitive underpinnings that yield differences (and similarities) in examinees performance. To complement the results of the BFs, graphical representations were constructed to visualize the extent to which tasks were (non-)invariant in a couple of ways. For most of the tasks evaluated in this work, the relative locations of task variants, as depicted via their ICCs, were consistent with the conclusions reached on the basis of the BFs. In this scenario, the amount of evidence in support of any one conclusion simply compounded. In other cases, the relative location of ICCs was in support of the conclusion opposite to that suggested by the BF. One instance of this scenario occurred for the comparison between the first and second groups on Item 51 on the traditional assessment. In this case, the BF =.30 but the ICCs may be deemed as sufficiently close (b 51,Group1 = 1.21, b 51, Group2 = 1.05) to be viewed as psychometrically isomorphic, depending on the stakes of assessment and the perspective of stakeholders. 71

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD Psy 427 Cal State Northridge Andrew Ainsworth, PhD Contents Item Analysis in General Classical Test Theory Item Response Theory Basics Item Response Functions Item Information Functions Invariance IRT