RATER EFFECTS & ALIGNMENT 1

Size: px
Start display at page:

Download "RATER EFFECTS & ALIGNMENT 1"

Transcription

1 RATER EFFECTS & ALIGNMENT 1 Running head: RATER EFFECTS & ALIGNMENT Modeling Rater Effects in a Formative Mathematics Alignment Study Daniel Anderson, P. Shawn Irvin, Julie Alonzo, and Gerald Tindal University of Oregon Acknowledgements: Note: Funds for the data set used to generate this report came from a federal grant awarded to the UO from the Institute of Education Sciences, U.S. Department of Education: Developing Middle School Mathematics Progress Monitoring Measures (R324A funded from June June 2014).

2 RATER EFFECTS & ALIGNMENT 2 Abstract The alignment of test items to content standards is critical to the validity of decisions made from standards-based tests. Generally, alignment is determined based on judgments made by a panel of content experts with ratings either averaged or a consensus being reached through discussion. When panels are used, the discussions are intended to increase the consistency of ratings and provide a final value for each item. However, the process is necessarily socially mediated, introducing a potential source of bias. To minimize this bias, we present a method for gauging item-standard alignment with an online independent review system, while controlling for theoretically important rater effects with a latent trait analysis. We apply this method to a large pool of formative mathematics test items to illustrate the technique and explore the degree of alignment between items and standards.

3 RATER EFFECTS & ALIGNMENT 3 Modeling Rater Effects in a Formative Mathematics Alignment Study Item alignment to standards is crucial to the validity of standards-based test inferences (La Marca, 2001). Test-based accountability policies, such as the No Child Left Behind Act (NCLB, 2001), depend upon the alignment of items to standards to draw inferences on the effectiveness of educational institutions using standards-based accountability systems. Alignment is also important for formative evaluations systems, as the validity of teachers instructional decisions depend, in part, upon the degree of item alignment: Misaligned items may lead to misguided instructional decisions. During typical alignment studies, multiple raters judge each item on various dimensions (degree to which it is aligned with standards, depth of knowledge, etc.). The process either entails independent reviews (Porter, 2002, 2004) or panels of experts convening for collaborative/discussion-based reviews (Rothman, Slattery, Vranek, & Resnick, 2002; Webb, 1999, 2007). With each approach, the ratings from the multiple reviewers must somehow be combined. Porter (2002, 2004) and Webb (1999, 2007) each recommend averaging the ratings, while Rothman et al. (2002) recommend the panel of experts continue discussion until a single consensus rating can be established. In this paper, we focus on the process of arriving at an alignment score while also attempting to maintain independence in ratings. In the system we used, ratings from multiple raters are combined through latent trait methods, as opposed to average or consensus methods, to more accurately represent the item location along the alignment continuum. All alignment data reflect independent reviews conducted online with a large pool (1,345) of interim-formative mathematics assessment items. We also use a latent trait model to statistically control for rater

4 RATER EFFECTS & ALIGNMENT 4 severity and better evaluate intra-rater reliability. Finally, we highlight practical benefits of an online review, particularly cost. The Panel Method When panel methods are used, groups of content-matter experts are recruited and convened to discuss and debate the alignment of test items to standards. Content-matter experts generally come from backgrounds such as classroom teachers and curriculum specialists (Rothman et al., 2002). In his influential report, Webb (1999) describes a panel methodology to establish the alignment of test items with content standards. He describes the process as a fourday Alignment Analysis Institute (p. 2), where raters judged the alignment of statewide math and science assessments from two or three grade levels. He also notes that the majority of raters attended a one-day meeting prior to the institute, where they were introduced to the study and the alignment process. In later reports, Webb (2007) notes that a three-day process is perhaps more typical, but that the length of the institute depends on the number of grades to be analyzed, the length of the standards, the length of the assessments, and the number of assessment forms under consideration (pp. 8-9). He recommends 5-8 raters for each panel, noting that an increase in raters will generally increase reliability. During the alignment process outlined by Webb, raters first independently rate the degree of item-standard alignment and then discuss their ratings with others on the panel. Following the discussion, raters are provided the opportunity to modify their ratings. Finally, ratings from the multiple reviewers are averaged to determine if alignment criteria are met, with standard deviations reported as an indicator of the variance between raters. Rothman et al. (2002) proposed the Achieve model, which shares some procedural similarities with Webb s model: Panels of experts are convened, and they work together to

5 RATER EFFECTS & ALIGNMENT 5 establish alignment ratings for all items. However, while the ratings from the multiple reviewers are averaged under Webb s model (1999, 2007), the Achieve model stipulates that a single consensus rating must be reached through group discussion (Case, Jorgensen, & Zucker, 2004; Martone & Sireci, 2009). Rothman et al. note that the process can be quite expensive and that the pool of raters should represent a diversity of viewpoints (p. 8), to ensure alignment ratings are not biased toward certain groups. Panel Methods and Threats to Validity While the panel method has been well documented (Council of Chief State School Officers (CCSSO), 2002), it also may become impractical in a number of situations, particularly if the pool of items to be reviewed is large, or if the expertise of individuals needed are broadly distributed geographically. For example, many interim-formative assessment systems contain multiple test forms within each grade that must be sensitive to a wide range of student abilities. The total item bank may thus contain several thousand items. In computer-adaptive testing situations, the total item bank may be similarly large to ensure that a sufficient number of items are calibrated at each respondent s skill level (Thompson & Weiss, 2011). In these contexts, the financial and practical costs of bringing expert reviewers together to judge the alignment of even a portion of the total item bank becomes, in most cases, unmanageably large. Additionally, in many rural states or regions, it may be difficult to bring together a sufficient number of content experts to form panels without considerable costs allocated to travel. Conducting reviews online alleviates many of the practical and logistical costs associated with convening experts in person. Additionally, experts from a broad geographical base (i.e., from across the country) can easily participate, which may increase the diversity of views (Rothman et al., 2002) and enhance the external validity of the ratings. For example, it is logical

6 RATER EFFECTS & ALIGNMENT 6 to expect raters previous experience instructing and/or assessing students to guide their interpretation of the standards. Raters prior experiences are also contingent, in part, upon their instructional contexts (e.g., educators from rural versus urban settings). Obtaining participants from a wide-range of contexts thus reduces the chances that the alignment ratings will contain contextual biases. For these reasons, an online system may be particularly useful for formative measurement systems with many items or in states or locations with a broad geographic base. An online format can account for both collaborative/discussion-based reviews (Webb, Alt, Ely, & Vesperman, 2005) and independent reviews (Behavioral Research and Teaching, 2013). Furthermore, the analysis of data collected through online designs need not differ from data collected from in-person reviews. For example, the online model proposed by Webb et al. (2005) allows reviewers to go through a process that mirrors the in-person institutes, including opportunities for discussion. Because the process is intended to represent an online version of the same model, the modes for determining item alignment (e.g., average ratings after group discussion) do not change. The Surveys of Enacted Curriculum is another model for gauging item alignment that is generally conducted online (see with ratings from multiple independent reviewers averaged (Porter, 2004). The critical issue, therefore, is not whether the system is online but the independence of ratings that are being made by reviewers. With independence of ratings, more advanced analyses may be used instead of raw averages. For example, in the panel model, ratings are not independent from one another; in fact, the opposite is deliberately the case, as independent ratings are modified through group discussion. The inherent non-independence of the data rules out the possibility of most parametric statistics, given that independence of observations (non-

7 RATER EFFECTS & ALIGNMENT 7 correlated residuals) is a foundational assumption (see Raudenbush & Bryk, 2002). With independent ratings, this independence of observations assumption holds, opening the possibility for more advanced analyses. Furthermore, a number of other problems exist with the panel method. The panel method is thought to enhance inter-rater reliability through a process of raters defending their initial judgments (elaborating on their thought processes, and potentially modifying their ratings to reach a final judgment). Basically, group discussions are designed to facilitate convergence of ratings and counter the biases of any single rater. In other words, raters who are overly harsh or lenient are more likely to be influenced by the panel and subsequently allowed to modify their ratings (Kocher & Sutter, 2005). By design, however, all panel models provide opportunities for a single rater to affect the judgment of others, and the reverse outcome may occur, with the outlying rater influencing the panel. Persuasion is an essential component of within-group dynamics present in panel settings (Brown, 1988), and an overbearing rater has the potential to sway others in a manner that over or underrepresents the alignment of the item. Indeed, the process of persuasion and subsequent conformity occurs regularly. In a seminal study on conformity in groups, Asch (1956) found that the rate of conformity in small groups rose rapidly when individuals faced a group of two or three. More recently, in a metaanalysis of studies replicating Asch s experiments, Bond (2005) found largely the same pattern of small group conformity. If an overbearing individual in a small group setting, such as an alignment panel, convinces one or two individuals to move in their direction, the opinion of the group may follow, with conformity/consensus the end product.

8 RATER EFFECTS & ALIGNMENT 8 Independent Reviews and Possibilities for Scaling In contrast to the panel method, independent ratings are not subject to group dynamics noted above that can potentially bias the results. Yet, independent reviews also come with considerable challenges. If raters participating in the study exhibit any of the common rater effects (see Engelhard, 1994; Myford & Wolfe, 2000; Saal, Downey, & Lahey, 1980), then items that are aligned may be misrepresented as misaligned (false negative), while items that are misaligned may be misrepresented as aligned (false positive). Although this is also true with panel designs, the concern may be more pronounced with independent reviews because the group cannot help temper extreme ratings/reviewers. Raters routinely exhibiting bias in one direction or another can have substantial practical repercussions (Wolfe, 2004). False negative items (those mistakenly rated as not aligned) can waste valuable test-developer resources, as the items may be revised needlessly or discarded, while false positive items (items mistakenly rated as aligned) threaten the validity of standards-based test inferences. However, because the observations are independent, latent trait methods such as the many-facets Rasch model (MFRM) can be applied to understand and control these rater effects. The MFRM provides a flexible framework from which data collected from independent alignment reviews can be analyzed. Using the MFRM, both the internal consistency and severity of each rater can be documented, with the latter statistically controlled. Under the MFRM, raters can be treated as a facet of the measurement process with conditions of measurement equated provided the reviewers provide alignment ratings of common items. By equating the conditions of measurement, all item ratings are placed on a common scale, taking into account the specific rater who judged the item. Ratings provided by overly-severe raters would thus be adjusted toward the aligned end of the scale, while ratings provided by overly-lenient raters would be

9 RATER EFFECTS & ALIGNMENT 9 adjusted toward the unaligned end of the scale. Concerns of inter-rater reliability then become mitigated, as the analysis statistically corrects for differences between raters in terms of severity. When applying the MFRM to alignment data, the item becomes the object of measurement. The multiple independent ratings for each item are then treated as indicators of the latent alignment trait and the item location on the latent trait is calculated based on the raw sum of ratings. However, because the conditions of measurement are equated, the item location is adjusted based on the severity of each reviewer who rates the item. In the end, independent ratings from each reviewer uniquely contribute to the latent alignment trait for each item, eliminating the need for a discussion-based process to increase inter-rater reliability. The internal consistency of each rater also can be evaluated more precisely by comparing a model-based expected value with an observed value. The expected value is calculated based on (a) how the reviewer rated all other items [i.e., the severity of the rater], and (b) how all other reviewers rated the item in question [i.e., the difficulty of endorsing the item as aligned]. If the expected value closely matches the observed value, then the reviewer is operating consistently. Using the MFRM, one can also inspect standardized residuals to investigate outlier responses, which can be used to help inform ratings of specific items. Because the MFRM fits within the larger family of Rasch models, of which the raw score is a sufficient statistic for calculating latent trait locations (Wright, 1982), the values reported on the latent scale can be transposed back to the original rating scale. The MFRM thus provides both a raw average of ratings for each item, as is often applied (e.g., Porter, 2004), as well as an adjusted average (adjusted for the severity of raters judging the item), which represents the item location on the latent trait transposed back to the original rating scale. This property is quite helpful from a practical evaluative perspective because original rules for determining alignment

10 RATER EFFECTS & ALIGNMENT 10 (e.g., on a 0-3 scale; less than 2 = not aligned, greater than 2 = aligned) can still be applied, and new rules do not have to be developed for the latent scale. Situating the Current Study In the current study, we apply the MFRM to data collected from an independent online alignment review. Before describing the methods, however, some unique characteristics of the study warrant note. First, the alignment review was conducted with an item bank from an interim-formative assessment system for which all items were written to measure the Common Core State Standards (CCSS) in mathematics. Evaluating alignment from an item bank, rather than intact test forms, prevented calculation of common alignment criteria described by Webb (e.g. categorical concurrence, balance of representation; 1999) or Rothman et al. (e.g., content centrality, performance centrality; 2002). Gauging alignment with interim-formative test items also presents certain challenges, given that the target population for the assessments is students performing below expectations (i.e., students whose progress is being monitored in relatively short intervals in concert with an intervention designed to increase achievement). Thus, perfect alignment to grade-level standards may result in an overly difficult test that the target population may have difficulty accessing (see Anderson, Lai, Alonzo, & Tindal, 2011). With interim-formative assessments, it may be desirable to have some items measure skills that are prerequisites to the target standard, but that (given their reduction in complexity) do not directly align with the standard. Therefore, in cases where items were not aligned with the corresponding standard, we were interested in whether the item measured a prerequisite to the standard. Prerequisite items were viewed as both desirable and necessary, and are perhaps one of the distinguishing characteristics of interim-formative assessment. Nevertheless, the

11 RATER EFFECTS & ALIGNMENT 11 methodology likely generalizes to any item-to-standards alignment study. Below, we discuss the specific procedures used in our study. Methods Participant Recruitment and Item Assignment Fifteen middle school math teachers from across the country were recruited to serve as our content expert raters. Raters were selected based on their knowledge of the CCSS and experience as middle school mathematics professionals. The full item bank used in this study included 900 items in each of grades 6-8, totaling 2,700 unique items. A pool of approximately 50% of the total item-bank was selected for review. Items were selected based on a randomized matrix-sampling plan, by which items were stratified by the original item-writer and the CCSS the item was intended to measure. A total of 1,345 items were selected for review, with each rater asked to judge the alignment of 270 items through an online system. All items were written to align with only one standard, and all items were paired with the corresponding standard prior to dissemination to reviewers. Items were judged on a 4-point alignment scale, as follows: 0 = no alignment, 1 = vague alignment, 2 = somewhat aligned, 3 = directly aligned. Prior to participating in the study, all raters were trained on the alignment process and rating scale through an online webinar. Reviewers were informed that all items would be collapsed into dichotomous aligned (2 or 3) or not aligned (0 or 1) categories; however, the 4- point scale allowed for finer-grained judgments. A second conditional question was also asked of respondents when a given item was rated as not aligned (0-1), to assess whether the item

12 RATER EFFECTS & ALIGNMENT 12 measured a prerequisite skill to the standard. This question was rated dichotomously 0 (no) or 1 (yes). Prior to the review, all items were divided into 5 review sets within each grade (15 review sets total), each containing 90 items. Items selected for review were randomly assigned to one of the five review sets within each grade. Each rater then independently judged the alignment of three review sets assigned to him/her based on a linking pattern that allowed for the calibration of all items and raters on a common scale. Three raters reviewed each review set. The rater sampling design, linking raters across review sets, is displayed in Table 1. All items were disseminated for review via a secure web-based system called the Distributed Item Review (DIR; Behavioral Research and Teaching, 2013), as displayed in Figure 1. Broadly, the DIR is a tool for presenting test items and test forms to reviewers in a userfriendly manner so they can be judged for dimensions of bias, sensitivity, and, in this case, alignment to standards. Test items are uploaded into the DIR and associated with researcherdefined study questions for presentation to reviewers. Analysis All data were analyzed with a many-facets Rasch model (MFRM) to document and control for rater effects. The MFRM is typically applied when the object of measurement is persons, and there are some combinatorial effects of an item and one of numerous facets (i.e., a rater) that determine the individual s score. Within an alignment study, however, the items themselves are the object of measurement. Nevertheless, the MFRM can still be applied because there are facets (i.e., the item and the judge) that determine the ultimate alignment rating of the item.

13 RATER EFFECTS & ALIGNMENT 13 The MFRM, developed by Linacre (1989), is an extension of the basic Rasch model. When data follow an ordinal scale, the MFRM can extend either Masters (1982) partial credit model in which each item follows its own unique scale or Andrich s (1978) rating scale model, in which item thresholds are assumed equivalent across items. In the current study, all items were rated on the same 4-point scale and there was little theoretical reason to presume that thresholds would vary dramatically across items. Given the large number of items in the analysis (1,345), estimation of the partial credit model would also have required a substantially greater number of estimated parameters because thresholds would need to be estimated for each item individually. Further, for many items there were unobserved categories (e.g., all raters rated an item as 2 or 3, leading to empty 0 and 1 categories). The rating scale model is robust to unobserved categories; with the partial credit model, estimation is degraded (Linacre, 2000). We thus decided to apply Andrich s rating scale model, defined as ln!!"#!!"(!!!) = B! C! F! (1) where P!"# is the probability that item n is rated into category k by rater j, B n is the estimated location for item n on the latent alignment trait, C j is the severity of rater j, and F k is the Rasch- Andrich threshold for the k 1 category. Because our analyses included only two facets, rater severity and item alignment, Equation 1 is actually equivalent to Andrich s (1978) rating scale model, and does not require a manyfacets extension. In Andrich s model, the C j parameter typically represents item difficulty, while the B n parameter typically represents the respondent location on the latent trait. In our model, these parameters represented the item location and rater severity respectively. However, we discuss the model within the MFRM framework given that raters are generally referenced as a facet of the measurement process and, thus, the MFRM is typically applied when latent trait

14 RATER EFFECTS & ALIGNMENT 14 methods are used (e.g., Myford & Wolfe, 2000). Further, the model displayed in Equation 1 could easily generalize to include other facets of the measurement process (e.g., the domain or grade-level the item was intended to measure). The reason our model simplified to the rating scale model was because there were no subjects in the analysis the item was the subject. Our model thus represented a special case of the MFRM, in which only two latent parameters were estimated. This special case of the MFRM is analogous to Andrich s rating scale model. The design and analysis was used to control for rater severity and obtain more accurate representations of the alignment of each item. No analysis, beyond descriptive statistics, was conducted with prerequisite skill data. All MFRM analyses were conducted with the FACETS software, version 3.70 (Linacre, 2012). As Equation 1 indicates, all items were placed on a logit scale, which is useful when evaluating the degree of alignment, as the logit transformation more closely represents a linear equal interval scale, provided the data fit the model (Salzberger, 2010). However, the FACETS program also reports an adjusted rating with the logit value transformed back to the original scale. As mentioned previously, values can ease interpretations when categorical cut-points must be determined. In the present study, all items were tabulated into aligned and not aligned categories. Using the adjusted ratings transformed back to the original scale, we used the original scale cutoff value of 2.0 to determine alignment. That is, any items with a rater-adjusted value less than 2.0 were deemed not aligned, while any above this cutoff were deemed aligned. Items rated not aligned were then evaluated with respect to the prerequisite skill ratings to determine which items were truly misaligned as opposed to those that were aligned to a prerequisite skill of the standard, though not the standard itself.

15 RATER EFFECTS & ALIGNMENT 15 Results For each item there were 6 total ratings 3 alignment ratings, one from each of three raters, and 3 prerequisite skill ratings, one from each of the same three raters. Note that the prerequisite skill ratings were only provided when an item was rated as not aligned with the standard (0 or 1). Below, we first provide our item standard alignment results, followed by a brief discussion of our prerequisite skills analysis. We then provide a summary of the observed rater effects. Item Standard Alignment Results for 15 randomly selected items, from the 1,345 items reviewed in this study, are presented in Table 2. The table is intended to provide an example of the type of information obtained for the overall item pool. For each item, observed and rater-adjusted averages are reported. The rater-adjusted average was used in all cases to determine whether an item did or did not align with the corresponding standard, with values below 2.0 deemed not aligned. Following the adjusted average is the endorsability statistic that indicates the ease with which raters endorsed the item as aligned with the standard (the B n parameter in Equation 1), reported on the logit scale. Lower endorsability values indicate easier-to-endorse items. The rater-adjusted average is a direct translation of the endorsability statistic, placed back on the original rating scale. After the endorsability statistic is the estimated standard error of the endorsability statistic for the items, which provides an estimate of the precision of the rating. Note that items with more extreme ratings have higher standard errors. When examining the ratings of items, the observed average is quite different from the rater-adjusted average in some cases. For example, items 1, 2, and 7 all had an observed average of When accounting for the raters who judged the items, however, we see differing degrees

16 RATER EFFECTS & ALIGNMENT 16 of alignment. Both items 1 and 2 have an adjusted rating below 2.00, indicating the items are not aligned, but item 2 is rated as being more closely aligned (1.74) than item 1 (1.69). Further, we see that item 7 is adjusted (slightly) in the opposite direction (2.03). Thus, when adjusting for rater severity, our inferences of the alignment of items 1 and 2 are reversed (from aligned to not aligned), while our confidence that item 7 is aligned is strengthened. After the display of the standard error, a series of item-level fit statistics are also reported the unstandardized and standardized infit and outfit. In their unstandardized form, both statistics center on 1.0, and provide an indication of how the item data fit the Rasch model expectations (Linacre, 2002). The infit statistic is weighted based on the patterns represented in the data. The outfit statistic is not weighted, but is more sensitive to outliers in the data. For both statistics, values above 1.0 represent items that underfit the model, which occur with unexpected values. For example, if a severe rater judging a difficult to endorse item judges the item as aligned, then the item would likely appear underfit. Values below 1.0 represent items overfit to the model, which occur with overly consistent/redundant values (Linacre, 2002). Generally, underfitting items are of greater concern than overfitting items because the estimated alignment rating is calculated from unexpected responses, and is thus more likely to be biased. Overall, 87.73% of all items had a rater-adjusted average rating above 2.0, suggesting they were aligned to their corresponding standard. Because only three reviewers judged the alignment of each item, the standard errors were quite large. Of the 1,180 items rated as aligned, the mean square outfit ranged from 0.05 to 7.79, with an average of 0.93, while the mean square infit ranged from 0.02 to 4.50, with an average of Using both the standard error and the fit statistics, we can begin to formulate degrees of confidence in the observed alignment rating. For

17 RATER EFFECTS & ALIGNMENT 17 items that do not fit the model well and have large standard errors, our confidence in the alignment estimate would be lower, relative to items that fit the model well and have small standard errors. Prerequisite Skills Analysis The prerequisite skills question was a conditional response only asked of raters when they had judged the item as not aligned with the standard, and thus, it created dependencies in the data, and inherent sparseness. For example, raters who were more severe judged more items as not aligning with the standard, and thus, judged more items with respect to prerequisite skills. Similarly, less-severe raters judged more items as aligned to standards, leading to fewer responses with respect to prerequisite skills for such raters. The missing data thus depended on which rater judged the item. The dependencies and sparseness of the data led us to only examine prerequisite skills from a descriptive lens. To determine whether the item addressed a pre-requisite skill we used a group majority rule, by which at least 2 of the 3 raters had to judge the item as either (a) aligned to the standard, or (b) addressing a prerequisite skill. Only 165 of the 1,345 items reviewed (12%) were rated as not aligned with the standard. Of the remaining 165 items that were rated as not aligned, 160, or 97%, were rated as addressing a prerequisite skill to the standard by consensus. Thus, 99.6% of all items in this study were rated as either aligning with the corresponding standard, or addressing a prerequisite skill to the standard. Rater Effects A summary of rater effects for all 15 raters included in the study is displayed in Table 3. Of particular note is the severity statistic, reported on the logit scale. The severity of the raters differed substantially. An examination of the average rating column indicated that the most

18 RATER EFFECTS & ALIGNMENT 18 severe Rater 3 judged items, on average, nearly a full category below the most lenient Rater 11. These differences in rater severity are shown visually in Figure 2. The left-most column of the figure displays logit values from the analysis, mapped vertically. To the right of these values is the distribution of rater severities, mapped against the logit values. To the right of the rater column is the distribution of items, mapped against the same logit scale. Finally, the furthest right column of the figure displays the raw rating scale, ranging from 0-3. The figure thus displays how the distribution of rater severities compared to the distribution of item endorsabilities on a common scale. Note that the item endorsabilities are quite severely skewed, as would be expected given that the majority of items were rated as aligned. The fit statistics in Table 3 also are important because they provide an indication of the internal consistency of raters. That is, given the severity of the rater, as determined by his or her scoring of all other items, and the endorsability of the item, did he or she consistently rate items as would be expected? The outfit statistic is perhaps most useful for this purpose with, again, values above 1.0 indicating underfit to the model and values below 1.0 indicating overfit to the model. These fit statistics provide an indication of the intra-rater reliability. A value of 1.0 is the ideal fit to the model, though mild deviations may occur with little cause for concern. We found that all raters in our study fit the model expectations quite well, with mean square outfit statistics ranging from 0.76 to These values are consistent with guidelines reported by Myford and Wolfe (2000). Discussion The primary purpose of this study was to propose and illustrate a methodology for conducting alignment reviews. To be clear, our purpose was not to suggest new methods for scoring dimensions of alignment, as has been proposed by others (e.g. Porter, 2002; Porter, 2004;

19 RATER EFFECTS & ALIGNMENT 19 Rothman et al., 2002; Webb, 1999, 2007). Rather, we proposed a new methodology for arriving at a score (with latent trait methods) through more efficient means (online independent reviews). With such an approach, dimensions of alignment can then be calculated post-hoc. A secondary purpose of this study was to explore the alignment of the easycbm formative middle school mathematics test items with the Common Core State Standards. Overall, the alignment was quite high. Although only a portion of the total item bank was selected for review, the proportion of aligned items should readily generalize to the full item bank, given that the same procedures were used during the development of all items (see Anderson, Irvin, Patarapichayatham, Alonzo, & Tindal, 2012), and the matrix sampling plan likely resulted in a representative sample of the total item bank. The results obtained from this study also highlight that important differences in item alignment can be observed when statistically controlling for rater effects. While it is unknown what the alignment ratings of specific items would have been under discussion-based/panel designs, we observed in some cases quite substantial differences between the raw averages, collected from the independent review, and the averages adjusted for rater severity. It is possible that some of these differences would have been alleviated during group discussions, but independent reviews are also free from social biases that could result from the group dynamics, such as group conformity and persuasion (Asch, 1956; Bond, 2005; Brown, 1988; Kocher & Sutter, 2005). Independent reviews are also advantageous in that they are easy to conduct online, leading to numerous practical and financial benefits, although efforts have been made to move discussion-based reviews online (Webb et al., 2005). Finally, independent reviews open the possibility for more advanced statistical analyses, given that the independence of observations assumption is not violated. Previous research using independent reviews (e.g., Porter, 2004) has

20 RATER EFFECTS & ALIGNMENT 20 reached conclusions about item alignment by averaging the multiple ratings. Applying more advanced analyses, such as the MFRM, allows the researcher to control for individual rater effects statistically, while still avoiding socially-mediated processes. Limitations Although the primary purpose of this study was to propose a new methodology for arriving at alignment scores, the actual design used was not without limitations. Perhaps most fundamentally, Webb (2007) suggested 5-7 reviewers judge the alignment of any one item, and the results of this study largely support those numbers. Only three judges were used in this study, and the resulting standard errors were larger than is perhaps desirable. Further, some may question the logic of having reviewers judge the alignment of items across grades, rather than just within a grade. That is, when examining Table 1, reviewer a judged the alignment of two item sets within grade 6, and one item set within grade 8. The expertise of this rater thus needed to span both grades 6 and 8. However, we felt that the vertical linking of raters was preferable given that the items themselves were developed to be on a common vertical scale (Anderson et al., 2012). Thus, in keeping with the scale of the items, we reviewed the alignment of all items with a vertical sampling plan. Further, if judges did not link across grades, then the total number of reviewers from which to equate the conditions of measurement would be much smaller, and the possibility for unaccounted unique groups of rater effects would be greater. For instance, in the current study each grade would have had 5 reviewers within each grade. With only 5 reviewers, it is quite possible that one grade could, by chance, get 5 overly-severe or overlylenient raters. Like all item response models, the MFRM is scale free (Embretson & Reise, 2000), and all rater severity calibrations were calculated relative to the other raters in the analysis. Thus, if one grade happened to have 5 overly-severe or overly-lenient reviewers, the MFRM

21 RATER EFFECTS & ALIGNMENT 21 would do little to correct the bias. Including all 15 reviewers in the same analysis reduced the likelihood of the above scenario. It is also worth noting, however, that an increase in the number of raters also would have reduced the chances of group rater effects occurring within a grade. Conclusions and Future Research This study differed from previous research in three important ways. First, all data were collected through an online design, which comes with substantial practical benefits. Conducting reviews online also potentially increases the external validity of the study by efficiently drawing reviewers together from broad and diverse contexts. Second, all alignment data were collected with fully independent reviews. While independent reviews are not unique (Porter, 2004), the purpose of the independent reviews was unique. There were two broad reasons for gathering reviews independently: (a) alleviating any potential socially mediated biases, and (b) opening the possibility for more advanced data analyses by not violating the independence of observations assumption. Third, all data were analyzed with a many-facets Rasch model to document and control for rater effects. This study represents the first application we have found of latent trait methods to alignment data, which likely leads to more accurate representations of the alignment of each item, while simultaneously providing modes of documenting within-rater consistency and controlling for between-rater severity. Future research should examine the effect of additional reviewers on the standard error of the estimate and empirically test how each additional reviewer increases precision. The standard errors will always be largest on the most extreme scores, and it is likely that adding reviewers reaches a point of diminishing returns. However, where this threshold lies remains an empirical question, and one that likely has to be balanced with the financial costs associated with adding

22 RATER EFFECTS & ALIGNMENT 22 additional reviewers. This research should also examine how precision changes when adding raters who do or do not fit the model expectations. It would also be interesting to investigate the effect of different training procedures on the consistency of rater responses. While online webinars are a cost-effective and convenient method of delivering trainings, some may reasonably argue that webinar trainings lack the depth of inperson trainings. To evaluate the effect of training procedures, one could conduct an alignment review with raters randomly assigned to each training condition, and the consistency of raters could be evaluated by comparing the mean square outfit values of raters within each group. In our study, all participants were trained via a webinar, and all raters appeared to meet the Rasch model expectations quite well. However, the training component was not of primary interest in this study, and more research would be needed to address it specifically. It is also worth noting that if in-person trainings were to be conducted, then the geographical distribution of raters would, again, be of concern given the substantial financial cost associated with convening experts from distant locations. Evidence of item to standards alignment is a critical part of establishing the validity of large-scale tests used in standards-based educational contexts. If inferences of students ability from tests are made relative to the standards the test items are intended to measure, then the validity of the inference hinges on adequate alignment. To date, few advanced statistical analyses have been applied to alignment data, perhaps due in part to the non-independence of observations. Yet, given the high-stakes nature of many tests, and the importance of alignment, it is critical that researchers continue to address the best methods of determining alignment, and not just which dimensions of alignment items should be scored. That is, are average ratings sufficient? Do panel methods, with experts discussing and debating item alignment, lead to more

23 RATER EFFECTS & ALIGNMENT 23 accurate representations of alignment than independent reviews? And how do panel ratings compare (directly) to independent reviews that have been adjusted for rater severities? It is important that these questions are addressed with future research. Little consensus currently exists on the most appropriate mode of arriving at an alignment score. Many of the practical problems of previous alignment research could be addressed through the application of latent trait methods with independent online reviews rater effects could be documented and controlled, reviewers could participate from diverse geographical areas, and a large pool of items could be efficiently reviewed. Yet, the method proposed here is not without its challenges and limitations, and future research should continue to address methods for arriving at alignment ratings.

24 RATER EFFECTS & ALIGNMENT 24 References Anderson, D., Irvin, P. S., Patarapichayatham, C., Alonzo, J., & Tindal, G. (2012). The Development and Scaling of the easycbm CCSS Middle School Mathematics Measures (Technical Report No. 1207). Eugene, OR: Behavioral Research and Teaching, University of Oregon. Anderson, D., Lai, C.-F., Alonzo, J., & Tindal, G. (2011). Examining a grade-level math CBM designed for persistently low-performing students. Educational Assessment, 16, doi: / Andrich, D. A. (1978). A rating formulation for ordered response categories. Psychometrika, 43, doi: /BF Asch, S. E. (1956). Studies of independence and conformity: A minority of one against a unanimous majority. Psychological Monographs: General and Applied, 70(9), doi: /h Behavioral Research and Teaching. (2013). Distributed Item Review. Retrieved March 27, 2013, from Bond, R. (2005). Group size and conformity. Group Processes Intergroup Relations, 8(4), doi: / Brown, R. (1988). Group processes: Dynamics within and between groups. Cambridge, MA: Basil Blackwell. Case, B. J., Jorgensen, M. A., & Zucker, S. (2004). Alignment in Educational Assessment: Pearson. Council of Chief State School Officers (CCSSO). (2002). Models for alignment analysis and assistance to states. Washington, DC: Author.

25 RATER EFFECTS & ALIGNMENT 25 Embretson, S. E., & Reise, S. P. (2000). Item Response Theory for Psychologists. Florence, KY: Psychology Press. Engelhard, G. (1994). Examining rater errors in the assessment of written composition with a many-faceted Rasch model. Journal of Educational Measurement, 31, Kocher, M. G., & Sutter, M. (2005). The decision maker matters: Individual versus group behaviour in experimental beauty contest games. The Economic Journal, 115, La Marca, P. M. (2001). Alignment of standards and assessments as an accountability criterion. Practical Assessment, Research & Evaluation 7(21). Retrieved October 17, 2012, from Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago: MESA Press. Linacre, J. M. (2000). Comparing "partial credit" and "rating scale" models. Rasch Measurement Transactions. Retrieved July 5, 2012, from Linacre, J. M. (2002). What do infit and outfit, mean-square and standardized mean? Rasch Measurement Transactions, 16, 878. Linacre, J. M. (2012). Facets computer program for man-facet Rasch measurement (Version ). Beaverton, Oregon: Winsteps.com. Martone, A., & Sireci, S. G. (2009). Evaluating alignment between curriculum, assessment, and instruction. Review of Educational Research, 79, doi: / Masters, G. N. (1982). A Rasch model for partial credit scoring. Pychometrika, 47(2), doi: /BF

26 RATER EFFECTS & ALIGNMENT 26 Myford, C. M., & Wolfe, E. W. (2000). Strengthening the ties that bind: Improving the linking network in sparsely connected rating designs. TOEFL Technical Report No. 15. Princeton, NJ: Educational Testing Service. The No Child Left Behind Act, Pub. L. No (2001). Porter, A. C. (2002). Measuring the content of instruction: Uses in research and Practice. Educational Researcher, 31, doi: / X Porter, A. C. (2004). Curriculum assessment. In J. Green, B. Camilli & P. Elmore (Eds.), Complementary Methods for Research in Education. Washington, DC: American Educational Research Association. Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Thousand Oaks, CA: Sage. Rothman, R., Slattery, J. B., Vranek, J. L., & Resnick, L. B. (2002). Benchmarking and alignment of standards and testing (Technical Report 566). Washington, DC: Center for the Study of Evaluation. Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings: Assessing the psychometric quality of rating data. Psychological Bulletin, 88, doi: / Salzberger, T. (2010). Does the Rasch model convert an ordinal scale into an interval scale? Rasch Measurement Transactions, 24, Thompson, N. A., & Weiss, D. J. (2011). A framework for the development of computerized adaptive tests. Practical Assessment, Research & Evaluation, 16(1), 1-9.

27 RATER EFFECTS & ALIGNMENT 27 Webb, N. L. (1999). Alignment of science and mathematics standards and assessments in four states (Research Monograph No. 18). Madison, WI: University of Wisconsin-Madison, National Institute for Science Education. Webb, N. L. (2007). Issues related to judging the alignment of curriculum standards and assessments. Applied Measurement in Education, 20, doi: / Webb, N. L., Alt, M., Ely, R., & Vesperman, B. (2005). Web Alignment Tool (WAT): Training Manual (Version 1.1): Council of Chief State School Officers; Wisconsin Center for Education Research. Wolfe, E. W. (2004). Identifying rater effects using latent trait models. Psychology Science, 46, Wright, B. D. (1982). Dichotomous Rasch model derived from counting right answers: Raw scores as sufficient statistics. Rasch Measurement Transactions, 3, 62.

28 RATER EFFECTS & ALIGNMENT 28 Table 1 Rater Sampling Plan Grade 6 Grade 7 Grade 8 Set 1 Set 2 Set 3 Set 4 Set 5 Set 1 Set 2 Set 3 Set 4 Set 5 Set 1 Set 2 Set 3 Set 4 Set 5 a a a b b b c c c d d d e e e f f f g g g h h h i i i j j j k k k l l l m m m n n n o o o Note. Each letter, a o, represents an individual reviewer. Each reviewer is further represented by one and only one row. The table displays the overlap of raters across items so that each item was rated by at least 3 reviewers, while the raters themselves linked across item sets, allowing for the calibration of raters and items on a common scale. Each set contained 90 items.

29 RATER EFFECTS & ALIGNMENT 29 Table 2 Random sample of 15 Items: Item-Standard Alignment Results Item Grade Obs. Avg. Adj. Avg. Endorsability S.E. Fit Statistics Infit Infit Z Outfit Outfit Z Note. Endorsability = MFRM scaled alignment rating, reported on the logit scale. Lower values indicate easier to endorse items, and thus items with a higher degree of alignment. Adj. Avg. = Endorsability statistic rescaled to the original rating scale, adjusted for rater severity.

30 RATER EFFECTS & ALIGNMENT 30 Table 3 Summary of Rater Effects Fit Statistics Rater Count Avg. Rtg Severity S.E. Infit Infit Z Outfit Outfit Z Note. Count = number of items judged by the rater. The severity statistic is reported on logit scale, with higher values indicating a more severe rater.

31 RATER EFFECTS & ALIGNMENT 31 Figure 1. Online web-based tool used for gauging item-alignment. The distributed item review (DIR) is a web-based secured system for presenting test items and test forms to experts across a broad geographic region for reviews of bias, sensitivity, and alignment to standards. In the current study, the DIR was used as the online framework for gauging item alignment to the Common Core State Standards.

RATER EFFECTS AND ALIGNMENT 1. Modeling Rater Effects in a Formative Mathematics Alignment Study

RATER EFFECTS AND ALIGNMENT 1. Modeling Rater Effects in a Formative Mathematics Alignment Study RATER EFFECTS AND ALIGNMENT 1 Modeling Rater Effects in a Formative Mathematics Alignment Study An integrated assessment system considers the alignment of both summative and formative assessments with

More information

A Cross-validation of easycbm Mathematics Cut Scores in. Oregon: Technical Report # Daniel Anderson. Julie Alonzo.

A Cross-validation of easycbm Mathematics Cut Scores in. Oregon: Technical Report # Daniel Anderson. Julie Alonzo. Technical Report # 1104 A Cross-validation of easycbm Mathematics Cut Scores in Oregon: 2009-2010 Daniel Anderson Julie Alonzo Gerald Tindal University of Oregon Published by Behavioral Research and Teaching

More information

Validation of an Analytic Rating Scale for Writing: A Rasch Modeling Approach

Validation of an Analytic Rating Scale for Writing: A Rasch Modeling Approach Tabaran Institute of Higher Education ISSN 2251-7324 Iranian Journal of Language Testing Vol. 3, No. 1, March 2013 Received: Feb14, 2013 Accepted: March 7, 2013 Validation of an Analytic Rating Scale for

More information

The Impact of Item Sequence Order on Local Item Dependence: An Item Response Theory Perspective

The Impact of Item Sequence Order on Local Item Dependence: An Item Response Theory Perspective Vol. 9, Issue 5, 2016 The Impact of Item Sequence Order on Local Item Dependence: An Item Response Theory Perspective Kenneth D. Royal 1 Survey Practice 10.29115/SP-2016-0027 Sep 01, 2016 Tags: bias, item

More information

Cross-validation of easycbm Reading Cut Scores in Washington:

Cross-validation of easycbm Reading Cut Scores in Washington: Technical Report # 1109 Cross-validation of easycbm Reading Cut Scores in Washington: 2009-2010 P. Shawn Irvin Bitnara Jasmine Park Daniel Anderson Julie Alonzo Gerald Tindal University of Oregon Published

More information

Centre for Education Research and Policy

Centre for Education Research and Policy THE EFFECT OF SAMPLE SIZE ON ITEM PARAMETER ESTIMATION FOR THE PARTIAL CREDIT MODEL ABSTRACT Item Response Theory (IRT) models have been widely used to analyse test data and develop IRT-based tests. An

More information

State C. Alignment Between Standards and Assessments in Science for Grades 4 and 8 and Mathematics for Grades 4 and 8

State C. Alignment Between Standards and Assessments in Science for Grades 4 and 8 and Mathematics for Grades 4 and 8 State C Alignment Between Standards and Assessments in Science for Grades and 8 and Mathematics for Grades and 8 Norman L. Webb National Institute for Science Education University of Wisconsin-Madison

More information

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys Jill F. Kilanowski, PhD, APRN,CPNP Associate Professor Alpha Zeta & Mu Chi Acknowledgements Dr. Li Lin,

More information

Alignment in Educational Testing: What it is, What it isn t, and Why it is Important

Alignment in Educational Testing: What it is, What it isn t, and Why it is Important Alignment in Educational Testing: What it is, What it isn t, and Why it is Important Stephen G. Sireci University of Massachusetts Amherst Presentation delivered at the Connecticut Assessment Forum Rocky

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS Michael J. Kolen The University of Iowa March 2011 Commissioned by the Center for K 12 Assessment & Performance Management at

More information

Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure. Rob Cavanagh Len Sparrow Curtin University

Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure. Rob Cavanagh Len Sparrow Curtin University Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure Rob Cavanagh Len Sparrow Curtin University R.Cavanagh@curtin.edu.au Abstract The study sought to measure mathematics anxiety

More information

Conceptualising computerized adaptive testing for measurement of latent variables associated with physical objects

Conceptualising computerized adaptive testing for measurement of latent variables associated with physical objects Journal of Physics: Conference Series OPEN ACCESS Conceptualising computerized adaptive testing for measurement of latent variables associated with physical objects Recent citations - Adaptive Measurement

More information

A Comparison of Several Goodness-of-Fit Statistics

A Comparison of Several Goodness-of-Fit Statistics A Comparison of Several Goodness-of-Fit Statistics Robert L. McKinley The University of Toledo Craig N. Mills Educational Testing Service A study was conducted to evaluate four goodnessof-fit procedures

More information

Examining the Validity of an Essay Writing Test Using Rasch Analysis

Examining the Validity of an Essay Writing Test Using Rasch Analysis Secondary English Education, 5(2) Examining the Validity of an Essay Writing Test Using Rasch Analysis Taejoon Park (KICE) Park, Taejoon. (2012). Examining the validity of an essay writing test using Rasch

More information

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups GMAC Scaling Item Difficulty Estimates from Nonequivalent Groups Fanmin Guo, Lawrence Rudner, and Eileen Talento-Miller GMAC Research Reports RR-09-03 April 3, 2009 Abstract By placing item statistics

More information

Fundamental Concepts for Using Diagnostic Classification Models. Section #2 NCME 2016 Training Session. NCME 2016 Training Session: Section 2

Fundamental Concepts for Using Diagnostic Classification Models. Section #2 NCME 2016 Training Session. NCME 2016 Training Session: Section 2 Fundamental Concepts for Using Diagnostic Classification Models Section #2 NCME 2016 Training Session NCME 2016 Training Session: Section 2 Lecture Overview Nature of attributes What s in a name? Grain

More information

Introduction. 1.1 Facets of Measurement

Introduction. 1.1 Facets of Measurement 1 Introduction This chapter introduces the basic idea of many-facet Rasch measurement. Three examples of assessment procedures taken from the field of language testing illustrate its context of application.

More information

R E S E A R C H R E P O R T

R E S E A R C H R E P O R T RR-00-13 R E S E A R C H R E P O R T INVESTIGATING ASSESSOR EFFECTS IN NATIONAL BOARD FOR PROFESSIONAL TEACHING STANDARDS ASSESSMENTS FOR EARLY CHILDHOOD/GENERALIST AND MIDDLE CHILDHOOD/GENERALIST CERTIFICATION

More information

Detecting Suspect Examinees: An Application of Differential Person Functioning Analysis. Russell W. Smith Susan L. Davis-Becker

Detecting Suspect Examinees: An Application of Differential Person Functioning Analysis. Russell W. Smith Susan L. Davis-Becker Detecting Suspect Examinees: An Application of Differential Person Functioning Analysis Russell W. Smith Susan L. Davis-Becker Alpine Testing Solutions Paper presented at the annual conference of the National

More information

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

Validating Measures of Self Control via Rasch Measurement. Jonathan Hasford Department of Marketing, University of Kentucky

Validating Measures of Self Control via Rasch Measurement. Jonathan Hasford Department of Marketing, University of Kentucky Validating Measures of Self Control via Rasch Measurement Jonathan Hasford Department of Marketing, University of Kentucky Kelly D. Bradley Department of Educational Policy Studies & Evaluation, University

More information

Examining Factors Affecting Language Performance: A Comparison of Three Measurement Approaches

Examining Factors Affecting Language Performance: A Comparison of Three Measurement Approaches Pertanika J. Soc. Sci. & Hum. 21 (3): 1149-1162 (2013) SOCIAL SCIENCES & HUMANITIES Journal homepage: http://www.pertanika.upm.edu.my/ Examining Factors Affecting Language Performance: A Comparison of

More information

Exploring rater errors and systematic biases using adjacent-categories Mokken models

Exploring rater errors and systematic biases using adjacent-categories Mokken models Psychological Test and Assessment Modeling, Volume 59, 2017 (4), 493-515 Exploring rater errors and systematic biases using adjacent-categories Mokken models Stefanie A. Wind 1 & George Engelhard, Jr.

More information

Linking Assessments: Concept and History

Linking Assessments: Concept and History Linking Assessments: Concept and History Michael J. Kolen, University of Iowa In this article, the history of linking is summarized, and current linking frameworks that have been proposed are considered.

More information

INTRODUCTION TO ITEM RESPONSE THEORY APPLIED TO FOOD SECURITY MEASUREMENT. Basic Concepts, Parameters and Statistics

INTRODUCTION TO ITEM RESPONSE THEORY APPLIED TO FOOD SECURITY MEASUREMENT. Basic Concepts, Parameters and Statistics INTRODUCTION TO ITEM RESPONSE THEORY APPLIED TO FOOD SECURITY MEASUREMENT Basic Concepts, Parameters and Statistics The designations employed and the presentation of material in this information product

More information

REPORT. Technical Report: Item Characteristics. Jessica Masters

REPORT. Technical Report: Item Characteristics. Jessica Masters August 2010 REPORT Diagnostic Geometry Assessment Project Technical Report: Item Characteristics Jessica Masters Technology and Assessment Study Collaborative Lynch School of Education Boston College Chestnut

More information

Application of Latent Trait Models to Identifying Substantively Interesting Raters

Application of Latent Trait Models to Identifying Substantively Interesting Raters Application of Latent Trait Models to Identifying Substantively Interesting Raters American Educational Research Association New Orleans, LA Edward W. Wolfe Aaron McVay April 2011 LATENT TRAIT MODELS &

More information

Rater Effects as a Function of Rater Training Context

Rater Effects as a Function of Rater Training Context Rater Effects as a Function of Rater Training Context Edward W. Wolfe Aaron McVay Pearson October 2010 Abstract This study examined the influence of rater training and scoring context on the manifestation

More information

Evaluating and restructuring a new faculty survey: Measuring perceptions related to research, service, and teaching

Evaluating and restructuring a new faculty survey: Measuring perceptions related to research, service, and teaching Evaluating and restructuring a new faculty survey: Measuring perceptions related to research, service, and teaching Kelly D. Bradley 1, Linda Worley, Jessica D. Cunningham, and Jeffery P. Bieber University

More information

Description of components in tailored testing

Description of components in tailored testing Behavior Research Methods & Instrumentation 1977. Vol. 9 (2).153-157 Description of components in tailored testing WAYNE M. PATIENCE University ofmissouri, Columbia, Missouri 65201 The major purpose of

More information

Construct Invariance of the Survey of Knowledge of Internet Risk and Internet Behavior Knowledge Scale

Construct Invariance of the Survey of Knowledge of Internet Risk and Internet Behavior Knowledge Scale University of Connecticut DigitalCommons@UConn NERA Conference Proceedings 2010 Northeastern Educational Research Association (NERA) Annual Conference Fall 10-20-2010 Construct Invariance of the Survey

More information

MEASURING AFFECTIVE RESPONSES TO CONFECTIONARIES USING PAIRED COMPARISONS

MEASURING AFFECTIVE RESPONSES TO CONFECTIONARIES USING PAIRED COMPARISONS MEASURING AFFECTIVE RESPONSES TO CONFECTIONARIES USING PAIRED COMPARISONS Farzilnizam AHMAD a, Raymond HOLT a and Brian HENSON a a Institute Design, Robotic & Optimizations (IDRO), School of Mechanical

More information

The Functional Outcome Questionnaire- Aphasia (FOQ-A) is a conceptually-driven

The Functional Outcome Questionnaire- Aphasia (FOQ-A) is a conceptually-driven Introduction The Functional Outcome Questionnaire- Aphasia (FOQ-A) is a conceptually-driven outcome measure that was developed to address the growing need for an ecologically valid functional communication

More information

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison Empowered by Psychometrics The Fundamentals of Psychometrics Jim Wollack University of Wisconsin Madison Psycho-what? Psychometrics is the field of study concerned with the measurement of mental and psychological

More information

Does factor indeterminacy matter in multi-dimensional item response theory?

Does factor indeterminacy matter in multi-dimensional item response theory? ABSTRACT Paper 957-2017 Does factor indeterminacy matter in multi-dimensional item response theory? Chong Ho Yu, Ph.D., Azusa Pacific University This paper aims to illustrate proper applications of multi-dimensional

More information

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria Thakur Karkee Measurement Incorporated Dong-In Kim CTB/McGraw-Hill Kevin Fatica CTB/McGraw-Hill

More information

linking in educational measurement: Taking differential motivation into account 1

linking in educational measurement: Taking differential motivation into account 1 Selecting a data collection design for linking in educational measurement: Taking differential motivation into account 1 Abstract In educational measurement, multiple test forms are often constructed to

More information

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data TECHNICAL REPORT Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data CONTENTS Executive Summary...1 Introduction...2 Overview of Data Analysis Concepts...2

More information

Using Differential Item Functioning to Test for Inter-rater Reliability in Constructed Response Items

Using Differential Item Functioning to Test for Inter-rater Reliability in Constructed Response Items University of Wisconsin Milwaukee UWM Digital Commons Theses and Dissertations May 215 Using Differential Item Functioning to Test for Inter-rater Reliability in Constructed Response Items Tamara Beth

More information

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD Psy 427 Cal State Northridge Andrew Ainsworth, PhD Contents Item Analysis in General Classical Test Theory Item Response Theory Basics Item Response Functions Item Information Functions Invariance IRT

More information

RESEARCH ARTICLES. Brian E. Clauser, Polina Harik, and Melissa J. Margolis National Board of Medical Examiners

RESEARCH ARTICLES. Brian E. Clauser, Polina Harik, and Melissa J. Margolis National Board of Medical Examiners APPLIED MEASUREMENT IN EDUCATION, 22: 1 21, 2009 Copyright Taylor & Francis Group, LLC ISSN: 0895-7347 print / 1532-4818 online DOI: 10.1080/08957340802558318 HAME 0895-7347 1532-4818 Applied Measurement

More information

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2 MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and Lord Equating Methods 1,2 Lisa A. Keller, Ronald K. Hambleton, Pauline Parker, Jenna Copella University of Massachusetts

More information

Shiken: JALT Testing & Evaluation SIG Newsletter. 12 (2). April 2008 (p )

Shiken: JALT Testing & Evaluation SIG Newsletter. 12 (2). April 2008 (p ) Rasch Measurementt iin Language Educattiion Partt 2:: Measurementt Scalles and Invariiance by James Sick, Ed.D. (J. F. Oberlin University, Tokyo) Part 1 of this series presented an overview of Rasch measurement

More information

ISC- GRADE XI HUMANITIES ( ) PSYCHOLOGY. Chapter 2- Methods of Psychology

ISC- GRADE XI HUMANITIES ( ) PSYCHOLOGY. Chapter 2- Methods of Psychology ISC- GRADE XI HUMANITIES (2018-19) PSYCHOLOGY Chapter 2- Methods of Psychology OUTLINE OF THE CHAPTER (i) Scientific Methods in Psychology -observation, case study, surveys, psychological tests, experimentation

More information

Measuring change in training programs: An empirical illustration

Measuring change in training programs: An empirical illustration Psychology Science Quarterly, Volume 50, 2008 (3), pp. 433-447 Measuring change in training programs: An empirical illustration RENATO MICELI 1, MICHELE SETTANNI 1 & GIULIO VIDOTTO 2 Abstract The implementation

More information

Alignment of Mathematics State-level Standards and Assessments: The Role of Reviewer Agreement. CSE Report 685

Alignment of Mathematics State-level Standards and Assessments: The Role of Reviewer Agreement. CSE Report 685 Alignment of Mathematics State-level Standards and Assessments: The Role of Reviewer Agreement CSE Report 685 Noreen Webb and Joan Herman University of California Norman Webb University of Wisconsin, Madison

More information

Computerized Mastery Testing

Computerized Mastery Testing Computerized Mastery Testing With Nonequivalent Testlets Kathleen Sheehan and Charles Lewis Educational Testing Service A procedure for determining the effect of testlet nonequivalence on the operating

More information

Lessons in biostatistics

Lessons in biostatistics Lessons in biostatistics The test of independence Mary L. McHugh Department of Nursing, School of Health and Human Services, National University, Aero Court, San Diego, California, USA Corresponding author:

More information

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison Group-Level Diagnosis 1 N.B. Please do not cite or distribute. Multilevel IRT for group-level diagnosis Chanho Park Daniel M. Bolt University of Wisconsin-Madison Paper presented at the annual meeting

More information

Use of Structure Mapping Theory for Complex Systems

Use of Structure Mapping Theory for Complex Systems Gentner, D., & Schumacher, R. M. (1986). Use of structure-mapping theory for complex systems. Presented at the Panel on Mental Models and Complex Systems, IEEE International Conference on Systems, Man

More information

Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement Invariance Tests Of Multi-Group Confirmatory Factor Analyses

Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement Invariance Tests Of Multi-Group Confirmatory Factor Analyses Journal of Modern Applied Statistical Methods Copyright 2005 JMASM, Inc. May, 2005, Vol. 4, No.1, 275-282 1538 9472/05/$95.00 Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement

More information

Issues That Should Not Be Overlooked in the Dominance Versus Ideal Point Controversy

Issues That Should Not Be Overlooked in the Dominance Versus Ideal Point Controversy Industrial and Organizational Psychology, 3 (2010), 489 493. Copyright 2010 Society for Industrial and Organizational Psychology. 1754-9426/10 Issues That Should Not Be Overlooked in the Dominance Versus

More information

Modeling DIF with the Rasch Model: The Unfortunate Combination of Mean Ability Differences and Guessing

Modeling DIF with the Rasch Model: The Unfortunate Combination of Mean Ability Differences and Guessing James Madison University JMU Scholarly Commons Department of Graduate Psychology - Faculty Scholarship Department of Graduate Psychology 4-2014 Modeling DIF with the Rasch Model: The Unfortunate Combination

More information

Construct Validity of Mathematics Test Items Using the Rasch Model

Construct Validity of Mathematics Test Items Using the Rasch Model Construct Validity of Mathematics Test Items Using the Rasch Model ALIYU, R.TAIWO Department of Guidance and Counselling (Measurement and Evaluation Units) Faculty of Education, Delta State University,

More information

Combining Dual Scaling with Semi-Structured Interviews to Interpret Rating Differences

Combining Dual Scaling with Semi-Structured Interviews to Interpret Rating Differences A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to the Practical Assessment, Research & Evaluation. Permission is granted to

More information

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1 Nested Factor Analytic Model Comparison as a Means to Detect Aberrant Response Patterns John M. Clark III Pearson Author Note John M. Clark III,

More information

HPS301 Exam Notes- Contents

HPS301 Exam Notes- Contents HPS301 Exam Notes- Contents Week 1 Research Design: What characterises different approaches 1 Experimental Design 1 Key Features 1 Criteria for establishing causality 2 Validity Internal Validity 2 Threats

More information

Evaluating a Special Education Teacher Observation Tool through G-theory and MFRM Analysis

Evaluating a Special Education Teacher Observation Tool through G-theory and MFRM Analysis Evaluating a Special Education Teacher Observation Tool through G-theory and MFRM Analysis Project RESET, Boise State University Evelyn Johnson, Angie Crawford, Yuzhu Zheng, & Laura Moylan NCME Annual

More information

Author s response to reviews

Author s response to reviews Author s response to reviews Title: The validity of a professional competence tool for physiotherapy students in simulationbased clinical education: a Rasch analysis Authors: Belinda Judd (belinda.judd@sydney.edu.au)

More information

Critical Thinking Assessment at MCC. How are we doing?

Critical Thinking Assessment at MCC. How are we doing? Critical Thinking Assessment at MCC How are we doing? Prepared by Maura McCool, M.S. Office of Research, Evaluation and Assessment Metropolitan Community Colleges Fall 2003 1 General Education Assessment

More information

Investigating the Reliability of Classroom Observation Protocols: The Case of PLATO. M. Ken Cor Stanford University School of Education.

Investigating the Reliability of Classroom Observation Protocols: The Case of PLATO. M. Ken Cor Stanford University School of Education. The Reliability of PLATO Running Head: THE RELIABILTY OF PLATO Investigating the Reliability of Classroom Observation Protocols: The Case of PLATO M. Ken Cor Stanford University School of Education April,

More information

2 Types of psychological tests and their validity, precision and standards

2 Types of psychological tests and their validity, precision and standards 2 Types of psychological tests and their validity, precision and standards Tests are usually classified in objective or projective, according to Pasquali (2008). In case of projective tests, a person is

More information

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses Item Response Theory Steven P. Reise University of California, U.S.A. Item response theory (IRT), or modern measurement theory, provides alternatives to classical test theory (CTT) methods for the construction,

More information

Evaluating the quality of analytic ratings with Mokken scaling

Evaluating the quality of analytic ratings with Mokken scaling Psychological Test and Assessment Modeling, Volume 57, 2015 (3), 423-444 Evaluating the quality of analytic ratings with Mokken scaling Stefanie A. Wind 1 Abstract Greatly influenced by the work of Rasch

More information

Reliability and Validity of a Task-based Writing Performance Assessment for Japanese Learners of English

Reliability and Validity of a Task-based Writing Performance Assessment for Japanese Learners of English Reliability and Validity of a Task-based Writing Performance Assessment for Japanese Learners of English Yoshihito SUGITA Yamanashi Prefectural University Abstract This article examines the main data of

More information

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

Mantel-Haenszel Procedures for Detecting Differential Item Functioning A Comparison of Logistic Regression and Mantel-Haenszel Procedures for Detecting Differential Item Functioning H. Jane Rogers, Teachers College, Columbia University Hariharan Swaminathan, University of

More information

Diagnostic Opportunities Using Rasch Measurement in the Context of a Misconceptions-Based Physical Science Assessment

Diagnostic Opportunities Using Rasch Measurement in the Context of a Misconceptions-Based Physical Science Assessment Science Education Diagnostic Opportunities Using Rasch Measurement in the Context of a Misconceptions-Based Physical Science Assessment STEFANIE A. WIND, 1 JESSICA D. GALE 2 1 The University of Alabama,

More information

RUNNING HEAD: EVALUATING SCIENCE STUDENT ASSESSMENT. Evaluating and Restructuring Science Assessments: An Example Measuring Student s

RUNNING HEAD: EVALUATING SCIENCE STUDENT ASSESSMENT. Evaluating and Restructuring Science Assessments: An Example Measuring Student s RUNNING HEAD: EVALUATING SCIENCE STUDENT ASSESSMENT Evaluating and Restructuring Science Assessments: An Example Measuring Student s Conceptual Understanding of Heat Kelly D. Bradley, Jessica D. Cunningham

More information

Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz

Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz This study presents the steps Edgenuity uses to evaluate the reliability and validity of its quizzes, topic tests, and cumulative

More information

Context of Best Subset Regression

Context of Best Subset Regression Estimation of the Squared Cross-Validity Coefficient in the Context of Best Subset Regression Eugene Kennedy South Carolina Department of Education A monte carlo study was conducted to examine the performance

More information

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments Greg Pope, Analytics and Psychometrics Manager 2008 Users Conference San Antonio Introduction and purpose of this session

More information

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES Correlational Research Correlational Designs Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introduction 1.1 Motivation and Goals The increasing availability and decreasing cost of high-throughput (HT) technologies coupled with the availability of computational tools and data form a

More information

Students' perceived understanding and competency in probability concepts in an e- learning environment: An Australian experience

Students' perceived understanding and competency in probability concepts in an e- learning environment: An Australian experience University of Wollongong Research Online Faculty of Engineering and Information Sciences - Papers: Part A Faculty of Engineering and Information Sciences 2016 Students' perceived understanding and competency

More information

Having your cake and eating it too: multiple dimensions and a composite

Having your cake and eating it too: multiple dimensions and a composite Having your cake and eating it too: multiple dimensions and a composite Perman Gochyyev and Mark Wilson UC Berkeley BEAR Seminar October, 2018 outline Motivating example Different modeling approaches Composite

More information

The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests

The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests Mary E. Lunz and Betty A. Bergstrom, American Society of Clinical Pathologists Benjamin D. Wright, University

More information

Experimental Psychology

Experimental Psychology Title Experimental Psychology Type Individual Document Map Authors Aristea Theodoropoulos, Patricia Sikorski Subject Social Studies Course None Selected Grade(s) 11, 12 Location Roxbury High School Curriculum

More information

Establishing Interrater Agreement for Scoring Criterion-based Assessments: Application for the MEES

Establishing Interrater Agreement for Scoring Criterion-based Assessments: Application for the MEES Establishing Interrater Agreement for Scoring Criterion-based Assessments: Application for the MEES Mark C. Hogrebe Washington University in St. Louis MACTE Fall 2018 Conference October 23, 2018 Camden

More information

PERCEIVED TRUSTWORTHINESS OF KNOWLEDGE SOURCES: THE MODERATING IMPACT OF RELATIONSHIP LENGTH

PERCEIVED TRUSTWORTHINESS OF KNOWLEDGE SOURCES: THE MODERATING IMPACT OF RELATIONSHIP LENGTH PERCEIVED TRUSTWORTHINESS OF KNOWLEDGE SOURCES: THE MODERATING IMPACT OF RELATIONSHIP LENGTH DANIEL Z. LEVIN Management and Global Business Dept. Rutgers Business School Newark and New Brunswick Rutgers

More information

Throwing the Baby Out With the Bathwater? The What Works Clearinghouse Criteria for Group Equivalence* A NIFDI White Paper.

Throwing the Baby Out With the Bathwater? The What Works Clearinghouse Criteria for Group Equivalence* A NIFDI White Paper. Throwing the Baby Out With the Bathwater? The What Works Clearinghouse Criteria for Group Equivalence* A NIFDI White Paper December 13, 2013 Jean Stockard Professor Emerita, University of Oregon Director

More information

Basic concepts and principles of classical test theory

Basic concepts and principles of classical test theory Basic concepts and principles of classical test theory Jan-Eric Gustafsson What is measurement? Assignment of numbers to aspects of individuals according to some rule. The aspect which is measured must

More information

A Broad-Range Tailored Test of Verbal Ability

A Broad-Range Tailored Test of Verbal Ability A Broad-Range Tailored Test of Verbal Ability Frederic M. Lord Educational Testing Service Two parallel forms of a broad-range tailored test of verbal ability have been built. The test is appropriate from

More information

Evaluating the Consistency of Test Content across Two Successive Administrations of a State- Mandated Science and Technology Assessment 1,2

Evaluating the Consistency of Test Content across Two Successive Administrations of a State- Mandated Science and Technology Assessment 1,2 Evaluating the Consistency of Test Content across Two Successive Administrations of a State- Mandated Science and Technology Assessment 1,2 Timothy O Neil 3 and Stephen G. Sireci University of Massachusetts

More information

A framework for predicting item difficulty in reading tests

A framework for predicting item difficulty in reading tests Australian Council for Educational Research ACEReSearch OECD Programme for International Student Assessment (PISA) National and International Surveys 4-2012 A framework for predicting item difficulty in

More information

References. Embretson, S. E. & Reise, S. P. (2000). Item response theory for psychologists. Mahwah,

References. Embretson, S. E. & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, The Western Aphasia Battery (WAB) (Kertesz, 1982) is used to classify aphasia by classical type, measure overall severity, and measure change over time. Despite its near-ubiquitousness, it has significant

More information

CHAPTER 7 RESEARCH DESIGN AND METHODOLOGY. This chapter addresses the research design and describes the research methodology

CHAPTER 7 RESEARCH DESIGN AND METHODOLOGY. This chapter addresses the research design and describes the research methodology CHAPTER 7 RESEARCH DESIGN AND METHODOLOGY 7.1 Introduction This chapter addresses the research design and describes the research methodology employed in this study. The sample and sampling procedure is

More information

PÄIVI KARHU THE THEORY OF MEASUREMENT

PÄIVI KARHU THE THEORY OF MEASUREMENT PÄIVI KARHU THE THEORY OF MEASUREMENT AGENDA 1. Quality of Measurement a) Validity Definition and Types of validity Assessment of validity Threats of Validity b) Reliability True Score Theory Definition

More information

Assignment Research Article (in Progress)

Assignment Research Article (in Progress) Political Science 225 UNIVERSITY OF CALIFORNIA, SAN DIEGO Assignment Research Article (in Progress) Philip G. Roeder Your major writing assignment for the quarter is to prepare a twelve-page research article

More information

Presented By: Yip, C.K., OT, PhD. School of Medical and Health Sciences, Tung Wah College

Presented By: Yip, C.K., OT, PhD. School of Medical and Health Sciences, Tung Wah College Presented By: Yip, C.K., OT, PhD. School of Medical and Health Sciences, Tung Wah College Background of problem in assessment for elderly Key feature of CCAS Structural Framework of CCAS Methodology Result

More information

Regression Discontinuity Analysis

Regression Discontinuity Analysis Regression Discontinuity Analysis A researcher wants to determine whether tutoring underachieving middle school students improves their math grades. Another wonders whether providing financial aid to low-income

More information

On the usefulness of the CEFR in the investigation of test versions content equivalence HULEŠOVÁ, MARTINA

On the usefulness of the CEFR in the investigation of test versions content equivalence HULEŠOVÁ, MARTINA On the usefulness of the CEFR in the investigation of test versions content equivalence HULEŠOVÁ, MARTINA MASARY K UNIVERSITY, CZECH REPUBLIC Overview Background and research aims Focus on RQ2 Introduction

More information

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida Vol. 2 (1), pp. 22-39, Jan, 2015 http://www.ijate.net e-issn: 2148-7456 IJATE A Comparison of Logistic Regression Models for Dif Detection in Polytomous Items: The Effect of Small Sample Sizes and Non-Normality

More information

A tale of two models: Psychometric and cognitive perspectives on rater-mediated assessments using accuracy ratings

A tale of two models: Psychometric and cognitive perspectives on rater-mediated assessments using accuracy ratings Psychological Test and Assessment Modeling, Volume 60, 2018 (1), 33-52 A tale of two models: Psychometric and cognitive perspectives on rater-mediated assessments using accuracy ratings George Engelhard,

More information

so that a respondent may choose one of the categories to express a judgment about some characteristic of an object or of human behavior.

so that a respondent may choose one of the categories to express a judgment about some characteristic of an object or of human behavior. Effects of Verbally Labeled Anchor Points on the Distributional Parameters of Rating Measures Grace French-Lazovik and Curtis L. Gibson University of Pittsburgh The hypothesis was examined that the negative

More information

Item Analysis Explanation

Item Analysis Explanation Item Analysis Explanation The item difficulty is the percentage of candidates who answered the question correctly. The recommended range for item difficulty set forth by CASTLE Worldwide, Inc., is between

More information

Measurement Issues in Concussion Testing

Measurement Issues in Concussion Testing EVIDENCE-BASED MEDICINE Michael G. Dolan, MA, ATC, CSCS, Column Editor Measurement Issues in Concussion Testing Brian G. Ragan, PhD, ATC University of Northern Iowa Minsoo Kang, PhD Middle Tennessee State

More information

CHAPTER 3 RESEARCH METHODOLOGY

CHAPTER 3 RESEARCH METHODOLOGY CHAPTER 3 RESEARCH METHODOLOGY 3.1 Introduction 3.1 Methodology 3.1.1 Research Design 3.1. Research Framework Design 3.1.3 Research Instrument 3.1.4 Validity of Questionnaire 3.1.5 Statistical Measurement

More information

Research Prospectus. Your major writing assignment for the quarter is to prepare a twelve-page research prospectus.

Research Prospectus. Your major writing assignment for the quarter is to prepare a twelve-page research prospectus. Department of Political Science UNIVERSITY OF CALIFORNIA, SAN DIEGO Philip G. Roeder Research Prospectus Your major writing assignment for the quarter is to prepare a twelve-page research prospectus. A

More information

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida and Oleksandr S. Chernyshenko University of Canterbury Presented at the New CAT Models

More information