RATER EFFECTS & ALIGNMENT 1

Size: px

Start display at page:

Download "RATER EFFECTS & ALIGNMENT 1"

Leonard Jennings
5 years ago
Views:

1 RATER EFFECTS & ALIGNMENT 1 Running head: RATER EFFECTS & ALIGNMENT Modeling Rater Effects in a Formative Mathematics Alignment Study Daniel Anderson, P. Shawn Irvin, Julie Alonzo, and Gerald Tindal University of Oregon Acknowledgements: Note: Funds for the data set used to generate this report came from a federal grant awarded to the UO from the Institute of Education Sciences, U.S. Department of Education: Developing Middle School Mathematics Progress Monitoring Measures (R324A funded from June June 2014).

2 RATER EFFECTS & ALIGNMENT 2 Abstract The alignment of test items to content standards is critical to the validity of decisions made from standards-based tests. Generally, alignment is determined based on judgments made by a panel of content experts with ratings either averaged or a consensus being reached through discussion. When panels are used, the discussions are intended to increase the consistency of ratings and provide a final value for each item. However, the process is necessarily socially mediated, introducing a potential source of bias. To minimize this bias, we present a method for gauging item-standard alignment with an online independent review system, while controlling for theoretically important rater effects with a latent trait analysis. We apply this method to a large pool of formative mathematics test items to illustrate the technique and explore the degree of alignment between items and standards.

3 RATER EFFECTS & ALIGNMENT 3 Modeling Rater Effects in a Formative Mathematics Alignment Study Item alignment to standards is crucial to the validity of standards-based test inferences (La Marca, 2001). Test-based accountability policies, such as the No Child Left Behind Act (NCLB, 2001), depend upon the alignment of items to standards to draw inferences on the effectiveness of educational institutions using standards-based accountability systems. Alignment is also important for formative evaluations systems, as the validity of teachers instructional decisions depend, in part, upon the degree of item alignment: Misaligned items may lead to misguided instructional decisions. During typical alignment studies, multiple raters judge each item on various dimensions (degree to which it is aligned with standards, depth of knowledge, etc.). The process either entails independent reviews (Porter, 2002, 2004) or panels of experts convening for collaborative/discussion-based reviews (Rothman, Slattery, Vranek, & Resnick, 2002; Webb, 1999, 2007). With each approach, the ratings from the multiple reviewers must somehow be combined. Porter (2002, 2004) and Webb (1999, 2007) each recommend averaging the ratings, while Rothman et al. (2002) recommend the panel of experts continue discussion until a single consensus rating can be established. In this paper, we focus on the process of arriving at an alignment score while also attempting to maintain independence in ratings. In the system we used, ratings from multiple raters are combined through latent trait methods, as opposed to average or consensus methods, to more accurately represent the item location along the alignment continuum. All alignment data reflect independent reviews conducted online with a large pool (1,345) of interim-formative mathematics assessment items. We also use a latent trait model to statistically control for rater

4 RATER EFFECTS & ALIGNMENT 4 severity and better evaluate intra-rater reliability. Finally, we highlight practical benefits of an online review, particularly cost. The Panel Method When panel methods are used, groups of content-matter experts are recruited and convened to discuss and debate the alignment of test items to standards. Content-matter experts generally come from backgrounds such as classroom teachers and curriculum specialists (Rothman et al., 2002). In his influential report, Webb (1999) describes a panel methodology to establish the alignment of test items with content standards. He describes the process as a fourday Alignment Analysis Institute (p. 2), where raters judged the alignment of statewide math and science assessments from two or three grade levels. He also notes that the majority of raters attended a one-day meeting prior to the institute, where they were introduced to the study and the alignment process. In later reports, Webb (2007) notes that a three-day process is perhaps more typical, but that the length of the institute depends on the number of grades to be analyzed, the length of the standards, the length of the assessments, and the number of assessment forms under consideration (pp. 8-9). He recommends 5-8 raters for each panel, noting that an increase in raters will generally increase reliability. During the alignment process outlined by Webb, raters first independently rate the degree of item-standard alignment and then discuss their ratings with others on the panel. Following the discussion, raters are provided the opportunity to modify their ratings. Finally, ratings from the multiple reviewers are averaged to determine if alignment criteria are met, with standard deviations reported as an indicator of the variance between raters. Rothman et al. (2002) proposed the Achieve model, which shares some procedural similarities with Webb s model: Panels of experts are convened, and they work together to

5 RATER EFFECTS & ALIGNMENT 5 establish alignment ratings for all items. However, while the ratings from the multiple reviewers are averaged under Webb s model (1999, 2007), the Achieve model stipulates that a single consensus rating must be reached through group discussion (Case, Jorgensen, & Zucker, 2004; Martone & Sireci, 2009). Rothman et al. note that the process can be quite expensive and that the pool of raters should represent a diversity of viewpoints (p. 8), to ensure alignment ratings are not biased toward certain groups. Panel Methods and Threats to Validity While the panel method has been well documented (Council of Chief State School Officers (CCSSO), 2002), it also may become impractical in a number of situations, particularly if the pool of items to be reviewed is large, or if the expertise of individuals needed are broadly distributed geographically. For example, many interim-formative assessment systems contain multiple test forms within each grade that must be sensitive to a wide range of student abilities. The total item bank may thus contain several thousand items. In computer-adaptive testing situations, the total item bank may be similarly large to ensure that a sufficient number of items are calibrated at each respondent s skill level (Thompson & Weiss, 2011). In these contexts, the financial and practical costs of bringing expert reviewers together to judge the alignment of even a portion of the total item bank becomes, in most cases, unmanageably large. Additionally, in many rural states or regions, it may be difficult to bring together a sufficient number of content experts to form panels without considerable costs allocated to travel. Conducting reviews online alleviates many of the practical and logistical costs associated with convening experts in person. Additionally, experts from a broad geographical base (i.e., from across the country) can easily participate, which may increase the diversity of views (Rothman et al., 2002) and enhance the external validity of the ratings. For example, it is logical

6 RATER EFFECTS & ALIGNMENT 6 to expect raters previous experience instructing and/or assessing students to guide their interpretation of the standards. Raters prior experiences are also contingent, in part, upon their instructional contexts (e.g., educators from rural versus urban settings). Obtaining participants from a wide-range of contexts thus reduces the chances that the alignment ratings will contain contextual biases. For these reasons, an online system may be particularly useful for formative measurement systems with many items or in states or locations with a broad geographic base. An online format can account for both collaborative/discussion-based reviews (Webb, Alt, Ely, & Vesperman, 2005) and independent reviews (Behavioral Research and Teaching, 2013). Furthermore, the analysis of data collected through online designs need not differ from data collected from in-person reviews. For example, the online model proposed by Webb et al. (2005) allows reviewers to go through a process that mirrors the in-person institutes, including opportunities for discussion. Because the process is intended to represent an online version of the same model, the modes for determining item alignment (e.g., average ratings after group discussion) do not change. The Surveys of Enacted Curriculum is another model for gauging item alignment that is generally conducted online (see with ratings from multiple independent reviewers averaged (Porter, 2004). The critical issue, therefore, is not whether the system is online but the independence of ratings that are being made by reviewers. With independence of ratings, more advanced analyses may be used instead of raw averages. For example, in the panel model, ratings are not independent from one another; in fact, the opposite is deliberately the case, as independent ratings are modified through group discussion. The inherent non-independence of the data rules out the possibility of most parametric statistics, given that independence of observations (non-

7 RATER EFFECTS & ALIGNMENT 7 correlated residuals) is a foundational assumption (see Raudenbush & Bryk, 2002). With independent ratings, this independence of observations assumption holds, opening the possibility for more advanced analyses. Furthermore, a number of other problems exist with the panel method. The panel method is thought to enhance inter-rater reliability through a process of raters defending their initial judgments (elaborating on their thought processes, and potentially modifying their ratings to reach a final judgment). Basically, group discussions are designed to facilitate convergence of ratings and counter the biases of any single rater. In other words, raters who are overly harsh or lenient are more likely to be influenced by the panel and subsequently allowed to modify their ratings (Kocher & Sutter, 2005). By design, however, all panel models provide opportunities for a single rater to affect the judgment of others, and the reverse outcome may occur, with the outlying rater influencing the panel. Persuasion is an essential component of within-group dynamics present in panel settings (Brown, 1988), and an overbearing rater has the potential to sway others in a manner that over or underrepresents the alignment of the item. Indeed, the process of persuasion and subsequent conformity occurs regularly. In a seminal study on conformity in groups, Asch (1956) found that the rate of conformity in small groups rose rapidly when individuals faced a group of two or three. More recently, in a metaanalysis of studies replicating Asch s experiments, Bond (2005) found largely the same pattern of small group conformity. If an overbearing individual in a small group setting, such as an alignment panel, convinces one or two individuals to move in their direction, the opinion of the group may follow, with conformity/consensus the end product.

8 RATER EFFECTS & ALIGNMENT 8 Independent Reviews and Possibilities for Scaling In contrast to the panel method, independent ratings are not subject to group dynamics noted above that can potentially bias the results. Yet, independent reviews also come with considerable challenges. If raters participating in the study exhibit any of the common rater effects (see Engelhard, 1994; Myford & Wolfe, 2000; Saal, Downey, & Lahey, 1980), then items that are aligned may be misrepresented as misaligned (false negative), while items that are misaligned may be misrepresented as aligned (false positive). Although this is also true with panel designs, the concern may be more pronounced with independent reviews because the group cannot help temper extreme ratings/reviewers. Raters routinely exhibiting bias in one direction or another can have substantial practical repercussions (Wolfe, 2004). False negative items (those mistakenly rated as not aligned) can waste valuable test-developer resources, as the items may be revised needlessly or discarded, while false positive items (items mistakenly rated as aligned) threaten the validity of standards-based test inferences. However, because the observations are independent, latent trait methods such as the many-facets Rasch model (MFRM) can be applied to understand and control these rater effects. The MFRM provides a flexible framework from which data collected from independent alignment reviews can be analyzed. Using the MFRM, both the internal consistency and severity of each rater can be documented, with the latter statistically controlled. Under the MFRM, raters can be treated as a facet of the measurement process with conditions of measurement equated provided the reviewers provide alignment ratings of common items. By equating the conditions of measurement, all item ratings are placed on a common scale, taking into account the specific rater who judged the item. Ratings provided by overly-severe raters would thus be adjusted toward the aligned end of the scale, while ratings provided by overly-lenient raters would be

9 RATER EFFECTS & ALIGNMENT 9 adjusted toward the unaligned end of the scale. Concerns of inter-rater reliability then become mitigated, as the analysis statistically corrects for differences between raters in terms of severity. When applying the MFRM to alignment data, the item becomes the object of measurement. The multiple independent ratings for each item are then treated as indicators of the latent alignment trait and the item location on the latent trait is calculated based on the raw sum of ratings. However, because the conditions of measurement are equated, the item location is adjusted based on the severity of each reviewer who rates the item. In the end, independent ratings from each reviewer uniquely contribute to the latent alignment trait for each item, eliminating the need for a discussion-based process to increase inter-rater reliability. The internal consistency of each rater also can be evaluated more precisely by comparing a model-based expected value with an observed value. The expected value is calculated based on (a) how the reviewer rated all other items [i.e., the severity of the rater], and (b) how all other reviewers rated the item in question [i.e., the difficulty of endorsing the item as aligned]. If the expected value closely matches the observed value, then the reviewer is operating consistently. Using the MFRM, one can also inspect standardized residuals to investigate outlier responses, which can be used to help inform ratings of specific items. Because the MFRM fits within the larger family of Rasch models, of which the raw score is a sufficient statistic for calculating latent trait locations (Wright, 1982), the values reported on the latent scale can be transposed back to the original rating scale. The MFRM thus provides both a raw average of ratings for each item, as is often applied (e.g., Porter, 2004), as well as an adjusted average (adjusted for the severity of raters judging the item), which represents the item location on the latent trait transposed back to the original rating scale. This property is quite helpful from a practical evaluative perspective because original rules for determining alignment

10 RATER EFFECTS & ALIGNMENT 10 (e.g., on a 0-3 scale; less than 2 = not aligned, greater than 2 = aligned) can still be applied, and new rules do not have to be developed for the latent scale. Situating the Current Study In the current study, we apply the MFRM to data collected from an independent online alignment review. Before describing the methods, however, some unique characteristics of the study warrant note. First, the alignment review was conducted with an item bank from an interim-formative assessment system for which all items were written to measure the Common Core State Standards (CCSS) in mathematics. Evaluating alignment from an item bank, rather than intact test forms, prevented calculation of common alignment criteria described by Webb (e.g. categorical concurrence, balance of representation; 1999) or Rothman et al. (e.g., content centrality, performance centrality; 2002). Gauging alignment with interim-formative test items also presents certain challenges, given that the target population for the assessments is students performing below expectations (i.e., students whose progress is being monitored in relatively short intervals in concert with an intervention designed to increase achievement). Thus, perfect alignment to grade-level standards may result in an overly difficult test that the target population may have difficulty accessing (see Anderson, Lai, Alonzo, & Tindal, 2011). With interim-formative assessments, it may be desirable to have some items measure skills that are prerequisites to the target standard, but that (given their reduction in complexity) do not directly align with the standard. Therefore, in cases where items were not aligned with the corresponding standard, we were interested in whether the item measured a prerequisite to the standard. Prerequisite items were viewed as both desirable and necessary, and are perhaps one of the distinguishing characteristics of interim-formative assessment. Nevertheless, the

11 RATER EFFECTS & ALIGNMENT 11 methodology likely generalizes to any item-to-standards alignment study. Below, we discuss the specific procedures used in our study. Methods Participant Recruitment and Item Assignment Fifteen middle school math teachers from across the country were recruited to serve as our content expert raters. Raters were selected based on their knowledge of the CCSS and experience as middle school mathematics professionals. The full item bank used in this study included 900 items in each of grades 6-8, totaling 2,700 unique items. A pool of approximately 50% of the total item-bank was selected for review. Items were selected based on a randomized matrix-sampling plan, by which items were stratified by the original item-writer and the CCSS the item was intended to measure. A total of 1,345 items were selected for review, with each rater asked to judge the alignment of 270 items through an online system. All items were written to align with only one standard, and all items were paired with the corresponding standard prior to dissemination to reviewers. Items were judged on a 4-point alignment scale, as follows: 0 = no alignment, 1 = vague alignment, 2 = somewhat aligned, 3 = directly aligned. Prior to participating in the study, all raters were trained on the alignment process and rating scale through an online webinar. Reviewers were informed that all items would be collapsed into dichotomous aligned (2 or 3) or not aligned (0 or 1) categories; however, the 4- point scale allowed for finer-grained judgments. A second conditional question was also asked of respondents when a given item was rated as not aligned (0-1), to assess whether the item

12 RATER EFFECTS & ALIGNMENT 12 measured a prerequisite skill to the standard. This question was rated dichotomously 0 (no) or 1 (yes). Prior to the review, all items were divided into 5 review sets within each grade (15 review sets total), each containing 90 items. Items selected for review were randomly assigned to one of the five review sets within each grade. Each rater then independently judged the alignment of three review sets assigned to him/her based on a linking pattern that allowed for the calibration of all items and raters on a common scale. Three raters reviewed each review set. The rater sampling design, linking raters across review sets, is displayed in Table 1. All items were disseminated for review via a secure web-based system called the Distributed Item Review (DIR; Behavioral Research and Teaching, 2013), as displayed in Figure 1. Broadly, the DIR is a tool for presenting test items and test forms to reviewers in a userfriendly manner so they can be judged for dimensions of bias, sensitivity, and, in this case, alignment to standards. Test items are uploaded into the DIR and associated with researcherdefined study questions for presentation to reviewers. Analysis All data were analyzed with a many-facets Rasch model (MFRM) to document and control for rater effects. The MFRM is typically applied when the object of measurement is persons, and there are some combinatorial effects of an item and one of numerous facets (i.e., a rater) that determine the individual s score. Within an alignment study, however, the items themselves are the object of measurement. Nevertheless, the MFRM can still be applied because there are facets (i.e., the item and the judge) that determine the ultimate alignment rating of the item.

13 RATER EFFECTS & ALIGNMENT 13 The MFRM, developed by Linacre (1989), is an extension of the basic Rasch model. When data follow an ordinal scale, the MFRM can extend either Masters (1982) partial credit model in which each item follows its own unique scale or Andrich s (1978) rating scale model, in which item thresholds are assumed equivalent across items. In the current study, all items were rated on the same 4-point scale and there was little theoretical reason to presume that thresholds would vary dramatically across items. Given the large number of items in the analysis (1,345), estimation of the partial credit model would also have required a substantially greater number of estimated parameters because thresholds would need to be estimated for each item individually. Further, for many items there were unobserved categories (e.g., all raters rated an item as 2 or 3, leading to empty 0 and 1 categories). The rating scale model is robust to unobserved categories; with the partial credit model, estimation is degraded (Linacre, 2000). We thus decided to apply Andrich s rating scale model, defined as ln!!"#!!"(!!!) = B! C! F! (1) where P!"# is the probability that item n is rated into category k by rater j, B n is the estimated location for item n on the latent alignment trait, C j is the severity of rater j, and F k is the Rasch- Andrich threshold for the k 1 category. Because our analyses included only two facets, rater severity and item alignment, Equation 1 is actually equivalent to Andrich s (1978) rating scale model, and does not require a manyfacets extension. In Andrich s model, the C j parameter typically represents item difficulty, while the B n parameter typically represents the respondent location on the latent trait. In our model, these parameters represented the item location and rater severity respectively. However, we discuss the model within the MFRM framework given that raters are generally referenced as a facet of the measurement process and, thus, the MFRM is typically applied when latent trait

14 RATER EFFECTS & ALIGNMENT 14 methods are used (e.g., Myford & Wolfe, 2000). Further, the model displayed in Equation 1 could easily generalize to include other facets of the measurement process (e.g., the domain or grade-level the item was intended to measure). The reason our model simplified to the rating scale model was because there were no subjects in the analysis the item was the subject. Our model thus represented a special case of the MFRM, in which only two latent parameters were estimated. This special case of the MFRM is analogous to Andrich s rating scale model. The design and analysis was used to control for rater severity and obtain more accurate representations of the alignment of each item. No analysis, beyond descriptive statistics, was conducted with prerequisite skill data. All MFRM analyses were conducted with the FACETS software, version 3.70 (Linacre, 2012). As Equation 1 indicates, all items were placed on a logit scale, which is useful when evaluating the degree of alignment, as the logit transformation more closely represents a linear equal interval scale, provided the data fit the model (Salzberger, 2010). However, the FACETS program also reports an adjusted rating with the logit value transformed back to the original scale. As mentioned previously, values can ease interpretations when categorical cut-points must be determined. In the present study, all items were tabulated into aligned and not aligned categories. Using the adjusted ratings transformed back to the original scale, we used the original scale cutoff value of 2.0 to determine alignment. That is, any items with a rater-adjusted value less than 2.0 were deemed not aligned, while any above this cutoff were deemed aligned. Items rated not aligned were then evaluated with respect to the prerequisite skill ratings to determine which items were truly misaligned as opposed to those that were aligned to a prerequisite skill of the standard, though not the standard itself.

15 RATER EFFECTS & ALIGNMENT 15 Results For each item there were 6 total ratings 3 alignment ratings, one from each of three raters, and 3 prerequisite skill ratings, one from each of the same three raters. Note that the prerequisite skill ratings were only provided when an item was rated as not aligned with the standard (0 or 1). Below, we first provide our item standard alignment results, followed by a brief discussion of our prerequisite skills analysis. We then provide a summary of the observed rater effects. Item Standard Alignment Results for 15 randomly selected items, from the 1,345 items reviewed in this study, are presented in Table 2. The table is intended to provide an example of the type of information obtained for the overall item pool. For each item, observed and rater-adjusted averages are reported. The rater-adjusted average was used in all cases to determine whether an item did or did not align with the corresponding standard, with values below 2.0 deemed not aligned. Following the adjusted average is the endorsability statistic that indicates the ease with which raters endorsed the item as aligned with the standard (the B n parameter in Equation 1), reported on the logit scale. Lower endorsability values indicate easier-to-endorse items. The rater-adjusted average is a direct translation of the endorsability statistic, placed back on the original rating scale. After the endorsability statistic is the estimated standard error of the endorsability statistic for the items, which provides an estimate of the precision of the rating. Note that items with more extreme ratings have higher standard errors. When examining the ratings of items, the observed average is quite different from the rater-adjusted average in some cases. For example, items 1, 2, and 7 all had an observed average of When accounting for the raters who judged the items, however, we see differing degrees

16 RATER EFFECTS & ALIGNMENT 16 of alignment. Both items 1 and 2 have an adjusted rating below 2.00, indicating the items are not aligned, but item 2 is rated as being more closely aligned (1.74) than item 1 (1.69). Further, we see that item 7 is adjusted (slightly) in the opposite direction (2.03). Thus, when adjusting for rater severity, our inferences of the alignment of items 1 and 2 are reversed (from aligned to not aligned), while our confidence that item 7 is aligned is strengthened. After the display of the standard error, a series of item-level fit statistics are also reported the unstandardized and standardized infit and outfit. In their unstandardized form, both statistics center on 1.0, and provide an indication of how the item data fit the Rasch model expectations (Linacre, 2002). The infit statistic is weighted based on the patterns represented in the data. The outfit statistic is not weighted, but is more sensitive to outliers in the data. For both statistics, values above 1.0 represent items that underfit the model, which occur with unexpected values. For example, if a severe rater judging a difficult to endorse item judges the item as aligned, then the item would likely appear underfit. Values below 1.0 represent items overfit to the model, which occur with overly consistent/redundant values (Linacre, 2002). Generally, underfitting items are of greater concern than overfitting items because the estimated alignment rating is calculated from unexpected responses, and is thus more likely to be biased. Overall, 87.73% of all items had a rater-adjusted average rating above 2.0, suggesting they were aligned to their corresponding standard. Because only three reviewers judged the alignment of each item, the standard errors were quite large. Of the 1,180 items rated as aligned, the mean square outfit ranged from 0.05 to 7.79, with an average of 0.93, while the mean square infit ranged from 0.02 to 4.50, with an average of Using both the standard error and the fit statistics, we can begin to formulate degrees of confidence in the observed alignment rating. For

17 RATER EFFECTS & ALIGNMENT 17 items that do not fit the model well and have large standard errors, our confidence in the alignment estimate would be lower, relative to items that fit the model well and have small standard errors. Prerequisite Skills Analysis The prerequisite skills question was a conditional response only asked of raters when they had judged the item as not aligned with the standard, and thus, it created dependencies in the data, and inherent sparseness. For example, raters who were more severe judged more items as not aligning with the standard, and thus, judged more items with respect to prerequisite skills. Similarly, less-severe raters judged more items as aligned to standards, leading to fewer responses with respect to prerequisite skills for such raters. The missing data thus depended on which rater judged the item. The dependencies and sparseness of the data led us to only examine prerequisite skills from a descriptive lens. To determine whether the item addressed a pre-requisite skill we used a group majority rule, by which at least 2 of the 3 raters had to judge the item as either (a) aligned to the standard, or (b) addressing a prerequisite skill. Only 165 of the 1,345 items reviewed (12%) were rated as not aligned with the standard. Of the remaining 165 items that were rated as not aligned, 160, or 97%, were rated as addressing a prerequisite skill to the standard by consensus. Thus, 99.6% of all items in this study were rated as either aligning with the corresponding standard, or addressing a prerequisite skill to the standard. Rater Effects A summary of rater effects for all 15 raters included in the study is displayed in Table 3. Of particular note is the severity statistic, reported on the logit scale. The severity of the raters differed substantially. An examination of the average rating column indicated that the most

18 RATER EFFECTS & ALIGNMENT 18 severe Rater 3 judged items, on average, nearly a full category below the most lenient Rater 11. These differences in rater severity are shown visually in Figure 2. The left-most column of the figure displays logit values from the analysis, mapped vertically. To the right of these values is the distribution of rater severities, mapped against the logit values. To the right of the rater column is the distribution of items, mapped against the same logit scale. Finally, the furthest right column of the figure displays the raw rating scale, ranging from 0-3. The figure thus displays how the distribution of rater severities compared to the distribution of item endorsabilities on a common scale. Note that the item endorsabilities are quite severely skewed, as would be expected given that the majority of items were rated as aligned. The fit statistics in Table 3 also are important because they provide an indication of the internal consistency of raters. That is, given the severity of the rater, as determined by his or her scoring of all other items, and the endorsability of the item, did he or she consistently rate items as would be expected? The outfit statistic is perhaps most useful for this purpose with, again, values above 1.0 indicating underfit to the model and values below 1.0 indicating overfit to the model. These fit statistics provide an indication of the intra-rater reliability. A value of 1.0 is the ideal fit to the model, though mild deviations may occur with little cause for concern. We found that all raters in our study fit the model expectations quite well, with mean square outfit statistics ranging from 0.76 to These values are consistent with guidelines reported by Myford and Wolfe (2000). Discussion The primary purpose of this study was to propose and illustrate a methodology for conducting alignment reviews. To be clear, our purpose was not to suggest new methods for scoring dimensions of alignment, as has been proposed by others (e.g. Porter, 2002; Porter, 2004;

19 RATER EFFECTS & ALIGNMENT 19 Rothman et al., 2002; Webb, 1999, 2007). Rather, we proposed a new methodology for arriving at a score (with latent trait methods) through more efficient means (online independent reviews). With such an approach, dimensions of alignment can then be calculated post-hoc. A secondary purpose of this study was to explore the alignment of the easycbm formative middle school mathematics test items with the Common Core State Standards. Overall, the alignment was quite high. Although only a portion of the total item bank was selected for review, the proportion of aligned items should readily generalize to the full item bank, given that the same procedures were used during the development of all items (see Anderson, Irvin, Patarapichayatham, Alonzo, & Tindal, 2012), and the matrix sampling plan likely resulted in a representative sample of the total item bank. The results obtained from this study also highlight that important differences in item alignment can be observed when statistically controlling for rater effects. While it is unknown what the alignment ratings of specific items would have been under discussion-based/panel designs, we observed in some cases quite substantial differences between the raw averages, collected from the independent review, and the averages adjusted for rater severity. It is possible that some of these differences would have been alleviated during group discussions, but independent reviews are also free from social biases that could result from the group dynamics, such as group conformity and persuasion (Asch, 1956; Bond, 2005; Brown, 1988; Kocher & Sutter, 2005). Independent reviews are also advantageous in that they are easy to conduct online, leading to numerous practical and financial benefits, although efforts have been made to move discussion-based reviews online (Webb et al., 2005). Finally, independent reviews open the possibility for more advanced statistical analyses, given that the independence of observations assumption is not violated. Previous research using independent reviews (e.g., Porter, 2004) has

20 RATER EFFECTS & ALIGNMENT 20 reached conclusions about item alignment by averaging the multiple ratings. Applying more advanced analyses, such as the MFRM, allows the researcher to control for individual rater effects statistically, while still avoiding socially-mediated processes. Limitations Although the primary purpose of this study was to propose a new methodology for arriving at alignment scores, the actual design used was not without limitations. Perhaps most fundamentally, Webb (2007) suggested 5-7 reviewers judge the alignment of any one item, and the results of this study largely support those numbers. Only three judges were used in this study, and the resulting standard errors were larger than is perhaps desirable. Further, some may question the logic of having reviewers judge the alignment of items across grades, rather than just within a grade. That is, when examining Table 1, reviewer a judged the alignment of two item sets within grade 6, and one item set within grade 8. The expertise of this rater thus needed to span both grades 6 and 8. However, we felt that the vertical linking of raters was preferable given that the items themselves were developed to be on a common vertical scale (Anderson et al., 2012). Thus, in keeping with the scale of the items, we reviewed the alignment of all items with a vertical sampling plan. Further, if judges did not link across grades, then the total number of reviewers from which to equate the conditions of measurement would be much smaller, and the possibility for unaccounted unique groups of rater effects would be greater. For instance, in the current study each grade would have had 5 reviewers within each grade. With only 5 reviewers, it is quite possible that one grade could, by chance, get 5 overly-severe or overlylenient raters. Like all item response models, the MFRM is scale free (Embretson & Reise, 2000), and all rater severity calibrations were calculated relative to the other raters in the analysis. Thus, if one grade happened to have 5 overly-severe or overly-lenient reviewers, the MFRM

21 RATER EFFECTS & ALIGNMENT 21 would do little to correct the bias. Including all 15 reviewers in the same analysis reduced the likelihood of the above scenario. It is also worth noting, however, that an increase in the number of raters also would have reduced the chances of group rater effects occurring within a grade. Conclusions and Future Research This study differed from previous research in three important ways. First, all data were collected through an online design, which comes with substantial practical benefits. Conducting reviews online also potentially increases the external validity of the study by efficiently drawing reviewers together from broad and diverse contexts. Second, all alignment data were collected with fully independent reviews. While independent reviews are not unique (Porter, 2004), the purpose of the independent reviews was unique. There were two broad reasons for gathering reviews independently: (a) alleviating any potential socially mediated biases, and (b) opening the possibility for more advanced data analyses by not violating the independence of observations assumption. Third, all data were analyzed with a many-facets Rasch model to document and control for rater effects. This study represents the first application we have found of latent trait methods to alignment data, which likely leads to more accurate representations of the alignment of each item, while simultaneously providing modes of documenting within-rater consistency and controlling for between-rater severity. Future research should examine the effect of additional reviewers on the standard error of the estimate and empirically test how each additional reviewer increases precision. The standard errors will always be largest on the most extreme scores, and it is likely that adding reviewers reaches a point of diminishing returns. However, where this threshold lies remains an empirical question, and one that likely has to be balanced with the financial costs associated with adding

22 RATER EFFECTS & ALIGNMENT 22 additional reviewers. This research should also examine how precision changes when adding raters who do or do not fit the model expectations. It would also be interesting to investigate the effect of different training procedures on the consistency of rater responses. While online webinars are a cost-effective and convenient method of delivering trainings, some may reasonably argue that webinar trainings lack the depth of inperson trainings. To evaluate the effect of training procedures, one could conduct an alignment review with raters randomly assigned to each training condition, and the consistency of raters could be evaluated by comparing the mean square outfit values of raters within each group. In our study, all participants were trained via a webinar, and all raters appeared to meet the Rasch model expectations quite well. However, the training component was not of primary interest in this study, and more research would be needed to address it specifically. It is also worth noting that if in-person trainings were to be conducted, then the geographical distribution of raters would, again, be of concern given the substantial financial cost associated with convening experts from distant locations. Evidence of item to standards alignment is a critical part of establishing the validity of large-scale tests used in standards-based educational contexts. If inferences of students ability from tests are made relative to the standards the test items are intended to measure, then the validity of the inference hinges on adequate alignment. To date, few advanced statistical analyses have been applied to alignment data, perhaps due in part to the non-independence of observations. Yet, given the high-stakes nature of many tests, and the importance of alignment, it is critical that researchers continue to address the best methods of determining alignment, and not just which dimensions of alignment items should be scored. That is, are average ratings sufficient? Do panel methods, with experts discussing and debating item alignment, lead to more

23 RATER EFFECTS & ALIGNMENT 23 accurate representations of alignment than independent reviews? And how do panel ratings compare (directly) to independent reviews that have been adjusted for rater severities? It is important that these questions are addressed with future research. Little consensus currently exists on the most appropriate mode of arriving at an alignment score. Many of the practical problems of previous alignment research could be addressed through the application of latent trait methods with independent online reviews rater effects could be documented and controlled, reviewers could participate from diverse geographical areas, and a large pool of items could be efficiently reviewed. Yet, the method proposed here is not without its challenges and limitations, and future research should continue to address methods for arriving at alignment ratings.

24 RATER EFFECTS & ALIGNMENT 24 References Anderson, D., Irvin, P. S., Patarapichayatham, C., Alonzo, J., & Tindal, G. (2012). The Development and Scaling of the easycbm CCSS Middle School Mathematics Measures (Technical Report No. 1207). Eugene, OR: Behavioral Research and Teaching, University of Oregon. Anderson, D., Lai, C.-F., Alonzo, J., & Tindal, G. (2011). Examining a grade-level math CBM designed for persistently low-performing students. Educational Assessment, 16, doi: / Andrich, D. A. (1978). A rating formulation for ordered response categories. Psychometrika, 43, doi: /BF Asch, S. E. (1956). Studies of independence and conformity: A minority of one against a unanimous majority. Psychological Monographs: General and Applied, 70(9), doi: /h Behavioral Research and Teaching. (2013). Distributed Item Review. Retrieved March 27, 2013, from Bond, R. (2005). Group size and conformity. Group Processes Intergroup Relations, 8(4), doi: / Brown, R. (1988). Group processes: Dynamics within and between groups. Cambridge, MA: Basil Blackwell. Case, B. J., Jorgensen, M. A., & Zucker, S. (2004). Alignment in Educational Assessment: Pearson. Council of Chief State School Officers (CCSSO). (2002). Models for alignment analysis and assistance to states. Washington, DC: Author.

25 RATER EFFECTS & ALIGNMENT 25 Embretson, S. E., & Reise, S. P. (2000). Item Response Theory for Psychologists. Florence, KY: Psychology Press. Engelhard, G. (1994). Examining rater errors in the assessment of written composition with a many-faceted Rasch model. Journal of Educational Measurement, 31, Kocher, M. G., & Sutter, M. (2005). The decision maker matters: Individual versus group behaviour in experimental beauty contest games. The Economic Journal, 115, La Marca, P. M. (2001). Alignment of standards and assessments as an accountability criterion. Practical Assessment, Research & Evaluation 7(21). Retrieved October 17, 2012, from Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago: MESA Press. Linacre, J. M. (2000). Comparing "partial credit" and "rating scale" models. Rasch Measurement Transactions. Retrieved July 5, 2012, from Linacre, J. M. (2002). What do infit and outfit, mean-square and standardized mean? Rasch Measurement Transactions, 16, 878. Linacre, J. M. (2012). Facets computer program for man-facet Rasch measurement (Version ). Beaverton, Oregon: Winsteps.com. Martone, A., & Sireci, S. G. (2009). Evaluating alignment between curriculum, assessment, and instruction. Review of Educational Research, 79, doi: / Masters, G. N. (1982). A Rasch model for partial credit scoring. Pychometrika, 47(2), doi: /BF

26 RATER EFFECTS & ALIGNMENT 26 Myford, C. M., & Wolfe, E. W. (2000). Strengthening the ties that bind: Improving the linking network in sparsely connected rating designs. TOEFL Technical Report No. 15. Princeton, NJ: Educational Testing Service. The No Child Left Behind Act, Pub. L. No (2001). Porter, A. C. (2002). Measuring the content of instruction: Uses in research and Practice. Educational Researcher, 31, doi: / X Porter, A. C. (2004). Curriculum assessment. In J. Green, B. Camilli & P. Elmore (Eds.), Complementary Methods for Research in Education. Washington, DC: American Educational Research Association. Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Thousand Oaks, CA: Sage. Rothman, R., Slattery, J. B., Vranek, J. L., & Resnick, L. B. (2002). Benchmarking and alignment of standards and testing (Technical Report 566). Washington, DC: Center for the Study of Evaluation. Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings: Assessing the psychometric quality of rating data. Psychological Bulletin, 88, doi: / Salzberger, T. (2010). Does the Rasch model convert an ordinal scale into an interval scale? Rasch Measurement Transactions, 24, Thompson, N. A., & Weiss, D. J. (2011). A framework for the development of computerized adaptive tests. Practical Assessment, Research & Evaluation, 16(1), 1-9.

27 RATER EFFECTS & ALIGNMENT 27 Webb, N. L. (1999). Alignment of science and mathematics standards and assessments in four states (Research Monograph No. 18). Madison, WI: University of Wisconsin-Madison, National Institute for Science Education. Webb, N. L. (2007). Issues related to judging the alignment of curriculum standards and assessments. Applied Measurement in Education, 20, doi: / Webb, N. L., Alt, M., Ely, R., & Vesperman, B. (2005). Web Alignment Tool (WAT): Training Manual (Version 1.1): Council of Chief State School Officers; Wisconsin Center for Education Research. Wolfe, E. W. (2004). Identifying rater effects using latent trait models. Psychology Science, 46, Wright, B. D. (1982). Dichotomous Rasch model derived from counting right answers: Raw scores as sufficient statistics. Rasch Measurement Transactions, 3, 62.

28 RATER EFFECTS & ALIGNMENT 28 Table 1 Rater Sampling Plan Grade 6 Grade 7 Grade 8 Set 1 Set 2 Set 3 Set 4 Set 5 Set 1 Set 2 Set 3 Set 4 Set 5 Set 1 Set 2 Set 3 Set 4 Set 5 a a a b b b c c c d d d e e e f f f g g g h h h i i i j j j k k k l l l m m m n n n o o o Note. Each letter, a o, represents an individual reviewer. Each reviewer is further represented by one and only one row. The table displays the overlap of raters across items so that each item was rated by at least 3 reviewers, while the raters themselves linked across item sets, allowing for the calibration of raters and items on a common scale. Each set contained 90 items.

29 RATER EFFECTS & ALIGNMENT 29 Table 2 Random sample of 15 Items: Item-Standard Alignment Results Item Grade Obs. Avg. Adj. Avg. Endorsability S.E. Fit Statistics Infit Infit Z Outfit Outfit Z Note. Endorsability = MFRM scaled alignment rating, reported on the logit scale. Lower values indicate easier to endorse items, and thus items with a higher degree of alignment. Adj. Avg. = Endorsability statistic rescaled to the original rating scale, adjusted for rater severity.

30 RATER EFFECTS & ALIGNMENT 30 Table 3 Summary of Rater Effects Fit Statistics Rater Count Avg. Rtg Severity S.E. Infit Infit Z Outfit Outfit Z Note. Count = number of items judged by the rater. The severity statistic is reported on logit scale, with higher values indicating a more severe rater.

31 RATER EFFECTS & ALIGNMENT 31 Figure 1. Online web-based tool used for gauging item-alignment. The distributed item review (DIR) is a web-based secured system for presenting test items and test forms to experts across a broad geographic region for reviews of bias, sensitivity, and alignment to standards. In the current study, the DIR was used as the online framework for gauging item alignment to the Common Core State Standards.

RATER EFFECTS AND ALIGNMENT 1. Modeling Rater Effects in a Formative Mathematics Alignment Study

RATER EFFECTS AND ALIGNMENT 1. Modeling Rater Effects in a Formative Mathematics Alignment Study RATER EFFECTS AND ALIGNMENT 1 Modeling Rater Effects in a Formative Mathematics Alignment Study An integrated assessment system considers the alignment of both summative and formative assessments with