R E S E A R C H R E P O R T

Size: px
Start display at page:

Download "R E S E A R C H R E P O R T"

Transcription

1 RR R E S E A R C H R E P O R T INVESTIGATING ASSESSOR EFFECTS IN NATIONAL BOARD FOR PROFESSIONAL TEACHING STANDARDS ASSESSMENTS FOR EARLY CHILDHOOD/GENERALIST AND MIDDLE CHILDHOOD/GENERALIST CERTIFICATION George Engelhard, Jr. Carol M. Myford Fred Cline Princeton, New Jersey September 2000

2 Research Reports provide preliminary and limited dissemination of ETS research prior to publication. They are available without charge from the Research Publications Office Mail Stop 07-R Educational Testing Service Princeton, NJ 08541

3 INVESTIGATING ASSESSOR EFFECTS IN NATIONAL BOARD FOR PROFESSIONAL TEACHING STANDARDS ASSESSMENTS FOR EARLY CHILDHOOD/GENERALIST AND MIDDLE CHILDHOOD/GENERALIST CERTIFICATION George Engelhard, Jr. Emory University Carol M. Myford and Fred Cline Educational Testing Service

4 Abstract The purposes of this study were to (1) examine, describe and evaluate the rating behavior of assessors scoring National Board for Professional Teaching Standards Early Childhood/Generalist or Middle Childhood/Generalist candidates, and (2) explore the effects of employing scoring designs using one assessor per exercise rather than two. Data from the Early Childhood/Generalist and Middle Childhood/Generalist assessments were analyzed using FACETS (Linacre, 1989) and SAS. While assessors differed somewhat in severity, most used the 12-point rating scale consistently. Residual severity effects that persisted after rater training tended to cancel out when ratings were averaged across the ten exercises to produce a scaled score for a candidate. The results of the decision consistency analyses suggest that having one assessor score each exercise would not result in a large increase in candidates incorrectly denied or awarded certification. However, the decision consistency analyses by exercise reveal the low reliability of the exercise banking decisions. Variation in assessor severity can affect the fairness of these decisions. Key words: teacher assessment, portfolio assessment, assessment centers, performance assessment, Item Response Theory, Rasch measurement, FACETS, rater effects, rater monitoring, quality control ii

5 Acknowledgments We would like to acknowledge the helpful advice of Mike Linacre regarding the use of the FACETS computer program to analyze these data. The material contained herein is based on work supported by the National Board for Professional Teaching Standards. Any opinions, findings, conclusions, and recommendations expressed herein are those of the authors and do not necessarily reflect the views of the National Board for Professional Teaching Standards, Emory University, or the Educational Testing Service. iii

6 Table of Contents Introduction.. 1 A Many-Faceted Rasch Model for the Assessment of Accomplished Teaching 7 Method. 10 Participants.. 10 NBPTS Assessment Process 12 NBPTS Scoring System 13 Procedure. 19 Results.. 23 I. FACETS Analyses of the Early Childhood/Generalist and Middle Childhood/ Generalist Assessment Systems 23 Variable Map 23 Rating Scale. 27 Candidates 29 Exercises..36 Assessors. 38 II. Comparisons of Scoring Designs.. 51 III. Summary of Results in Terms of the Research Questions 65 Discussion.69 References.74 Page iv

7 Introduction The National Board for Professional Teaching Standards (NBPTS) is committed to the development of a psychometrically sound assessment system designed to measure fairly and objectively high levels of teaching accomplishment in a variety of content areas (National Board for Professional Teaching Standards, 1999a). Many of the early psychometric analyses of the NBPTS assessments were conducted using traditional methods based on classical test theory and its extensions, such as Generalizability Theory (National Board for Professional Teaching Standards, 1999b). As the number of candidates for NBPTS certification has increased, it is now possible to use modern psychometric methods based on Item Response Theory (IRT) (Hambleton & Swaminathan, 1985) to evaluate the psychometric quality of the NBPTS assessments. The use of IRT to measure teaching has been discussed several times in the literature (Mislevy, 1987; Swaminathan, 1986); but as far as we know, this study reflects the first application of IRT to a large-scale performance assessment of teachers within the context of certification testing. In essence, we are modeling how trained assessors implement an approach to evaluating teaching. This study focuses on the use of one class of IRT models based on the measurement theory of Rasch (1980) and extended by Wright and his colleagues (Wright & Stone, 1979; Wright & Masters, 1982). The extension of the Rasch model for analyses of assessor-mediated ratings is called the many-faceted Rasch model (FACETS model, Linacre, 1989). The FACETS model has been used to examine the psychometric quality of a variety of performance assessments based on assessor-mediated ratings (e.g., Engelhard, 1992, 1994, 1996; Heller, Sheingold, & Myford, 1998; Linacre, Engelhard, Tatum, & Myford, 1994; Lunz & Stahl, 1990; Lunz, Wright, & Linacre, 1990; Myford, Marr, & Linacre, 1996; Myford & Mislevy, 1995; Myford & Wolfe, 2000a; Paulukonis, Myford, & Heller, 2000; Wolfe, Chiu, & Myford, 1999). It should be stressed that the FACETS model provides additional information that supplements, rather than supplants, the inferences provided by more traditional methods that have been used previously to analyze the NBPTS assessments. The FACETS analyses presented in this study provide a different lens for exploring the quality of assessor-mediated ratings obtained within the context of the NBPTS assessment system. There are several key ideas that undergird the analyses reported in this study. It is important to recognize that high levels of teaching accomplishment cannot be viewed directly, and that teaching accomplishment is a latent variable or construct that must be inferred from observed performances on selected assessment exercises. In essence, assessors observe a variety of performances that reflect a candidate s responses to different exercises (or tasks), and then the assessors make judgments based on these teacher performances to infer level of teaching accomplishment. The conceptual framework is based on what Messick (1994) has called a construct-driven rather than a task-driven model. Performance assessments developed within this conceptual framework require the careful elicitation, analysis, and evaluation (both formative and summative) of assessor judgments in order to obtain fair inferences regarding a candidate s level of teaching accomplishment. 1

8 The conceptual model for the assessment of accomplished teaching used to guide this study is presented in Figure 1. This conceptual model hypothesizes that the observed ratings (i.e., the ratings the assessors give candidates on the exercises) are a function of a variety of factors. Ideally, the major determinant of the observed rating should be the construct or latent variable being measured: accomplished teaching. However, there are a number of intervening variables in the model that potentially could adversely affect the quality of those ratings by introducing unwanted sources of construct-irrelevant variance into the assessment process. For a given teaching field, a set of standards of accomplished teaching that lay out the knowledge, skills, dispositions, and commitments defining the construct for that field are prepared. These standards reflect a professional consensus regarding the critical aspects of practice that characterizes accomplished teachers in this field (see, for example, National Board for Professional Teaching Standards, 1998c, 1998d). A set of exercises and scoring rubrics for each exercise are then devised that are explicitly linked to the standards. The standards, the exercises and their accompanying rubrics provide the operational definition of the accomplished teaching construct. There are several aspects of the assessment design process that, if not carried out appropriately, could introduce construct-irrelevant variance into NBPTS assessments: (1) the standards may not adequately lay out the knowledge, skills, dispositions, and commitments that define the accomplished teaching construct for a given field, (2) some standards may have been included that were not a part of the construct, (3) the linkage of the standards to the exercises may be loose, or (4) the linkage of the standards to the rubrics may be loose. In building a validity argument based on the new Standards for Educational and Psychological Testing (American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education, 1999) for the NBPTS assessments, content validity evidence has been gathered to demonstrate that the processes of defining the accomplished teaching construct through standards and of linking the standards to the exercises and scoring rubrics were carried out in a credible, reproducible fashion (see, for example, Benson & Impara, 1996a, 1996b). Candidates who wish to apply for NBPTS certification must clearly understand the standards for their field and respond to each exercise in an informed manner. The NBPTS provides candidates with specific approaches for studying the standards, detailed instructions for completing each exercise, and suggestions for reviewing the written work the candidates must submit. However, it is possible that some candidates may not achieve a sufficient level of understanding regarding the assessment process and the types of evidence the assessment requires the candidate to gather. If any of the exercises have been designed such that candidates do not clearly understand what is expected of them, then this could introduce construct-irrelevant variance into the assessment process. Assessors who are practicing teachers in the field carefully study the standards for that field and then receive intensive training to become familiar with the demands of the particular exercise they are to score. As part of their assessor training, they learn to apply the rubric to score candidates performances on that exercise. The assessors practice scoring pre-selected 2

9 Figure 1 Conceptual Model for the Assessment of Accomplished Teaching Intervening Variables Assessor Severity Level of Accomplished Teaching Observed Ratings Exercise Difficulty I. Portfolio II. Documented Accomplishments III. Assessment Center Structure of the Rating Scale 3

10 performances until they develop a shared understanding of how to apply and interpret the rubric. If there were assessors who after training rated significantly more leniently (or, conversely, more harshly) than other assessors, that could introduce construct-irrelevant variance into the ratings. In this situation, whether a candidate obtained NBPTS certification could be heavily dependent upon the particular set of assessors who, by luck of the draw, happened to rate the candidate. Similarly, if there were assessors who, even after intensive training, were unable to use a scoring rubric consistently when evaluating candidates performances, or assessors who allowed personal biases to cloud their judgments, such assessors would introduce additional constructirrelevant variance into the assessment process. The quality of the ratings they provide would be highly suspect. In effect, the standards, the exercises, the rubrics, and the assessors function as intervening variables in the conceptual model, providing the lens and focus for viewing the accomplished teaching construct and obtaining the ratings. We employed a psychometric model that enabled us to operationalize key aspects of the conceptual model. Our psychometric model incorporates some (but not all) of the possible intervening variables. By reviewing output from our analyses, we are able to monitor the extent to which the intervening variables (or facets, as we refer to them) included in our psychometric model are introducing unwanted constructirrelevant sources of variance into the assessment process. These facets, if left unattended, have the potential to threaten the quality of the ratings (and, ultimately, the validity of the assessment enterprise). Our psychometric model includes exercises as one facet in our analyses. Each separate exercise is an element of the exercise facet. From our analyses, we obtain an estimate of the difficulty of each exercise (i.e., how hard it is for a candidate to receive a high rating on the exercise). 1 We also obtain an estimate of the extent to which the exercise fits with the other 1 One approach to construct validation would be to define the accomplished teaching construct through a series of exercises ordered by difficulty. In essence, the assessment developers would construct a set of exercises to measure accomplished teaching. The exercise difficulty continuum would logically define the underlying construct. Exercises at the upper end of the continuum would be designed to be harder for candidates to get high ratings on than exercises at the lower end of the continuum. If exercises could be devised so that they sufficiently spread out along a difficulty continuum, that would allow for placement of candidates along the variable defined by the hierarchically ordered exercises. Candidates could then be ordered along the continuum by their levels of accomplished teaching. The joint ordering of exercises and candidates would provide a conceptual basis for explaining what it means to be more or less highly accomplished in the field. Specific statements could be prepared describing candidate performance at various levels along the continuum (i.e., what is it that a candidate at each level has demonstrated that he/she knows and can do). Establishing the construct validity of the assessment involves comparing the intended (i.e., theoretical) ordering of the exercises to the actual empirical ordering of the exercises. The critical construct validation question becomes this: to what extent does the assessment developers initial theoretical understanding of the construct (i.e., their view of how the exercises should order hierarchically) map to the empirical evidence obtained when candidates actually take the assessment? This was not the validation approach adopted by those designing the NBPTS assessments, however. Rather, their approach was to use the standards for the certification field as the basis for exercise development (see the National Board for Professional Teaching Standards Early Childhood/Generalist Standards, 1998c, and Middle Childhood/Generalist Standards, 1998d). The standards describe what teachers applying for that particular certificate should know and be able to do in order to be certified as an accomplished teacher in their field. The exercises and their accompanying scoring rubrics were designed to reflect the standards for the certificate area. Each exercise provides evidence for one or more standards. No one exercise tries to provide evidence for all the standards. There was no intention to try to devise exercises that would differ in difficulty and could thus be hierarchically ordered. Nor was it the assessment developers intention to design exercises so that they would all be of the same level of difficulty. In short, exercise 4

11 exercises included in the model. That is, whether the exercises work together to define a single unidimensional construct, or whether there is evidence that the exercises are measuring multiple, independent dimensions of a construct (or, perhaps, multiple constructs). The fit statistics for an exercise signal the degree of correspondence between ratings of candidates performances on that particular exercise when compared to ratings of candidates performances on the other exercises. They provide an assessment of whether ratings on the exercises can be meaningfully combined to produce a single composite score, or whether there may be a need for separate scores to be reported rather than a single composite score. A second facet included in our psychometric model is assessors. From our analyses, we obtain an estimate of the level of severity of each assessor when evaluating candidates performances. This information can be used to judge the degree to which assessors are functioning interchangeably. It is also possible to obtain an estimate of the consistency with which an assessor applies a scoring rubric. Assessor fit statistics provide estimates of the degree to which an assessor is internally consistent when using the rubric to evaluate multiple candidate performances. A third facet in our psychometric model is candidates. The output from our analyses provides an estimate of each candidate s level of accomplished teaching. In producing these estimates, the computer program adjusts the candidate s score for the level of severity that the particular assessors scoring that candidate s performance exercised. In effect, the program washes out these unwanted sources of construct-irrelevant variance. The resulting score reflects what the candidate would have received if assessors of average severity had rated the candidate. The program also produces candidate fit statistics that are indices of the degree of consistency shown in the evaluation of the candidate s level of accomplished teaching across exercises and across assessors. Through fit analyses, candidates who exhibit unusual profiles of ratings across exercises (i.e., candidates who appear to do well on some exercises but poorly on others) can be identified. Flagging the scores of these misfitting candidates allows for an important quality control check before score reports are issued an independent review of each misfitting candidate s performance across exercises. One can determine whether the particular ratings that the computer program has identified as surprising or unexpected for that candidate are perhaps due to random (or systematic) assessor error and ought to be changed. Alternatively, through fit analysis, one might determine that, indeed, the candidate performed differentially across exercises, and the ratings should be left to stand as is. difficulty was not a consideration in their design process. The validation approach was to establish content validity, not construct validity. The validation strategy involved having expert panelists make judgments about the following: (1) whether the content standards represented critical aspects of accomplished practice in the field, (2) whether the portfolio exercises and assessment center exercises were relevant and important to one or more content standards, (3) whether the skills, knowledges, and abilities necessary to carry out the exercises were relevant and important to the domain of accomplished teaching in that field, (4) whether the scoring rubrics were relevant and important to the content standards, (5) whether the scoring rubrics emphasized factors that were relevant, important, and necessary to the domain, and (6) whether the exercises considered as a set represented the skills and knowledge expected of an accomplished teacher in the field (Benson & Impara, 1996a, 1996b). For more information about the assessment development process, the interested reader is referred to the National Board for Professional Teaching Standards Technical Analysis Report, 1999b, pages

12 The psychometric model also takes into consideration the structure of the scoring rubrics used in the assessment of accomplished teaching. These analyses provide useful information that enables the determination of whether or not the scoring rubrics are functioning as intended. For example, by examining the rating scale category calibrations that the computer program produces, it is possible to determine whether the categories in a given rubric are appropriately ordered and are clearly distinguishable. The psychometric model provides useful information about two key intervening variables in our conceptual model (exercises and assessors), the candidates, and the scoring rubric. However, the psychometric model does not provide any information about the adequacy of the initial conception of the construct (i.e., how well the standards define the accomplished teaching construct in a given field; how well the knowledge, skills, dispositions, and commitments of the field map to the standards). Nor does the psychometric model address whether the exercises and scoring rubrics adequately reflect the standards. Our analyses do not provide any information about the strength of that critical linkage. Other approaches are needed to gather evidence to rule out the possibility that these critical aspects of assessment development are not introducing construct-irrelevant sources of variance into the candidate evaluation process, threatening its validity. This study focuses on issues related to the investigation of assessor effects in two NBPTS assessments: Early Childhood/Generalist and Middle Childhood/Generalist certification. Candidates who apply for Early Childhood/Generalist certification teach all subjects to students ages 3-8, while those who apply for Middle Childhood/Generalist certification teach all subjects to students ages The major purposes of this study are to (1) examine, describe and evaluate the rating behavior of individual assessors scoring Early Childhood/Generalist or Middle Childhood/Generalist candidates, and (2) explore the effects of employing scoring designs that use fewer than 20 ratings per candidate (10 exercises, each rated by two assessors). Specifically, this study addresses the following questions within each NBPTS assessment: 1. Do assessors differ in the severity with which they rate candidates performances on the assessment exercises? Are differences in assessor severity more pronounced for some exercises than for others? 2. Do differences in assessor severity affect candidate scores on a given exercise? 3. Do differences in assessor severity affect the accuracy and/or consistency of the certification decision? How do differences in assessor severity impact decisions made regarding the banking of exercises? If the National Board for Professional Teaching Standards were to move to a scoring design that reduced the number of assessors to one per exercise, what would be the impact of that policy decision on certification rates? 4. Do assessors use the scoring rubrics consistently across candidates? Do they share a common understanding of the meaning of each scale point on a rubric? Are there any inconsistent assessors whose patterns of scores show little systematic relationship to the 6

13 scores given to the same candidates by other assessors? (That is, are there assessors who are sometimes more lenient but at other times are more severe than other assessors assessors who are neither consistently lenient nor consistently severe but rather tend to vacillate erratically between these two general tendencies?) 5. Do some candidates exhibit unusual profiles of ratings across exercises, receiving unexpectedly high (or low) ratings on certain exercises, given the ratings the candidates received on the other exercises? 6. Is it harder for candidates to get high ratings on some exercises than others? To what extent do the exercises differ in difficulty? How well does the set of 10 exercises succeed in defining statistically distinct levels of accomplished teaching among the candidates? 7. Can ratings from all exercises be calibrated, or do ratings on some exercises frequently fail to correspond to ratings on other exercises? (That is, are there certain exercises that do not fit with the others?) 8. Can many-faceted Rasch measurement models be used to monitor the effectiveness of complex performance assessment systems like the National Board for Professional Teaching Standards assessment systems? What special challenges arise in analyzing rating data from NBPTS assessments using the FACETS computer program? A Many-Faceted Rasch Model for the Assessment of Accomplished Teaching The procedures described in this section for examining the quality of ratings obtained from assessors are based on a many-faceted version of the Rasch measurement (FACETS) model for ordered response categories developed by Linacre (1989). The FACETS model is an extended version of the Rasch measurement model (Andrich, 1988; Rasch, 1980; Wright & Masters, 1982). The FACETS model is essentially an additive linear model that is based on a logistic transformation of observed assessor ratings to a logit or log-odds scale. The logistic transformation of ratios of successive category probabilities (log odds) can be viewed as the dependent variable with various facets, such as candidates, exercises, and assessors, conceptualized as independent variables that influence these log odds. In this study, the FACETS model takes the following form: ln[p nijk / P nijk-1 ] = θ n ξ i α j τ k, (1) where P nijk = probability of candidate n being rated k on exercise i by assessor j, P nijk-1 = probability of candidate n being rated k - 1 on exercise i by assessor j, θ n = teaching accomplishment for candidate n, 7

14 ξ i = difficulty of exercise i, α j = severity of assessor j, and τ k = difficulty of category k relative to category k - 1. The rating category coefficient, τ k, is not considered a facet in the model. Based on the FACETS model presented in Equation 1, the probability of candidate n with level of teaching accomplishment θ n obtaining a rating of k (k = 1,, m) on exercise ξ i from assessor α j with a rating category coefficient of τ k is given as k P nijk = exp [k (θ n ξ i α j ) Σ τ h ] / γ, (2) h=1 where τ 1 is defined to be 0, and γ is a normalizing factor based on the sum of the numerators. As written in Equation 1, the category coefficients, τ k, represent a rating scale model (Andrich, 1978) with category coefficients fixed across exercises. Once the parameters of the model are estimated using standard numerical methods, such as those implemented in the FACETS computer program (Linacre & Wright, 1992), model-data fit issues can be examined in a variety of ways. Useful indices of rating quality can be obtained by a detailed examination of the standardized residuals calculated as where m Z nij = (x nij E nij ) / [Σ(k E nij ) 2 P nijk ] 1/2 (3) k=1 m E nij = Σ k P nijk. (4) k=1 The standardized residuals, Z nij, can be summarized over different facets and different elements within a facet in order to provide indices of model-data fit. These residuals are typically summarized as mean-square error statistics called OUTFIT and INFIT statistics. The OUTFIT statistics are unweighted mean-squared residual statistics that are particularly sensitive to outlying unexpected ratings. The INFIT statistics are based on weighted mean-squared residual statistics and are less sensitive to outlying unexpected ratings. Engelhard (1994) provides a description regarding the interpretation of these fit statistics within the context of assessor-mediated ratings. 8

15 To be useful, an assessment must be able to separate candidates by their performance (Stone & Wright, 1988). FACETS produces a candidate separation ratio, G N, which is a measure of the spread of the candidate accomplished teaching measures relative to their precision. Separation is expressed as a ratio of the standard deviation of candidate accomplished teaching measures adjusted for measurement error over the average candidate standard error (Equation 5): G N = SD 2 N n= 1 SE N 2 N N n= 1 SE N 2 n (5) where 2 SD is the observed variance of the non-extreme candidate accomplished teaching measures, SE 2 n is the squared standard error for candidate n, N is the number of candidates. Using the candidate separation ratio, one can then calculate the candidate separation index, which is the number of measurably different levels of candidate performance in the sample of candidates: (4 G N + 1) / 3 (6) A candidate separation index of 2 would suggest that the assessment process is sensitive enough to be used to make certification decisions about candidate performance, since two statistically distinct candidate groups can be discerned (i.e., those who should be certified, those who should not). Similarly, FACETS produces an assessor separation ratio, which is a measure of the spread of the assessor severity measures relative to their precision. The assessor separation index, derived from that separation ratio, connotes the number of statistically distinct levels of assessor severity in the sample of assessors. An assessor separation index of 1 would suggest that all assessors were exercising a similar level of severity and could be considered as one interchangeable group. (We will be reporting candidate and assessor separation indices in our results, but not their associated separation ratios. The separation indices are more readily understood and have more practical utility, in our view.) Another useful statistic is the reliability of separation index. This index provides information about how well the elements within a facet are separated in order to define reliably the facet. This index is analogous to traditional indices of reliability, such as Cronbach s coefficient alpha and KR20, in the sense that it reflects an estimate of the ratio of true score to observed score variance. The reliability of separation indices have slightly different substantive interpretations for different facets in the model. For candidates, the reliability of separation index is comparable to coefficient alpha, indicating the reliability with which the assessment separates the sample of candidates (that is, the proportion of observed sample variance which is 9

16 attributable to individual differences between candidates, Wright & Masters, 1982). Unlike interrater reliability, which is a measure of how similar the assessor measures are, the candidate separation reliability is a measure of how different the candidate accomplished teaching measures are (Linacre, 1994). By contrast, for assessors, the reliability of separation index reflects potentially unwanted variability in assessor severity. Separation reliability can be calculated as R = (SD 2 MSE) / SD 2, (7) where SD 2 is the observed variance of element difficulties for a facet on the latent variable scale in logits, and MSE is the mean-square calibration error estimated as the mean of the calibration error variances (squares of the standard errors) for each element within a facet. (MSE is? SE 2 / N.) Andrich (1982) provides a detailed derivation of this reliability of separation index. Detailed general descriptions of the separation statistics are also provided in Wright and Masters (1982) and Fisher (1992). Equation 2 can be used to generate a variety of expected scores under different conditions reflecting various assumptions regarding the assessment process. For example, it is possible to estimate an expected rating for a candidate on a particular exercise that would be obtained from an assessor who exercised a level of severity equal to zero (i.e., an assessor who was neither more lenient nor more severe than other assessors). In this case, the assessor, j, would be defined as α j = 0, and the adjusted probability, AP, and adjusted rating, AR, would be calculated as follows m AP nijk = exp [k (θ n ξ i 0 j ) Στ k ] / γ, (8) k=1 and m AR nij = Σ k AP nijk. (9) k=1 Equation 9 can be interpreted as producing an expected adjusted rating for candidate n on exercise i from assessor j (α j = 0). Participants Method Early Childhood/Generalist candidates. The total candidate pool for the Early Childhood/Generalist certification was 603 candidates. There were 596 candidates included in the analyses. Seven candidates were dropped from the analyses because they did not have ratings on all ten exercises. Within the total candidate pool, 98% were female, and 2% were male. In terms of race/ethnicity, 85% were White (non-hispanic), 10% were African American, 1% were Hispanic, 1% were Asian American, 1% were Native American, and 1% of the candidates did not identify their race/ethnicity. When asked about their teaching setting, 34% indicated that they worked in urban schools, 27% worked in suburban settings, and 39% worked 10

17 in rural settings (though there is some concern about whether candidates used these classification categories as intended). The range of years of the candidates teaching experience was 3 to 38 years (mean = 14, median = 13). The mean age of candidates was 41 years (median = 43 years), while the range was 25 to 68 years. The highest degree attained was a bachelor s degree for 44% of the candidates, a master s degree for 56% of the candidates, and a doctorate for 1% of the candidates. Candidates from 33 states submitted their work for review: 45% of the candidates were from North Carolina, 16% were from Ohio, and the remaining 38% were from the other 31 states. (Less than 6% of the total candidate pool were from each of these 31 states.) 2 All candidates taught in public schools; there were no private school teachers. Finally, 92% of the candidates worked in elementary schools. The remaining 8% taught in preschools, but one candidate taught in a middle school. Early Childhood/Generalist assessors. There were 119 Early Childhood/Generalist assessors included in this study. All assessors were practicing Early Childhood/Generalist teachers. Seventy-eight assessors scored exercises related to the portfolio: 68% were White (non-hispanic), 12% were African American, and 21% were Hispanic. Thirty-one assessors scored exercises administered at the assessment centers: 77% were White (non-hispanic), 19% were African American, and 3% were Hispanic. Six assessors functioned as lead trainers for the portfolio assessment: two were White (non-hispanic), two were African American, and two were Hispanic. Finally, four assessors functioned as assessment center trainers: three were White (non-hispanic), and one was African American. Middle Childhood/Generalist candidates. The total candidate pool for the Middle Childhood/Generalist certification was 523 candidates. There were 516 candidates included in the analyses. Seven candidates were dropped from the analyses because they did not have ratings on all ten exercises. Within the total candidate pool, 94% were female, and 7% were male. In terms of race/ethnicity, 89% were White (non-hispanic), 8% were African American, 2% were Hispanic, 1% were Asian American, 1% were Native American, and 1% of the candidates did not identify their race/ethnicity. When asked about their teaching setting, 31% indicated that they taught in urban schools, 35% worked in suburban settings, and 34% worked in rural settings (though there is some concern about whether candidates used these classification categories as intended). The range of years of the candidates teaching experience was 3 to 40 years (mean = 13, median = 12). The mean age of candidates was 41 years (median = 42 years), while the range was 26 to 63 years. The highest degree attained was a bachelor s degree for 35% of the candidates, a master s degree for 64% of the candidates, and a doctorate for 1% of the candidates. Candidates from 35 states submitted their work for review: 33% of the candidates were from North Carolina, 19% were from Ohio, and the remaining 47% were from the other 33 states. (Less than 6% of the total candidate pool was from each of these 33 states.) 3 All 2 The demographics of the candidate pool for the Early Childhood/Generalist assessment have changed over the last several years. For example, in 1999, the total candidate pool was 1,441 candidates: 23.5% were from North Carolina, 23.2% were from Florida, 13% were from Mississippi, 7% were from California, and 6.2% were from Ohio. The remaining candidates were from 38 other states; fewer than 4% of the total candidate pool were from each of these states. 3 The demographics of the candidate pool for the Middle Childhood/Generalist assessment have also changed over the last several years. For example, in 1999, the total candidate pool was 1,311 candidates: 30.1% were from Florida, 18.9% were from North Carolina, 8.6% were from Mississippi, 6.6% were from California, and 5.7% were 11

18 candidates taught in public schools; there were no private school teachers. Finally, 96% of the candidates worked in elementary schools. The remaining 4% taught in middle schools. Middle Childhood/Generalist assessors. There were 117 Middle Childhood/Generalist assessors included in this study. All assessors were practicing Middle Childhood/Generalist teachers. Seventy-five assessors scored exercises related to the portfolio: 80% were White (non- Hispanic), 8% were African American, and 12% identified themselves as having racial/ethnic backgrounds other than White (non-hispanic). Forty-two assessors scored exercises administered at the assessment centers: 88% were White (non-hispanic), 7% were African American, and 5% were from a racial/ethnic background other than White (non-hispanic). Six assessors functioned as lead trainers for the portfolio assessment: four were White (non- Hispanic), one was African American, and one was from a racial/ethnic background other than White (non-hispanic). Finally, four assessors functioned as assessment center trainers: one was White (non-hispanic), and three were African American. NBPTS Assessment Process In , candidates who applied for the Early Childhood/Generalist certificate or the Middle Childhood/Generalist certificate participated in a two-part assessment process: (1) a portfolio assessment, and (2) a series of assessment center exercises. The portfolio was designed to showcase candidates classroom practice as well as their work outside the classroom with families and the community at large, with their colleagues, and their profession. The assessment center exercises were designed to evaluate the candidates content and pedagogical content knowledge. The portfolio has two distinct parts: four classroom-based entries, and two documented accomplishment entries. The classroom-based entries focus on the candidate s practice inside the classroom; by contrast, the documented accomplishment entries focus on the candidate s practice outside the classroom. Tables 1 and 2 describe the portfolio entries for Early Childhood/Generalist and Middle Childhood/Generalist certification. Candidates applying for certification received a large binder containing detailed instructions and requirements for preparing each portfolio entry, including a careful description of the kinds of evidence required so that the candidate s response will be scorable. The scoring criteria that assessors would use to evaluate the candidate s response to each portfolio entry were also included in the binder. Candidates portfolios included various sources of evidence of classroom practice. For two of the portfolio entries, candidates provided videotape of classroom interactions. The other two classroom-based portfolio entries required candidates to gather and respond to particular types of student work. In each case, the candidate produced a detailed written analysis of the teaching sequences shown in the videotape and of the student work. Through their written commentary, candidates were expected to provide a context for the work, specify the broad goals as well as the specific goals for the classroom segments shown on videotape, and provide critical reflection on their classroom practice. Assessors could then determine whether what they saw in the videotape segments and in the student work was mirrored in the written commentary. from Ohio. The remaining candidates were from 41 other states; fewer than 5% of the total candidate pool were from each of these states. 12

19 All candidates traveled to an assessment center where, over the course of a day, they participated in a series of exercises that provided them with an opportunity to demonstrate knowledge and skills not addressed in the portfolio. The assessment center exercises are designed to supplement the portfolio, tapping the candidate s content knowledge and pedagogy through real-life scenarios the enable the candidate to confront critical instructional matters. Tables 3 and 4 describe the assessment center exercises for Early Childhood/Generalist and Middle Childhood/Generalist certification. At the assessment center, candidates responded to four ninety-minute exercise blocks. Some of the exercises were simulations of situations that teachers would typically encounter; other exercises posed questions related to pedagogical content topics and issues. For each exercise block, candidates were given a prompt or set of prompts on a computer. They could produce their responses on the computer or by hand. Some of the prompts were based on stimulus materials that candidates received prior to coming to the assessment center. Candidates were allowed to bring with them to the assessment center any supporting materials that they felt might be useful. 4 NBPTS Scoring System Trained assessors scored the candidate s portfolio entries and their performance on the four assessment center exercises. The six portfolio entries a candidate submits are scored separately. Each entry is considered to be a separate assessment exercise. An assessment consists, then, of ten exercises. Assessors who scored Middle Childhood/Generalist candidates written responses to the assessment center exercises participated in an on-line computerized assessor training program. Assessors participated in either a four-day training program to prepare to score candidates portfolios or a two-day training program to prepare to score the assessment center exercises. Each assessor was trained to score a single exercise. 5 The training procedure included five components: (1) an introduction to the history and principles of the National Board for Professional Teaching Standards, (2) exposure to the various elements of the scoring system, including the standards, exercise directions, and scoring rubrics, (3) identification and recognition of the assessor s own personal biases, and targeted training to reduce the likelihood that those biases would enter into the scoring process, (4) exposure to benchmark performances that were designed to anchor the score scale and rubric, providing clear-cut examples of performances that served as operational definitions of each score point, and (5) extensive practice in applying the scoring system to evaluate candidates pre-selected performances. Following training, assessors participated in a qualifying round to determine whether they were able to show adequate agreement with validated trainers ratings on a number of pre-selected 4 A new policy was instituted in Under this policy, candidates are permitted to bring their assessment center orientation book, any stimulus materials that were mailed to them, the NBPTS standards, notes they wrote or typed themselves, and instructions for operating their calculator. They may not bring textbooks, encyclopedias, bound books, loose-leaf printed materials, or information they have downloaded from the Internet. 5 A few assessors in this study scored two exercises. In our analyses, we assigned them unique assessor IDs within exercise and then treated them as if they were nested within exercise. These few assessors received training to score both exercises. 13

20 Table 1 Description of the Early Childhood/Generalist Portfolio Entries 6 1. REFLECTING ON A TEACHING AND LEARNING SEQUENCE In this entry, candidates are asked to demonstrate how they nurture children s growth and learning through the experiences they provide and the ways in which they make adjustments to these experiences based on their ongoing assessments of children. Through a Written Commentary and four Supporting Artifacts, candidates describe a sequence of activities that reflects an extended exploration of some theme or topic in which they draw on and integrate concepts from social studies and from the arts. 2. EXAMINING A CHILD S LITERARY DEVELOPMENT In this entry, candidates are asked to demonstrate their skill in assessing and supporting children s literacy development. Through a Written Commentary with Supporting Artifacts, candidates provide evidence of the ways they foster literacy in their classroom. Candidates also analyze work samples from one child, discuss his/her development, and outline their approach to supporting his/her learning. 3. INTRODUCTION TO YOUR CLASSROOM COMMUNITY In this entry, candidates provide evidence of how they create and sustain a climate that supports children s social and emotional development. In a Written Commentary candidates are asked to introduce assessors to their learning community by describing ways they build appreciation of diversity and support children s development of social skills. Candidates are asked to illustrate their commentary with a Videotape that shows a class discussion in which they build a sense of community. 4. ENGAGING CHILDREN IN SCIENCE LEARNING In this entry, candidates provide evidence of their skill in helping children acquire scientific knowledge and scientific ways of thinking, observing, and communicating. Through a Written Commentary candidates are asked to discuss and analyze a sequence of learning experiences and how they reflect their general approach to science instruction. Candidates provide evidence from one learning experience in this sequence, explaining how it fits into the sequence. In a videotape of this learning experience, candidates provide evidence of how they engage children in science learning. 5. DOCUMENTED ACCOMPLISHMENTS COLLABORATION IN THE PROFESSIONAL COMMUNITY In this entry, candidates present evidence of sustained or significant contributions to the development and/or substantive review of instructional resources and/or practices, to educational policy and practices, and/or to collaborative work with colleagues in developing pedagogy. Through written description together with work products of the teacher and/or letters of verification from others, the candidate provides support for each accomplishment. Finally, the candidate writes an Interpretive Summary that synthesizes the accomplishments and evidence presented. 6. DOCUMENTED ACCOMPLISHMENTS OUTREACH TO FAMILIES AND THE COMMUNITY In this entry, candidates present evidence of how they create ongoing interactive communication with families and other adults interested in students progress and learning. In addition, candidates may demonstrate evidence of efforts to understand parents concerns about student learning, subject matter, and curriculum and/or contributions to connecting the school program to community needs, resources, and interests. Through written description together with work products and/or letters of verification from others, the candidate provides support for each accomplishment. Finally, the candidate writes an Interpretive Summary that synthesizes the accomplishments and evidence presented. 6 The descriptions of portfolio entries are taken from the National Board for Professional Teaching Standards Assessment Analysis Report, Early Childhood/Generalist, , page 3. 14

21 Table 2 Description of the Middle Childhood/Generalist Portfolio Entries 7 1. WRITING: THINKING THROUGH THE PROCESS In this exercise, candidates demonstrate their use of writing to develop students thinking and writing skills for different audiences and purposes. Through a Written Commentary, two writing Assignments/Prompts, and four Student Responses candidates provide evidence of their planning and teaching and of their ability to describe, analyze, and evaluate student writing, to develop students writing ability, and to use student work to reflect on their practice. 2. THEMATIC EXPLORATION: CONNECTION TO SCIENCE In this exercise, candidates show how they help students to acquire important science knowledge as they strive to better understand a substantive interdisciplinary theme. Through a Written Commentary, three Instructional Artifacts and six Student Responses, candidates present their ability to develop an interdisciplinary theme and to engage children in work that helps them acquire one or more big idea(s) from science in order to enrich their understanding of that theme (big ideas from science include, for example, systems, models, energy, evolution, scale, structure, constancy, and patterns of change). 3. BUILDING A CLASSROOM COMMUNITY In this exercise, candidates display their ability to observe and analyze interactions in their classroom. Through a Written Commentary and Videotape, candidates describe and illustrate how they create a climate that supports students emerging abilities to understand and consider perspectives other than their own, and to assume responsibility for their own actions. 4. BUILDING A MATHEMATICAL UNDERSTANDING In this exercise, candidates demonstrate how they engage students in the discovery, exploration, and implementation of concepts, procedures, and processes to develop a deep understanding of an important mathematics content area over a period of time. Through a Written Commentary, Videotape, and an Assignment/Prompt with two Student Responses candidates provide evidence of planning and teaching that help build students mathematical understanding. Candidates also provide evidence of their ability to describe, analyze, and reflect on their teaching practice. 5. DOCUMENTED ACCOMPLISHMENTS COLLABORATION IN THE PROFESSIONAL COMMUNITY In this entry, candidates present evidence of sustained or significant contributions to the development and/or substantive review of instructional resources and/or practices, to educational policy and practices, and/or to collaborative work with colleagues in developing pedagogy. Through written description together with work products of the teacher and/or letters of verification from others, the candidate provides support for each accomplishment. Finally, the candidate writes an Interpretive Summary that synthesizes the accomplishments and evidence presented. 6. DOCUMENTED ACCOMPLISHMENTS OUTREACH TO FAMILIES AND THE COMMUNITY In this entry, candidates present evidence of how they create ongoing interactive communication with families and other adults interested in students progress and learning. In addition, candidates may demonstrate evidence of efforts to understand parents concerns about student learning, subject matter, and curriculum and/or contributions to connecting the school program to community needs, resources, and interests. Through written description together with work products and/or letters of verification from others, the candidate provides support for each accomplishment. Finally, the candidate writes an Interpretive Summary that synthesizes the accomplishments and evidence presented. 7 The descriptions of portfolio entries are taken from the National Board for Professional Teaching Standards Assessment Analysis Report, Middle Childhood/Generalist, , pages

Introduction. 1.1 Facets of Measurement

Introduction. 1.1 Facets of Measurement 1 Introduction This chapter introduces the basic idea of many-facet Rasch measurement. Three examples of assessment procedures taken from the field of language testing illustrate its context of application.

More information

Validation of an Analytic Rating Scale for Writing: A Rasch Modeling Approach

Validation of an Analytic Rating Scale for Writing: A Rasch Modeling Approach Tabaran Institute of Higher Education ISSN 2251-7324 Iranian Journal of Language Testing Vol. 3, No. 1, March 2013 Received: Feb14, 2013 Accepted: March 7, 2013 Validation of an Analytic Rating Scale for

More information

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS Michael J. Kolen The University of Iowa March 2011 Commissioned by the Center for K 12 Assessment & Performance Management at

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

RATER EFFECTS AND ALIGNMENT 1. Modeling Rater Effects in a Formative Mathematics Alignment Study

RATER EFFECTS AND ALIGNMENT 1. Modeling Rater Effects in a Formative Mathematics Alignment Study RATER EFFECTS AND ALIGNMENT 1 Modeling Rater Effects in a Formative Mathematics Alignment Study An integrated assessment system considers the alignment of both summative and formative assessments with

More information

Detecting Suspect Examinees: An Application of Differential Person Functioning Analysis. Russell W. Smith Susan L. Davis-Becker

Detecting Suspect Examinees: An Application of Differential Person Functioning Analysis. Russell W. Smith Susan L. Davis-Becker Detecting Suspect Examinees: An Application of Differential Person Functioning Analysis Russell W. Smith Susan L. Davis-Becker Alpine Testing Solutions Paper presented at the annual conference of the National

More information

Basic concepts and principles of classical test theory

Basic concepts and principles of classical test theory Basic concepts and principles of classical test theory Jan-Eric Gustafsson What is measurement? Assignment of numbers to aspects of individuals according to some rule. The aspect which is measured must

More information

Author s response to reviews

Author s response to reviews Author s response to reviews Title: The validity of a professional competence tool for physiotherapy students in simulationbased clinical education: a Rasch analysis Authors: Belinda Judd (belinda.judd@sydney.edu.au)

More information

Examining the Validity of an Essay Writing Test Using Rasch Analysis

Examining the Validity of an Essay Writing Test Using Rasch Analysis Secondary English Education, 5(2) Examining the Validity of an Essay Writing Test Using Rasch Analysis Taejoon Park (KICE) Park, Taejoon. (2012). Examining the validity of an essay writing test using Rasch

More information

RESEARCH ARTICLES. Brian E. Clauser, Polina Harik, and Melissa J. Margolis National Board of Medical Examiners

RESEARCH ARTICLES. Brian E. Clauser, Polina Harik, and Melissa J. Margolis National Board of Medical Examiners APPLIED MEASUREMENT IN EDUCATION, 22: 1 21, 2009 Copyright Taylor & Francis Group, LLC ISSN: 0895-7347 print / 1532-4818 online DOI: 10.1080/08957340802558318 HAME 0895-7347 1532-4818 Applied Measurement

More information

Centre for Education Research and Policy

Centre for Education Research and Policy THE EFFECT OF SAMPLE SIZE ON ITEM PARAMETER ESTIMATION FOR THE PARTIAL CREDIT MODEL ABSTRACT Item Response Theory (IRT) models have been widely used to analyse test data and develop IRT-based tests. An

More information

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys Jill F. Kilanowski, PhD, APRN,CPNP Associate Professor Alpha Zeta & Mu Chi Acknowledgements Dr. Li Lin,

More information

Diagnostic Opportunities Using Rasch Measurement in the Context of a Misconceptions-Based Physical Science Assessment

Diagnostic Opportunities Using Rasch Measurement in the Context of a Misconceptions-Based Physical Science Assessment Science Education Diagnostic Opportunities Using Rasch Measurement in the Context of a Misconceptions-Based Physical Science Assessment STEFANIE A. WIND, 1 JESSICA D. GALE 2 1 The University of Alabama,

More information

Rater Effects as a Function of Rater Training Context

Rater Effects as a Function of Rater Training Context Rater Effects as a Function of Rater Training Context Edward W. Wolfe Aaron McVay Pearson October 2010 Abstract This study examined the influence of rater training and scoring context on the manifestation

More information

Construct Validity of Mathematics Test Items Using the Rasch Model

Construct Validity of Mathematics Test Items Using the Rasch Model Construct Validity of Mathematics Test Items Using the Rasch Model ALIYU, R.TAIWO Department of Guidance and Counselling (Measurement and Evaluation Units) Faculty of Education, Delta State University,

More information

Assessing the Validity and Reliability of the Teacher Keys Effectiveness. System (TKES) and the Leader Keys Effectiveness System (LKES)

Assessing the Validity and Reliability of the Teacher Keys Effectiveness. System (TKES) and the Leader Keys Effectiveness System (LKES) Assessing the Validity and Reliability of the Teacher Keys Effectiveness System (TKES) and the Leader Keys Effectiveness System (LKES) of the Georgia Department of Education Submitted by The Georgia Center

More information

Application of Latent Trait Models to Identifying Substantively Interesting Raters

Application of Latent Trait Models to Identifying Substantively Interesting Raters Application of Latent Trait Models to Identifying Substantively Interesting Raters American Educational Research Association New Orleans, LA Edward W. Wolfe Aaron McVay April 2011 LATENT TRAIT MODELS &

More information

Critical Thinking Assessment at MCC. How are we doing?

Critical Thinking Assessment at MCC. How are we doing? Critical Thinking Assessment at MCC How are we doing? Prepared by Maura McCool, M.S. Office of Research, Evaluation and Assessment Metropolitan Community Colleges Fall 2003 1 General Education Assessment

More information

Evaluating and restructuring a new faculty survey: Measuring perceptions related to research, service, and teaching

Evaluating and restructuring a new faculty survey: Measuring perceptions related to research, service, and teaching Evaluating and restructuring a new faculty survey: Measuring perceptions related to research, service, and teaching Kelly D. Bradley 1, Linda Worley, Jessica D. Cunningham, and Jeffery P. Bieber University

More information

25. EXPLAINING VALIDITYAND RELIABILITY

25. EXPLAINING VALIDITYAND RELIABILITY 25. EXPLAINING VALIDITYAND RELIABILITY "Validity" and "reliability" are ubiquitous terms in social science measurement. They are prominent in the APA "Standards" (1985) and earn chapters in test theory

More information

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA Data Analysis: Describing Data CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA In the analysis process, the researcher tries to evaluate the data collected both from written documents and from other sources such

More information

A framework for predicting item difficulty in reading tests

A framework for predicting item difficulty in reading tests Australian Council for Educational Research ACEReSearch OECD Programme for International Student Assessment (PISA) National and International Surveys 4-2012 A framework for predicting item difficulty in

More information

The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests

The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests Mary E. Lunz and Betty A. Bergstrom, American Society of Clinical Pathologists Benjamin D. Wright, University

More information

REPORT. Technical Report: Item Characteristics. Jessica Masters

REPORT. Technical Report: Item Characteristics. Jessica Masters August 2010 REPORT Diagnostic Geometry Assessment Project Technical Report: Item Characteristics Jessica Masters Technology and Assessment Study Collaborative Lynch School of Education Boston College Chestnut

More information

Examining Factors Affecting Language Performance: A Comparison of Three Measurement Approaches

Examining Factors Affecting Language Performance: A Comparison of Three Measurement Approaches Pertanika J. Soc. Sci. & Hum. 21 (3): 1149-1162 (2013) SOCIAL SCIENCES & HUMANITIES Journal homepage: http://www.pertanika.upm.edu.my/ Examining Factors Affecting Language Performance: A Comparison of

More information

linking in educational measurement: Taking differential motivation into account 1

linking in educational measurement: Taking differential motivation into account 1 Selecting a data collection design for linking in educational measurement: Taking differential motivation into account 1 Abstract In educational measurement, multiple test forms are often constructed to

More information

Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure. Rob Cavanagh Len Sparrow Curtin University

Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure. Rob Cavanagh Len Sparrow Curtin University Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure Rob Cavanagh Len Sparrow Curtin University R.Cavanagh@curtin.edu.au Abstract The study sought to measure mathematics anxiety

More information

UvA-DARE (Digital Academic Repository)

UvA-DARE (Digital Academic Repository) UvA-DARE (Digital Academic Repository) Standaarden voor kerndoelen basisonderwijs : de ontwikkeling van standaarden voor kerndoelen basisonderwijs op basis van resultaten uit peilingsonderzoek van der

More information

Validity Arguments for Alternate Assessment Systems

Validity Arguments for Alternate Assessment Systems Validity Arguments for Alternate Assessment Systems Scott Marion, Center for Assessment Reidy Interactive Lecture Series Portsmouth, NH September 25-26, 26, 2008 Marion. Validity Arguments for AA-AAS.

More information

Proceedings of the 2011 International Conference on Teaching, Learning and Change (c) International Association for Teaching and Learning (IATEL)

Proceedings of the 2011 International Conference on Teaching, Learning and Change (c) International Association for Teaching and Learning (IATEL) EVALUATION OF MATHEMATICS ACHIEVEMENT TEST: A COMPARISON BETWEEN CLASSICAL TEST THEORY (CTT)AND ITEM RESPONSE THEORY (IRT) Eluwa, O. Idowu 1, Akubuike N. Eluwa 2 and Bekom K. Abang 3 1& 3 Dept of Educational

More information

Using Differential Item Functioning to Test for Inter-rater Reliability in Constructed Response Items

Using Differential Item Functioning to Test for Inter-rater Reliability in Constructed Response Items University of Wisconsin Milwaukee UWM Digital Commons Theses and Dissertations May 215 Using Differential Item Functioning to Test for Inter-rater Reliability in Constructed Response Items Tamara Beth

More information

Having your cake and eating it too: multiple dimensions and a composite

Having your cake and eating it too: multiple dimensions and a composite Having your cake and eating it too: multiple dimensions and a composite Perman Gochyyev and Mark Wilson UC Berkeley BEAR Seminar October, 2018 outline Motivating example Different modeling approaches Composite

More information

Validating Measures of Self Control via Rasch Measurement. Jonathan Hasford Department of Marketing, University of Kentucky

Validating Measures of Self Control via Rasch Measurement. Jonathan Hasford Department of Marketing, University of Kentucky Validating Measures of Self Control via Rasch Measurement Jonathan Hasford Department of Marketing, University of Kentucky Kelly D. Bradley Department of Educational Policy Studies & Evaluation, University

More information

Evaluating a Special Education Teacher Observation Tool through G-theory and MFRM Analysis

Evaluating a Special Education Teacher Observation Tool through G-theory and MFRM Analysis Evaluating a Special Education Teacher Observation Tool through G-theory and MFRM Analysis Project RESET, Boise State University Evelyn Johnson, Angie Crawford, Yuzhu Zheng, & Laura Moylan NCME Annual

More information

A tale of two models: Psychometric and cognitive perspectives on rater-mediated assessments using accuracy ratings

A tale of two models: Psychometric and cognitive perspectives on rater-mediated assessments using accuracy ratings Psychological Test and Assessment Modeling, Volume 60, 2018 (1), 33-52 A tale of two models: Psychometric and cognitive perspectives on rater-mediated assessments using accuracy ratings George Engelhard,

More information

Reliability. Internal Reliability

Reliability. Internal Reliability 32 Reliability T he reliability of assessments like the DECA-I/T is defined as, the consistency of scores obtained by the same person when reexamined with the same test on different occasions, or with

More information

A typology of polytomously scored mathematics items disclosed by the Rasch model: implications for constructing a continuum of achievement

A typology of polytomously scored mathematics items disclosed by the Rasch model: implications for constructing a continuum of achievement A typology of polytomously scored mathematics items 1 A typology of polytomously scored mathematics items disclosed by the Rasch model: implications for constructing a continuum of achievement John van

More information

Construct Invariance of the Survey of Knowledge of Internet Risk and Internet Behavior Knowledge Scale

Construct Invariance of the Survey of Knowledge of Internet Risk and Internet Behavior Knowledge Scale University of Connecticut DigitalCommons@UConn NERA Conference Proceedings 2010 Northeastern Educational Research Association (NERA) Annual Conference Fall 10-20-2010 Construct Invariance of the Survey

More information

The Impact of Item Sequence Order on Local Item Dependence: An Item Response Theory Perspective

The Impact of Item Sequence Order on Local Item Dependence: An Item Response Theory Perspective Vol. 9, Issue 5, 2016 The Impact of Item Sequence Order on Local Item Dependence: An Item Response Theory Perspective Kenneth D. Royal 1 Survey Practice 10.29115/SP-2016-0027 Sep 01, 2016 Tags: bias, item

More information

Combining Dual Scaling with Semi-Structured Interviews to Interpret Rating Differences

Combining Dual Scaling with Semi-Structured Interviews to Interpret Rating Differences A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to the Practical Assessment, Research & Evaluation. Permission is granted to

More information

Chapter 3. Psychometric Properties

Chapter 3. Psychometric Properties Chapter 3 Psychometric Properties Reliability The reliability of an assessment tool like the DECA-C is defined as, the consistency of scores obtained by the same person when reexamined with the same test

More information

Evaluating the quality of analytic ratings with Mokken scaling

Evaluating the quality of analytic ratings with Mokken scaling Psychological Test and Assessment Modeling, Volume 57, 2015 (3), 423-444 Evaluating the quality of analytic ratings with Mokken scaling Stefanie A. Wind 1 Abstract Greatly influenced by the work of Rasch

More information

INVESTIGATING FIT WITH THE RASCH MODEL. Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form

INVESTIGATING FIT WITH THE RASCH MODEL. Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form INVESTIGATING FIT WITH THE RASCH MODEL Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form of multidimensionality. The settings in which measurement

More information

Linking Assessments: Concept and History

Linking Assessments: Concept and History Linking Assessments: Concept and History Michael J. Kolen, University of Iowa In this article, the history of linking is summarized, and current linking frameworks that have been proposed are considered.

More information

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

Lec 02: Estimation & Hypothesis Testing in Animal Ecology Lec 02: Estimation & Hypothesis Testing in Animal Ecology Parameter Estimation from Samples Samples We typically observe systems incompletely, i.e., we sample according to a designed protocol. We then

More information

Project Goal(s) and Objectives What is a goal? What is an objective? Measurable Objectives

Project Goal(s) and Objectives What is a goal? What is an objective? Measurable Objectives Project Goal(s) and Objectives What is a goal? A goal is a broad statement of what you wish to accomplish. Goals are broad, general, intangible, and abstract. A goal is really about the final impact or

More information

Research. Reports. Monitoring Sources of Variability Within the Test of Spoken English Assessment System. Carol M. Myford Edward W.

Research. Reports. Monitoring Sources of Variability Within the Test of Spoken English Assessment System. Carol M. Myford Edward W. Research TEST OF ENGLISH AS A FOREIGN LANGUAGE Reports REPORT 65 JUNE 2000 Monitoring Sources of Variability Within the Test of Spoken English Assessment System Carol M. Myford Edward W. Wolfe Monitoring

More information

By Hui Bian Office for Faculty Excellence

By Hui Bian Office for Faculty Excellence By Hui Bian Office for Faculty Excellence 1 Email: bianh@ecu.edu Phone: 328-5428 Location: 1001 Joyner Library, room 1006 Office hours: 8:00am-5:00pm, Monday-Friday 2 Educational tests and regular surveys

More information

Investigating the Reliability of Classroom Observation Protocols: The Case of PLATO. M. Ken Cor Stanford University School of Education.

Investigating the Reliability of Classroom Observation Protocols: The Case of PLATO. M. Ken Cor Stanford University School of Education. The Reliability of PLATO Running Head: THE RELIABILTY OF PLATO Investigating the Reliability of Classroom Observation Protocols: The Case of PLATO M. Ken Cor Stanford University School of Education April,

More information

RUNNING HEAD: EVALUATING SCIENCE STUDENT ASSESSMENT. Evaluating and Restructuring Science Assessments: An Example Measuring Student s

RUNNING HEAD: EVALUATING SCIENCE STUDENT ASSESSMENT. Evaluating and Restructuring Science Assessments: An Example Measuring Student s RUNNING HEAD: EVALUATING SCIENCE STUDENT ASSESSMENT Evaluating and Restructuring Science Assessments: An Example Measuring Student s Conceptual Understanding of Heat Kelly D. Bradley, Jessica D. Cunningham

More information

Shiken: JALT Testing & Evaluation SIG Newsletter. 12 (2). April 2008 (p )

Shiken: JALT Testing & Evaluation SIG Newsletter. 12 (2). April 2008 (p ) Rasch Measurementt iin Language Educattiion Partt 2:: Measurementt Scalles and Invariiance by James Sick, Ed.D. (J. F. Oberlin University, Tokyo) Part 1 of this series presented an overview of Rasch measurement

More information

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1 Nested Factor Analytic Model Comparison as a Means to Detect Aberrant Response Patterns John M. Clark III Pearson Author Note John M. Clark III,

More information

Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study

Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study Research Report Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study Xueli Xu Matthias von Davier April 2010 ETS RR-10-10 Listening. Learning. Leading. Linking Errors in Trend Estimation

More information

Measuring change in training programs: An empirical illustration

Measuring change in training programs: An empirical illustration Psychology Science Quarterly, Volume 50, 2008 (3), pp. 433-447 Measuring change in training programs: An empirical illustration RENATO MICELI 1, MICHELE SETTANNI 1 & GIULIO VIDOTTO 2 Abstract The implementation

More information

Chapter 1 Introduction to Educational Research

Chapter 1 Introduction to Educational Research Chapter 1 Introduction to Educational Research The purpose of Chapter One is to provide an overview of educational research and introduce you to some important terms and concepts. My discussion in this

More information

Inferences: What inferences about the hypotheses and questions can be made based on the results?

Inferences: What inferences about the hypotheses and questions can be made based on the results? QALMRI INSTRUCTIONS QALMRI is an acronym that stands for: Question: (a) What was the broad question being asked by this research project? (b) What was the specific question being asked by this research

More information

Writing Reaction Papers Using the QuALMRI Framework

Writing Reaction Papers Using the QuALMRI Framework Writing Reaction Papers Using the QuALMRI Framework Modified from Organizing Scientific Thinking Using the QuALMRI Framework Written by Kevin Ochsner and modified by others. Based on a scheme devised by

More information

Influences of IRT Item Attributes on Angoff Rater Judgments

Influences of IRT Item Attributes on Angoff Rater Judgments Influences of IRT Item Attributes on Angoff Rater Judgments Christian Jones, M.A. CPS Human Resource Services Greg Hurt!, Ph.D. CSUS, Sacramento Angoff Method Assemble a panel of subject matter experts

More information

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison Empowered by Psychometrics The Fundamentals of Psychometrics Jim Wollack University of Wisconsin Madison Psycho-what? Psychometrics is the field of study concerned with the measurement of mental and psychological

More information

Teacher satisfaction: some practical implications for teacher professional development models

Teacher satisfaction: some practical implications for teacher professional development models Teacher satisfaction: some practical implications for teacher professional development models Graça Maria dos Santos Seco Lecturer in the Institute of Education, Leiria Polytechnic, Portugal. Email: gracaseco@netvisao.pt;

More information

Pediatrics Milestones and Meaningful Assessment Translating the Pediatrics Milestones into Assessment Items for use in the Clinical Setting

Pediatrics Milestones and Meaningful Assessment Translating the Pediatrics Milestones into Assessment Items for use in the Clinical Setting Pediatrics Milestones and Meaningful Assessment Translating the Pediatrics Milestones into Assessment Items for use in the Clinical Setting Ann Burke Susan Guralnick Patty Hicks Jeanine Ronan Dan Schumacher

More information

COMPUTING READER AGREEMENT FOR THE GRE

COMPUTING READER AGREEMENT FOR THE GRE RM-00-8 R E S E A R C H M E M O R A N D U M COMPUTING READER AGREEMENT FOR THE GRE WRITING ASSESSMENT Donald E. Powers Princeton, New Jersey 08541 October 2000 Computing Reader Agreement for the GRE Writing

More information

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories Kamla-Raj 010 Int J Edu Sci, (): 107-113 (010) Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories O.O. Adedoyin Department of Educational Foundations,

More information

Published online: 16 Nov 2012.

Published online: 16 Nov 2012. This article was downloaded by: [National Taiwan Normal University] On: 12 September 2014, At: 01:02 Publisher: Routledge Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered

More information

What Is the Fat Intake of a Typical Middle School Student?

What Is the Fat Intake of a Typical Middle School Student? What Is the Fat Intake of a Typical Middle School Student? Health professionals believe that the average American diet contains too much fat. Although there is no Recommended Daily Allowance for fat, it

More information

Brent Duckor Ph.D. (SJSU) Kip Tellez, Ph.D. (UCSC) BEAR Seminar April 22, 2014

Brent Duckor Ph.D. (SJSU) Kip Tellez, Ph.D. (UCSC) BEAR Seminar April 22, 2014 Brent Duckor Ph.D. (SJSU) Kip Tellez, Ph.D. (UCSC) BEAR Seminar April 22, 2014 Studies under review ELA event Mathematics event Duckor, B., Castellano, K., Téllez, K., & Wilson, M. (2013, April). Validating

More information

Reliability and Validity of a Task-based Writing Performance Assessment for Japanese Learners of English

Reliability and Validity of a Task-based Writing Performance Assessment for Japanese Learners of English Reliability and Validity of a Task-based Writing Performance Assessment for Japanese Learners of English Yoshihito SUGITA Yamanashi Prefectural University Abstract This article examines the main data of

More information

RATER EFFECTS & ALIGNMENT 1

RATER EFFECTS & ALIGNMENT 1 RATER EFFECTS & ALIGNMENT 1 Running head: RATER EFFECTS & ALIGNMENT Modeling Rater Effects in a Formative Mathematics Alignment Study Daniel Anderson, P. Shawn Irvin, Julie Alonzo, and Gerald Tindal University

More information

Exploring rater errors and systematic biases using adjacent-categories Mokken models

Exploring rater errors and systematic biases using adjacent-categories Mokken models Psychological Test and Assessment Modeling, Volume 59, 2017 (4), 493-515 Exploring rater errors and systematic biases using adjacent-categories Mokken models Stefanie A. Wind 1 & George Engelhard, Jr.

More information

School Annual Education Report (AER) Cover Letter

School Annual Education Report (AER) Cover Letter School (AER) Cover Letter May 11, 2018 Dear Parents and Community Members: We are pleased to present you with the (AER) which provides key information on the 2016-2017 educational progress for the Lynch

More information

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

Bruno D. Zumbo, Ph.D. University of Northern British Columbia Bruno Zumbo 1 The Effect of DIF and Impact on Classical Test Statistics: Undetected DIF and Impact, and the Reliability and Interpretability of Scores from a Language Proficiency Test Bruno D. Zumbo, Ph.D.

More information

Measuring Higher Education Quality

Measuring Higher Education Quality Measuring Higher Education Quality Imagine a thickly overgrown garden where flowering plants are choked by rampant vegetation. The vision is one of confusion; order can be achieved only by sharp pruning.

More information

2013 Supervisor Survey Reliability Analysis

2013 Supervisor Survey Reliability Analysis 2013 Supervisor Survey Reliability Analysis In preparation for the submission of the Reliability Analysis for the 2013 Supervisor Survey, we wanted to revisit the purpose of this analysis. This analysis

More information

Running head: PRELIM KSVS SCALES 1

Running head: PRELIM KSVS SCALES 1 Running head: PRELIM KSVS SCALES 1 Psychometric Examination of a Risk Perception Scale for Evaluation Anthony P. Setari*, Kelly D. Bradley*, Marjorie L. Stanek**, & Shannon O. Sampson* *University of Kentucky

More information

A Comparison of Several Goodness-of-Fit Statistics

A Comparison of Several Goodness-of-Fit Statistics A Comparison of Several Goodness-of-Fit Statistics Robert L. McKinley The University of Toledo Craig N. Mills Educational Testing Service A study was conducted to evaluate four goodnessof-fit procedures

More information

Using the Partial Credit Model

Using the Partial Credit Model A Likert-type Data Analysis Using the Partial Credit Model Sun-Geun Baek Korean Educational Development Institute This study is about examining the possibility of using the partial credit model to solve

More information

The Effect of Option Homogeneity in Multiple- Choice Items

The Effect of Option Homogeneity in Multiple- Choice Items Manuscripts The Effect of Option Homogeneity in Multiple- Choice Items Applied Psychological Measurement 1 12 Ó The Author(s) 2018 Reprints and permissions: sagepub.com/journalspermissions.nav DOI: 10.1177/0146621618770803

More information

Teacher stress: A comparison between casual and permanent primary school teachers with a special focus on coping

Teacher stress: A comparison between casual and permanent primary school teachers with a special focus on coping Teacher stress: A comparison between casual and permanent primary school teachers with a special focus on coping Amanda Palmer, Ken Sinclair and Michael Bailey University of Sydney Paper prepared for presentation

More information

Grand Valley State University

Grand Valley State University Reports Grand Valley State University comparison group 1: comparison group 2: Public 4yr Colleges Public/Private Universities and Public 4yr Colleges 1.1 Reports Table of Contents Reports How to Read the

More information

SPECIAL EDUCATION (SED) DeGarmo Hall, (309) Website:Education.IllinoisState.edu Chairperson: Stacey R. Jones Bock.

SPECIAL EDUCATION (SED) DeGarmo Hall, (309) Website:Education.IllinoisState.edu Chairperson: Stacey R. Jones Bock. 368 SPECIAL EDUCATION (SED) 591 533 DeGarmo Hall, (309) 438-8980 Website:Education.IllinoisState.edu Chairperson: Stacey R. Jones Bock. General Department Information Program Admission Requirements for

More information

A Brief Introduction to Bayesian Statistics

A Brief Introduction to Bayesian Statistics A Brief Introduction to Statistics David Kaplan Department of Educational Psychology Methods for Social Policy Research and, Washington, DC 2017 1 / 37 The Reverend Thomas Bayes, 1701 1761 2 / 37 Pierre-Simon

More information

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data TECHNICAL REPORT Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data CONTENTS Executive Summary...1 Introduction...2 Overview of Data Analysis Concepts...2

More information

Providing Highly-Valued Service Through Leadership, Innovation, and Collaboration

Providing Highly-Valued Service Through Leadership, Innovation, and Collaboration ~ivingston Providing Highly-Valued Service Through Leadership, Innovation, and Collaboration March 3, 27 Dear Parents and Community Members: We are pleased to present you with the Annual Education Report

More information

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida and Oleksandr S. Chernyshenko University of Canterbury Presented at the New CAT Models

More information

Psychometrics for Beginners. Lawrence J. Fabrey, PhD Applied Measurement Professionals

Psychometrics for Beginners. Lawrence J. Fabrey, PhD Applied Measurement Professionals Psychometrics for Beginners Lawrence J. Fabrey, PhD Applied Measurement Professionals Learning Objectives Identify key NCCA Accreditation requirements Identify two underlying models of measurement Describe

More information

Item Analysis Explanation

Item Analysis Explanation Item Analysis Explanation The item difficulty is the percentage of candidates who answered the question correctly. The recommended range for item difficulty set forth by CASTLE Worldwide, Inc., is between

More information

Denny Borsboom Jaap van Heerden Gideon J. Mellenbergh

Denny Borsboom Jaap van Heerden Gideon J. Mellenbergh Validity and Truth Denny Borsboom Jaap van Heerden Gideon J. Mellenbergh Department of Psychology, University of Amsterdam ml borsboom.d@macmail.psy.uva.nl Summary. This paper analyzes the semantics of

More information

Indiana University-Purdue University-Fort Wayne

Indiana University-Purdue University-Fort Wayne Reports Indiana University-Purdue University-Fort Wayne comparison group 1: comparison group 2: Public 4yr Colleges Public/Private Universities and Public 4yr Colleges 1.1 Reports Table of Contents Reports

More information

FRASER RIVER COUNSELLING Practicum Performance Evaluation Form

FRASER RIVER COUNSELLING Practicum Performance Evaluation Form FRASER RIVER COUNSELLING Practicum Performance Evaluation Form Semester 1 Semester 2 Other: Instructions: To be completed and reviewed in conjunction with the supervisor and the student, signed by both,

More information

André Cyr and Alexander Davies

André Cyr and Alexander Davies Item Response Theory and Latent variable modeling for surveys with complex sampling design The case of the National Longitudinal Survey of Children and Youth in Canada Background André Cyr and Alexander

More information

Saint Thomas University

Saint Thomas University Reports Saint Thomas University comparison group 1: comparison group 2: Private/Nonsectarian 4yr Colleges Nonsectarian, Catholic, Other Religious 4yr Colleges 1.1 Reports Table of Contents Reports How

More information

Illinois Wesleyan University

Illinois Wesleyan University Reports Illinois Wesleyan University comparison group 1: comparison group 2: Private/Nonsectarian 4yr Colleges Nonsectarian, Catholic, Other Religious 4yr Colleges 1.1 Reports Table of Contents Reports

More information

Students' perceived understanding and competency in probability concepts in an e- learning environment: An Australian experience

Students' perceived understanding and competency in probability concepts in an e- learning environment: An Australian experience University of Wollongong Research Online Faculty of Engineering and Information Sciences - Papers: Part A Faculty of Engineering and Information Sciences 2016 Students' perceived understanding and competency

More information

Research Questions and Survey Development

Research Questions and Survey Development Research Questions and Survey Development R. Eric Heidel, PhD Associate Professor of Biostatistics Department of Surgery University of Tennessee Graduate School of Medicine Research Questions 1 Research

More information

PSYCHOLOGY (413) Chairperson: Sharon Claffey, Ph.D.

PSYCHOLOGY (413) Chairperson: Sharon Claffey, Ph.D. PSYCHOLOGY (413) 662-5453 Chairperson: Sharon Claffey, Ph.D. Email: S.Claffey@mcla.edu PROGRAMS AVAILABLE BACHELOR OF ARTS IN PSYCHOLOGY BEHAVIOR ANALYSIS MINOR PSYCHOLOGY MINOR TEACHER LICENSURE PSYCHOLOGY

More information

Development, Standardization and Application of

Development, Standardization and Application of American Journal of Educational Research, 2018, Vol. 6, No. 3, 238-257 Available online at http://pubs.sciepub.com/education/6/3/11 Science and Education Publishing DOI:10.12691/education-6-3-11 Development,

More information

A Case Study: Two-sample categorical data

A Case Study: Two-sample categorical data A Case Study: Two-sample categorical data Patrick Breheny January 31 Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 1/43 Introduction Model specification Continuous vs. mixture priors Choice

More information

Regression CHAPTER SIXTEEN NOTE TO INSTRUCTORS OUTLINE OF RESOURCES

Regression CHAPTER SIXTEEN NOTE TO INSTRUCTORS OUTLINE OF RESOURCES CHAPTER SIXTEEN Regression NOTE TO INSTRUCTORS This chapter includes a number of complex concepts that may seem intimidating to students. Encourage students to focus on the big picture through some of

More information

Research Prospectus. Your major writing assignment for the quarter is to prepare a twelve-page research prospectus.

Research Prospectus. Your major writing assignment for the quarter is to prepare a twelve-page research prospectus. Department of Political Science UNIVERSITY OF CALIFORNIA, SAN DIEGO Philip G. Roeder Research Prospectus Your major writing assignment for the quarter is to prepare a twelve-page research prospectus. A

More information

Presented By: Yip, C.K., OT, PhD. School of Medical and Health Sciences, Tung Wah College

Presented By: Yip, C.K., OT, PhD. School of Medical and Health Sciences, Tung Wah College Presented By: Yip, C.K., OT, PhD. School of Medical and Health Sciences, Tung Wah College Background of problem in assessment for elderly Key feature of CCAS Structural Framework of CCAS Methodology Result

More information