R E S E A R C H R E P O R T

Similar documents
Introduction. 1.1 Facets of Measurement

Validation of an Analytic Rating Scale for Writing: A Rasch Modeling Approach

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

Technical Specifications

RATER EFFECTS AND ALIGNMENT 1. Modeling Rater Effects in a Formative Mathematics Alignment Study

Detecting Suspect Examinees: An Application of Differential Person Functioning Analysis. Russell W. Smith Susan L. Davis-Becker

Basic concepts and principles of classical test theory

Author s response to reviews

Examining the Validity of an Essay Writing Test Using Rasch Analysis

RESEARCH ARTICLES. Brian E. Clauser, Polina Harik, and Melissa J. Margolis National Board of Medical Examiners

Centre for Education Research and Policy

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys

Diagnostic Opportunities Using Rasch Measurement in the Context of a Misconceptions-Based Physical Science Assessment

Rater Effects as a Function of Rater Training Context

Construct Validity of Mathematics Test Items Using the Rasch Model

Assessing the Validity and Reliability of the Teacher Keys Effectiveness. System (TKES) and the Leader Keys Effectiveness System (LKES)

Application of Latent Trait Models to Identifying Substantively Interesting Raters

Critical Thinking Assessment at MCC. How are we doing?

Evaluating and restructuring a new faculty survey: Measuring perceptions related to research, service, and teaching

25. EXPLAINING VALIDITYAND RELIABILITY

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA

A framework for predicting item difficulty in reading tests

The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Tests

REPORT. Technical Report: Item Characteristics. Jessica Masters

Examining Factors Affecting Language Performance: A Comparison of Three Measurement Approaches

linking in educational measurement: Taking differential motivation into account 1

Measuring mathematics anxiety: Paper 2 - Constructing and validating the measure. Rob Cavanagh Len Sparrow Curtin University

UvA-DARE (Digital Academic Repository)

Validity Arguments for Alternate Assessment Systems

Proceedings of the 2011 International Conference on Teaching, Learning and Change (c) International Association for Teaching and Learning (IATEL)

Using Differential Item Functioning to Test for Inter-rater Reliability in Constructed Response Items

Having your cake and eating it too: multiple dimensions and a composite

Validating Measures of Self Control via Rasch Measurement. Jonathan Hasford Department of Marketing, University of Kentucky

Evaluating a Special Education Teacher Observation Tool through G-theory and MFRM Analysis

A tale of two models: Psychometric and cognitive perspectives on rater-mediated assessments using accuracy ratings

Reliability. Internal Reliability

A typology of polytomously scored mathematics items disclosed by the Rasch model: implications for constructing a continuum of achievement

Construct Invariance of the Survey of Knowledge of Internet Risk and Internet Behavior Knowledge Scale

The Impact of Item Sequence Order on Local Item Dependence: An Item Response Theory Perspective

Combining Dual Scaling with Semi-Structured Interviews to Interpret Rating Differences

Chapter 3. Psychometric Properties

Evaluating the quality of analytic ratings with Mokken scaling

INVESTIGATING FIT WITH THE RASCH MODEL. Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form

Linking Assessments: Concept and History

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

Project Goal(s) and Objectives What is a goal? What is an objective? Measurable Objectives

Research. Reports. Monitoring Sources of Variability Within the Test of Spoken English Assessment System. Carol M. Myford Edward W.

By Hui Bian Office for Faculty Excellence

Investigating the Reliability of Classroom Observation Protocols: The Case of PLATO. M. Ken Cor Stanford University School of Education.

RUNNING HEAD: EVALUATING SCIENCE STUDENT ASSESSMENT. Evaluating and Restructuring Science Assessments: An Example Measuring Student s

Shiken: JALT Testing & Evaluation SIG Newsletter. 12 (2). April 2008 (p )

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study

Measuring change in training programs: An empirical illustration

Chapter 1 Introduction to Educational Research

Inferences: What inferences about the hypotheses and questions can be made based on the results?

Writing Reaction Papers Using the QuALMRI Framework

Influences of IRT Item Attributes on Angoff Rater Judgments

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Teacher satisfaction: some practical implications for teacher professional development models

Pediatrics Milestones and Meaningful Assessment Translating the Pediatrics Milestones into Assessment Items for use in the Clinical Setting

COMPUTING READER AGREEMENT FOR THE GRE

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Published online: 16 Nov 2012.

What Is the Fat Intake of a Typical Middle School Student?

Brent Duckor Ph.D. (SJSU) Kip Tellez, Ph.D. (UCSC) BEAR Seminar April 22, 2014

Reliability and Validity of a Task-based Writing Performance Assessment for Japanese Learners of English

RATER EFFECTS & ALIGNMENT 1

Exploring rater errors and systematic biases using adjacent-categories Mokken models

School Annual Education Report (AER) Cover Letter

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

Measuring Higher Education Quality

2013 Supervisor Survey Reliability Analysis

Running head: PRELIM KSVS SCALES 1

A Comparison of Several Goodness-of-Fit Statistics

Using the Partial Credit Model

The Effect of Option Homogeneity in Multiple- Choice Items

Teacher stress: A comparison between casual and permanent primary school teachers with a special focus on coping

Grand Valley State University

SPECIAL EDUCATION (SED) DeGarmo Hall, (309) Website:Education.IllinoisState.edu Chairperson: Stacey R. Jones Bock.

A Brief Introduction to Bayesian Statistics

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data

Providing Highly-Valued Service Through Leadership, Innovation, and Collaboration

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida

Psychometrics for Beginners. Lawrence J. Fabrey, PhD Applied Measurement Professionals

Item Analysis Explanation

Denny Borsboom Jaap van Heerden Gideon J. Mellenbergh

Indiana University-Purdue University-Fort Wayne

FRASER RIVER COUNSELLING Practicum Performance Evaluation Form

André Cyr and Alexander Davies

Saint Thomas University

Illinois Wesleyan University

Students' perceived understanding and competency in probability concepts in an e- learning environment: An Australian experience

Research Questions and Survey Development

PSYCHOLOGY (413) Chairperson: Sharon Claffey, Ph.D.

Development, Standardization and Application of

A Case Study: Two-sample categorical data

Regression CHAPTER SIXTEEN NOTE TO INSTRUCTORS OUTLINE OF RESOURCES

Research Prospectus. Your major writing assignment for the quarter is to prepare a twelve-page research prospectus.

Presented By: Yip, C.K., OT, PhD. School of Medical and Health Sciences, Tung Wah College

Transcription:

RR-00-13 R E S E A R C H R E P O R T INVESTIGATING ASSESSOR EFFECTS IN NATIONAL BOARD FOR PROFESSIONAL TEACHING STANDARDS ASSESSMENTS FOR EARLY CHILDHOOD/GENERALIST AND MIDDLE CHILDHOOD/GENERALIST CERTIFICATION George Engelhard, Jr. Carol M. Myford Fred Cline Princeton, New Jersey 08541 September 2000

Research Reports provide preliminary and limited dissemination of ETS research prior to publication. They are available without charge from the Research Publications Office Mail Stop 07-R Educational Testing Service Princeton, NJ 08541

INVESTIGATING ASSESSOR EFFECTS IN NATIONAL BOARD FOR PROFESSIONAL TEACHING STANDARDS ASSESSMENTS FOR EARLY CHILDHOOD/GENERALIST AND MIDDLE CHILDHOOD/GENERALIST CERTIFICATION George Engelhard, Jr. Emory University Carol M. Myford and Fred Cline Educational Testing Service

Abstract The purposes of this study were to (1) examine, describe and evaluate the rating behavior of assessors scoring National Board for Professional Teaching Standards Early Childhood/Generalist or Middle Childhood/Generalist candidates, and (2) explore the effects of employing scoring designs using one assessor per exercise rather than two. Data from the 1997-98 Early Childhood/Generalist and Middle Childhood/Generalist assessments were analyzed using FACETS (Linacre, 1989) and SAS. While assessors differed somewhat in severity, most used the 12-point rating scale consistently. Residual severity effects that persisted after rater training tended to cancel out when ratings were averaged across the ten exercises to produce a scaled score for a candidate. The results of the decision consistency analyses suggest that having one assessor score each exercise would not result in a large increase in candidates incorrectly denied or awarded certification. However, the decision consistency analyses by exercise reveal the low reliability of the exercise banking decisions. Variation in assessor severity can affect the fairness of these decisions. Key words: teacher assessment, portfolio assessment, assessment centers, performance assessment, Item Response Theory, Rasch measurement, FACETS, rater effects, rater monitoring, quality control ii

Acknowledgments We would like to acknowledge the helpful advice of Mike Linacre regarding the use of the FACETS computer program to analyze these data. The material contained herein is based on work supported by the National Board for Professional Teaching Standards. Any opinions, findings, conclusions, and recommendations expressed herein are those of the authors and do not necessarily reflect the views of the National Board for Professional Teaching Standards, Emory University, or the Educational Testing Service. iii

Table of Contents Introduction.. 1 A Many-Faceted Rasch Model for the Assessment of Accomplished Teaching 7 Method. 10 Participants.. 10 NBPTS Assessment Process 12 NBPTS Scoring System 13 Procedure. 19 Results.. 23 I. FACETS Analyses of the Early Childhood/Generalist and Middle Childhood/ Generalist Assessment Systems 23 Variable Map 23 Rating Scale. 27 Candidates 29 Exercises..36 Assessors. 38 II. Comparisons of Scoring Designs.. 51 III. Summary of Results in Terms of the Research Questions 65 Discussion.69 References.74 Page iv

Introduction The National Board for Professional Teaching Standards (NBPTS) is committed to the development of a psychometrically sound assessment system designed to measure fairly and objectively high levels of teaching accomplishment in a variety of content areas (National Board for Professional Teaching Standards, 1999a). Many of the early psychometric analyses of the NBPTS assessments were conducted using traditional methods based on classical test theory and its extensions, such as Generalizability Theory (National Board for Professional Teaching Standards, 1999b). As the number of candidates for NBPTS certification has increased, it is now possible to use modern psychometric methods based on Item Response Theory (IRT) (Hambleton & Swaminathan, 1985) to evaluate the psychometric quality of the NBPTS assessments. The use of IRT to measure teaching has been discussed several times in the literature (Mislevy, 1987; Swaminathan, 1986); but as far as we know, this study reflects the first application of IRT to a large-scale performance assessment of teachers within the context of certification testing. In essence, we are modeling how trained assessors implement an approach to evaluating teaching. This study focuses on the use of one class of IRT models based on the measurement theory of Rasch (1980) and extended by Wright and his colleagues (Wright & Stone, 1979; Wright & Masters, 1982). The extension of the Rasch model for analyses of assessor-mediated ratings is called the many-faceted Rasch model (FACETS model, Linacre, 1989). The FACETS model has been used to examine the psychometric quality of a variety of performance assessments based on assessor-mediated ratings (e.g., Engelhard, 1992, 1994, 1996; Heller, Sheingold, & Myford, 1998; Linacre, Engelhard, Tatum, & Myford, 1994; Lunz & Stahl, 1990; Lunz, Wright, & Linacre, 1990; Myford, Marr, & Linacre, 1996; Myford & Mislevy, 1995; Myford & Wolfe, 2000a; Paulukonis, Myford, & Heller, 2000; Wolfe, Chiu, & Myford, 1999). It should be stressed that the FACETS model provides additional information that supplements, rather than supplants, the inferences provided by more traditional methods that have been used previously to analyze the NBPTS assessments. The FACETS analyses presented in this study provide a different lens for exploring the quality of assessor-mediated ratings obtained within the context of the NBPTS assessment system. There are several key ideas that undergird the analyses reported in this study. It is important to recognize that high levels of teaching accomplishment cannot be viewed directly, and that teaching accomplishment is a latent variable or construct that must be inferred from observed performances on selected assessment exercises. In essence, assessors observe a variety of performances that reflect a candidate s responses to different exercises (or tasks), and then the assessors make judgments based on these teacher performances to infer level of teaching accomplishment. The conceptual framework is based on what Messick (1994) has called a construct-driven rather than a task-driven model. Performance assessments developed within this conceptual framework require the careful elicitation, analysis, and evaluation (both formative and summative) of assessor judgments in order to obtain fair inferences regarding a candidate s level of teaching accomplishment. 1

The conceptual model for the assessment of accomplished teaching used to guide this study is presented in Figure 1. This conceptual model hypothesizes that the observed ratings (i.e., the ratings the assessors give candidates on the exercises) are a function of a variety of factors. Ideally, the major determinant of the observed rating should be the construct or latent variable being measured: accomplished teaching. However, there are a number of intervening variables in the model that potentially could adversely affect the quality of those ratings by introducing unwanted sources of construct-irrelevant variance into the assessment process. For a given teaching field, a set of standards of accomplished teaching that lay out the knowledge, skills, dispositions, and commitments defining the construct for that field are prepared. These standards reflect a professional consensus regarding the critical aspects of practice that characterizes accomplished teachers in this field (see, for example, National Board for Professional Teaching Standards, 1998c, 1998d). A set of exercises and scoring rubrics for each exercise are then devised that are explicitly linked to the standards. The standards, the exercises and their accompanying rubrics provide the operational definition of the accomplished teaching construct. There are several aspects of the assessment design process that, if not carried out appropriately, could introduce construct-irrelevant variance into NBPTS assessments: (1) the standards may not adequately lay out the knowledge, skills, dispositions, and commitments that define the accomplished teaching construct for a given field, (2) some standards may have been included that were not a part of the construct, (3) the linkage of the standards to the exercises may be loose, or (4) the linkage of the standards to the rubrics may be loose. In building a validity argument based on the new Standards for Educational and Psychological Testing (American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education, 1999) for the NBPTS assessments, content validity evidence has been gathered to demonstrate that the processes of defining the accomplished teaching construct through standards and of linking the standards to the exercises and scoring rubrics were carried out in a credible, reproducible fashion (see, for example, Benson & Impara, 1996a, 1996b). Candidates who wish to apply for NBPTS certification must clearly understand the standards for their field and respond to each exercise in an informed manner. The NBPTS provides candidates with specific approaches for studying the standards, detailed instructions for completing each exercise, and suggestions for reviewing the written work the candidates must submit. However, it is possible that some candidates may not achieve a sufficient level of understanding regarding the assessment process and the types of evidence the assessment requires the candidate to gather. If any of the exercises have been designed such that candidates do not clearly understand what is expected of them, then this could introduce construct-irrelevant variance into the assessment process. Assessors who are practicing teachers in the field carefully study the standards for that field and then receive intensive training to become familiar with the demands of the particular exercise they are to score. As part of their assessor training, they learn to apply the rubric to score candidates performances on that exercise. The assessors practice scoring pre-selected 2

Figure 1 Conceptual Model for the Assessment of Accomplished Teaching Intervening Variables Assessor Severity Level of Accomplished Teaching Observed Ratings Exercise Difficulty I. Portfolio II. Documented Accomplishments III. Assessment Center Structure of the Rating Scale 3

performances until they develop a shared understanding of how to apply and interpret the rubric. If there were assessors who after training rated significantly more leniently (or, conversely, more harshly) than other assessors, that could introduce construct-irrelevant variance into the ratings. In this situation, whether a candidate obtained NBPTS certification could be heavily dependent upon the particular set of assessors who, by luck of the draw, happened to rate the candidate. Similarly, if there were assessors who, even after intensive training, were unable to use a scoring rubric consistently when evaluating candidates performances, or assessors who allowed personal biases to cloud their judgments, such assessors would introduce additional constructirrelevant variance into the assessment process. The quality of the ratings they provide would be highly suspect. In effect, the standards, the exercises, the rubrics, and the assessors function as intervening variables in the conceptual model, providing the lens and focus for viewing the accomplished teaching construct and obtaining the ratings. We employed a psychometric model that enabled us to operationalize key aspects of the conceptual model. Our psychometric model incorporates some (but not all) of the possible intervening variables. By reviewing output from our analyses, we are able to monitor the extent to which the intervening variables (or facets, as we refer to them) included in our psychometric model are introducing unwanted constructirrelevant sources of variance into the assessment process. These facets, if left unattended, have the potential to threaten the quality of the ratings (and, ultimately, the validity of the assessment enterprise). Our psychometric model includes exercises as one facet in our analyses. Each separate exercise is an element of the exercise facet. From our analyses, we obtain an estimate of the difficulty of each exercise (i.e., how hard it is for a candidate to receive a high rating on the exercise). 1 We also obtain an estimate of the extent to which the exercise fits with the other 1 One approach to construct validation would be to define the accomplished teaching construct through a series of exercises ordered by difficulty. In essence, the assessment developers would construct a set of exercises to measure accomplished teaching. The exercise difficulty continuum would logically define the underlying construct. Exercises at the upper end of the continuum would be designed to be harder for candidates to get high ratings on than exercises at the lower end of the continuum. If exercises could be devised so that they sufficiently spread out along a difficulty continuum, that would allow for placement of candidates along the variable defined by the hierarchically ordered exercises. Candidates could then be ordered along the continuum by their levels of accomplished teaching. The joint ordering of exercises and candidates would provide a conceptual basis for explaining what it means to be more or less highly accomplished in the field. Specific statements could be prepared describing candidate performance at various levels along the continuum (i.e., what is it that a candidate at each level has demonstrated that he/she knows and can do). Establishing the construct validity of the assessment involves comparing the intended (i.e., theoretical) ordering of the exercises to the actual empirical ordering of the exercises. The critical construct validation question becomes this: to what extent does the assessment developers initial theoretical understanding of the construct (i.e., their view of how the exercises should order hierarchically) map to the empirical evidence obtained when candidates actually take the assessment? This was not the validation approach adopted by those designing the NBPTS assessments, however. Rather, their approach was to use the standards for the certification field as the basis for exercise development (see the National Board for Professional Teaching Standards Early Childhood/Generalist Standards, 1998c, and Middle Childhood/Generalist Standards, 1998d). The standards describe what teachers applying for that particular certificate should know and be able to do in order to be certified as an accomplished teacher in their field. The exercises and their accompanying scoring rubrics were designed to reflect the standards for the certificate area. Each exercise provides evidence for one or more standards. No one exercise tries to provide evidence for all the standards. There was no intention to try to devise exercises that would differ in difficulty and could thus be hierarchically ordered. Nor was it the assessment developers intention to design exercises so that they would all be of the same level of difficulty. In short, exercise 4

exercises included in the model. That is, whether the exercises work together to define a single unidimensional construct, or whether there is evidence that the exercises are measuring multiple, independent dimensions of a construct (or, perhaps, multiple constructs). The fit statistics for an exercise signal the degree of correspondence between ratings of candidates performances on that particular exercise when compared to ratings of candidates performances on the other exercises. They provide an assessment of whether ratings on the exercises can be meaningfully combined to produce a single composite score, or whether there may be a need for separate scores to be reported rather than a single composite score. A second facet included in our psychometric model is assessors. From our analyses, we obtain an estimate of the level of severity of each assessor when evaluating candidates performances. This information can be used to judge the degree to which assessors are functioning interchangeably. It is also possible to obtain an estimate of the consistency with which an assessor applies a scoring rubric. Assessor fit statistics provide estimates of the degree to which an assessor is internally consistent when using the rubric to evaluate multiple candidate performances. A third facet in our psychometric model is candidates. The output from our analyses provides an estimate of each candidate s level of accomplished teaching. In producing these estimates, the computer program adjusts the candidate s score for the level of severity that the particular assessors scoring that candidate s performance exercised. In effect, the program washes out these unwanted sources of construct-irrelevant variance. The resulting score reflects what the candidate would have received if assessors of average severity had rated the candidate. The program also produces candidate fit statistics that are indices of the degree of consistency shown in the evaluation of the candidate s level of accomplished teaching across exercises and across assessors. Through fit analyses, candidates who exhibit unusual profiles of ratings across exercises (i.e., candidates who appear to do well on some exercises but poorly on others) can be identified. Flagging the scores of these misfitting candidates allows for an important quality control check before score reports are issued an independent review of each misfitting candidate s performance across exercises. One can determine whether the particular ratings that the computer program has identified as surprising or unexpected for that candidate are perhaps due to random (or systematic) assessor error and ought to be changed. Alternatively, through fit analysis, one might determine that, indeed, the candidate performed differentially across exercises, and the ratings should be left to stand as is. difficulty was not a consideration in their design process. The validation approach was to establish content validity, not construct validity. The validation strategy involved having expert panelists make judgments about the following: (1) whether the content standards represented critical aspects of accomplished practice in the field, (2) whether the portfolio exercises and assessment center exercises were relevant and important to one or more content standards, (3) whether the skills, knowledges, and abilities necessary to carry out the exercises were relevant and important to the domain of accomplished teaching in that field, (4) whether the scoring rubrics were relevant and important to the content standards, (5) whether the scoring rubrics emphasized factors that were relevant, important, and necessary to the domain, and (6) whether the exercises considered as a set represented the skills and knowledge expected of an accomplished teacher in the field (Benson & Impara, 1996a, 1996b). For more information about the assessment development process, the interested reader is referred to the National Board for Professional Teaching Standards Technical Analysis Report, 1999b, pages 5-13. 5

The psychometric model also takes into consideration the structure of the scoring rubrics used in the assessment of accomplished teaching. These analyses provide useful information that enables the determination of whether or not the scoring rubrics are functioning as intended. For example, by examining the rating scale category calibrations that the computer program produces, it is possible to determine whether the categories in a given rubric are appropriately ordered and are clearly distinguishable. The psychometric model provides useful information about two key intervening variables in our conceptual model (exercises and assessors), the candidates, and the scoring rubric. However, the psychometric model does not provide any information about the adequacy of the initial conception of the construct (i.e., how well the standards define the accomplished teaching construct in a given field; how well the knowledge, skills, dispositions, and commitments of the field map to the standards). Nor does the psychometric model address whether the exercises and scoring rubrics adequately reflect the standards. Our analyses do not provide any information about the strength of that critical linkage. Other approaches are needed to gather evidence to rule out the possibility that these critical aspects of assessment development are not introducing construct-irrelevant sources of variance into the candidate evaluation process, threatening its validity. This study focuses on issues related to the investigation of assessor effects in two NBPTS assessments: Early Childhood/Generalist and Middle Childhood/Generalist certification. Candidates who apply for Early Childhood/Generalist certification teach all subjects to students ages 3-8, while those who apply for Middle Childhood/Generalist certification teach all subjects to students ages 7-12. The major purposes of this study are to (1) examine, describe and evaluate the rating behavior of individual assessors scoring Early Childhood/Generalist or Middle Childhood/Generalist candidates, and (2) explore the effects of employing scoring designs that use fewer than 20 ratings per candidate (10 exercises, each rated by two assessors). Specifically, this study addresses the following questions within each NBPTS assessment: 1. Do assessors differ in the severity with which they rate candidates performances on the assessment exercises? Are differences in assessor severity more pronounced for some exercises than for others? 2. Do differences in assessor severity affect candidate scores on a given exercise? 3. Do differences in assessor severity affect the accuracy and/or consistency of the certification decision? How do differences in assessor severity impact decisions made regarding the banking of exercises? If the National Board for Professional Teaching Standards were to move to a scoring design that reduced the number of assessors to one per exercise, what would be the impact of that policy decision on certification rates? 4. Do assessors use the scoring rubrics consistently across candidates? Do they share a common understanding of the meaning of each scale point on a rubric? Are there any inconsistent assessors whose patterns of scores show little systematic relationship to the 6

scores given to the same candidates by other assessors? (That is, are there assessors who are sometimes more lenient but at other times are more severe than other assessors assessors who are neither consistently lenient nor consistently severe but rather tend to vacillate erratically between these two general tendencies?) 5. Do some candidates exhibit unusual profiles of ratings across exercises, receiving unexpectedly high (or low) ratings on certain exercises, given the ratings the candidates received on the other exercises? 6. Is it harder for candidates to get high ratings on some exercises than others? To what extent do the exercises differ in difficulty? How well does the set of 10 exercises succeed in defining statistically distinct levels of accomplished teaching among the candidates? 7. Can ratings from all exercises be calibrated, or do ratings on some exercises frequently fail to correspond to ratings on other exercises? (That is, are there certain exercises that do not fit with the others?) 8. Can many-faceted Rasch measurement models be used to monitor the effectiveness of complex performance assessment systems like the National Board for Professional Teaching Standards assessment systems? What special challenges arise in analyzing rating data from NBPTS assessments using the FACETS computer program? A Many-Faceted Rasch Model for the Assessment of Accomplished Teaching The procedures described in this section for examining the quality of ratings obtained from assessors are based on a many-faceted version of the Rasch measurement (FACETS) model for ordered response categories developed by Linacre (1989). The FACETS model is an extended version of the Rasch measurement model (Andrich, 1988; Rasch, 1980; Wright & Masters, 1982). The FACETS model is essentially an additive linear model that is based on a logistic transformation of observed assessor ratings to a logit or log-odds scale. The logistic transformation of ratios of successive category probabilities (log odds) can be viewed as the dependent variable with various facets, such as candidates, exercises, and assessors, conceptualized as independent variables that influence these log odds. In this study, the FACETS model takes the following form: ln[p nijk / P nijk-1 ] = θ n ξ i α j τ k, (1) where P nijk = probability of candidate n being rated k on exercise i by assessor j, P nijk-1 = probability of candidate n being rated k - 1 on exercise i by assessor j, θ n = teaching accomplishment for candidate n, 7

ξ i = difficulty of exercise i, α j = severity of assessor j, and τ k = difficulty of category k relative to category k - 1. The rating category coefficient, τ k, is not considered a facet in the model. Based on the FACETS model presented in Equation 1, the probability of candidate n with level of teaching accomplishment θ n obtaining a rating of k (k = 1,, m) on exercise ξ i from assessor α j with a rating category coefficient of τ k is given as k P nijk = exp [k (θ n ξ i α j ) Σ τ h ] / γ, (2) h=1 where τ 1 is defined to be 0, and γ is a normalizing factor based on the sum of the numerators. As written in Equation 1, the category coefficients, τ k, represent a rating scale model (Andrich, 1978) with category coefficients fixed across exercises. Once the parameters of the model are estimated using standard numerical methods, such as those implemented in the FACETS computer program (Linacre & Wright, 1992), model-data fit issues can be examined in a variety of ways. Useful indices of rating quality can be obtained by a detailed examination of the standardized residuals calculated as where m Z nij = (x nij E nij ) / [Σ(k E nij ) 2 P nijk ] 1/2 (3) k=1 m E nij = Σ k P nijk. (4) k=1 The standardized residuals, Z nij, can be summarized over different facets and different elements within a facet in order to provide indices of model-data fit. These residuals are typically summarized as mean-square error statistics called OUTFIT and INFIT statistics. The OUTFIT statistics are unweighted mean-squared residual statistics that are particularly sensitive to outlying unexpected ratings. The INFIT statistics are based on weighted mean-squared residual statistics and are less sensitive to outlying unexpected ratings. Engelhard (1994) provides a description regarding the interpretation of these fit statistics within the context of assessor-mediated ratings. 8

To be useful, an assessment must be able to separate candidates by their performance (Stone & Wright, 1988). FACETS produces a candidate separation ratio, G N, which is a measure of the spread of the candidate accomplished teaching measures relative to their precision. Separation is expressed as a ratio of the standard deviation of candidate accomplished teaching measures adjusted for measurement error over the average candidate standard error (Equation 5): G N = SD 2 N n= 1 SE N 2 N N n= 1 SE N 2 n (5) where 2 SD is the observed variance of the non-extreme candidate accomplished teaching measures, SE 2 n is the squared standard error for candidate n, N is the number of candidates. Using the candidate separation ratio, one can then calculate the candidate separation index, which is the number of measurably different levels of candidate performance in the sample of candidates: (4 G N + 1) / 3 (6) A candidate separation index of 2 would suggest that the assessment process is sensitive enough to be used to make certification decisions about candidate performance, since two statistically distinct candidate groups can be discerned (i.e., those who should be certified, those who should not). Similarly, FACETS produces an assessor separation ratio, which is a measure of the spread of the assessor severity measures relative to their precision. The assessor separation index, derived from that separation ratio, connotes the number of statistically distinct levels of assessor severity in the sample of assessors. An assessor separation index of 1 would suggest that all assessors were exercising a similar level of severity and could be considered as one interchangeable group. (We will be reporting candidate and assessor separation indices in our results, but not their associated separation ratios. The separation indices are more readily understood and have more practical utility, in our view.) Another useful statistic is the reliability of separation index. This index provides information about how well the elements within a facet are separated in order to define reliably the facet. This index is analogous to traditional indices of reliability, such as Cronbach s coefficient alpha and KR20, in the sense that it reflects an estimate of the ratio of true score to observed score variance. The reliability of separation indices have slightly different substantive interpretations for different facets in the model. For candidates, the reliability of separation index is comparable to coefficient alpha, indicating the reliability with which the assessment separates the sample of candidates (that is, the proportion of observed sample variance which is 9

attributable to individual differences between candidates, Wright & Masters, 1982). Unlike interrater reliability, which is a measure of how similar the assessor measures are, the candidate separation reliability is a measure of how different the candidate accomplished teaching measures are (Linacre, 1994). By contrast, for assessors, the reliability of separation index reflects potentially unwanted variability in assessor severity. Separation reliability can be calculated as R = (SD 2 MSE) / SD 2, (7) where SD 2 is the observed variance of element difficulties for a facet on the latent variable scale in logits, and MSE is the mean-square calibration error estimated as the mean of the calibration error variances (squares of the standard errors) for each element within a facet. (MSE is? SE 2 / N.) Andrich (1982) provides a detailed derivation of this reliability of separation index. Detailed general descriptions of the separation statistics are also provided in Wright and Masters (1982) and Fisher (1992). Equation 2 can be used to generate a variety of expected scores under different conditions reflecting various assumptions regarding the assessment process. For example, it is possible to estimate an expected rating for a candidate on a particular exercise that would be obtained from an assessor who exercised a level of severity equal to zero (i.e., an assessor who was neither more lenient nor more severe than other assessors). In this case, the assessor, j, would be defined as α j = 0, and the adjusted probability, AP, and adjusted rating, AR, would be calculated as follows m AP nijk = exp [k (θ n ξ i 0 j ) Στ k ] / γ, (8) k=1 and m AR nij = Σ k AP nijk. (9) k=1 Equation 9 can be interpreted as producing an expected adjusted rating for candidate n on exercise i from assessor j (α j = 0). Participants Method Early Childhood/Generalist candidates. The total candidate pool for the 1997-98 Early Childhood/Generalist certification was 603 candidates. There were 596 candidates included in the analyses. Seven candidates were dropped from the analyses because they did not have ratings on all ten exercises. Within the total candidate pool, 98% were female, and 2% were male. In terms of race/ethnicity, 85% were White (non-hispanic), 10% were African American, 1% were Hispanic, 1% were Asian American, 1% were Native American, and 1% of the candidates did not identify their race/ethnicity. When asked about their teaching setting, 34% indicated that they worked in urban schools, 27% worked in suburban settings, and 39% worked 10

in rural settings (though there is some concern about whether candidates used these classification categories as intended). The range of years of the candidates teaching experience was 3 to 38 years (mean = 14, median = 13). The mean age of candidates was 41 years (median = 43 years), while the range was 25 to 68 years. The highest degree attained was a bachelor s degree for 44% of the candidates, a master s degree for 56% of the candidates, and a doctorate for 1% of the candidates. Candidates from 33 states submitted their work for review: 45% of the candidates were from North Carolina, 16% were from Ohio, and the remaining 38% were from the other 31 states. (Less than 6% of the total candidate pool were from each of these 31 states.) 2 All candidates taught in public schools; there were no private school teachers. Finally, 92% of the candidates worked in elementary schools. The remaining 8% taught in preschools, but one candidate taught in a middle school. Early Childhood/Generalist assessors. There were 119 Early Childhood/Generalist assessors included in this study. All assessors were practicing Early Childhood/Generalist teachers. Seventy-eight assessors scored exercises related to the portfolio: 68% were White (non-hispanic), 12% were African American, and 21% were Hispanic. Thirty-one assessors scored exercises administered at the assessment centers: 77% were White (non-hispanic), 19% were African American, and 3% were Hispanic. Six assessors functioned as lead trainers for the portfolio assessment: two were White (non-hispanic), two were African American, and two were Hispanic. Finally, four assessors functioned as assessment center trainers: three were White (non-hispanic), and one was African American. Middle Childhood/Generalist candidates. The total candidate pool for the 1997-98 Middle Childhood/Generalist certification was 523 candidates. There were 516 candidates included in the analyses. Seven candidates were dropped from the analyses because they did not have ratings on all ten exercises. Within the total candidate pool, 94% were female, and 7% were male. In terms of race/ethnicity, 89% were White (non-hispanic), 8% were African American, 2% were Hispanic, 1% were Asian American, 1% were Native American, and 1% of the candidates did not identify their race/ethnicity. When asked about their teaching setting, 31% indicated that they taught in urban schools, 35% worked in suburban settings, and 34% worked in rural settings (though there is some concern about whether candidates used these classification categories as intended). The range of years of the candidates teaching experience was 3 to 40 years (mean = 13, median = 12). The mean age of candidates was 41 years (median = 42 years), while the range was 26 to 63 years. The highest degree attained was a bachelor s degree for 35% of the candidates, a master s degree for 64% of the candidates, and a doctorate for 1% of the candidates. Candidates from 35 states submitted their work for review: 33% of the candidates were from North Carolina, 19% were from Ohio, and the remaining 47% were from the other 33 states. (Less than 6% of the total candidate pool was from each of these 33 states.) 3 All 2 The demographics of the candidate pool for the Early Childhood/Generalist assessment have changed over the last several years. For example, in 1999, the total candidate pool was 1,441 candidates: 23.5% were from North Carolina, 23.2% were from Florida, 13% were from Mississippi, 7% were from California, and 6.2% were from Ohio. The remaining candidates were from 38 other states; fewer than 4% of the total candidate pool were from each of these states. 3 The demographics of the candidate pool for the Middle Childhood/Generalist assessment have also changed over the last several years. For example, in 1999, the total candidate pool was 1,311 candidates: 30.1% were from Florida, 18.9% were from North Carolina, 8.6% were from Mississippi, 6.6% were from California, and 5.7% were 11

candidates taught in public schools; there were no private school teachers. Finally, 96% of the candidates worked in elementary schools. The remaining 4% taught in middle schools. Middle Childhood/Generalist assessors. There were 117 Middle Childhood/Generalist assessors included in this study. All assessors were practicing Middle Childhood/Generalist teachers. Seventy-five assessors scored exercises related to the portfolio: 80% were White (non- Hispanic), 8% were African American, and 12% identified themselves as having racial/ethnic backgrounds other than White (non-hispanic). Forty-two assessors scored exercises administered at the assessment centers: 88% were White (non-hispanic), 7% were African American, and 5% were from a racial/ethnic background other than White (non-hispanic). Six assessors functioned as lead trainers for the portfolio assessment: four were White (non- Hispanic), one was African American, and one was from a racial/ethnic background other than White (non-hispanic). Finally, four assessors functioned as assessment center trainers: one was White (non-hispanic), and three were African American. NBPTS Assessment Process In 1997-98, candidates who applied for the Early Childhood/Generalist certificate or the Middle Childhood/Generalist certificate participated in a two-part assessment process: (1) a portfolio assessment, and (2) a series of assessment center exercises. The portfolio was designed to showcase candidates classroom practice as well as their work outside the classroom with families and the community at large, with their colleagues, and their profession. The assessment center exercises were designed to evaluate the candidates content and pedagogical content knowledge. The portfolio has two distinct parts: four classroom-based entries, and two documented accomplishment entries. The classroom-based entries focus on the candidate s practice inside the classroom; by contrast, the documented accomplishment entries focus on the candidate s practice outside the classroom. Tables 1 and 2 describe the portfolio entries for Early Childhood/Generalist and Middle Childhood/Generalist certification. Candidates applying for certification received a large binder containing detailed instructions and requirements for preparing each portfolio entry, including a careful description of the kinds of evidence required so that the candidate s response will be scorable. The scoring criteria that assessors would use to evaluate the candidate s response to each portfolio entry were also included in the binder. Candidates portfolios included various sources of evidence of classroom practice. For two of the portfolio entries, candidates provided videotape of classroom interactions. The other two classroom-based portfolio entries required candidates to gather and respond to particular types of student work. In each case, the candidate produced a detailed written analysis of the teaching sequences shown in the videotape and of the student work. Through their written commentary, candidates were expected to provide a context for the work, specify the broad goals as well as the specific goals for the classroom segments shown on videotape, and provide critical reflection on their classroom practice. Assessors could then determine whether what they saw in the videotape segments and in the student work was mirrored in the written commentary. from Ohio. The remaining candidates were from 41 other states; fewer than 5% of the total candidate pool were from each of these states. 12

All candidates traveled to an assessment center where, over the course of a day, they participated in a series of exercises that provided them with an opportunity to demonstrate knowledge and skills not addressed in the portfolio. The assessment center exercises are designed to supplement the portfolio, tapping the candidate s content knowledge and pedagogy through real-life scenarios the enable the candidate to confront critical instructional matters. Tables 3 and 4 describe the assessment center exercises for Early Childhood/Generalist and Middle Childhood/Generalist certification. At the assessment center, candidates responded to four ninety-minute exercise blocks. Some of the exercises were simulations of situations that teachers would typically encounter; other exercises posed questions related to pedagogical content topics and issues. For each exercise block, candidates were given a prompt or set of prompts on a computer. They could produce their responses on the computer or by hand. Some of the prompts were based on stimulus materials that candidates received prior to coming to the assessment center. Candidates were allowed to bring with them to the assessment center any supporting materials that they felt might be useful. 4 NBPTS Scoring System Trained assessors scored the candidate s portfolio entries and their performance on the four assessment center exercises. The six portfolio entries a candidate submits are scored separately. Each entry is considered to be a separate assessment exercise. An assessment consists, then, of ten exercises. Assessors who scored Middle Childhood/Generalist candidates written responses to the assessment center exercises participated in an on-line computerized assessor training program. Assessors participated in either a four-day training program to prepare to score candidates portfolios or a two-day training program to prepare to score the assessment center exercises. Each assessor was trained to score a single exercise. 5 The training procedure included five components: (1) an introduction to the history and principles of the National Board for Professional Teaching Standards, (2) exposure to the various elements of the scoring system, including the standards, exercise directions, and scoring rubrics, (3) identification and recognition of the assessor s own personal biases, and targeted training to reduce the likelihood that those biases would enter into the scoring process, (4) exposure to benchmark performances that were designed to anchor the score scale and rubric, providing clear-cut examples of performances that served as operational definitions of each score point, and (5) extensive practice in applying the scoring system to evaluate candidates pre-selected performances. Following training, assessors participated in a qualifying round to determine whether they were able to show adequate agreement with validated trainers ratings on a number of pre-selected 4 A new policy was instituted in 1998-99. Under this policy, candidates are permitted to bring their assessment center orientation book, any stimulus materials that were mailed to them, the NBPTS standards, notes they wrote or typed themselves, and instructions for operating their calculator. They may not bring textbooks, encyclopedias, bound books, loose-leaf printed materials, or information they have downloaded from the Internet. 5 A few assessors in this study scored two exercises. In our analyses, we assigned them unique assessor IDs within exercise and then treated them as if they were nested within exercise. These few assessors received training to score both exercises. 13

Table 1 Description of the Early Childhood/Generalist Portfolio Entries 6 1. REFLECTING ON A TEACHING AND LEARNING SEQUENCE In this entry, candidates are asked to demonstrate how they nurture children s growth and learning through the experiences they provide and the ways in which they make adjustments to these experiences based on their ongoing assessments of children. Through a Written Commentary and four Supporting Artifacts, candidates describe a sequence of activities that reflects an extended exploration of some theme or topic in which they draw on and integrate concepts from social studies and from the arts. 2. EXAMINING A CHILD S LITERARY DEVELOPMENT In this entry, candidates are asked to demonstrate their skill in assessing and supporting children s literacy development. Through a Written Commentary with Supporting Artifacts, candidates provide evidence of the ways they foster literacy in their classroom. Candidates also analyze work samples from one child, discuss his/her development, and outline their approach to supporting his/her learning. 3. INTRODUCTION TO YOUR CLASSROOM COMMUNITY In this entry, candidates provide evidence of how they create and sustain a climate that supports children s social and emotional development. In a Written Commentary candidates are asked to introduce assessors to their learning community by describing ways they build appreciation of diversity and support children s development of social skills. Candidates are asked to illustrate their commentary with a Videotape that shows a class discussion in which they build a sense of community. 4. ENGAGING CHILDREN IN SCIENCE LEARNING In this entry, candidates provide evidence of their skill in helping children acquire scientific knowledge and scientific ways of thinking, observing, and communicating. Through a Written Commentary candidates are asked to discuss and analyze a sequence of learning experiences and how they reflect their general approach to science instruction. Candidates provide evidence from one learning experience in this sequence, explaining how it fits into the sequence. In a videotape of this learning experience, candidates provide evidence of how they engage children in science learning. 5. DOCUMENTED ACCOMPLISHMENTS COLLABORATION IN THE PROFESSIONAL COMMUNITY In this entry, candidates present evidence of sustained or significant contributions to the development and/or substantive review of instructional resources and/or practices, to educational policy and practices, and/or to collaborative work with colleagues in developing pedagogy. Through written description together with work products of the teacher and/or letters of verification from others, the candidate provides support for each accomplishment. Finally, the candidate writes an Interpretive Summary that synthesizes the accomplishments and evidence presented. 6. DOCUMENTED ACCOMPLISHMENTS OUTREACH TO FAMILIES AND THE COMMUNITY In this entry, candidates present evidence of how they create ongoing interactive communication with families and other adults interested in students progress and learning. In addition, candidates may demonstrate evidence of efforts to understand parents concerns about student learning, subject matter, and curriculum and/or contributions to connecting the school program to community needs, resources, and interests. Through written description together with work products and/or letters of verification from others, the candidate provides support for each accomplishment. Finally, the candidate writes an Interpretive Summary that synthesizes the accomplishments and evidence presented. 6 The descriptions of portfolio entries are taken from the National Board for Professional Teaching Standards Assessment Analysis Report, Early Childhood/Generalist, 1997-1998, page 3. 14

Table 2 Description of the Middle Childhood/Generalist Portfolio Entries 7 1. WRITING: THINKING THROUGH THE PROCESS In this exercise, candidates demonstrate their use of writing to develop students thinking and writing skills for different audiences and purposes. Through a Written Commentary, two writing Assignments/Prompts, and four Student Responses candidates provide evidence of their planning and teaching and of their ability to describe, analyze, and evaluate student writing, to develop students writing ability, and to use student work to reflect on their practice. 2. THEMATIC EXPLORATION: CONNECTION TO SCIENCE In this exercise, candidates show how they help students to acquire important science knowledge as they strive to better understand a substantive interdisciplinary theme. Through a Written Commentary, three Instructional Artifacts and six Student Responses, candidates present their ability to develop an interdisciplinary theme and to engage children in work that helps them acquire one or more big idea(s) from science in order to enrich their understanding of that theme (big ideas from science include, for example, systems, models, energy, evolution, scale, structure, constancy, and patterns of change). 3. BUILDING A CLASSROOM COMMUNITY In this exercise, candidates display their ability to observe and analyze interactions in their classroom. Through a Written Commentary and Videotape, candidates describe and illustrate how they create a climate that supports students emerging abilities to understand and consider perspectives other than their own, and to assume responsibility for their own actions. 4. BUILDING A MATHEMATICAL UNDERSTANDING In this exercise, candidates demonstrate how they engage students in the discovery, exploration, and implementation of concepts, procedures, and processes to develop a deep understanding of an important mathematics content area over a period of time. Through a Written Commentary, Videotape, and an Assignment/Prompt with two Student Responses candidates provide evidence of planning and teaching that help build students mathematical understanding. Candidates also provide evidence of their ability to describe, analyze, and reflect on their teaching practice. 5. DOCUMENTED ACCOMPLISHMENTS COLLABORATION IN THE PROFESSIONAL COMMUNITY In this entry, candidates present evidence of sustained or significant contributions to the development and/or substantive review of instructional resources and/or practices, to educational policy and practices, and/or to collaborative work with colleagues in developing pedagogy. Through written description together with work products of the teacher and/or letters of verification from others, the candidate provides support for each accomplishment. Finally, the candidate writes an Interpretive Summary that synthesizes the accomplishments and evidence presented. 6. DOCUMENTED ACCOMPLISHMENTS OUTREACH TO FAMILIES AND THE COMMUNITY In this entry, candidates present evidence of how they create ongoing interactive communication with families and other adults interested in students progress and learning. In addition, candidates may demonstrate evidence of efforts to understand parents concerns about student learning, subject matter, and curriculum and/or contributions to connecting the school program to community needs, resources, and interests. Through written description together with work products and/or letters of verification from others, the candidate provides support for each accomplishment. Finally, the candidate writes an Interpretive Summary that synthesizes the accomplishments and evidence presented. 7 The descriptions of portfolio entries are taken from the National Board for Professional Teaching Standards Assessment Analysis Report, Middle Childhood/Generalist, 1997-1998, pages 3-4. 15