Establishing Interrater Agreement for Scoring Criterion-based Assessments: Application for the MEES

Similar documents
Closed Coding. Analyzing Qualitative Data VIS17. Melanie Tory

Using SPSS for Correlation

On the purpose of testing:

No part of this page may be reproduced without written permission from the publisher. (

02a: Test-Retest and Parallel Forms Reliability

Reliability Study of ACTFL OPIc in Spanish, English, and Arabic for the ACE Review

Relationship Between Intraclass Correlation and Percent Rater Agreement

CHAPTER ONE CORRELATION

English 10 Writing Assessment Results and Analysis

POLS 5377 Scope & Method of Political Science. Correlation within SPSS. Key Questions: How to compute and interpret the following measures in SPSS

On the usefulness of the CEFR in the investigation of test versions content equivalence HULEŠOVÁ, MARTINA

NIH Public Access Author Manuscript Tutor Quant Methods Psychol. Author manuscript; available in PMC 2012 July 23.

Comparison of the Null Distributions of

Getting Started.

Measurement. 500 Research Methods Mike Kroelinger

State C. Alignment Between Standards and Assessments in Science for Grades 4 and 8 and Mathematics for Grades 4 and 8

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES


Chapter 3. Psychometric Properties

ADMS Sampling Technique and Survey Studies

Research with the SAPROF

1. Automatically create Flu Shot encounters in AHLTA in 2 mouse clicks. 2. Ensure accurate DX and CPT codes used for every encounter, every time.

Reliability. Internal Reliability

What Works Clearinghouse

Sleep Apnea Therapy Software User Manual

Statistical Significance, Effect Size, and Practical Significance Eva Lawrence Guilford College October, 2017

CSDplotter user guide Klas H. Pettersen

Survey Question. What are appropriate methods to reaffirm the fairness, validity reliability and general performance of examinations?

Daniel Boduszek University of Huddersfield

Comparing Vertical and Horizontal Scoring of Open-Ended Questionnaires

Validity and reliability of measurements

AC : AN EXAMINATION OF RAPID PROTOTYPING IN DESIGN EDUCATION

Lab 3: Perception of Loudness

Relationship, Correlation, & Causation DR. MIKE MARRAPODI

To open a CMA file > Download and Save file Start CMA Open file from within CMA

Collecting & Making Sense of

Introduction. 1.1 Facets of Measurement

Statistics as a Tool. A set of tools for collecting, organizing, presenting and analyzing numerical facts or observations.

Instructor Guide to EHR Go


Introductory: Coding

The Meta on Meta-Analysis. Presented by Endia J. Lindo, Ph.D. University of North Texas

The Nature of Regression Analysis

Sleep Apnea Therapy Software Clinician Manual

Lab 4: Perception of Loudness

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

PTHP 7101 Research 1 Chapter Assignments

how good is the Instrument? Dr Dean McKenzie

Investigating the Reliability of Classroom Observation Protocols: The Case of PLATO. M. Ken Cor Stanford University School of Education.

Internal structure evidence of validity

Use the following checklist to ensure that video captions are compliant with accessibility guidelines.

CHAPTER III RESEARCH METHODOLOGY

Part IV: Interim Assessment Hand Scoring System

Validity and reliability of measurements

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland

An update on the analysis of agreement for orthodontic indices

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review

Section 3 Correlation and Regression - Teachers Notes

Bloom's Taxonomy SWBAT=Students will be able to Bloom's Levels Performance Objectives Activity Examples

Examining differences between two sets of scores

Section 6: Analysing Relationships Between Variables

EPIDEMIOLOGY. Training module

Binary Diagnostic Tests Paired Samples

Lessons in biostatistics

One-Way Independent ANOVA

Driving Success: Setting Cut Scores for Examination Programs with Diverse Item Types. Beth Kalinowski, MBA, Prometric Manny Straehle, PhD, GIAC

HOW STATISTICS IMPACT PHARMACY PRACTICE?

AM 649 Psychology of Trauma Hartford Seminary. Location: Online

Supporting Information

CHAPTER VI RESEARCH METHODOLOGY

Examining Factors Affecting Language Performance: A Comparison of Three Measurement Approaches

CHAPTER III METHODOLOGY

ISC- GRADE XI HUMANITIES ( ) PSYCHOLOGY. Chapter 2- Methods of Psychology

An Introduction to Research Statistics

SOME NOTES ON STATISTICAL INTERPRETATION

Assessing the Validity and Reliability of the Teacher Keys Effectiveness. System (TKES) and the Leader Keys Effectiveness System (LKES)

Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and

Quantitative Methods in Computing Education Research (A brief overview tips and techniques)

Dentrix Learning Edition

Psy201 Module 3 Study and Assignment Guide. Using Excel to Calculate Descriptive and Inferential Statistics

CHAPTER - 6 STATISTICAL ANALYSIS. This chapter discusses inferential statistics, which use sample data to

Correlation and Regression

Measurement and Descriptive Statistics. Katie Rommel-Esham Education 604

Victoria YY Xu PGY-2 Internal Medicine University of Toronto. Supervisor: Dr. Camilla Wong

Appendix B. Nodulus Observer XT Instructional Guide. 1. Setting up your project p. 2. a. Observation p. 2. b. Subjects, behaviors and coding p.

Blue Distinction Centers for Fertility Care 2018 Provider Survey

Labels for Anchors: The Good, the Bad, and the Endpoints

Development, administration, and validity evidence of a subspecialty preparatory test toward licensure: a pilot study

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

Sample Exam Questions Psychology 3201 Exam 1

ISR Process for Internal Service Providers

Effect of Adjustment on the Academic Performance of Urdu Medium Male and Female Secondary Level Students

An International Study of the Reliability and Validity of Leadership/Impact (L/I)

Before we get started:

Exploring Differences in Measurement and Reporting of Classroom Observation Inter-Rater Reliability

Day 11: Measures of Association and ANOVA

Using the New Nutrition Facts Label Formats in TechWizard Version 5

Creating YouTube Captioning

smk72+ Handbook Prof. Dr. Andreas Frey Dr. Lars Balzer Stephan Spuhler smk72+ Handbook Page 1

Transcription:

Establishing Interrater Agreement for Scoring Criterion-based Assessments: Application for the MEES Mark C. Hogrebe Washington University in St. Louis MACTE Fall 2018 Conference October 23, 2018 Camden on the Lake Lake of the Ozarks

For a copy of presentation and paper Send an email to: mhogrebe.documents@gmail.com Email does not need any text. Automatic response will send link to download files.

Purpose Provide guidance on how to establish interrater agreement when scoring performance using a criterion-based assessment like the MEES. Applies to any EPP assessment that must comply with the CAEP Evaluation Framework

Purpose (2/2) Describe the requirements for calculating absolute agreement using this type of assessment rubric. Show an example of how to calculate interrater agreement with a free, online calculator to use during MEES training sessions. Describe a procedure about how interrater agreement data can be collected, summarized, and presented for a MEES technical manual.

Difference between Interrater Agreement and Interrater Reliability Interrater agreement emphasizes the extent to which two or more raters give the same rating using the rubric (e.g., Rater1 scores 3 and Rater2 scores 3). When absolute interrater agreement is the goal, raters should give the exact same score in rating the same performance for an observation or response.

Difference between Interrater Agreement and Interrater Reliability (2/4) Absolute interrater agreement is the goal when the rubric is intended to be a criterion-based (competency-based, behavior-based) assessment like the MEES observation instrument. In contrast, interrater reliability refers to the consistency of the rank order of the scores between raters.

Difference between Interrater Agreement and Interrater Reliability (3/4) Interrater reliability using a Pearson correlation coefficient is 1.00 (perfect)

Difference between Interrater Agreement and Interrater Reliability (4/4) Interrater reliability using a Pearson correlation coefficient is 1.00 However, interrater agreement is equal to 0.00! Clearly, when using a criterionbased rubric, interrater reliability would not be appropriate.

What is an acceptable magnitude for interrater agreement coefficients? Although there is no ultimate standard for the magnitude for interrater agreement coefficients, the following guidelines for reliability coefficients are widely recognized by the assessment and testing professional community. (Reynolds, Livingston, & Willson, 2006, p. 103)

What is an acceptable magnitude for interrater agreement coefficients? (2/3) If an instrument is being used to make important decisions that significantly impact individuals and are not easily reversed, it is reasonable to expect reliability/agreement coefficients of 0.90 or higher. Interrater agreement coefficients of 0.85 to 0.89 for a few individual items may be acceptable, but the interrater agreement for all items combined should be 0.90 or above for a high stakes assessment such as the MEES.

What is an acceptable magnitude for interrater agreement coefficients? (3/3) For both the teacher candidates and teacher education programs, the MEES observation instrument is a high stakes assessment that demands high interrater agreement aligned to the target ratings. Reliable MEES observations are critical for candidates who depend upon accurate ratings in attaining certification. candidates MEES certification MEES ratings are given significant weight in the Annual Performance Report (APR) for teacher education programs. program MEES APR

Requirements and Procedures: Establishing absolute agreement using a criterion-based rubric

Example interrater agreement task for MEES Standard 1: content knowledge aligned with appropriate instruction Rate five target videos, representing each scoring category in the rubric, i.e., each candidate level: Example videos for MEES Standard 1 Targets Ratings Rater 1 0 standard not present 0 0 1 emerging candidate 1 2 2 developing candidate 2 2 3 skilled candidate 3 4 4 exceeding candidate 4 3

Important to calculate the interrater agreement between each rater and the target ratings Use of validated target scores as Rater 1 insures that the agreement calculation is based on accurate representations of the 5 categories. Both a validation (raters associate observed behavior to correct categories) and agreement index at the same time. Raters can miss the target. Two raters can have high agreement, but their ratings may be inaccurate in assigning the correct score. (Accuracy as determined by content experts who created target ratings.)

Selecting the Target Videos to Rate in Establishing Interrater Agreement In order to establish interrater agreement, the raters must score more than one target video. (Note: target video = CAEP master criteria) It is critical that multiple target videos be scored that represent the range of possible values in the rubric. For example, if a rubric has 5 categories, then raters have to score target videos representing each of the 5 categories in the rubric.

Why should a range of target videos be used for establishing interrater agreement? Important that the raters view each type of target video that they are expected to encounter and score. Helps demonstrate the validity of the categories as being distinct and distinguishable from each other. Use of all categories avoids restriction of range which limits interrater agreement coefficients.

Restriction of range in action 1 = Not dependable 2 = Dependable Difference of one unit results in completely opposite ratings 1 = Not dependable 5 = Sometimes 10 = Always Difference of one unit results in very similar ratings

Determining Interrater Agreement at a MEES Training Session During the MEES training session, several of the key standards would be the focus (e.g., standards 1, 2, 3, and 5). After training and discussing how to rate standard 1, the workgroup would watch a video and then each member assign a rating. Discussion and calibration follows the ratings.

Suggested practice during training sessions After finishing training and calibration, estimate interrater agreement starting with having the trainees watch 5 short videos for Standard 1, each representing the 5 categories of the rubric. The trainees rate each of the 5 videos.

Compute the interrater agreement on Standard 1 for each trainees ratings with target ratings. Results can be viewed and reasons for low agreements discussed and explored. Example videos for MEES Standard 1 Each Trainee s Ratings Targets Ratings 0 standard not present 0 0 1 emerging candidate 1 2 2 developing candidate 2 2 3 skilled candidate 3 4 4 exceeding candidate 4 3

Establishing Interrater Agreement for the MEES as a High Stakes Assessment for the APR Example for teacher education candidate Standard 1 20 cooperating teachers and 20 program supervisors rate 5 videos representing 5 categories of the rubric Calculate interrater agreement on the 40 pairs of ratings between the target ratings and raters Calculate mean, standard deviation, standard error of the mean, and confidence interval for the 40 interrater agreement coefficients Display to show distribution and variance

Determining Interrater Agreement for 40 Target and Rater Pairs on Standard 1 For presenting interrater agreement in a technical manual. 95% Confidence Interval for Standard 1 Interrater Agreement as Estimated by Krippendorf s alpha C.I. = Mean +/ (1.96 * std. error of mean) C.I. =.886 +/ (1.96 *.008494) C.I. =.886 +/.0167 C.I. = (.869,.903) Mean 0.886 We can be 95% confident that the interval between.869 to.903 contains the population or true interrater agreement coefficient (Krippendorff alpha). There is a 95% chance that the true interrater agreement coefficient is contained in this confidence interval (.869,.903).

How to Calculate Interrater Agreement When the rating scale is ordinal, the calculation for interrater agreement should be able to account for intervals between categories that cannot be assumed to be equidistant (e.g., MEES). A versatile interrater agreement measure designed to deal with ordinal data is Krippendorff s alpha (Hayes and Krippendorff, 2007; Krippendorff, 2011; Krippendorff, 2004a; Krippendorff, 2004b).

How to Calculate Interrater Agreement The advantages of Krippendorff s alpha are that it can be used for: all levels of measurement (nominal, ordinal, interval, ratio) data with missing values any number of raters or categories

Krippendorff s alpha Krippendorff s alpha can be calculated using SPSS in conjunction with a macro, in Matlab, and with Stata module krippalpha. However, there is a free, simple-to-use online Krippendorff alpha calculator called ReCal that produces the same results as these other programs. The online ReCal calculator can be found here: http://dfreelon.org/utils/recalfront/recal-oir/#doc

Example using ReCal to calculate interrater agreement for Krippendorff alpha Enter data in Excel spreadsheet and save as a.csv file: MEES example 1.cvs Note there are no column or row headings. Columns represent different raters and the rows represent their ratings/scores for each observation. See Figure 1. Example videos for MEES Standard 1 Targets Ratings Rater 1 0 standard not present 0 0 1 emerging candidate 1 2 2 developing candidate 2 2 3 skilled candidate 3 4 4 exceeding candidate 4 3

Computing Krippendorff s alpha Go to the free, online Krippendorff alpha calculator called ReCal at: http://dfreelon.org/utils/recalfront/recal-oir/#doc Browse to find the MEES example 1.cvs file, then click on the Calculate Reliability button. The results are displayed immediately as follows:

The interrater agreement as calculated by Krippendorff s alpha for Rater 1 against the target ratings is 0.86

Krippendorff s alpha with missing data ReCal for Krippendorff alpha can handle a file with missing data. Data in Excel spreadsheet saved as a.csv file: MEES example missing data.cvs Missing data is coded with a #

The interrater agreement as calculated by Krippendorff s alpha for Rater 1 with one missing rating against the target ratings drops to 0.825

Summary Recommendations for MEES Identify five target videos that represent the rubric range of 5 scores (0, 1, 2, 3, 4) Establish 100% interrater agreement for the five target videos using experts. Raters in training should score each of 5 videos. Each rater s scores are compared to target video scores by: Calculating interrater agreement using Krippendorf s alpha

Question for discussion Can more than one standard be rated from the same video? Separate target videos for each standard is best practice But if rating for 2 standards in same video, should the video be viewed once or a 2nd time for the other standard?

References American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, & Joint Committee on Standards for Educational and Psychological Testing. (2014). Standards for educational and psychological testing. Washington, DC: AERA. Relevant Chapters in Standards for Educational and Psychological Testing: Chapter 2 Reliability/Precision and Errors of Measurement Chapter 12 Educational Testing and Assessment Bandalos, D. L. (2018). Measurement theory and applications for the social sciences. Chapter 9: Interrater agreement and reliability. New York: The Guilford Press Krippendorff, K. (2011). Computing Krippendorff 's Alpha-Reliability. Retrieved from http://repository.upenn.edu/asc_papers/43 Hallgren, K. A. (2012). Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial. Tutorials in Quantitative Methods for Psychology, 8(1), 23 34.

References (2/2) Hayes, A. F. & Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures 1(1):77-89. Krippendorff, K. (2004a). Reliability in Content Analysis: Some Common Misconceptions and Recommendations. Human Communication Research, 30 (3), 411-433. https://doi.org/10.1111/j.1468-2958.2004.tb00738.x Krippendorff, K. (2004b). Content Analysis: An Introduction to Its Methodology. Second Edition. Thousand Oaks, CA: Sage. LeBreton, J. M., & Senter, J. L. (2008). Answers to 20 questions about interrater reliability and interrater agreement. Organizational Research Methods, 11(4), 815-852 McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30-46 Reynolds, C. R., Livingston, R. B., & Willson, V. L. (2006). Measurement and assessment in education. Pearson/Allyn & Bacon, Boston, p. 103