Psychometrics for Beginners. Lawrence J. Fabrey, PhD Applied Measurement Professionals

Similar documents
Item Analysis Explanation

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Manfred M. Straehle, Ph.D. Prometric, Inc. Rory McCorkle, MBA Project Management Institute (PMI)

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

By Hui Bian Office for Faculty Excellence

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

On indirect measurement of health based on survey data. Responses to health related questions (items) Y 1,..,Y k A unidimensional latent health state

Survey Question. What are appropriate methods to reaffirm the fairness, validity reliability and general performance of examinations?

Influences of IRT Item Attributes on Angoff Rater Judgments

Comparing standard toughness through weighted and unweighted scores by three standard setting procedures

2016 Technical Report National Board Dental Hygiene Examination

Linking Assessments: Concept and History

RESEARCH ARTICLES. Brian E. Clauser, Polina Harik, and Melissa J. Margolis National Board of Medical Examiners

Development, Standardization and Application of

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION

Driving Success: Setting Cut Scores for Examination Programs with Diverse Item Types. Beth Kalinowski, MBA, Prometric Manny Straehle, PhD, GIAC

Section 5. Field Test Analyses

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

Differential Item Functioning

The Psychometric Principles Maximizing the quality of assessment

THE PROFESSIONAL BOARD FOR PSYCHOLOGY HEALTH PROFESSIONS COUNCIL OF SOUTH AFRICA TEST DEVELOPMENT / ADAPTATION PROPOSAL FORM

Item Writing Guide for the National Board for Certification of Hospice and Palliative Nurses

Introduction to Reliability

An Investigation of Two Criterion-Referencing Scoring Procedures for National Board Dental Examinations

Comparison of Direct and Indirect Methods

Chapter 6. Methods of Measuring Behavior Pearson Prentice Hall, Salkind. 1

Introduction to Item Response Theory

Introduction On Assessing Agreement With Continuous Measurement

Table of Contents. Preface to the third edition xiii. Preface to the second edition xv. Preface to the fi rst edition xvii. List of abbreviations xix

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

Item Analysis: Classical and Beyond

Item Analysis for Beginners

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

Measurement and Descriptive Statistics. Katie Rommel-Esham Education 604

Test Validity. What is validity? Types of validity IOP 301-T. Content validity. Content-description Criterion-description Construct-identification

THE ANGOFF METHOD OF STANDARD SETTING

Techniques for Explaining Item Response Theory to Stakeholder

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland

Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz

The Psychometric Development Process of Recovery Measures and Markers: Classical Test Theory and Item Response Theory

Introduction: Speaker. Introduction: Buros. Buros & Education. Introduction: Participants. Goal 10/5/2012

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan

ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE

The Concept of Validity

On the usefulness of the CEFR in the investigation of test versions content equivalence HULEŠOVÁ, MARTINA

Supplementary Material*

INVESTIGATING FIT WITH THE RASCH MODEL. Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form

Underlying Theory & Basic Issues

Psychometrics in context: Test Construction with IRT. Professor John Rust University of Cambridge

Published by European Centre for Research Training and Development UK (

Using Generalizability Theory to Investigate the Psychometric Property of an Assessment Center in Indonesia

Issues That Should Not Be Overlooked in the Dominance Versus Ideal Point Controversy

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati.

Quick start guide for using subscale reports produced by Integrity

Georgina Salas. Topics EDCI Intro to Research Dr. A.J. Herrera

Technical Specifications

ABERRANT RESPONSE PATTERNS AS A MULTIDIMENSIONAL PHENOMENON: USING FACTOR-ANALYTIC MODEL COMPARISON TO DETECT CHEATING. John Michael Clark III

Detection of Differential Test Functioning (DTF) and Differential Item Functioning (DIF) in MCCQE Part II Using Logistic Models

Equating UDS Neuropsychological Tests: 3.0>2.0, 3.0=2.0, 3.0<2.0? Dan Mungas, Ph.D. University of California, Davis

Handout 5: Establishing the Validity of a Survey Instrument

Data harmonization tutorial:teaser for FH2019

Reliability and Validity

ITEM WRITERS' GUIDELINES. for the PET SPECIALTY EXAM. offered by the NUCLEAR MEDICINE TECHNOLOGY CERTIFICATION BOARD

Scoring Models for Innovative Item Types: A Case Study

Building Evaluation Scales for NLP using Item Response Theory

THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL

TECHNICAL GUIDELINES. Castle s Technical Guidelines for the Construction of Multiple-Choice Questions. Updated January 2009

A National Study of the Certified Diabetes Educator: Report on a Job Analysis Conducted by the National Certification Board for Diabetes Educators

Reliability & Validity Dr. Sudip Chaudhuri

2015 Exam Committees Report for the National Physical Therapy Examination Program

Examination of the Application of Item Response Theory to the Angoff Standard Setting Procedure

Validity of measurement instruments used in PT research

On the purpose of testing:

The Concept of Validity

Validity, Reliability, and Fairness in Music Testing

Proceedings of the 2011 International Conference on Teaching, Learning and Change (c) International Association for Teaching and Learning (IATEL)

Detecting Suspect Examinees: An Application of Differential Person Functioning Analysis. Russell W. Smith Susan L. Davis-Becker

Validity and reliability of measurements

Psychological testing

Does factor indeterminacy matter in multi-dimensional item response theory?

UvA-DARE (Digital Academic Repository)

Examining Construct Stability Across Career Stage Cohorts

Utilizing the NIH Patient-Reported Outcomes Measurement Information System

Summary of Independent Validation of the IDI Conducted by ACS Ventures:

Chapter 3 Psychometrics: Reliability and Validity

investigate. educate. inform.

CHAPTER 3. Research Methodology

AMERICAN BOARD OF SURGERY 2009 IN-TRAINING EXAMINATION EXPLANATION & INTERPRETATION OF SCORE REPORTS

André Cyr and Alexander Davies

Examining Factors Affecting Language Performance: A Comparison of Three Measurement Approaches

The Effect of Option Homogeneity in Multiple- Choice Items

SLEEP DISTURBANCE ABOUT SLEEP DISTURBANCE INTRODUCTION TO ASSESSMENT OPTIONS. 6/27/2018 PROMIS Sleep Disturbance Page 1

Registered Radiologist Assistant (R.R.A. ) 2016 Examination Statistics

A Comparison of the Angoff and Item Mapping Standard Setting Methods for a Certification Examination

Reliability Theory for Total Test Scores. Measurement Methods Lecture 7 2/27/2007

1. Evaluate the methodological quality of a study with the COSMIN checklist

26:010:557 / 26:620:557 Social Science Research Methods

Transcription:

Psychometrics for Beginners Lawrence J. Fabrey, PhD Applied Measurement Professionals Learning Objectives Identify key NCCA Accreditation requirements Identify two underlying models of measurement Describe factors that contribute to enhanced score reliability Describe the importance of reliability in making pass/fail decisions Identify the links in the chain of evidence to support the validity of examination results Describe practical steps in creating a defensible examination 1

Outline Identifying the Intent of the Measurement Process NCCA Standards and their origin Theories Of Measurement Reliability Validity Time for Q & A Identifying the Intent of the Measurement Process 1. What do we want to measure? 2. And how precise do we need to be? Reliability and Validity are often noted together They are separate concepts: # 1 addresses validity, # 2 addresses reliability Why is validity more important than reliability? 2

Can we claim This is a valid examination! No -- but why not? The question suggests that validity can be evaluated with only one method, The inferences made about examination results are not addressed, A yes or no response would fail to account for other psychometric issues, and The purpose for which examination results are to be used must be described. Validity Evidence for Certification What do we want to measure? Knowledge, skills, abilities, attitudes Why do we want to measure? Protect the public or inform the public More on validity later. Can we Claim a Test is Reliable? Maybe, but reliability is a matter of degree More on reliability later. 3

NCCA Standards and Their Origin Several standards documents NCCA Standards for the Accreditation of Certification Programs (NOCA, 2004), and Standards for educational and psychological testing (American Educational Research Association, American Psychological Association & National Council on Measurement in Education, 1999). (and others) Highlights of NCCA Standards Standards 1 9 considered Administrative Standards 10 18 Psychometric 19-20 Recertification, therefore a little of both 21 requires ongoing accountability Theories Of Measurement Two Models: Classical Measurement Theory Item Response Theory Statistics used for each model Choosing a Measurement Model 4

Classical Measurement Theory Model Sometimes called CTT (Classical Test Theory) Sampling items from a domain Each item assumed to make an equal contribution to measurement of the domain Scores: counting the number of correct answers Sometimes transformed to another scale, but Highest number correct will yield the best result. Classical Measurement Theory Model Statistics Key Item Statistics Item difficulty identified by p value (proportion correct) Item discrimination identified by a correlation (e.g., point-biserial ) Test level statistics Mean, standard deviation, etc. Measures of reliability to be discussed later Item Response Theory Model Commonly called IRT Sometimes called Latent Trait Theory Ability of examinee estimated based on the difficulty of the items with correct responses Items are calibrated Probability of a correct response is a function of the underlying ability of the examinee and the statistical characteristics of the item. 5

Probability of a Correct Response Item Response Theory Model Statistics Key Item Statistics up to three parameters Item discrimination = the a parameter Item difficulty = the b parameter Guessing = the c parameter Depicted with an ICC Test level statistics Test Characteristic Curves Information functions 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0-3 -2-1 0 1 2 3 Ability (q) Sample Item Characteristic Curve (ICC) Choosing a Measurement Model Both provide valuable information Interpretations for either benefit from larger candidate volume Candidate volume is more critical when choosing number of parameters for IRT 6

Reliability - Overview The degree to which the results of testing are free from errors of measurement Methods of assessment Contributing factors How to promote higher reliability Item writing Statistical analysis of items Scaling and equating Reliability - Methods of Assessment CTT: Internal Consistency (Coefficient α, KR-20) SEM Generalizability (G) theory (based on ANOVA) IRT: SEM Fit Decision Consistency for both models Reliability Contributing Factors 1. Homogeneity of content 2. Heterogeneity of examinees 3. Number of items/responses How reliable is reliable enough? NCCA Standard 14: sufficiently reliable for the intended purposes 7

How to Promote Higher Reliability - Item Writing Training Item Writers to Ensure: Pertinence to the examination Clearly written stems Avoidance of extraneous clues Plausible, high quality distractors Absence of bias Thorough review by SMEs How to Promote Higher Reliability - Statistical Analysis of Items Pretest before using items to compute scores The goal is generally: Higher positive discrimination Moderate difficulty How to Promote Higher Reliability - Example of Item Analysis 8

An appropriately difficult item, discriminates well (could be similar to item depicted by the previous ICC) Overall A B C D n 192 137 13 19 23 p 0.76 0.71 0.07 0.10 0.12 Mean 74.29 77.63 63.57 62.07 70.56 Disc +0.412 +0.412-0.156-0.317-0.231 An easy item that does not discriminate well Overall A B C D n 192 20 2 166 4 p 0.87 0.10 0.01 0.87 0.02 Mean 75.41 75.50 69.05 75.61 69.73 Disc +0.049 +0.03-0.059 +0.049-0.086 How to Promote Higher Reliability - Example of Item in Bank 9

Scaling and Equating Scaling A linear transformation of a number or score from one scale (usually a raw score or number correct) to another Examples: temperature, currency Simplest is calculating a percentage Other scaling methods Advantages and disadvantages of scaling Scaling and Equating Equating Determining comparability of score interpretations based on different examination forms CTT: usually through common items (an anchor test) IRT: usually through placing all parameter estimates from different samples of examinees on a common scale Validity - Overview Evidence of validity based on content Criterion-Related validation strategies Item and test bias Establishing cut scores Summary of validity as applied to credentialing examinations 10

Types of Validity Evidence Traditional: Content, Construct, Criterion-related 1999 Standards: Validity is a unitary concept Sources of validity evidence based on: Test Content Response Processes Internal Structure Other Variables (i.e., convergent, discriminant, test-criterion relationships, generalization) Consequences of testing Evidence of Validity Based on Content Job analysis Also known as practice analysis, role delineation study, or other terms The goal: determine what a practitioner must know and do in the role/job Collection of data (e.g., survey) Interpretation of data Leads to development of specifications Links Examination in the Chain Development of Evidence Overview Identify the Intent of the Examination Conduct a Job Analysis Create Examination Specifications Item Writing & Review Create and Review Examination Form(s) Standard Setting Study Examination Scoring (IA) Equating of Future Forms Examination Administration 11

Criterion-Related Validation Strategies Less common for credentialing Relationships of scores with other measures For example: job performance ratings or other assessments Challenges Item and Test Bias Item bias Differential item functioning (DIF) Detection at two points in the testing process: during examination development or during the analysis of examination results. Prevention is the key Establishing Cut Scores Perfect test of no value without appropriate passing point Standard 12: consistent with the purpose of the credential and the established standard of competence for the profession, occupation, role, or skill. Criterion related (and not norm-referenced) 12

Establishing Cut Scores Different methods may be considered Principles can apply regardless of format Commonalities among usual methods: Selection of judges Agreement on MCP definition Judgments about items Reasonability check Establishing Cut Scores Examples of different methods: Angoff Ebel Jaeger Nedelsky Bookmark Demonstration of a modified-angoff method Establishing a Cut Score - Angoff Method Let s pretend: You are SMEs A four item test has been approved* We just discussed and agreed on the definition of an MCP Let s set a cut score for the Certified Chocolate Consumer (CCC) examination *Items courtesy of Hogan, Waters, Nettles, & Breyer (ATP, 2008) 13

Rate, then we ll check key 1. Which of the following foods would be the best to use as a palate cleanser before tasting chocolate? A. salted corn chips B. white bread C. sweet apple D. lettuce leaves 2. The process of pursing the lips and drawing air into the mouth to oxygenate chocolate is called A. clicking. B. sucking. C. guppying. D. winnowing. 3. Chocolate was most frequently consumed as a beverage until the A. Crusades. B. War of 1812. C. Industrial Revolution. D. early 1900's. 4. The process that is used to grade the quality of cocoa beans is called A. a cut test. B. a dry test. C. blanketing. D. kibbling. Establishing the Cut Score Please announce your ratings Difficulty Small judge variability Suggested cut = 74% (or 3 out of 4) Items Judge Judge 1 2 3 4 Means 1 80 75 65 70 72.50 2 80 65 70 65 70.00 3 85 85 65 75 77.50 4 90 80 70 65 76.25 5 85 85 65 70 76.25 6 85 75 75 75 77.50 7 85 75 75 60 73.75 8 85 65 65 75 72.50 9 95 70 55 75 73.75 10 80 75 60 65 70.00 Mean 85.00 75.00 66.50 69.50 74.00 14

Summary of Validity as Applied to Credentialing Examinations Needed for any assessment method Links in the Chain of Evidence Used to Support the Validity of Examination Results Documentation Q & A Lawrence J. Fabrey, PhD Applied Measurement Professionals 913-895-4706 LFabrey@goAMP.com 15