Scoring Models for Innovative Item Types: A Case Study

Similar documents
Psychometrics for Beginners. Lawrence J. Fabrey, PhD Applied Measurement Professionals

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys

By Hui Bian Office for Faculty Excellence

Empirical Demonstration of Techniques for Computing the Discrimination Power of a Dichotomous Item Response Test

Reliability. Internal Reliability

Survey Question. What are appropriate methods to reaffirm the fairness, validity reliability and general performance of examinations?

Certification of Airport Security Officers Using Multiple-Choice Tests: A Pilot Study

Driving Success: Setting Cut Scores for Examination Programs with Diverse Item Types. Beth Kalinowski, MBA, Prometric Manny Straehle, PhD, GIAC

Examining the Psychometric Properties of The McQuaig Occupational Test

The Stability of Undergraduate Students Cognitive Test Anxiety Levels

THE USE OF CRONBACH ALPHA RELIABILITY ESTIMATE IN RESEARCH AMONG STUDENTS IN PUBLIC UNIVERSITIES IN GHANA.

MEDICAL ASSISTANCE IN DYING: LEGAL FACTS ON THE THREE OUTSTANDING AREAS

Registered Radiologist Assistant (R.R.A. ) 2016 Examination Statistics

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

How Many Options do Multiple-Choice Questions Really Have?

Establishing Interrater Agreement for Scoring Criterion-based Assessments: Application for the MEES

An Examination Of The Psychometric Properties Of The CPGI In Applied Research (OPGRC# 2328) Final Report 2007

Development and Psychometric Properties of the Relational Mobility Scale for the Indonesian Population

CURRENT RESEARCH IN SOCIAL PSYCHOLOGY

PUBLIC KNOWLEDGE AND ATTITUDES SCALE CONSTRUCTION: DEVELOPMENT OF SHORT FORMS

The DLOSCE: A National Standardized High-Stakes Dental Licensure Examination. Richard C. Black, D.D.S., M.S. David M. Waldschmidt, Ph.D.

Psychometric Properties of the Mean Opinion Scale

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Clinical Review Report (Sample)

a, Emre Sezgin a, Sevgi Özkan a, * Systems Ankara, Turkey


Measuring Perceived Social Support in Mexican American Youth: Psychometric Properties of the Multidimensional Scale of Perceived Social Support

Endorsement of Criminal Behavior Amongst Offenders: Implications for DSM-5 Gambling Disorder

SAULT STE. MARIE SAFE COMMUNITIES PARTNERSHIP COMMUNITY SCORECARD SEPTEMBER 2010*

City, University of London Institutional Repository

Comparing Multiple-Choice, Essay and Over-Claiming Formats as Efficient Measures of Knowledge

Adult Substance Use and Driving Survey-Revised (ASUDS-R) Psychometric Properties and Construct Validity

EVALUATING AND IMPROVING MULTIPLE CHOICE QUESTIONS

The Effect of Guessing on Item Reliability

Item Analysis Explanation

Kaizen Event Follow-up Mechanisms and Goal Sustainability: Preliminary Results

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission.

Assessing Pharmacy Student Knowledge on Multiple-Choice Examinations Using Partial-Credit Scoring of Combined-Response Multiple-Choice Items

The Assertive Community Treatment Transition Readiness Scale User s Manual 1

The Development of the Revised Urinary Incontinence Scale (RUIS) Jan Sansoni, Nick Marosszeky, Emily Sansoni, Graeme Hawthorne.

Introduction to Reliability

Evaluating the Reliability and Validity of the. Questionnaire for Situational Information: Item Analyses. Final Report

Assessing Cultural Competency from the Patient s Perspective: The CAHPS Cultural Competency (CC) Item Set

VALUE FRAMEWORKS IN HTA Are Value Frameworks Useful in Helping Canadian Cancer Systems Determine Funding for New Therapies?

INSPECT Overview and FAQs

THE TEST STATISTICS REPORT provides a synopsis of the test attributes and some important statistics. A sample is shown here to the right.

Lecture Week 3 Quality of Measurement Instruments; Introduction SPSS

Packianathan Chelladurai Troy University, Troy, Alabama, USA.

4.09 The Ontario Parole and Earned Release Board

Description of components in tailored testing

How are testing technologies used to diagnose HIV infection?

Reviewing the TIMSS Advanced 2015 Achievement Item Statistics

ITEM ANALYSIS OF MID-TRIMESTER TEST PAPER AND ITS IMPLICATIONS

Workplace Drug Policies and Employee Drug Testing in the Era of Marijuana Legalization

Guidelines for using the Voils two-part measure of medication nonadherence

Discrimination Weighting on a Multiple Choice Exam

CHAPTER 4: FINDINGS 4.1 Introduction This chapter includes five major sections. The first section reports descriptive statistics and discusses the

Ramsey County Proxy Tool Norming & Validation Results

Appendix B Construct Reliability and Validity Analysis. Initial assessment of convergent and discriminate validity was conducted using factor

Goal for Today. Making a Better Test: Using New Multiple Choice Question Varieties with Confidence

The Psychometric Development Process of Recovery Measures and Markers: Classical Test Theory and Item Response Theory

BOARD CERTIFICATION PROCESS (EXCERPTS FOR SENIOR TRACK III) Stage I: Application and eligibility for candidacy

Final Report. HOS/VA Comparison Project

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

A Tool for Decision-Making in Norm-Referenced Survey Questionnaires with Items of Ordinal Variables

After working to have a solid conceptual definition of the construct, and

CADTH RAPID RESPONSE REPORT: REFERENCE LIST Side Effect Free Chemotherapy for the Treatment of Cancer: Clinical Effectiveness

Internal Consistency: Do We Really Know What It Is and How to. Assess it?

The Assessment in Advanced Dementia (PAINAD) Tool developer: Warden V., Hurley, A.C., Volicer, L. Country of origin: USA

PLEASE NOTE. For more information concerning the history of this Act, please see the Table of Public Acts.

Empowerment At Higher Educational Organizations


The Wellness Assessment: Global Distress and Indicators of Clinical Severity May 2010

Excellence in Prevention descriptions of the prevention programs and strategies with the greatest evidence of success

ISAM. LE CERTIFICAT INTERNATIONAL EN MÉDECINE DE L ADDICTION (SIMA - Soc Intern Med Add) une décénie d'expérience

Cognitive-Behavioral Assessment of Depression: Clinical Validation of the Automatic Thoughts Questionnaire

Five Criminal Justice Policy Choices in New York: Opinions From the Imagine RIT 2016 Survey

CHAPTER III METHODOLOGY

Medication Assisted Treatment Specialist - MATS

Psychometric Toolbox USER S GUIDE

ALCOHOL AND/OR DRUGS AMONG CRASH VICTIMS DYING WITHIN 12 MONTHS OF A CRASH ON A PUBLIC ROAD, BY JURISDICTION: CANADA, 2014

Emotional Intelligence Assessment Technical Report

Treatment Effects: Experimental Artifacts and Their Impact

An adult version of the Screen for Child Anxiety Related Emotional Disorders (SCARED-A)

Prepared by: Assoc. Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies

Chapter 3. Psychometric Properties

AOBEM Part I Certification Examination Public Reporting

CMTO's Multiple Choice Examination (MCQ) Content Outline 2018

ESTABLISHING VALIDITY AND RELIABILITY OF ACHIEVEMENT TEST IN BIOLOGY FOR STD. IX STUDENTS

School orientation and mobility specialists School psychologists School social workers Speech language pathologists

Neighborhood Questionnaire Grade 11 /Year 12. Fast Track Project Technical Report Anne Corrigan April 24, 2003

Effects of the Number of Response Categories on Rating Scales

Development of a self-reported Chronic Respiratory Questionnaire (CRQ-SR)

Figure: Presentation slides:

AFTD Annual Meeting White Plains, NY March 14, HelpLine

Knowledge as a driver of public perceptions about climate change reassessed

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION

Validity refers to the accuracy of a measure. A measurement is valid when it measures what it is suppose to measure and performs the functions that

HIV testing technologies

Transcription:

Scoring Models for Innovative Item Types: A Case Study By Dr. Chris Beauchamp January 2016 As a way to capture the richness of job performance, many credentialing organizations are supplementing traditional multiple choice questions (MCQs) with innovative item types. Although this view is not unanimous, the theory is that MCQs represent a somewhat artificial representation of job tasks, and that these innovative items represent a more refined way to assess candidate competence. There are a number of benefits and challenges related to innovative item types. One of the biggest challenges is related the selection of an item scoring methodology. The purpose of this backgrounder is to discuss a recent project where four different scoring models were used to score candidate response data to a pilot project that includes four item types. The Pilot Test This project involved the administration of a 70-item test to a group of 71 volunteers. The test involved four item types: Traditional Multiple Choice Multiple True-False Drag and Drop to Classify Drag and Drop to Order Traditional Multiple Choice (47 items) The candidate is asked to select the single correct or best answer from the 4 options provided. There are 4 answer options for every item. Who was Canada s first Prime Minister? 1. Sir John A. McDonald (*) 2. William Lyon Mackenzie 3. George Washington 4. Mackenzie King

Multiple True-False (9 items) The candidate is presented with a stem and 4-9 answer options. For each option, the candidate is asked to indicate if the statement is true or false. Which of the following teams were part of the Original Six teams in the National Hockey League (NHL)? 1. The California Golden Seals (False) 2. The Philadelphia Flyers (False) 3. The Boston Bruins (True) 4. The New York Rangers (True) 5. The New York Islanders (False) 6. The Detroit Red Wings (True) 7. The Colorado Rockies (False) Drag and Drop to Classify (10 items) The candidate is presented with instructions followed by a list of 4-7 options. For each option, the candidate is asked to indicate which category applies to the option. Based on the Canadian Constitution, identify if the following fall under federal or provincial jurisdiction: 1. Healthcare (provincial) 2. Military (federal) 3. Education (provincial) 4. Telecommunications (federal) 5. Criminal Code (provincial) Drag and Drop to Order (4 items) The candidate is presented with instructions, followed by a list of 4 options. The candidate is asked to place the options in order. Place the following dance movies in the order in which they were released: 1. Top Hat (1) 2. Step Up (3) 3. Dirty Dancing (2) 4. Bring it On (4)

The Scoring Models Candidate responses were scored using four different scoring models: Dichotomous Model Modified Dichotomous Model Partial Credit Model Negative Scoring Model Dichotomous Model Candidates receive 1 point if they answered every element of a question correctly. They receive 0 points if they answer any element of the question incorrectly. As a result, candidates receive a score of 1 or 0 on each item. Stem: Based on the Canadian Constitution, identify if the following options fall under federal or provincial jurisdiction: Option Candidate Response Correct Response Healthcare Provincial Provincial Military Education Provincial Provincial Telecommunications Criminal Code Provincial Candidate Score: 0 points

Modified Dichotomous Model As a variation to the dichotomous model, candidates receive a point if they achieved a score of 75% or more on an item. They receive 0 points if they achieved a score of less than 75%. As a result, candidates either receive a score of 1 or 0 on an item. Stem: Based on the Canadian Constitution, identify if the following options fall under federal or provincial jurisdiction: Option Candidate Response Correct Response Healthcare Provincial Provincial Military Education Provincial Provincial Telecommunications Criminal Code Provincial Candidate Score: 1 point (candidate achieved a score of >75% on the item)

Partial Credit Model Candidates receive partial credit for each option they answer correctly. They receive a score of 0 for each option they answer incorrectly. As a result, candidates can receive a score anywhere between 0 and 1 based on how well they performed on the item. Stem: Based on the Canadian Constitution, identify if the following options fall under federal or provincial jurisdiction: Option Candidate Response Correct Response Healthcare Provincial Provincial Military Education Provincial Provincial Telecommunications Criminal Code Provincial Candidate Score: 0.80 points

Negative Scoring Model Candidates receive partial credit for each option they answer correctly. They will receive negative credit for each option they answer incorrectly. As a scoring rule, with the exception of MCQ, candidates cannot receive a score of less than 0 on an item. As a result, candidates can receive a score anywhere between 0 and 1 based on how they responded to the item. Stem: Based on the Canadian Constitution, identify if the following options fall under federal or provincial jurisdiction: Option Candidate Response Correct Response Healthcare Provincial Provincial Military Education Provincial Provincial Telecommunications Criminal Code Provincial Candidate Score: 0.80 points - 0.20 points = 0.60 points

Results Candidate scores and test-level metrics were calculated based on the application of the four scoring models to the test. Dichomotous Modified Dichotomous Partial Credit Negative Scoring Number of candidates 71 71 71 71 Number of items 70 70 70 70 Average score 38.15 42.89 46.65 38.89 Standard deviation 7.16 7.61 7.49 7.68 Minimum score 11.00 13.00 13.23 10.00 Maximum score 50.00 55.00 56.74 51.48 Average item difficulty 0.55 0.61 0.67 0.56 Average discrimination 0.22 0.24 0.29 0.20 Reliability 0.81 0.82 0.86 0.79 Candidate scores Candidates achieved higher scores using the partial credit model. Scores were lowest for the dichotomous and negative scoring models. Based on the intended use of this test, one could argue that the dichotomous and negative scoring models led to unexpectedly low scores that may not be a true reflection of a candidate s ability. Test Reliability Reliability for the test was calculated using Cronbach s alpha coefficient. This measure produces an index that ranges from 0 (no reliability) to 1 (perfect reliability). In credentialing examinations, the minimum acceptable reliability is above 0.80 (Nunnally, 1978). The partial credit model produced the best reliability (0.86). The other three models yielded lower reliability, with the worst model being negative scoring (0.79).

Item-level performance In addition to item difficulty (i.e., the percentage of candidates answering a question correctly), one important metric to assess item performance is discrimination. Discrimination refers to an item s ability to differentiate between those who know the content and those who do not. Items with higher discrimination do a good job of identifying competent candidates versus not-yet-competent candidates. For this study, the corrected point biserial correlation (CRPB) was used. This measure produces an index that ranges from -1.00 (perfect negative discrimination) to 1.00 (perfect positive discrimination) with an index of 0.00 indicating no discrimination. Based on average discrimination, the partial credit model was most effective in differentiating between competent and not-yetcompetent candidates. The negative scoring model yielded the least discriminating items. Conclusion In many ways, the partial credit model is the simplest and most intuitive model. It yielded scores that best match the true representation of a candidate s ability, displayed evidence of higher test reliability, and produced the best average indices of item discrimination. In addition, the partial credit model, unlike the negative scoring model, does not penalize candidates for guessing. Within the context of this particular test and this particular candidate population, the evidence supports the use of the partial credit model over the other three models. References Nunnally, J.C. (1978). Psychometric Theory, 2nd ed. New York: McGraw-Hill. To read more articles related to elearning, examination, and instructional design go to www.yas.getyardstick.com and check out our blog.