Item Analysis: Classical and Beyond

Similar documents
Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Influences of IRT Item Attributes on Angoff Rater Judgments

Turning Output of Item Response Theory Data Analysis into Graphs with R

Development, Standardization and Application of

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan

A Comparison of Several Goodness-of-Fit Statistics

Description of components in tailored testing

Introduction to Item Response Theory

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Psychometrics in context: Test Construction with IRT. Professor John Rust University of Cambridge

Building Evaluation Scales for NLP using Item Response Theory

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys

An application of the new irt command in Stata

Does factor indeterminacy matter in multi-dimensional item response theory?

Type I Error Rates and Power Estimates for Several Item Response Theory Fit Indices

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

Likelihood Ratio Based Computerized Classification Testing. Nathan A. Thompson. Assessment Systems Corporation & University of Cincinnati.

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

ABERRANT RESPONSE PATTERNS AS A MULTIDIMENSIONAL PHENOMENON: USING FACTOR-ANALYTIC MODEL COMPARISON TO DETECT CHEATING. John Michael Clark III

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

Item Response Theory (IRT): A Modern Statistical Theory for Solving Measurement Problem in 21st Century

Initial Report on the Calibration of Paper and Pencil Forms UCLA/CRESST August 2015

Copyright. Kelly Diane Brune

ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE

Comprehensive Statistical Analysis of a Mathematics Placement Test

A simple guide to IRT and Rasch 2

A Comparison of Pseudo-Bayesian and Joint Maximum Likelihood Procedures for Estimating Item Parameters in the Three-Parameter IRT Model

MEANING AND PURPOSE. ADULT PEDIATRIC PARENT PROXY PROMIS Item Bank v1.0 Meaning and Purpose PROMIS Short Form v1.0 Meaning and Purpose 4a

A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM LIKELIHOOD METHODS IN ESTIMATING THE ITEM PARAMETERS FOR THE 2PL IRT MODEL

Item-Rest Regressions, Item Response Functions, and the Relation Between Test Forms

On indirect measurement of health based on survey data. Responses to health related questions (items) Y 1,..,Y k A unidimensional latent health state

CYRINUS B. ESSEN, IDAKA E. IDAKA AND MICHAEL A. METIBEMU. (Received 31, January 2017; Revision Accepted 13, April 2017)

André Cyr and Alexander Davies

INVESTIGATING FIT WITH THE RASCH MODEL. Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form

Information Structure for Geometric Analogies: A Test Theory Approach

Objectives. Quantifying the quality of hypothesis tests. Type I and II errors. Power of a test. Cautions about significance tests

4 Diagnostic Tests and Measures of Agreement

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

ROC Curve. Brawijaya Professional Statistical Analysis BPSA MALANG Jl. Kertoasri 66 Malang (0341)

A Broad-Range Tailored Test of Verbal Ability

Having your cake and eating it too: multiple dimensions and a composite

Centre for Education Research and Policy

INTRODUCTION TO ITEM RESPONSE THEORY APPLIED TO FOOD SECURITY MEASUREMENT. Basic Concepts, Parameters and Statistics

PSYCHOLOGICAL STRESS EXPERIENCES

During the past century, mathematics

Smoking Social Motivations

Issues That Should Not Be Overlooked in the Dominance Versus Ideal Point Controversy

THE NATURE OF OBJECTIVITY WITH THE RASCH MODEL

SLEEP DISTURBANCE ABOUT SLEEP DISTURBANCE INTRODUCTION TO ASSESSMENT OPTIONS. 6/27/2018 PROMIS Sleep Disturbance Page 1

Validating Measures of Self Control via Rasch Measurement. Jonathan Hasford Department of Marketing, University of Kentucky

Psychometrics for Beginners. Lawrence J. Fabrey, PhD Applied Measurement Professionals

Reanalysis of the 1980 AFQT Data from the NLSY79 1

Maximum Marginal Likelihood Bifactor Analysis with Estimation of the General Dimension as an Empirical Histogram

ABOUT PHYSICAL ACTIVITY

Introduction to Measurement

PHYSICAL STRESS EXPERIENCES

Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous and Dichotomous Methods

The Psychometric Development Process of Recovery Measures and Markers: Classical Test Theory and Item Response Theory

GENERAL SELF-EFFICACY AND SELF-EFFICACY FOR MANAGING CHRONIC CONDITIONS

Section 5. Field Test Analyses

Proceedings of the 2011 International Conference on Teaching, Learning and Change (c) International Association for Teaching and Learning (IATEL)

ANXIETY A brief guide to the PROMIS Anxiety instruments:

Inferential Statistics

ABOUT SMOKING NEGATIVE PSYCHOSOCIAL EXPECTANCIES

Assessing Measurement Invariance in the Attitude to Marriage Scale across East Asian Societies. Xiaowen Zhu. Xi an Jiaotong University.

Published by European Centre for Research Training and Development UK (

Item Response Theory. Robert J. Harvey. Virginia Polytechnic Institute & State University. Allen L. Hammer. Consulting Psychologists Press, Inc.

FATIGUE. A brief guide to the PROMIS Fatigue instruments:

A Bayesian Nonparametric Model Fit statistic of Item Response Models

Analyzing Teacher Professional Standards as Latent Factors of Assessment Data: The Case of Teacher Test-English in Saudi Arabia

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

Construct Validity of Mathematics Test Items Using the Rasch Model

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Dr. Kelly Bradley Final Exam Summer {2 points} Name

ABOUT SUBSTANCE USE INTRODUCTION TO ASSESSMENT OPTIONS SUBSTANCE USE. 2/26/2018 PROMIS Substance Use Page 1

Differential Item Functioning

Adjusting for mode of administration effect in surveys using mailed questionnaire and telephone interview data

Nonparametric DIF. Bruno D. Zumbo and Petronilla M. Witarsa University of British Columbia

Latent Trait Standardization of the Benzodiazepine Dependence. Self-Report Questionnaire using the Rasch Scaling Model

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Dimensionality of the Force Concept Inventory: Comparing Bayesian Item Response Models. Xiaowen Liu Eric Loken University of Connecticut

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Ordinal Data Modeling

Computerized Mastery Testing

Analysis test of understanding of vectors with the three-parameter logistic model of item response theory and item response curves technique

Identification of group differences using PISA scales - considering effects of inhomogeneous items

DEVELOPING IDEAL INTERMEDIATE ITEMS FOR THE IDEAL POINT MODEL MENGYANG CAO THESIS

The Effect of Guessing on Item Reliability

References. Embretson, S. E. & Reise, S. P. (2000). Item response theory for psychologists. Mahwah,

Factors Affecting the Item Parameter Estimation and Classification Accuracy of the DINA Model

linking in educational measurement: Taking differential motivation into account 1

Decision consistency and accuracy indices for the bifactor and testlet response theory models

A simulation study of person-fit in the Rasch model

MEASURING MIDDLE GRADES STUDENTS UNDERSTANDING OF FORCE AND MOTION CONCEPTS: INSIGHTS INTO THE STRUCTURE OF STUDENT IDEAS

AN ABSTRACT OF THE THESIS OF

Utilizing the NIH Patient-Reported Outcomes Measurement Information System

Bayesian and Frequentist Approaches

Technical Specifications

Transcription:

Item Analysis: Classical and Beyond SCROLLA Symposium Measurement Theory and Item Analysis Modified for EPE/EDP 711 by Kelly Bradley on January 8, 2013

Why is item analysis relevant? Item analysis provides a way of measuring the quality of questions - seeing how appropriate they were for the respondents how well they measured their ability. Item analysis also provides a way of re-using items over and over again in different instruments with prior knowledge of how they are going to perform.

What kinds of item analysis are there? Item Analysis Classical Latent Trait Models Item Response theory Rasch IRT1 IRT2 IRT3 IRT4

Classical Test Theory Classical analysis is the easiest and most widely used form of analysis. The statistics can be computed by generic statistical packages (or at a push by hand) and need no specialist software. Classical analysis is performed on the survey or test instrument as a whole rather than on the item and although item statistics can be generated, they apply only to that group of students on that collection of items

Classical Test Theory Assumptions Classical test theory assumes that any test score (or survey instrument sum) is comprised of a true value, plus randomized error. Crucially it assumes that this error is normally distributed; uncorrelated with true score and the mean of the error is zero. x obs = x true + G(0, σ err )

Classical Analysis Statistics Difficulty Discrimination Reliability (item level statistic) (item level statistic) (instrument level statistic)

Classical Test Theory Difficulty The difficulty of a (single response selection) question in classical analysis is simply the proportion of people who answered the question incorrectly. For multiple mark questions, it is the average mark expressed as a proportion. Given on a scale of 0-1, the higher the proportion the greater the difficulty.

Classical Test Theory Discrimination The discrimination of an item is the (Pearson) correlation between the average item mark and the average total test mark. Being a correlation it can vary from 1 to +1 with higher values indicating (desirable) high discrimination.

Classical Test Theory Reliability Reliability is a measure of how well the test or survey holds together. For practical reasons, internal consistency estimates are the easiest to obtain which indicate the extent to which each item correlates with every other item. This is measured on a scale of 0-1. The greater the number the higher the reliability.

Classical Analysis versus Latent Trait Models Classical analysis has the survey, or test, (not the item) as its basis. Although the statistics generated are often generalized to similar populations completing a similar survey, or taking a similar test; they only really apply to those students taking that test Latent trait models aim to look beyond that at the underlying traits which are producing the test performance. They are measured at item level and provide sample-free measurement

Latent Trait Models Latent trait models have been around since the 1940s, but were not widely used until the 1960s. Although theoretically possible, it is practically unfeasible to use these without specialist software. They aim to measure the underlying ability (or trait) which is producing the test performance rather than measuring performance per se. This leads to them being sample-free. As the statistics are not dependant on the test situation which generated them, they can be used more flexibly.

Rasch versus Item Response Theory Mathematically, Rasch is identical to the most basic IRT model (IRT1), however there are some important differences which makes it a more viable proposition for practical testing For instance, In Rasch the model is superior. Data which does not fit the model is discarded (carefully and not dumped). Rasch does not permit abilities to be estimated for extreme items and persons.

IRT - the generalized model Where a g = gradient of the ICC at the point θ (item discrimination) b g = the ability level at which a g is maximized (item difficulty) c g = probability of low persons correctly answering question (or endorsing) g

IRT - Item Characteristic Curves An ICC is a plot of the respondents ability (likeliness to endorse) over the probability of them correctly answering the question (endorsing). The higher the ability the higher the chance that they will respond correctly. c - intercept b - ability at max (a) a - gradient

IRT - About the Parameters Difficulty Although there is no correct difficulty for any one item, it is clearly desirable that the difficulty of the test (or survey instrument) is centred around the average ability of the respondents. The higher the b parameter the more difficult the question. This is inversely proportionate to the probability of the question being answered correctly.

IRT - About the Parameters Discrimination In IRT (unlike Rasch) maximal discrimination is sought. Thus the higher the a parameter the more desirable the question. Differences in the discrimination of questions can lead to differences in the difficulties of questions across the ability range.

IRT - About the Parameters Guessing A high c parameter suggests that candidates with very little ability may choose the correct answer. This is rarely a valid parameter outwith multiple choice testing and the value should not vary excessively from the reciprocal of the number of choices.

IRT - Parameter Estimation Before being used (in an item bank or for measurement) items must first be calibrated. That is their parameters must be estimated. There are two main procedures - Joint Maximal Likelihood and Marginal Maximal Likelihood. JML is most common for IRT1 and 2, while MML is used more frequently for IRT3. Bayesian estimation and estimated bounds may be imposed on the data to avoid high discrimination items being over valued.