Differential Item Functioning from a Compensatory-Noncompensatory Perspective

Similar documents
The Influence of Conditioning Scores In Performing DIF Analyses

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing

Differential Item Functioning

Graphical Representation of Multidimensional

Section 5. Field Test Analyses

International Journal of Education and Research Vol. 5 No. 5 May 2017

When can Multidimensional Item Response Theory (MIRT) Models be a Solution for. Differential Item Functioning (DIF)? A Monte Carlo Simulation Study

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Noncompensatory. A Comparison Study of the Unidimensional IRT Estimation of Compensatory and. Multidimensional Item Response Data

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Jason L. Meyers. Ahmet Turhan. Steven J. Fitzpatrick. Pearson. Paper presented at the annual meeting of the

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

Nonparametric DIF. Bruno D. Zumbo and Petronilla M. Witarsa University of British Columbia

Improvements for Differential Functioning of Items and Tests (DFIT): Investigating the Addition of Reporting an Effect Size Measure and Power

Keywords: Dichotomous test, ordinal test, differential item functioning (DIF), magnitude of DIF, and test-takers. Introduction

Technical Specifications

THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH

Linking Mixed-Format Tests Using Multiple Choice Anchors. Michael E. Walker. Sooyeon Kim. ETS, Princeton, NJ

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

IRT Parameter Estimates

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement Invariance Tests Of Multi-Group Confirmatory Factor Analyses

A Monte Carlo Study Investigating Missing Data, Differential Item Functioning, and Effect Size

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

A DIFFERENTIAL RESPONSE FUNCTIONING FRAMEWORK FOR UNDERSTANDING ITEM, BUNDLE, AND TEST BIAS ROBERT PHILIP SIDNEY CHALMERS

Determining Differential Item Functioning in Mathematics Word Problems Using Item Response Theory

Sensitivity of DFIT Tests of Measurement Invariance for Likert Data

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking

The Matching Criterion Purification for Differential Item Functioning Analyses in a Large-Scale Assessment

Building Evaluation Scales for NLP using Item Response Theory

Diagnostic Classification Models

Published by European Centre for Research Training and Development UK (

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION

Gender-Based Differential Item Performance in English Usage Items

THE STRENGTH OF MULTIDIMENSIONAL ITEM RESPONSE THEORY IN EXPLORING CONSTRUCT SPACE THAT IS MULTIDIMENSIONAL AND CORRELATED. Steven G.

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

Examining the Validity and Fairness of a State Standards-Based Assessment of English-Language Arts for Deaf or Hard of Hearing Students

Multidimensionality and Item Bias

Three Generations of DIF Analyses: Considering Where It Has Been, Where It Is Now, and Where It Is Going

Proceedings of the 2011 International Conference on Teaching, Learning and Change (c) International Association for Teaching and Learning (IATEL)

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report. Assessing IRT Model-Data Fit for Mixed Format Tests

André Cyr and Alexander Davies

The Effects Of Differential Item Functioning On Predictive Bias

Item Response Theory: Methods for the Analysis of Discrete Survey Response Data

An Introduction to Missing Data in the Context of Differential Item Functioning

Modeling DIF with the Rasch Model: The Unfortunate Combination of Mean Ability Differences and Guessing

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1.

A Comparison of Traditional and IRT based Item Quality Criteria

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Detection of Differential Test Functioning (DTF) and Differential Item Functioning (DIF) in MCCQE Part II Using Logistic Models

A Bayesian Nonparametric Model Fit statistic of Item Response Models

Computerized Mastery Testing

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Maike Krannich, Odin Jost, Theresa Rohm, Ingrid Koller, Steffi Pohl, Kerstin Haberkorn, Claus H. Carstensen, Luise Fischer, and Timo Gnambs

A Comparison of Methods of Estimating Subscale Scores for Mixed-Format Tests

Decision consistency and accuracy indices for the bifactor and testlet response theory models

On indirect measurement of health based on survey data. Responses to health related questions (items) Y 1,..,Y k A unidimensional latent health state

The Effects of Controlling for Distributional Differences on the Mantel-Haenszel Procedure. Daniel F. Bowen. Chapel Hill 2011

Multidimensional Modeling of Learning Progression-based Vertical Scales 1

Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study

Differential Performance of Test Items by Geographical Regions. Konstantin E. Augemberg Fordham University. Deanna L. Morgan The College Board

María Verónica Santelices 1 and Mark Wilson 2

Re-Examining the Role of Individual Differences in Educational Assessment

THE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION

USING MULTIDIMENSIONAL ITEM RESPONSE THEORY TO REPORT SUBSCORES ACROSS MULTIPLE TEST FORMS. Jing-Ru Xu

A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM LIKELIHOOD METHODS IN ESTIMATING THE ITEM PARAMETERS FOR THE 2PL IRT MODEL

Chapter 11 Multiple Regression

LOGISTIC APPROXIMATIONS OF MARGINAL TRACE LINES FOR BIFACTOR ITEM RESPONSE THEORY MODELS. Brian Dale Stucky

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan

Initial Report on the Calibration of Paper and Pencil Forms UCLA/CRESST August 2015

Development, Standardization and Application of

Adaptive Testing With the Multi-Unidimensional Pairwise Preference Model Stephen Stark University of South Florida

COMBINING SCALING AND CLASSIFICATION: A PSYCHOMETRIC MODEL FOR SCALING ABILITY AND DIAGNOSING MISCONCEPTIONS LAINE P. BRADSHAW

EFFECTS OF OUTLIER ITEM PARAMETERS ON IRT CHARACTERISTIC CURVE LINKING METHODS UNDER THE COMMON-ITEM NONEQUIVALENT GROUPS DESIGN

ABERRANT RESPONSE PATTERNS AS A MULTIDIMENSIONAL PHENOMENON: USING FACTOR-ANALYTIC MODEL COMPARISON TO DETECT CHEATING. John Michael Clark III

Academic Discipline DIF in an English Language Proficiency Test

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Analyzing Teacher Professional Standards as Latent Factors of Assessment Data: The Case of Teacher Test-English in Saudi Arabia

Comparing DIF methods for data with dual dependency

Item-Rest Regressions, Item Response Functions, and the Relation Between Test Forms

linking in educational measurement: Taking differential motivation into account 1

SESUG '98 Proceedings

An Alternative to the Trend Scoring Method for Adjusting Scoring Shifts. in Mixed-Format Tests. Xuan Tan. Sooyeon Kim. Insu Paek.

Thank You Acknowledgments

Math 124: Module 2, Part II

Fighting Bias with Statistics: Detecting Gender Differences in Responses on Items on a Preschool Science Assessment

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

Blending Psychometrics with Bayesian Inference Networks: Measuring Hundreds of Latent Variables Simultaneously

Does factor indeterminacy matter in multi-dimensional item response theory?

A Comparison of Several Goodness-of-Fit Statistics

Assessing the item response theory with covariate (IRT-C) procedure for ascertaining. differential item functioning. Louis Tay

Scaling TOWES and Linking to IALS

UCLA UCLA Electronic Theses and Dissertations

Bayesian Tailored Testing and the Influence

Constrained Multidimensional Adaptive Testing without intermixing items from different dimensions

On Test Scores (Part 2) How to Properly Use Test Scores in Secondary Analyses. Structural Equation Modeling Lecture #12 April 29, 2015

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

THE DEVELOPMENT AND VALIDATION OF EFFECT SIZE MEASURES FOR IRT AND CFA STUDIES OF MEASUREMENT EQUIVALENCE CHRISTOPHER DAVID NYE DISSERTATION

Brent Duckor Ph.D. (SJSU) Kip Tellez, Ph.D. (UCSC) BEAR Seminar April 22, 2014

Transcription:

Differential Item Functioning from a Compensatory-Noncompensatory Perspective Terry Ackerman, Bruce McCollaum, Gilbert Ngerano University of North Carolina at Greensboro

Motivation for my Presentation Differential Item functioning, DIF, has become a standard analysis in achievement testing. Its purpose is to insure test impartiality and identify items which are unfair, favoring one group of examinees over another. Given the high stakes surrounding many educational tests today, DIF analyses have increased in importance. DIF has the greatest potential to occur when a test is multidimensional, that is, containing items that measure, to varying degrees, superfluous, invalid skills that are different than the purported purpose of the test.

Motivation for my Presentation If items measure invalid skills, and examinees differ on those skills DIF is likely to result. Some examinees will end up getting those items right, not because they are competent in the skill or composite skills being measured by the test, but because they are more able on an unessential skill being measured by a DIF item.

Motivation for my Presentation A little strange example: me taking an algebra math test written in Turkish. Now while I should be pretty good at algebra I know close to nothing about the Turkish language. Thus, no matter how good I am at algebra, it can not compensate for my not understanding Turkish and thus my probability of correct response on these items would be very low. I would be stuck in a noncompensatory situation.

Motivation for my Presentation Note, that DIF cannot occur if a test is strictly unidimensional, measuring only one skill, factor, or trait. DIF is thought to occur when a test measures invalid skills and groups of examinees differ in their underlying distribution of abilities on those skills. Detecting DIF is a relatively easy process, the true challenge is determining what caused it!

Goal of my Presentation The purpose of my talk today is to explain a new cause of DIF. It is a situation in which a test is measuring both valid and invalid skills but groups of examinees have identical underlying ability distributions, but use the information presented in an item differently. That is for some examinees an item is compensatory, while for other examinees the item perhaps because of greater exposure of the requisite information or instructional or pedagogical differences) the item is noncompensatory. To explain this new source of DIF I m going to back up a bit and give a brief background in Multidimensional IRT modeling and DIF analyses.

Konuşma Akışı 1. Çok Boyutlu Madde Tepki Kuramı Modelleri: a) Tamamlayıcı b) Tamamlayıcı Olmayan 2. Çok Boyutlu Madde Tepki Kuramında Madde Gösterimi 3. Çok Boyutlu Perspektiften Değişen Madde Fonksiyonuna (DMF) Bakış 4. DMF: Tamamlayıcı İşlemlere Karşı Tamamlayıcı Olmayan İşlemler 5. Örnek Uygulamalar 6. Sonuç ve Gelecekteki Yönelimler 7. Kϋҫϋk sinav

1. Multidimensional IRT Models: Compensatory vs. Noncompensatory

The Two-dimensional Compensatory Model The probability of examinee j correctly responding to item i, can be expressed as: P ij 1.0 e 1.7 1.0 a 1i 1 j a 2i 2 j d j 2 Discrimination Parameters 2 Latent abilities 1 Difficulty Parameter

The Two-dimensional Noncompensatory Model The probability of examinee j correctly responding to item i can be expressed as: P ij 1.0 e 1.0 1.0 1.7a 1i1 j b1 j 1. 7 a i j b e 2 1.0 2 2i 2 Discrimination Parameters 2 Latent abilities 2 Difficulty Parameter

2. Multidimensional IRT: item representation

Mathematica Representations

Contour Plot of Item Response Surface a1 = 1.50 a2 = 0.0 d = 0.3 A B C This item only discriminates between levels of 1 The steeper the surface, the more discrimination, the closer the contours.

Contour Plot of Item Response Surface a1 = 0.0 a2 = 0.8 d = 0.3 C This item only discriminates between levels of 2 A B The flatter the surface, the less discrimination, the further apart the contours

Contour Plot of Item Response Surface a1 = 1.0 a2 = 1.0 d = 0.3 A Low 1 High 2 This item discriminates between an equal composite of 1 and 2 Notice examinees with opposite ability profiles have the same probability of correct answer (i.e., compensation). B High 1 Low 2

Noncompensatory Model Contour Plot of Item Response Surface a1 = 1.0 a2 = 1.0 b1 = 0.0 b2 = 0.0 High 2 Low 1 A B No compensation occurs for being high on only one ability Low 1 C Low 2 High 1 Low 2

Perhaps the best representation of twodimensional is the vector method. Each item is represented in the latent ability plane as a vector. All vectors lie on lines that pass through the origin. Vectors can lie only in the first and third quadrants because when we estimate the a- parameters they are constrained to be positive Vectors representing easy items lie in the third quadrant; those representing difficult items lie in the first quadrant.

The length of the vector indicates how well an item can discriminate between levels of skill. This value is called MDISC. MDISC 2 2 a 1 a2 The tail of the vector lies on the p=.5 equiprobability contour. The signed distance from the origin to this contour is denoted as D D d MDISC The angular direction, α, indicates the composite of ability that the item is best measuring cos 1 a 1 MDISC

Vectors are actually projections of the direction (i.e., 1, 2 composite) of maximum discrimination or slope, onto the latent ability plane Direction of maximum slope Response surface Projected item vector

Contour Plot of Item Response Surface a1 = 1.8 a2 = 1.0 d = 0.8 Item response vector p =.5 equiprobability contour

By color coding the vectors to match different content areas we can determine Are items from a certain content area more discriminating or more difficult? Do different items from different content areas measure different ability composites? How similar are the vector profiles for different yet parallel forms?

Example of item vectors for the 101 item LSAT Difficult items Easy items

3. Differential Item Functioning from a multidimensional perspective

DIF Analyses DIF is examined in terms of differential performance between two identified groups, which are usually denoted as the Reference Group and the Focal Group. DIF analyses usually focus on one item at a time using conditional analyses, where intermediate statistics are calculated for each raw score category and then summed. Although there are many types of DIF analyses, for today I will focus on two dichotomously scored approaches, SIBTEST and Mantel Haenszel. At the heart of each conditional analysis is a 2 x 2 contingency table.

Mantel Haenszel DIF Statistic MH i j A j E j j j E Var A j A A j j N 1 2 Where the expected value of cell A frequency is R N N.. j 2 1. j and the variance of cell A frequencies equal N RjN FjN1. j N0. j Var A 2 N N 1.. j.. j 2x2 Contingency Table for the jth Score Category Item Score Group 1 0 Total Reference (R) A j B j N Rj Focal (F) C j D j N Fj Total N 1.j N 0.j N..j

n h Fh Rh h U Y Y p 0 * * ˆ ˆ n j Fh Rj Fh Rh h G G G G p 0 ˆ and G Rh and G Fh are the number of examinees in the reference and focal groups at valid score X = h. U U B U ˆ ˆ ˆ 2 1 0 2 2 2, ˆ 1, ˆ 1 ˆ ˆ ˆ n h Fh Rh k U F h Y G R h Y G p The SIBTEST test statistic is calculated as where SIBTEST DIF Statistic An estimate of the numerator of the SIBTEST test statistic is where

3. Differential Item Functioning from a multidimensional perspective

Key Ingredient in DIF analyses: the Conditioning Variable DIF occurs because the conditioning variable does not capture all of the skills (complete latent space) that the groups of examinees utilized in responding to the test items. Several studies have looked at conditioning scores and how to account for all the skills examinees have used in responding to items on a test.

Shin (1992) Zwick & Ercikan (1989) Condition on Skill 2 Condition on Skill 1

Ackerman & Evans 1994 DIF Study Generated Ability Distributions Generated Items

Conditioning on θ 2 Valid Skill Valid Composite Direction DIF Items Invalid Skill

Conditioning on θ 1 DIF Items Invalid Skill Valid Composite Direction Valid Skill

Conditioning on raw score DIF Items Invalid Skill DIF Items Invalid Skill

Conditioning on θ 1 and θ 2 All items (composites) are valid Valid Skill Valid Skill

4. DIF: compensatory processing versus noncompensatory processing

Identical Generating Distributions N = 1000 Mean Std Dev rt1t2 REF Theta 1.08 1.00.35 Theta 2.01.98 FOC Theta 1.06 1.00.39 Theta 2.02 1.01

Vectors of Generated Items n = 30

Compensatory Item 13 a1 =.4 a2 =.4 d =.0 Noncompensatory Item 13 a1 = 1.2 a2 = 1.2 b1 =.0 b2 =.0

Compensatory Item 14 a1 =.8 a2 =.8 d =.0 Noncompensatory Item 14 a1 = 0.8 a2 = 0.8 b1 =.0 b2 =.0

Compensatory Item 15 a1 = 1.2 a2 = 1.2 d =.0 Noncompensatory Item 15 a1 = 1.2 a2 = 1.2 b1 =.0 b2 =.0

Compensatory Item 16 a1 = 1.6 a2 = 1.6 d =.0 Noncompensatory Item 16 a1 = 1.6 a2 = 1.6 b1 =.0 b2=.0

Raw Score Frequency 60 40 20 0 Raw Score Frequency 0 20 40 60 30 Reference Group S C O R E Focal Group 0

Item Reference Group Focal Group Type Compensatory Noncompensatory p-value biserial p-value biserial 13.49.55.26.44 14.46.83.27.71 15.48.94.30.79 16.47.98.30.84

Item 2 Item 15

ETS DIF Classification Categories Category A B C MH D-DIF value MH D-DIF < 1.0 1 < MH D DIF < 1.5 MH D DIF >1.5 During Test Assembly Select freely If possible select Equivalent item with smaller MH D-DIF Select ONLY if Essential; Independent Reviewer required Action Before Score Reporting Independent reviewer required

A B C

5. Example applications

Situations in which compensation differences between subgroups could occur 1. Teaching Literacy Phonemic awareness Phonics Reading Fluency including oral reading skills Vocabulary Development Reading Comprehension Strategies Teacher Training - Content Knowledge vs Pedagogical Knowledge Praxis II English Language Learners

Situations in which compensation differences between subgroups could occur 2. English Language Learners students whose first language is not English 3. Teacher Training - Content knowledge vs pedagogical knowledge (Praxis II)

6. Conclusion and future directions

Conclusions DIF is a very perplexing analysis to perform. Quite often when we identify items that are favoring one group or another, we still can not determine what caused the DIF. Hopefully by applying multidimensional modeling we might be able to expand on why groups of students perform differentially. Such analyses, especially those involving compensation and lack of compensation could be potentially very instructive and prescriptive for teachers and help inform pedagogical practice.

Future Work More work needs to be done on how best to represent items in a noncompensatory framework. I am working closely with my doctoral students to look at the ways we feel DIF can occur through lack of compensation. This includes developing items that have distractors representing varying degrees of compensation. One of my students is also looking at latent class mixture models using the compensatory and noncompensatory MIRT models to identify classes of students who lack requisite skills and thus are facing noncompensatory testing scenarios.

Kϋҫϋk sinav

Being the great psychometricians that you are which group do you think this ACT item favored, Whites? Blacks? Males? Females? No DIF? BLACK EXAMINEES

Which group do you think this ACT item favored, Whites? Blacks? Males? Females? No DIF? A rectangular 8-inch by 10-inch picture is to be framed with a 3-inch border all the way around it. How many more square inches of wall space will be covered by the framed picture than by the picture alone? a) 24 b) 48 c) 54 d) 108 e) 144 WHITE EXAMINEES

For questions or comments please email me at taackerm@uncg.edu

References Ackerman, T. A. (1989). Unidimensional IRT calibration of compensatory and noncompensatory multidimensional. Applied Psychological Measurement, 13, 113-127. Ackerman, T. A. (1992). An explanation of differential item functioning from a multidimensional perspective. Journal of Educational Measurement, 24, 67-91. Ackerman, T. A. (1994). The Influence of Conditioning Scores In Performing DIF Analyses. Applied Psychological Measurement, 18, 4, 329-342. Ackerman, T. A., & Evans, J. A. (1992, April). An investigation of the relationship between reliability, power, and the Type I error rate of the Mantel-Haenszel and simultaneous item bias detection procedures. Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco. Ackerman, T.A. & Henson, R. A. (2014 ) Graphical representations of items and tests that are measuring multiple abilities. Proceedings of the Psychometric Society. IMPS 2013.

Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In H. Wainer & P. W. Holland (Eds.), Differential item functioning, (pp. 35-66). Hillsdale NJ: Erlbaum. Shin, S. (1992). An empirical investigation of the robustness of the Mantel-Haenszel procedure and sources of differential item functioning. Dissertation Abstracts International, 53A, 3504. Zwick, R., & Ercikan, K. (1989). Analysis of differential item functioning in the NAEP History Assessment. Journal of Educational Measurement, 26, 55-66.

"Teşekkürler"