Differential Item Functioning

Similar documents
Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

Detection of Differential Test Functioning (DTF) and Differential Item Functioning (DIF) in MCCQE Part II Using Logistic Models

Section 5. Field Test Analyses

Assessing Measurement Invariance in the Attitude to Marriage Scale across East Asian Societies. Xiaowen Zhu. Xi an Jiaotong University.

Differential Item Functioning from a Compensatory-Noncompensatory Perspective

THE MANTEL-HAENSZEL METHOD FOR DETECTING DIFFERENTIAL ITEM FUNCTIONING IN DICHOTOMOUSLY SCORED ITEMS: A MULTILEVEL APPROACH

Item Response Theory. Steven P. Reise University of California, U.S.A. Unidimensional IRT Models for Dichotomous Item Responses

Published by European Centre for Research Training and Development UK (

USE OF DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS FOR BIAS ANALYSIS IN TEST CONSTRUCTION

The Matching Criterion Purification for Differential Item Functioning Analyses in a Large-Scale Assessment

Modeling DIF with the Rasch Model: The Unfortunate Combination of Mean Ability Differences and Guessing

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

10. LINEAR REGRESSION AND CORRELATION

Technical Specifications

Influences of IRT Item Attributes on Angoff Rater Judgments

A Comparison of Several Goodness-of-Fit Statistics

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida

Empowered by Psychometrics The Fundamentals of Psychometrics. Jim Wollack University of Wisconsin Madison

THE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION

Simple Linear Regression the model, estimation and testing

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking

An Introduction to Missing Data in the Context of Differential Item Functioning

Gender-Based Differential Item Performance in English Usage Items

Structural Equation Modeling (SEM)

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

INVESTIGATING FIT WITH THE RASCH MODEL. Benjamin Wright and Ronald Mead (1979?) Most disturbances in the measurement process can be considered a form

Sensitivity of DFIT Tests of Measurement Invariance for Likert Data

Thank You Acknowledgments

Nonparametric DIF. Bruno D. Zumbo and Petronilla M. Witarsa University of British Columbia

Introduction to Multilevel Models for Longitudinal and Repeated Measures Data

Impact of Differential Item Functioning on Subsequent Statistical Conclusions Based on Observed Test Score Data. Zhen Li & Bruno D.

GENERALIZABILITY AND RELIABILITY: APPROACHES FOR THROUGH-COURSE ASSESSMENTS

Academic Discipline DIF in an English Language Proficiency Test

International Journal of Education and Research Vol. 5 No. 5 May 2017

On indirect measurement of health based on survey data. Responses to health related questions (items) Y 1,..,Y k A unidimensional latent health state

Biostatistics II

The Use of Unidimensional Parameter Estimates of Multidimensional Items in Adaptive Testing

Introduction to Multilevel Models for Longitudinal and Repeated Measures Data

Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement Invariance Tests Of Multi-Group Confirmatory Factor Analyses

Measuring and Assessing Study Quality

The Influence of Conditioning Scores In Performing DIF Analyses

Differential Item Functioning Amplification and Cancellation in a Reading Test

Proceedings of the 2011 International Conference on Teaching, Learning and Change (c) International Association for Teaching and Learning (IATEL)

A Differential Item Functioning (DIF) Analysis of the Self-Report Psychopathy Scale. Craig Nathanson and Delroy L. Paulhus

Instrument equivalence across ethnic groups. Antonio Olmos (MHCD) Susan R. Hutchinson (UNC)

Development, Standardization and Application of

Methodology for Non-Randomized Clinical Trials: Propensity Score Analysis Dan Conroy, Ph.D., inventiv Health, Burlington, MA

Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score Scales

Cross-cultural DIF; China is group labelled 1 (N=537), and USA is group labelled 2 (N=438). Satisfaction with Life Scale

A Modified CATSIB Procedure for Detecting Differential Item Function. on Computer-Based Tests. Johnson Ching-hong Li 1. Mark J. Gierl 1.

GMAC. Scaling Item Difficulty Estimates from Nonequivalent Groups

Latent Trait Standardization of the Benzodiazepine Dependence. Self-Report Questionnaire using the Rasch Scaling Model

Comparing DIF methods for data with dual dependency

André Cyr and Alexander Davies

Linking Mixed-Format Tests Using Multiple Choice Anchors. Michael E. Walker. Sooyeon Kim. ETS, Princeton, NJ

Table of Contents. Preface to the third edition xiii. Preface to the second edition xv. Preface to the fi rst edition xvii. List of abbreviations xix

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

AP Stats Chap 27 Inferences for Regression

Research Report No Using DIF Dissection Method to Assess Effects of Item Deletion

A Monte Carlo Study Investigating Missing Data, Differential Item Functioning, and Effect Size

Psychometrics for Beginners. Lawrence J. Fabrey, PhD Applied Measurement Professionals

Addendum: Multiple Regression Analysis (DRAFT 8/2/07)

Small Group Presentations

Scale Building with Confirmatory Factor Analysis

María Verónica Santelices 1 and Mark Wilson 2

Examining the Validity and Fairness of a State Standards-Based Assessment of English-Language Arts for Deaf or Hard of Hearing Students

Chapter 11 Multiple Regression

Known-Groups Validity 2017 FSSE Measurement Invariance

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Getting a DIF Breakdown with Lertap

Centre for Education Research and Policy

6. Unusual and Influential Data

linking in educational measurement: Taking differential motivation into account 1

Business Statistics Probability

ITEM RESPONSE THEORY ANALYSIS OF THE TOP LEADERSHIP DIRECTION SCALE

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

Computerized Mastery Testing

A MONTE CARLO STUDY OF MODEL SELECTION PROCEDURES FOR THE ANALYSIS OF CATEGORICAL DATA

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

12/30/2017. PSY 5102: Advanced Statistics for Psychological and Behavioral Research 2

q3_2 MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Overview of Non-Parametric Statistics

Unit 1 Exploring and Understanding Data

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan

The Pretest! Pretest! Pretest! Assignment (Example 2)

Regression Discontinuity Analysis

Item-Rest Regressions, Item Response Functions, and the Relation Between Test Forms

Multifactor Confirmatory Factor Analysis

Item Position and Item Difficulty Change in an IRT-Based Common Item Equating Design

The Effects of Controlling for Distributional Differences on the Mantel-Haenszel Procedure. Daniel F. Bowen. Chapel Hill 2011

Lecture 12: more Chapter 5, Section 3 Relationships between Two Quantitative Variables; Regression

Basic concepts and principles of classical test theory

Comprehensive Statistical Analysis of a Mathematics Placement Test

THE DEVELOPMENT AND VALIDATION OF EFFECT SIZE MEASURES FOR IRT AND CFA STUDIES OF MEASUREMENT EQUIVALENCE CHRISTOPHER DAVID NYE DISSERTATION

Effects of Local Item Dependence

Transcription:

Differential Item Functioning Lecture #11 ICPSR Item Response Theory Workshop Lecture #11: 1of 62 Lecture Overview Detection of Differential Item Functioning (DIF) Distinguish Bias from DIF Test vs. Item level bias Revisit some test score equating issues Lecture #11: 2of 62

DIFFERENTIAL ITEM FUNCTIONING Lecture #11: 3of 62 Outline Review relevant assumptions Concept of potentially biased test items (or DIF, as we ll call it) IRT based methods of detecting DIF Lecture #11: 4of 62

Some Notes We will focus on: DIF with 0 1 test items DIF with polytomous items is more complicated, though similar approaches are used IRT methods only DIF as a statistical issue Interpretation of why? can be quite a bit trickier Lecture #11: 5of 62 Why study DIF? We re often interested in comparing cultural, ethnic, or gender groups Meaningful comparisons require that measurement equivalence holds Classical test theory methods confound bias with true mean differences Not IRT In IRT terminology, item/test bias is referred to as DIF/DTF Lecture #11: 6of 62

Definition of DIF A test item is labeled with DIF when people with equal ability, but from different groups, have an unequal probability of item success A test item is labeled as non DIF if examinees having the same ability have equal probability of getting the item correct, regardless of group membership Lecture #11: 7of 62 DIF versus DTF DIF is basically found by examining differences in ICCs across groups Differential Test Functioning, or DTF, is the analogous procedure for determining differences in TCCs DTF is arguably more important because it speaks to impact DIF in one item may be significant, but might not have too much practical impact on test results Lecture #11: 8of 62

Classical Approach to DIF Compare item p values in the two groups Criticism: p value difference may be due to real and important group differences Need to compare p value difference for one item in relation to the differences for other items This provides a baseline to interpret the difference But item p value differences will be confounded by discrimination differences Lecture #11: 9of 62 Lecture #11: 10 of 62

Important Assumptions Unidimensionality Parameter Invariance Violations of these assumptions across identifiable groups basically provide us with a definition of DIF Lecture #11: 11 of 62 Unidimensionality The test measures ONE construct e.g., math proficiency, verbal ability, trait, etc. Group membership (e.g., gender, ethnicity, high low ability) should not differentially impact success on the test (or on an item) If group membership impacts or explains performance, then the assumptions of the model are not being met DIF=Dimensionality (unintended) Lecture #11: 12 of 62

Parameter Invariance Invariance of Item parameters (a,b,c) Compare item statistics obtained in two or more groups (e.g., high and low performing groups; Ethnicity; Gender) If the model fits, different samples should still produce close to the same item parameter estimates Scatterplots of b b, a a, c c based on one sample versus the other should be strongly linearly related Relationship won t be perfect: sampling error does enter the picture Those estimates far from the best fit line represent a violation of invariance Lecture #11: 13 of 62 Parameter Invariance b-plot Lecture #11: 14 of 62

Check Item Invariance If The test is unidimensional and The parameters are invariant across groups Then Even though there may be differences in the distributions of ability, similar (linearly related) results will be obtained for item parameter estimates Lecture #11: 15 of 62 Two groups may have different distributions for the trait being measured, but the same model should fit Lecture #11: 16 of 62

Frequencies may differ, but matched ability groups should have the same probability of success on the item Lecture #11: 17 of 62 Check Item Invariance Differential Item Functioning (DIF) This violates parameter invariance because items aren t unidimensional Lecture #11: 18 of 62

DIF in Practice Reference versus Focal groups Males Females Majority Minority English ESL groups One example of DIF is item drift due to over or under emphasis of content measured over time Lecture #11: 19 of 62 May have to remove items: Parameter Drift Lecture #11: 20 of 62

DIF in Practice Differential Item Functioning NOT item bias is the term used DIF is a statistical property, which states that matchedability groups have differential probabilities of success on an item Bias is a substantive interpretation of any DIF that may be observed Lecture #11: 21 of 62 DIF in Practice DIF studies are absolutely essential in testing programs with high stakes Potential gender and/or ethnicity bias could negatively impact one or more groups in a construct irrelevant way We re only interested in theta! Lecture #11: 22 of 62

Identifying DIF Uniform DIF: item is systematically more difficult for members of one group, even after matching examinees on ability (theta) Cause: shift in b parameter Lecture #11: 23 of 62 Lecture #11: 24 of 62

Identifying DIF Non Uniform DIF: shift in item difficulty is not consistent across the ability continuum Increase/decrease in P for low ability examinees is offset by the converse for high ability examinees Cause: shift in a (and possibly b) Lecture #11: 25 of 62 Lecture #11: 26 of 62

Steps in a DIF Study Identify Reference and Focal groups of interest (usually two at a time) Design the DIF study to have samples which are large as possible Choose DIF statistics which are appropriate for the data Carry out the statistical analyses Interpret DIF stats and delete items or make item changes as necessary Lecture #11: 27 of 62 Sample Size v. Effect Size CAUTION: Don t assume if DIF is found with statistical tests that there are definitely problems. With big samples, the power exists to detect even small conditional differences The flip side is that with small samples, sufficient power may not exist to identify truly problematic items Lecture #11: 28 of 62

IRT DIF Analyses Combine data from both groups and obtain c parameter estimates Fix the c parameters and obtain a and b parameter estimates in each group from separate calibrations If multivariate DIF test is done, two completely separate calibrations may be conducted (not always) Lecture #11: 29 of 62 IRT DIF Analyses Before the ICCs in the two groups can be compared, the must be placed onto the same scale (equated). Typically, item parameter estimates from the Focal group (minority) are placed onto the scale of the Reference group (majority) Lecture #11: 30 of 62

Mean and Sigma Equating Method After separate calibrations, determine the linear transformation that matches the mean and SD of anchor item b values across administrations: This transformation places the scale of item parameters from Form B onto the scale of item parameters from Form A Lecture #11: 31 of 62 Mean & Sigma Example 25 Linking items Form A: mb FormA = 0.06, sb FormA = 1.03 Form B: mb FormB = 0.25, sb FormB = 1.12 x = 1.03 / 1.12 = 0.92 y = 0.06 (0.92 * 0.25) = 0.17 b new = x b old + y b new = 0.92 b old + 0.17 Lecture #11: 32 of 62

M & S Transformation of Form B linking b-parameters Intercept (y) is above zero (i.e., common items were easier). This indicates that Group B had higher ability than Group A. Lecture #11: 33 of 62 After item parameters are adjusted, the same transformation of b is done for all now Group B will look more able (as they should). Lecture #11: 34 of 62

Sample Size Issue We might also not have enough data to estimate two calibrations (e.g., the Focal N may be small) One solution is to estimate the ICCs with just the majority group, score the minority group with those parameters, and examine their residuals against the model predicted ICC Lecture #11: 35 of 62 This shows a pattern of non-uniform DIF for the minority group Lecture #11: 36 of 62

Important Note Group differences do NOT (necessarily) mean DIF (or bias, for that matter) Two groups can have differences in their means, and if the model fits, the same ICCs will be estimated Lecture #11: 37 of 62 Important Note The point here is that if, after being matched on ability, different ICCs occur, then the item displays DIF If matched ability examinees have the same probability of success on the item, then there is no DIF, even if one group is smarter than the other Lecture #11: 38 of 62

No DIF: Despite group differences, one ICC fits for both groups Lecture #11: 39 of 62 DIF: The item performs differentially even after matching on ability Lecture #11: 40 of 62

IRT DIF Analyses After equating parameters, compare the items across groups using one of the following methods we ll discuss Optional: delete items that have large DIF statistics, rerun analyses, and re estimate equating constants Carefully review any flagged items for substantive review, interpretation Lecture #11: 41 of 62 Two stage Methods Conduct a DIF analysis Flag DIF items, temporarily remove them from criterion (raw score, q) Re evaluate DIF with new criterion Lecture #11: 42 of 62

DIF Detection Methods Mantel Haenszel: condition on raw score, statistical test of contingency tables Logistic Regression: condition on raw score, model group response relationship IRT Methods: condition on ability (q) Compare item parameters or ICCs Lecture #11: 43 of 62 IRT DIF Detection Compare Item Parameter Estimates Multivariate test (b, a, and, c) t tests on b values Area Methods Total Area (e.g., Raju, 1988, 1990) Squared Differences Weighted Areas and Differences Lecture #11: 44 of 62

Parameter Comparisons Null Hypothesis H 0 : b1 = b2 a1 = a2 c1 = c2 Alternative H 1 : b1 b2 a1 a2 c1 c2 Lecture #11: 45 of 62 Parameter Comparisons Chi Squared test of parameter differences, after parameters have been equated: Df = p, the number of parameters; Sometimes done with common c, which is left out (df = 2) Criticism: Asymptotic properties known, but it s not clear how big a sample you need to do it Lecture #11: 46 of 62

Parameter Comparisons t test of difference between b parameters, after parameters have been equated This is a very simple procedure which may be informative for identifying items which call for a closer look, but not too common to rely solely on this Doesn t account for a and c parameters, which may vary even for fixed b value Lecture #11: 47 of 62 IRT Area Methods These methods are more common, as they compare an ICC from one group against an ICC from the other and look at how much area is between the two Doesn t depend on differences in parameters, but actual differences in conditional probability The basic approach to these methods is to evaluate the amount of space between the two ICCs The smaller, the better Recall that if the item parameters were invariant across groups, the two ICCs would be coincident Lecture #11: 48 of 62

Lecture #11: 49 of 62 Lecture #11: 50 of 62

Calculating Area The area between the ICCs is defined as Lecture #11: 51 of 62 User must choose range [-3,3] and size of the interval (e.g., 0.01) P ref ( ) P foc ( ) Lecture #11: 52 of 62

DIF, DTF Interpretation DTF can be examined in the same way, only using a comparison of TCCs instead of ICCs How much is too much? No significance test, so one common approach is to determine the amount of DIF present in simulated non DIF data Lecture #11: 53 of 62 Simulated Data for Comparison Combine groups and estimate item and ability parameters Use item and person parameter estimates to simulate item response data Any DIF present will only be due to statistical artifact Lecture #11: 54 of 62

Simulated Data for Comparison For each item, determine the amount of DIF present, which will only be a result of sampling error This amount of DIF becomes the cutoff point; any DIF greater and the item is flagged Lecture #11: 55 of 62 Other Area Methods Absolute value of each difference is the most commonplace approach Less Common Signed DIF positives and negatives can cancel out (with non uniform DIF) Squared DIF less intuitive to interpret Weighted DIF magnitude of differences are weighted by conditional sample sizes Lecture #11: 56 of 62

Practical Concerns When all is said and done, after an item has been identified as DIF, the questions then becomes why? In the absence of interpretation, it is difficult to justify deleting an item Lecture #11: 57 of 62 *Example DIF against Females Decoy : Duck :: : (A) Net : Butterfly (B) Web : Spider (C) Lure : Fish (D) Lasso : Rope (E) Detour : Shortcut *from Holland & Wainer, 1993 Lecture #11: 58 of 62

Practical Concerns Some essential items (either due to content or statistical properties) may be left on the test Even with DIF Ameliorated by constructing forms balanced to include some other item or items which favor the focal group Make TCCs match Lecture #11: 59 of 62 CONCLUDING REMARKS Lecture #11: 60 of 62

Concluding Remarks DIF is a heavily researched topic: No shortage of articles comparing methods Or creating new ones Methods presented today are general but very useful The nature and spirit of DIF is sometimes called invariance testing More from a factor analytic perspective Attempts the same goal with different sets of parameters (i.e., intercepts and slopes) Lecture #11: 61 of 62 Up Next Equating Example Lecture #11: 62 of 62