Differential Item Functioning from a Compensatory-Noncompensatory Perspective

Differential Item Functioning from a Compensatory-Noncompensatory Perspective Terry Ackerman, Bruce McCollaum, Gilbert Ngerano University of North Carolina at Greensboro

Motivation for my Presentation Differential Item functioning, DIF, has become a standard analysis in achievement testing. Its purpose is to insure test impartiality and identify items which are unfair, favoring one group of examinees over another. Given the high stakes surrounding many educational tests today, DIF analyses have increased in importance. DIF has the greatest potential to occur when a test is multidimensional, that is, containing items that measure, to varying degrees, superfluous, invalid skills that are different than the purported purpose of the test.

Motivation for my Presentation If items measure invalid skills, and examinees differ on those skills DIF is likely to result. Some examinees will end up getting those items right, not because they are competent in the skill or composite skills being measured by the test, but because they are more able on an unessential skill being measured by a DIF item.

Motivation for my Presentation A little strange example: me taking an algebra math test written in Turkish. Now while I should be pretty good at algebra I know close to nothing about the Turkish language. Thus, no matter how good I am at algebra, it can not compensate for my not understanding Turkish and thus my probability of correct response on these items would be very low. I would be stuck in a noncompensatory situation.

Motivation for my Presentation Note, that DIF cannot occur if a test is strictly unidimensional, measuring only one skill, factor, or trait. DIF is thought to occur when a test measures invalid skills and groups of examinees differ in their underlying distribution of abilities on those skills. Detecting DIF is a relatively easy process, the true challenge is determining what caused it!

Goal of my Presentation The purpose of my talk today is to explain a new cause of DIF. It is a situation in which a test is measuring both valid and invalid skills but groups of examinees have identical underlying ability distributions, but use the information presented in an item differently. That is for some examinees an item is compensatory, while for other examinees the item perhaps because of greater exposure of the requisite information or instructional or pedagogical differences) the item is noncompensatory. To explain this new source of DIF I m going to back up a bit and give a brief background in Multidimensional IRT modeling and DIF analyses.

Konuşma Akışı 1. Çok Boyutlu Madde Tepki Kuramı Modelleri: a) Tamamlayıcı b) Tamamlayıcı Olmayan 2. Çok Boyutlu Madde Tepki Kuramında Madde Gösterimi 3. Çok Boyutlu Perspektiften Değişen Madde Fonksiyonuna (DMF) Bakış 4. DMF: Tamamlayıcı İşlemlere Karşı Tamamlayıcı Olmayan İşlemler 5. Örnek Uygulamalar 6. Sonuç ve Gelecekteki Yönelimler 7. Kϋҫϋk sinav

1. Multidimensional IRT Models: Compensatory vs. Noncompensatory

The Two-dimensional Compensatory Model The probability of examinee j correctly responding to item i, can be expressed as: P ij 1.0 e 1.7 1.0 a 1i 1 j a 2i 2 j d j 2 Discrimination Parameters 2 Latent abilities 1 Difficulty Parameter

The Two-dimensional Noncompensatory Model The probability of examinee j correctly responding to item i can be expressed as: P ij 1.0 e 1.0 1.0 1.7a 1i1 j b1 j 1. 7 a i j b e 2 1.0 2 2i 2 Discrimination Parameters 2 Latent abilities 2 Difficulty Parameter

2. Multidimensional IRT: item representation

Mathematica Representations

Contour Plot of Item Response Surface a1 = 1.50 a2 = 0.0 d = 0.3 A B C This item only discriminates between levels of 1 The steeper the surface, the more discrimination, the closer the contours.

Contour Plot of Item Response Surface a1 = 0.0 a2 = 0.8 d = 0.3 C This item only discriminates between levels of 2 A B The flatter the surface, the less discrimination, the further apart the contours

Contour Plot of Item Response Surface a1 = 1.0 a2 = 1.0 d = 0.3 A Low 1 High 2 This item discriminates between an equal composite of 1 and 2 Notice examinees with opposite ability profiles have the same probability of correct answer (i.e., compensation). B High 1 Low 2

Noncompensatory Model Contour Plot of Item Response Surface a1 = 1.0 a2 = 1.0 b1 = 0.0 b2 = 0.0 High 2 Low 1 A B No compensation occurs for being high on only one ability Low 1 C Low 2 High 1 Low 2

Perhaps the best representation of twodimensional is the vector method. Each item is represented in the latent ability plane as a vector. All vectors lie on lines that pass through the origin. Vectors can lie only in the first and third quadrants because when we estimate the a- parameters they are constrained to be positive Vectors representing easy items lie in the third quadrant; those representing difficult items lie in the first quadrant.

The length of the vector indicates how well an item can discriminate between levels of skill. This value is called MDISC. MDISC 2 2 a 1 a2 The tail of the vector lies on the p=.5 equiprobability contour. The signed distance from the origin to this contour is denoted as D D d MDISC The angular direction, α, indicates the composite of ability that the item is best measuring cos 1 a 1 MDISC

Vectors are actually projections of the direction (i.e., 1, 2 composite) of maximum discrimination or slope, onto the latent ability plane Direction of maximum slope Response surface Projected item vector

Contour Plot of Item Response Surface a1 = 1.8 a2 = 1.0 d = 0.8 Item response vector p =.5 equiprobability contour

By color coding the vectors to match different content areas we can determine Are items from a certain content area more discriminating or more difficult? Do different items from different content areas measure different ability composites? How similar are the vector profiles for different yet parallel forms?

Example of item vectors for the 101 item LSAT Difficult items Easy items

3. Differential Item Functioning from a multidimensional perspective

DIF Analyses DIF is examined in terms of differential performance between two identified groups, which are usually denoted as the Reference Group and the Focal Group. DIF analyses usually focus on one item at a time using conditional analyses, where intermediate statistics are calculated for each raw score category and then summed. Although there are many types of DIF analyses, for today I will focus on two dichotomously scored approaches, SIBTEST and Mantel Haenszel. At the heart of each conditional analysis is a 2 x 2 contingency table.

Mantel Haenszel DIF Statistic MH i j A j E j j j E Var A j A A j j N 1 2 Where the expected value of cell A frequency is R N N.. j 2 1. j and the variance of cell A frequencies equal N RjN FjN1. j N0. j Var A 2 N N 1.. j.. j 2x2 Contingency Table for the jth Score Category Item Score Group 1 0 Total Reference (R) A j B j N Rj Focal (F) C j D j N Fj Total N 1.j N 0.j N..j

n h Fh Rh h U Y Y p 0 * * ˆ ˆ n j Fh Rj Fh Rh h G G G G p 0 ˆ and G Rh and G Fh are the number of examinees in the reference and focal groups at valid score X = h. U U B U ˆ ˆ ˆ 2 1 0 2 2 2, ˆ 1, ˆ 1 ˆ ˆ ˆ n h Fh Rh k U F h Y G R h Y G p The SIBTEST test statistic is calculated as where SIBTEST DIF Statistic An estimate of the numerator of the SIBTEST test statistic is where

3. Differential Item Functioning from a multidimensional perspective

Key Ingredient in DIF analyses: the Conditioning Variable DIF occurs because the conditioning variable does not capture all of the skills (complete latent space) that the groups of examinees utilized in responding to the test items. Several studies have looked at conditioning scores and how to account for all the skills examinees have used in responding to items on a test.

Shin (1992) Zwick & Ercikan (1989) Condition on Skill 2 Condition on Skill 1

Ackerman & Evans 1994 DIF Study Generated Ability Distributions Generated Items

Conditioning on θ 2 Valid Skill Valid Composite Direction DIF Items Invalid Skill

Conditioning on θ 1 DIF Items Invalid Skill Valid Composite Direction Valid Skill

Conditioning on raw score DIF Items Invalid Skill DIF Items Invalid Skill

Conditioning on θ 1 and θ 2 All items (composites) are valid Valid Skill Valid Skill

4. DIF: compensatory processing versus noncompensatory processing

Identical Generating Distributions N = 1000 Mean Std Dev rt1t2 REF Theta 1.08 1.00.35 Theta 2.01.98 FOC Theta 1.06 1.00.39 Theta 2.02 1.01

Vectors of Generated Items n = 30

Compensatory Item 13 a1 =.4 a2 =.4 d =.0 Noncompensatory Item 13 a1 = 1.2 a2 = 1.2 b1 =.0 b2 =.0

Compensatory Item 14 a1 =.8 a2 =.8 d =.0 Noncompensatory Item 14 a1 = 0.8 a2 = 0.8 b1 =.0 b2 =.0

Compensatory Item 15 a1 = 1.2 a2 = 1.2 d =.0 Noncompensatory Item 15 a1 = 1.2 a2 = 1.2 b1 =.0 b2 =.0

Compensatory Item 16 a1 = 1.6 a2 = 1.6 d =.0 Noncompensatory Item 16 a1 = 1.6 a2 = 1.6 b1 =.0 b2=.0

Raw Score Frequency 60 40 20 0 Raw Score Frequency 0 20 40 60 30 Reference Group S C O R E Focal Group 0

Item Reference Group Focal Group Type Compensatory Noncompensatory p-value biserial p-value biserial 13.49.55.26.44 14.46.83.27.71 15.48.94.30.79 16.47.98.30.84

Item 2 Item 15

ETS DIF Classification Categories Category A B C MH D-DIF value MH D-DIF < 1.0 1 < MH D DIF < 1.5 MH D DIF >1.5 During Test Assembly Select freely If possible select Equivalent item with smaller MH D-DIF Select ONLY if Essential; Independent Reviewer required Action Before Score Reporting Independent reviewer required

A B C

5. Example applications

Situations in which compensation differences between subgroups could occur 1. Teaching Literacy Phonemic awareness Phonics Reading Fluency including oral reading skills Vocabulary Development Reading Comprehension Strategies Teacher Training - Content Knowledge vs Pedagogical Knowledge Praxis II English Language Learners

Situations in which compensation differences between subgroups could occur 2. English Language Learners students whose first language is not English 3. Teacher Training - Content knowledge vs pedagogical knowledge (Praxis II)

6. Conclusion and future directions

Conclusions DIF is a very perplexing analysis to perform. Quite often when we identify items that are favoring one group or another, we still can not determine what caused the DIF. Hopefully by applying multidimensional modeling we might be able to expand on why groups of students perform differentially. Such analyses, especially those involving compensation and lack of compensation could be potentially very instructive and prescriptive for teachers and help inform pedagogical practice.

Future Work More work needs to be done on how best to represent items in a noncompensatory framework. I am working closely with my doctoral students to look at the ways we feel DIF can occur through lack of compensation. This includes developing items that have distractors representing varying degrees of compensation. One of my students is also looking at latent class mixture models using the compensatory and noncompensatory MIRT models to identify classes of students who lack requisite skills and thus are facing noncompensatory testing scenarios.

Kϋҫϋk sinav

Being the great psychometricians that you are which group do you think this ACT item favored, Whites? Blacks? Males? Females? No DIF? BLACK EXAMINEES

Which group do you think this ACT item favored, Whites? Blacks? Males? Females? No DIF? A rectangular 8-inch by 10-inch picture is to be framed with a 3-inch border all the way around it. How many more square inches of wall space will be covered by the framed picture than by the picture alone? a) 24 b) 48 c) 54 d) 108 e) 144 WHITE EXAMINEES

For questions or comments please email me at taackerm@uncg.edu

References Ackerman, T. A. (1989). Unidimensional IRT calibration of compensatory and noncompensatory multidimensional. Applied Psychological Measurement, 13, 113-127. Ackerman, T. A. (1992). An explanation of differential item functioning from a multidimensional perspective. Journal of Educational Measurement, 24, 67-91. Ackerman, T. A. (1994). The Influence of Conditioning Scores In Performing DIF Analyses. Applied Psychological Measurement, 18, 4, 329-342. Ackerman, T. A., & Evans, J. A. (1992, April). An investigation of the relationship between reliability, power, and the Type I error rate of the Mantel-Haenszel and simultaneous item bias detection procedures. Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco. Ackerman, T.A. & Henson, R. A. (2014 ) Graphical representations of items and tests that are measuring multiple abilities. Proceedings of the Psychometric Society. IMPS 2013.

Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In H. Wainer & P. W. Holland (Eds.), Differential item functioning, (pp. 35-66). Hillsdale NJ: Erlbaum. Shin, S. (1992). An empirical investigation of the robustness of the Mantel-Haenszel procedure and sources of differential item functioning. Dissertation Abstracts International, 53A, 3504. Zwick, R., & Ercikan, K. (1989). Analysis of differential item functioning in the NAEP History Assessment. Journal of Educational Measurement, 26, 55-66.

"Teşekkürler"