A Differential Item Functioning (DIF) Analysis of the Self-Report Psychopathy Scale. Craig Nathanson and Delroy L. Paulhus

A Differential Item Functioning (DIF) Analysis of the Self-Report Psychopathy Scale Craig Nathanson and Delroy L. Paulhus University of British Columbia Poster presented at the 1 st biannual meeting of the Society for the Scientific Study of Psychopathy, Vancouver, BC, Canada. Please do not cite without prior permission

ABSTRACT Differential item functioning (DIF) occurs when an item is influenced by a variable irrelevant to the construct of interest. Recent investigations of DIF in the Psychopathy Checklist-Revised (PCL-R) indicated that several PCL-R items demonstrated significant DIF. These findings spurred the current investigation of DIF in a measure of subclinical psychopathy closely tied to the PCL-R, namely a 40-item version of the Self-Report Psychopathy Scale (SRP). Participants (N = 383) completed the 40-item measure and we investigated these items for DIF across genders and ethnicities. Results indicated that the majority of SRP items were free of DIF. Only two items demonstrated significant DIF, one between genders and the other between ethnicities. Given the paucity of SRP items with DIF only 5% such a finding may have been obtained by chance. In turn, we do not recommend dropping or modifying these items. Rather, our results suggest that users of the SRP should feel confident that its items function equivalently across genders and ethnicities. Future studies of differential test functioning (DTF) are also discussed.

Introduction Differential item functioning (DIF) -- a technique based in item response theory -- occurs when a given item is influenced by an irrelevant variable (Zumbo, 1999). One common form of DIF akin to a main effect in ANOVA is overestimation: After being matched on trait level, individuals from one group are on average more likely to endorse a particular item than are those in another group. More complex and less common is so-called non-uniform DIF: Akin to an interaction in ANOVA, nonuniform DIF is said to occur when across the levels of the trait, there is an inconsistent between-group difference in the likelihood of endorsing that item. To date, several studies have observed DIF in the items of the Psychopathy Checklist-Revised (PCL-R; Hare, 2003). For example, items on the Social Deviance factor (Factor 2) tended to display DIF more than items on the Affective/Interpersonal factor (Factor 1). Specifically, Factor 2 items tended to overestimate psychopathy in male offenders compared to female offenders (Bolt, Hare, Vitale, & Newman, 2004) and African American offenders rather than Caucasian offenders (Cooke, Kosson, & Michie, 2001). These findings spurred us on to investigate DIF in a measure of subclinical psychopathy. We felt that the most appropriate measure to investigate was the recently developed 40-item version of the Self-Report Psychopathy Scale (SRP; Williams, Nathanson, & Paulhus, 2003), for two reasons. First, the SRP is historically linked to the PCL-R. Second, the SRP has been explicitly modeled after the factor structure of psychopathy (Williams et al., 2003). The current study explored the extent to which the items on the 40-item Self-Report Psychopathy Scale demonstrated differential item functioning. In particular, we compared the genders and the major ethnic groups at our university, namely students of European vs. East Asian heritage.

Method Participants Participants were 383 undergraduates from a large Western Canadian university. Sixty-five percent of participants were women. Fifty-two percent of participants were of East Asian heritage, 32% were of European heritage, and the remainder came from other ethnic heritages. All participants received course credit for participation. Materials As part of a larger laboratory study, participants completed a 40-item version of the Self-Report Psychopathy Scale (Williams et al., 2003). This measure requires participants to rate their agreement with the items presented using a 5-point Likert scale (1 = Strongly disagree to 5 = Strongly agree ). Sample items include I don t think of myself as tricky or sly (reverse coded) and I have stolen a motor vehicle. Total score was computed by averaging across all 40 items. Alpha reliability of SRP scores in this sample was.86. Measuring DIF To measure DIF in the SRP s items, we utilized the methods and criteria advocated by Bruno Zumbo and colleagues (Gelin & Zumbo, 2003; Slocum, Gelin, & Zumbo, 2003; Zumbo, 1999; see also Hidalgo and Lopez-Pina, 2004). For an item to demonstrate DIF, it must (1) be significantly affected by external, irrelevant variables, and (2) this effect must have at least a moderate effect size. We conducted a series of ordinal logistic regressions on each item, where item scores are regressed first on total score (Step 1), then on that plus the demographic or group variable of interest (e.g., gender) (Step 2), and then finally on those terms plus their interaction (Step 3). To determine whether an item demonstrates significant DIF, the difference in chi-squares ( χ 2 ) of Step 3 - Step 1 is compared against the 2 df, p <.01 cutoff of 9.21. This omnibus χ 2 functions like an omnibus F-test in

ANOVA given the χ 2 simultaneously tests for an interaction, namely non-uniform DIF, and a main effect, namely overestimation. If the omnibus χ 2 is significant, Zumbo and colleagues recommend conducting similar analyses to those described above by computing χ 2 and R 2 scores for (A) Step 3 - Step 2 and (B) Step 2 - Step 1 (Slocum et al., 2003). If test (A) meets Zumbo and colleagues criteria, the item is said to demonstrate non-uniform DIF (i.e., a significant interaction). If test (B) meets the criteria, the item is said to overestimate the trait for a particular group (i.e., a significant main effect). Although statistical significance is necessary for an item to demonstrate DIF, it is not sufficient. Recall that Zumbo and colleagues (1999; Gelin & Zumbo, 2003) indicate that an item only demonstrates DIF if the significant χ 2 has at least a moderate effect size. The omnibus measure of effect size is obtained by computing the difference in R 2 of Step 3 - Step 1, with R 2 values of.035 to.070 considered moderate (Jodoin & Gierl, 2001). In cases where both criteria are met, it is useful to graph the item characteristic curves, which shows the relationship between level of the trait (i.e., total score) and responses on a given item (i.e., item score) for each group (e.g., men and women).

Results Gender The vast majority of the 40 items on the SRP did not meet Zumbo and colleagues criteria and, in turn, did not demonstrate DIF, an example of which is indicated in Figure 1. Note that the two item characteristic curves seen in Figure 1 for the item Not hurting others feelings is important to me (reverse coded) overlap very closely with each other, suggesting no DIF (Slocum et al., 2003). Only one item demonstrated DIF, namely the item, I am usually very careful about what I say to people (reverse coded) with χ 2 (2) = 18.05, p <.01, and R 2 =.046. Subsequent analyses revealed that this item was overestimating psychopathy and there was no interaction: Difference scores for Step 2 - Step 1 were χ 2 (1) = 15.77, p <.01, R 2 =.041, compared with χ 2 (1) = 2.28, R 2 =.005 for Step 3 - Step 2. As seen in Figure 2, this item consistently overestimated psychopathy in women. Ethnicity Similar to the results with gender, the majority of the SRP items did not demonstrate DIF, with the exception of four items. We suspected that the DIF observed here may be attributable to a large proportion of English-as-a-second-language students in our sample. In turn, we excluded those students who indicated that English was not their first language, resulting in a subsample of n = 145 (70% European heritage). After re-testing the four DIF items with this subsample, three of the four items no longer demonstrated DIF. However, one item, I find it easy to manipulate people, continued to demonstrate DIF, with χ 2 (2) = 9.97, p <.01, and R 2 =.053. Unlike the gender DIF item, the current item demonstrated non-uniform DIF, χ 2 (1) = 7.62, p <.01, R 2 =.04, but did not overestimate psychopathy, χ 2 (1) = 2.35, R 2 =.013. As indicated in Figure 3, although this item discriminated European heritage participants across levels of psychopathy, it did not discriminate for East Asian heritage participants.

Figure 1 Item characteristic curves for men and women for the SRP item Not hurting others feelings is important to me 5 Men Women 4 Item score 3 2 1 1.00 1.50 2.00 2.50 3.00 3.50 4.00 Total score

Figure 2 Item characteristic curves for men and women for the SRP item I am usually very careful about what I say to people 5 Men Women 4 Item score 3 2 1 1.00 1.50 2.00 2.50 3.00 3.50 4.00 Total score

Figure 3 Item characteristic curves for participants of East Asian heritage and participants of European heritage for the SRP item I find it easy to manipulate people 5 East Asian heritage 4 European heritage Item score 3 2 1 1.00 1.50 2.00 2.50 3.00 3.50 4.00 Total score

Discussion Taken together, our results suggest that these 40 Self-Report Psychopathy Scale items are not significantly influenced by external, irrelevant variables. The vast majority of these items 38 out of 40, or 95% were found to be free of DIF. We found only one item that demonstrated gender DIF and another that demonstrated ethnicity DIF. However, given the relatively few SRP items that demonstrated DIF, we do not feel that researchers or other users of the SRP should be particularly concerned. That is, the paucity of items with DIF on the SRP 2 out of 40, or 5% suggests that such a finding may be attributable to chance. In turn, we do not believe that these two items should be modified or discarded from the SRP. Moreover, users of the SRP may rightly feel confident that the items are valid indicators of subclinical psychopathy across genders and ethnic groups. The next step in this investigation is to test for differential test functioning (DTF), which extends the principle of DIF from the individual item to the whole test. It is especially noteworthy that despite having several items with DIF, subsequent DTF analyses conducted on the PCL-R suggested these items were not significantly harming the usefulness of the PCL-R s total score. That is, the PCL- R s ability to classify individuals as psychopaths or non-psychopaths using extant scoring procedures remained intact (e.g., Bolt et al., 2004; Cooke et al., 2001; for an exception, see Cooke, Michie, Hart, & Clark, 2005). Although our concerns with the SRP are different than those for the PCL-R we do not use the SRP to categorize individuals the more general principle of using a measure that functions equally for all groups is, of course, still highly relevant. Given we observed far less DIF among the SRP items than has been observed among the PCL-R items, we expect that DTF analyses should unequivocally highlight the SRP s unbiased measurement of subclinical psychopathy.

References Bolt, D. M., Hare, R. D., Vitale, J. E., & Newman, J. P. (2004). A multigroup item response theory analysis of the Psychopathy Checklist-Revised. Psychological Assessment, 16, 155-168. Cooke, D. J., Kosson, D. S., & Michie, C. (2001). Psychopathy and ethnicity: Structural, item, and test generalizability of the Psychopathy Checklist Revised (PCL-R) in Caucasian and African- American participants. Psychological Assessment, 13, 531-542. Cooke, D. J., Michie, C., Hart, S. D., & Clark, D. (2005). Assessing psychopathy in the UK: Concerns about cross-cultural generalisability. British Journal of Psychiatry, 186(4), 335-341. Gelin, M. N., & Zumbo, B. D. (2003). Differential item functioning results may change depending on how an item is scored: An illustration with the Center for Epidemiologic Studies Depression Scale. Educational and Psychological Measurement, 63, 65-74. Hare, R. D. (2003). Manual for the Psychopathy Checklist-Revised (2 nd ed). Toronto/Buffalo: Multi- Health Systems. Hidalgo, M. D., & Lopez-Pina, J. A. (2004). Differential item functioning detection and effect size: A comparison between logistic regression and Mantel-Haenszel procedures. Educational and Psychological Measurement, 64, 903-915. Jodoin, M.G., & Gierl, M.J. (2001). Evaluating Type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14, 329-349. Slocum, S. L., Gelin, M. N., & Zumbo, B. D. (in press). Statistical and graphical modeling to investigate differential item functioning for rating scale and Likert item formats. In B. D. Zumbo (Ed.), Developments in the Theories and Applications of Measurement, Evaluation, and Research Methodology Across the Disciplines, Vol. 1. Vancouver: Edgeworth Laboratory, University of British Columbia. Williams, K. M., Nathanson, C., & Paulhus, D. L. (2003, August). Structure and validity of the Self- Report Psychopathy scale in normal populations. Poster presented at the 111 th annual meeting of the American Psychological Association, Toronto.

Zumbo, B. D. (1999). A Handbook on the Theory and Methods of Differential Item Functioning (DIF): Logistic Regression Modeling as a Unitary Framework for Binary and Likert-Type (Ordinal) Item Scores. Ottawa, ON: Directorate of Human Resources Research and Evaluation, Department of National Defense.