Journal of Applied Psychology

Size: px

Start display at page:

Download "Journal of Applied Psychology"

Stella Manning
6 years ago
Views:

1 Journal of Applied Psychology Effect Size Indices for Analyses of Measurement Equivalence: Understanding the Practical Importance of Differences Between Groups Christopher D. Nye, and Fritz Drasgow Online First Publication, April 4, doi: /a CITATION Nye, C. D., & Drasgow, F. (2011, April 4). Effect Size Indices for Analyses of Measurement Equivalence: Understanding the Practical Importance of Differences Between Groups. Journal of Applied Psychology. Advance online publication. doi: /a

2 Journal of Applied Psychology 2011 American Psychological Association 2011, Vol., No., /11/$12.00 DOI: /a Effect Size Indices for Analyses of Measurement Equivalence: Understanding the Practical Importance of Differences Between Groups Christopher D. Nye and Fritz Drasgow University of Illinois at Urbana Champaign Because of the practical, theoretical, and legal implications of differential item functioning (DIF) for organizational assessments, studies of measurement equivalence are a necessary first step before scores can be compared across individuals from different groups. However, commonly recommended criteria for evaluating results from these analyses have several important limitations. The present study proposes an effect size index for confirmatory factor analytic (CFA) studies of measurement equivalence to address 1 of these limitations. The application of this index is illustrated with personality data from American English, Greek, and Chinese samples. Results showed a range of nonequivalence across these samples, and these differences were linked to the observed effects of DIF on the outcomes of the assessment (i.e., group-level mean differences and adverse impact). Keywords: differential item functioning, measurement equivalence, effect size, employee selection Practitioners and organizational researchers confront a vast number of questions that involve comparing scores on assessment instruments across groups. Are workers more satisfied in organizations with empowerment programs? Are successful salespersons more extraverted? Are employees in a multinational organization more satisfied in one country than employees in another? Moreover, because of the legal and practical implications of using selection assessments that advantage one group over another, group comparisons may be particularly salient during the hiring process. For all of these comparisons to be meaningful, it is essential that the tests and scales provide equivalent measurement across groups. Equivalent measurement is obtained when individuals with the same standing on the trait assessed by the test or scale, but sampled from different groups, have equal expected observed scores (Drasgow, 1984). As such, measurement invariance can be examined by a differential item functioning (DIF) analysis using item-response theory (IRT) or with confirmatory factor analytic (CFA) mean and covariance structure (MACS) analysis. The latter method is the focus of this article. Although several articles have proposed various decision rules for determining if measurement nonequivalence exists with MACS analysis (Cheung & Rensvold, 2002; Hu & Bentler, 1999; Meade, Johnson, & Braddy, 2008), these rules generally involve empirically derived cutoffs or statistical significance tests. As such, the Christopher D. Nye and Fritz Drasgow, Department of Psychology, University of Illinois at Urbana Champaign. An earlier version of this article was presented at the annual meeting of the Academy of Management, Montréal, Quebec, Canada, August, We would like to thank Brent W. Roberts and Gerard Saucier for the use of their data in the empirical example. Correspondence concerning this article should be addressed to Christopher D. Nye, Department of Psychology, University of Illinois at Urbana Champaign, 603 East Daniel Street, Champaign, IL cnye2@cyrus.psych.illinois.edu analysis does not address the practical importance of observed differences between groups and does not provide users with information about the effects of nonequivalence on the organizational outcomes of an assessment. In the broader psychological literature, effect size statistics have been proposed to overcome this limitation (Cohen, 1990, 1994; Kirk, 2006; Schmidt, 1996). However, effect size indices for CFA evaluations of measurement equivalence have not yet been developed. In the present study, we propose such an index and examine its application to real-world data. To illustrate its practical importance, we also demonstrate the effects of measurement nonequivalence on the observed outcomes (e.g., means, adverse impact) of group-level comparisons. This information will enable researchers and practitioners to further evaluate the theoretical and practical importance of observed differences. The Importance of Measurement Equivalence Measurement invariance techniques can and should be applied prior to testing between-groups differences. For example, these methods are commonly used to examine the equivalence of tests and assessments across cultures (e.g., Wasti, Bergman, Glomb, & Drasgow, 2000), races (e.g., D. Chan, 1997), sexes (e.g., Parker, Baltes, & Christiansen, 1997), and other demographic groups or over time (e.g., K.-Y. Chan, Drasgow, & Sawin, 1999; Ryan, West, & Carr, 2003). Therefore, these techniques have been used to address questions in a number of substantive organizational research areas such as job attitudes (Ryan et al., 2003), employee selection (Stark, Chernyshenko, Chan, Lee, & Drasgow, 2001), organizational citizenship behaviors (Lam, Hui, & Law, 1999), motivation (Sagie, Elizur, & Yamauchi, 1996), performance ratings (Woehr, Sheehan, & Bennett, 2005), leadership (M. S. Cole, Bedeian, & Field, 2006), and sexual harassment (Wasti et al., 2000), among others. However, studies of measurement equivalence have garnered the most attention in areas where consistent group-level differences are observed. 1

3 2 NYE AND DRASGOW Within the cognitive ability domain, some content has been found to function differently across groups (Kuncel & Hezlett, 2007). Specifically, men tend to perform better than women on science-related questions (Lawrence, Curley, & McHale, 1988), and some verbal stimuli tend to favor European American over Hispanic individuals (Schmitt & Dorans, 1988). Other research has demonstrated that the psychometric properties of these measures can vary over time. For example, Lievens, Reeve, and Heggestad (2007) showed that allowing job applicants to retake a cognitive ability test could affect the functioning of the test in a selection context. In other words, results from their study showed that retesting resulted in nonequivalent test scores and biased prediction. Another phenomenon, known as item drift, may also occur as items on a selection test become obsolete or outdated as a result of educational, technological, and/or cultural changes (Bock, Muraki, & Pfeiffenberger, 1988). For example, K.-Y. Chan et al. (1999) examined the Armed Services Vocational Aptitude Battery (ASVAB) and found that items requiring greater technical knowledge (e.g., Electrical Information or General Sciences tests) exhibited the most DIF over time. In addition, Drasgow, Nye, and Guo (2008) found significant levels of item drift on the National Council Licensure Examination for Registered Nurses (NCLEX RN). Interestingly, in this study, DIF canceled out at the test level, suggesting that nonequivalence would not have a substantial impact on overall test scores. Because of their predictive validity and smaller subgroup differences, some have recommended using personality measures to supplement cognitive ability tests in selection settings (Cascio, Jacobs, & Silva, 2009; Maxwell & Arvey, 1993). In addition, personality variables play a key role in a variety of theories and models of organizational behavior, including, but not limited to, leadership (Judge, Bono, Ilies, & Gerhardt, 2002), motivation (Kanfer & Heggestad, 1997), organizational justice (Colquitt, Scott, Judge, & Shaw, 2006), job satisfaction (Judge, Heller, & Mount, 2002), and turnover (Salgado, 2000). Although personality measures may exhibit smaller subgroup differences within a single culture, recent research suggests that problems may occur in multicultural contexts (Ghorpade, Hattrup, & Lackritz, 1999; Oishi, 2006). For example, Nye, Roberts, Saucier, and Zhou (2008) found a lack of invariance for the majority of items on a common measure of personality when compared across three cultures. Because of these differences, scores may not be comparable across cultural groups. Given the increasing importance of cross-cultural psychology and international organizations, this issue has growing significance. Measurement Equivalence A variety of methods have been developed for examining measurement equivalence. Some have suggested using t tests, analysis of variance (ANOVA), or other statistical tests of observed score differences to evaluate DIF. However, these methods are inappropriate for this purpose because they confound DIF with true differences (referred to as impact; Stark, Chernyshenko, & Drasgow, 2004) between groups. Stated differently, these methods require the assumption of equal latent trait distributions across groups, which is unlikely to be true in practice (Hulin, Drasgow, & Parsons, 1983). Showing the inadequacy of these tests, Camilli and Shepard (1987) demonstrated that when true mean differences exist between groups, ANOVA comparisons were incapable of detecting DIF. In fact, even when measurement nonequivalence accounted for 35% of the observed mean difference, the effects suggested by ANOVA were negligible. More important, these authors found that the presence of true group differences can result in high Type I error rates. Mean and covariance structure (MACS) analysis is a more appropriate method for examining measurement equivalence (Cheung & Rensvold, 2000; Little, 1997; Stark, Chernyshenko, & Drasgow, 2006; Vandenberg & Lance, 2000) because it has important advantages over the alternative methods described above. First, it does not assume equal distributions of the latent trait across groups (Drasgow & Kanfer, 1985). Thus, a MACS analysis allows researchers to differentiate DIF from impact. Second, the adequacy of models can be evaluated using several well-established indices of fit. Therefore, this approach applies a more comprehensive definition of nonequivalence to the data. Mean and Covariance Structure (MACS) Analysis Although the number and order of the steps in MACS analysis can vary across studies (Vandenberg & Lance, 2000), researchers are generally interested in tests of configural, metric, and scalar invariance. However, a number of additional tests for invariance are available, and the exact forms of invariance that are assessed should be linked to the purposes of the study (Steenkamp & Baumgartner, 1998). Nevertheless, these additional tests should be preceded by a confirmation of configural, metric, and scalar equivalence. Configural invariance is the first step to assessing measurement equivalence (Vandenberg & Lance, 2000). Here, the pattern of fixed (at zero) and free loadings is compared across groups. Essentially, this test assesses the extent to which items in the scale or test load on the same factors in both samples and determines whether individuals in these samples employ the same conceptualization of the focal constructs. Therefore, this type of invariance has been particularly important in the personality literature, where there have been substantial debates over the latent structure of individual differences (see Saucier & Goldberg, 2001). If the pattern of zero and nonzero loadings differs across groups, no further tests will be justified; constructs that are conceptualized differently are not comparable across groups. In contrast, if configural invariance is confirmed, assessments of metric invariance should proceed. Metric invariance is tested by constraining the factor loadings to be equivalent across groups. In LISREL notation, metric equivalence tests the hypothesis that xg xg, where xg is the loading matrix for the gth group. Items that are found invariant at the metric level can then be assessed for scalar equivalence. In this step, the model for the data is X g xg xg g g, (1) where X g is the vector of observed variables, xg is the vector of intercepts of the regressions of the observed variables on the latent factor, and g is the vector of measurement errors. To test for scalar invariance, the item intercepts are constrained in addition to the factor loadings (i.e., xg xg and xg xg ). As such, this test assesses the comparability of scores across groups. A failure to support the null hypothesis suggests that the scores, and hence the

4 MEASUREMENT EQUIVALENCE 3 group means, are not directly comparable. Therefore, tests of scalar invariance have critical importance for drawing conclusions about group differences. Although metric and scalar equivalence are generally assessed sequentially, Stark et al. (2006) suggested that it may be useful to assess these forms of invariance simultaneously. Examining metric and scalar equivalence separately increases the number of comparisons and, therefore, also increases the risk of Type I errors. Moreover, the sequential process may propagate errors from one step (e.g., metric invariance) to another (i.e., scalar invariance). Interpreting Results of CFA Studies of Measurement Equivalence To test measurement equivalence, it is common to use statistical significance tests based on a chi-square distribution. In addition to the traditional chi-square difference tests, some authors (D. Chan, 2000; González-Romá, Hernández, & Gómez-Benito, 2006) have recommended examining modification indices that represent a chi-square estimate of the improvement in fit when the corresponding parameter is freed (Bollen, 1989). However, it is well known that chi-square significance tests are affected by sample size (Meade et al., 2008). Thus, in large samples, even small differences will be identified as statistically significant. González- Romá et al. (2006) illustrated this problem with modification indices. Under some conditions, these authors showed that power was only.29 when N 100 but increased to 1.00 when N 800. In other conditions, Type I error rates for the modification indices increased by.15 when samples of 100 and 800 were compared. Because of the limitations of chi-square tests, other indices (i.e., changes in fit statistics) have been suggested for evaluating equivalence in the CFA framework. For example, Cheung and Rensvold (2002) suggested that a change in the comparative fit index ( CFI) greater than.01 be used as a cutoff for identifying nonequivalence. More recently, Meade et al. (2008) showed that this cutoff was too liberal and did not detect some forms of nonequivalence. Instead, these authors recommended that a CFI.002 be used. This emphasis on statistical significance tests and empirically derived cutoff values has been criticized for several reasons (Cohen, 1990, 1994; Harlow, Mulaik, & Steiger, 1997; Kirk, 1996; Schmidt, 1996). Kirk (2006) critiqued these criteria because they force researchers to turn a decision continuum into a dichotomous reject/do not reject decision. As Kirk pointed out, this practice treats a p value that is only slightly larger than the cutoff (e.g., p.055) the same as a much larger value. Another important criticism is that these statistical tests do not reflect the practical significance of a difference. Meade et al. (2008) differentiated between detectable and practically significant DIF, noting that they were independent issues. These authors stated that the development of conservative cutoffs is primarily focused on the detection of DIF and not its practical significance. Thus, observing CFI.002 indicates that DIF exists but does not reflect the importance of the difference. As a result of these criticisms (and others), it is widely believed that the interpretation of empirical results should be based on an evaluation of effect sizes rather than tests of statistical significance. Therefore, a number of effect size statistics have been developed for ANOVAs (e.g., Hays, 1963), t tests (e.g., Cohen, 1988), and other traditional statistical tests. However, no such indices exist for CFA analyses. Stark et al. (2004) proposed an effect size for IRT analyses of differential test functioning (DTF) that they referred to as d dtf. Despite the usefulness of this measure, it is not applicable to CFA methodology, where no viable alternatives exist for these techniques. In addition, it does not address nonequivalence at the item level, which can be informative for test development. By assessing the magnitude of DIF, one can more easily detect items and/or specific content that may be problematic for a specific group. For this reason, single-item tests of measurement equivalence are frequently recommended to diagnose the source of any nonequivalence at the scale level (Vandenberg & Lance, 2000). Moreover, although CFA analyses have traditionally been conducted by placing simultaneous constraints on all of the items in a scale (i.e., the constrained-baseline approach), singleitem constraints (i.e., the free-baseline approach) provide lower Type I and II error rates (Stark et al., 2006). Consequently, we propose an item-level effect size measure for CFA analyses of measurement equivalence. An Effect Size Index for MACS Analyses As suggested by Stark et al. (2004) for their IRT effect size, an index of practically important nonequivalence can be defined as the contribution that DIF makes to expected score differences for each item. In CFA methodology, the mean predicted response ˆX ir to item i for an individual with a score of on the latent variable in the reference group (Group R) is given by ˆX ir ir ir, (2) where ir is the intercept and ir is the item s loading. Here, we use the language of IRT to differentiate the groups being compared. In this terminology, the reference group is the majority or baseline group, and the focal group (Group F) is the sample we are comparing to it. Thus, DIF is reflected in the area between the regression lines for the reference and focal groups (see Figures 1 and 2 for an illustration). Consequently, an effect size for MACS analysis can be defined as d MACS 1 SD ˆX ir ˆX if 2 f F d, (3) ip Figure 1. Comparing the adjective bold (d MACS 0.26) across the American English and Greek samples.

5 4 NYE AND DRASGOW also suggested using the average effect across respondents in the focal group or the raw differences between groups to provide additional information about the magnitude of an effect. Of course, if different denominators are used in the literature, effect size estimates will need to be adjusted for meta-analytic comparisons (cf. Morris & DeShon, 2002). Practical Consequences of Measurement Nonequivalence Figure 2. Comparing the adjective bold (d MACS 1.11) across the American English and Chinese samples. where SD ip is the pooled within-group standard deviation of item i across Groups R and F given by SD ip N R 1 SD R N F 1 SD F. (4) N R 1 N F 1 In addition, f F is the distribution of the latent trait in the focal group, 1 which is assumed to have a normal distribution with a mean and variance estimated from the latent factor in the focal group (i.e., F and F, respectively, in LISREL notation). It is also worth noting that the integral in Equation 3 can be approximated by summing across quadrature nodes. However, for practical use, a computer program is available from the first author for calculating this index using parameter estimates from commonly used structural equation modeling (SEM) software (e.g., LISREL, MPlus, AMOS). Dividing by the pooled standard deviation puts this measure in a standardized metric similar to other effect size indices. Although some indices use the pooled standard deviation from the reference and focal groups to standardize the raw differences between them (e.g., Cohen, 1988), others have suggested pooling the standard deviations across all g groups for increased comparability and more precise estimates of the population standard deviation (e.g., Hedges, 1981). If the magnitude of measurement nonequivalence is compared across three or more groups, pooling the standard deviations for the reference group and a single focal group (cf. Cohen, 1988) would result in effect sizes from the same study that are in different metrics and are not comparable to tests with other focal groups. However, if the pooled standard deviation for all groups is used as the denominator for Equation 3, effect size indices can be compared to evaluate the relative size of nonequivalence in each of the focal groups. Therefore, this approach is suggested here. Although the present study focuses on the pooled standard deviation across all groups, we also note that alternative denominators can be used. For example, Glass (1976) recommended using the standard deviation in the reference group 2 as the denominator for his effect size. This approach has a similar advantage to the pooled standard deviation in that effect sizes will be in the same metric when a single reference group is compared with multiple focal groups. In his taxonomy of IRT effect sizes, Meade (2010) Although this effect size measure can be used to describe the magnitude of an effect, it still provides little information about the observed consequences of measurement nonequivalence. For example, what effects does item-level nonequivalence have on the mean and the variance of the scale? Or, how will nonequivalence affect the outcomes of the selection process? To address these issues, equations were derived to calculate the effects of DIF on the mean and variance of a measure. These equations will help researchers and practitioners to further understand the effects of nonequivalence. In group-level comparisons, observed mean differences can be defined as Observed differences DIF impact. (5) To quantify the effects of DIF on the mean of a scale, one can calculate n mean X S 1 ˆX ir ˆX if f F d, (6) where X S is the scale score. Notice that the integral in this equation is similar to that in Equation 3 except that the differences between the mean predicted responses in Groups R and F at each ability level are not squared so that DIF in opposite directions can cancel. In addition, because we are interested in the change in the mean of the scale, item-level differences are summed across all n items to obtain the overall mean difference in raw score points. In sum, mean (X S ) refers to the amount of the observed difference that can be attributed to DIF; impact is not a factor in this calculation. Differences between the variances of a scale in the reference and focal groups due to DIF can also be calculated. Using the itemlevel parameters from the CFA model, these effects are defined as var x i 2C i ir F C i 2 F, (7) 1 The distribution of the latent factor in the focal group was used in Equation 3 because we are interested in DIF relative to this group. In other words, analyses of measurement equivalence are designed to detect DIF across groups, and we are interested in determining the magnitude of these differences across the range of the latent trait displayed by the focal group s members. This approach is consistent with other similar indices (Stark et al., 2004). 2 Glass discussed his effect size measure in the context of experimental manipulations where experimental and control groups are being compared. Thus, he suggested using the standard deviation of the control group as the denominator. In the language of DIF, the control group is most analogous to the reference group.

6 MEASUREMENT EQUIVALENCE 5 where ir is the factor loading of item i in the reference group, F is the variance of the latent factor in the reference group, and C i is the difference between the factor loadings for item i in the reference and focal groups. As illustrated in the Appendix, two key assumptions were made in this derivation. First, because we are interested only in identifying differences due to DIF, we assumed R F. When this is the case, var is not influenced by true group-level differences in the latent construct. Instead, only metric nonequivalence (i.e., differences in the factor loadings) can result in var 0. The second simplifying assumption is var( ir ) var( if ). Several authors have suggested that requiring equivalent error variances is the least important hypothesis to test and is generally unnecessary for analyses of measurement equivalence (Bentler, 1995; Byrne, 1998; Jöreskog, 1971). Speaking about constraining the error variances and covariances to equality in multigroup tests, Byrne (1998) noted that it is now widely accepted that to do so represents an overly restrictive test of the data (p. 261). Therefore, we do not include differences in when evaluating the effects of DIF. To estimate the total effect of DIF on the variance of a scale, the var x i can be aggregated across all n items in the scale using the formula for calculating the variance of a composite. In other words, var X S var x 1 var x 2... var x n where 2 cov x 1,x cov x n 1,x n, (8) cov x i,x j jr C i F C j ir F C i C j F (9) is the covariance of items i and j (see Appendix for derivations of var and cov). Adverse Impact Because DIF can result in mean differences between groups (see Equation 6), selection decisions may be affected. Indeed, mean differences between groups are the primary source of the differential selection outcomes experienced by members of various groups (Newman, Jacobs, & Bartram, 2007). Therefore, it is important to examine the consequences of DIF for adverse impact (AI). Although AI has been defined in a number of ways (Gatewood, Field, & Barrick, 2007), courts generally recognize two forms of evidence that it exists: statistical significance tests and the fourfifths rule (Bobko & Roth, 2009). Because of the problems with significance tests noted above, the present study focuses solely on the four-fifths rule. Using the four-fifths rule, AI is identified by the ratio of selection ratios for the majority and minority groups. If the value of this ratio is less than.80, AI is said to occur. Although selection ratios are typically calculated using the results of the selection process, a prospective ratio can be estimated from the CFA model by assuming that the latent trait is normally distributed. Here, the AI ratio is defined as AI ratio P F Z XF Z Cut P R Z XR Z Cut, (10) where Z XF and Z XR are the standardized scores on a selection measure in the focal and reference groups, Z Cut is the standardized cut score used to select employees, and P F Z XF Z Cut is the probability of an individual in Group F obtaining an observed score on the assessment that is greater than the cut score. The denominator in this equation is the same probability for an individual in the reference group. Note that the probabilities in the denominator and numerator are calculated using the model-based means and standard deviations from the reference and focal groups, respectively, but assuming equivalent distributions for the latent trait. In other words, differences between the numerator and denominator are entirely the result of DIF in the measure because differences due to impact (i.e., differences in the latent trait distribution) are not incorporated into these calculations. Thus, if this ratio is less than.80, AI will occur solely because of DIF in the measure. The Current Study The primary goal of the present study was to provide an empirical illustration of the effect size index described above and to examine the magnitude of DIF in a measure of the Big Five personality traits. In a reanalysis of data from Nye et al. (2008), the Mini-Markers Scale (Saucier, 1994) was used to determine the extent of CFA nonequivalence across American English, Greek, and Chinese cultures. Next, Equations 6 9 were used to show the effects of DIF on the observed scale-level properties of this measure. Finally, the AI ratio shown in Equation 10 was calculated and compared with the four-fifths rule to illustrate the consequences of DIF for employee selection. An Empirical Example Samples Because SEM models generally require large samples for accurate parameter estimates, even small effects may be statistically significant. Therefore, we chose samples for our empirical example that were large enough to obtain accurate parameters and to illustrate the advantages of effect size indices for MACS analysis. In the American English sample, responses were provided by 727 undergraduate students from two large Midwestern universities. The sample contained 388 women and 339 men, and the mean age was years. Because the measure examined in this study was developed in the United States and has been researched extensively in this country (Saucier, 1994), this sample was used as the reference group, and all analyses were conducted as twogroup comparisons between the U.S. sample and a focal group. The Greek sample was composed of 991 undergraduate students from several Greek universities. In this sample, there were substantially more women (N 751) than men (N 224; 16 did not report their gender). The Chinese sample consisted of 433 undergraduate students from a large university in Shanghai. The sample contained approximately 49% women and 51% men. Measure Participants in all three samples responded to Saucier s (1994) Mini-Marker Scale. This scale contains 40 adjective items assess-

7 6 NYE AND DRASGOW ing the five-factor personality structure. In their original study, Nye et al. (2008) found that a majority of the items were nonequivalent across all three cultures. However, the magnitudes of the CFI and chi-square difference tests suggested that effect sizes may vary across items in the scale. Analyses MACS analysis was used to examine the equivalence of the personality scales across the three samples. Specifically, configural, metric, and scalar invariance were assessed using the maximum-likelihood estimator and the multigroup function in LISREL 8.7. Interestingly, Nye et al. (2008) found that all five of the personality scales were multidimensional, with the two latent factors represented by positive and negative items, respectively. Thus, the four positive items in the Extraversion and Agreeableness scales composed one factor, and the four negative items defined the second factor. Although the two latent factors in the Conscientiousness scale were largely defined by positive and negative items, the adjective inefficient was the only item that did not conform to this structure. Despite the negative wording, inefficient loaded negatively on the positive factor. Given these results, we tested both one- and two-factor models in the American English sample and used the best fitting models to assess configural invariance. Metric and scalar equivalence were assessed simultaneously, and parameters were constrained using the free-baseline approach (i.e., a single item was constrained at a time). In MACS analysis, the referent item plays an important role in statistically identifying the latent trait scale. Because the latent factors being modeled are unobservable, they do not have an inherent scale and must be given one for the model to be identified. The most common approach to doing this is to constrain the loading of a single item to 1.00 for each factor. These items are referred to as the referent items. When the mean structure is estimated, as it is for tests of scalar equivalence, a scale must also be given to the mean of the latent factor. In the present study, we did this by constraining the intercept of the referent item to zero as suggested by Bollen (1989). Thus, the latent factor will have a mean and variance equal to that of the referent item. An alternative approach to scaling the mean of the latent factor is to constrain this parameter to zero (i.e., 0) in one of the groups, typically the reference group. With this approach, the means of the latent factors in the unconstrained groups represent their deviation from the reference group s mean. Either of these methods of scaling the latent factor can be used when calculating effect size indices. In studies of measurement equivalence, it is essential that the referent items are equivalent across groups. A nonequivalent referent will confound results and render the analysis meaningless. For example, Johnson, Meade, and DuVernet (2009) showed that a referent item that functions differently across groups can either mask or exacerbate nonequivalence in other items, resulting in low power or high Type I error rates, respectively. Thus, for each of the two factors in the scales examined here, only items found to be equivalent by Nye et al. (2008) were used as referent items. However, all of the items in the Neuroticism and Openness scales exhibited nonequivalence and, therefore, were excluded from the present analyses. In addition, Nye et al. were able to identify an equivalent referent item for only one of the latent factors in the Conscientiousness scale. Specifically, none of the negatively worded items were equivalent across cultures. Therefore, effect size estimates will be accurate for the positive Conscientiousness items but not for the negatively worded items. Nevertheless, we calculate effect sizes for the positive items and use the negative items to illustrate the effects of a nonequivalent referent item on the magnitude of an effect. Reliabilities for Extraversion ranged from.59 (Greece) to.84 (U.S.), Agreeableness ranged from.65 (Greece) to.78 (U.S.), and Conscientiousness ranged from.59 (U.S.) to.74 (China). Results To facilitate the use of the indices we propose here, we first provide step-by-step instructions for calculating these measures with examples from the Extraversion scale. The purpose of providing this level of detail is to help readers understand how to calculate these indices and to illustrate the application of them to complex survey data. Following this discussion, we present the results for all three personality scales. Calculating Effect Size Indices Step 1. The first step in the process of calculating effect sizes for MACS analyses is to identify the factor structure that will be tested across groups. On the basis of the results presented by Nye et al. (2008), we tested two factor-models for each of the scales, with every item loading on a single latent factor. However, we also hypothesized additional relationships between several of the items that were not tested by Nye et al. In all three scales, some items were included with their antonyms and/or synonyms. For example, the antonyms sympathetic and unsympathetic were both included in the Agreeableness scale. In contrast, the antonyms talkative and quiet were included in the Extraversion scale. Consequently, we also modeled correlated uniqueness terms for the antonyms and synonyms in each scale. Because of the inherent methodological relationship between these types of items, freeing these parameters seems justified (D. A. Cole, Ciesla, & Steiger, 2007) and is a common practice in personality research (Hopwood & Donnellan, 2010). Table 1 shows the fit indices for both the one- and two-factor models in the American English sample. As shown here, the Table 1 Fit Statistics for the One- and Two-Factor Models of the Extraversion, Agreeableness, and Conscientiousness Scales Scale 2 df RMSEA NNFI CFI SRMR Extraversion One-factor model Two-factor model Agreeableness One-factor model Two-factor model Conscientiousness One-factor model Two-factor model Note. RMSEA root mean square error of approximation; NNFI nonnormed fit index; CFI comparative fit index; SRMR standardized root mean square residual.

8 MEASUREMENT EQUIVALENCE 7 Conscientiousness scale was clearly not unidimensional, but the two-factor model fit well. In addition, although the single-factor models fit moderately well in the Extraversion and Agreeableness scales, the two-factor models fit better in both cases. Thus, the two-factor models were used to test for measurement equivalence. Step 2. After identifying a factor structure, multigroup analyses were applied to test for measurement equivalence. For the Extraversion scale, separate referent items were identified by Nye et al. (2008) for each of the two latent factors using the method suggested by Cheung and Rensvold (1999). With this approach, n 1 tests of equivalence are conducted for each of the items in the scale, with a different item serving as the referent in each test. An appropriate referent item is identified if an item is equivalent across each of the n 1 tests. Using this approach, the adjective extraverted was selected as the referent for the positively worded items, and the adjective shy was used to scale the negative items. After a referent item has been identified and the mean and variance of the latent factor set, tests for configural, metric, and scalar invariance can proceed as normal (see Vandenberg & Lance, 2000, for a comprehensive review of these analyses). In this process, effect size indices will be calculated using the unconstrained parameters that are estimated in each of the groups. Step 3. Next, the item-level pooled standard deviations were estimated. To calculate these for the Extraversion scale, the observed standard deviations for each item in the American English, Greek, and Chinese samples were pooled using Equation 4. As described above, the advantage of calculating the pooled standard deviation across all groups is that effect sizes will be on the same metric for comparisons of different groups. Thus, the magnitude of nonequivalence can be compared across groups. Step 4. After obtaining the item parameters and the pooled standard deviations, these values can be input to the dmacs computer program developed by and freely available from the first author. With this software, item parameters estimated in LISREL, MPlus, or any other statistical program can be used to estimate d MACS, mean, and var. LISREL estimates of the item parameters that were used to calculate effect sizes for the Extraversion scale are provided in Table 2. To illustrate the link between parameter differences and the magnitude of an effect, Figures 1 and 2 plot the mean predicted responses for the adjective bold in the Greek and Chinese comparisons, respectively. Note that when a scale is multidimensional, as is the case in the present study, the effect size indices must be calculated separately for each latent factor. Because group-level differences are integrated over the assumed normal distribution of the latent trait in the focal group (i.e., with a mean of F and a variance of F ), the distributions will not necessarily be the same for different dimensions. Thus, the parameters used to estimate the effect size will not be the same for each latent factor, and effect sizes must be estimated separately for items loading on different factors. Tables 3, 4, and 5 show the results for the Extraversion, Agreeableness, and Conscientiousness scales, respectively. Although there were eight items in each scale, the referent items are excluded from the tables because the parameters for these items are fixed and, therefore, identical across groups. The column headed 2 contains the increase in overall chi-square obtained when item parameters were constrained to be equal across the reference and focal groups. The columns headed RMSEA and CFI show the corresponding increases in the root mean square error of approximation (RMSEA) and CFI. Significant chi-square tests are in bold, and changes in CFI greater than Meade et al. s (2008) recommended cutoff (i.e.,.002) are marked by an asterisk. Modification indices (MIs) are also provided for comparison, and indices greater than 3.84 are significant. The values reported for the MIs depend on the presence of metric or scalar equivalence. If significant, the MI for the factor loading of the item is reported. If not, then the MI for the intercept is provided. As shown, nearly all of the items in these scales were nonequivalent in one or both of the focal groups. Moreover, the chi-square difference tests and the changes in CFI agreed in most cases. Although the MIs were generally consistent with the chisquare difference test and the change in CFI, this was not always the case. For example, the adjective talkative in the Extraversion scale was flagged as nonequivalent by both the chi-square differ- Table 2 Item Parameters for the Extraversion Scale Factor loading ( i ) Intercept ( i ) d MACS Item American English Greek Chinese American English Greek Chinese Greek Chinese Extraverted a Talkative Bold Energetic Latent mean ( 1 ) Latent variance ( 2 ) Shy a Quiet Bashful Withdrawn Latent mean ( 2 ) Latent variance ( 2 ) Note. MACS mean and covariance structure. a Referent items.

9 8 NYE AND DRASGOW Table 3 Measurement Equivalence of Extraversion Across Cultures 2 RMSEA CFI MI d MACS Item G C G C G C G C G C Talkative b a Bold b a Energetic a a Quiet a 8.05 a Bashful a 9.96 a Withdrawn a a Note. Bold values represent significant chi-square differences. All modification indices greater than 3.84 are significant and suggest that differential item functioning (DIF) is present. Indices are not presented for the referent items because these items are constrained across groups and, therefore, the values in each column are zero. RMSEA root mean square error of approximation; CFI comparative fit index; MI modification index; MACS mean and covariance structure; G Greek sample; C Chinese sample. a Modification index identifying metric nonequivalence. All significant modification indices suggested metric nonequivalence, and, therefore, scalar nonequivalence is not identified here. b Nonsignificant modification indices are represented by the values for the intercept of the item. CFI.002. ence test and the change in CFI. However, the MI did not identify significant DIF for this item at the metric or scalar level. Despite these discrepancies, nonequivalence appears to be pervasive in this measure of personality when used cross-culturally. The final two columns in Tables 3 5 show the effect sizes of nonequivalence for each of the items. In both the Greek and Chinese samples, a range of nonequivalence was found. Within a single scale, the broadest range of effect sizes was observed for the Extraversion scale in Table 3, where the magnitude of effects ranged from 0.26 for the adjective bold in the Greek sample to 1.11 for the same adjective in the Chinese sample. If Cohen s (1988) guidelines (i.e., values greater than 0.20 are considered small, 0.50 are medium, and 0.80 or greater are large) are used, nonequivalence on the Extraversion scale ranged from small to large. The effects for the adjective bold are also graphed in Figures 1 and 2. Figure 1 shows that the regression lines for the American English and Greek samples do not differ much at any point on the latent trait continuum. Thus, although this item may display significant nonequivalence, the magnitude of the effect appears small. In contrast, large differences are evident in the Chinese sample, particularly at the upper end of the latent trait continuum. In other words, highly extraverted respondents in the American English and Chinese samples are likely to respond differently to this item. Tables 4 and 5 present the results for the Agreeableness and Conscientiousness scales, respectively. Overall, results were generally consistent with those for the Extraversion scale. For both Agreeableness and positive Conscientiousness, a range of nonequivalence was identified. The smallest effects were observed on the Agreeableness scale. Here, the effect sizes for four of the six items were 0.20 or below in the Greek sample. However, substantially larger effects were observed in the Chinese sample and on several of the positively worded Conscientiousness items. Note that d MACS provides useful information for interpreting the magnitudes of the chi-square test and the CFI. In particular, using d MACS to evaluate results does not force a dichotomous interpretation that an effect either exists or does not exist (cf. Kirk, 2006). Instead, this index can be used to evaluate the magnitude of the differences between groups on a continuum of nonequivalence. The adjective unsympathetic in the Agreeableness scale (presented Table 4 Measurement Equivalence of Agreeableness Across Cultures 2 RMSEA CFI MI d MACS Item G C G C G C G C G C Warm b 5.73 a Kind a a Unsympathetic a a Cooperative a 9.94 a Rude a a Harsh a 0.13 b Note. Bold values represent significant chi-square differences. All modification indices greater than 3.84 are significant and suggest that differential item functioning (DIF) is present. Indices are not presented for the referent items because these items are constrained across groups and, therefore, the values in each column are zero. RMSEA root mean square error of approximation; CFI comparative fit index; MI modification index; MACS mean and covariance structure; G Greek sample; C Chinese sample. a Modification index identifying metric nonequivalence. All significant modification indices suggested metric nonequivalence, and, therefore, scalar nonequivalence is not identified here. b Nonsignificant modification indices are represented by the values for the intercept of the item. CFI.002.

10 MEASUREMENT EQUIVALENCE 9 Table 5 Measurement Equivalence of Conscientiousness Across Cultures 2 RMSEA CFI MI d MACS Item G C G C G C G C G C Organized a a Efficient a a Systematic b a Inefficient a 0.04 a Sloppy a a Careless a a Note. Bold values represent significant chi-square differences. All modification indices greater than 3.84 are significant and suggest that differential item functioning (DIF) is present. Indices are not presented for the referent items because these items are constrained across groups and, therefore, the values in each column are zero. RMSEA root mean square error of approximation; CFI comparative fit index; MI modification index; MACS mean and covariance structure; G Greek sample; C Chinese sample. a Modification index identifying metric nonequivalence. All significant modification indices suggested metric nonequivalence, and, therefore, scalar nonequivalence is not identified here. b Nonsignificant modification indices are represented by the values for the intercept of the item. CFI.002. in Table 4) provides a compelling example of the information that can be gained from using d MACS. Both the chi-square index and the CFI indicate that this item displays significant nonequivalence. However, the effect size presented for the Greek sample suggests that the magnitude of the difference between American English and Greek respondents is small. Similar results were obtained for the adjectives rude and harsh. Because an equivalent referent item was not available for the negatively worded factor in the Conscientiousness scale, accurate estimates of the effect sizes for these items are not possible. Therefore, we use these items to illustrate the influence of the referent item on the magnitude of the differences between groups. Table 6 provides effect sizes for the three negatively worded items using each of the other items on this factor as the referent. The bolded values represent the largest differences between effect sizes for the same item. For example, with the adjective disorganized as the referent item, the effect size for sloppy was 0.85 in the Chinese sample. However, when the adjective careless was used as the referent, the effect size was Similar results were obtained for the other items as well. Thus, using a nonequivalent referent item can have a substantial effect on the magnitude of differences between groups. Table 6 Demonstrating the Importance of an Equivalent Referent Item Using the Conscientiousness Scale Item Referent item G d MACS Disorganized Sloppy Disorganized Careless Sloppy Careless Sloppy Disorganized Careless Sloppy Careless Disorganized Note. Bolded values identify the item-level analyses that were affected most by the choice of the referent indicator. MACS mean and covariance structure; G Greek sample; C Chinese sample. C The Effects of Nonequivalence on Scale Characteristics Although the item-level effect size indices were calculated separately for items loading on a single latent factor, mean and var were aggregated across the two latent factors in the present study. Although this practice may not always be appropriate, we used this approach for the present study because the two factors represented methodological rather than substantive dimensions. As a consequence, research typically reports results at the scale level rather than for each methodological subfactor. Therefore, the effect of nonequivalence at the scale level is arguably more important than differences in the method factors. The effects of DIF on the scales properties are shown in Table 7. The first column shows the differences between the means of the reference and focal groups due to DIF. As shown in Equation 6, the focal group s mean was subtracted from the reference group s to obtain mean (X S ). Thus, negative values in this column suggest that DIF will result in a higher mean for the focal group. The next column shows the total observed differences between groups (i.e., DIF impact), and the following column provides the percentage of the observed difference that is accounted for by DIF; the remaining difference can be attributed to impact. The fourth column gives the differences between the variances of the scales in the reference and focal groups due to DIF, and the next two columns give the corresponding observed differences and the percentage of this difference attributable to DIF, respectively. The final column shows the range of d MACS for each focal group. As shown in Table 7, there was a range of differences between the means of the reference and focal groups due to DIF. The largest difference was 2.08 points for the Extraversion scale in the Chinese sample, and the smallest was 0.26 points for the Agreeableness scale in the Greek sample. These differences are in raw score points and indicate that the U.S. group would be expected to have a mean that was 2.08 points higher (or 0.26 points lower in the smallest case) than the Chinese group because of DIF. For the Extraversion and Agreeableness scales, these differences are at the scale level and, therefore, should be interpreted relative to a 40-point maximum score (i.e., eight items with five response options). In contrast, the effects of DIF on the Conscientiousness

THE DEVELOPMENT AND VALIDATION OF EFFECT SIZE MEASURES FOR IRT AND CFA STUDIES OF MEASUREMENT EQUIVALENCE CHRISTOPHER DAVID NYE DISSERTATION

THE DEVELOPMENT AND VALIDATION OF EFFECT SIZE MEASURES FOR IRT AND CFA STUDIES OF MEASUREMENT EQUIVALENCE BY CHRISTOPHER DAVID NYE DISSERTATION Submitted in partial fulfillment of the requirements for