Instrument equivalence across ethnic groups Antonio Olmos (MHCD) Susan R. Hutchinson (UNC)
Overview Instrument Equivalence Measurement Invariance Invariance in Reliability Scores Factorial Invariance Item Response Theory The study Sample Instrument Methods Results Future directions
Importance of Cross-cultural Measurement Equivalence Standardized psychological measurement instruments must provide equivalent measurement across subpopulations if comparative statements are to have substantive import. Without equivalent measurement, observed scores are not directly comparable (Drasgow & Kanfer, 1985)
How is cross-cultural equivalence manifested? Test scores from a given measure should be equally accurate for different groups this means that reliability coefficients should be the same for all groups The factor structure on a given instrument should be the same for all relevant groups e.g., if depression has 2 subdomains, it should have 2 subdomains for all cultural subgroups
equivalence -- cont d Each item on a particular instrument should mean the same thing to people from different cultural groups i.e., a psychological test that lacks item equivalence is in essence two different tests; one for each cultural group
When do we achieve Instrument Equivalence? Meaningful comparisons across different cultures cannot be guaranteed unless we have: Construct congruence: the factor structure of the instruments is the same across different cultures Scalar equivalence: the numbers assigned to the items have the same meaning in the different cultures
What if we are not using questionnaires? Belief that etiology, expression, course, and outcome were universal and independent of cultural factors. Now, it is assumed that culture can play a role in psychopathology by: Determining standards of normality Creating personality configurations that may look like pathological in one culture but not in another
Even when using behavioral scores, we could face cross-cultural biases Normal behaviors in one culture can be classified as pathological in other Dependency in Japan is valued, whereas in America it has negative connotations Culturally different individuals are not adequately represented in the norm groups Classifying individuals of different cultures as pathological, based on norms may lead to tragedy
Instrument Equivalence in Mental Health Use of behavior rating scales where rater and ratee come from different cultures If cross-cultural differences result in biased ratings, then scales are not diagnostically valid for those groups May suggest the need to generate norms for different cultures Development of models to study the factor structure of the measure across different cultures
Measurement Invariance Whether or not, under different conditions of observing a phenomena, measurement operations yield measurement of the same attribute If there is evidence of variability, findings of differences between individuals and groups cannot be unambiguously interpreted Differences in means can be just as easily interpreted as indicating that different things were measured Association between variables will be for different attributes for different groups
Reliability Estimates Reflect consistency of responses For clinical instruments, internal consistency reliability, as measured by Cronbach s, reflects the consistency of ratings provided by clinicians for a given client, assuming all items on the measure are measuring the same trait
Invariance in reliability scores Technically, a reliability is r 2, therefore, we could treat them as correlation scores and use a Z-score to test differences between groups: 1 Zr1_white. 2 ( ln( 1 corr1_wh ) ln( 1 corr1_wh ) ) se_wh_aa 1 n_wh 3 1 n_aa 3 Z_test_wh_aa_1 Zr1_white se_wh_aa Zr1_aa
Factorial Invariance Another way to test measurement invariance is FACTORIAL INVARIANCE The main question it addresses: Do the items making a particular measuring instrument operate equivalently across different populations (e.g., White and Hispanics)? The measurement model is group-invariant Tests for Factorial Invariance (in order of difficulty):
Steps in Factor Invariance testing (1) 1 Equivalent Factor structure Same number of factors, items associated with the same factors (Structural model invariance) 2 Equivalent Factor loading paths Factor loadings are identical for every item and every factor
Steps in Factor Invariance testing (2) 3 Equivalent Factor variance/covariance Variances and Covariances (correlations) among factors are the same across populations 4 Equivalent Item reliabilities Residuals for every item are the same across populations
Testing item equivalence Although there are a number of possible methods for testing item equivalence across groups, in our study, we used Rasch modeling to assess differences in item parameters Rasch modeling is one of many types of Item Response Theory models
What is Item Response Theory? Item Response Theory (IRT) is a method for providing information about items on a particular instrument as well as scores for persons responding to the items Historically, IRT has been used primarily by testing companies on aptitude and achievement tests to determine how difficult the items are to determine how well the items discriminate between people of high versus low ability
IRT cont d More recently, IRT has been applied to rating scales with much of the rating scale analyses conducted using the Rasch model the Rasch model has a number of favorable characteristics including requiring fewer subjects
What does the Rasch analysis tell us about rating scales? It provides difficulty indices for each item that indicate how relatively easy or difficult it is for a given individual to receive a high rating on a clinical scale to measure some type of dysfunction, for example, an easy item suggests that even for clients with low levels of dysfunction, they have a high probability of receiving a high rating a difficult item would require much higher levels of dysfunction in order to receive a high rating
Rasch cont d Rasch analysis also provides ability scores for the people responding to the items high scores indicate higher levels of the trait being measured on a clinical scale measuring levels of dysfunction, high scores would indicate greater levels of dysfunction Rasch analysis also provides a map comparing the item difficulties versus person abilities for a given group of items and persons this information is useful in determining how appropriate the test is for that group of individuals a good instrument has items that match the level of dysfunction for that group
How is Rasch analysis used in assessing item equivalence? The difficulty indices for different groups can be compared to determine if ratings are being applied in the same way across groups The approach we selected was to compute standardized differences (z-test) between items, with a cutoff value of + or 1.96 suggesting substantial difference Large differences suggest that rating values on that item tend to be higher for clients in one cultural group than in the other, despite their having equal levels of dysfunction such a difference suggests potential bias in ratings
The Study Instrument Equivalence in a sample of children with Mental Health Illness Evaluated using Colorado Client Assessment Record (CCAR) Three different Ethnic groups: White/Caucasian, African-American, Hispanic
The sample 317 children enrolled at the Mental Health Corporation of Denver between 2000 and 2001 (134 White/Caucasian, 99 African American, 84 Hispanic) Ages between 4 and 15 years old (M = 13.7, SD = 2.9) 96.7% had English as primary language; 3.3% had Spanish as primary language Primary diagnosis: Major Depression
The Instrument Colorado Client Assessment Record Problem Severity Scales only Items: Thought Process, Manic Issues, Suicide- Danger Self, Cognitive Problems, Anxiety, Attention Problems, Depressive issues, Emotional Withdrawal, Family issues & problems, Role, Resistiveness, Legal, Aggressive-Danger others, Socialization Issues, Security-Management issues, Interpersonal, Medical-physical, Self Care- Basic needs
Method Exploratory Factor Analysis (EFA) All 317 kids PC and PFA, with Promax Rotation to allow correlation among factors Reliability analysis to explore fit (by ethnicity) Factorial Invariance using Confirmatory Factor Analysis based on EFA (entire sample & by ethnicity) Item Response Theory (IRT, or Rasch) Model (by ethnicity)
Exploratory Factor Analysis EVERYBODY WHITE-CAUCASSIAN HISPANIC 1 2 3 4 1 2 3 4 1 2 3 4 Thought Process 0.883 0.924 0.573 Manic Issues 0.751 0.571 0.772 Suicide-Danger Self 0.728 0.760 0.449-0.469 Cognitive Problems 0.602 0.697 0.403 Anxiety 0.517 0.426 0.667 0.759 Attention Problems 0.421 0.548 0.740 Depressive issues 0.831 0.553 0.936 Emotional Withdrawal 0.824 0.725 0.698 Family issues & problems 0.646 0.609 0.459-0.575 Role 0.571 0.711 0.438 Resistiveness 0.447 0.795 Legal 0.865 0.926-0.444 0.745 Aggresive-Danger others 0.623 0.505 0.683 Socialization Issues 0.566 0.717 0.792 Security-Management issues 0.551 0.492 0.939 Interpersonal 0.455 0.582 Medical-physical 0.878 0.863-0.636 Self Care-Basic needs 0.569 0.531 0.516
Confirmatory Factor Analysis After requesting advice from clinicians, we reduced the original model from 18 items to 9 Items left out did not contribute to the diagnosis, especially in children Fit of the CFA kept blowing up More information about this when we talk about Item Response Theory
Model used for Factor Invariance Socialization Viol-other Interpersonal Cognitive Security-mgmt Factor 1 Factor 2 Emotional wthd Anxiety Depression Suicide
Results Reliability Analysis White/Caucasian African American Hispanic Factor 1 0.8157 0.8105 0.7530 Factor 2 0.8028 0.7723 0.7488 Z-test Reliabilities White-African American White-Hispanic African American- Hispanic Factor 1 0.115 1.169 0.993 Factor 2 0.603 0.971 0.373
Results Factorial Invariance Model χ 2 df χ 2 df RMSEA GFI CFI 1) Number of factors invariant 143.09 75 -- -- 0.090 0.90 0.93 2) Model (1) with pattern of factor loadings held invariant (Lambda-X Invariant) 3) Model (2) with factor variances and covariances held invariant (PHI- Invariant) 4) Model (3) with factor invariance of item-pair reliabilities (Theta-Delta- Invariant) 175.42 89 32.33 # 14 0.094 0.86 0.92 187.95 95 12.53 6 0.092 0.85 0.91 220.35 115 32.4 * 20 0.091 0.8 0.90 * p < 0.05 # p < 0.01
χ 2 (Chi-Square): in this context, it tests the closeness of fit between the unrestricted sample covariance matrix and the restricted (model) covariance matrix. Very sensitive to sample size: The statistic will be significant when the model fits approximately in the population and the sample size is large. RMSEA (Root Mean Square Error of Approximation): Analyzes the discrepancies between observed and implied covariance matrices. Lower bound of zero indicates perfect fit with values increasing as the fit deteriorates. Suggested that values below 0.1 indicate a good fit to the data, and values below 0.05 indicate a very good fit. It is recommended not to use models with RMSEA values larger than 0.1 GFI (Goodness of Fit Index): Analogous to R 2 in that it indicates the proportion of variance explained by the model. Oscillates between 0 and 1 with values exceeding 0.9 indicating a good fit to the data. CFI (Comparative Fit Index): Indicates the proportion of improvement of the overall fit compared to a null (independent) model. Sample size independent, and penalizes for model complexity. It uses a 0-1 norm, with 1 indicating perfect fit. Values of about 0.9 or higher reflect a good fit.
Tables Z-test comparison The formula used to obtain the z scores is z = ( d 1 ( s 2 1 + d 2 s ) 2 2 ) Note that zwb = difference between White and Black estimates, zwh = difference between White and Hispanic estimates, and zbh = difference between Black and Hispanic estimates.
Standardized differences between difficulty estimates for each pair of ethnic groups across all 18 items ZWB ZWH ZBH MEDICAL-PHYSICAL 0.33-2.29-2.46 SELF-CARE BASIC NEEDS 0.33-0.28-0.53 THOUGHT 0.94-0.61-1.32 LEGAL -1.80 -.74 0.83 MANIC ISSUES 1.11-1.31-2.21 SUICIDE-DANGER SELF 0.11 -.55 -.61 VIOLENCE-DANGER OTHERS 1.31 0.00-1.14 COGNITIVE -0.11-1.11-0.96 SECURITY MANAGEMENT 1.06-0.50-1.40 SOCIALIZATION ISSUES -0.26 0.81 0.98 ANXIETY -0.26 1.86 1.95 ROLE 0.64 0.93 0.33 INTERPERSONAL -0.13 0.00 0.11 EMOTIONAL WITHDRAWAL -1.41 2.21 3.25 DEPRESSION -1.79 1.98 3.36 FAMILY ISSUES -2.69 2.32 4.45 RESISTIVENESS 0.35 1.74 1.41 ATTENTION PROBLEMS 1.15-1.41 -.46
Standardized mean differences for the final 9 items ZWB ZWH ZBH SUICIDE-DANGER SELF 0.09-0.82-0.86 VIOLENCE-DANGER OTHERS 1.41-0.18-1.41 COGNITIVE -0.09-1.39-1.25 SECURITY MANAGEMENT 1.19-1.02-1.93 SOCIALIZATION ISSUES 0.00 0.30 0.28 ANXIETY -0.11 1.40 1.41 INTERPERSONAL 0.22-0.60-0.75 EMOTIONAL WITHDRAWAL -1.30 1.60 2.63 DEPRESSION -1.52 1.30 2.54
WHITE/CAUCASIAN KIDS ALL 18 ITEMS INPUT: 134 PERSONS, 18 ITEMS ANALYZED: 132 PERSONS, 18 ITEMS, 9 CATS v2.93 -------------------------------------------------------------------------------- PERSONS MAP OF ITEMS <frequ> <less> 1. + T MEDICAL-PHYS SELF-CARE THOUGHT-PROCESS S. MANIC-ISS LEGAL SUICIDE VIOLENCE-D-OTHERS COGNITIVE. T SECURITY-MGM 0.# +M # ATTENTION-PRB RESISTIVENESS ## SOCIALIZATION ## ANXIETY ROLE.### S S EMOTIONAL-WTHD INTERPERSONAL. ###### ### DEPRESSION #####.#### T FAMILY-ISSUES -1 ######## M+ #####.#### #### #### ### S.#.#.## -2. T+.. -3 # + <rare> <more>
Conclusions The CCAR appears to have comparable reliability across ethnic groups based on the test we used however, there is a more optimal test of reliability differences The results also indicate equal factor structure for depression However, at the item level, there is evidence to suggest that some items are operating differently across ethnic groups based on different loadings based on different difficulty indices
Future directions Redo reliability comparisons using the more optimal test Increase N to see if the model holds Replicate the study using a different primary diagnosis Study potential effects of secondary diagnosis Compare findings for depressed children with depressed adults Examine potential gender differences