Instrument equivalence across ethnic groups. Antonio Olmos (MHCD) Susan R. Hutchinson (UNC)

Similar documents
The MHSIP: A Tale of Three Centers

The Psychometric Development Process of Recovery Measures and Markers: Classical Test Theory and Item Response Theory

Prepared for The Colorado Trust by. Kiki Sayre

Running head: CFA OF TDI AND STICSA 1. p Factor or Negative Emotionality? Joint CFA of Internalizing Symptomology

Confirmatory Factor Analysis of Preschool Child Behavior Checklist (CBCL) (1.5 5 yrs.) among Canadian children

Doing Quantitative Research 26E02900, 6 ECTS Lecture 6: Structural Equations Modeling. Olli-Pekka Kauppila Daria Kautto

Assessing Measurement Invariance in the Attitude to Marriage Scale across East Asian Societies. Xiaowen Zhu. Xi an Jiaotong University.

Psychometric Evaluation of the Major Depression Inventory at the Kenyan Coast

Measurement Invariance (MI): a general overview

Running head: CFA OF STICSA 1. Model-Based Factor Reliability and Replicability of the STICSA

Basic concepts and principles of classical test theory

Development and Psychometric Properties of the Relational Mobility Scale for the Indonesian Population

The Psychometric Properties of Dispositional Flow Scale-2 in Internet Gaming

Purpose of Workshop. Faculty. Culturally Sensitive Research. ISOQOL Workshop 19 Oct 2005

Small Sample Bayesian Factor Analysis. PhUSE 2014 Paper SP03 Dirk Heerwegh

Differential Item Functioning

Assessing e-banking Adopters: an Invariance Approach

Multifactor Confirmatory Factor Analysis

A 2-STAGE FACTOR ANALYSIS OF THE EMOTIONAL EXPRESSIVITY SCALE IN THE CHINESE CONTEXT

Methodological Issues in Measuring the Development of Character

Title: The Theory of Planned Behavior (TPB) and Texting While Driving Behavior in College Students MS # Manuscript ID GCPI

Paul Irwing, Manchester Business School

Presented by Paul A. Carrola, Ph.D., LPC S The University of Texas at El Paso TCA 2014 Mid Winter Conference

Advantages of Within Group Analysis of Race/Ethnicity

Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement Invariance Tests Of Multi-Group Confirmatory Factor Analyses

Relationship between Belief System and Depression with Anxiety among Undergraduate Students in Yemen

Confirmatory Factor Analysis of the Procrastination Assessment Scale for Students

Construct Invariance of the Survey of Knowledge of Internet Risk and Internet Behavior Knowledge Scale

Developing Academic Identity Statues Scale (AISS) and Studying its Construct Validity on Iranian Students

Original Article. Relationship between sport participation behavior and the two types of sport commitment of Japanese student athletes

Comparability Study of Online and Paper and Pencil Tests Using Modified Internally and Externally Matched Criteria

Factor Analysis. MERMAID Series 12/11. Galen E. Switzer, PhD Rachel Hess, MD, MS

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan

A 3-Factor Model for the FACIT-Sp

Chapter 3. Psychometric Properties

Errors and biases in Structural Equa2on Modeling. Jelte M. Wicherts

Connectedness DEOCS 4.1 Construct Validity Summary

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Comprehensive Statistical Analysis of a Mathematics Placement Test

A methodological perspective on the analysis of clinical and personality questionnaires Smits, Iris Anna Marije

Factor Structure of Boredom in Iranian Adolescents

Using the Rasch Modeling for psychometrics examination of food security and acculturation surveys

Validity and reliability of physical education teachers' beliefs and intentions toward teaching students with disabilities (TBITSD) questionnaire

Scale Building with Confirmatory Factor Analysis

The Modification of Dichotomous and Polytomous Item Response Theory to Structural Equation Modeling Analysis

Assessing Cultural Competency from the Patient s Perspective: The CAHPS Cultural Competency (CC) Item Set

International Conference on Humanities and Social Science (HSS 2016)

Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz

Culture and Survey Behavior

Chapter 2. Test Development Procedures: Psychometric Study and Analytical Plan

Development of self efficacy and attitude toward analytic geometry scale (SAAG-S)

Gezinskenmerken: De constructie van de Vragenlijst Gezinskenmerken (VGK) Klijn, W.J.L.

DATE: 8/ 1/2008 TIME: 3:32. L I S R E L 8.71 BY Karl G. J reskog & Dag S rbom

Testing the Multiple Intelligences Theory in Oman

Learning Objectives. Outline. The Dyadic Adjustment Scale: Utility for Rural Marital Assessment

Bipolar items for the measurement of personal optimism instead of unipolar items

Factorial Validity and Consistency of the MBI-GS Across Occupational Groups in Norway

Background. Workshop. Using the WHOQOL in NZ

Validating the Factorial Structure of the Malaysian Version of Revised Competitive State Anxiety Inventory-2 among Young Taekwondo Athletes

Measurement and Descriptive Statistics. Katie Rommel-Esham Education 604

Michael Armey David M. Fresco. Jon Rottenberg. James J. Gross Ian H. Gotlib. Kent State University. Stanford University. University of South Florida

College Student Self-Assessment Survey (CSSAS)

Structural Equation Modeling of Multiple- Indicator Multimethod-Multioccasion Data: A Primer

FACTOR VALIDITY OF THE MERIDEN SCHOOL CLIMATE SURVEY- STUDENT VERSION (MSCS-SV)

While many studies have employed Young s Internet

validscale: A Stata module to validate subjective measurement scales using Classical Test Theory

A Hierarchical Comparison on Influence Paths from Cognitive & Emotional Trust to Proactive Behavior Between China and Japan

Psychometric Validation of the Four Factor Situational Temptations for Smoking Inventory in Adult Smokers

Revised Motivated Strategies for Learning Questionnaire for Secondary School Students

Factor Structure, Validity and Reliability of the Persian version of the Acceptance and Action Questionnaire (AAQ-II-7)

1. Evaluate the methodological quality of a study with the COSMIN checklist

The following lines were read from file C:\Documents and Settings\User\Desktop\Example LISREL-CFA\cfacommand.LS8:

Personal Well-being Among Medical Students: Findings from a Pilot Survey

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Running Head: MULTIPLE CHOICE AND CONSTRUCTED RESPONSE ITEMS. The Contribution of Constructed Response Items to Large Scale Assessment:

Emotional Intelligence and Leadership

The revised short-form of the Eating Beliefs Questionnaire: Measuring positive, negative, and permissive beliefs about binge eating

Chapter 11 Multiple Regression

Nonparametric DIF. Bruno D. Zumbo and Petronilla M. Witarsa University of British Columbia

Reliability Analysis: Its Application in Clinical Practice

Table of Contents. Preface to the third edition xiii. Preface to the second edition xv. Preface to the fi rst edition xvii. List of abbreviations xix

Dr Ruth Tully (1) (2) & Dr Tom Bailey (1)

Examining the efficacy of the Theory of Planned Behavior (TPB) to understand pre-service teachers intention to use technology*

MEASURING STUDENT ATTACHMENT TO SCHOOL : A STRUCTURAL EQUATION MODELING APPROACH

Instrument Validation Study

The Youth Experience Survey 2.0: Instrument Revisions and Validity Testing* David M. Hansen 1 University of Illinois, Urbana-Champaign

Personality Traits Effects on Job Satisfaction: The Role of Goal Commitment

AN EVALUATION OF CONFIRMATORY FACTOR ANALYSIS OF RYFF S PSYCHOLOGICAL WELL-BEING SCALE IN A PERSIAN SAMPLE. Seyed Mohammad Kalantarkousheh 1

MULTISOURCE PERFORMANCE RATINGS: MEASUREMENT EQUIVALENCE ACROSS GENDER. Jacqueline Brooke Elkins

11 Validity. Content Validity and BIMAS Scale Structure. Overview of Results

Personal Style Inventory Item Revision: Confirmatory Factor Analysis

Measuring Self-Serving Cognitive Distortions: An analysis of the. Psychometric Properties of the How I think Questionnaire (HIT-16-Q)

Factorial Validity and Reliability of 12 items General Health Questionnaire in a Bhutanese Population. Tshoki Zangmo *

Assessing the Validity and Reliability of a Measurement Model in Structural Equation Modeling (SEM)

An Examination Of The Psychometric Properties Of The CPGI In Applied Research (OPGRC# 2328) Final Report 2007

On the Performance of Maximum Likelihood Versus Means and Variance Adjusted Weighted Least Squares Estimation in CFA

The Ego Identity Process Questionnaire: Factor Structure, Reliability, and Convergent Validity in Dutch-Speaking Late. Adolescents

Impact and adjustment of selection bias. in the assessment of measurement equivalence

Modeling the Influential Factors of 8 th Grades Student s Mathematics Achievement in Malaysia by Using Structural Equation Modeling (SEM)

CHAPTER 4 RESEARCH METHODS (PP )

Transcription:

Instrument equivalence across ethnic groups Antonio Olmos (MHCD) Susan R. Hutchinson (UNC)

Overview Instrument Equivalence Measurement Invariance Invariance in Reliability Scores Factorial Invariance Item Response Theory The study Sample Instrument Methods Results Future directions

Importance of Cross-cultural Measurement Equivalence Standardized psychological measurement instruments must provide equivalent measurement across subpopulations if comparative statements are to have substantive import. Without equivalent measurement, observed scores are not directly comparable (Drasgow & Kanfer, 1985)

How is cross-cultural equivalence manifested? Test scores from a given measure should be equally accurate for different groups this means that reliability coefficients should be the same for all groups The factor structure on a given instrument should be the same for all relevant groups e.g., if depression has 2 subdomains, it should have 2 subdomains for all cultural subgroups

equivalence -- cont d Each item on a particular instrument should mean the same thing to people from different cultural groups i.e., a psychological test that lacks item equivalence is in essence two different tests; one for each cultural group

When do we achieve Instrument Equivalence? Meaningful comparisons across different cultures cannot be guaranteed unless we have: Construct congruence: the factor structure of the instruments is the same across different cultures Scalar equivalence: the numbers assigned to the items have the same meaning in the different cultures

What if we are not using questionnaires? Belief that etiology, expression, course, and outcome were universal and independent of cultural factors. Now, it is assumed that culture can play a role in psychopathology by: Determining standards of normality Creating personality configurations that may look like pathological in one culture but not in another

Even when using behavioral scores, we could face cross-cultural biases Normal behaviors in one culture can be classified as pathological in other Dependency in Japan is valued, whereas in America it has negative connotations Culturally different individuals are not adequately represented in the norm groups Classifying individuals of different cultures as pathological, based on norms may lead to tragedy

Instrument Equivalence in Mental Health Use of behavior rating scales where rater and ratee come from different cultures If cross-cultural differences result in biased ratings, then scales are not diagnostically valid for those groups May suggest the need to generate norms for different cultures Development of models to study the factor structure of the measure across different cultures

Measurement Invariance Whether or not, under different conditions of observing a phenomena, measurement operations yield measurement of the same attribute If there is evidence of variability, findings of differences between individuals and groups cannot be unambiguously interpreted Differences in means can be just as easily interpreted as indicating that different things were measured Association between variables will be for different attributes for different groups

Reliability Estimates Reflect consistency of responses For clinical instruments, internal consistency reliability, as measured by Cronbach s, reflects the consistency of ratings provided by clinicians for a given client, assuming all items on the measure are measuring the same trait

Invariance in reliability scores Technically, a reliability is r 2, therefore, we could treat them as correlation scores and use a Z-score to test differences between groups: 1 Zr1_white. 2 ( ln( 1 corr1_wh ) ln( 1 corr1_wh ) ) se_wh_aa 1 n_wh 3 1 n_aa 3 Z_test_wh_aa_1 Zr1_white se_wh_aa Zr1_aa

Factorial Invariance Another way to test measurement invariance is FACTORIAL INVARIANCE The main question it addresses: Do the items making a particular measuring instrument operate equivalently across different populations (e.g., White and Hispanics)? The measurement model is group-invariant Tests for Factorial Invariance (in order of difficulty):

Steps in Factor Invariance testing (1) 1 Equivalent Factor structure Same number of factors, items associated with the same factors (Structural model invariance) 2 Equivalent Factor loading paths Factor loadings are identical for every item and every factor

Steps in Factor Invariance testing (2) 3 Equivalent Factor variance/covariance Variances and Covariances (correlations) among factors are the same across populations 4 Equivalent Item reliabilities Residuals for every item are the same across populations

Testing item equivalence Although there are a number of possible methods for testing item equivalence across groups, in our study, we used Rasch modeling to assess differences in item parameters Rasch modeling is one of many types of Item Response Theory models

What is Item Response Theory? Item Response Theory (IRT) is a method for providing information about items on a particular instrument as well as scores for persons responding to the items Historically, IRT has been used primarily by testing companies on aptitude and achievement tests to determine how difficult the items are to determine how well the items discriminate between people of high versus low ability

IRT cont d More recently, IRT has been applied to rating scales with much of the rating scale analyses conducted using the Rasch model the Rasch model has a number of favorable characteristics including requiring fewer subjects

What does the Rasch analysis tell us about rating scales? It provides difficulty indices for each item that indicate how relatively easy or difficult it is for a given individual to receive a high rating on a clinical scale to measure some type of dysfunction, for example, an easy item suggests that even for clients with low levels of dysfunction, they have a high probability of receiving a high rating a difficult item would require much higher levels of dysfunction in order to receive a high rating

Rasch cont d Rasch analysis also provides ability scores for the people responding to the items high scores indicate higher levels of the trait being measured on a clinical scale measuring levels of dysfunction, high scores would indicate greater levels of dysfunction Rasch analysis also provides a map comparing the item difficulties versus person abilities for a given group of items and persons this information is useful in determining how appropriate the test is for that group of individuals a good instrument has items that match the level of dysfunction for that group

How is Rasch analysis used in assessing item equivalence? The difficulty indices for different groups can be compared to determine if ratings are being applied in the same way across groups The approach we selected was to compute standardized differences (z-test) between items, with a cutoff value of + or 1.96 suggesting substantial difference Large differences suggest that rating values on that item tend to be higher for clients in one cultural group than in the other, despite their having equal levels of dysfunction such a difference suggests potential bias in ratings

The Study Instrument Equivalence in a sample of children with Mental Health Illness Evaluated using Colorado Client Assessment Record (CCAR) Three different Ethnic groups: White/Caucasian, African-American, Hispanic

The sample 317 children enrolled at the Mental Health Corporation of Denver between 2000 and 2001 (134 White/Caucasian, 99 African American, 84 Hispanic) Ages between 4 and 15 years old (M = 13.7, SD = 2.9) 96.7% had English as primary language; 3.3% had Spanish as primary language Primary diagnosis: Major Depression

The Instrument Colorado Client Assessment Record Problem Severity Scales only Items: Thought Process, Manic Issues, Suicide- Danger Self, Cognitive Problems, Anxiety, Attention Problems, Depressive issues, Emotional Withdrawal, Family issues & problems, Role, Resistiveness, Legal, Aggressive-Danger others, Socialization Issues, Security-Management issues, Interpersonal, Medical-physical, Self Care- Basic needs

Method Exploratory Factor Analysis (EFA) All 317 kids PC and PFA, with Promax Rotation to allow correlation among factors Reliability analysis to explore fit (by ethnicity) Factorial Invariance using Confirmatory Factor Analysis based on EFA (entire sample & by ethnicity) Item Response Theory (IRT, or Rasch) Model (by ethnicity)

Exploratory Factor Analysis EVERYBODY WHITE-CAUCASSIAN HISPANIC 1 2 3 4 1 2 3 4 1 2 3 4 Thought Process 0.883 0.924 0.573 Manic Issues 0.751 0.571 0.772 Suicide-Danger Self 0.728 0.760 0.449-0.469 Cognitive Problems 0.602 0.697 0.403 Anxiety 0.517 0.426 0.667 0.759 Attention Problems 0.421 0.548 0.740 Depressive issues 0.831 0.553 0.936 Emotional Withdrawal 0.824 0.725 0.698 Family issues & problems 0.646 0.609 0.459-0.575 Role 0.571 0.711 0.438 Resistiveness 0.447 0.795 Legal 0.865 0.926-0.444 0.745 Aggresive-Danger others 0.623 0.505 0.683 Socialization Issues 0.566 0.717 0.792 Security-Management issues 0.551 0.492 0.939 Interpersonal 0.455 0.582 Medical-physical 0.878 0.863-0.636 Self Care-Basic needs 0.569 0.531 0.516

Confirmatory Factor Analysis After requesting advice from clinicians, we reduced the original model from 18 items to 9 Items left out did not contribute to the diagnosis, especially in children Fit of the CFA kept blowing up More information about this when we talk about Item Response Theory

Model used for Factor Invariance Socialization Viol-other Interpersonal Cognitive Security-mgmt Factor 1 Factor 2 Emotional wthd Anxiety Depression Suicide

Results Reliability Analysis White/Caucasian African American Hispanic Factor 1 0.8157 0.8105 0.7530 Factor 2 0.8028 0.7723 0.7488 Z-test Reliabilities White-African American White-Hispanic African American- Hispanic Factor 1 0.115 1.169 0.993 Factor 2 0.603 0.971 0.373

Results Factorial Invariance Model χ 2 df χ 2 df RMSEA GFI CFI 1) Number of factors invariant 143.09 75 -- -- 0.090 0.90 0.93 2) Model (1) with pattern of factor loadings held invariant (Lambda-X Invariant) 3) Model (2) with factor variances and covariances held invariant (PHI- Invariant) 4) Model (3) with factor invariance of item-pair reliabilities (Theta-Delta- Invariant) 175.42 89 32.33 # 14 0.094 0.86 0.92 187.95 95 12.53 6 0.092 0.85 0.91 220.35 115 32.4 * 20 0.091 0.8 0.90 * p < 0.05 # p < 0.01

χ 2 (Chi-Square): in this context, it tests the closeness of fit between the unrestricted sample covariance matrix and the restricted (model) covariance matrix. Very sensitive to sample size: The statistic will be significant when the model fits approximately in the population and the sample size is large. RMSEA (Root Mean Square Error of Approximation): Analyzes the discrepancies between observed and implied covariance matrices. Lower bound of zero indicates perfect fit with values increasing as the fit deteriorates. Suggested that values below 0.1 indicate a good fit to the data, and values below 0.05 indicate a very good fit. It is recommended not to use models with RMSEA values larger than 0.1 GFI (Goodness of Fit Index): Analogous to R 2 in that it indicates the proportion of variance explained by the model. Oscillates between 0 and 1 with values exceeding 0.9 indicating a good fit to the data. CFI (Comparative Fit Index): Indicates the proportion of improvement of the overall fit compared to a null (independent) model. Sample size independent, and penalizes for model complexity. It uses a 0-1 norm, with 1 indicating perfect fit. Values of about 0.9 or higher reflect a good fit.

Tables Z-test comparison The formula used to obtain the z scores is z = ( d 1 ( s 2 1 + d 2 s ) 2 2 ) Note that zwb = difference between White and Black estimates, zwh = difference between White and Hispanic estimates, and zbh = difference between Black and Hispanic estimates.

Standardized differences between difficulty estimates for each pair of ethnic groups across all 18 items ZWB ZWH ZBH MEDICAL-PHYSICAL 0.33-2.29-2.46 SELF-CARE BASIC NEEDS 0.33-0.28-0.53 THOUGHT 0.94-0.61-1.32 LEGAL -1.80 -.74 0.83 MANIC ISSUES 1.11-1.31-2.21 SUICIDE-DANGER SELF 0.11 -.55 -.61 VIOLENCE-DANGER OTHERS 1.31 0.00-1.14 COGNITIVE -0.11-1.11-0.96 SECURITY MANAGEMENT 1.06-0.50-1.40 SOCIALIZATION ISSUES -0.26 0.81 0.98 ANXIETY -0.26 1.86 1.95 ROLE 0.64 0.93 0.33 INTERPERSONAL -0.13 0.00 0.11 EMOTIONAL WITHDRAWAL -1.41 2.21 3.25 DEPRESSION -1.79 1.98 3.36 FAMILY ISSUES -2.69 2.32 4.45 RESISTIVENESS 0.35 1.74 1.41 ATTENTION PROBLEMS 1.15-1.41 -.46

Standardized mean differences for the final 9 items ZWB ZWH ZBH SUICIDE-DANGER SELF 0.09-0.82-0.86 VIOLENCE-DANGER OTHERS 1.41-0.18-1.41 COGNITIVE -0.09-1.39-1.25 SECURITY MANAGEMENT 1.19-1.02-1.93 SOCIALIZATION ISSUES 0.00 0.30 0.28 ANXIETY -0.11 1.40 1.41 INTERPERSONAL 0.22-0.60-0.75 EMOTIONAL WITHDRAWAL -1.30 1.60 2.63 DEPRESSION -1.52 1.30 2.54

WHITE/CAUCASIAN KIDS ALL 18 ITEMS INPUT: 134 PERSONS, 18 ITEMS ANALYZED: 132 PERSONS, 18 ITEMS, 9 CATS v2.93 -------------------------------------------------------------------------------- PERSONS MAP OF ITEMS <frequ> <less> 1. + T MEDICAL-PHYS SELF-CARE THOUGHT-PROCESS S. MANIC-ISS LEGAL SUICIDE VIOLENCE-D-OTHERS COGNITIVE. T SECURITY-MGM 0.# +M # ATTENTION-PRB RESISTIVENESS ## SOCIALIZATION ## ANXIETY ROLE.### S S EMOTIONAL-WTHD INTERPERSONAL. ###### ### DEPRESSION #####.#### T FAMILY-ISSUES -1 ######## M+ #####.#### #### #### ### S.#.#.## -2. T+.. -3 # + <rare> <more>

Conclusions The CCAR appears to have comparable reliability across ethnic groups based on the test we used however, there is a more optimal test of reliability differences The results also indicate equal factor structure for depression However, at the item level, there is evidence to suggest that some items are operating differently across ethnic groups based on different loadings based on different difficulty indices

Future directions Redo reliability comparisons using the more optimal test Increase N to see if the model holds Replicate the study using a different primary diagnosis Study potential effects of secondary diagnosis Compare findings for depressed children with depressed adults Examine potential gender differences