The MHSIP: A Tale of Three Centers

Similar documents
Instrument equivalence across ethnic groups. Antonio Olmos (MHCD) Susan R. Hutchinson (UNC)

The Psychometric Development Process of Recovery Measures and Markers: Classical Test Theory and Item Response Theory

Doing Quantitative Research 26E02900, 6 ECTS Lecture 6: Structural Equations Modeling. Olli-Pekka Kauppila Daria Kautto

Techniques for Explaining Item Response Theory to Stakeholder

Confirmatory Factor Analysis of Preschool Child Behavior Checklist (CBCL) (1.5 5 yrs.) among Canadian children

Multilevel Techniques for Quality Control Charts of Recovery Outcomes

DATE: 8/ 1/2008 TIME: 3:32. L I S R E L 8.71 BY Karl G. J reskog & Dag S rbom

Basic concepts and principles of classical test theory

Personality Traits Effects on Job Satisfaction: The Role of Goal Commitment

Examining the efficacy of the Theory of Planned Behavior (TPB) to understand pre-service teachers intention to use technology*

Development of self efficacy and attitude toward analytic geometry scale (SAAG-S)

Multifactor Confirmatory Factor Analysis

Running head: CFA OF TDI AND STICSA 1. p Factor or Negative Emotionality? Joint CFA of Internalizing Symptomology

The following lines were read from file C:\Documents and Settings\User\Desktop\Example LISREL-CFA\cfacommand.LS8:

Applications of Structural Equation Modeling (SEM) in Humanities and Science Researches

Structural Validation of the 3 X 2 Achievement Goal Model

Chapter 9. Youth Counseling Impact Scale (YCIS)

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

THE PROCESS OF MENTAL HEALTH RECOVERY/RESILIENCY IN CHILDREN AND ADOLESCENTS

Assessing Measurement Invariance in the Attitude to Marriage Scale across East Asian Societies. Xiaowen Zhu. Xi an Jiaotong University.

Running head: CFA OF STICSA 1. Model-Based Factor Reliability and Replicability of the STICSA

The Psychometric Properties of Dispositional Flow Scale-2 in Internet Gaming

Oak Meadow Autonomy Survey

Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz

Scale Building with Confirmatory Factor Analysis

International Conference on Humanities and Social Science (HSS 2016)

Bipolar items for the measurement of personal optimism instead of unipolar items

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement Invariance Tests Of Multi-Group Confirmatory Factor Analyses

Presented by Paul A. Carrola, Ph.D., LPC S The University of Texas at El Paso TCA 2014 Mid Winter Conference

VALIDATION OF TWO BODY IMAGE MEASURES FOR MEN AND WOMEN. Shayna A. Rusticus Anita M. Hubley University of British Columbia, Vancouver, BC, Canada

Psychometric Evaluation of the Major Depression Inventory at the Kenyan Coast

Differential Item Functioning

Objective. Life purpose, a stable and generalized intention to do something that is at once

The Development of Scales to Measure QISA s Three Guiding Principles of Student Aspirations Using the My Voice TM Survey

Anumber of studies have shown that ignorance regarding fundamental measurement

Measurement Invariance (MI): a general overview

Validity of the Risk & Protective Factor Model

Factor Analysis. MERMAID Series 12/11. Galen E. Switzer, PhD Rachel Hess, MD, MS

Consumer Perception Survey (Formerly Known as POQI)

All reverse-worded items were scored accordingly and are in the appropriate direction in the data set.

Personal Style Inventory Item Revision: Confirmatory Factor Analysis

Using Generalizability Theory to Investigate the Psychometric Property of an Assessment Center in Indonesia

A Modification to the Behavioural Regulation in Exercise Questionnaire to Include an Assessment of Amotivation

Adult Consumer and Family Member Perceptions of Care 2012: Findings from the Annual Survey of Pennsylvania Behavioral Health Service Recipients

The Ego Identity Process Questionnaire: Factor Structure, Reliability, and Convergent Validity in Dutch-Speaking Late. Adolescents

I. Introduction and Data Collection B. Sampling. 1. Bias. In this section Bias Random Sampling Sampling Error

Paul Irwing, Manchester Business School

Impact and adjustment of selection bias. in the assessment of measurement equivalence

EMOTIONAL INTELLIGENCE skills assessment: technical report

Instrument Validation Study

Purpose of Workshop. Faculty. Culturally Sensitive Research. ISOQOL Workshop 19 Oct 2005

Factorial Validity and Consistency of the MBI-GS Across Occupational Groups in Norway

The Modification of Dichotomous and Polytomous Item Response Theory to Structural Equation Modeling Analysis

CBT+ Measures Cheat Sheet

1. Evaluate the methodological quality of a study with the COSMIN checklist

Revised Motivated Strategies for Learning Questionnaire for Secondary School Students

Nonparametric DIF. Bruno D. Zumbo and Petronilla M. Witarsa University of British Columbia

Reliability and Validity of the Divided

College Student Self-Assessment Survey (CSSAS)

Author s response to reviews

Measures of children s subjective well-being: Analysis of the potential for cross-cultural comparisons

MEASURING MEANING AND PEACE WITH THE FACIT-SP: DISTINCTION WITHOUT A DIFFERENCE?

Packianathan Chelladurai Troy University, Troy, Alabama, USA.

A methodological perspective on the analysis of clinical and personality questionnaires Smits, Iris Anna Marije

On Test Scores (Part 2) How to Properly Use Test Scores in Secondary Analyses. Structural Equation Modeling Lecture #12 April 29, 2015

The Influence of Psychological Empowerment on Innovative Work Behavior among Academia in Malaysian Research Universities

A Short Form of Sweeney, Hausknecht and Soutar s Cognitive Dissonance Scale

Hanne Søberg Finbråten 1,2*, Bodil Wilde-Larsson 2,3, Gun Nordström 3, Kjell Sverre Pettersen 4, Anne Trollvik 3 and Øystein Guttersrud 5

The Bilevel Structure of the Outcome Questionnaire 45

Saville Consulting Wave Professional Styles Handbook

ANOVA. Thomas Elliott. January 29, 2013

Assessing the Validity and Reliability of a Measurement Model in Structural Equation Modeling (SEM)

Development and Psychometric Properties of the Relational Mobility Scale for the Indonesian Population

Connectedness DEOCS 4.1 Construct Validity Summary

Using Your Brain -- for a CHANGE Summary. NLPcourses.com

The measurement of media literacy in eating disorder risk factor research: psychometric properties of six measures

MHSIP Consumer Survey Technical Report. Fiscal Year 2012

Validity and reliability of physical education teachers' beliefs and intentions toward teaching students with disabilities (TBITSD) questionnaire

By Hui Bian Office for Faculty Excellence

Self-Oriented and Socially Prescribed Perfectionism in the Eating Disorder Inventory Perfectionism Subscale

Issues That Should Not Be Overlooked in the Dominance Versus Ideal Point Controversy

The Youth Experience Survey 2.0: Instrument Revisions and Validity Testing* David M. Hansen 1 University of Illinois, Urbana-Champaign

Introduction to Multilevel Models for Longitudinal and Repeated Measures Data

Confirmatory Factor Analysis of the Procrastination Assessment Scale for Students

Original Article. Relationship between sport participation behavior and the two types of sport commitment of Japanese student athletes

Comprehensive Statistical Analysis of a Mathematics Placement Test

Construct Invariance of the Survey of Knowledge of Internet Risk and Internet Behavior Knowledge Scale

Two-Way Independent ANOVA

AN EVALUATION OF CONFIRMATORY FACTOR ANALYSIS OF RYFF S PSYCHOLOGICAL WELL-BEING SCALE IN A PERSIAN SAMPLE. Seyed Mohammad Kalantarkousheh 1

Assessing Measurement Invariance of the Teachers Perceptions of Grading Practices Scale across Cultures

Psychometrics in context: Test Construction with IRT. Professor John Rust University of Cambridge

Testing the Multiple Intelligences Theory in Oman

Measurement and Descriptive Statistics. Katie Rommel-Esham Education 604

Validating Measures of Self Control via Rasch Measurement. Jonathan Hasford Department of Marketing, University of Kentucky

Examination of the factor structure of critical thinking disposition scale according to different variables

Organizational readiness for implementing change: a psychometric assessment of a new measure

The CSGU: A Measure of Controllability, Stability, Globality, and Universality Attributions

Development and validation of a patient-reported outcome measure for stroke patients

Transcription:

The MHSIP: A Tale of Three Centers P. Antonio Olmos-Gallo, Ph.D. Kathryn DeRoche, M.A. Mental Health Center of Denver Richard Swanson, Ph.D., J.D. Aurora Research Institute John Mahalik, Ph.D., M.P.A. Jefferson Center for Mental Health Presented at the Organization for Program Evaluation in Colorado Annual Meeting, May 15, 2008 1

Presentation Overview Accountability in mental health Description and intended use of the MHSIP Review of constructs of measurement Purpose and Methods Results of the psychometric investigation Reliabilities Measurement invariance Differential item functioning Discussion of results Future directions for accountability in mental health 2

Accountability in Mental health 3

Accountability in Mental Health More and more prevalent in the MH field This is a good thing, if it helps centers to improve their services But for accountability to be used to improve quality, the approach needs to meet two criteria: 1. The feedback provided to the centers must be useful and, 2. The yardstick must be the same for all centers. 4

How does accountability work in MH? Accountability has changed from Formative- to more Summative-oriented Grant funding (Federal, Private) requires that outcomes be demonstrated (NOMS, GPRA) State-based requirements (CCAR, MHSIP, YSSF) Stakeholders are more in-tune with accountability

Description and Intended Uses of the MHSIP What is the MHSIP? What is it used for? 6

What is the MHSIP? Mental Health Statistics Improvement Program (MHSIP) Consumer survey Annual administration to a stratified random sample of consumers Colorado Version: 28 items designed to measure 5 constructs Annual administration of the MHSIP is a result of President s New Freedom Commission on Mental Health (2003) focusing on consumer informed performance measurement 7

What is the MHSIP Used For? Designed to assess the performance of mental health services inside and across individual centers Mental health centers are compared according to their MHSIP scores, which assess the center s performance for the current year MHSIP results are reported, and can be viewed by all mental health centers and their stakeholders 8

Domains of the MHSIP Overall Performance of Mental Health Centers Access (4 items) Quality/ Appropriateness (8 items) Participation in Service/ Treatment Planning (2 items) Consumer Perception of Outcomes (7 items) General Satisfaction (3 items) Note: Participation is not considered a domain for many centers at the national or state level, but it is used in Colorado, therefore it was included in the analysis 9

How can Mental Health Centers Use the MHSIP results? Excerpt from MHSIP Consumer Survey Technical Report 2007 (DMH) This information [MHSIP numerical scores] can be used to inform future change within individual centers and can provide a catalyst for more indepth study of particular domains at the center level. (page 1) In 2006, due to low scores, we conducted a study to see how we could improve Participation in services The results did not match what the MHSIP found, nor it provided any explanation for why our scores were low 10

So we pondered, how can this happen? We were wondering how is the MHSIP measuring centers, Is it possible that measurement artifacts may be influencing the quantitative results? The remainder of the presentation describes our investigation into the psychometrics of the MHSIP for its intended use 11

Measurement Constructs 12

Psychometrics Properties The focus of the analysis is based around two questions: 1) What do scores from the MHSIP mean? and 2) How should we interpret the scores produced from the MSHIP to assist us in quality improvement? To investigate the meaning of a numeric score, we need to review the psychometric properties of a survey: Reliability- Does the survey produce the same score for consumers with the same true opinions about their mental health center (as defined by the MHSIP)? Validity- Does the survey measure performance of a mental health center as it is intended to do? Or does it measure some other trait, as well as performance? The underlying premise of psychometrics is to examine how a numeric score from the MHSIP survey captures the true opinions about consumers satisfaction. 13

Reliability of the MHSIP Reliability estimates contain two critical components, including the true score or true opinions of performance and error or anything beside the true score, in measuring performance. (According to Classical Test Theory) True score of consumer satisfaction Error (anything beside the true score) Numeric Score produced by the MHSIP Where the reliability is the ratio of true scores to error 14

What are we comparing? To be able to compare centers regarding their MHSIP scores, all centers should have similar reliabilities (or a similar percentage of error) This means that each center in the analysis should have similar ratios of true score vs. error in their MHSIP scores (i.e. invariance across centers) Mental Health Center A: True score (80%) Error (20%) Numeric Score Reliability.80 Mental Health Center B: True score (50%) Error (50%) Numeric Score Reliability.50 In this example, the numeric Scores from Center A cannot be compared to Center B, because the two scores have different meanings 15

Rasch Modeling Perspective In terms of Rasch modeling (type of IRT model) theory, the numeric score also consists of item difficulty, or how hard or easy it is to agree with an item. For example, Q16 was one of the more difficult-to-endorse items Q16: Staff respected my wishes about who is, and is not to be given information about my treatment True score of consumer satisfaction Error (anything beside the true score) Item Difficulty (how hard or easy it is to agree with the item) Numeric Score produced by the MHSIP 16 16

Purpose and Methods Participants, Procedures, and Data Analysis 17

Purpose of the Investigation Are the MHSIP numeric scores measuring the same construct across centers? We plan to accomplish this through three different statistical measurement techniques 1. Do the reliabilities among sub-domains of the MHSIP vary across mental health centers? (CTT) 2. Is the structure, and error, of the MHSIP the same across centers (invariance testing- structural equation modeling)? (CTT) True score Error Numeric Score Reliability 3. Are the characteristics of the items similar across mental health centers? Do consumers interpret the items in the same manner? (Rasch- Differential Item Functioning-DIFF) True score of consumer satisfaction Error (anything beside the true score) Item Difficulty (how hard or easy it is to agree with the item) Numeric Score produced by the MHSIP 18

Participants MHSIP surveys collected during the State Fiscal Year 2006 (July 1, 2005-June 30, 2006) Three mental health centers from the State of Colorado (in alphabetical order: Aurora, Jefferson, MHCD)* Center 1 n=137 Center 2 n= 101 Center 3 n= 148 * For this presentation, centers will not be identified 19

Procedures The MHSIP is administered annually to a stratified (Medicaid/non-Medicaid) random sample. Consumers were sampled from an unduplicated file of FY 2005-2006 Colorado Client Assessment Record (CCAR) Narrowed to those who had a recorded encounter with the mental health system in the latter half of FY 2005-2006 20

Psychometric Examination of the MHSIP Reliability, Measurement Invariance, and Differential item Functioning 21

Comparing Subscales Initially, we analyzed the reliability estimates of the five MHSIP subscales within the three centers during 2007 Reliability estimates range from 0 to 1, with 1 representing no error or perfect measurement of centers performance, and 0 representing all error Estimates of 0.70 or higher considered acceptable. 22

Reliability Estimates in 2007 among Subscales and Centers Access Quality Outcomes Satisfaction Participation Center 1 = 0.77 Center 2 = 0.65 Center 3 = 0.73 (Q4 0.75) Center 1 = 0.84 Center 2 = 0.80 Center 3 = 0.80 (Q18 0.84) Center 1 = 0.85 (Q26 0.89) Center 2 = 0.85 (Q26 0.88) Center 3 = 0.75 (Q26 0.85) Center 1 = 0.84 Center 2 = 0.79 Center 3 = 0.88 Center 1 = 0.47 Center 2 = 0.50 Center 3 = 0.28 *Note: A Q# suggested that an alpha-if-item-delete value higher than the actually reliability value, suggesting that deletion of that question could increase the reliability of the scale 23

Reliability Summary All subscales produced acceptable reliability, expect for participation Only contains 2 items (reliability increases as the number of items in scale increase) We cannot infer meaning from the scores for the participation domain In the Outcomes domain, reliability estimates would have increased among all centers with the removal of Q26 ( I do better in school and/or work ) All other items deal with concepts associated with mental health treatment (i.e., decreasing symptom interference, relationships, control of their life ) Notice that many consumers have good outcomes without participating in school or work (resiliency factor) 24

Invariance Testing Across Centers 25

Confirmatory Factor Analysis A model with all five domains could not be fit Some of the parameters could not be estimated (Variance-Covariance matrix may not be identified) Exploratory analyses using only Outcomes and Participation showed that Outcomes was the major culprit

Outcomes/Participation X 2 (26) = 160.32 RMSEA = 0.13 GFI = 0.79 CFI = 0.85

Invariance with 3 domains We tested invariance on three domains only: Satisfaction, Access and Quality We ran separate models for every center to have an idea up-front of their similarities/differences Trouble can be expected based on the fit Center 2 had the worst fit, Center 3 had a notso-bad fit; Center 1 was in between the other two centers

Center 1 Standardized scores n = 137 X 2 (87) = 236.97 RMSEA = 0.11 GFI = 0.82 CFI = 0.87

Center 2 Standardized Scores n = 101 X 2 (87) = 298.47 RMSEA = 0.15 GFI = 0.73 CFI = 0.68

Center 3 Standardized scores n = 148 X 2 (87) = 199.92 RMSEA = 0.09 GFI = 0.85 CFI = 0.90

Measurement Invariance Whether or not, we can assert that we measured the same attribute under different conditions If there is evidence of variability, any findings reporting differences between individuals and groups cannot be interpreted Differences in average scores can be just as easily interpreted as indicating that different things were measured Correlations between variables will be for different attributes for different groups

Factorial Invariance One way to test measurement invariance is FACTORIAL INVARIANCE The main question it addresses: Do the items making a particular measuring instrument work the same across different populations (e.g., Males and Females)? The measurement model is group-invariant Tests for Factorial Invariance (in order of difficulty):

Steps in Factor Invariance testing 1 Equivalent Factor structure Same number of factors, items associated with the same factors (Structural model invariance) 2 Equivalent Factor loading paths Factor loadings are identical for every item and every factor

Steps in Factor Invariance testing (cont) 3 Equivalent Factor variance/ covariance Variances and Covariances (correlations) among factors are the same across populations 4 Equivalent Item reliabilities Residuals for every item are the same across populations

Results Factorial Invariance Model χ 2 df χ 2 df RMSEA GFI CFI 1) Number of factors invariant 734.67 261 -- -- 0.11 0.93 0.85 2) Model (1) with pattern of factor loadings held invariant (Lambda-X Invariant) 3) Model (2) with factor variances and covariances held invariant (PHI- Invariant) 4) Model (3) with factor invariance of item-pair reliabilities (Theta-Delta- Invariant) 786.05 285 51.38* 24 0.11 0.85 0.93 1765.42 291 1030.75# 30 0.30 0.84 0.91 Not run * p < 0.05 # p < 0.01

Conclusions Factorial Invariance The model does not provide a good fit for the different centers Most of the discrepancy is centered on loadings and how the domains interact with each other (variance-covariance) Since the model is incremental, (later tests are more challenging than early ones), we did not run equivalent item reliabilities (the most stringent test)

Differential Item Functioning (DIF) 38

Differential Item Functioning Rasch analysis separates the item characteristics from participants scores It assumes that some items can be more difficulty to agree with than others. DIFF- examines and tests (statistically) whether the item characteristics (difficulty scores) vary across centers Since difficulty is an item characteristic, if difficulty scores vary among mental health centers, then it can be assumed that the items measure the centers differently (opposed to actually being a true difference in their scores) 39

0.6 Access DIF Plot 0.4 0.2 Difficulty Scores 0-0.2 1 2 3-0.4-0.6 Q4 Q5 Q6 Q7 ITEM 40

42

1 General Satisfaction 0.8 0.6 0.4 Difficulty 0.2 0 1 2 3-0.2-0.4-0.6-0.8 Q1 Q2 Q3 ITEM 44

Summary of DIFF Analysis The Quality/Appropriateness, Participation, and General Satisfaction subscales measure equally across mental health centers In the Access and Outcomes subscales, there are 6 questions that produced significant DIFF s meaning that characteristics of the measurement changes across centers Q4- The locations of service was convenient Q6- Staff returned my phone calls within 24 hours Q22- I am better able to control my life Q23- I am better able to deal with crisis Q24- I am getting along better with my family Q26- I do better in school and/or work Regarding these 6 questions, variations across centers may be due to differences in measurement, as opposed to true differences in consumer satisfaction 45

Discussion 46

What did we learn about the MHSIP? Some items and subscales (domains) do not seem to measure equally across centers Therefore comparing centers using these items/domains may not reflect true differences in performance It is more likely that they reflect differences in measurement (including error, difficulty, reliability) 47

Some domains are reliable, some are not Satisfaction was Ok from all 3 perspectives Quality had some good characteristics, but some items were bad Participation is not very reliable (only two items; but the items were good) Outcomes is overall, a real bad domain (bad items, lots of cross-loading, correlated errors) Employment/education may not be a desired outcome for all consumers

Discussion Despite the fact that the samples may not be appropriate (biases, sampling frameworks that can be improved), the data at hand suggests that there are some intrinsic problems with the MHSIP But the analyses also suggest some very specific ways to improve it 49

Suggestions Revise the Outcomes Scale (differentiate between recovery/resiliency) Add items to participation scale Some items in Access need to be reviewed (Q4 and Q6) How do we deal with all these cross-loading factors? Is it one domain (satisfaction) that we artificially broke into many domains (outcomes, access, )? How does the factor structure for the entire sample (EFA included in the annual report) holds up for individual centers? More research is needed in this area

More suggestions Sampling Suggestions: Attempt to Stratify the sample by Consumer s needs level At MHCD, we have developed a measure of consumer s recovery needs level (RNL) Equating Suggestions: Use some form of equating procedures to equate scores across centers Using Item Response Theory techniques: IRT could help learn more about how the MHSIP measures satisfaction/performance within/among mental health centers

More suggestions Mixed Method Design: Conducting focus groups at each center would provide a cross-validation to quantitative measurement This would also enhance the utilization of the results for quality improvement Include in the annual reports the psychometrics (reliability) for every center Helps to know how much confidence we should have in the scores

Questions??? Contact Information: Antonio Olmos, Antonio.Olmos@MHCD.org Kate DeRoche, Kathryn.Deroche@MHCD.org Richard Swanson, Richard.Swanson@aumhc.org John Mahalik, JohnMa@jcmh.org 53

χ2 (Chi-Square): in this context, it tests the closeness of fit between the unrestricted sample covariance matrix and the restricted (model) covariance matrix. Very sensitive to sample size: The statistic will be significant when the model fits approximately in the population and the sample size is large. RMSEA (Root Mean Square Error of Approximation): Analyzes the discrepancies between observed and implied covariance matrices. Lower bound of zero indicates perfect fit with values increasing as the fit deteriorates. Suggested that values below 0.1 indicate a good fit to the data, and values below 0.05 indicate a very good fit. It is recommended not to use models with RMSEA values larger than 0.1 GFI (Goodness of Fit Index): Analogous to R2 in that it indicates the proportion of variance explained by the model. Oscillates between 0 and 1 with values exceeding 0.9 indicating a good fit to the data. CFI (Comparative Fit Index): Indicates the proportion of improvement of the overall fit compared to a null (independent) model. Sample size independent, and penalizes for model complexity. It uses a 0-1 norm, with 1 indicating perfect fit. Values of about 0.9 or higher reflect a good fit