The MHSIP: A Tale of Three Centers P. Antonio Olmos-Gallo, Ph.D. Kathryn DeRoche, M.A. Mental Health Center of Denver Richard Swanson, Ph.D., J.D. Aurora Research Institute John Mahalik, Ph.D., M.P.A. Jefferson Center for Mental Health Presented at the Organization for Program Evaluation in Colorado Annual Meeting, May 15, 2008 1
Presentation Overview Accountability in mental health Description and intended use of the MHSIP Review of constructs of measurement Purpose and Methods Results of the psychometric investigation Reliabilities Measurement invariance Differential item functioning Discussion of results Future directions for accountability in mental health 2
Accountability in Mental health 3
Accountability in Mental Health More and more prevalent in the MH field This is a good thing, if it helps centers to improve their services But for accountability to be used to improve quality, the approach needs to meet two criteria: 1. The feedback provided to the centers must be useful and, 2. The yardstick must be the same for all centers. 4
How does accountability work in MH? Accountability has changed from Formative- to more Summative-oriented Grant funding (Federal, Private) requires that outcomes be demonstrated (NOMS, GPRA) State-based requirements (CCAR, MHSIP, YSSF) Stakeholders are more in-tune with accountability
Description and Intended Uses of the MHSIP What is the MHSIP? What is it used for? 6
What is the MHSIP? Mental Health Statistics Improvement Program (MHSIP) Consumer survey Annual administration to a stratified random sample of consumers Colorado Version: 28 items designed to measure 5 constructs Annual administration of the MHSIP is a result of President s New Freedom Commission on Mental Health (2003) focusing on consumer informed performance measurement 7
What is the MHSIP Used For? Designed to assess the performance of mental health services inside and across individual centers Mental health centers are compared according to their MHSIP scores, which assess the center s performance for the current year MHSIP results are reported, and can be viewed by all mental health centers and their stakeholders 8
Domains of the MHSIP Overall Performance of Mental Health Centers Access (4 items) Quality/ Appropriateness (8 items) Participation in Service/ Treatment Planning (2 items) Consumer Perception of Outcomes (7 items) General Satisfaction (3 items) Note: Participation is not considered a domain for many centers at the national or state level, but it is used in Colorado, therefore it was included in the analysis 9
How can Mental Health Centers Use the MHSIP results? Excerpt from MHSIP Consumer Survey Technical Report 2007 (DMH) This information [MHSIP numerical scores] can be used to inform future change within individual centers and can provide a catalyst for more indepth study of particular domains at the center level. (page 1) In 2006, due to low scores, we conducted a study to see how we could improve Participation in services The results did not match what the MHSIP found, nor it provided any explanation for why our scores were low 10
So we pondered, how can this happen? We were wondering how is the MHSIP measuring centers, Is it possible that measurement artifacts may be influencing the quantitative results? The remainder of the presentation describes our investigation into the psychometrics of the MHSIP for its intended use 11
Measurement Constructs 12
Psychometrics Properties The focus of the analysis is based around two questions: 1) What do scores from the MHSIP mean? and 2) How should we interpret the scores produced from the MSHIP to assist us in quality improvement? To investigate the meaning of a numeric score, we need to review the psychometric properties of a survey: Reliability- Does the survey produce the same score for consumers with the same true opinions about their mental health center (as defined by the MHSIP)? Validity- Does the survey measure performance of a mental health center as it is intended to do? Or does it measure some other trait, as well as performance? The underlying premise of psychometrics is to examine how a numeric score from the MHSIP survey captures the true opinions about consumers satisfaction. 13
Reliability of the MHSIP Reliability estimates contain two critical components, including the true score or true opinions of performance and error or anything beside the true score, in measuring performance. (According to Classical Test Theory) True score of consumer satisfaction Error (anything beside the true score) Numeric Score produced by the MHSIP Where the reliability is the ratio of true scores to error 14
What are we comparing? To be able to compare centers regarding their MHSIP scores, all centers should have similar reliabilities (or a similar percentage of error) This means that each center in the analysis should have similar ratios of true score vs. error in their MHSIP scores (i.e. invariance across centers) Mental Health Center A: True score (80%) Error (20%) Numeric Score Reliability.80 Mental Health Center B: True score (50%) Error (50%) Numeric Score Reliability.50 In this example, the numeric Scores from Center A cannot be compared to Center B, because the two scores have different meanings 15
Rasch Modeling Perspective In terms of Rasch modeling (type of IRT model) theory, the numeric score also consists of item difficulty, or how hard or easy it is to agree with an item. For example, Q16 was one of the more difficult-to-endorse items Q16: Staff respected my wishes about who is, and is not to be given information about my treatment True score of consumer satisfaction Error (anything beside the true score) Item Difficulty (how hard or easy it is to agree with the item) Numeric Score produced by the MHSIP 16 16
Purpose and Methods Participants, Procedures, and Data Analysis 17
Purpose of the Investigation Are the MHSIP numeric scores measuring the same construct across centers? We plan to accomplish this through three different statistical measurement techniques 1. Do the reliabilities among sub-domains of the MHSIP vary across mental health centers? (CTT) 2. Is the structure, and error, of the MHSIP the same across centers (invariance testing- structural equation modeling)? (CTT) True score Error Numeric Score Reliability 3. Are the characteristics of the items similar across mental health centers? Do consumers interpret the items in the same manner? (Rasch- Differential Item Functioning-DIFF) True score of consumer satisfaction Error (anything beside the true score) Item Difficulty (how hard or easy it is to agree with the item) Numeric Score produced by the MHSIP 18
Participants MHSIP surveys collected during the State Fiscal Year 2006 (July 1, 2005-June 30, 2006) Three mental health centers from the State of Colorado (in alphabetical order: Aurora, Jefferson, MHCD)* Center 1 n=137 Center 2 n= 101 Center 3 n= 148 * For this presentation, centers will not be identified 19
Procedures The MHSIP is administered annually to a stratified (Medicaid/non-Medicaid) random sample. Consumers were sampled from an unduplicated file of FY 2005-2006 Colorado Client Assessment Record (CCAR) Narrowed to those who had a recorded encounter with the mental health system in the latter half of FY 2005-2006 20
Psychometric Examination of the MHSIP Reliability, Measurement Invariance, and Differential item Functioning 21
Comparing Subscales Initially, we analyzed the reliability estimates of the five MHSIP subscales within the three centers during 2007 Reliability estimates range from 0 to 1, with 1 representing no error or perfect measurement of centers performance, and 0 representing all error Estimates of 0.70 or higher considered acceptable. 22
Reliability Estimates in 2007 among Subscales and Centers Access Quality Outcomes Satisfaction Participation Center 1 = 0.77 Center 2 = 0.65 Center 3 = 0.73 (Q4 0.75) Center 1 = 0.84 Center 2 = 0.80 Center 3 = 0.80 (Q18 0.84) Center 1 = 0.85 (Q26 0.89) Center 2 = 0.85 (Q26 0.88) Center 3 = 0.75 (Q26 0.85) Center 1 = 0.84 Center 2 = 0.79 Center 3 = 0.88 Center 1 = 0.47 Center 2 = 0.50 Center 3 = 0.28 *Note: A Q# suggested that an alpha-if-item-delete value higher than the actually reliability value, suggesting that deletion of that question could increase the reliability of the scale 23
Reliability Summary All subscales produced acceptable reliability, expect for participation Only contains 2 items (reliability increases as the number of items in scale increase) We cannot infer meaning from the scores for the participation domain In the Outcomes domain, reliability estimates would have increased among all centers with the removal of Q26 ( I do better in school and/or work ) All other items deal with concepts associated with mental health treatment (i.e., decreasing symptom interference, relationships, control of their life ) Notice that many consumers have good outcomes without participating in school or work (resiliency factor) 24
Invariance Testing Across Centers 25
Confirmatory Factor Analysis A model with all five domains could not be fit Some of the parameters could not be estimated (Variance-Covariance matrix may not be identified) Exploratory analyses using only Outcomes and Participation showed that Outcomes was the major culprit
Outcomes/Participation X 2 (26) = 160.32 RMSEA = 0.13 GFI = 0.79 CFI = 0.85
Invariance with 3 domains We tested invariance on three domains only: Satisfaction, Access and Quality We ran separate models for every center to have an idea up-front of their similarities/differences Trouble can be expected based on the fit Center 2 had the worst fit, Center 3 had a notso-bad fit; Center 1 was in between the other two centers
Center 1 Standardized scores n = 137 X 2 (87) = 236.97 RMSEA = 0.11 GFI = 0.82 CFI = 0.87
Center 2 Standardized Scores n = 101 X 2 (87) = 298.47 RMSEA = 0.15 GFI = 0.73 CFI = 0.68
Center 3 Standardized scores n = 148 X 2 (87) = 199.92 RMSEA = 0.09 GFI = 0.85 CFI = 0.90
Measurement Invariance Whether or not, we can assert that we measured the same attribute under different conditions If there is evidence of variability, any findings reporting differences between individuals and groups cannot be interpreted Differences in average scores can be just as easily interpreted as indicating that different things were measured Correlations between variables will be for different attributes for different groups
Factorial Invariance One way to test measurement invariance is FACTORIAL INVARIANCE The main question it addresses: Do the items making a particular measuring instrument work the same across different populations (e.g., Males and Females)? The measurement model is group-invariant Tests for Factorial Invariance (in order of difficulty):
Steps in Factor Invariance testing 1 Equivalent Factor structure Same number of factors, items associated with the same factors (Structural model invariance) 2 Equivalent Factor loading paths Factor loadings are identical for every item and every factor
Steps in Factor Invariance testing (cont) 3 Equivalent Factor variance/ covariance Variances and Covariances (correlations) among factors are the same across populations 4 Equivalent Item reliabilities Residuals for every item are the same across populations
Results Factorial Invariance Model χ 2 df χ 2 df RMSEA GFI CFI 1) Number of factors invariant 734.67 261 -- -- 0.11 0.93 0.85 2) Model (1) with pattern of factor loadings held invariant (Lambda-X Invariant) 3) Model (2) with factor variances and covariances held invariant (PHI- Invariant) 4) Model (3) with factor invariance of item-pair reliabilities (Theta-Delta- Invariant) 786.05 285 51.38* 24 0.11 0.85 0.93 1765.42 291 1030.75# 30 0.30 0.84 0.91 Not run * p < 0.05 # p < 0.01
Conclusions Factorial Invariance The model does not provide a good fit for the different centers Most of the discrepancy is centered on loadings and how the domains interact with each other (variance-covariance) Since the model is incremental, (later tests are more challenging than early ones), we did not run equivalent item reliabilities (the most stringent test)
Differential Item Functioning (DIF) 38
Differential Item Functioning Rasch analysis separates the item characteristics from participants scores It assumes that some items can be more difficulty to agree with than others. DIFF- examines and tests (statistically) whether the item characteristics (difficulty scores) vary across centers Since difficulty is an item characteristic, if difficulty scores vary among mental health centers, then it can be assumed that the items measure the centers differently (opposed to actually being a true difference in their scores) 39
0.6 Access DIF Plot 0.4 0.2 Difficulty Scores 0-0.2 1 2 3-0.4-0.6 Q4 Q5 Q6 Q7 ITEM 40
42
1 General Satisfaction 0.8 0.6 0.4 Difficulty 0.2 0 1 2 3-0.2-0.4-0.6-0.8 Q1 Q2 Q3 ITEM 44
Summary of DIFF Analysis The Quality/Appropriateness, Participation, and General Satisfaction subscales measure equally across mental health centers In the Access and Outcomes subscales, there are 6 questions that produced significant DIFF s meaning that characteristics of the measurement changes across centers Q4- The locations of service was convenient Q6- Staff returned my phone calls within 24 hours Q22- I am better able to control my life Q23- I am better able to deal with crisis Q24- I am getting along better with my family Q26- I do better in school and/or work Regarding these 6 questions, variations across centers may be due to differences in measurement, as opposed to true differences in consumer satisfaction 45
Discussion 46
What did we learn about the MHSIP? Some items and subscales (domains) do not seem to measure equally across centers Therefore comparing centers using these items/domains may not reflect true differences in performance It is more likely that they reflect differences in measurement (including error, difficulty, reliability) 47
Some domains are reliable, some are not Satisfaction was Ok from all 3 perspectives Quality had some good characteristics, but some items were bad Participation is not very reliable (only two items; but the items were good) Outcomes is overall, a real bad domain (bad items, lots of cross-loading, correlated errors) Employment/education may not be a desired outcome for all consumers
Discussion Despite the fact that the samples may not be appropriate (biases, sampling frameworks that can be improved), the data at hand suggests that there are some intrinsic problems with the MHSIP But the analyses also suggest some very specific ways to improve it 49
Suggestions Revise the Outcomes Scale (differentiate between recovery/resiliency) Add items to participation scale Some items in Access need to be reviewed (Q4 and Q6) How do we deal with all these cross-loading factors? Is it one domain (satisfaction) that we artificially broke into many domains (outcomes, access, )? How does the factor structure for the entire sample (EFA included in the annual report) holds up for individual centers? More research is needed in this area
More suggestions Sampling Suggestions: Attempt to Stratify the sample by Consumer s needs level At MHCD, we have developed a measure of consumer s recovery needs level (RNL) Equating Suggestions: Use some form of equating procedures to equate scores across centers Using Item Response Theory techniques: IRT could help learn more about how the MHSIP measures satisfaction/performance within/among mental health centers
More suggestions Mixed Method Design: Conducting focus groups at each center would provide a cross-validation to quantitative measurement This would also enhance the utilization of the results for quality improvement Include in the annual reports the psychometrics (reliability) for every center Helps to know how much confidence we should have in the scores
Questions??? Contact Information: Antonio Olmos, Antonio.Olmos@MHCD.org Kate DeRoche, Kathryn.Deroche@MHCD.org Richard Swanson, Richard.Swanson@aumhc.org John Mahalik, JohnMa@jcmh.org 53
χ2 (Chi-Square): in this context, it tests the closeness of fit between the unrestricted sample covariance matrix and the restricted (model) covariance matrix. Very sensitive to sample size: The statistic will be significant when the model fits approximately in the population and the sample size is large. RMSEA (Root Mean Square Error of Approximation): Analyzes the discrepancies between observed and implied covariance matrices. Lower bound of zero indicates perfect fit with values increasing as the fit deteriorates. Suggested that values below 0.1 indicate a good fit to the data, and values below 0.05 indicate a very good fit. It is recommended not to use models with RMSEA values larger than 0.1 GFI (Goodness of Fit Index): Analogous to R2 in that it indicates the proportion of variance explained by the model. Oscillates between 0 and 1 with values exceeding 0.9 indicating a good fit to the data. CFI (Comparative Fit Index): Indicates the proportion of improvement of the overall fit compared to a null (independent) model. Sample size independent, and penalizes for model complexity. It uses a 0-1 norm, with 1 indicating perfect fit. Values of about 0.9 or higher reflect a good fit