The Bilevel Structure of the Outcome Questionnaire 45

Psychological Assessment 2010 American Psychological Association 2010, Vol. 22, No. 2, 350 355 1040-3590/10/$12.00 DOI: 10.1037/a0019187 The Bilevel Structure of the Outcome Questionnaire 45 Jamie L. Bludworth, Terence J. G. Tracey, and Cynthia Glidden-Tracey Arizona State University The structure of the Outcome Questionnaire 45 (Lambert et al., 2001) was examined in a sample of 1,100 university counseling center clients using confirmatory factor analysis. Specifically, the relative fit of 1-factor, 3-factor orthogonal, 3-factor oblique, 4-factor hierarchical, and 4-factor bilevel models were examined. Although the 3-factor oblique, 4-factor hierarchical, and 4-factor bilevel models fit the data well, the 4-factor bilevel model fit the data best. The results provided support for the fit of the 4-factor bilevel model where each item loads on 1 of the 3 independent scales of Symptom Distress, Social Role Performance, and Interpersonal Relations, in addition to a nonoverlapping general distress factor. Keywords: Outcome Questionnaire 45 structure, bilevel structure, confirmatory factor analysis The Outcome Questionnaire 45 (OQ-45, Lambert et al., 2001) has been designed as a concise outcome and screening instrument with three subscales and has become a widely used measure in research and clinical applications. 1 The Symptom Distress (SD) subscale has a primary emphasis on symptoms that are characteristic of the most commonly diagnosed mental disorders (i.e., depression and anxiety). The Interpersonal Relations (IR) subscale is designed to assess difficulties with family, friends, and marital relationships. The Social Role Performance (SR) subscale is designed to assess problems or conflicts with one s employment, education, or leisure pursuits. Lambert et al. (2001) cautioned against using subscale scores as indicators of client change pending empirical validation. However, they also suggested that subscales scores can be consulted to identify specific areas of difficulty (Lambert et al., 2001, p. 16). To examine the construct validity of the three subscales on the OQ-45, Mueller, Lambert, and Burlingame (1998) conducted a confirmatory factor analysis to determine the likelihood that the subscales measured the three domains independently. They found that the one-factor solution fit the data equally well as either of the two- or three-factor solutions. With such inconclusive results, investigators have called for more research to determine the validity and applicability of the separate subscales of the OQ-45 (Lambert et al., 2001; Mueller et al., 1998). Given that Mueller et al. (1998) found that the one-factor, twofactor orthogonal, and three-factor orthogonal solutions fit the data equally and also because the unexplained overlap between the three subscales obfuscates the dimensional interpretability of the OQ-45, we tested several alternative models of the OQ-45 structure. The small range of models tested in previous studies may account for the inconclusive results, and examining a broader range of viable Jamie L. Bludworth, Counseling and Consultation, Arizona State University; Terence J. G. Tracey and Cynthia Glidden-Tracey, Department of Psychology in Education, Arizona State University. We thank Michael Lambert and the Brigham Young University counseling center for providing access to archival data for the current study. Correspondence concerning this article should be addressed to Jamie L. Bludworth, Counseling and Consultation, Arizona State University, P.O. Box 871012, Tempe, AZ 85287-1012. E-mail: james.bludworth@asu.edu models might provide means by which OQ-45 results may be more clearly interpreted. Specifically, we focused on the main theoretically driven structures for the OQ: one-factor, three-factor orthogonal, and three-factor oblique models. However, the OQ-45 manual (Lambert et al., 2001) described a general distress score in addition to subscale scores, suggesting alternative models of OQ factor structure that include both a general distress factor and multiple unique subscales. A four-factor hierarchical model was posited with three orthogonal first-order factors and a second-order general distress factor (see Figure 1). In this model, the three first-order factors are the three factors posited by Lambert et al. (2001): SD, IR, and SR. The second-order factor is a general factor that represents one s overall maladjustment (OM) across the domains. As can be seen in Figure 1, the general factor is hypothesized to have direct effects on each of the first-order factors and indirect effects on the actual items of the OQ-45. Empirically identical to the model in Figure 1 is the three-factor oblique solution depicted in Figure 2. Although conceptually distinguishable, these two models are treated here as a single hypothesized alternative given their empirical equivalence. This is a special case that applies only when models with three oblique factors are compared with hierarchical models with three firstorder factors and one higher order factor. The models are empirically equivalent because they contain an equal number of free parameters to be estimated, yielding fit statistics that are identical. This becomes apparent when Figure 1 is compared with Figure 2. Both models contain exactly 27 free parameters to be estimated. Consequently, the four-factor hierarchical model in Figure 1 was not the primary focus of evaluation. It is included in the current study because, in the literature to date, there has been no discussion regarding the possible existence of a general factor within the conceptual structure of the OQ-45. However, given the scoring of both a total score and subscale scores on the OQ-45, this makes conceptual sense. 1 The OQ-45 is a proprietary, trademarked Outcome Questionnaire, used with the permission of OQ Measures, LLC (http://www.oqmeasures.com). 350

OUTCOME QUESTIONNAIRE 45 351 OM SD SR IR #2 #3 #5 #6 #8 #9 #10 #11 #13 #15 #22 #23 #24 #25 #27 #29 #45 #42 #41 #40 #36 #35 #34 #33 #31 #4 #12 #14 #21 #28 #38 #32 #39 #44 #1 #7 #16 #17 #18 #19 #26 #20 #30 #37 #43 Figure 1. Four-factor hierarchical model. IR Interpersonal Relations; OM Overall Maladjustment; SD Symptom Distress; SR Social Role Performance. Error terms have been omitted for clarity. Another model that includes both a general factor and specific factors is the bilevel model. Using logic similar to that applied by Tracey and Kokotovic (1989) in a study of the structure of the Working Alliance Inventory, each item on the OQ-45 was hypothesized to be indicative of two factors at different logical levels: a common general factor and more specific factors. Each item would load on one of the three specific OQ factors (i.e., SD, IR, or SR). The variance of each item is partitioned into two nonoverlapping components, one tapping general maladjustment and another tapping specific distress. Responses thus would be determined by overall level of distress as well as specific concerns. This model is distinct from the hierarchical model described above in that it proposes that each item not only contains unique variance in one of the three content specific factors but also contains common variance as represented by the general factor. The typical hierarchical model described above views the general factor as being a linear combination of the specific underlying factors, where general distress is simply the sum of the separate underlying scales. Moreover, as can be seen by comparing Figures 1 and 3, the bilevel factor model in Figure 3 is different from a hierarchical model depicted in Figure 1 because the general factor has direct effects on the measured variables rather than indirect effects as in a higher order model. In this way, the variance that went unaccounted for in previous studies is explained by the separate, nonoverlapping general factor of OM. It was hypothesized that the OQ-45 is a multidimensional instrument that contains one general factor and multiple unique subscale factors. This hypothesis is reflected in the bilevel factor and hierarchical models and is similar in pattern to g in models of intelligence. It was hypothesized that the four-factor bilevel model explicated above would provide an empirically adequate and clinically useful description of the factor structure of the OQ-45 that would be consistent with the scoring and interpretation system proposed by the authors of the instrument, providing a framework within which to better understand and apply results obtained from the OQ-45. So the purpose of the present study was to compare several different models of the structure of the OQ-45, with the hypothesis that the four-factor bilevel model would fit best. SD SR IR #2 #3 #5 #6 #8 #9 #10 #11 #13 #15 #22 #23 #24 #25 #27 #29 #45 #42 #41 #40 #36 #35 #34 #33 #31 #4 #12 #14 #21 #28 #38 #32 #39 #44 #1 #7 #16 #17 #18 #19 #26 #20 #30 #37 #43 Figure 2. Three-factor oblique model. IR Interpersonal Relations; SD Symptom Distress; SR Social Role Performance. Error terms have been omitted for clarity.

352 BLUDWORTH, TRACEY, AND GLIDDEN-TRACEY SD SR IR #2 #3 #5 #6 #8 #9 #10 #11 #13 #15 #22 #23 #24 #25 #27 #29 #45 #42 #41 #40 #36 #35 #34 #33 #31 #4 #12 #14 #21 #28 #38 #32 #39 #44 #1 #7 #16 #17 #18 #19 #26 #20 #30 #37 #43 OM Figure 3. Four-factor bilevel model. IR Interpersonal Relations; OM Overall Maladjustment; SD Symptom Distress; SR Social Role Performance. Error terms have been omitted for clarity. Method Participants Data were generously provided (with permission) from an archival database maintained at the Brigham Young University Counseling and Career Center. Data had been collected from 1,100 participants from 1996 to 2003. Ages of participants ranged from 18 to 59 years with a mean age of 23 years. The sample included 649 women (59.0%) and 451 men (41.0%), with 55.9% being single, 41.7% married, 1.3% divorced, and 1.1% declining to state their relationship status. Approximately 89.8% of the participants identified themselves as White, 4.7% were Hispanic/Latino, 1.5% were Pacific Islander, 1.1% were Asian, 0.9% were American Indian, 0.4% were Black, and 1.5% endorsed other. Instrument The OQ-45 (Lambert et al., 2001) consists of 45 items that use a five-point Likert-type response format, with anchors of 0 (never) and 4(almost always). Individuals are instructed to respond to the items in terms of how they felt in the past week. The SD subscale contains 25 items, the majority of which are designed to measure anxiety and depression. Sample items include I tire quickly and I feel nervous. The IR subscale contains items that assess difficulties with family, friends, and marital relationships; for example, I have frequent arguments. The SR subscale is designed to assess problems with one s employment, education, or leisure pursuits. This subscale contains nine items such as I find my work/school satisfying (reverse scored). Scores from the three subscales are summed to create a total score that represents one s overall level of psychological dysfunction or general distress. Past research has provided good evidence of reliability and validity support other than the mixed structural findings noted earlier (Lambert et al., 2001). Analysis We used confirmatory factor analysis (CFA) to evaluate each of the models proposed in the current study. The CFA procedure

OUTCOME QUESTIONNAIRE 45 353 Table 1 Goodness-of-Fit Indices for the Item-Level Models of the Outcome Questionnaire 45 2 Model df SB CFI RMSEA 90% CI SMSR One-factor model 945 6,125.23.69.080 [.078.082].09 Three-factor model (orthogonal) 945 6,801.73.64.086 [.084.088].09 Three-factor model (oblique) 942 5,427.20.73.073 [.072.075].07 Four-factor model (hierarchical) 942 5,427.20.73.073 [.072.075].07 Four-factor model (bilevel factor) 900 4,201.60.80.065 [.063.066].06 2 Note. SB Satorra Bentler scaled chi-square; CFI comparative fit index; CI confidence interval; RMSEA root mean square error of approximation; SMSR standardized mean square residual. applied within this study was conducted using EQS 6.1 (Bentler, 1995) on the variance covariance matrices using maximum likelihood estimation. There were no cases of missing data. Initial examination revealed that the data were not multivariate normal, as Mardia s coefficient of multivariate kurtosis (Bentler, 1995) was 663. Given the nonnormal distribution, all analyses were conducted using the Satorra Bentler robust method of parameter estimation (Satorra & Bentler, 1999). Although there are many indices of model fit, we used the root mean square error of approximation (RMSEA) and the standardized mean square residual (SMSR) as our primary indicators of fit, both of which are recommended by Hu and Bentler (1998, 1999). The RMSEA is an especially useful indicator of fit as it takes account of model complexity by penalizing more complex models as well as enabling the specification of a confidence interval. Values for RMSEA of.05 or lower suggest a closely fitting model,.06.08 indicate good fit, and values of.10 and above are indicative of poorly fitting models (Browne & Cudeck, 1993). SMSR values below.08 are indicative of a good fit (Hu & Bentler, 1999). We also reported the Satorra Bentler scaled chi-square ( 2 SB ) and the comparative fit index (CFI) but these are less useful. The chi-square is highly affected by sample size and model complexity (Marsh, Balla, & McDonald, 1988; Vandenberg & Lance, 2000), both of which are salient here. Given that we were examining an item-level CFA model, we anticipated overall poor fit of many indices, especially the CFI as documented by Marsh, Hau, and Grayson (2005) even with accurately specified models. 2 Results The goodness-of-fit indices for each of the models tested at the item level are presented in Table 1. The levels of the CFI values for each model were low (ranging from.69 to.80) and well below standard cutoffs (e.g., Hu & Bentler, 1998). But such low fit results are typical when measures are examined in CFA at the item level (Marsh et al., 2005) even with accurately specified models. Given this, we focused on the RMSEA and SMSR indices in evaluating fit. Hu and Bentler (1999) have supported the use of RMSEA and the SMSR as two of the preferred indicators of model fit. Using these indices, the three-factor oblique, four-factor hierarchical, and four-factor bilevel models were below standard cutoffs of.08 for RMSEA and.08 for SMSR. The relations among the factors in the three-factor oblique model were very high, with standardized parameter estimates of rs.89 between SD and IR,.78 between SD and SR, and.73 between IR and SR. The high relations could indicate a general factor incorporated into the bilevel model. The parameter estimates from the four-factor hierarchical model that is equivalent to the three-factor oblique model were rs.84,.81, and.76 from the general factor to the three subscales, respectively. The three plausible models (i.e., three-factor oblique, four-factor hierarchical, and four-factor bilevel models) were not nested, precluding tests of the scaled difference. However, the fit indices of the bilevel model were the lowest overall. The SMSR was equal to.06 for the bilevel model versus.07 for the other two models. The RMSEA was equal to.065 for the bilevel model versus.073 for the other models. The lower levels of the bilevel RMSEA estimates are noteworthy because the RMSEA index penalizes less parsimonious models (Browne & Cudeck, 1993). The bilevel model is much less parsimonious than either the three-factor oblique model or the four-factor hierarchical model in that it has 42 more parameters. In addition, use of the RMSEA enables the specification of a confidence interval for the parameter estimate. The 90% confidence interval for the RMSEA for the four-factor bilevel model was.063.066, and this band did not include the RMSEA values for either the three-factor oblique or the four-factor hierarchical models, supporting the superiority of the bilevel model. So in terms of model fit, the four-factor bilevel model fit best. Levy and Hancock (2007) presented an approach for evaluating the relative fit of nonnested models within structural equation modeling. They proposed the use of a T statistic to evaluate the similarity in fit of two fitted model covariance matrices. We compared the four-factor bilevel model with the other two plausible models. The T statistic for this comparison was T 11.48, p.001, indicating that the four-factor bilevel model fit the data significantly better than did the three-factor oblique or the fourfactor hierarchical models. The standardized parameter estimates for the four-factor bilevel model are presented in Table 2. An examination of the item loadings reveals that most items load at least moderately on the intended scale. Also, the loadings on the general OM factor were all high and fairly uniform in magnitude. All of the items had large, fairly equal parameter estimates on the general OM factor and then moderate parameter estimates for the specific factors. All 2 We also examined the models using item parcels whereby items were combined into one of three parcels for each subscale, for a total of nine parcels. These parcels were then tested using the same models. The results were virtually identical to those obtained using the item-level analyses. We present the item-level analyses because these are more obvious and less reliant on the specific parcels generated.

354 BLUDWORTH, TRACEY, AND GLIDDEN-TRACEY Table 2 Standardized Parameter Estimates of the Four-Factor Bilevel Model Item items had higher parameter estimates on the general factor than on the specific factors. Discussion Factor SD IR SR OM 1 0.55 0.49 2 0.56 0.55 3 0.44 0.73 4 0.32 0.54 5 0.32 0.61 6 0.56 0.61 7 0.65 0.33 8 0.34 0.66 9 0.29 0.70 10 0.38 0.62 11 0.41 0.15 12 0.32 0.64 13 0.22 0.79 14 0.12 0.05 15 0.42 0.83 16 0.77 0.24 17 0.66 0.17 18 0.27 0.74 19 0.44 0.31 20 0.26 0.65 21 0.22 0.63 22 0.33 0.61 23 0.19 0.80 24 0.24 0.81 25 0.47 0.48 26 0.05 0.35 27 0.50 0.41 28 0.25 0.55 29 0.28 0.49 30 0.27 0.53 31 0.31 0.84 32 0.37 0.33 33 0.28 0.52 34 0.39 0.34 35 0.44 0.34 36 0.22 0.57 37 0.48 0.53 38 0.34 0.47 39 0.26 0.43 40 0.30 0.67 41 0.14 0.55 42 0.30 0.84 43 0.41 0.66 44 0.20 0.51 45 0.26 0.39 Note. SD Symptom Distress; IR Interpersonal Relations; SR Social Role Performance; OM Overall Maladjustment. The proposed structure of the OQ-45 (Lambert, 1983) is one of three separate subscales (i.e., SD, IR, and SR) that can be aggregated into one general distress total score. However, empirical support for this proposed structure has not previously been demonstrated (e.g., Mueller et al., 1998). The results of this study provide some support for this structure. The most plausible models for the OQ-45 in this study were the three-factor oblique model, the four-factor hierarchical model, and the four-factor bilevel model. All of these models represent the composition of the OQ-45 in terms of specific subscale variance as well as high relations across the scales. All models justify using the OQ-45 as originally proposed: as a representation of three subscales as well as a more general distress factor. This study has provided support for the proposed two-level structure of the OQ-45. However, the four-factor bilevel model received the most support. This model is different from the three-factor oblique or hierarchical models in that it separates out overall distress at the item level. Each item is composed of both specific problem variance and general distress variance. From an examination of the parameter estimates for the bilevel model, it appears that the general factor loads highly and fairly evenly across all items and appears to be capturing a general distress dimension. In addition, each item loaded on one of the three specific factors. The bilevel model indicates that individuals are responding to items both with respect to the overall level of general distress experienced, regardless of specific item qualities, and in response to the unique aspects of that specific item. The hierarchical model views the general distress as simply a summation of the separate scales. The bilevel model demonstrates that a separate overall distress is being captured at the item level. This four-factor bilevel structure supports the current applications of the OQ-45 where individuals are typically provided with their three subscale scores as well as their overall score. This reporting system thus represents the overall, general distress as indicated in the total score and the unique variance associated with each of the individual subscale scores. The relative similarity of loadings for the general distress factor across items means that the OQ can be unit weighted to yield both subscale and general distress scores without any undesirable consequences. The finding that three separate subscales conform to the proposed structure is the first such structural support in the literature. The presence of these three subscale factors was only revealed when general distress was separately taken into account. The lack of incorporation of general distress into previous models is proposed as the main reason for past equivocal results (e.g., Mueller et al., 1998). The results of this study provide support for the structural presence of the three subscales but they provide no information on other aspects of validity. Lambert et al. (2001) noted that there was not a clear pattern of correlations of the subscales with the Inventory of Interpersonal Problems (Horowitz, Rosenberg, Baer, Ureno, & Villasenor, 1988) and the Social Adjustment Scale (Weissman & Bothwell, 1976). The OQ-45 subscales did not differently relate to these external scales. So although there is support for the unique variation associated with the three subscales, it is less clear that they are accurately labeled. Subsequent research needs to focus on understanding what each subscale is uniquely representing via relations with external scales. Given the presence of the general distress factor, future research needs to take account of this factor first before looking for unique relations with other scales, otherwise a lack of clear results may occur. Given that the OQ-45 was designed to measure client change over multiple administrations, it would also be important to conduct further structural evaluations of the OQ-45 that examine the degree to which the structure remains consistent over multiple

OUTCOME QUESTIONNAIRE 45 355 administrations. Nevertheless, the current findings provide support for the multidimensionality of this measure and call into question previous admonitions to disregard the subscales as being anything more than theoretically interesting. Additionally, these findings suggest that future construct validity studies of the OQ-45 would do well to evaluate the instrument from a bilevel perspective. References Bentler, P. M. (1995). EQS structural equations program manual. Encino, CA: Multivariate Software. Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 136 162). Newbury Park, CA: Sage. Horowitz, L. M., Rosenberg, S. E., Baer, B. A., Ureno, G., & Villasenor, V. S. (1988). Inventory of Interpersonal Problems: Psychometric properties and clinical applications. Journal of Consulting and Clinical Psychology, 56, 885 892. Hu, L., & Bentler, P. M. (1998). Fit indices in covariance structure modeling: Sensitivity to underparameterized model misspecification. Psychological Methods, 3, 424 453. Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1 55. Lambert, M. J. (1983). Introduction to assessment of psychotherapy outcome: Historical perspective and current issues. In M. J. Lambert, E. R. Christensen, & S. S. DeJulio (Eds.), The assessment of psychotherapy outcome (pp. 3 32). New York, NY: Wiley. Lambert, M. J., Hansen, N. B., Umpress, V., Lunnen, K., Okiishi, J., Burlingame, G. M., & Reisinger, C. W. (2001). Administration and scoring manual for the OQ-45. Orem, UT: American Professional Credentialing Services. Levy, R., & Hancock, G. R. (2007). A framework of statistical tests for comparing mean and covariance structure models. Multivariate Behavioral Research, 42, 33 66. Marsh, H. W., Balla, J. R., & McDonald, R. P. (1988). Goodness-of-fit indexes in confirmatory factor analysis: The effects of sample size. Psychological Bulletin, 103, 391 410. Marsh, H. W., Hau, K., & Grayson, D. (2005). Goodness of fit in structural equation models. In A. Maydeu-Olivares & J. J. McArdle (Eds.), Contemporary psychometrics: A festschrift for Roderick P. McDonald (pp. 275 340). Mahwah, NJ: Erlbaum. Mueller, R. M., Lambert, M. J., & Burlingame, G. M. (1998). Construct validity of the Outcome Questionnaire: A confirmatory factor analysis. Journal of Personality Assessment, 70, 248 262. Satorra, A., & Bentler, P. M. (1999). A scaled difference chi-square statistic for moment structure analysis (UCLA Statistics Series No. 260), University of California, Los Angeles. Tracey, T. J., & Kokotovic, A. M. (1989). Factor structure of the Working Alliance Inventory. Psychological Assessment, 1, 207 210. Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3, 4 70. Weissman, M. M., & Bothwell, S. (1976). Assessment of social adjustment by patient self-report. Archives of General Psychiatry, 33, 1111 1115. Received June 6, 2009 Revision received January 14, 2010 Accepted January 18, 2010