Chapter 9 Measurement Models for Marketing Constructs

Size: px

Start display at page:

Download "Chapter 9 Measurement Models for Marketing Constructs"

Justin Robertson
5 years ago
Views:

1 Chapter 9 Measurement Models for Marketing Constructs Hans Baumgartner and Bert Weijters 9.1 Introduction Researchers who seek to understand marketing phenomena frequently need to measure the phenomena studied. However, for a variety of reasons, measuring marketing constructs is generally not a straightforward task and often sophisticated measurement models are needed to fully capture relevant marketing constructs. For instance, consider brand love, a construct that has become a common theme in advertising practice and academic marketing research, but which has proven hard to measure adequately. In an effort to assess brand love, Batra et al. (2012) develop a measurement model that comprises no fewer than seven dimensions of the construct (passion-driven behaviors, self-brand integration, positive emotional connection, long-term relationship, anticipated separation distress, overall attitude valence, and attitude strength in terms of certainty/confidence). The hierarchical model they propose can assist marketing executives in showing how to influence a consumer s feeling of brand love by targeting the lower-level, concrete subcomponents through product and service design and/or marketing communications (e.g., by providing trusted expert advice on a website a company can leverage a feeling of anticipated separation distress). But even for marketing constructs that seem more concrete and for which well-established measures are readily available, researchers face important challenges in terms of measurement modeling. For instance, validly measuring satisfaction is often more challenging than it may seem. Consider a researcher who is interested in consumers satisfaction with a firm s offering and the determinants of H. Baumgartner ( ) Smeal College of Business, Penn State University, University Park, USA HansBaumgartner@psu.edu B. Weijters Ghent University, Ghent, Belgium Springer International Publishing AG 2017 B. Wierenga and R. van der Lans (eds.), Handbook of Marketing Decision Models, International Series in Operations Research & Management Science 254, DOI / _9 259

2 260 H. Baumgartner and B. Weijters their satisfaction (e.g., both proximal determinants such as their satisfaction with particular aspects of the product and more distal determinants such as prior expectations). It is well-known that the responses provided by consumers may not reflect their true satisfaction with the product or other product-related characteristics because of various extraneous influences, both random and systematic, related to the respondent (e.g., acquiescence, social desirability), the survey instrument (e.g., wording of the items, response format), and situational factors (e.g., distractions present in the survey setting). Furthermore, it may not be valid to assume that the responses provided by consumers can be treated at face value and used as interval scales in analyses that require such an assumption. Because of these problems, both in the way respondents provide their ratings and in how researchers treat these ratings, comparisons across individuals or groups of individuals may be compromised. As a case in point, Rossi et al. (2001) demonstrate that a model in which scale usage differences and the discrete nature of the rating scale are taken into account explicitly leads to very different findings about the relationship between overall satisfaction and various dimensions of product performance compared to a model in which no corrections are applied to the data. In another study using satisfaction survey data, Andreassen et al. (2006) illustrate how alternative estimation methods to account for non-normality may lead to different results in terms of model fit, model selection, and parameter estimates and, as a consequence, managerial priorities in the marketing domain. Measurement is the process of quantifying a stimulus (the object of measurement) on some dimension of judgment (the attribute to be measured). Often, a measurement task involves a rater assigning a position on a rating scale to an object based on the object s perceived standing on the attribute of interest. In a marketing context, objects of measurement may be individuals (consumers, salespeople, etc.), firms, or advertisements, to mention just a few examples, which are rated on various individual differences, firm characteristics, and other properties of interest. For example, raters may be asked to assess the service quality of a particular firm (specifically, the reliability dimension of service quality) by indicating their agreement or disagreement with the following item: XYZ [as a provider of a certain type of service] is dependable (Parasuraman et al. 1988). Although the use of raters is common, the quantification could also be based on secondary data and other sources. Three important concepts in the measurement process have to be distinguished (Groves et al. 2004): construct, measure, and response. A construct is the conceptual entity that the researcher is interested in. Before empirical measurements can be collected, it is necessary that the construct in question be defined carefully. This requires an explication of the essential meaning of the construct and its differentiation from related constructs. In particular, the researcher has to specify the domain of the construct in terms of both the attributes (properties, characteristics) of the intended conceptual entity and the objects to which these attributes extend (MacKenzie et al. 2011; Rossiter 2002). For example, if the construct is service

3 9 Measurement Models for Marketing Constructs 261 quality, the attribute is quality (or specific aspects of quality such as reliability) and the object is firm XYZ s service. Both the object and its attributes can vary in how concrete or abstract they are and, generally, measurement becomes more difficult as the abstractness of the object and/or attributes increases (both for the researcher trying to construct appropriate measures of the construct of interest and the respondent completing a measurement exercise). In particular, more abstract constructs generally require multiple measures. One important question to be answered during the construct specification task is whether the object and/or the attributes of the object should be conceptualized as uni- or multidimensional. This question is distinct from whether the items that are used to empirically measure a construct are uni- or multidimensional. If an object is complex because it is an aggregate of sub-objects or an abstraction of more basic ideas, or if the meaning of the attribute is differentiated and not uniformly comprehended, it may be preferable to conceptualize the construct as multidimensional. For multidimensional objects and attributes, the sub-objects and sub-attributes have to be specified and measured separately. For example, if firm XYZ has several divisions or provides different types of services so that it is difficult for respondents to integrate these sub-objects into an overall judgment, it may be preferable to assess reactions to each sub-object separately. Similarly, since the quality of a service is not easily assessed in an overall sense (and an overall rating may lack diagnosticity at any rate), service quality has been conceptualized in terms of five distinct dimensions (tangibles, reliability, responsiveness, assurance, and empathy) (Parasuraman et al. 1988). Once the construct has been defined, measures of the construct have to be developed. Under special circumstances, a single measure of a construct may be sufficient. Specifically, Bergkvist and Rossiter (2007) and Rossiter (2002) argue (and present some supporting evidence) that a construct can be measured with a single item if in the minds of raters (1) the object of the construct is concrete singular, meaning that it consists of one object that is easily and uniformly imagined, and (2) the attribute of the construct is concrete, again meaning that it is easily and uniformly imagined (Bergkvist and Rossiter 2007, p. 176). However, since either the object or the attribute (or both) tend to be sufficiently abstract, multiple measures are usually required to adequately capture a construct. An important consideration when developing measures and specifying a measurement model is whether the measures are best thought of as manifestations of the underlying construct or defining characteristics of it (MacKenzie et al. 2005). In the former case, the indicators are specified as effects of the construct (so-called reflective indicators), whereas in the latter case they are hypothesized as causes of the construct (formative indicators). Mackenzie et al. (2005) provide four criteria that can be used to decide whether particular items are reflective or formative measures of a construct. If an indicator is (a) a manifestation (rather than a defining characteristic) of the underlying construct, (b) conceptually interchangeable with

4 262 H. Baumgartner and B. Weijters the other indicators of the construct, (c) expected to covary with the other indicators, and (d) hypothesized to have the same antecedents and consequences as the other indicators, then the indicator is best thought of as reflective. Otherwise, the indicator is formative. Based on the measures chosen to represent the intended conceptual entity, observed responses of the hypothesized construct can be obtained. For constructs for which both the object and the attribute are relatively concrete (e.g., a person s chronological age, a firm s advertising spending), few questions about the reliability and validity of measurement may be raised. However, as constructs become more abstract, reliability and validity assessments become more important. Depending on how one specifies the relationship between indicators and the underlying construct (i.e., reflective vs. formative), different procedures for assessing reliability and validity have to be used (MacKenzie et al. 2011). Constructing reliable and valid measures of constructs is a nontrivial task involving issues related to construct definition and development of items that fully capture the intended construct. Since several elaborate discussions of construct measurement and scale development have appeared in the recent literature (MacKenzie et al. 2011; Rossiter 2002), we will not discuss these topics in the present chapter. Instead, we will focus on models that can be used to assess the quality of measurement for responses that are already available. We will start with a discussion of the congeneric measurement model in which continuous observed indicators are seen as reflections of an underlying latent variable, each observed variable loads on a single latent variable (provided multiple latent variables are included in the model), and no correlations among the unique factors (measurement errors) are allowed. We will also contrast the congeneric measurement model with a formative measurement model, consider measurement models that incorporate a mean structure (in addition to a covariance structure), and present an extension of the single-group model to multiple groups. We will then discuss three limitations of the congeneric model. First, it may be unrealistic to assume that each item loads on a single latent variable and that the loadings on non-target factors are zero (provided the measurement model contains multiple latent variables). Second, often the observed variables are not only correlated because they load on the same factor or because the factors on which they load are correlated. There may be other sources of covariation (due to various method factors) that require the specification of correlations among the unique factors or the introduction of method factors. Third and finally, although the assumption of continuous, normally distributed indicators, which is probably never strictly satisfied, may often be adequate, sometimes it is so grossly violated that alternative models have to be entertained. Below we will discuss the three limitations in greater detail and consider ways of overcoming these shortcomings. Throughout the chapter, illustrative examples of the various models are presented to help the reader follow the discussion more easily.

5 9 Measurement Models for Marketing Constructs The Congeneric Measurement Model Conceptual Development The so-called congeneric measurement model is a confirmatory factor model in which I observed or manifest variables x i (also called indicators), contained in an I 1 vector x, are a function (i.e., reflections of) J latent variables (or common factors) ξ j (included in a J 1 vector ξ) and I unique factors δ i (summarized in an I 1 vector δ). The strength of the relationship between the x i and ξ j is expressed by an I J matrix of factor loadings Λ with typical elements λ ij. In matrix form, the model can be written as follows: x = Λξ + δ ð9:1þ For now, we assume that x and ξ are in deviation form (i.e., mean-centered), although this assumption will be relaxed later. Assuming that EðÞ= δ 0 and Covðξ, δ 0 Þ= 0, this specification of the model implies the following structure for the variance-covariance matrix of x, which is called Σ: Σ = ΣΛ, ð Φ, ΘÞ= ΛΦΛ 0 + Θ ð9:2þ where Φ and Θ are the variance-covariance matrices of ξ and δ, respectively (with typical elements φ ij and θ ij ), and the symbol is the transpose operator. In a congeneric measurement model, each observed variable is hypothesized to load on a single factor (i.e., Λ contains only one nonzero entry per row) and the unique factors are uncorrelated (i.e., Θ is diagonal). For identification, either one loading per factor has to be fixed at one, or the factor variances have to be standardized to one. If there are at least three indicators per factor, a congeneric factor model is identified, even if there is only a single factor and regardless of whether multiple factors are correlated or uncorrelated (orthogonal). If there are only two indicators per factor, a single-factor model is not identified (unless additional restrictions are imposed), and multiple factors have to be correlated for the model to be identified. If there is only a single indicator per factor, the associated unique factor variance cannot be freely estimated (i.e., has to be set to zero or another assumed value). A graphical representation of a specific congeneric measurement model with 6 observed measures and 2 factors is shown in Fig Factor models are usually estimated based on maximum likelihood (which assumes multivariate normality of the observed variables and requires a relatively large sample size), although other estimation procedures are available. To evaluate the fit of the overall model, one can use a likelihood ratio test in which the fit of the specified model is compared to the fit of a model with perfect fit. A nonsignificant χ 2 value indicates that the specified model is acceptable, but often the hypothesized

6 264 H. Baumgartner and B. Weijters ϕ ξ 1 ξ 2 λ 21 λ 11 λ 31 λ 42 λ 52 λ 62 x 1 x 2 x 3 x 4 x 5 x 6 δ 1 δ δ 2 4 δ 5 δ 6 δ 3 δ δ δ δ δ δ θ11 θ22 θ33 θ44 θ55 θ 66 Fig. 9.1 A congeneric measurement model. Note for Fig. 9.1: In the illustrative example of Sect. 2.2, x 1 x 3 are regularly worded environmental concern items; x 4 x 6 are regularly worded health concern items; ξ 1 refers to environmental concern, and ξ 2 refers to health concern (see Table 9.2 for the items) model is found to be inconsistent with the data. Since models are usually not meant to be literally true, and since at relatively large sample sizes the χ 2 test will be powerful enough to detect even relatively minor misspecifications, researchers frequently use alternative fit indices to evaluate whether the fit of the model is good enough from a practical perspective. Among the more established alternative fit indices are the root mean square error of approximation (RMSEA), the standardized root mean square residual (SRMR), the comparative fit index (CFI), and the Tucker-Lewis fit index (TLI). For certain purposes, information theory-based fit indices such as BIC may also be useful. Definitions of these fit indices, brief explanations, and commonly used cutoff values are provided in Table 9.1. In our experience, researchers are often too quick to dismiss a significant χ 2 value based on the presumed limitations of this test (i.e., a significant χ 2 test does show that there are problems with the specified model and the researcher should investigate potential sources of this lack of fit), but it is possible that relatively minor misspecifications lead to a significant χ 2 value, in which case a reliance on satisfactory alternative fit indices may be justified. If a model is deemed to be seriously inconsistent with the data, it has to be respecified. This is usually done with the help of modification indices, although

7 9 Measurement Models for Marketing Constructs 265 Table 9.1 A summary of commonly used overall fit indices Index Definition of the index Interpretation and use of the index Minimum fit function chi-square (χ 2 ) Root mean square error of approximation (RMSEA) (Standardized) Root mean squared residual (S)RMR) Bayesian information criterion (BIC) Comparative Fit Index (CFI) Tucker-Lewis nonnormed fit index (TLI, NNFI) (N 1) f qffiffiffiffiffiffiffiffiffiffiffiffiffi ðχ 2 df Þ ðn 1Þdf qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2ðs ij σ îj Þ 2 ðpþðp +1Þ [χ 2 + r ln N] or [χ 2 df ln N] maxðχ 1 2 t df t,0þ maxðχ 2 n df n, χ 2 t df t,0þ χ 2 n dfn χ2 t df t dfn dft χ 2 n dfn dfn Tests the hypothesis that the specified model fits perfectly (within the limits of sampling error); the obtained χ 2 value should be smaller than χ 2 crit ; note that the minimum fit function χ 2 is only one possible chi-square statistic and that different discrepancy functions will yield different χ 2 values Estimates how well the fitted model approximates the population covariance matrix per df; Browne and Cudeck (1992) suggest that a value of 0.05 indicates a close fit and that values up to 0.08 are reasonable; Hu and Bentler (1999) recommend a cutoff value of 0.06; a p- value for testing the hypothesis that the discrepancy is smaller than 0.05 may be calculated (so-called test of close fit) Measures the average size of residuals between the fitted and sample covariance matrices; if a correlation matrix is analyzed, RMR is standardized to fall within the [0, 1] interval (SRMR), otherwise it is only bounded from below; a cutoff of 0.05 is often used for SRMR; Hu and Bentler (1999) recommend a cutoff value close to 0.08 Based on statistical information theory and used for testing competing (possibly non-nested) models; the model with the smallest BIC is selected Measures the proportionate improvement in fit (defined in terms of noncentrality, i.e., χ 2 df ) as one moves from the baseline to the target model; originally, values greater than 0.90 were deemed acceptable, but Hu and Bentler (1999) recommend a cutoff value of 0.95 Measures the proportionate improvement in fit (defined in terms of noncentrality) as one moves from the baseline to the target model, per df; originally, values greater than 0.90 were deemed acceptable, but Hu and Bentler (1999) recommend a cutoff value of 0.95 Notes N = sample size; f = minimum of the fitting function; df = degrees of freedom; p = number of observed variables; r = number of estimated parameters; χ 2 crit = critical value of the χ 2 distribution with the appropriate number of degrees of freedom and for a given significance level; the subscripts n and t refer to the null (or baseline) and target models, respectively. The baseline model is usually the model of complete independence of all observed variables

8 266 H. Baumgartner and B. Weijters other tools are also available (such as an analysis of the residuals between the observed and model-implied covariance matrices). A modification index is the predicted decrease in the χ 2 statistic if a fixed parameter is freely estimated or an equality constraint is relaxed. For example, a significant modification index for a factor loading that is restricted to be zero suggests that the indicator in question may have a non-negligible loading on a non-target factor, or maybe that the observed variable was incorrectly assumed to be an indicator of a certain construct (particularly when the loading on the presumed target factor is small). Associated with each modification index is an expected parameter change (EPC), which shows the predicted estimate of the parameter when the parameter is freely estimated. Although models can be brought into closer correspondence with the data by repeatedly freeing parameters based on significant modification indices, there is no guarantee that the respecified model will be closer to reality, or hold up well with new data (MacCallum 1986). Once a researcher is satisfied that the (respecified) model is reasonably consistent with the data, a detailed investigation of the local fit of the model can be conducted. From a measurement perspective, three issues are of paramount importance. First, the items hypothesized to measure a given construct have to be substantially related to the underlying construct, both individually and collectively. If one assumes that the observed variance in a measure consists of only two sources, substantive variance (variance due to the underlying construct) and random error variance, then convergent validity is similar to reliability (some authors argue that reliability refers to the convergence of measures based on the same method, whereas different methods are necessary for convergent validity); henceforth, we will therefore use the term reliability to refer to the relationship between measures and constructs (even though the unique factor variance usually does not only contain random error variance). Individually, an item should load significantly on its target factor, and each item s observed variance should contain a substantial amount of substantive variance. One index, called individual-item reliability (IIR) or individual-item convergent validity (IICV), is defined as the squared correlation between a measure x i and its underlying construct ξ j (i.e., the proportion of the total variance in x i that is substantive variance), which can be computed as follows: IIR xi = λ2 ij φ jj λ 2 ij φ jj + θ ii ð9:3þ Ideally, at least half of the total variance should be substantive variance (i.e., IIR 0.5), but this is often not the case. One can also summarize the reliability of all indicators of a given construct by computing the average of the individual-item reliabilities. This is usually called average variance extracted (AVE), that is,

9 9 Measurement Models for Marketing Constructs 267 AVE = IIR x i K ð9:4þ where K is the number of indicators for the construct in question. A common rule of thumb is that AVE should be at least 0.5. Collectively, all measures of a given construct combined should be strongly related to the underlying construct. One common index is composite reliability (CR), which is defined as the squared correlation between an unweighted sum (or average) of the measures of a construct and the construct itself. CR is a generalization of coefficient alpha to a situation in which items can have different loadings on the underlying factor and it can be computed as follows: CR xi = ð λ ij Þ 2 φ jj ð λ ij Þ 2 φ jj + θ ii ð9:5þ CR should be at least 0.7 and preferably higher. Second, items should be primarily related to their underlying construct and not to other constructs. In a congeneric model, loadings on non-target factors are set to zero a priori, but the researcher has to evaluate that this assumption is justified by looking at the relevant modification indices and expected parameter changes. This criterion can be thought of as an assessment of discriminant validity at the item level. Third, the constructs themselves should not be too highly correlated if they are to be distinct. This is called discriminant validity at the construct level. One way to test discriminant validity is to construct a confidence interval around each construct correlation in the Φ matrix (the covariances are correlations if the variances on the diagonal have been standardized to one) and to check whether the confidence interval includes one (in which case a perfect correlation cannot be dismissed). However, this is a weak criterion of discriminant validity because with a large sample and precise estimates of the factor correlations, the factor correlations will usually be distinct from one, even if the correlations are quite high. A stronger test of discriminant validity is the criterion proposed by Fornell and Larcker (1981). This criterion says that each squared factor correlation should be smaller than the AVE for the two constructs involved in the correlation. Intuitively, this rule means that a construct should be more strongly related to its own indicators than to another construct from which it is supposedly distinct. It is easy to test alternative assumptions about how the measures of a given latent variable relate to the underlying latent construct. In the congeneric model, the observed measures of a construct have a single latent variable in common, but both the factor loadings and unique factor variances are allowed to differ across the indicators. In an essentially tau-equivalent measurement model, all the factor

10 268 H. Baumgartner and B. Weijters loadings for a given construct are specified to be the same (Traub 1994). This means that the scale metrics of the observed variables are identical. In a parallel measurement model, the unique factor variances are also specified to be the same. This means that the observed variables are fully exchangeable. A χ 2 difference test can be used to test the relative fit of alternative models. If, say, the model positing equality of factor loadings does not show a significant deterioration in fit relative to a model in which the factor loadings are freely estimated, the hypothesis of tau-equivalence is consistent with the empirical evidence. In measurement analyses the focus is generally on the interrelationships of the observed variables. However, it is possible to incorporate the means into the model. When data are available for a single group only, little additional information is gained by estimating means. If the intercepts of all the observed variables are freely estimated, the means of the latent constructs have to be restricted to zero in order to identify the model and the estimated intercepts are simply the observed means. Alternatively, if the latent means are to be estimated, one intercept per factor has to be restricted to zero. If this is done for the indicator whose loading on the underlying factor is set to one (the so-called marker variable or reference indicator), the latent factor mean is simply the observed mean of the reference indicator. One can also test more specific hypotheses about the means. For example, if it is hypothesized that the observed measures of a given latent construct all have the same relationship to the underlying latent variable, one could test whether the measurement intercepts are all the same, which implies that the means of the construct indicators are identical. Of course, one can also compare the means of observed variables across constructs, although this is not very meaningful unless the scale metrics are comparable Empirical Example Congeneric measurement models are very common in marketing research. For example, Walsh and Beatty (2007) identify dimensions of customer-based corporate reputation and develop scales to measure these dimensions (Customer Orientation, Good Employer, Reliable and Financially Strong Company, Product and Service Quality, and Social and Environmental Responsibility). An important advantage of confirmatory factor analysis is that hierarchical factor models can be specified, where a second-order factor has first-order factors as its indicators (more than two levels are possible, but two levels are most common). For instance, Yi and Gong (2013) develop and validate a scale for customer value co-creation behavior. The scale comprises two second-order factors (customer participation behavior and customer citizenship behavior), each of which consists of four first-order factors:

11 9 Measurement Models for Marketing Constructs 269 Table 9.2 Empirical illustration a congeneric factor model for environmental and health concern Environmental concern item 1 item 2 item 3 I would describe myself as environmentally conscious I take into account the environmental impact in many of my decisions My buying habits are influenced by my environmental concern Health concern item 1 I consider myself as health conscious item 2 item 3 I think that I take health into account a lot in my life My health is so valuable to me that I am prepared to sacrifice many things for it Standardized loading IIR AVE CR Note IIR individual-item reliability; AVE average variance extracted; CR composite reliability information seeking, information sharing, responsible behavior, and personal interaction for customer participation behavior; and feedback, advocacy, helping, and tolerance for customer citizenship behavior. To illustrate the concepts discussed so far, we present an empirical example using data from N = 740 Belgian consumers, all with primary responsibility for purchases in their household, who responded to a health consciousness scale and an environmental concern scale. The scale items were adapted from Chen (2009) and translated into Dutch. For now, we will only use the first three items from each scale (see Table 9.2). Later, we will analyze the complete scales, including the reversed items. In Mplus 7.3, we specify a congeneric two-factor model where each factor has three reflective indicators. The χ 2 test is significant (χ 2 (8) = , p = ), but the indices of local misfit do not indicate a particular misspecification (the modification indices point toward some negligible cross-loadings and residual correlations, although the modification indices are all smaller than 10). Based on the alternative fit indices, the model shows acceptable fit to the data: RMSEA = (with a 90% confidence interval [CI] ranging from to 0.089), SRMR = 0.031, CFI = 0.986, and TLI = Table 9.2 reports the IIRs for all items, and the AVE and CR for both factors. One of the health concern items shows an unsatisfactory IIR (below 0.50), probably because it is worded more extremely and somewhat more verbosely than the other two health concern items. Nevertheless, the AVE for both factors is above 0.50 and the CR for both factors is

12 270 H. Baumgartner and B. Weijters above 0.70, which indicates acceptable reliability. The correlation between health concern and environmental concern is 0.38 (95% CI from 0.31 to 0.46). Since the shared variance of 0.15 is smaller than the AVE of either construct, the factors show discriminant validity. 9.3 Multi-sample Congeneric Measurement Models with Mean Structures Conceptual Development One advantage of using structural equation modeling techniques for measurement analysis is that they enable sophisticated assessments of the measurement properties of scales across different populations of respondents. This is particularly useful in a cross-cultural context, where researchers are often interested in either assessing the invariance of findings across countries or establishing nomological differences between countries. If cross-cultural comparisons are to be meaningful, it is first necessary to ascertain that the constructs and measures are comparable. A congeneric measurement model containing a mean structure for group g can be specified as follows: x g = τ g + Λ g ξ g + δ g ð9:6þ where τ is an I 1 vector of equation intercepts, the other terms were defined earlier, and the superscript g refers to group g. Under the assumptions mentioned earlier (although x and ξ are not mean-centered in the present case), the corresponding mean and covariance structures are: μ g = τ g + Λ g κ g Σ g = Λ g Φ g Λ 0g + Θ g ð9:7þ ð9:8þ where μ is the expected value of x and κ is the expected value of ξ (i.e., the vector of latent means of the constructs). To identify the covariance part, one loading per factor should be set to one (the corresponding indicator is called the marker variable or reference indicator); the factor variances should not be standardized at one since this would impose the assumption that the factor variances are equal across groups, which is not required and which need not be the case. To identify the means part, the intercepts of the marker variables have to be set to zero, in which case all the latent means can be freely estimated, or one latent mean (the latent mean of the reference group) has to be set to zero and the intercepts of the marker variables are specified to be invariant across groups. In the latter case, the latent means in the remaining groups express the difference in latent means between the reference group and the other groups.

13 9 Measurement Models for Marketing Constructs 271 In the model of Eqs. (9.7) and (9.8), five different types of parameters can be tested for invariance. Two of these are of substantive interest (κ g, Φ g ); the remaining ones are measurement parameters (τ g, Λ g, Θ g ). In order for the comparisons of substantive interest to be meaningful, certain conditions of measurement invariance have to be satisfied. To begin with, the same congeneric factor model has to hold in each of the g groups; this is called configural invariance, and it is a minimum condition of comparability (e.g., if a construct is unidimensional in one group and multi-dimensional in another, meaningful comparisons are difficult if not impossible). Usually, more specific comparisons are of interest and in this case more stringent forms of invariance have to be satisfied. Specifically, Steenkamp and Baumgartner (1998) show that if relationships between constructs are to be compared across groups, metric invariance (equality of factor loadings) has to hold, and if latent construct means are to be compared across groups, scalar invariance (invariance of measurement intercepts) has to hold as well. For example, consider the relationship between the mean of variable x i and the mean on the underlying construct ξ j, that is, μ g i = τ g i + λ g ij κg j. The goal is to compare κ j g across groups based on x g i. Unfortunately, the comparison on μ g i depends on τ g i, λ g ij, and κ g j. Inferences about the latent means will only be unambiguous if τ g i and λ g ij are the same across groups. For certain purposes, one may also want to test for the invariance of unique factor variances across groups, but usually this comparison is less relevant. The hypothesis of full metric invariance can be tested by comparing the model with invariant loadings across groups to the model in which the loadings are freely estimated in each group. If the deterioration in fit is nonsignificant, metric invariances is satisfied. Similarly, the hypothesis of full scalar invariance can be tested by comparing the model with invariant loadings and intercepts to the model with invariant loadings. Metric invariance should be established before scalar invariance is assessed. In practice, full metric and scalar invariance are frequently violated (esp. the latter). The question then arises whether partial measurement invariance is sufficient to conduct meaningful across-group comparisons. Note that in the specification of the model in Eqs. (9.7) and (9.8), one variable per factor was already assumed to have invariant loadings and intercepts (because one loading per factor was set to one and the corresponding intercept was fixed at zero). However, these restrictions are necessary to identify the model and do not impose binding constraints on the model (this can be seen by the fact that regardless of which variable is chosen as the marker variable, the fit of the model will always be the same). In order to be able to test (partial) metric or scalar invariance, at least two items per factor have to have invariant loadings or intercepts (see Steenkamp and Baumgartner 1998). This is a minimum requirement; ideally, more indicators per factor will display metric and scalar invariance. Asparouhov and Muthén (2014) have recently proposed a new procedure called the alignment method, in which these strict requirements of measurement invariance are relaxed, but their method is beyond the scope of this chapter.

14 272 H. Baumgartner and B. Weijters Although the tests of (partial) metric and scalar invariance described previously are essential, one word of caution is necessary. These tests assume that any biases in τ and Λ that may distort comparisons across groups are nonuniform across items. If the bias is the same across items (e.g., τ is biased upward or downward by the same amount across items), the researcher may wrongly conclude that measurement invariance is satisfied and mistakenly attribute a difference in intercepts to a difference in latent means (Little 2000) Empirical Example Multi-sample congeneric measurement models (with or without mean structures) are commonly applied in cross-national marketing research. Some examples are the following. Strizhakova et al. (2008) compare branded product meanings (quality, values, personal identity, and traditions) across four countries based on newly developed measures for which they demonstrate cross-national measurement invariance. Their results show that identity-related and traditions-related meanings are more important in the U.S. than in three emerging markets (Romania, Ukraine, and Russia). Singh et al. (2007) test models involving moral philosophies, moral intensity, and ethical decision making across two samples of marketing practitioners from the United States and China. Their measurement models show partial metric and scalar invariance. In a similar way, using multi-group confirmatory factor analysis, Schertzer et al. (2008) establish configural, metric and partial scalar invariance for a gender identity scale across samples from the U.S., Mexico, and Norway. To further illustrate measurement invariance testing, we analyze data from an online panel of respondents in two countries, Slovakia (N = 1063) and Romania (N = 970), using four bipolar items to measure attitude toward the brand Coca-Cola. The four items (unpleasant-pleasant, negative-positive, unattractiveattractive, and low quality-high quality) were translated (and back-translated) by professional translators into respondents native languages. Respondents provided their ratings on seven-point scales. The samples from both countries were comparable in social demographic makeup for reasons of comparability. The four items are modeled as reflective indicators of one latent factor, using a two-group congeneric model with a mean structure in Mplus 7.3. We test a sequence of nested models, gradually imposing constraints that reflect configural, metric, and scalar invariance. The fit indices are reported in Table 9.3. In the configural model (Model A), the same congeneric model is estimated in the two groups (this is the so-called configural model), but the loadings and intercepts are estimated freely in each group. The exception is the marker item, which is specified to have a loading of one and an intercept of zero in both groups. The model shows acceptable fit to the data. With sample sizes around 1000, the χ 2 test is sensitive to even minor misspecifications, and the modification indices do not indicate a specific

15 9 Measurement Models for Marketing Constructs 273 Table 9.3 Model fit indices for measurement invariance tests of brand attitude in two countries A. Configural invariance Absolute fit Difference test Alternative fit indices χ 2 df p Reference model Δχ 2 df p RMSEA (90% CI) < (0.058, 0.111) B. Metric invariance <0.001 A (0.043, 0.084) C. Scalar invariance <0.001 B (0.048, 0.083) D. Partial scalar invariance <0.001 B (0.038, 0.075) Note See Table 9.1 for an explanation of these fit indices CFI TLI SRMR BIC

16 274 H. Baumgartner and B. Weijters misspecification that is serious. The RMSEA will evolve toward more acceptable levels when more constraints are added (the reason being that this fit index imposes a substantial penalty for the number of freely estimated parameters). Model B specifies metric invariance by restricting the factor loadings of all items (not only the marker item) to equality across the two groups (since there are three non-marker items, metric invariance is tested based on a χ 2 difference test with three degrees of freedom). In support of metric invariance, the χ 2 difference test is nonsignificant. Moreover, the alternative fit indices show acceptable fit and the RMSEA and BIC values (which impose a penalty for estimating many free parameters) even show a clear improvement in fit. Model C imposes scalar invariance by additionally restricting all item intercepts to be equal across groups (scalar invariance is tested with a χ 2 difference with three degrees of freedom as well). The evidence for scalar invariance is somewhat mixed. The BIC improves, while the RMSEA and the other alternative fit indices remain almost stable. However, the χ 2 difference test is statistically significant (p < 0.001), so the hypothesis of scalar invariance is rejected. Closer inspection of the modification indices (focusing on MI s > 10, given the large sample size) indicates that the intercept for item 4 is non-invariant. We therefore estimate an additional model, model D, to test for partial scalar invariance, in which scalar invariance for item 4 is relaxed (i.e., the intercepts for all items except item 4 are set to equality across groups) and test the deterioration in fit relative to the metric invariance model (model D is nested in model B). The χ 2 difference test is statistically nonsignificant, in support of partial scalar invariance. Since there are still three items that are scalar invariant, the latent means can now be compared across the two groups. The means are (Standard error = 0.047) for the Slovakian sample and (standard error = 0.061) for the Romanian sample. A χ 2 difference test for latent mean equality shows that the difference is non-significant (Δχ 2 (1) = 0.313, p = 0.576). 9.4 The Formative Measurement Model Conceptual Development Sometimes it is not meaningful to assume that an observed measure is a reflection of the operation of an underlying latent variable. For example, assume that job satisfaction is measured with items assessing satisfaction with various aspects of the job, such as satisfaction with one s supervisor, co-workers, pay, etc. Satisfaction with each facet of the job is presumably a contributing factor to overall job satisfaction, not a reflection of it. Jarvis et al. (2003) reviewed the measurement of 1,192 constructs in 178 articles published in four leading Marketing journals and found that 29% of constructs were modeled incorrectly; the vast majority of measurement model misspecifications was due to formative indicators being modeled as reflective (see also MacKenzie et al. 2005). This practice is problematic because simulations

17 9 Measurement Models for Marketing Constructs 275 have demonstrated that if the measurement model is misspecified, this will bias estimates of structural paths (see Diamantopoulos et al for a review of the evidence). In a formative measurement model, the direction of causality goes from the indicator to the construct, so the observed measures are also called cause indicators (rather than effect indicators). This reversal of causality has several implications. First, error does not reside in the indicators, but in the construct. Since the variance of the construct is a function of the variances and covariances of the formative indicators, plus error, the construct is not a traditional latent variable and should be more accurately referred to as a composite variable (MacCallum and Browne 1993). Second, in general the variance of the error term of a construct that is a function of its indicators is not identified. In order for the model to be identified, directed paths have to go from the formative construct to at least two other variables or constructs. Often, two global reflective indicators are used for this purpose (MacKenzie et al. 2011), but in our experience these reflective indicators are usually not very sophisticated and well-developed measures of the underlying construct. Furthermore, this type of model is empirically indistinguishable from a model in which a reflectively measured construct is related to various antecedents, so the question arises whether the formative measurement model is a measurement model at all. Third, formative measures need not be positively correlated (e.g., dissatisfaction with one s supervisor does not mean that one is also dissatisfied with one s co-workers), so that conventional convergent validity and reliability assessment based on internal consistency is not applicable. Instead, formative measures should have a significant effect on the construct, and collectively the formative measures should account for a large portion of the construct s variance (e.g., at least 50%). Unfortunately, formative measures are frequently quite highly correlated, which leads to multicollinearity problems and the likely non-significance of some of the relationships between the indicators and the construct. This then raises the thorny question of whether an item that may be conceptually important but happens to be empirically superfluous should be retained in the model. An additional difficulty is that because of the correlations among the formative indicators, no firm conclusions about the measurement quality of individual indicators can be drawn. Fourth, formative models assume that the formative measures are error-free contributors to the formative construct, which seems unrealistic. To circumvent this problem, multiple reflective measures can be used to correct for measurement error, which makes the formative measures first-order factors and the formative construct a second-order construct. Unfortunately, such models are quite complex. In sum, while it is certainly true that formative measures should not be specified as reflective indicators, formative measurement faces a range of formidable difficulties, and several authors have recommended that formative measurement be abandoned altogether (Edwards 2011). Formative measurement is sometimes equated with the partial least squares (PLS) approach, which has also seen increased criticism in recent years (McIntosh et al. 2014), but formative models can be

18 276 H. Baumgartner and B. Weijters estimated using traditional structural equation modeling techniques, as shown below. Still, the meaningfulness of formative measurement is a topic of active debate and it remains to be seen whether better alternatives can be formulated Empirical Example The marketing research literature offers several recent examples of formative measurement models. Coltman et al. (2008) propose that market orientation viewed from a behavioral perspective (where market orientation is the result of allocating resources to a set of specific activities) can be conceptualized as a formative construct with a reactive and a proactive component. Dagger et al. (2007) develop a multidimensional hierarchical scale for measuring health service quality and validate it in three field studies. In their formative model, nine subdimensions (interaction, relationship, outcome, expertise, atmosphere, tangibles, timeliness, operation, and support) drive four primary dimensions (interpersonal quality, technical quality, environment quality, and administrative quality), which in turn drive service quality perceptions. To further illustrate the formative measurement model, we will use data from 497 respondents who indicated their attitude toward self-service technologies (SSTs), that is, self-scanning in grocery stores. Specifically, respondents rated the perceived usefulness, perceived ease of use, reliability, perceived fun, and newness of the technology on three items each (e.g., Self-scanning will allow me to shop faster for perceived usefulness, and Self-scanning will be enjoyable for fun, rated on 5-point agree-disagree scales). The items within each of the five factors were averaged, and the five averages will be used as formative indicators of attitude toward SST. Three overall measures of attitude are also available (i.e., How would you describe your feelings toward using self-scanning technology in this store, rated on 5-point favorable-unfavorable, I like it-i dislike it, and good-bad scales), which will be used as reflective indicators to identify the model. The model is shown graphically in Fig The estimated model fit the data well: χ 2 (10) = , p = 0.03; SRMR = 0.01; RMSEA = (with a 90% CI ranging from to 0.073); Fig. 9.2 Formative measurement model for attitude towards self-scanning

19 9 Measurement Models for Marketing Constructs 277 CFI = 0.994; TLI = There were no significant modification indices exceeding Perceived usefulness (estimate of 0.294, 95% CI = to 0.376), ease of use (0.374, 95% CI = to 0.469), reliability (0.128, 95% CI = to 0.227) and fun (0.237, 95% CI = to 0.310) all had a significant effect on attitude toward SSTs, but the effect of newness (0.047, 95% CI = to 0.149) was nonsignificant. The explained variance in the three reflective attitude measures was 0.81, 0.90, and 0.81, respectively, and the five formative measures accounted for 51% of the variance in the formative construct. By conventional standards, these results indicate an acceptable measurement model. It is possible to take into account measurement error in the five formative measures by specifying a first-order reflective measurement model for them. The resulting model fits the data somewhat less well, but the fit is still adequate: χ 2 (120) = , p = 0.000; SRMR = 0.037; RMSEA = (with a 90% CI interval ranging from to 0.057); CFI = 0.976; TLI = Only perceived usefulness, perceived ease of use, and fun are significant determinants of attitude toward SSTs; reliability is borderline significant (z = 1.87). Together, the five formative measures account for 55% of the variance in the construct. The results for the two models are similar, but taking into account measurement error does change the findings somewhat. The major advantage of the second approach is that a more explicit measurement analysis of the independent variables or formative indicators is possible. 9.5 Extension 1: Relaxing the Assumption of Zero Non-target Loadings Conceptual Development The congeneric measurement model assumes that the loadings of indicators on factors other than the target factor are zero (which is sometimes referred to as an independent cluster confirmatory factor analysis). This is a strong assumption and even mild violations of the assumption of zero cross-loadings may decrease the overall fit of a model substantially, especially when the sample size is reasonably large. Furthermore, forcing zero cross-loadings when they are in fact non-zero may have other undesirable effects, such as inflated factor correlations and misleading evidence about (lack of) discriminant validity. One approach to relaxing the assumption of zero cross-loadings is exploratory structural equation modeling (ESEM) (Marsh et al. 2014). ESEM basically replaces the confirmatory factor model used in traditional SEM with an exploratory factor model (or a combination of exploratory and confirmatory factor models), although the exploratory factor analysis is used in a more confirmatory fashion because the researcher usually posits a certain number of factors and expects a certain pattern of factor loadings. In early applications of this approach geomin rotation was used to

Doing Quantitative Research 26E02900, 6 ECTS Lecture 6: Structural Equations Modeling. Olli-Pekka Kauppila Daria Kautto

Doing Quantitative Research 26E02900, 6 ECTS Lecture 6: Structural Equations Modeling Olli-Pekka Kauppila Daria Kautto Session VI, September 20 2017 Learning objectives 1. Get familiar with the basic idea