DIFFERENTIAL EFFECTS OF OMITTING FORMATIVE INDICATORS: A COMPARISON OF TECHNIQUES

Size: px

Start display at page:

Download "DIFFERENTIAL EFFECTS OF OMITTING FORMATIVE INDICATORS: A COMPARISON OF TECHNIQUES"

Ursula Arnold
5 years ago
Views:

1 DIFFERENTIAL EFFECTS OF OMITTING FORMATIVE INDICATORS: A COMPARISON OF TECHNIQUES Completed Research Paper Miguel I. Aguirre-Urreta DePaul University 1 E. Jackson Blvd. Chicago, IL USA maguirr6@depaul.edu George M. Marakas Florida International University S.W. 8th Street RB250 Miami, FL USA gmarakas@fiu.edu Abstract Research examining the formative specification of constructs has highlighted the need for researchers to capture all relevant causes of a construct of interest. However, the consequences of omitting a formative indicator have not been thoroughly examined. Given that one of the commonly employed techniques for modeling formatively specified constructs, Partial Least Squares, implicitly assumes that all relevant causes of a construct have been modeled, the consequences of omitting one of those are of prime importance. In this research we compare latent variable and PLS techniques on this issue based on theoretical arguments and results from Monte Carlo simulations. In particular, we focus on the presence or absence of estimation bias in the relationships between formative indicators and the formatively specified construct, and between the latter and other constructs in the research model. Our results highlight differences in how these two techniques cope with the omission of formative indicators, and discuss why those differences occur. Keywords: partial least squares, formative specification, latent variables, simulation, omitted variable bias Thirty Third International Conference on Information Systems, Orlando

2 Research Methods Introduction A significant amount of attention has been given to issues of instrument validity in the past, including content, construct, internal and statistical conclusion validity, as well as reliability (Straub 1989). Only recently, however, have researchers begun to focus on the underlying relationship between constructs and their empirical indicators, prompted by the seminal work of Diamantopoulos and Winklhofer (2001), although much of the theoretical development dates from earlier (Bollen 1984; Bollen and Lennox 1991; Cohen et al. 1990; Curtis and Jackson 1962). This relationship can be either reflective or formative. While the former is quite well understood, recent efforts have been made to better understand the alternative, formative specification, and its implications for research. Exemplars of this work include Jarvis, Mackenzie and Podsakoff (2003), Mackenzie, Podsakoff and Jarvis (2005), and recently in the information systems literature, Petter, Straub and Rai (2007) and Marakas, Johnson and Clay (2007). While most of the focus on the theory building and testing process is generally placed in the substantive relationships between constructs of interest, more than forty years ago Costner (1969) noted the need to include what he termed auxiliary theories, those relating abstract dimensions and their empirical indicators, as an integral part of scientific theories. In addition, he argued that these should be treated as any other theoretical proposition. Empirical testing of auxiliary theories, then, would serve to tentatively establish the adequacy of particular sets of indicators for testing the implications of their respective abstract formulations. In more modern terms, researchers should establish whether validity is adequate, although the original emphasis was solely on the issue of measurement error and its implications for theory testing. Having found that the indicators are not inadequate for this purpose, only then should researchers attempt to ascertain whether the relationships between constructs are themselves tenable. This logic is consistent with the work of Straub (1989) on instrument validity. To better establish the extent to which formatively specified constructs are actively being employed in mainstream IS research, as well as which statistical techniques are employed in the estimation of research models involving them, journal issues of MIS Quarterly, Information Systems Research, the Journal of Management Information Systems, and the Journal of the Association for Information Systems for the period January 1998 through December 2011 were examined 1. Forty six empirical studies that included one or more formatively-specified first-order constructs were found. Whereas the research included in this review represents a relatively small fraction of all research published in this period and outlets, a relatively large proportion of those (38 out of 46, or 83%) has appeared since 2006 alone, we believe reflecting the newly found interest of researchers in this topic. Examples of formatively specified constructs include Virtual Copresence (a subjective feeling of being together with others in a virtual environment; I find that people respond to my posts quickly, I am usually aware of who are logged on online ) (Ma and Agarwal 2007) and Technology Interaction (IT interactions undertaken with the purpose of accomplishing an individual or organizational task; I use this system (or application) to solve various problems, I use this system (or application) to justify my decisions ) (Barki et al. 2007). Defining for what reasons and under which circumstances researchers would want to specify the relationship between constructs and their indicators as formative or reflective is beyond the scope of this work, and others (Jarvis et al. 2003; MacKenzie et al. 2005; Petter et al. 2007) have provided quite extensive treatments of the issue. That said, it must be understood that once a decision has been made to specify a construct as formative, the researcher must make a choice between one of two families of statistical techniques, component- or covariance-based SEM, for the subsequent data analysis and model estimation. As expected, PLS was the most popular procedure used for this purpose in the reviewed research (37 out of 46 studies, or 80%), with only five studies using a latent variable technique (LISREL or AMOS) to analyze the research model, and other four studies employing OLS regression. Further examination of these studies indicates that IS researchers testing models which include formatively-specified constructs operate largely on two main assumptions: (a) covariance-based techniques (of which LISREL is an example) cannot handle models postulating first-order formative relationships (Choudhury & Karahanna, 2008; Liang, Saraf, Hu, & Xue, 2007; Limayen, Hirt, & Cheung, 1 This review of extant research was confined only to first-order formative constructs, which are the main focus of interest in this paper. The list of articles is not included here due to space limitations but is available from the first author upon request. 2 Thirty Third International Conference on Information Systems, Orlando 2012

3 2007; Ma & Agarwal, 2007), and (b) PLS is a viable alternative for doing so (Chin, Marcolin, & Newsted, 2003; Gefen et al., 2000; Petter et al., 2007). However, despite its widespread use in this regard, an extensive review of the literature on these methods and associated simulations was unable to uncover any in-depth examination of the ability of either alternative to analyze research models including formatively specified constructs when some of those formative indicators are omitted, what the effects of these omissions are on other parameters in the model, and whether those vary by technique. In general, most literature on the specification of formative constructs has been limited to latent variable techniques, such as LISREL, whereas PLS remains underexamined in this area, despite it being the most popular alternative for modeling these scenarios. Given that most research in this area highlights the need for researchers to specify all causes of a formatively specified construct (e.g., Bollen and Lennox, 1991) and that PLS implicitly assumes all such causes have been included in the model by virtue of not including a residual disturbance term that captures omitted causes, the consequences of violating this assumption seem worthy of detailed examination. In the rest of this article we first discuss the specific research models employed here to ground our discussion of these issues. Then, the specification and analysis of models containing a formatively specified construct are examined from the perspective of latent variable techniques. Problems resulting from the omission of a relevant formative indicator are discussed conceptually and then validated with Monte Carlo simulations. Next, we examine the same issues as they relate to PLS analyses, and further extend the concept of reliability as shared variance between a composite and the latent variable it represents to include components representing formatively specified constructs. We conclude our work with a comparison of our results for each technique, and some of the limitations of this research. Research Model and Simulation Parameters Throughout this research we employ a set of models to ground our discussion of the many issues associated with formatively-specified constructs and omitted indicators, and choice of statistical analysis technique. These models also provide the population parameters used in our simulations. Given that these apply to both techniques under examination here, it is worth briefly reviewing those at this time. The basic structure of the model is shown in Figure 1. Figure 1. Population Model Thirty Third International Conference on Information Systems, Orlando

4 Research Methods This model has been adapted from previous work investigating related issues by Jarvis, Mackenzie and Podsakoff (2003); see also Petter, Straub and Rai (2007) and Aguirre-Urreta and Marakas (2012). Population covariance matrices for all our models are available from the first author upon request. All values shown in Figure 1 for the parameters of interest are expressed in standardized metric. The model shown in Figure 1 has been modified from earlier uses to allow us to examine the effects of omitted formative indicators of varying importance. In particular, modifications have been made to ensure that the variance of the formatively specified latent variable remains constant across all three scenarios (e.g., Models A, B, and C), whereas the paths from the formative indicators to the latent variable vary to accomplish this, given that each model features different correlation strengths amongst these indicators. In all three cases the path coefficients from the formative indicators have been set to represent varying degrees of relative importance, as follows: the path from x 1 is four times as large as the path from x 4, that from x 2 three times as large, and that from x 3 two times as large. In all three models the residual variance of the formatively specified variable has been set at five percent of the total variance. We take this residual variance to represent the random shocks (minor, unstable influences on a variable) discussed by James, Mulaik and Brett (1983) (see our discussion below when dealing with identification issues). All variables in the model, both latent and manifest, follow a multivariate normal distribution. Data were simulated and subject to analysis with both of the statistical techniques under examination. All data generation was performed with EQS 6.1 (Bentler and Wu 1995). Statistical analyses within the latent variable framework were also conducted with EQS 6.1, and those for PLS with PLS-Graph 3.0. All analyses were conducted on standardized estimates, which are directly comparable across techniques and with those included in the models shown in Figure 1. Aside from the omission of a specific formative indicator, all other aspects of the models employed in the analyses were correctly specified. Data were generated for each different model (Models A, B, and C varying in the strength of correlation amongst formative indicators) for N = 200, N = 350 and N = 500, with one thousand replications in each condition. Within each one of these scenarios, the datasets were alternatively analyzed including all formative indicators, or missing one formative indicator at a time. Simulated data for all conditions are available from the first author upon request. Formative Modeling with Latent Variables The first of the two alternative approaches for modeling formatively specified constructs discussed here is structural equation modeling with latent variables (SEM-LV), also commonly referred to as covariancebased structural equation modeling. Estimation of both measurement and structural parameters included in the model is achieved by iterative minimization of a fit function that compares the observed covariance matrix with that implied by the research model. Maximum likelihood (ML) is the most commonly employed estimator, although there are others available. When ML is employed, the following assumptions are made: sample observations are independent and identically distributed following a multivariate normal distribution, the hypothesized model is approximately correct, a sample covariance matrix is analyzed, and the sample size is large (Boomsma and Hoogland 2000). The research model employed here and shown in Figure 1 has been identified as follows. In all reflectively-specified latent variables the loading of their first indicator (that is, the loadings for y 1, y 5, y 9 and y 13) has been fixed to one. For the identification of the formatively specified latent variable, three alternatives were considered. First, the disturbance term for the formatively specified construct could be set to zero. This is equivalent to assuming that the latent variable is perfectly determined by the formative indicators. This appears undesirable for a number of reasons. First, the definition of a formatively specified latent variable as (Bollen 2007; Bollen and Lennox 1991): recognizes that there is something more to the latent variable of interest than what is captured by its set of formative indicators. Fixing the disturbance term to zero for identification purposes would transform the latent variable into a weighted composite of its formative indicators, which creates the additional conundrum of positing that a set of manifest variables cause their own weighted sum. Bollen and Davis (2009) take a similar position. 4 Thirty Third International Conference on Information Systems, Orlando 2012

5 Second, seminal work in causal analysis (James et al. 1983) (see, in particular, their discussion on selfcontainment of functional equations) has defined the variance of a variable as the aggregate of a set of stable, non-minor and direct causes that are related to each other (i.e., correlated amongst them), if any; a set of stable, non-minor and direct causes that are uncorrelated to each other, if any; and random shocks, which are minor and unstable causes of a variable. In order to obtain unbiased estimates of path coefficients (see our discussion on omitted variable bias below), only the first set should be completely specified in the functional equation for a latent variable, as the disturbance term will capture any unmeasured but uncorrelated causes, and any random shocks that affect the latent variable at any given time. Fixing the residual variance to zero ignores that, even if all relevant causes of a latent variable are accounted for, there is still some portion of the variance that is due to minor variations (e.g., random shocks as put by James et al, 1983). Third, fixing the residual term to zero deprives researchers of an important diagnostic tool that can be used to assess the extent to which all causes of the formatively specified latent variable have indeed been incorporated in the model (Diamantopoulos 2006). Finally, recognizing that present knowledge about these relationships is generally incomplete, one would be hard pressed to provide complete and unequivocal assurance that all possible causes of a latent variable have been included. As noted by James (1980, p. 415): The operative question is not whether one has an unmeasured variables problem but rather the degree to which the unavoidable unmeasured variables problem biases the estimates of path coefficients and provides a basis for alternative explanations of results (see below for a more comprehensive discussion of the omitted variables problem). Given these considerations, we cannot recommend this approach to the identification of formatively specified latent variables, though at first sight it would appear to be the most simple alternative. The two other approaches considered, which are also discussed by Bollen and Davis (2009), require constraining a path coefficient to a non-zero value. This could be done by either fixing one of the path coefficients emitted by the formatively specified latent variable to another latent variable to one, or by fixing the path from a formative indicator to the formatively specified latent variable to one. Either of these alternatives will result in identical model fit, since they are merely setting the scale of the formatively specified latent variable; whereas unstandardized coefficients vary depending on scaling approach and the particular nonzero value chosen for identification, standardized coefficients will also be equal across both approaches. The downside of either of these approaches is that standard errors cannot be estimated for the parameter on which the constraint is placed. This occurs as well when constraining a loading to a non-zero value in order to identify a reflectively specified variable, but in those cases the significance of a particular loading is likely not of theoretical importance and assessing its magnitude, as well as the magnitude and significance of other items loading on the same variable, is generally deemed sufficient to establish its validity. In the case of formatively specified latent variables, however, the significance of either of these paths (from one latent variable to another, or from a formative indicator to its latent variable) is surely of interest. Given that we believe researchers are generally more interested in assessing parameters relating latent variables to each other, we believe they would be better served by fixing the path from a formative indicator to its latent variable for the purpose of establishing identification 2. In all our analyses and simulations, the formatively specified construct in the model shown in Figure 1 was identified by constraining the path from x 1 to one when x 1 was included in the model, and the path from x 2 in those cases where x 1 was omitted. The Problem of Omitted Variables and its Consequences When analyzing models using SEM-LV, the coefficient linking the formative indicators to their respective latent variable takes the form of a regression estimate. Therefore, the omission of one of those indicators from the model represents a special case of the more general issue, of great importance for empirical research, of omitted (or left-out, or unmeasured) variables. The issue has been examined in great detail by James (1980), James, Mulaik and Brett (1983), Hosman, Hansen and Holland (2010), Mauro (1990), Cellini (2008) or Meade, Behrend and Lance (2009), to name a few. Here we provide a summary of the 2 When the path from a formative indicator to the latent variable is fixed the standardized solution still reflects the relative contribution of all formative indicators, including the one with the fixed path, to the latent variable. However, the significance of a fixed path cannot be ascertained. Researchers should choose which paths to fix for identification purposes based on the goals of their research. In this case we have chosen to focus our attention on the structural paths relating latent variables to each other. Thirty Third International Conference on Information Systems, Orlando

6 Research Methods problem and how it is expected to affect research models with omitted formative indicators. The problem occurs when a research model omits a variable that (a) has a substantial effect on the dependent variable of interest, (b) is correlated with another predictor that is included in the model, and (c) makes a unique contribution to the prediction of the dependent variable, that is, is not itself linearly dependent on other predictors included in the model (James et al. 1983; Mauro 1990). When such a variable exists in the population but is not included in the model tested by a researcher, bias will occur in the regression coefficients of the predictors that are included in the model. Such bias occurs because the omission of a relevant predictor results in covariation between predictors that are explicitly included in the model and the disturbance term. More precisely, it violates the requirement that functional equations representing the causal model be self-contained (James et al. 1983). A simple example helps illustrate the problem (the principle is the same for more complex models but the demonstration is more involved; see Mauro, 1990, for an example with three predictors). Consider the following linear model: Where y is the dependent variable, x 1 and x 2 are the only two predictors, correlated at r 12, and b 1 and b 2 are standardized regression coefficients. The closed-form solutions for the coefficients are (where ry 1 is the correlation of x 1 and y, and ry 2 is the correlation between x 2 and y): and If a researcher omits one of the predictors, say x 2, then the formula above for b 1 simplifies to b 1 = r y1, that is, the standardized regression coefficient equals its correlation with the dependent variable. If the two predictors were not correlated, which is one of the required conditions for bias to occur, the standardized coefficient b 1 would always equal its correlation with the dependent variable (as r 12 would equal zero in the formula above). If the two predictors are correlated, however, this is no longer the case, as: Both direction and degree of bias will vary as a function of the sign and magnitude of the correlations involved. For models involving multiple predictor variables the net effects of omitting one or more of those on the coefficients for the included variables are more complex to determine given the multiple correlation coefficients, in both strength and sign, that play a role. However, unless all correlations involved are very small or close to zero (and leaving aside suppressor effects), omitting a relevant predictor variable will cause some degree of bias on the remaining ones in the model. See also Cenfetelli and Bassellier (2009) for some discussion of the possibility of suppressor effects within formatively specified latent variables. First, as expected from our discussion of the omitted variables problem above, the omission of a relevant formative indicator results in a significant upward bias in the path coefficients relating the included formative indicators to the latent variable, as compared to their population values obtained when the full model is estimated. The magnitude of this bias is itself a function of three different variables: relative importance of the missing indicator, strength of correlation between indicators, and relative importance of each of the included indicators for which bias occurs. First, bias increases as the strength of the correlation between formative indicators increases. Second, the degree of bias is a function of the relative importance of the missing indicator. When that indicator contributes the most toward the variance of the latent variable (e.g., x 1 in these models) bias will be higher than is the case when the missing indicator is of more limited importance to the definition of the latent variable. Finally, relatively less important indicators, that is, those with smaller path coefficients in the population model, will exhibit more bias than relatively more important ones. For example, in one extreme case (Table 1, estimated path for x 4 when x 1 is missing from the model) the resulting estimate is 90% larger than its population counterpart. The second major result of importance is that omissions of formative indicators from the model have no consequence on the estimates of the structural parameters of interest (i.e., those path coefficients that relate latent variables to each other). As can be seen in Table 1, all structural coefficients are identical to their population values, regardless of which indicators are missing and how strongly those are correlated amongst themselves. As a result, though most discussions on the formative specification of latent 6 Thirty Third International Conference on Information Systems, Orlando 2012

7 variables emphasize the need to obtain a complete set of formative indicators (e.g., a census according to Bollen and Lennox, 1991), our results indicate that violating this requirement will not create bias in the path coefficients between latent variables 3. To see how this is the case, consider the following simple MIMIC example (which abstracts the complexities associated with other latent variables in the model, but in no way affects our results) with two formative indicators and two reflective indicators 4. The implied population covariance matrix is shown in Figure 2 below, with a graphical representation of this example shown in Figure 3. In this model, there are six equations (for the variances of the two reflective indicators and the four covariances among the reflective and formative indicators) but only four unknowns 5 (the two loadings and the two regression paths from the formative indicators), which means that the equation group is over identified and a solution implies that two of the covariances can be expressed as a function of the other covariances. In this sense, one of the formative indicators is redundant, and the model could still be solved if only one of them were present indeed, missing a formative indicator in this example would result in an equation system with three equations (for the variances of the two reflective indicators and their covariance) and three unknowns (the two loadings and the regression path from the single remaining indicator), which can be solved. Another way to consider this is that the proportionality constraints imposed in the model imply redundancy between the formative indicators. x 1 Var x ) ( 1 x 1 x 2 y 1 y 2 x 2 φ Var x ) 2 y 1 λ γvar ( x ) + γ ] λ γvar ( x ) + γ ] λvar η) + Var( ) 1[ 1 1 2φ ( 2 1[ 2 2 1φ 1 ( ε1 y 2 2 λ γvar ( x ) + γ ] λ γ Var ( x ) + γ ] λλvar ( ) λvar η) + Var( ) 2[ 1 1 2φ 2[ 2 2 1φ 1 2 η ( η γ 1 1 γ 2 2 γ 1γ 2φ ζ 2 2 Note: Symmetric upper triangle omitted. Var ) = Var( x ) + Var( x ) Var( ) Figure 2. Population Covariance Matrix for MIMIC Example 2 ( ε 2 ζ x 1 γ 1 λ 1 y 1 ε 1 Φ η x 2 γ 2 λ 2 y 2 ε 2 Figure 3. MIMIC Example Graphical Representation 3 In order to validate this result we ran various other analyses including only one or two formative indicators in the analyzed models. In all cases the standardized structural coefficients are identical to their population values. 4 We thank Mikko Rönkkö for this insight. 5 Setting the scale of the latent variable with some of the available identification approaches removes one unknown. In this case we have omitted discussion of the residual for the latent variable. Thirty Third International Conference on Information Systems, Orlando

8 Research Methods Our analysis shows that, in a simple MIMIC model, the formative indicators are redundant when estimating the path coefficients from the latent variable toward the indicators. The next question is how well this analysis generalizes to more complex models. Adding more formative indicators makes the equations more complex, but it does not change the basic principle: The covariance between the two reflective indicators determines the product of the two paths λ 1 and λ 2 and one formative indicator is sufficient to solve how the covariance between the reflective indicators is partitioned between the two parameters. Similarly, replacing the reflective indicators with two latent variables maintains the basic principle. The only difference would be that we would not be using an observed covariance between two reflective indicators, but rather a model implied covariance between two latent variables when solving the values for the paths emitted from the formatively measured latent variable. However, our results hold here as well, as the issue is not the nature of the dependent variables, but the fact that there are more variables than unknowns when solving the system of equations implied by the models examined here. Our analysis indicates that for a correctly specified formative model, omitting a formative indicator in latent variable estimation is not as severe as earlier thought. As is the case in reflective measurement models, where multiple indicators of the same latent variable add a layer of redundancy to the model-implied equations, formative indicators exhibit similar characteristics in this regard. Given the likely controversial nature of these results, we also conducted a population analysis to validate our Monte Carlo simulations. Our approach here is similar to that for conducting power analyses developed by Satorra and Saris (1985). In this approach, the covariance matrix implied by the known population model (derived from Figure 1) is subject to analyses using alternative models which contain some misspecified component. In this case, the misspecification is the omission of a formative indicator from the alternative model. While some empirical issues cannot be examined with this approach (for example, convergence and solution propriety), results thus obtained will hold when data are sampled from the population and subject to analyses using the same alternative models. In essence, this approach allows a researcher to develop expectations about the consequences of various misspecifications if an infinitely large number of samples from the original population were collected and analyzed. In the case of Satorra and Saris (1985), misspecification due to constraining a particular parameter to zero allows researchers to estimate the a priori power of a variety of sample sizes to detect that parameter based on the non-centrality resulting from the analysis, which follows a non-central chi-square distribution (for an extended discussion and annotated example see Brown, 2006, Chapter 10). Results from these additional analyses are identical to those obtained from the Monte Carlo simulations previously discussed, lending additional credence to our findings. Third, our results in Table 1 show that all models fit the population covariance matrix equally well. That is, neither absolute (i.e., chi-square) nor relative (e.g., RMSEA, CFI) fit indexes will alert researchers that a relevant formative indicator has been omitted from the model. This constitutes a special case of the phenomena described by Tomarken and Waller (2003) who discuss how otherwise misspecified models can fit equally well as correctly specified ones when the restrictions on the covariance matrix implied by the misspecified model are a subset of those specified by the correctly specified one. As a result, lack of significant ill-fit cannot be taken as validation that all relevant formative indicators have been included in the research model. Finally, we consider the often raised possibility (Cenfetelli and Bassellier 2009; Diamantopoulos 2006; Diamantopoulos et al. 2008) of using the degree of variance explained (or unexplained) in the formatively specified construct by its indicators as a measure of whether all causes of the construct have been accounted for, that is, completeness of the specification. While the underlying logic is sound, researchers should be mindful of the fact that bias in the path coefficients relating formative indicators to their construct that occurs when a relevant indicator is omitted will also have an upward biasing effect on the reported variance explained in the formatively specified latent variable. As a result, though the proportion of variance explained may seem high, that may not be a clear cut indication that there are no missing indicators. As discussed before, however, omitted indicators do not appear to bias the structural coefficients relating latent variables to one another. 8 Thirty Third International Conference on Information Systems, Orlando 2012

9 TABLE 1. Latent Variable Simulation Results, Model B, N = 500 Full Model X1 Missing X2 Missing X3 Missing X4 Missing X1 K (- 1%) (+16%) (+11%) (+ 5%) X2 K ( 0%) (+30%) (+15%) (+ 8%) X3 K ( 0%) (+45%) (+34%) (+11%) X4 K ( 0%) (+90%) (+68%) (+46%) K Eta ( 0%) ( 0%) ( 0%) ( 0%) ( 0%) K Eta ( 0%) ( 0%) ( 0%) ( 0%) ( 0%) Eta1 Eta ( 0%) ( 0%) ( 0%) ( 0%) ( 0%) Eta1 Eta ( 0%) ( 0%) ( 0%) ( 0%) ( 0%) Χ 2 (d.f.) (160) (145) (145) (145) (145) RMSEA CFI R Note: Values on the left of each column show the average standardized parameter over 1000 replications. The values on the right show the average percentage bias compared to known population values for each parameter, calculated as (average estimate population value ) / population value. Values on the bottom four rows are averages over all replications. Formative Modeling with Partial Least Squares In PLS the latent variables of interest are represented as weighted composites of the observed variables that are directly related to them. The practice of substituting composites of observed variables as proxies for those of theoretical interest is certainly not new, and has been discussed before by many authors (e.g., McDonald, 1996) and, specifically in the context of formative specification of latent variables, by Bollen and Lennox (1991). In the particular case of the composites employed by PLS, those are weighted combinations of observed variables, whereas most of the literature discussing the use of composites refers to the unweighted case. As is well known, estimates of the relationship between latent variables obtained from the relationship between composites that substitute for them will be biased unless those composites are perfectly reliable. Whether that bias will be upward or downward compared to the population value of the relationship under examination is a function of the complexity of the functional equation relating the composites and which composites are less than perfectly reliable. In the case of only one predictor and one dependent variable, the obtained estimate will be biased downward when either of those composites is less than perfectly reliable. For a more general discussion of the effects of lack of reliability on various statistical procedures see, for example, Ree and Carretta (2006) and Bollen (1989) specifically dealing with latent variables. As it is also well known, estimates of the relationships between composites in PLS will be biased for any finite number of indicators related to each of those composites (i.e., the consistency-at-large requirement). In the case of reflectively specified latent variables this occurs because the composites representing them in PLS analyses, being themselves a weighted aggregate of individual indicators containing both true score and error components, are not perfect representations of the latent variables under examination, i.e., they are not perfectly reliable. To the extent that the number of indicators is large and those are of high quality, then the reliability of the composites will be relatively high and bias more limited. However, as noted by Bollen and Lennox (1991): this is not the same as saying that it [the composite] has perfect reliability (p. 310). The issue is generally overlooked by researchers who, after showing that the composites exhibit reliability above a certain threshold (typically 0.70 or 0.80), proceed as if estimates of the relationships of interest obtained from those composites accurately reflected those present in the population. Unless the reliabilities of the involved composites are quite high (in the range), however, there will be substantial bias in the estimates. Following Raykov (1997), and extending work in this regard by Bollen and Lennox (1991), the reliability of a composite can be understood as the shared variance between the composite and the latent variable it represents in the analysis; that is, the squared correlation between composite and latent variable. This is consistent with the commonly used measure of composite reliability by Werts, Linn and Jöreskog (1974) that is employed in PLS analyses; see also Fornell and Larcker (1981a; 1981b). More generally, reliability Thirty Third International Conference on Information Systems, Orlando

10 Research Methods as squared correlation between a composite and corresponding latent variable can be expressed as follows (Bollen and Lennox 1991; Raykov 1997), where c is the composite, η the corresponding latent variable, w i the weights employed in the formation of the composite, and λ i the loadings relating a reflective items to their latent variable: In this formulation, the numerator represents the variance in the composite that is due to the latent variable, whereas the denominator captures the total variance of the composite. The ratio between the two is the reliability of said composite. In the case of an equally weighted composite it simplifies into the well known statistic by Werts et al (1974). Although the notion of reliability is always associated with reflective specifications (indeed, various authors have noted that the idea of reliability as internal consistency does not apply to formative indicators), we argue here that the idea of reliability as shared variance between composite and latent variable is applicable, at least in principle, to composites representing formatively specified latent variables, as follows. In the case of PLS, composites representing the latent variables are weighted aggregates of the formative indicators included in the model without, as noted above, a disturbance term. To the extent that these composites are not perfectly correlated with the theoretical latent variable (i.e., are not perfectly reliable proxies for the latent variable of interest), any relationship between these composites and other variables in the model will be biased with respect to their population-level counterpart (Bollen and Lennox 1991). As a result, lack of perfect composite reliability, for either formative or reflective composites, explains the presence of bias in the estimates obtained from our PLS simulations. Even when all formative indicators are included in the models, the lack of a disturbance term reduces the correlation of the formative composite slightly, resulting in a small amount of bias in the relationships between this composite and others in the model (lack of perfect reliability in those other composites is also responsible for this bias). More problematic, however, are those cases where a formative indicator is omitted from the model. In those scenarios, the reliability of the formative composite greatly suffers as the absence of the indicator reduces the theoretical correlation with the latent variable it represents, since the variance due to the omitted indicator is no longer included. This is the reason why estimates obtained from models with omitted indicators exhibit substantial bias in the relationship between composites representing latent variables, and a major downside of employing PLS for analyzing these models. To better show how lack of reliability lies behind these occurrences, we conducted the following exercise. Adapting a procedure outlined by Raykov (1997), we estimated the reliability of the formative composite and that of one of the reflective composites it is related to (in this exercise, Eta3) for one scenario (Model A, X2 missing, N = 500) and show how, correcting for unreliability, the relationship between the two composites is no longer biased. We proceed as follows. First, we extracted the weights used to form both composites from the results of our PLS analyses. Next, using EQS 6.1 and the procedure by Raykov (1997) we create weighted composites (using the weights determined by PLS) and included them in the latent variable analysis of the same data, omitting the same indicator as well. As an additional parameter, the software estimates the correlation between these composites and the latent variables included in the model. Using those correlations, and the correcting procedures described by Ree and Carretta (2006) we then calculated a corrected value for the relationship between Ksi and Eta3 in Figure 1. The process was repeated for each replication and results averaged over all replications are discussed next. To highlight that the inclusion of these composites has no bearing on the latent variable analysis itself, we compared the chi-square statistic and its p-value for models with and without the composites, with identical results. For the particular scenario discussed here, the average path coefficient between Ksi and Eta3 obtained from the PLS analyses was The average reliability of the formative composite was and that of the reflective composite Correcting the path estimates obtained from PLS by lack 10 Thirty Third International Conference on Information Systems, Orlando 2012

11 of reliability in the composites 6, the average path is now 0.713, which compares well with a population value of (note that we discuss these results here in terms of averages for ease of exposition, but the procedures outlined above were conducted for each replication separately and only then averaged for presentation). These results underscore the fact that the presence of bias in these estimates is a result of employing weighted composites as (imperfect) substitutes for the latent variables of interest. When correcting for the lack of complete reliability introduced by that substitution, estimated parameters are in agreement with expected values based on population settings. Bias as a Result of Omitted Indicators We are now thus able to explain the pattern of results reported next with regards to the presence of bias in the relationships between the various composites representing latent variables of interest, and between those composites and their formative indicators. Given that all these composites are always imperfect representations of the ideal latent variables, any relationships between those will exhibit bias when compared to their population values. This is a well known feature of PLS. For the specific case of interest here, formative specifications, unreliability in the formative composite will be limited when all measurable formative indicators are included in the model, and thus limited bias is likely to occur (subject to high reliability of the other composites related to the focal one). This occurs because the only omitted portion of the latent variable in the weighted composite is the residual disturbance due to random shocks, which should be a small portion of the overall variance and, further, uncorrelated with the formative indicators. Therefore, its omission does not represent a major problem. As shown in our results, this is the case when all formative indicators are included in the model. Table 5 shows results for Model B where N = 500, which can be directly compared with those in Table 1; results for other cases are available from the first author upon request. TABLE 2. PLS Simulation Results, Model B, N = 500 Full Model X1 Missing X2 Missing X3 Missing X4 Missing X1 K (+ 2%) (+26%) (+16%) (+ 9%) X2 K (+ 3%) (+ 48%) (+21%) (+11%) X3 K (+ 2%) (+ 65%) (+45%) (+14%) X4 K (+ 4%) (+117%) (+83%) (+53%) K Eta (- 6%) (- 16%) (- 12%) (- 8%) (- 7%) K Eta (- 6%) (- 16%) (- 11%) (- 8%) (- 7%) Eta1 Eta (- 8%) (- 8%) (- 8%) (- 8%) (- 8%) Eta1 Eta (- 8%) (- 8%) (- 8%) (- 8%) (- 8%) Note: Values on the left of each column show the average standardized parameter over 1000 replications. The values on the right show the average percentage bias compared to known population values for each parameter, calculated as (average estimate population value ) / population value. When a formative indicator is omitted, however, the shared variance between the theoretical latent variable and the composite (i.e., its reliability) decreases as a function of how important the omitted indicator was to the determination of the formatively specified latent variable. When a major indicator is omitted, this drop in reliability has major consequences for the estimates obtained. As can be seen in our results, omitting an indicator leads to significant bias in the structural parameters of the model. The situation is most problematic when the formative indicators are not highly correlated amongst themselves, as omission of one indicator results in a larger portion of the variance of the composite that is not shared with the latent variable, resulting in lower reliability of the composite. For example, the estimated relationship between the formatively specified construct in Figure 1 and its direct consequent constructs exhibits a 28% downward bias when x 1 is missing and formative indicators are not highly correlated, which improves to 16% and 10% downward bias as the correlations between those indicators increase (Table 2 shows the 16% case). When the least important formative indicator in our models was omitted the resulting bias in the structural relationships was not different than when all indicators were included (compare, for example, the x 4 Missing column with the Full Model column in Table 2 for any one 6 Since this is a direct path with only one predictor, the estimate represents the correlation between the two composites. The formula for correcting correlations for unreliability is where r xy is the attenuated correlation and r xx and r yy the reliabilities of x and y, respectively. Thirty Third International Conference on Information Systems, Orlando

12 Research Methods of our model variations with regards to estimates for the structural relationships). Whereas the degree of bias in structural relationships due to an omitted formative indicator decreases as a function of how correlated the formative indicators are amongst themselves, bias in the measurement relationships between indicators and composite increases when an indicator is omitted as a result of increased correlations amongst those, as can be seen in Table 2. There are two related explanations for the occurrence of this bias. First, similar to the discussion above with respect to latent variables, when a relevant cause of a variable is omitted from a model, this leads to bias in the parameters linking the included causes to the dependent variable; in this case, the weights linking formative indicators to the composite. When all correlations amongst the predictors as well as the population parameters between those and the dependent variable are positive, upward bias will occur. The issue, however, is compounded by the fact that omitting a relevant formative indicator from the model omits it from the composite as well. Whereas in the case of latent variables the residual disturbance term captured the effects of omitted variables (and the violation of the self-containment assumption which made the residual term correlated with the included indicators was responsible for the bias in those estimates), in the case of PLS the weighted composite is only formed by those indicators explicitly included in the model. This results in a further biasing of the relationships between indicators and composite as each indicator ends up representing an even larger portion of the composite than would have been the case had the omitted indicator been included in the model. These effects can be best seen by comparing the bias in the relationship between indicators and composite shown in Table 2. First, the biasing effects of omitting a relevant formative indicator decrease as the omitted indicator is a relatively less important part of the definition of the construct. For example, bias when x 4 is omitted is an order of magnitude smaller than when x 1 is omitted. Second, the degree of bias itself is also a function of the relative importance of the indicator for which it is quantified. As a percentage of their population value, more important indicators (e.g., x 1 or x 2) will be less severely biased when an indicator is omitted than less important ones (such as x 3 or x 4). Bias for the latter in some combinations can be, in fact, quite extreme (consider, for instance, that the average estimated coefficient for x 4 when x 1 is omitted in Table 2 is more than twice as large as its population value). Finally, irrespective of which indicator is omitted, the general level of bias increases as the magnitude of the correlations amongst formative indicators increases. For example, bias in the relationship between x 3 and the formatively specified composite when x 2 is omitted equals 48% in Model A, 83% in Model B and 97% (essentially twice as large as its population value) in Model C (Table 2 shows the 83% case). Discussion and Limitations Comparison Between Approaches All results and conclusions discussed in this research are specific to the particular set of research models and varied conditions previously discussed, and subject to other limitations noted in more detail below. There are, however, a number of interesting results arising from our work, which we discuss in detail next. See Table 3. First, although the performance of the two examined techniques for those cases where models are correctly specified in all aspects was not our main focus of interest, the work conducted here sheds some light on that as well. While most research in this area (e.g., Jarvis et al, 2003; Petter et al, 2007) has focused on latent variables, there is limited evidence of the performance of PLS even for correctly specified models, when those include a formatively specified construct. Results shown in Table 3 indicate that when models are correctly specified, the latent variable technique (LV-SEM) exhibits no bias in estimation of either structural (those relating constructs to each other) or measurement parameters (those relating formative indicators to their corresponding construct), whereas PLS does exhibit some bias. Bias in the estimation of measurement parameters in this case is quite small, and due to the lack of a disturbance term in the composites modeled by PLS, leading to the formative indicators appearing to be slightly biased upwards. Bias for the structural parameters is small but not negligible, due to the lack of perfect reliability in the composites that substitute for the latent variables of interest; this bias is a function of the reliabilities of both composites involved in the estimates. Second, the omission of a formative indicator results in no bias for structural parameters under LV-SEM 12 Thirty Third International Conference on Information Systems, Orlando 2012

Can We Assess Formative Measurement using Item Weights? A Monte Carlo Simulation Analysis

Association for Information Systems AIS Electronic Library (AISeL) MWAIS 2013 Proceedings Midwest (MWAIS) 5-24-2013 Can We Assess Formative Measurement using Item Weights? A Monte Carlo Simulation Analysis