An Alternative Way of Establishing Measurement in Marketing Research Its Implications for Scale Development and Validity

Size: px

Start display at page:

Download "An Alternative Way of Establishing Measurement in Marketing Research Its Implications for Scale Development and Validity"

Jonathan Baldwin
5 years ago
Views:

1 An Alternative Way of Establishing Measurement in Marketing Research Its Implications for Scale Development and Validity Thomas Salzberger University of Economics and Business Administration, Vienna (WU-Wien) Abstract Quantitative consumer and marketing research is looking back on an era of construct operationalization predominantly based on classical test theory as a technical framework of scale development. Rasch measurement theory provides an alternative framework of measurement. Previous studies demonstrated the potential of Rasch measurement for marketing research from a theoretical viewpoint, and reported applications of Rasch measurement models to existing marketing scales. This paper focuses on the fact that the Rasch model explicitly accounts for the different amount of the construct that is needed by the respondents to agree with different items. Each item is characterized by an item parameter, i.e. the item location, that expresses the amount of the property to be measured the item stands for. Whereas the foundation of the Rasch model, i.e. specific objectivity, provide evidence of construct validity of a scale that fits the model, the range of item locations spans the latent dimension and gives insight into the meaning of different levels of the construct and thereby adds to content validity. An empirical example shows that applying Rasch models to existing scales does not reveal the full potential of the model, even if a comprehensive, albeit classical item pool is referred to. Consequently, only newly generated items are likely to span a wide range of items providing content validity. Introduction When developing instruments to measure latent constructs in marketing and consumer behaviour research, the procedure suggested by Churchill (1979) has been routinely applied, and it has been adopted by most of the textbooks in marketing research. This procedure rests on the basics of classical test theory (CTT, Lord and Novick 1968). During the last decade, an alternative framework of measurement has been introduced to marketing research (e.g. Soutar et al. 1990, Soutar and Cornish-Ward 1997, Soutar and Ryan 1999, Salzberger et al. 1999, Balasubramian and Wagner 1989, Singh et al. 1990, Singh 1996). From a general perspective, this alternative measurement theory may be referred to as item response theory (IRT, Lord 1980) or latent trait theory (LTT). A special class of models, the family of Rasch models, though, stands out by featuring special properties which more general models do not share. This paper does not offer a comprehensive introduction into LTT-models in general or Rasch models in particular (see, e.g., Andrich 1988a) but focuses on specific consequences as to the process of scale development in line with Rasch measurement theory (RMT). The Principles of Latent Trait Theory Since RMT is beyond the mainstream paradigm of construct operationalization in marketing research and, consequently, most marketing scholars are not fully familiar with it, a short introduction is provided highlighting the most fundamental differences between RMT and classical approaches. The classical approach is based essentially on the principle of correlation. From a comprehensive pool of items those are retained that show high loadings in factor analysis and contribute to reliability, i.e. their exclusion would decrease reliability. Both criteria require high item inter-correlations. This approach entails some theoretical drawbacks. First of all, it is not explained how an item score is actually accomplished. Rather, the item score is treated immediately as errorcontaminated measurement. Secondly, to be meaningful a correlation coefficient requires scale properties of the item scores, i.e. interval scale level in most cases, that are more than questionable and not testable in practice. Thirdly, correlations are affected by the distribution of the respondents. Consequently, a different sample of respondents is very likely to yield a different picture. Finally, the limited range of actually possible item scores, e.g. 1 to 5, has an important impact on the correlation of two items: due to floor and ceiling effects, only those items that have similar means may show high item inter-correlations. LTT models proceed on a totally different rationale. Rather than correlating manifest item scores, LTT models attempt to explain how a particular item score comes about. While classical approaches focus on summary statistics, i.e. variances, correlations, LTT refers primarily to individual responses. It depends on the particular type of LTT model which parameters are conceptualized in order to explain the respondents answer behaviour. 1111

2 However, there are two parameters all models have in common. The first one refers to the respondent s amount of the property to be measured - the ultimate goal of measuring. The second one parallels the first one but stands for the item s amount of the property. These parameters, along with others depending on the model, govern the answer behaviour. Classical approaches simply treat manifest item scores as meaningful data provided the scores from items measuring the same dimension show substantial correlations. LTT models examine whether empirical response patterns make sense and whether these patterns may be explained by item and person parameters, in other words, whether these patterns constitute measurement. To this end, items are to vary in the amount of the property to allow for determining likely and unlikely patterns of response. Further more, a wide range of items provides insight into what various levels of the property actually mean. Foundations of Rasch Measurement Theory (RMT) Notwithstanding the fact that the Rasch model (Rasch 1960/1980, see figure 1) shares some features with other LTT models, it has some unique properties, i.e. specific objectivity and raw score sufficiency and their consequences, that other models do not have, e.g. Birnbaum s logistic models for dichotomous items (Birnbaum 1968) and generalizations for polytomous items like the graded response model (Samejima 1969). It is these unique features that make up the measurement theory underlying the Rasch model, and all models adhering to these principles are termed Rasch models. Figure 1: Depiction P( a vi = 1) P( a vi = 0) The Dichotomous Rasch Model (Rasch 1960/80, p.187): Parametrization and Graphical e β v δ i = e β v δ i + 1 = e β v δ i + βv...person location parameter δi...item location parameter a vi..answer of person v to item i (0 = disagree, 1 = agree) P(a vi =x δ i, β v ) ICC P(a vi =1 δ i =0, β v ) δ i = δ i, β v All LTT models follow the concept of a latent dimension of the respondents degree of the property to be measured. Respondents are scaled onto this dimension in terms of their attitude, satisfaction, propensity to buy or whatsoever. The items are scaled onto this scale as well, i.e. a common dimension of respondents and items is established. The parameter characterizing the item s location on the latent dimension expresses the amount of the property the item stands for. The Rasch measurement model then defines the probability that a given respondent agrees with a given item characterized by its location. Each item may be represented graphically by a curve, the item characteristic curve (ICC), depicting the probability of agreement depending on the respondents location (see figure 1). In contrast to CTT based parameters of item location (the simple proportion of people agreeing to an item, or the more sophisticated item intercept in factor analysis), the item location within the Rasch model is sample independent provided the data fit the model. The basic principle of the Rasch model is the principle of objectivity. Objectivity in this context means, the respondents location must not depend on specific items answered and, vice versa, the item s location must not depend on specific respondents. Rasch (1960/1980) called this principle specific objectivity and deduced the model that follows necessarily (see also Fischer 1995). Only under the Rasch model is the unweighted raw score for respondents and items a sufficient statistic, i.e. the specific response patterns do not provide additional information. Consequently, maximum likelihood estimation of the parameters may be conditioned on these scores and any assumptions concerning the distribution of the respondents are no longer necessary (see, e.g., Molenaar 1995, for parameter estimation techniques). The person and item parameters have interval scale properties. The unit of the scale is defined by the common item discrimination implicitly set to one, while the origin of the scale is usually defined by constraining the mean of the item parameters to zero. 1112

3 As marketing research mostly employs multicategorical item scales (i.e., widely applied rating scales), the dichotomous Rasch model (as in figure 1) may not be applied. However, the Rasch model may be generalized for polytomous items in a straightforward way without losing its key property, i.e. specific objectivity. The two most important models are the rating scale model (Andrich 1978) and the partial credit model (Masters 1982, Andrich 1988b, see figure 2). In the numerator, there is, like in the dichotomous model, the difference between the person location v and the item mean location i. A positive difference contributes to a higher probability of agreement. The difference is multiplied by the score of the category because, e.g., choosing category 3 requires passing threshold 1 and threshold 2 which are theoretically independent. Furthermore, there is the negative of the sum of thresholds ij in the numerator. Thus, the higher the thresholds, the lower the numerator and, consequently, the lower the probability of choosing an affirmative category. The denominator is simply the sum of all numerators, i.e. the numerators of all category probabilities, to ensure that all probabilities add up to one. Both the partial credit model and the rating scale model may be derived by applying the dichotomous Rasch model repeatedly to adjacent categories of polytomous items. Between any pair of adjacent categories a threshold parameter is modelled. Consequently k answer categories call for k-1 threshold parameters. In the following we will concentrate on the rating scale model, which assumes a uniform scale across items, i.e. equal threshold distances across items but not necessarily within items. Figure 2: General Polytomous Rasch Model (Andrich, 1988b, p.366) P( a vi = x β v, τ ij, j = 1 m, 0 < x m) = with: m ϒ = 1 + e k = 1 k τ ij j = 1 + k ( β v δ i ) x τ ij j = 1 + x ( β v δ i ) e ϒ β v...person v location parameter δ i...item i location parameter τ ij...threshold j of item i parameter m...maximum score, number of categories - 1 a vi...answer of person v to item i (item score) Andrich (1995a, 1995b) pointed out that due to the fact that polytomous Rasch models estimate the threshold parameters independently of each other, the empirical threshold parameters may or may not reflect the order that is hypothesized when setting up a polytomous answer scale. If the empirical threshold estimates are not properly ordered, i.e. they are reversed, the scale does not really work as intended and, in fact, lacks ordinal properties. In this case, adjacent categories should be collapsed, i.e. the scoring function assigns the same numbers to adjacent categories. However, further data have to be collected in order to cross-validate the new scale format. The most important features of the Rasch model may be summarized as follows, the model provides a theory of how measurement is accomplished based on the principle of specific objectivity, namely by a comparison of an item and a person in the empirical domain and thereby establishing an interval scale for item and person parameters, the model may be falsified empirically, assessed by various tests of fit (which go beyond the scope of this paper), the model defines only one dimension, i.e. it rests on the prerequisite of unidimensionality; this prerequisite is subject to empirical falsification, however; the principle of local stochastic independence, i.e. the answer to one item is independent of the answer to a different item given the person parameter, is closely related to unidimensionality (see, e.g., Gustafsson 1980), the answer scale of a polytomous item is hypothesized to have ordinal properties which are subject to empirical falsification (reversed thresholds), for any person a specific answer pattern is expected to occur most likely (i.e. agreement with all items standing for less of the property than the person itself has, disagreement with all other items), offering opportunities to test for person fit. 1113

4 The Application of the Rasch Model and Its Consequences For The Scale Development Process The question arises whether the application of the Rasch model may simply be seen as a technique of analysis to be carried out instead of or parallel to classical techniques of scale analysis. From the perspective of item generation, the Rasch model and classical analysis differ substantially. While the classical approach requires to cover as many facets as possible, the Rasch model additionally requires to consider different levels of the construct to be measured. That s why, the classical approach to scale development usually does not account for varying degrees of the property and is not very likely to provide a foundation of establishing a useful Rasch scale. Thus, the application of the Rasch model is more than an alternative way of mere data analysis. Rasch measurement represents a different philosophy of construct operationalization. It aims at developing a type of a ruler with the items representing the marks which the respondents are checked against. It provides a superior foundation for assessing content validity as well as construct validity since it gives insight into what various levels of the construct actually mean. A mere re-analysis of an existing scale is a priori not very likely to yield a Rasch scale with a wide range of item locations. The empirical example examines whether, in order to establish a Rasch scale, it is sufficient to go back to the original comprehensive item pool underlying the development of a widely used marketing scale, the CETSCALE, and if it is not, how the content of the scale may be extended by including additional items. Empirical Example The CETSCALE (Shimp and Sharma 1987) has obtained much popularity in consumer research since its introduction as is demonstrated by the multiplicity of applications in national as well as in cross-national marketing research (e.g. Herche 1992, Netemeyer et al. 1991, Durvasula et al., Good and Huddleston 1995, Steenkamp and Baumgartner 1998). The idea of consumer ethnocentric tendencies transfers the sociological concept of ethnocentrism to marketing and consumer research in that it focuses on the attitude towards foreign economies and their products opposed to one s own domestic economy. Both the general level of a nation s consumer ethnocentric tendencies and the level within segments of consumers relevant to a company are obviously important for corporate location policy, product mix decisions, and corporate communication strategy. The CETSCALE Data Set Data has been collected in Austria (n=974 listwise nonmissing respondents, self administered interviews) based on a translated version of the whole set of 100 items that remained in the item pool after a judgmental panel screening of originally 180 items generated to develop the CETSCALE (Shimp and Sharma 1987, Sinkovics 1999). The items seven-point scale provides categories labelled as follows: fully disagree, partly disagree, somewhat disagree, neither disagree nor agree, somewhat agree, partly agree, and fully agree. Rasch Based Analysis Using a data set restricted to the original 17 CETSCALE items, Salzberger (1999) showed that six of these items may indeed be scaled successfully applying the Rasch model for polytomous data (rating scale model). Some of the thresholds were reversed, however. So two pairs of adjacent categories had to be collapsed leading to a fivepoint rating scale. The range of item parameters amounted to a mere log-units with five items within approximately 0.2 log- units. Consequently, these items do not yield a profound understanding of the latent construct that goes beyond the expectation that the higher the ethnocentric tendencies the higher the probability of agreement with nearly all items in the same way. The current analysis built upon these results. It started with a conventional factor analysis (principal axis factoring) in order to ensure unidimensionality. As a cutoff criterion a factor loading of.3 has been chosen which is rather small compared to CTT standards. The reason is that the correlation of the item and the factor may be reduced due to scale bounding effects especially if the item is extraordinarily easy or hard to endorse. The remaining 65 items have been analysed using the partial credit model implemented in RUMM 2.7 (Sheridan et al. 1997) as rough screening of suitability. 25 items were retained. Subsequently, these items have been analysed using the rating scale model. In line with the results of Salzberger (1999), the original seven-point Likert scale had to be transformed to a five-point rating scale in order to achieve a proper order of thresholds. On each step of parameter estimation the worst significantly misfitting item (alpha =.001) in terms of a chisquare test of fit provided by RUMM 2.7, which compares model predicted probability and actual response 1114

5 behaviour, has been deleted. Ultimately, a scale has been derived containing ten items fitting the model. The established rating scale is very similar to that reported by Salzberger (1999), i.e. the threshold distances are almost identical. The same applies to the item location parameters as the mean of the thresholds of each item (detailed results are available upon request from the first author). The striking outcome of the current analysis, however, is the fact that widening the base of the analysis from 17 to 100 items resulted in a mere increase of four additional items fitting the model yielding only a small increase in the range of item locations from to log- units. While ten items might in principle suffice for most applications, the small range of item locations leads to two different problems. First, it increases the measurement error for respondents who do not fall into this small area, i.e. the area of the item locations considering the thresholds. (The measurement error for a specific person depends on the item information which reaches a maximum when person and item location coincide.) Second, content validity is limited to the number of facets of the construct covered by the items. It remains unclear, however, what a certain degree of ethnocentric tendencies actually means. If there were a broader range of item locations, any non-extreme area on the scale would be associated with specific items agreed with and others disagreed with. It should be noted that from the viewpoint of CTT the small range of item locations does not represent a severe problem at all. In fact, the whole item pool has proved to be designed for CTT based analyses. Consequently, following the LTT approach of measurement means more than (re-)analysing a data set creation of which has been guided by a different measurement paradigm, i.e. CTT. The measurement theory adhered to has a significant impact on the items generated and, eventually, on the data collected. In other words, data are, at least in part, determined by the measurement paradigm chosen. Extending the CETSCALE Given the small range of item locations, a preliminary follow-up study aimed at widening the items in terms of the amount of the property they stand for. To this end, 14 additional items have been generated as an extension of the CETSCALE to cover the positive and negative extremes of the construct. The answer scale has been confined to five categories. Based on a small convenience sample (n=80), these items were analysed together with 19 items stemming from the CETSCALE item pool to evaluate their locations. 26 items proved to fit the model. 16 items came from the CETSCALE item pool ensuring that the basic concept to be measured stays the same. The other ten items, which were newly generated, successfully widened the range of item locations. The items of the extended CETSCALE differ as much as log-units in their locations (detailed results and a list of the items are available upon request from the author). A Classical Re-analysis of the Extended Scale In order to contrast the results with those based on the classical paradigm of scale development, a re-analysis using principal components analysis has been carried out. The 26 items of the final Rasch scale yield a unidimensional set of indicators with a mean loading of.687 (ranging from.55 to.86) and a scale reliability of.96 (Cronbach s alpha). Thus, the scale developed by Rasch modelling proves tenable from the classical perspective of scale development. Certainly, the number of items has to be reduced for practical purposes. However, the approaches differ significantly in the way the number of items would be reduced. The Rasch approach would drop items based on their locations, i.e. for any region of the latent dimension at least one item has to be retained. Thereby the range of item locations would not be reduced since the extreme items would certainly be kept in the instrument. In contrast, the classical approach would discard items showing loadings below average. Not really surprisingly, almost all of the extreme items show loadings below average. Consequently, the classical approach of item selection would lead to a narrower instrument in terms of item locations. Implications and Conclusions From a theoretical viewpoint, the Rasch measurement approach has the potential to lift measurement in consumer and marketing research to a higher level and provide a better foundation for managerial decision making. It provides a powerful foundation of assessing content and construct validity. The application of Rasch models to existing marketing scales is a good starting point for further dissemination. However, in the long run Rasch measurement should guide us from the beginning of scale development. The first step of the scale development process as outlined by Churchill (1979) should not be restricted to domain specification in terms of aspects to be considered but also aim at covering a range of the construct as wide as possible. 1115

6 A Rasch scale with widely varying item locations provides a deeper understanding of the construct, i.e. what it means for respondents to be located at a certain position on the latent dimension. Moreover, this is the prerequisite of precise measurement over a wide range for measurement error increases strongly when items are off-target for the respondents. References Andrich, David (1978), A Rating Formulation for Ordered Response Categories, Psy- chometrika, 43 (4), (1988a), Rasch Models for Measurement, Sage University Paper Series on Quantitative Applications in the Social Sciences 68, Beverly Hills: Sage (1988b), A General Form of Rasch s Extended Logistic Model for Partial Credit Scoring, Applied Measurement in Education, 1 (4), (1995a), Models for Measurement, Precision and the Non-Dichotomization of Graded Responses, Psychometrika, 60 (1), (1995b), Further Remarks on the Non-Dichotomization of Graded Responses, Psychometrika, 60 (1), Balasubramian, Siva. K. and Wagner A. Kamakura (1989), Measuring Consumer Attitudes Toward the Marketplace With Tailored Interviews, Journal of Marketing Research, 26 (3), Birnbaum, Allan (1968), Some Latent Trait Models and Their Use in Inferring an Examinee s Ability, in: Statistical Theories of Mental Test Scores, Chapters 17-20, Eds. Frederic Lord and Melvin R. Novick, Reading (Mass.): Addison- Wesley. Churchill, Gilbert A. (1979), A Paradigm for Developing Better Measures of Marketing Constructs, Journal of Marketing Research, 26, Durvasula, Srinivas, Craig J. Andrews and Richard G. Netemeyer (1997), A Cross-Cultural Comparison of Consumer Ethnocentrism in the United States and Russia, Journal of International Consumer Marketing, 9 (4), Fischer, Gerhard H. (1995), Derivations of the Rasch Model, in: Rasch Models, Foundations Recent Developments, and Applications, Eds. Gerhard H. Fischer and Ivo W. Molenaar, New York: Springer, Good, Linda and Patricia Huddleston (1995), Ethnocentrism of Polish and Russian Consumers: Are Feelings and Intentions related?, International Marketing Review 12 (5), Gustafsson, Jan-Eric (1980), Testing and Obtaining Fit of Data to the Rasch Model, British Journal of Mathematical and Statistical Psychology, 32, Herche, Joel (1992), A Note on the Predictive Validity of the CETSCALE, Journal of the Academy of Marketing Science, 20(3), Lord, Frederic M. (1980), Applications of Item Response Theory to Practical Testing Problems, Hillsdale, New Jersey: Lawrence Erlbaum Associates. ----, and Melvin R. Novick (1968), Statistical Theories of Mental Test Scores, Reading (Mass): Addison-Wesley. Masters, Geofferey N. (1982), A Rasch Model for Partial Credit Scoring, Psychometrika, 47 (2), Molenaar, Ivo W. (1995) Estimation of Item Parameters. in: Rasch Models, Foundations Recent Developments, and Applications. Eds. Gerhard H. Fischer and Ivo W. Molenaar. New York: Springer, Netemeyer, Richard G., Srinivas Durvasula and Donald R. Lichtenstein (1991), A Cross-National Assessment of the Reliability and Validity of the CETSCALE, Journal of Marketing Research, 28 (3), Rasch, Georg (1960/1980) Probabilistic Models for Some Intelligence and Attainment Tests, Chicago: MESA Press. Reprint of the original publication in 1960 by the Danish Institute for Educational Research. Salzberger, Thomas (1999), How the Rasch Model May Shift Our Perspective of Measurement in Marketing Research, Paper presented at the 1999 Australia and New Zealand Marketing Academy Conference (ANZMAC), Sydney. ----, Rudolf Sinkovics and Bodo B. Schlegelmilch (1999), Data Equivalence in Cross-Cultural Research: A Comparison of Classical Test Theory and Latent Trait Theory Based Approaches, Australasian Marketing Journal, 7 (2),

7 Samejima, Fumiko (1969) Estimation of Latent Ability Using a Response Pattern of Graded Responses, Psychometric Monograph, 17, Iowa City (IA): Psychometric Society. Sheridan, Barry, David Andrich and Guanzhong Luo (1997), User s Guide to RUMM Rasch Unidimensional Measurement Models, Perth: RUMM Laboratory. Shimp, Terence A. and Subhash Sharma (1987), Consumer Ethnocentrism: Construction and Validation of the CETSCALE, Journal of Marketing Research, 24 (3), Singh, Jagdip (1996), A Latent Trait Theory Approach to Measurement Issues in Marketing Research: Principles, Relevance and Application, Proceedings of the EMAC Annual Conference, Budapest University of Economic Sciences, Vol. 1. Eds. József Berács, András Bauer and Judith Simon, , Roy D. Howell and Gary K. Rhoads (1990), Adaptive Designs for Likert-Type Data: An Approach for Implementing Marketing Surveys, Journal of Marketing Research, 27 (3), Sinkovics, Rudolf R. (1999), Ethnozentrismus und Konsumentenverhalten [Ethnocentrism and Consumer Behaviour], Wiesbaden: Deutscher Universitätsverlag. Soutar, Geoffrey N., Richard Bell and Yvonne Wallis (1990), Consumer Acquisition Patterns for Durable Goods: A Rasch Analysis, Asia Pacific International Journal of Marketing, 2 (1), , and Steven P. Cornish-Ward (1997), Ownership Patterns for Durable Goods and Financial Assets: A Rasch Analysis, Applied Econimics, 29, , and Maria M. Ryan (1999), People's Leisure Activities: A Logistic Modelling Approach, Paper presented at the 1999 Australia and New Zealand Marketing Academy Conference (ANZMAC), Sydney. Steenkamp, Jan-Benedict E.M. and Hans Baumgartner (1998), Assessing Measurement Invariance in Cross-National Consumer Research, Journal of Consumer Research, 25,

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD Psy 427 Cal State Northridge Andrew Ainsworth, PhD Contents Item Analysis in General Classical Test Theory Item Response Theory Basics Item Response Functions Item Information Functions Invariance IRT