In search of abstraction: The varying abstraction model of categorization

Psychonomic Bulletin & Review 2008, 15 (4), 732-749 doi: 10.3758/PBR.15.4.732 In search of abstraction: The varying abstraction model of categorization Wolf Vanpaemel and Gert Storms University of Leuven, Leuven, Belgium A longstanding debate in the categorization literature concerns representational abstraction. Generally, when exemplar models, which assume no abstraction, have been contrasted with prototype models, which assume total abstraction, the former models have been found to be superior to the latter. Although these findings may rule out the idea that total abstraction takes place during category learning and instead suggest that no abstraction is involved, the idea of abstraction retains considerable intuitive appeal. In this article, we propose the varying abstraction model of categorization (VAM), which investigates the possibility that partial abstraction may play a role in category learning. We apply the VAM to four previously published data sets that have been used to argue that no abstraction is involved. Contrary to the previous findings, our results provide support for the idea that some form of partial abstraction can be used in people s category representations. A classic question in cognitive psychology concerns what is stored as a consequence of learning a category, and hence what information people rely on when they make a categorization decision. It is generally assumed that learning a category involves the generation of a category representation and that assigning a novel object to a category involves the comparison of the object to that category representation. However, one of the most fundamental and unresolved issues in the categorization literature concerns the exact nature of this category representation. Although the debate on category learning and category representation has a very long history, in the past few decades it has centered on the question of whether people represent a category in terms of an abstracted summary or a set of specific examples. Early work argued for the prototype view of category learning. Under this view, on the basis of experience with the category examples, people abstract out the central tendency of a category. In other words, a category representation consists of a summary of all of the examples of the category, called the prototype (see, e.g., Posner & Keele, 1968; Reed, 1972; Smith & Minda, 2002). The initial success of this view has gradually declined in favor of the exemplar view, in which experience with examples of a category does not lead to the development of an abstracted prototype; instead, people simply store all of the examples they encounter. In other words, a category representation consists of all of the individual examples of the category, called the exemplars (Brooks, 1978; Estes, 1986; Medin & Schaffer, 1978; Nosofsky, 1986). The shift from the prototype to the exemplar view was motivated by several arguments. A first empirical argument for this shift involved the demonstration that exemplar models can account for empirical phenomena that were believed to provide evidence for the prototype view (e.g., the prototype enhancement effect; Busemeyer, Dewey, & Medin, 1984). A second empirical argument involved overwhelming evidence that exemplar models yield fits superior to those of prototype models in a wide variety of experimental settings (see Nosofsky, 1992, for a review). The major argument against the prototype view, however, is that it fails to account for important aspects of human concept learning. In particular, a prototype does not seem to retain enough information about the examples encountered in learning. For example, prototypes discard information on correlations among features (e.g., large spoons tend to be made of wood, and small spoons are likely to be made of steel) and on the variability among the examples (e.g., U.S. quarters display very little variability in their diameters, whereas pizzas can vary greatly in size). This prototype feature is inconsistent with experimental studies that have suggested that people are sensitive to such information and store more about a category than just its central tendency (e.g., Fried & Holyoak, 1984; Medin, Altom, Edelson, & Freko, 1982; Rips, 1989). As a consequence, the exemplar view is generally considered superior to the prototype view. However, the exemplar view also has not gone unchallenged. The main concern raised against this approach is its lack of any cognitive economy (Rosch, 1978). Under the exemplar view, people are assumed to store every training example and retrieve every exemplar from memory every time an item is classified. Both of these claims seem counterintuitive and excessive. For example, when people decide that a dog is a mammal, it seems unlikely that they compare the dog W. Vanpaemel, wolf.vanpaemel@psy.kuleuven.be Copyright 2008 Psychonomic Society, Inc. 732

The Varying Abstraction Model 733 to every single mammal they have ever encountered. The intuition that some cognitive economy is involved in category representations is confirmed by experimental findings suggesting that people store less information about a category than all of its members (Feldman, 2003). In sum, the current theorizing on category representation involves a tension between informativeness and economy (Komatsu, 1992). A prototype representation has appealing economy but fails to provide the information people actually use. In contrast, an exemplar representation provides detailed information but is not economical. A natural way to resolve this tension would be to propose a representation that combines the benefits of both economy and informativeness. Such a representation would provide just enough representational information to describe the category structure in a sufficiently complete way. Motivated by the appeal of such an intermediate representation, we propose the varying abstraction model (VAM; Vanpaemel, Storms, & Ons, 2005). It starts from the observation that the debate between the exemplar and prototype views can be usefully regarded as a debate on the use of abstraction in category representations. At the heart of the VAM is the idea that the category representations hypothesized by the exemplar and prototype views do not represent alternatives constituting a dichotomy, but rather correspond to the endpoints of a continuum: The exemplar representation corresponds to minimal abstraction, and the prototype representation corresponds to maximal abstraction. Between these endpoints, various new possible representations can be developed, balancing the opposing pressures of economy and informativeness. Such an intermediate representation would not consist of all exemplars, but neither would it consist of one single prototype. Instead, it would consist of a set of subprototypes formed by category members merging together. The intermediate representation would be less detailed and more economical than the exemplar representation, but more detailed and less economical than the prototype representation, corresponding to partial abstraction. On the basis of this extended class of representations, numerous categorization models can be developed, including the exemplar and prototype models. Crucially, all models of the VAM contrast only in their representational assumptions. Consequently, the VAM provides a simple framework for evaluating the idea that abstraction takes place during category learning. The currently dominant practice when inferring the use of abstraction in category representations is to restrict the analysis to the extreme levels of abstraction that is, to compare the prototype and exemplar representations only. This means that the wealth of intermediate representations corresponding to partial abstraction are overlooked. In light of the limitations of the extreme levels of abstraction, partial abstraction has considerable intuitive appeal. By formalizing the idea of partial abstraction, the VAM provides an alternative to the focus on exemplar and prototype representations only. It is important to highlight from the outset that the VAM is not intended as an improvement of the exemplar or of the prototype model. Rather, our intended contribution is to provide an improvement to the debate on abstraction in category representations, by providing a principled way to explore the use of abstraction in people s category representations. Recently, a number of other authors have also proposed computational models that aim to go beyond the exemplar and prototype models. In particular, the rational model of categorization (RMC; Anderson, 1991), SUSTAIN (Love, Medin, & Gureckis, 2004), and the mixture models of categorization (MMC; Rosseel, 2002) share a starting point similar to that of the VAM. As will become clear in our General Discussion, the approach taken in the VAM differs in important ways from these earlier approaches. The main difference is that, unlike the other models, the VAM makes no strong assumptions about how representations arise, and therefore allows for a more general exploration of partial abstraction. We organize our article by first reviewing briefly the best-known exemplar and prototype models. Next, we explain how the VAM positions these models as extremes on a continuum and formalizes models between the extremes. The VAM is then applied to four previously published data sets in order to evaluate the level of abstraction of people s category representations. Earlier analyses of these data sets failed to provide evidence in favor of abstraction. In contrast, the present VAM analysis shows that, for three of the four data sets, some form of abstraction took place during category learning. We also demonstrate that, for three data sets, our results are not caused by chance, because the different models encompassed by the VAM can be distinguished in a satisfactory way. Finally, we compare the VAM with related models and discuss some limitations and some possibilities for future research. Review of the Exemplar and Prototype Models Both the exemplar and the prototype models assume that an object is classified as a member of a category if it is judged to be sufficiently similar to that category. The distinguishing assumption between the models is the exact nature of the category representation. Prototype models assume that a category is represented abstractly by the central tendency of the known category members (i.e., the prototype). 1 Categorization of an object depends on the relative similarity of the object to the prototypes of the relevant categories. By contrast, exemplar models assume that no abstraction is involved in category learning, but instead that a category is represented as the collection of its category members (i.e., the exemplars). Categorization of an object thus depends on the relative similarity of the object to all of the members of the relevant categories. In what follows, the formal descriptions of a widely tested exemplar model, the generalized context model (GCM; Nosofsky, 1986), and of its abstract counterpart, the multidimensional- scaling-based prototype model (MPM; Nosofsky, 1987; Reed, 1972), are reviewed. Experimental Procedure Category-learning tasks present people with stimuli and their accompanying category labels and require label

734 Vanpaemel and Storms prediction for novel stimuli. A typical artificial categorylearning task involves learning a two-category structure over a small number of stimuli. A subset of the stimuli are assigned to Categories A and B, and the remaining stimuli are left unassigned. Most experiments consist of a training (or category-learning) phase followed by a test phase. During the training phase, only the assigned stimuli are presented. The participant classifies each presented stimulus into either Category A or B and receives corrective feedback following each response. During the test phase, both the assigned and unassigned stimuli are presented. The unassigned stimuli are not seen in training, so they are novel to the participant. Because the assigned stimuli are used as the training stimuli, they are the basis for the category representation. Stimulus Representation Both the GCM and the MPM assume that stimuli are represented as points in a multidimensional psychological space. Such a multidimensional representation is typically derived from identification confusion data (see, e.g., Nosofsky, 1987) or from similarity ratings (see, e.g., Shin & Nosofsky, 1992) using multidimensional scaling (MDS; Borg & Groenen, 1997; Lee, 2001). Once the stimuli are represented in a multidimensional space, the distances between the stimuli can be computed. There are several ways to compute the distance between a pair of stimuli (Ashby & Maddox, 1993). When x i 5 (x i1,..., x id ) denotes the coordinates of stimulus x i in a D-dimensional space, the most common expression for the distance between the stimuli x i and x j is ( ) = 1/ r D r d xi, xj wk xik xjk. (1) k = 1 Of crucial importance are the free parameters w k, which model the psychological process of selective attention. The underlying motivation for this parameter is the assumption that when people are faced with a categorization task, they are inclined to focus on the dimensions that are relevant for the categorization task at hand and to ignore the ones that are irrelevant. In geometric terms, this mechanism of selective attention is represented in terms of stretching the space along the attended, relevant dimensions and shrinking the space along the unattended, irrelevant ones. As such, the parameters w k can modify the structure of the psychological space. Since the parameters are constrained by 0, w k, 1 and ok51 D w k 5 1, they can be interpreted as the proportion of attention allocated to dimension D k and are often termed the attention weights. The differential weighting of dimensions has been a critical component of the GCM (and the MPM) and has enabled it to account for human categorization behavior. The metric r is not a free parameter, but rather depends on the type of dimensions that compose the stimuli. In particular, previous investigations have supported the use of the city-block metric (r 5 1) when stimuli vary on separable dimensions and the Euclidean metric (r 5 2) when they vary on integral dimensions (see Shepard, 1991, for a review). Stimulus-to-Stimulus Similarity Both the GCM and the MPM assume that similarity is a decreasing function of distance in the psychological space, implying that similar stimuli lie close together, whereas dissimilar stimuli lie far apart (see, e.g., Nosofsky, 1984). In particular, the similarity between the stimuli x i and x j is given by s(x i, x j ) 5 e 2cd(x i,x j )α. (2) In this equation, c is a free parameter called the sensitivity parameter. It runs from 0 to ` and determines the rate at which similarity declines with distance. A high value of c implies that only stimuli that lie very close to each other are considered similar, whereas a low value of c implies that all stimuli are at least somewhat similar to each other. Much as with the metric parameter r, the value of a depends on the nature of the stimuli and is not considered a free parameter. Two settings of the a parameter are prominent: a 5 1, resulting in an exponential decay function, and a 5 2, resulting in a Gaussian function. When the stimuli are readily discriminable, the exponential decay function seems to be the appropriate choice, whereas the Gaussian function is typically preferred when the stimuli are highly confusable (Shepard, 1987). Stimulus-to-Category Similarity Equation 2 can be used to compute the similarity of a stimulus to a certain category member. However, both models assume that a stimulus is classified according to its similarity to a category, not just to a category member. To go from stimulus-to-stimulus to stimulus-to-category similarity, a definition of a category is required. It is this assumption that distinguishes the GCM and the MPM. In the GCM, a category is represented by all of its members, so the similarity of stimulus x i to Category C J is computed by summing the similarity of x i to all of the category members of C J :,. (3) η ij s x i x j xj CJ ( ) In contrast, in the MPM, a category is represented by the category prototype, denoted as p J. As such, the similarity of x i to C J equals the similarity of x i to p J : h ij ; s(x i, p J ). (4) Although the prototype generally does not match a stimulus, it is treated formally as a stimulus, thus s(x i, p J ) can be computed using Equations 2 and 1, given the coordinates of p J. Since the prototype is the central tendency of all of the category members, the coordinates of p J are simply the averaged coordinates of all of the n J members of C J : π Jk 1 = n J xj CJ x jk. (5) Response Rule Both the GCM and the MPM assume that a categorization decision is governed by the Luce choice rule. Given

The Varying Abstraction Model 735 M relevant categories, the probability of categorizing stimulus x i in Category C J is then J ij p ij = M K = 1 βη β η K ik. (6) In this equation, every b K is a free parameter, ranging from 0 to 1 and satisfying the constraint o M K51 b K 5 1. It is interpreted as the response bias toward Category C K. The response rule of Equation 6 is the one proposed in Nosofsky s (1986) original formalization of the GCM. Ashby and Maddox (1993) later generalized the response rule into γ βη J ij p ij = M γ βkηik K = 1. (7) This generalization involves the inclusion of an additional free parameter g, termed the response-scaling parameter. It runs from 0 to ` and reflects the amount of determinism in responding. Values of g larger than 1 reflect greater levels of determinism than are produced by Equation 6, and values of g less than 1 reflect less determinism. Obviously, the generalized response rule reduces to the original response rule when g 5 1. The version of the GCM using this modified response rule is commonly referred to as GCM-g. 2 The VaM The GCM and the MPM are identical to each other in terms of their assumptions about stimulus representation, selective attention, similarity, and response rule. The assumption that distinguishes the two models is the category representation. Clearly, other representational possibilities can be hypothesized besides those considered by the GCM and the MPM. In this section, we develop a formal model of categorization that shares all of the common assumptions of the GCM and MPM but goes beyond these models in terms of the category representation. Beyond the Exemplar and Prototype Representations At the heart of the VAM is the idea that the exemplar and the prototype representations are the extreme endpoints on a continuum of abstraction. Along this continuum, positions between the extremes are held by representations in which an intermediate level of abstraction is assumed. Such an intermediate representation consists of a set of subprototypes formed by abstracting across a subset of category members. In particular, a category representation arises by partitioning 3 the category members into clusters and then averaging across the instances in each cluster. Crucially, in this way, the exemplar representation, the prototype representation, and a wealth of intermediate representations can be constructed. This is illustrated in Figure 1, which shows, for a category with five members represented in a two-dimensional space, the prototype representation (panel A), the exemplar representation (panel B), and an intermediate representation consisting of two subprototypes (panel C). Using this procedure of partitioning and averaging, a large set of representations can be created. The exhaustive set of possible representations for a category with four members is illustrated in Figure 2, represented in a two-dimensional space. The subprototypes representing the category are shown in black circles and are connected by lines to the original category members, shown in white circles. The number of subprototypes in a representation can be interpreted as the level of abstraction of the representation, so that lower numbers of subprototypes correspond to more abstraction. As such, the 15 representations displayed in Figure 2 involve four different levels of abstraction. However, the representations do not only differ in their level of A B C Category Category Partition Category Representation Figure 1. The two-step procedure to construct a category representation: (1) Partition the category into clusters and (2) construct the centroid for each cluster. In this way, it is possible to construct the prototype representation (panel A), the exemplar representation (panel B), and a set of intermediate representations, one of which is illustrated in panel C.

736 Vanpaemel and Storms A B C D E F G H I J K L M N O Figure 2. The 15 possible representations of the VAM for a category with four members in a two-dimensional space. The subprototypes are shown as black circles and are connected by lines to the original category members, shown as white circles. Panel A shows the exemplar representation, in which no category members are merged, and panel O shows the prototype representation, in which all four category members are merged in one single item. The remaining panels show all of the possible intermediate representations allowed by the VAM, in which a category is represented by three (panels B G) or two (panels H N) subprototypes. abstraction. For all but the extreme levels of abstraction, the VAM proposes different representations at one single level of abstraction. In particular, in Figure 2, 6 of the representations are at a level of abstraction of three (panels B G), and 7 are at a level of abstraction of two (panels H N). Representations with the same level of abstraction share the number of subprototypes but differ in the category members that are merged. In particular, they can differ in the degree of

The Varying Abstraction Model 737 of the relations between prototype and exemplar models, is provided by Nosofsky (1992). His Table 8.5 summarizes fits of both the GCM and the MPM across 19 previously published data sets, involving a variety of category structures, experimental conditions, and types of stimuli. He concludes that it is easily seen that the evidence is overwhelmingly in favor of exemplar-based category representations compared to prototype-based category representations, with the nature of the similarity rule held constant (p. 163). Indeed, the MPM performed rather poorly relative to the GCM and provided a better fit than the GCM for only 1 data set. In sum, Nosofsky s (1992) review provides compelling evidence that a model that assumes no abstraction, like the GCM, generally fits empirical data better than a model that assumes total abstraction, like the MPM. Crucially, these findings rule out the use of total abstraction in category representations, but they cannot rule out the use of all forms of abstraction. The studies reviewed by Nosofsky (1992) have been of considerable importance in the debate about the role of abstraction in categorization, so they seemed particularly attractive to be reanalyzed with the VAM. Therefore, in this section, the VAM is applied to 4 of the 19 data sets from Nosofsky s (1992) Table 8.5. The 4 data sets that were most appropriate for an initial VAM analysis were those with the smallest number of models implied by the design of the experiment, resulting in the selection of the data from Shin and Nosofsky s (1992) Experiment 3, Size 3, equal-frequency (E3S3EF) condition and from Nosofsky s (1987) saturation (A), saturation (B), and criss-cross conditions. All four conditions involved two categories to be learned, with a deterministic assignment of the category members to the categories. In a VAM analysis of empirical categorization data, all models implied by the VAM are fit separately to the data. The most common method to fit a model to empirical data is maximum likelihood estimation (MLE; see, e.g., Myung, 2003). The idea behind MLE is to search for values of the free parameters that maximize the likelihood of observing the data. 5 The model yielding the smallest ln L( u) is selected as the best-fitting model. In the VAM analysis of the four data sets, we tried to follow the original analyses as closely as possible. Obviously, the crucial difference between the analyses in the original studies and the VAM analysis was that, in the present study, the full set of representations as formalized by the VAM was considered, whereas in the original studies, generally only two representations were considered. Apart from this difference, the original analyses were followed in all major respects, in order to increase comparability. One minor difference between the present analysis and the original analyses was that we assumed nondifferential category bias (i.e., b K 5 1/2 for every K). The reason for this choice was that we wished to use as few free parameters as needed, and the analyses in both Shin and Nosofsky (1992) and Nosofsky (1987) revealed that response bias did not play a significant role. A second minor difference between the present analysis and the original analyses concerns the use of stimulus biases. Nosofsky (1987) fitsimilarity that is involved in the merging of the category members. For example, in panel E, the category members being merged are much more similar to each other than the ones merged in panel C. Formal Description of the VAM Formally, a categorization model arises when, for every relevant category, a representation is combined with the assumptions shared by the GCM and the MPM. In particular, let Q J 5 {Q 1, Q 2,..., Q qj } denote a partition of Category C J, and let n i denote the number of category members in cluster Q i. Further, let m i denote the centroid of Q i, and S J 5 {m 1, m 2,..., m qj } denote the set of all of the q J centroids. These centroids are the subprototypes making up the category representation. The similarity of stimulus x i to Category C J is computed by summing the similarity of x i to all q J subprototypes representing C J : η s x, µ, (8) ( ) ij i j µ S j J where s(x i, m j ) is the similarity of x i to m j. Like the category prototype, the subprototypes can be treated formally as stimuli; thus, s(x i, m j ) can be computed by Equations 2 and 1, if the coordinates of the subprototypes are known. These are defined as the averaged coordinates of all the n i category members within the cluster Q i : 1 µ ik = ni xj Qi x jk. (9) Note that at the extreme values of q J, the GCM and the MPM arise (i.e., Equations 3 and 4). The VAM encompasses all of the categorization models that can be constructed by combining all of the possible representations of all of the relevant categories. In general, the number of possible partitions of a set of n elements is given by a number known as the nth Bell number, denoted by Bell(n). This implies that, in a categorization task with two Categories A and B (i.e., M 5 2) with n A and n B category members, respectively, the VAM generates Bell(n A ) 3 Bell(n B ) different categorization models. A critical aspect of the VAM is that all models are matched to each other in every respect and contrast only in their representational assumptions. As such, all models have identical free parameters: M21 response biases b J (because of the constraint o K51 M b K 5 1), one sensitivity parameter c, one response-scaling parameter g, and D21 attention weights w k (because of the constraint ok51 D w k 5 1), all summarized in the parameter vector u 5 (w 1, w 2,..., w D21, c, g, b 1, b 2,..., b M21 ). Balancing the models in terms of their parameters assures the fairest comparison between the different models. 4 A VAM Analysis of Empirical Data Progress in understanding abstraction in category representation has often been sought by a systematic quantitative comparison of the GCM and the MPM. A review of these comparisons, as well as a theoretical treatment

738 Vanpaemel and Storms ted a version of the GCM that made use of stimulus biases, which were estimated from the data of an identification experiment. Since this version of the GCM is not commonly used, we did not include the stimulus biases in the reanalysis of the data. In all other respects, we followed the original analyses: We used the categorization proportions from the original studies; we used the MDS solutions derived in the original studies; we assumed the Euclidean distance metric and the exponential decay similarity function (i.e., r 5 2 and a 5 1), as in the original studies; and we did not include the response-scaling parameter g in the analyses (i.e., g 5 1), as in the original studies. The Shin and Nosofsky (1992) Data The first set of data that we reanalyzed was from a series of experiments conducted by Shin and Nosofsky (1992) using the prototype-distortion paradigm (see, e.g., Posner & Keele, 1968). In this paradigm, generally, a category is defined by first creating a category prototype and then constructing the category members by randomly distorting these prototypes. Generalization is then tested by presenting the prototypes, the old distortions of them, and various new distortions. Data and Results. In Experiment 3 of Shin and Nosofsky (1992), the stimuli used were random polygons. Two categories of polygons were created by first defining two prototypes and then generating 10 distortions of each one. In the Size 3 condition, for each category, 3 of these distortions were randomly selected as the training stimuli. An additional set of 5 transfer stimuli were created for each category, as follows. First, the prototype was redefined by calculating the average position of all 10 generated stimuli. Second, 2 new distortions were generated from each redefined prototype. Third, 2 new distortions were generated from a stimulus that was randomly selected from the training set. For the main experimental manipulation, in a baseline condition all training stimuli were presented with the same frequency, whereas in a high-frequency condition, some of the training stimuli were presented more often than the others. However, because our main interest was in the category representation rather than in the effect of presentation frequency, we only analyzed the data from the baseline condition. In sum, we analyzed the data from the E3S3EF condition. Thirty participants learned to classify the polygons into the two categories. Following a training phase in which feedback was provided after classification, a test phase was conducted during which all 6 training stimuli and all 10 transfer stimuli were presented. Classification feedback was provided only for the training stimuli. There were three blocks of test trials, with each stimulus presented once in each block, resulting in 90 classification decisions for every stimulus. The observed proportion of Category A responses for each stimulus, averaged across participants, is reported in Table 10 of Shin and Nosofsky (1992). Following the test phase, all participants judged the degree of similarity between all pairs of the 16 polygons. Thirty other participants, who had not taken part in the categorization experiment, did the same. From the similarity judgment data, Shin and Nosofsky (1992) derived a four-dimensional MDS solution. This solution, reported in their Table A3, was taken as the underlying stimulus representation both for the theoretical analyses of Shin and Nosofsky and for the present VAM analysis. In their theoretical analyses, Shin and Nosofsky (1992) found that the MPM fared poorly relative to the GCM, as reported in their Tables 11 and 13. In addition, they fitted a combined model, in which the relative contributions of the prototype and the exemplar representation could be estimated, and found that the parameter weighting the use of the prototype representation was 0 (see their Table 14). In sum, their analysis of the data in the E3S3EF condition did not provide any evidence for the operation of an abstraction process. VAM analysis. Since the third Bell number (i.e., three members per category) is 5, the VAM encompasses 25 possible models. Table 1 shows the details of the VAM analysis for all the 25 models. Each model is described by two membership vectors, one for each category. In general, the representation of a category with n members using q subprototypes can be described in a convenient way by the membership vector v 5 (v 1, v 2,..., v n ), where v i P {1, 2,..., q} indexes the cluster membership of stimulus x i. For example, for a category with five members, the exemplar representation is described by v 5 (1, 2, 3, 4, 5), the prototype representation is described by Table 1 Summary Fits and Maximum Likelihood Parameters for All 25 Models Fitted to Shin and Nosofsky s (1992) E3S3EF Condition Data, As Ordered by Fit Model Fit Parameters v A v B 2ln L pvaf w 1 w 2 w 3 c 1, 2, 3 1, 2, 3 63.45 91.57 0.48 0.02 0.11 1.50 1, 2, 2 1, 2, 2 66.38 90.99 0.32 0.11 0.27 1.66 1, 1, 2 1, 2, 2 67.91 89.87 0.43 0.04 0.19 1.61 1, 2, 2 1, 1, 2 68.96 89.82 0.40 0.10 0.00 1.88 1, 1, 2 1, 1, 2 71.82 89.05 0.42 0.22 0.02 2.06 1, 2, 1 1, 2, 2 72.62 89.03 0.52 0.04 0.17 1.47 1, 2, 3 1, 1, 2 76.52 88.74 0.46 0.15 0.00 1.79 1, 2, 1 1, 1, 2 78.99 87.52 0.58 0.13 0.03 1.63 1, 1, 2 1, 2, 3 79.58 85.29 0.46 0.02 0.25 1.55 1, 2, 2 1, 2, 3 80.08 85.52 0.55 0.01 0.35 1.35 1, 2, 1 1, 2, 3 80.15 85.28 0.54 0.00 0.22 1.50 1, 2, 3 1, 2, 2 85.10 85.00 0.42 0.14 0.11 1.59 1, 1, 1 1, 1, 1 85.30 85.17 0.54 0.02 0.03 1.69 1, 1, 2 1, 2, 1 89.93 83.25 0.63 0.05 0.02 1.46 1, 2, 1 1, 2, 1 90.48 83.35 0.73 0.01 0.02 1.41 1, 2, 2 1, 2, 1 92.68 82.89 0.93 0.03 0.00 1.11 1, 2, 3 1, 2, 1 100.28 79.47 0.85 0.03 0.00 1.27 1, 1, 1 1, 2, 2 110.67 75.83 0.50 0.04 0.38 1.36 1, 1, 1 1, 1, 2 110.94 77.86 0.67 0.03 0.30 1.22 1, 1, 2 1, 1, 1 112.04 76.10 0.44 0.12 0.02 2.00 1, 2, 1 1, 1, 1 112.38 78.11 0.37 0.13 0.01 2.30 1, 2, 2 1, 1, 1 118.80 74.58 0.48 0.18 0.00 1.70 1, 1, 1 1, 2, 1 160.09 63.40 0.67 0.04 0.09 1.54 1, 2, 3 1, 1, 1 178.33 57.47 0.45 0.21 0.00 1.88 1, 1, 1 1, 2, 3 180.19 56.85 0.46 0.00 0.36 1.67 Note v A, v B, membership vector for Category A, B; 2ln L, negative value of the maximized log-likelihood; pvaf, percentage of variance accounted for; w k, attention weight given to dimension D k ; c, sensitivity.

The Varying Abstraction Model 739 v 5 (1, 1, 1, 1, 1), and the intermediate representation of panel C in Figure 1 is described by v 5 (1, 1, 1, 2, 2). In Table 1, the GCM and the MPM correspond to the models indexed by v A 5 v B 5 (1, 2, 3) and v A 5 v B 5 (1, 1, 1), respectively. The table reports, for every model, the negative log-likelihood and, as an auxiliary measure of fit, the percentage of variance accounted for. Although they are of lesser interest for the present goal, the bestfitting parameter values are reported as well. Our primary interest is in which representation best accounts for the observed data. In Table 1, the models are ordered by fit, with the best-fitting model on the top row. Apparently, of all 25 possible models considered, the model that captured the participants performance best was the one that assumed an exemplar representation for both categories. Impressively, even when the level of abstraction was allowed to vary, there was no evidence for the presence of abstraction in the category representations. Saturation (A) 1 5 10 3 8 2 4 6 9 11 12 7 The Nosofsky (1987) Data Apart from the prototype-distortion paradigm, a second influential research paradigm involves simple perceptual stimuli that vary along a few salient dimensions. The remaining three data sets we reanalyzed were collected by Nosofsky (1987) using this paradigm. Data and Results. Nosofsky (1987) conducted a color categorization study using a stimulus set of 12 Munsell color chips varying in brightness and saturation. On the basis of this set of stimuli, six different category structures were constructed, as shown in Figure 5 of Nosofsky (1987). The three category structures studied in the present article are the saturation (A), saturation (B), and criss-cross structures. Figure 3 illustrates these category structures in the psychological space. To derive these representations, Nosofsky (1987) instructed 34 participants to identify all 12 stimuli, and he used the data from this identification experiment to derive the MDS solution, reported in his Table 3. Twenty-four other participants learned to categorize the same set of 12 stimuli in both the saturation (A) and crisscross conditions, and 40 others were assigned to the saturation (B) condition. In the saturation (A) condition, participants were presented with one block of 120 trials, and in the two other conditions, two blocks of 120 trials were presented. Each stimulus was presented 10 times in each block. After classifying a stimulus in either of the two categories, feedback was given only in the case in which a stimulus was assigned to a category. Table 4 of Nosofsky (1987) shows the proportion of Category A responses for each stimulus in each condition, averaged across participants. The saturation (A) data were obtained during the final 90 trials of the single block, and those for the two other conditions were obtained during the second block, resulting in sample sizes of 180, 400, and 240 for the saturation (A), saturation (B), and criss-cross conditions, respectively. As shown in Nosofsky s (1987) Tables 5 and 6, he found that, for the saturation (A) condition, the MPM yielded essentially the same fit as the GCM. For the saturation (B) condition, the MPM was found to fit substantially worse Saturation (B) 1 10 5 Criss-Cross 1 10 3 8 3 6 5 6 8 2 4 9 11 12 11 2 4 9 12 Figure 3. Schematic representation of Nosofsky s (1987) saturation (A), saturation (B), and criss-cross conditions in the psychological space. Squares denote training stimuli assigned to Category A, and circles denote those assigned to Category B. The remaining stimuli are unassigned. Adapted from Nosofsky (1987). 7 7

740 Vanpaemel and Storms than the GCM. Finally, the MPM was unable to account for the data in the criss-cross condition, whereas the GCM yielded an impressive fit. In sum, once again no evidence for the operation of an abstraction process could be discerned on the basis of these data. Especially in the criss-cross condition, the evidence against such a process was particularly compelling. It is instructive to understand why the MPM failed so dramatically in this condition. Inspection of the category structure explains why the prototype representation was insufficient as a basis for categorization: Apparently, the centroids for Categories A and B virtually overlap. If the representation for Category A were identical to that for Category B, the similarity of a stimulus to Category A would be identical to its similarity to Category B, and the model would predict performance at chance. Therefore, it is far from surprising that the MPM failed to account for the data collected in the criss-cross condition. Importantly, for the criss-cross structure, several intermediate representations seem highly plausible. In this structure, a category can be split up in two subcategories, 6 so it is reasonable to expect that, when abstraction takes place in this condition, subprototypes would be based on the subcategories. This observation led Nosofsky (1987) to test one multiple-prototype representation in the criss-cross condition, consisting of four subprototypes. It is shown in Figure 4, using the graphical conventions adopted earlier (i.e., the subprototypes are shown in black and are connected by lines to the original category members, shown in white). Nosofsky (1987) found that, although this multiple-prototype model fared far better than the MPM, the GCM was still superior. Thus, no evidence for the use of abstraction could be provided once again. VAM analysis. Clearly, other multiple-prototype representations than the one considered by Nosofsky (1987) Criss-Cross Figure 4. The intermediate representation tested by Nosofsky (1987) in the criss-cross condition. Table 2 Summary Fits and Maximum Likelihood Parameters for the Best-Fitting Model to Nosofsky s (1987) Saturation (A), Saturation (B), and Criss-Cross Data Fit Parameters Condition 2ln L pvaf w 1 c Saturation (A) 41.09 98.52 0.79 1.06 Saturation (B) 56.45 99.04 0.75 1.23 Criss-cross 44.65 99.09 0.62 1.60 Note 2ln L, negative value of the maximized log-likelihood; pvaf, percentage of variance accounted for; w k, attention weight given to dimension D k ; c, sensitivity. can be formalized. This is exactly what the VAM provides. All three conditions involved two categories of four members each, implying 225 possible models. Table 2 shows the results of the VAM analysis. The table reports, for each condition, the negative log-likelihood, the percentage of variance accounted for, and the parameters for the bestfitting VAM instantiation. Whereas the focus of the present research lies on the representation, we briefly mention that the estimated values of the free parameters in the best-fitting models are intuitively acceptable. Particularly in the two saturation conditions, in which the first dimension was clearly more diagnostic than the other dimension for performing the categorization task, the estimated attention weights on the first dimension.79 and.75 for the saturation (A) and (B) conditions, respectively are consistent with the expectation that the participants attempted to attend selectively to the first dimension. Furthermore, the weights are in close correspondence with the estimated weights for the GCM and MPM in the original study. Our primary interest, however, is which of the representations describes the observed data best. Figure 5 shows the best-fitting representation for each condition. Inspection of these figures reveals that when the level of abstraction was allowed to vary in each condition, the representation yielding the best fit assumed some form of partial abstraction. In the saturation (B) condition, Category A adopted the exemplar representation, but in all other best-fitting representations at least two category members were merged. A comparison of these VAM results with those produced by the GCM is provided in the Appendix. The best-fitting representation in the saturation (A) condition seems somewhat counterintuitive, since in Category A rather disparate category members are being merged. The representation gains some psychological plausibility in the space modified by selective attention, as is illustrated in the top panel of Figure 5B. In contrast, the best-fitting representation in the saturation (B) and crisscross conditions are psychologically easily interpretable. In the saturation (B) condition, Category A adopts the exemplar representation, and in the Category B representation, two rather similar category members are merged. In the criss-cross condition, the best-fitting representation has a particularly strong intuitive appeal. As already hypothesized by Nosofsky (1987), this representation

The Varying Abstraction Model 741 A Saturation (A) B Saturation (A) Saturation (B) Saturation (B) Criss-Cross Criss-Cross Figure 5. The best-fitting representations for Nosofsky s (1987) saturation (A), saturation (B), and crisscross conditions, in a psychological space either without (column A) or with (column B) modification by selective attention. involves the formation of subprototypes for the subcategories. However, unlike the intermediate representation considered by Nosofsky (1987), shown in Figure 4, the intermediate representation providing the best fit to the data does not consist of four subprototypes, but of only two. It is formed by merging two stimulus pairs, 2 4 and 9 12, but unlike the representation in Figure 4, Stimuli 1, 3, 8, and 10 are left unmerged. In sum, in contrast to the earlier conclusions, the results of the VAM analysis of Nosofsky s (1987) empirical data provide support for the idea that some form of abstraction is involved in people s category representations.

742 Vanpaemel and Storms The Distinguishability of the Representations Since the VAM enlarges the set of representational possibilities, a potential concern with the varying abstraction approach is that it considers too many representations. In particular, the problem is that, for any data set, there will always be a representation that yields a better fit than the others, but evaluating whether the superior fit of this representation is reliable or accidental is difficult. Therefore, a question of central importance is whether the representations are distinguishable. Obviously, if the discrimination between the different representations fails, the results obtained in the previous section would be due to chance, and the conclusions we reached not legitimate. To gain information regarding the distinguishability of the representations, for all four conditions we conducted a large-scale recovery simulation study. Such a study involves artificial data for which the true underlying representation is known but that are contaminated by sampling variability. Of interest is the ability of the VAM to recover the true, data-generating representation when it is fit to the simulated data. If the VAM analysis is not governed by chance, it should be able to see through the random variation caused by sampling error and correctly discern the representation that generated the data. In the remainder of this section, we first provide details on the procedure used in the recovery study and then report its results for both the Shin and Nosofsky (1992) and the Nosofsky (1987) data. Procedure The data were generated from a particular model by the following three steps: selecting a set of parameter values, computing the classification probabilities according to the model, and adding sampling error to these probabilities. Since we wished to generate response patterns that were similar to those observed, the parameter values were obtained by fitting the generating model to the empirical data set. Given the optimal parameter values for this model, a classification probability of a stimulus according to the model was then obtained by substituting these parameter values in Equation 7. Sampling error was introduced by generating a set of binary-valued responses (0 or 1) from the binomial probability distribution corresponding to the classification probability. In each application, the number of binary responses was equal to the sample size of the empirical study, again in order to generate response patterns similar to those observed (Pitt, Kim, & Myung, 2003). Once this procedure had been applied for each of the stimuli, an artificial data set as generated from the model was obtained. From each model in the VAM, we generated 100 such artificial data sets. Each of these artificial data sets was treated exactly as an empirical data set, so each model of the VAM was fit to each of them in order to determine the best-fitting model. As a result, each artificial data set was classified to a certain model. Since the VAM was fit to every artificial data set, this implies that 25 3 100 3 25 5 62,500 models were fit for the E3S3EF condition and, likewise, 225 3 100 3 225 5 5,062,500 models were fit for each condition from Nosofsky (1987). For all four applications together, over 15 million models were fit. The Shin and Nosofsky (1992) Data Table 3 shows, for each of the 25 different models, the recovery rate r i and the false recovery rate f i. The recovery rate of model M i (i 5 1, 2,..., 25) is the percentage of correctly classified data sets from M i. It is defined by r i 5 ncorr i /ngen i, where ngen i denotes the number of generated data sets from model M i (i.e., 100 for every i, in the present case) and ncorr i denotes the number of correctly classified data sets generated by M i (i.e., the number of data sets for which M i both generated the data and was selected as the best-fitting model). The false recovery rate of model M i is the percentage of cases in which data sets were incorrectly classified to M i. It is defined by f i 5 nfalse i / nfalse, where nfalse denotes the total number of incorrectly classifi ed data sets across all models and nfalse i denotes the numb er of data sets incorrectly classified to M i (i.e., the number of data sets for which M i did not generate the data but was nonetheless selected as the best-fitting model). Globally, the recovery is quite good, with the individual recovery rates ranging from 49% to 100%. Of all 2,500 artificial data sets, 280 were incorrectly classified. As indicated by the false recovery rates shown in the last column, the model responsible for most of the misclassifications was the GCM. However, this happened for only 30 [i.e., (10.71 3 280)/100] data sets. This seems negligible, since as many as 2,400 data sets had been generated by Table 3 Recovery Rates and False Recovery Rates for All 25 Models of Shin and Nosofsky s (1992) E3S3EF Condition v A v B r i f i 1, 1, 1 1, 1, 1 72.00 4.29 1, 1, 1 1, 1, 2 95.00 3.21 1, 1, 1 1, 2, 1 97.00 0.36 1, 1, 1 1, 2, 2 92.00 1.07 1, 1, 1 1, 2, 3 100.00 0.36 1, 1, 2 1, 1, 1 89.00 0.00 1, 1, 2 1, 1, 2 95.00 6.07 1, 1, 2 1, 2, 1 77.00 9.29 1, 1, 2 1, 2, 2 84.00 6.43 1, 1, 2 1, 2, 3 82.00 9.64 1, 2, 1 1, 1, 1 99.00 2.50 1, 2, 1 1, 1, 2 90.00 6.43 1, 2, 1 1, 2, 1 78.00 8.57 1, 2, 1 1, 2, 2 85.00 6.43 1, 2, 1 1, 2, 3 79.00 7.14 1, 2, 2 1, 1, 1 97.00 0.36 1, 2, 2 1, 1, 2 98.00 3.93 1, 2, 2 1, 2, 1 49.00 2.86 1, 2, 2 1, 2, 2 92.00 4.29 1, 2, 2 1, 2, 3 89.00 2.14 1, 2, 3 1, 1, 1 99.00 0.00 1, 2, 3 1, 1, 2 99.00 1.43 1, 2, 3 1, 2, 1 98.00 0.36 1, 2, 3 1, 2, 2 96.00 2.14 1, 2, 3 1, 2, 3 89.00 10.71 Note v A, v B, membership vector for Category A, B; r i, recovery rate for model M i (in %); f i, false recovery rate for model M i (in %).

The Varying Abstraction Model 743 models other than the GCM. In sum, it seems that in the experimental design (which includes the category structure, the number and positions of the transfer stimuli, and the sample size) of the E3S3EF condition, the different VAM representations are sufficiently distinguishable to legitimize the conclusions of the VAM analysis in the previous section. The Nosofsky (1987) Data Each of the three conditions of the Nosofsky (1987) study involved 225 models, so a detailed report of the recovery rates for each model separately would take up too much space. Instead, the 225 individual recovery rates are depicted graphically in Figure 6 using a histogram. Furthermore, to get a global picture of the recovery in these three conditions, the second column of Table 4 reports the overall recovery rates r. The overall recovery rate averages the individual recovery rates r i (i.e., r 5 oi51 225 ncorr i /22,500) and corresponds to the percentage of correctly classified data sets across all models for each of the conditions. In the saturation (A) condition, recovery was clearly poor, with the correct model recovered for less than 50% of the data sets. Figure 6 shows that some of the models had a very high recovery rate and others a very low recovery rate, but that for the bulk of the models recovery was poor. In the saturation (B) condition, the overall recovery rate was reasonably high, with the true model recovered about 9 times out of 10. As shown in Figure 6, for some exceptional models, recovery was moderate or even poor, but for most of the models recovery ranged from very good to perfect. Impressively, in the criss-cross condition, recovery was virtually perfect: As few as 265 of the 22,500 artificial data sets were incorrectly classified, and the vast majority of the models were perfectly recovered (i.e., 203 of the 225 models had an individual recovery rate of 100%). Table 4 also reports the recovery and false recovery rates for three models with a privileged status. The models deserving special attention are the MPM, the GCM, and the model that fitted the empirical data best (see Figure 5). The most interesting finding is that, in all three conditions, the recovery rate for the MPM was particularly bad. In fact, in the saturation (B) condition, the MPM had the worst recovery of any of the models, and in the saturation (A) and criss-cross conditions, only two models had worse recovery. 7 In contrast, the GCM had perfect recovery in two of the conditions. Only in the saturation (A) condition was the GCM not clearly distinguishable. Considering the false recovery results, none of the three models displayed in Table 4 was responsible for a large amount of the misclassifications in the saturation (A) and (B) conditions. In fact, in these conditions, no model at all had a particularly large false recovery rate; the largest were 0.78% and 2.24%, respectively. In the criss-cross condition, the false recovery rate was also very small for the GCM and, most importantly, for the best-fitting model. Quite surprisingly, the MPM was responsible for more than 6% of the misclassifications. In total, 6 models were responsible for at least 5% of the misclassifications each. However, since the number of misclassified data sets was Number of Models Number of Models Number of Models 15 10 5 Saturation (A) 0 20 40 60 80 100 Recovery Rate 45 Saturation (B) 30 15 225 175 125 75 25 0 20 40 60 80 100 Recovery Rate Criss-Cross 0 20 40 60 80 100 Recovery Rate Figure 6. Histogram of the recovery rates for all models separately in the saturation (A), saturation (B), and criss-cross conditions.