1 Affine Transforms on Probabilistic Representations of Categories Denis Cousineau École de psychologie, Université d'ottawa Ottawa, Ontario K1N 6N5, CANADA Abstract In categorization processes explored in cognitive science, representation of categories is based on central tendencies, category boundaries, or exemplar memory. Here we explore a probabilistic representation in which categories are abstracted using a probability density function. This representation is compatible with the prototype theory as well as the exemplar theory and encompasses the category boundaries theory. We examine how affine transforms can be applied to perceived exemplars that are distorted instances of the category and how accurately category membership can be assessed. Simulations demonstrate that category retrieval is near perfect with rotation, translation and contraction or expansion on distorted exemplars. Keywords: Categorization theories, Prototype representation, probabilistic representation, likelihood of category membership, affine transforms. Introduction Because we never see an object exactly the same twice, humans and animals' ability to classify objects into categories is probably the most important and most frequently used cognitive process. The present document introduces a framework for category representation that is compatible with affine transforms. In a cognitive system evolving in real-life conditions, an internal representation has very little chance of being an exact picture of the exemplars that were presented to this system. This is caused by the fact that all the stimuli perceived during a category formation phase differ either in orientation, size or position, not to mention luminance, contrasts, etc. In addition to these variations caused by the viewer position and conditions of viewing, the stimulus perceived might not be a perfect instance of its category, either because of distortion, imperfect development or mutilation. Hence, the category representation must somehow take into account these variations such that similar exemplars are classified under a single interpretation. For example, all dogs (Setters, Dobermans, Fox terrier, Chihuahua, etc.) have a lot of variance but are jointly categorized under the term "dog". To distinguish members of one category from members of other categories, various computational algorithms are possible. The first theory proposed a "prototype", that is, a representation that captures what is typical in the stimuli. Prototype theories were explored in cognitive psychology (e.g. Rosch and Mervis, 1975, Smith, Redford, & Haas, 2008). According to this theory, a new, ambiguous exemplar stimulus is classified to the prototype closest to it, assuming some internal multidimensional space in which the stimulus can be assigned a spot. Another approach is to suppose that instead of category centroids, category boundaries are learned. New stimuli will be classified according to the side of the boundaries they fall on. Ambiguous stimuli (located close to a category boundary) will take more time to be classified (Hélie & Ashby, 2012). A generalization encompassing both approaches is to suppose that a category is represented by a distribution. Two largely overlapping distributions are more difficult to discriminate. Such distributions can be built from training exemplars, each possessing imperfect attributes caused by distortions. As more and more are presented, a sampling distribution can be stored in the cognitive system. Hence, this vision is quite compatible with the exemplar theory of categorization which assumes that each exemplar is somehow stored in the memory system (Nosofsky & Johansen, 2000). Within this perspective, category membership can be assessed using computations based on statistics: A new exemplar can be seen as being sampled from a distribution of all the possible exemplars. A posteriori, it is member of a distribution with a certain likelihood given by (1) (Edwards, 1992) where is a description of the exemplar perceived, is the distribution of all the exemplars from the category, and is the probability density of the category. As an illustration, suppose an exemplar is composed of a certain number of attributes, whose presence can be assessed using a real number. In Figure 1, the exemplars are composed of two attributes, illustrated using two dimensions. An example of a two-dimensional stimulus is a sound, which can be described by pitch and frequency. Colors might be described by three dimensions if the Red- Green-Blue representation is used. More complex stimuli might require a larger number of dimensions. For example, a geometric figure composed of straight lines (a form) might be represented by its vertices so that a 5-vertex figure would require 10 dimensions (five {x, y} coordinates). Finally, the perception of an animal might requires hundreds of dimensions (color, limb size, proportion head-to-body, etc.). Assuming independence of the dimensions (which is of course an oversimplification), Eq. (1) becomes where is the marginal distribution of the category along the i th dimension, and is the i th perceived attribute value. The decision-bound theory (Maddox & Ashby, 1993) can (2) 113

2 Figure 1: Three view of the categories "Dog", "Cat" and "Mouse" in recent cognitive theories of categorization. The two axes represent two (real-value) attributes perceived. Left: categories are represented by a central tendency, a prototype; middle: Categories are delimited by boundaries with no reference to category centroid; right: Categories are represented by density functions from which category centroids and category boundaries can be deduced. be represented under this probabilistic view by assuming that the boundaries are located at the equal-probability line between two categories. However, this generalization offers the extra flexibility that biases can be introduced. The equalprobability line is replaced by the line where the ratio between the two alternative categories reflects the strength of the bias, much in line with signal detection theory (Luce, 1963, 1986). One last advantage of the present generalization is that the decision to favor the most likely category given one exemplar,, results in a relation very similar to Luce's decision theory (incorporated in GCM and EBRW, Nosofsky, 1986, Nosofsky and Palmeri, 1997; here is the set of all the categories known to the system). Indeed, owing to Bayes theorem: which suggests that likelihood is a measure of similarity between one exemplar and a category. This measure has the advantage of being bounded between 0 and 1. It also suggests that the contribution of individual dimensions is multiplicative. Hence, if one dimension is unlikely, the whole exemplar becomes an unlikely member of the category. The above shows that classifying an exemplar as belonging to a certain category could be based on the Prototype Rotation (3) Contration or Expansion Figure 2: One figure (the prototype) and four affine transforms performed on the figure. assessment of the likelihood function using a distribution. As said above, this distribution could be learned by the agent using past sampling of previous exemplars, so that an estimate of the distribution can be built and stored internally in a multidimensional space. In what follows, we indicate how exemplars can be correctly classified despite the fact that they could be perceived in a non-canonical perspective, being seen rotated, from a close of far distance, or not directly from the front. These changes in perspectives are easily performed (or cancelled) using affine transforms. Normally, affine transforms can only verify if there is an exact match between a reference figure and a test figure. However, in a probabilistic-based memory, an exact match is not an option. Here, a match can be quantified by similarity as measured by likelihood (or log likelihood) which is a value between 0 and 1 (or between and 0 in the log coordinate). As we will show next, using deterministic or probabilistic representation has no influence on the possibility of using affine transforms. Affine transforms Affine transforms is a transformation that preserve straight lines and proportions along those lines. Affine transforms include translation, contraction and expansion, rotation, reflection and shear. Figure 2 illustrates some transformation on a figure composed of five points, each shown on a 2-dimension surface. For a given point, we get the transformed point using the augmented transformation matrix and the translation vector : in which is a 2 2 transformation matrix and is a column vector of size 2 for the horizontal and vertical translation. For example, to rotate by θ radians, expand or contract by a factor σ, reverse horizontally or vertically [using -1 or 0], or shear horizontally or vertically, use the following matrices: (4) 114

3 Figure 3: (Left) The prototype P used in Monte Carlo simulations 1; (Center) three distortions (exemplars) from the prototype (shown with a dashed line); (Right) one exemplar shown in its correct location, after a translation to the right, and after a translation up and a rotation of 45 degrees (π / 4 in radians).,,, etc., and use to translate horizontally or vertically. The augmented transformation matrix of size 3 3 which incorporates the translation can be noted as long as the point to be transformed is also augmented with a 1; such point will be noted with a prime, as in. Affine transforms can be nested such that (5) and are reversible, so that In the present application, the cognitive system is presented with an object made up of a collection of points. This object is a distorted version (an exemplar) of a prototype object P, defined as the centroids of the distribution. The exemplar, in addition to being distorted, can also be translated, rotated and scaled up or down by various amounts. Is it possible to recognize as an instance of P? Category assessment with probabilistic representation and affine transforms The problem of verifying category membership of the exemplar with respect to category can be summarized with the problem of locating the most likely category in which the attribute dimensions (indexed by i) are assumed independent and is the space of the rotation parameter, the scale parameter and the translation parameters (as defined above) that would replace the exemplar in a canonical position where it can be compared to the prototype. The above approach is not guaranteed to function because is not precisely the prototype but a distortion of it. Returning a distortion to a canonical point of view might be (6) (7) an unsolvable problem. In what follows, we evaluate the approach by generating exemplars from one category, applying some affine transforms on them, and then estimating the best-fitting transform parameters that will make the exemplar best-fit the prototype. If the transforms are cancelled properly on average, it would suggest that the exemplars are properly aligned on the prototype so that their likelihood (similarity) to the prototype can be assessed properly Perception of a unique object with distortions under rotation As a first test, we used only one category P, shown in Figure 3, left. We also used only one affine transform, the rotation. It should be possible to undo the rotation to align one exemplar on the prototype even if the exemplar's points are distorted. Hence, on average, the difference between the amount of rotation of the exemplar and the counter-rotation to align it to the prototype should be null. Methodology The category is built using a 10-dimensional probability distribution. Each dimension represents the position (horizontal or vertical) of one vertex of an abstract figure. The distribution is assumed to be multinormale (i.e., a normal distribution in 10 dimensions) and the vertex positions are assumed independent. For convenience, we plot this distribution on a plane by splitting the distributions per vertex, so that we see 5 multinormale distributions. Each sub-distribution is located at {-2, 2}, {2, 2}, {0, 0}, {3, -3}, or {-3, -3} and has standard deviation of 1. An exemplar from this category is obtained by sampling 5 random coordinates from the category distribution. Figure 4 shows the distribution and one exemplar. Figure 3, center, shows three possible exemplars. Finally, the exemplar is rotated by a random angle. Once the exemplar is created, category membership is assessed by finding the angle which maximizes the similarity of the exemplar to the prototype by maximizing likelihood (Eq. 7 using rotation only). This whole process is repeated 5,000 times and the difference between the true angle of rotation and the estimated angle that un-rotated the exemplar is recorded. Parameter search here and later in the manuscript was done using the Simplex algorithm (Nelder & Mead, 1965). We checked four degrees of difficulties by generating angles of rotation within a small range (from π / 4 to +π / 4), a medium range (form π / 2 to +π / 2), a moderately large range (from 2/3 π to +2/3 π) and finally using all possible rotation angles (from π to +π). All angles were picked uniformly within the range given. Results The mean absolute difference between the true rotation and the angle that un-rotated the exemplars so that they maximally resemble the prototype are shown in Table 1. As seen, for small to moderately large angles, the 115

4 exemplars are correctly un-rotated, the mean absolute error being non-significantly different from zero. However, when 180 degree rotations in both directions are possible, the recovery is no longer adequate, the difference between the true rotation angle and the un-rotation angle is 0.27 radian (about 15 degrees), significantly different from zero (p <.001). This is caused by a few rotations executed in the wrong direction (e.g. unrotating the exemplar 170 degrees clockwise when the true rotation was 170 degrees counterclockwise, resulting in an error of 340 degrees instead of 20). Table 1: Mean absolute difference between the true angle of rotation and the estimated angle that maximized similarity of exemplars to the prototype, along with the standard error (SE). Difficulty range error SE π / 4 to + π / π / 2 to +π / /3 π to +2/3 π π to +π Table 1 also shows the mean log likelihood of the category with respect to the presented exemplar. For the category centroid, this value is 9.2. By comparison, the likelihood of the raw exemplar goes from 19.4 to 69.6 in the four conditions. Perception of a unique object with distortions under scale change We ran a similar simulation varying scale only. The change in scale could be a strong contraction (to about one fifth of the size of the original exemplar) to a large expansion (fivefold increase in size). Methodology Exemplars were generated as previously from the same prototype. They were then scaled up or down by a factor. Then the transformation that cancelled the scale change so that the exemplar is maximally similar to the prototype is estimated. For a perfect recovery, the scales should cancel out multiplicatively, so that should return 1. As previously, we explored five levels of difficulty from a strong reduction (approximately one fifth of the original exemplar size, the exact scale being sampled randomly in the range [ 0.5 1/ /5]) to a strong magnification (around 5, in the range [ ], exploring also a weak reduction and a weak magnification (1/2 and 2) and no important change in scale (around 1, i.e., [ ]). Results The results are shown in Table 2. As seen, there is no difference caused by the amount of scale up or down, all results being equal throughout the table. Table 2: Mean ratio between the true magnification and the estimated magnification that maximized similarity of exemplars to the prototype, along with standard error (SE). Magnification range Mean error SE [ 0.5 1/ /5] [ 0.5 1/ /2] [ ] [ ] [ ] However, recovery is never 1 (a perfect restoration of the original size of the object). On average, the exemplars end up smaller than originally created by 14%. This result is robust and was obtained for other prototype's shapes. According to this system, decreasing exemplar size increases its likelihood of being a member of the prototype distribution. Perception of a unique object with distortions under rotation, scale change and translation In this set of simulations, we replicated the previous ones using multiple affine transforms simultaneously. Indeed, there is no guarantee that these transforms don't cumulate imprecision so that at some point, the exemplar cannot be matched to a prototype. Here, each exemplar generated was rotated, translated and scaled up or down by a random amount. Afterwards, the best-fitting likelihood, varying rotation, scale and translation freely, was found. Methodology The same prototype as earlier was used and exemplars from it were generated using the same procedure. The exemplars were afterward rotated, scaled and translated by random amounts (using the range π / 2 to +π / 2 for the rotation, the range 0.5 to 2 for the scale change, and the range -2 to +2 for horizontal and vertical translation). Afterwards, the resulting exemplars were adjusted to the prototype, allowing affine transforms to freely alter the exemplars. This process was repeated 5,000 times, and the deviation between the true transformations and the reversed transforms which maximized the fit between the exemplar and the prototype is computed. We did not vary the difficulty but stayed in the range of rotation not too difficult. Results The results, when three affine transforms are combined, show no cumulative impact on the best-fit between the exemplar and the category. The mean absolute error between the rotation angle and the un-rotation angle is ± SE, as in the first section. The ratio of scales is 0.89 ± 0.004, very similar to the results of section 2. Finally, translations are removed well on average, with an error on average of ± horizontally and ± vertically. These results, except for the scale, are all not 116

5 Figure 4: One exemplar generation. The distribution of the category is seen on a plane as 5 modes, one for each vertex. From each mode, a pair of coordinates is sampled (the black arrow). Joining the vertices, we see the exemplar lying above the arrows. Then, the exemplar is rotated (here 22 degrees) to obtained the tested exemplar. significantly different from zero. For the scale, we find the same tendency to have the size of the exemplar decreased. The fit of the exemplar, once moved to the center, unrotated and un-scaled is better than the fit of the exemplar before it is moved, scaled and rotated. The log likelihood of the exemplar is improved by 23% on average, relative to the fit of the exemplar prior to the random affine transforms. Hence, a better point of view is achieved by letting the parameters of the affine transforms vary freely. Categorization of objects from four categories under three affine transforms Here we develop a classifier, a system that can choose to assign an exemplar to one of many alternative prototypes. In what follows, there will be four categories, each composed of five points. These categories' centroids are generated Figure 5: Three prototypes generated randomly from 10 numbers taken from a normal distribution (4 prototypes were used in the simulations). Each prototype has 5 points. randomly by picking 10 random numbers from a normal distribution with mean 0 and standard deviation 4, and forming five {x, y} coordinates. Figure 5 shows three such prototypes. As before, the categories' distribution is multinormale around these points with variance or 1 and no covariance. One exemplar is generated from one of the category chosen randomly, as in the previous simulations. Finally, this exemplar is rotated, scaled and translated by random quantities as in the previous section. The likelihood of this test exemplar is measured against each of the four prototypes in turn, allowing freely rotation, scaling and translations to maximize likelihood. Finally, the system assigns the exemplar to the prototype whose likelihood is closest to one (whose similarity is greatest). This response is then compared to the correct response. This process is repeated 5,000 times, generating four new prototypes every times and the proportion of correct response is computed. The results are strikingly good: this model categorized correctly 97.6% of the distorted test exemplar, even though they were altered by affine transforms of random magnitude (avoiding the largest amount of rotations; see the results of the first Monte Carlo simulations). The chance level here in comparison is 25% correct. Discussion All the prototypes in the last simulations had an equal number of dimensions. This is relevant as the likelihood index of fit depends on the number of dimensions. With more dimensions, the likelihood would result in a weaker fit. Hence, if some of the prototypes had more points than the others, they would be less likely selected because of this (resulting in a model's bias against complexity). It may be more relevant to use the mean log likelihood, that is, the log likelihood divided by the number of dimensions. The simulations performed very well as a classifier. One reason that might explain its good performance is the fact that it was provided some information: the system knew the true distribution of the prototypes. It was in the simulations a multinormale centered on every prototype's points with variances of 1 and no covariance. There is no reason that the system knows these distributions. A real-life system would need to interact with exemplars, building a density of the possible point positions from interactions with these. It is essentially a question of building an estimate of the real distributions from sampling. If the exemplars are presented in their canonical form (that is, without rotation, without scaling, without translation), then this estimate can be built by superposing the exemplars, as in Figure 6. As seen, with more than 100 exemplars perceived, the estimated distributions (using linear interpolation and a bandwidth as per Silverman, 1986) are reasonably well estimated. This solution however does not hold in the presence of affine transforms. If the exemplars are for example rotated freely, the distribution of their points will cover a complete circle (more like a "donut"). If in addition, they can be magnified or reduced, the distribution will simply cover the 117

6 assumed. Because prototype, exemplar and decision-bound theories can explain (with numerous free parameters) over 95% of the variance in human categorization tasks, it is difficult to assess in what respect they make different predictions and for what reason. The current unifying framework could address these questions without free parameters. Overall, this system is testing hypothesis about category membership. When there are two alternative categories, Eq. 3 is almost implementing a likelihood ratio test. It therefore supports the view held by some researchers that the purpose of the brain is to anticipate events by assigning probabilities to the possible outcomes (Farell, 1985). Acknowledgments The author thanks Bradley Harding, Sébastien Hélie, Gyslain Giguère and Guy Lacroix for discussions and the University of Ottawa for a starting fund. Figure 6: Estimated densities built from random sampling of exemplars. For a better visibility, we used exemplars that had only three points, forming an isosceles triangle; likewise, we did not show lines connecting the points; matching colors indicate points that belonged to the same exemplar. In the top row, only 50 exemplars where presented; in the central row, 100 exemplars were presented; in the bottom row, 500 exemplars were presented. On the left, the exemplars are superposed; on the right, the smooth kernel estimate of the distribution are shown. For 50 exemplars, the distribution is not smooth, as is the true distribution, a multinormale distribution with no covariance. whole space. This implies that such a system cannot learn to develop new categories unless a supervisor first places the exemplars in a canonical point of view (always the same orientation, always the same scale, always without translation, etc.). The supervisor does not have to identify the relevant dimensions to the learner, but its role is nonetheless crucial if accurate classification is to be attained. The probabilistic representation of prototypes discussed in this document provides a unifying account of exemplar theories, prototype theories and decision-bound theories of classification. It is not mathematically equivalent to these models, but makes it simpler to generate common predictions from the same training exemplars. Finally, as was shown, it accommodates affine transforms without difficulty. Finally, it indicates how prototypes could be coded in a memory system: it is enough to superimpose neural activities across time, so that the only requirement is a topological organization of the neural substrate, as in the Kohonen networks (Kohonen, 1984). Hence, a neural implementation of this categorization models becomes straightforward once a self- organization of perceptions is References Edwards, A. W. F. (1992). Likelihood. Baltimore: The John Hopkins University Press. Farell, B. (1985). "Same"-"Different" judgments: A review of current controversies in perceptual comparisions. Psychological Bulletin, 98, Hélie, S., & Ashby, F. G. (2012). Learning and transfer of category knowledge in an indirect categorization task. Psychological Research, 76, Kohonen, T. (1984) Self-organization and associative memory. Berlin: Springer-Verlag. Luce, D. R. (1963). Pure recognition, in Luce, R. D., Busch, R. R., Galanter, E. (eds.). Handbook of mathematical psychology (pp ). New York: John Wiley and Sons. Luce, R. D. (1986). Response times, their role in inferring elementary mental organization. New York: Oxford University Press. Maddox, W. T. & Ashby, G. F. (1993). Comparing decision bound and exemplar models of categorisation. Perception and Psychophysics, 53, Nelder, J. A. & Mead, R. (1965). A simplex method for function minimization. The Computer Journal, 7, Nosofsky, R. M. (1986). Attention, similarity, and the identification-categorisation relationship. Journal of Experimental Psychology: General, 115, Nosofsky, R. M. & Palmeri, T. J. (1997). An Exemplar- Based Random Walk Model of Speeded Classification. Psychological Review, 104, Rosch, E. & Mervis, C. B. (1975). Family ressemblances: studies in the internal structure of categories. Cognitive Psychology, 7, Silverman, B. W. (1986). Density estimation for statistics and data analysis. London: Chapman and Hall. Smith, J. D., Redford, J. S., & Haas, S. M. (2008). Prototype abstraction by monkeys (macaca mulatta). Journal of Experimental Psychology: General, 137,

