THE last decades have seen a boom in the

CSC2541: NATURAL SCENE STATISTICS FINAL PROJECT 1 The Statistics and Biology of Higher Order Vision Marcus Brubaker Email: mbrubake@cs.toronto.edu Abstract Statistics of images beyond simple first order local properties have been diffcult to study. This paper reviews some recent results from neurophysiology about how objects are represented. Focus is directed towards how the results could be used to guide studies of higher order image statistics. I. INTRODUCTION THE last decades have seen a boom in the amount of knowledge about the brain due to the widening use of technology such as fmri. Much of the study has been devoted to understanding how the brain sees. Unfortunately the majority of this work has focused on the low-level areas of the visual processing such as V1. The result has been that, while there is a good understanding of V1, how it fits in the larger picture of visual perception is generally a mystery. However, the absence of a broader picture of visual perception cannot be entirely explained by the sparsity of data beyond early visual areas. Rather, the complex non-linearity of the phenomena found in the higher visual areas have been notoriously hard to explore and characterize with traditional psychophysical and neurophysiological techniques. Concepts which have driven the exploration of early visual areas for nearly fifty years, such as neural receptive fields, have failed to capture the complexity of higher visual areas. Parallel to the biology of vision has been the study of image statistics. This area, which has potential implications for both biological and computational vision systems, has done well in understanding simple, low-level visual features. These results, based on low order statistical models, have provided insight into the computational principles underlying the retina, LGN and area V1. However, as with the biology, there has been something of a road block in going beyond these low order models. This review aims to bring together recent neurophysiological results concerning higher level functions of the visual system with a particular focus on object recognition. To further limit the scope of this review, we will focus on the question how are objects represented in the brain and what is the nature of this representation. Relevant results will be discussed with the hope of suggesting directions for future statistical and computational modeling efforts. Background on the question will be presented in section II. Their history will be roughly overviewed and the dominant theories will be presented. Section III will review of some recent results and section IV will provide a discussion of the papers.

CSC2541: NATURAL SCENE STATISTICS FINAL PROJECT 2 II. BACKGROUND ON OBJECT REPRESENTATION The representation of objects in the brain is one of the most important questions in visual perception, particularly for those interested in building computer vision systems. Answering this question does not simply expand knowledge about visual perception but could lead to a reevaluation of current theories of all levels of visual processing and provide vital clues to the functioning of the brain as a whole. There are two classical schools of thought with respect to how objects are represented in the brain. The first proposes that objects are represented in the brain by 3D models and that the purpose of the primary visual cortex is to extract 2D clues of 3D structure. The canonical example of this school of thought is Recognition-by-Components [1] which suggests that objects are represented by 3D primatives called geons which are infered by detecting non-accidental properties such as co-curvilinearity and parallelism. The second school of thought suggests that objects are represented by collections of 2D views and thus that recognition amounts to finding the closest 2D view of the perceived object. For instance Tarr and Bülthoff [2] argued that Recognition-by- Components is inconsistent with significant amounts of experimental data for which a view-based theory can explain. Similarly, in [3] Edelman argued for a view-based representation as an intermediary stage to higher level object recognition. Nearly twenty years of intensive experimentation and debate on parts versus views has yet to produce a conclusive answer despite the claims of those in both camps such as [4] and [5]. Arguably, one of the few certainties which has come out of the debate is that the truth is somewhere in between. Experiments have shown that recognition is mitigated by viewbased effects [6] but others have shown [7] that nonaccidental properties have a disproportionate effect on recognition. Intimately tied to this debate is the imagery debate. Here the question focuses on the phenomena of mental imagery. Essentially, what is happening when we visualize an object in our mind? The ties to the question of object representation arise with the realization that the mechanisms that support mental imagery also support recognition and perception. For a good review of this particular debate see [8]. III. OBJECT REPRESENTATION AND ATTENTION A. Complex Objects are Represented in Macaque Inferotemporal Cortex by the Combination of Feature Columns In a now well-known paper [9] Tsunoda et al used a neuroimaging technique called intrinsic signal imaging (ISI) to investigate the representation of shape in the inferotemporal (IT) cortex of the macaque. In ISI an optical sensing device is used to record reflectance variations of the targeted cortical area. This recording is called the intrinsic signal (IS). In the brain slight reflectance variations are caused by changes in blood volume, deoxyhemoglobin concentration 1, and the tissues them- 1 The presence of deoxyhemoglobin in a region indicates the use of the oxygen carried there by red blood cells and thus recent neural activation.

CSC2541: NATURAL SCENE STATISTICS FINAL PROJECT 3 selves. [10] The IS is inherently noisy and will tend to emphasize blood vessels. With careful application and some post-processing the IS can be cleaned up to give good indications of neural activation with better spatial localization than other neuroimaging techniques. [11] Using ISI, Tsunoda et al were able to simultaneously look at the activation of neural columns in IT over a significant region of the cortex. Unlike many investigations of visual perception their stimuli consisted of complex objects. To try to better understand how an object was being represented in cortex they recorded activity in IT for both the whole object as well as several simplifications of the object. See figure 1 for a sample stimulus. They found that in a significant number of instances the simplified stimulus activated some regions that were also activated by the more complex object but failed to activate others. Further, by using extracellular recordings in these regions they found optimal stimuli which were in general simpler than the original objects. Their results point towards a representation in IT based on the combination of active and inactive columns where these columns are indicative of the presence of a particular visual feature. Despite the attention that these results received it is unclear how strongly their conclusions can be justified. Their stimulus simplification procedure consisted of either manually segmenting the stimuli, reducing it to gray silhouettes or masking portions of the stimuli. Unfortunately manual segmentation allowed for significant experimenter bias in the results. To Fig. 1. A sample stimulus and its simplifications from [9] better support their conclusions experiments with arbitrarily segmented stimuli is needed. At first glance the conclusions of this paper seem to support an RBC like representation. However, as with most research in support of parts- or viewbased representations, it does not exclude the other view. In particular, based on the results presented it is impossible to distinguish whether the columns are representative of 2D or 3D features. In order to help resolve this ambiguity experiments with depth rotated stimuli need to be performed. While most work in IT has focused on 2D shape, sensitivity to stereo disparity has been found [12] which could indicate the representation of 3D information. An answer to whether IT encodes 3D shape and object geometry or it is simply 2D shape plus disparity could significantly sway the parts- versus viewbased debate since IT is one of the last purely visual areas of the brain. On a computational note the conclusions by Tsunoda et al fit well in the HMAX model developed by Reisenhuber and Poggio (see [13] for a recent overview). The HMAX model proposes that the neural responses of a given level are built by summing and maximizing the responses of the earlier levels. A distributed representation where columns represent visual features fits well conceptually within an HMAX like model.

CSC2541: NATURAL SCENE STATISTICS FINAL PROJECT 4 B. Population Coding of Shape in Area V4 In [14] Pasupathy and Connor looked to study how the population of neurons in V4 encoded shape. In an expansion of previous work [15] they used single cell recordings in response to 2D shapes of moderate complexity. Beyond simply examining the sensitivities of cells in V4 they constructed a simple 2D shape model and used it to predict responses. Previous work had found that some cells in V4 were sensitive to boundary curvature. This led the researchers to hypothesize in [15] a model where V4 encoded 2D shape with neurons sensitive to boundary features in object centric coordinates. Specifically, for a shape centered in a cell s classical receptive field they proposed that the tuning curve of V4 neurons would be peaked around specific combinations of radial location and curvature. Thus, a spatially localized collection of such neurons could describe the boundary of a 2D shape. This was modelled this with a collection of guassians. Specifically, if the boundary of a shape was sampled at P points and each point p has a curvature measure of X c (p) and orientation angle of X o (p) then the response r (normalized such that r [0, 1]) of a cell is modelled as r = max p k f {x,o} exp ( (X f(p) µ f ) 2 2σ 2 f ) (1) where k is a normalization constant and µ c, µ o, σ c and σ o are parameters. By fitting this model to data in a least-squares sense they were able to significantly predict the responses to stimuli. See figure 2 for a comparison of the observed and predicted responses. This work was continued in [14] by recording from a larger population of neurons and using the model to reconstruct the stimuli based on recorded activations. The reconstructions provided a convincing demonstration of their model, as can be seen in figure 3. While clearly not optimal (for instance, reconstruction of straight edges performed poorly due to training set bias) the model did remarkably well explaining the responses of shape sensitive cells in V4. The locality of the shape representation lead the authors to conclude that this fits with a parts-based model of object representation. Indeed, localized representations of shape is a crucial prediction of such a model. However, localized image descriptors is a necessary component of any view based model. The use of max in equation 1 may seem an odd choice. The authors noted that similar results were obtained by replacing the max p with p. However, the use of the max operation is consistent with the previously mentioned HMAX model. C. Underlying Principles of Visual Shape Selectivity in Posterior Inferotemporal Cortex Recently Brincat and Connor [16] have expanded the model used in [15] and [14] to study the representation of boundary shape in IT. They found good fits between their learned model predictions and observed data and analyzed the features of the learned models. The analysis found significant nonlinear effects and indicated that relative to their receptive field, the neurons showed invariance in terms of size and position.

CSC2541: NATURAL SCENE STATISTICS FINAL PROJECT 5 Fig. 2. Observed versus predicted responses of the shape model. The correlation coefficient between the observed and predicted responses in this case was 0.70. Adapted from [15]. While still using a Gaussian function as the core of their model, they added relative and absolute position to curvature and angular orientation as boundary features. Further, they sum over the boundary points instead of taking the maximum response. In particular using similar notation as in equation 1 the response of a subunit s is ( R s = A s exp (X ) f(p) µ s f )2 2σf s p f (2) where X f (p) is the value of feature f at boundary point p, µ s f and σs f are parameters of feature f in subunit s and A s is the amplitude which is positive if the unit is excitatory and negative if it is inhibitory. Unlike their original model, an individial cell is characterized by a combination of these subunits. In particular they allow for both linear and non-linear combinations by defining [ R = G (w s R s ) + w NL+ R s ++ s s ] + + (3) w NL + b0 s R s Fig. 3. Each box represents the original stimuli (top left), the reconstructed stimuli (top right), the population coding of the shape (bottom height plot), and the angle versus curvature plot of original stimuli (bottom white curve). The population coding is a height plot of the response weighted sum of individual tuning functions. The x- axis is angle relative to the center of mass with 0 degrees on the left indicating the right side of the shape and 180 degrees in the center of the x-axis representing the left side of the shape. The y-axis is the squashed curvature which ranges from 0.3 at the bottom to 1.0 at the top. Color indicates response strength with blue indicating r = 0 and red indicating r = 1. Adapted from [14]. where indexes over s + are over excitatory subunits, indexes over s are over inhibitory subunits, the w s are weights, [ ] + is half-wave rectification, G is a normalization factor and b 0 is the baseline (null stimulus) firing rate. This model has the advantage that fitting the weights allows it to vary between linear and non-linear effects. Note however that the half-wave rectification adds a fixed non-linearity irregardless of weights. When fitting they automatically determined the number of units by requiring the data variance associated with each unit to be

CSC2541: NATURAL SCENE STATISTICS FINAL PROJECT 6 less than 2.5%. Much of the detail of their results were similar in nature to their previous work. They measured the contribution of the simple non-linearity by measuring the magnitude of the non-linear weights (w NL+ and w NL ) in relation to the sum of all weights. The results, which can be seen in figure 4, show that, while the linearity has the largest impact, nonlinear effects are significant. It is worth noting that this model is a natural, hierarchical extension to their previous model. Thus, the discovered non-linearity is indicative of the nature of transformation between V4 and IT. While the particular form of the non-linearity may not be entirely correct, that such a simple form of nonlinearity is significant is promising. They also measured the how invariant the neurons were to the size of their prefered stimuli. They found that within a broad range around their prefered size the neurons responded similarly but once the stimuli either became too small or too large relative to the receptive field the responses dropped off quickly. The size tuning of an exemplary neuron can be seen in figure 5. D. Shape-coding in IT cells generalizes over contrast and mirror reversal, but not figure-ground reversal To continue on the theme of invariance in IT we now turn to a paper by Baylis and Driver [17] which examines the invariances of IT neurons. For stimuli they used four relatively simple shapes which were transformed in three ways: Fig. 4. defined as Fig. 5. Non-linearity index histogram. The non-linearity index is w NL± Ps ws+w NL++w NL. Adapted from [16]. Size sensitivity of sample neuron. Three objects of varied preference to the neuron (black being most prefered, light gray being least prefered) were shown at size different retinotopic scales. Curves of the same shade correspond to the mean response with the dotted lines indicating one standard deviation. Adapted from [16]. 1) Contrast reversal, 2) Mirroring across the vertical axis, and 3) Figure-ground reversal. See figure 6 for an illustration of the shapes and transformations. Single cell recording was used to determine the response of a set of cells to each of the 32 stimuli. To test the invariance of each cell, its response to a given shape was compared to transformations of that shape. The results found strong

CSC2541: NATURAL SCENE STATISTICS FINAL PROJECT 7 E. Shape perception reduces activity in human primary visual cortex Fig. 6. Stimuli and transformations used in experiments by Baylis and Driver. Adapted from [17]. correlations in the firing rate to contrast reversal and mirroring. That is, the responses of the cells were invariant to reversals in contrast and mirroring. However, there were no such correlations to figureground reversal. This result is quite surprising since figure-ground reversal is the only transformation which leaves local regions unchanged. These results, while not as illuminating with respect to shape representation as the others we ve looked at, provide an interesting link to another crucial topic, attention. Figure-ground seperation is known to be an important aspect of visual perception and is crucial in most computer vision systems. Attention is often thought to play a crucial role in figure-ground seperation and this result indicates that IT, if not directly involved in figure-ground seperation, is certainly affected by the process. Murray et al set out to study the process of perceptual grouping in [18] with an fmri study of humans. Three experiments were reported in this paper and we will look at the first two. The first two compared the average fmri signal from regions marked as the lateral occipital complex (LOC) and V1. In the first experiment human subjects were shown randomly placed lines, lines which constituted a 2D shape and lines which constituted a 3D shape. In the second, three kinds of random dot displays were presented: stationary, velocity perturbed random dots projected onto a 3D structure and 3D structured motion. The LOC showed stronger responses to more structured stimuli, consistent with other recent studies which have suggested that the LOC prefers coherent stimuli. The interesting findings in this study were that V1 showed reduced average activation to structured input. In particular, activation levels of the LOC and V1 seemed inversely correlated when presented with structured versus unstructured input. This result indicates that recognition has a net inhibitory effect on the early levels of vision. The authors note that a perceptual model of predictive coding is consistent with these results. Murray continued on this line with Wojciulik in [19] where they studied the effects of attention on the selectivity of a population of neurons. Using similar experimental techniques, they compared the population responses of attended versus unattended shapes. Their findings indicated that attention not only enhanced the re-

CSC2541: NATURAL SCENE STATISTICS FINAL PROJECT 8 sponses but also increased the shape selectivity of neurons. IV. DISCUSSION The results reviewed suggest that higher level visual processes in the brain consist of localized shape descriptors. While none of the work presented has been conclusive, most have preferentially supported a parts-based object representation. Moreover, the research has indicated that object representation is intimately tied to attention, figure-ground seperation and grouping. Understanding how these processes interact will be key to understanding each of them individually. The work of Baylis and Driver, while inconclusive on this account, suggests that perhaps IT only represents the shape of the figure and not of the ground. Of course, alternative explanations abound. For instance, the representation may only be of 2D boundary shape but only represents boundaries which belong to the figure in some sense. In a review paper Treisman [20] proposed that the spatial attention window solves the binding problem by having upper visual areas represent only the features found in the attended spatial window. This theory is consistent with the work of Baylis and Driver but has its own limitations when we consider the mutliscale nature of visual perception. Under such a theory we should have a hard time recognizing a shape with significant texturing or shapes on it. For instance, we are able to recognize a bus even if it has pictures on it. While it may appear that there are more questions here than answers, it is fortunate that many of these questions may be answerable in the near future. If we are able to determine whether IT neurons are selective to 3D features or if it is strictly to 2D shape we could begin to discount figure-ground seperation in IT as strictly boundary ownership. On the Parts Versus Views Debate: Twenty years of research without conclusion has set low expectations for a resolution to this debate. What can be said is that the truth appears to be somewhere in between. Mounting, yet still inconclusive, evidence prefers a parts-based representation at higher levels. However view-based factors are clearly at work early on and are able to impact recognition performance appreciably. [4] More to the point though, I think it can be said that the debate is quickly becoming moot. As we get closer to understanding just how neurons in IT and beyond represent shape, the questions that the debate was trying to answer in the first place may be answered, even if the debate itself is not. That it is possible to accurately predict neural responses to non-trivial stimuli, as Pasupathy and Connor have done, is an indication of just how close we are to answering those questions. On the Statistics of Objects: Higher order statistics of images have been, at best, difficult to study. Attempts at doing so (e.g., [21]) have usually failed to truly go that far beyond local, first order statistics. The stumbling block is generally how to significantly increase the order of a model while constraining it to be both tractable and realistic. The results here, particularly those of Brincat, Connor and Pasupathy, have supported the simplistic

CSC2541: NATURAL SCENE STATISTICS FINAL PROJECT 9 HMAX model. This suggests the HMAX model as a natural candidate for examining the statistics of objects in images as its properties may even be amenable to direct analytical study. A now classical result [22] concerning unsupervised learning rules in neural networks has shown how biologically plausible 2 learning rules can lead individual neurons to become independent component analyzers. ICA has been applied on images but the results have remained confined to the early visual cortex. [23] This limitation is inherent in any linear model such as ICA but the question is how best to add a non-linearity. The work of Brincat and Connor suggests that relatively simple nonlinearities may be sufficient and worth exploring. More recent work on networks of spiking neurons [24] shows that such networks may be performing online belief propogation in the log-likelihood domain. This naturally suggests graphical models as a means of studying higher order statistics of images. While learning in graphical models can be hard, connections with local learning rules of spiking neurons could provide a means of traction. This sits well with the results of Murray et al as predictive coding could easily be implemented in a belief network with cycles. V. CONCLUSION This review has covered several papers which helped illuminate the nature of shape and object representation in the brain. The reviewed papers favour a parts-based representation though they do 2 Well, at least not biologically implausible. not preclude view-based mitigation of the partsbased representation. Several areas for future work have been proposed in the study of image statistics and object representations. Simple non-linear extensions have been suggested for future exploration as well as complex probabilistic models. Also suggested have been directions for future research in neurophysiology to help resolve ambiguities in existing work. REFERENCES [1] I. Biederman, Recognition-by-components: A theory of human image understanding, Psychological Review, vol. 94, pp. 115 147, 1987. [2] M. J. Tarr and H. H. Bülthoff, Is human object recognition better described by geon structural descriptions or by multiple views? Comments on Biederman and Gerhardstein, Journal of Experimental Psycholofy: Human Perception and Performance, vol. 21, no. 6, pp. 1494 1505, 1995. [3] S. Edelman, Representation and Recognition in Vision. MIT Press, 1999. [4] M. J. Tarr, Visual object recognition: Can a single mechanism suffice? in Perception of Faces, Objects and Scenes: Analytic and Holistic Processes, M. A. Peterson and G. Rhodes, Eds. Oxford University Press, 2003. [5] I. Biederman, Recognizing depth-rotated objects: A review of recent research and theory, Spatial Vision, vol. 13, pp. 241 253, 2001. [6] M. J. Tarr, P. Williams, W. G. Hayward, and I. Gauthier, Threedimensional object recognition is viewpoint dependent, Nature Neuroscience, vol. 1, no. 4, pp. 275 277, August 1998. [7] G. Kayeart, I. Biederman, and R. Vogels, Shape tuning in macaque inferior temporal cortex, The Journal of Neuroscience, vol. 23, no. 7, pp. 3016 3027, April 1 2003. [8] S. M. Kosslyn, Image and Brain: The Resolution of the Imagery Debate. Cambridge, Massachusetts: MIT Press, 1994. [9] K. Tsunoda, Y. Yamane, M. Nishizaki, and M. Tanifuji, Complex objects are represented in macaque inferotemporal cortex by the combination of feature columns, Nature Neuroscience, vol. 4, no. 8, pp. 832 838, August 2001.

CSC2541: NATURAL SCENE STATISTICS FINAL PROJECT 10 [10] C. H. Chen-Bee, D. B. Polley, B. Brett-Green, N. Prakash, M. C. Kwon, and R. D. Frostig, Visualizing and quantifying evoked cortical activity assessed with intrinsic signal imaging, Journal of Neuroscience Methods, vol. 97, no. 2, pp. 157 173, Apr. 2000. [Online]. Available: http://www.sciencedirect.com/science/article/ B6T04-404H21S-8/2/1d433f5cad676d3f9539142bbc511ce0 [11] C. J. Hodge, R. T. Stevens, H. Newman, J. Merola, and C. Chu, Identification of functioning cortex using cortical optical imaging. Neurosurgery, vol. 41, no. 5, pp. 1137 1145, November 1997. [12] T. Uka, H. Tanaka, K. Yoshiyama, M. Kato, and I. Fujita, Disparity selectivity of neurons in monkey inferior temporal cortex, J Neurophysiol, vol. 84, no. 1, pp. 120 132, 2000. [Online]. Available: http://jn.physiology.org/cgi/content/ abstract/84/1/120 [13] M. Riesenhuber and T. Poggio, How the visual cortex recognizes objects: The tale of the standard model, in The Visual Neurosciences, L. M. Chalupa and J. S. Werner, Eds. Cambridge, Mass.: MIT Press, 2004, vol. 2, ch. 111, pp. 1640 1652. [14] A. Pasupathy and C. E. Connor, Population coding of shape in area V4, Nature Neuroscience, vol. 5, no. 12, pp. 1332 1338, December 2002. [15], Shape representation in area V4: Position-specific tuning for boundary conformation, J Neurophysiol, vol. 86, pp. 2505 2519, 2001. [16] S. L. Brincat and C. E. Connor, Underlying principles of visual shape selectivity in posterior inferotemporal cortex, Nature Neuroscience, vol. 7, no. 8, pp. 880 886, August 2004. [17] G. C. Baylis and J. Driver, Shape-coding in it cells generalizes over contrast and mirror reversal, but not figure-ground reversal, Nature Neuroscience, vol. 4, no. 9, pp. 937 942, September 2001. [18] S. O. Murray, D. Kersten, B. A. Olshausen, P. Schrater, and D. L. Woods, Shape perception reduces activity in human primary visual cortex, PNAS, vol. 99, no. 23, pp. 15 164 15 169, 2002. [Online]. Available: http://www.pnas.org/ cgi/content/abstract/99/23/15164 [19] S. O. Murray and E. Wojciulik, Attention increases neural selectivity in the human lateral occipital complex, Nature Neuroscienec, vol. 7, no. 1, pp. 70 74, January 2004. [20] A. Treisman, Feature binding, attention and object perception, Philosophical Transactions: Biological Sciences, vol. 353, no. 1373, pp. 1295 1306, August 29 1998. [21] Y. Karklin and M. S. Lewicki, Learning higher-order structures in natural images, Network: Computation in Neural Systems, vol. 14, no. 3, pp. 483 499, 2003. [Online]. Available: http://stacks.iop.org/0954-898x/14/483 [22] E. Oja, Pca, ica, and nonlinear hebbian learning, in Proc. Int. Conf. on Artificial Neural Networks, Paris, France, October 9-13 1995, pp. 89 94. [23] A. J. Bell and T. J. Sejnowskia, The independent components of natural scenes are edge filters, Vision Research, vol. 37, no. 23, pp. 3327 3338, December 1997. [24] R. P. N. Rao, Hierarchical beysian inference in networks of spiking neurons, in Advances in NIPS, vol. 17, 2005.