An Introduction to Biologically-Inspired Visual Recognition

Size: px

Start display at page:

Download "An Introduction to Biologically-Inspired Visual Recognition"

Alexander McLaughlin
5 years ago
Views:

1 Universität Hamburg Department Informatik Knowledge Technology, WTM An Introduction to Biologically-Inspired Visual Recognition Seminar Paper Biologically-Inspired Artificial Intelligence Sebastian Starke Matr.Nr

3 An Introduction to Biologically-Inspired Visual Recognition Abstract This paper focusses on introducing the topic of visual recognition from a biologicallyinspired point of view. Since humans can efficiently detect and recognize edges, structures and motion and are able to classify objects in highly complex and dynamic real-world environments, it briefly explains the anatomy and functionalities of the primate visual system in order to achieve an intuitive understanding of the underlying processes. Coherently, the opportunities and limitations in resulting computational approaches will be described. Edge detection as a very fundamental lower-level problem will be considered by comparing different conventional algorithms with novel biologically-inspired approaches using multilevel surround inhibition or hexagonal pixels processed by spiking neural networks. Ultimately, the hierarchical vision architecture HMAX for the higher-level problem of object recognition as well as its improvements that could be obtained by sparsity-regularization will be discussed. Contents 1 Introduction and Motivation 2 2 The Primate Visual System Anatomy and Functionalities Inspirations for Computational Visual Recognition Edge Detection Conventional Operators Multilevel Surround Inhibition Hexagonal Pixels and Spiking Neural Networks Object Recognition HMAX: The Standard Model Sparsity-Regularization Conclusion 19 Bibliography 20 1

Sebastian Starke 1 Introduction and Motivation Here lay a way to formulate the purpose of vision building a description of the shapes and positions of things from images.

4 Sebastian Starke 1 Introduction and Motivation Here lay a way to formulate the purpose of vision building a description of the shapes and positions of things from images. Of course, that is by no means all that vision can do; it also tells about the illumination and about the reflectances of the surfaces that make the shapes their brightness and colors and visual textures and about their motion. But these things seemed secondary; they could be hung off a theory in which the main job of vision was to derive a representation of shape. David Marr This quote perfectly matches the problem of vision in an intuitive but very meaningful way. The brain s visual system provides remarkably good functionalities and abilities in visual information processing by applying highly complex neuronal processing. All the more, it seems to perform this processing almost effortlessly. Therefore, the ultimate goal of computational image processing is lastly to find algorithms and models that imitate this underlying behaviour as good as possible. In fact, nature itself mostly provides the optimal solutions for many problems where it is beneficial to if possible adopt the underlying patterns and processes. For example, determining the depth and distance of objects is optimally done by stereo-vision using two cameras and integrating the correlated information of two separate images. This principle is clearly adopted from nature given two eyes that are independently able to see, but are connected to the same visual system for neuronal processing. However, in contrast to cameras that merely give a raw projection of visual information of the environment, the visual system is able to provide information analysis of lower-level features like edges, shapes and forms, textures and colors, motion and depth as well as subjective awareness of higher-level categories like objects and their interpretation. Fig. 1 shows two examples of illusory contour perception that give an intuitive proof of such visual information processing in the brain. Though there is no contrast change in the intensity or true luminance at certain invisible contours, the brain is able to perceive specific shapes at even different layers. In algorithmic design, biologically- Figure 1: Illusory Contour Perception: An illusionary white square overlapping four perceived circles (left image) and the famous Kanisza Triangle (right image) 2

5 An Introduction to Biologically-Inspired Visual Recognition inspired approaches are more likely to fulfill the demand in robustness, selectivity and fastness that is provided by the primate visual system and required for many applications such as edge and motion detection, object and activity recognition or vision-based navigation. Therefore, the neuronal information processing in the brain has been extensively studied in recent years. Intrinsically, it can be modelled as a complex hierarchical processing system that is subdivided into various layers consisting of various cells of different complexity [Hubel and Wiesel, 1968]. Each of these layers fulfills specific functionalities [Masland, 2001] which are activated by spiking neurons or spatio-temporal receptive fields of retinal ganglion cells [Hosoya et al., 2005, Hubel and Wiesel, 1968]. It has been identified that these receptive fields are reactive to different stimuli like contrast change or orientation features [Hosoya et al., 2005, Kandel and Schwartz, 1981]. Former research in the visual system of the human retina has also shown that the cone photoreceptors are arranged in a hexagonal lattice. However, many common approaches that try to solve the problem of edge detection still use rectangular shaped lattices what raises the question why to do so when nature provides another well-working solution. This insight shows up that the integrated neighbourhood lattice around a specific pixel or cell has important influence on the behaviour and efficiency of the visual processing model in the sense of accuracy as well as computational complexity. Therefore, [Kerr et al., 2011] imitated the mechanism of hexagonal shaped receptive fields modelled by hexagonal pixels which are then processed by spiking neural networks. Also, the Canny-Edge-Detector [Canny, 1983] could be extended by a neighbourhood-based and biologically-inspired technique namely multilevel surround inhibition with the outcome to suppress texture edges that do not represent important object boundaries [Papari et al., 2007]. Moreover, conventional algorithms solving the problem of object recognition and classification typically require many very well chosen training samples while still lacking in adaptivity. Therefore, hierarchical models especially HMAX (Hierarchical Model and X) have been extensively studied during recent years by trying to adapt the nature of the visual cortex and hence performing a generic object recognition [Hubel and Wiesel, 1968, Riesenhuber and Poggio, 1999]. Biologically-inspired sets of features have been discovered in order to achieve a higher robustness and selectivity while only requiring very few examples for learning phases [Serre et al., 2005, Ghodrati et al., 2012]. It was also possible to apply sparsity-regularization to HMAX and thus to obtain an improvement of sparse firing patterns [Zhang et al., 2014]. A major outcome of this modification is that no supervised labelling has to be applied for learning higher-level features of objects. Ultimately, it will be shown how these mentioned approaches and improvements are able to outperform several state-of-the-art algorithms by using a variety of test images from different recognition and classification categories. 3

6 Sebastian Starke 2 The Primate Visual System The primate visual system has been studied for decades, but only recent years of research in the visual cortex had remarkable impact from a neurological or psychological point of view. Most of these investigations were studied on the brain of the macaque monkey where several sensory-motor and cortical areas were discovered to be homologous to the corresponding areas in the human brain. Still, there is high uncertainty in the functionalities and correlation of certain cortex areas considering the awareness and consciousness of vision as well as the general information processing under various stimuli input. This chapter is most widely based on recent reviews of the primate visual system found in [Krüger et al., 2013, Tong, 2003, Yang, 2007]. 2.1 Anatomy and Functionalities The stimuli input of the environment is captured by the eye which can from a functional point of view be compared to a camera. From the retina with its hexagonally arranged photoreceptor ganglion cells the information impulses are transmitted through the optic nerves which are emerged at the optic chiasma. Therefore, the visual input in the left and right optic nerve is perceived by each other side of the cerebral hemisphere. Furthermore, around 90% of the nerve fibres from the optic tract are projected (optic radiation) to the lateral geniculate nucleus (LGN) which is lastly connected to the primary visual cortex also known as V1 or striate cortex. This area is currently the most extensively and well studied part of the visual cortex. The remaining 10% are transmitted to various areas of the extrastriate cortex which covers the area around V1. The most important regions in the extrastriate cortex are covered by layers V2 to V4 as well as MT (V5) or MST (medial (superior) temporal lobe). Since the whole visual cortex covers the areas of psychological visual information processing, any damage along this pathway from the eyeball to this region the precortical processing will result in a loss of conscious visual perception. However, the majority of information transmission between layers in the visual cortex is sent by feedforward connections from V1 to V4. Layers below V4 typically provide feedback projections to the V1 and the LGN or might be even bidirectional. Fig. 2 illustrates a higher-level description of the anatomy of the brain from the eyeball to the visual cortex while Fig. 3 shows the locations and connectivity of cortical areas in the primate visual cortex. All information transmission is obtained by receptive fields which are activated under various stimuli input. They were presented in the famous paper of Hubel and Wiesel [Hubel and Wiesel, 1968]. These receptive fields are located anywhere in the brain to enable neuronal processing and can roughly be differentiated in simple or complex cells. Simple cells respond to light patterns that have a particular orientation, size and position. By their clearly structured excitatory and inhibitory regions, they respond best to a bar on an edge where most light is located in the excitatory region and only little in the inhibitory. From a technical point 4

7 An Introduction to Biologically-Inspired Visual Recognition Figure 2: Higher-level anatomy of the brain from [Pearson-Education, 2004] Figure 3: Cortical areas in the primate visual cortex from [Krüger et al., 2013] 5

Sebastian Starke of view, it was possible to show that Gabor-Wavelet transformations provide an approximation to the behaviour of simple cells.

8 Sebastian Starke of view, it was possible to show that Gabor-Wavelet transformations provide an approximation to the behaviour of simple cells. Complex cells are also sensitive to the orientation, but not to the position of the stimulus which has proven to be most effective if flickering or moving. Furthermore, the size of these receptive fields increases by their hierarchical distance from the lower layers in the striate cortex to the higher layers in the extrastriate cortex. Fig. 4 illustrates the impulse responses of simple and complex cells by different stimuli input. Considering Figure 4: Stimuli input impulses in simple and complex cells from [Hubel and Wiesel, 1968, Yang, 2007] scene interpretation by the visual cortex, the receptive field cells in the V1 mainly process visual information that is sensitive to edges, line endings, motion, color or disparity. Therefore, this cortical area is responsible for the raw perception of vision. V2 is also sensitive to orientation, color and disparity, but provides a more sophisticated behaviour by processing correlated perception in terms of texture-defined and illusory contours or relative depth between shapes. Lower cortical layers provide a highly connective behaviour to each other. They continue integrating the transmitted visual information of V1 and V2 from lower- to higherlevel responses that provide direction and speed of motion, connectivity of shapes, curvature selectivity or luminance-invariant perception of hue. These lower layers (i.e. MT/MST) are not solely responsible for perception, but have also impact on motor control like smooth eye movements by feedback to higher layers. Considering a higher-level perspective of the functionality in the visual cortex, there are two important pathways of visual information processing namely the ventral pathway (V1 V2 (V3) V4) and the dorsal pathway (V1 V2 (MT)). The ventral pathway also known as the what-stream is responsible for object recognition in the inferior temporal cortex (IT). Therefore, the subcortical areas TEO (posterior inferior IT) and TE (anterior IT) are able to process highly complex visual features while providing position- and size-invariant shapeselectivity. Furthermore, the dorsal pathway described as the where-stream mostly projects into the frontal lobe bridging between the visual and sensorymotor cortical areas. The corresponding receptive fields typically react to visual motion, optical flow, first- or second-order disparity signals. Therefore, it is clear that the cortical areas in the dorsal pathway provide higher-level responses of cognitive reactive behaviour from various stimuli input of the environment. 6

9 An Introduction to Biologically-Inspired Visual Recognition 2.2 Inspirations for Computational Visual Recognition The explicit functionality and correlative connectivity between many areas in the primate visual cortex is still very theoretical and sometimes even controversal regarding different experiments containing lesions. However, focussing on biologically-inspired approaches for computational visual processing, the neurological behaviour within the primate visual cortex offers some interesting design principles. The various layers from V1 and V2 along the dorsal and ventral pathway roughly form an ordered sequence which thus indicates the use of hierarchical models. In terms of computational complexity, receptive fields suggest a sparse and spatial-partitioned neural processing where on the other hand the complexity of features at each lower layer increases generically. This generic approach comes along with another beneficial principle regarding the learning efficiency in the sense that already designed robust layers with particularly invariant proven features can be inherited. Also, considering the size of receptive fields that increases at layers which are sensitive to more sophisticated features, adopting this principle does also avoid the general problem of overfitting. However, the truth is that even the world from where the visual stimuli lastly come from is hierarchical. Accordingly, objects and their complex and simplex features each share the same hierarchical layers in the environment. Consequently, they are more correlated to each other and can therefore similarly be processed from a corresponding limited subset of layers in the designed computational hierarchical model. Furthermore, different visual processing tasks like selection and connection of edges and shapes, color perception or recognition of motion can be processed separately and can also be used for different purposes at higher layers. This seperation of information channels also provides robustness in case of unavailability of certain visual information. Additionally, this does even result in a better efficiency of combinatorial feature representation since creating a unique pattern of features for each object would lead to an explosion in combinatorial complexity. Also, previously unseen and therefore unknown objects can be learned and represented easily by feature combinations also known as the binding problem provided by separate lower layers. Moreover, while most computational approaches for visual recognition are based on pure feed-forward architectures, the visual cortex indicates that feed-backward connections play an important role as well. This is because vision is not only processed by the stimuli input of the environment, but also from any prior knowledge that has been learned. Lastly, the question remains whether nature itself preferably suggests either to optimize the functionalities or the hierarchical collaboration and connectivity of the layers in the visual cortex. This signalizes the tradeoff considering the design of biologically-inspired approaches that try to imitate the behaviour of the brain. 7

10 Sebastian Starke 3 Edge Detection It seems clear that edge detection is both from a biological and computational point of view a fundamental task to enable any higher-level visual processing considering scene interpretation. Accordingly, referring to the primate visual cortex, this visual information is mainly processed in V1. Edges in images basically depict the boundaries of objects which are typically characterized by abrupt changes in local intensities. Furthermore, solving this problem can also be used to filter out noisy high frequency content in the image while preserving important structural properties. In order to provide some historical and technical background, this chapter will at first review some conventional operators used for edge detection. Afterwards, the biologically-inspired concept of multilevel surround inhibition as well another approach processing hexagonally-shaped pixel lattices by spiking neural networks will be discussed. 3.1 Conventional Operators Over the last decades, there have been numerous publications concerning the problem of edge detection. The most prominent approaches are known as the Robert s- Cross-Operator [Roberts, 1963], the Sobel-Operator [Sobel and Feldman, 1968], the Laplacian-of-Gaussian [Marr and Hildreth, 1980] or the famous Canny-Edge-Detector [Canny, 1983]. These are either purely gradient-based by calculating an approximation for the first-order derivative or more sophisticatedly try to find zerocrossings using the second-order derivative convolved with a smoothing filter. However, both models are interested in finding the direction vector that denotes the maximum rate of change. The Roberts-Cross-Operator was one of the first solutions for edge detection and provides a simple and fast computation of the gradient. The partial first-orderderivatives g x and g y are efficiently computed using the two 2 2 convolution kernel masks depicted in Fig. 5. The gradient magnitude Mag( g) is then given by (1) and the angle of the gradient direction with the maximum rate of change is calculated by (2). Mag( g) = g(x, y) = g x 2 + g y 2 g x + g y (1) Mag( g) = arctan g y g x (2) According to the masks, this operator responds optimally to edges that are arranged by an angle of 45 within the pixel neighbourhood. Furthermore, the g x : { } g y : { } Figure 5: Convolution kernel masks used by Robert s-cross-operator 8

11 An Introduction to Biologically-Inspired Visual Recognition Sobel-Operator extends this approach by using a 3 3 convolution kernel matrix shown in Fig 6, but instead of approximating the partial first-order derivatives diagonally, the gradients g x and g y are computed with respect to the X- and Y -axis. Therefore, this operator has maximum response on edges that run horizontally or vertically. However, the computation of the gradient magnitude as well as for the direction with the maximum rate of change then holds the same as for the Robert s-cross-operator given by (1) and (2) g x : g y : Figure 6: Convolution kernel masks used by Sobel-Operator The major disadvantage of both the Robert s-cross-operator and the Sobel- Operator is that their overall performance is edge orientation dependent, meaning they might perform very poorly. An advance in robustness is therefore possible to achieve by adding Gaussian-distributed white noise. However, the Laplacian-of- Gaussian (LoG, also called the Marr-Hildreth-Operator) obtains this property of rotation invariance considering zero-crossings in the second derivative plus applying a smoothing filter that comes from a convolution with a Gaussian kernel mask. The Laplacian operator at a certain pixel of the intensity image is given by (3) and the corresponding Gaussian filtering function is defined as (4). 2 g = σ2 g σx + σ2 g 2 σy 2 (3) ) G(x, y) = exp ( x2 + y 2 2σ 2 (4) Interchanging the order of differentation and convolution in (5) then gives rise to (6) where c is used in order to normalize the sum of mask elements. 2 (f(x, y, σ))g(x, y) = h(x, y)g(x, y) (5) ( ) ) x h(x, y) = 2 + y 2 σ 2 G(x, y) = c exp ( x2 + y 2 σ 4 2σ 2 (6) This resulting function is also known as the Mexican-Hat-Operator given by its shape. A 5 5 approximation yields the convolution kernel mask shown in Fig. 7. The truth is that all edge detection algorithms are in some way inspired by the processing of the primate visual system since they integrate the direct neighbourhood around each pixel. Nevertheless, the Laplacian-of-Gaussian was the first approach that was indeed directly biologically-inspired by applying convolution with a smoothing filter since this processing is also done by the LGN. Coherently, the main outcome in using a smoothing filter is that it yields a remarkable improvement in robustness to outlier noise as provided by the primate visual system. 9

Sebastian Starke 0 0 1 0 0 0 1 2 1 0 1 2 16 2 1 0 1 2 1 0 0 0 1 0 0 Figure 7: 5 5 approximation of the LoG convolution kernel mask Lastly, the Canny-Edge-Detector is considered as one of the most

12 Sebastian Starke Figure 7: 5 5 approximation of the LoG convolution kernel mask Lastly, the Canny-Edge-Detector is considered as one of the most optimal edge detectors due to its high robustness to noise and reliable representation of edges. In contrast to other algorithms, a multi-stage process in applied where at first the image is smoothed by a Gaussian convolution filter. Afterwards, a gradient-based operator like the Roberts-Cross-Operator or the Sobel-Operator is used in order to highlight regions in the intensity image providing high gradient magnitudes that give rise to edges. Furthermore, the ultimate goal that distinguishes this algorithm from all others is that it connects those regions with high gradient magnitudes aiming to return continuous lines. This is achieved by non-maximum suppression this is setting to zero all pixels that are not on the top of a ridge and using two defined hysteresis thresholds a high and a low which are used to signalize only the beginning and the end of an edge and also to suppress noise which might otherwise be detected as an additional edge. Fig. 8 illustrates the performance of the four presented conventional algorithms for edge detection under the presence of noise. While both the Roberts- Cross-Operator and the Sobel-Operator either tend to under- or overfit, the Canny- Edge-Detector succeeds in filtering many noisy edge fragments and returns more separable regions. The Laplacian-of-Gaussian overall detects less edges than the Canny-Edge-Detector, but provides clearly structured regions and traceable lines of salient regions in the image. Figure 8: Conventional edge detectors from [Juneja and Sandhu, 2009] 10

13 An Introduction to Biologically-Inspired Visual Recognition 3.2 Multilevel Surround Inhibition While noise can be suppressed by smoothing or hysteresis thresholding, a major drawback which is shared by all presented conventional edge detection algorithms is given by their inability to detect luminance or intensity changes of edges caused by an object s texture. Studies on the primary visual system of humans have discovered a mechanism called non-classical receptive field inhibition (surround suppression) [Nothdurft et al., 1999, Papari et al., 2006] which performs such additional visual processing. Further experimental studies on the human visual system in [Papari et al., 2006] have also shown that regions of similar edge stimuli are more likely to be recognized as texture than object contours. Therefore, multilevel surround inhibition is a biologically-inspired technique which aims to imitate this mechanism with the outcome to obtain stronger region boundaries and object contours and to suppress texture edges. [Papari et al., 2007] applied this technique as an additional computational step to the Canny-Edge- Detector taking into account the amount of texture T (g x,y ) around a pixel g x,y controlled by a parameter α that defines the inhibition strength. The influence of α can be considered as an additional trade-off thresholding where too strong inhibition discards weak contours while too weak inhibition does not reliably detect texture edges. Fig. 9 illustrates the effect of setting different levels for thresholding and inhibition strength. Figure 9: Effect of different levels for thresholding and inhibition strength: high to low from left to right from [Papari et al., 2007] While the hysteresis thresholding of the Canny-Edge-Detector obtains two binary maps, multilevel surround inhibition generalizes this concept by generating n thresholds t k;1 k n regarding the gradient-magnitude of which each is connected to an inhibition level α k;1 k n. Afterwards, the n obtained binary maps are combined by an iterative connectivity-based algorithm. In more detail, T (g x,y ) is high under a close occurence of many similar edge stimuli in a local neighbourhood that is computed by the weighted average of gradient magnitudes. Accordingly, [7] denotes the inhibited gradient magnitude Mag I (g x,y ) and [8] describes the computation of regions Q k which are required in order to combine the binary maps. Mag I (g x,y ) = Mag( g x,y ) αt (g x,y ) (7) Q k = {[Mag, T ] T Mag αt > t k } (8) 11

Sebastian Starke Fig. 10 illustrates that most texture edges detected by the standard Canny-Edge- Detector are successfully suppressed by the multilevel surround inhibition.

14 Sebastian Starke Fig. 10 illustrates that most texture edges detected by the standard Canny-Edge- Detector are successfully suppressed by the multilevel surround inhibition. The important object boundaries are clearly visible and hence can significantly ease the task of segmentation and object recognition. Figure 10: Improvement of multilevel surround inhibition over the standard Canny- Edge-Detector from [Papari et al., 2007] 3.3 Hexagonal Pixels and Spiking Neural Networks Considering the human visual system, the cone photoreceptors in the retina are arranged in a hexagonal lattice from where the perceived visual stimuli of the environment are channelled over the LGN to the visual cortex. This stream of information is obtained by receptive field cells which are sensitive to action potentials also called spikes. Inspired by this mechanism, [Kerr et al., 2011] presented a novel approach for edge detection that is based on spiking neural networks which models the behaviour of the hexagonally arranged near-circular receptive field cells in the human visual system. The images are converted from a standard rectangular lattice into a hexagonal lattice pixel representation using the technique presented in [Middleton and Sivaswamy, 2001]. A significant property of hexagonal lattices is that the weighted distance of a center pixel to all of its neighbours equals 1. A hexagonal pixel is created by clustering 56 sub-pixels from a corresponding rectangular pixel block illustrated in Fig. 11. Each of them has to fulfill the properties 12

The pixel intensity is calculated by the average over all 56 sub-pixels. This transformation lastly enabled a higher sampling efficiency and hence a better computational performance.

15 An Introduction to Biologically-Inspired Visual Recognition Figure 11: Hexagonal pixel from [Kerr et al., 2011] that there is no overlapping or gap between neighbouring subpixels of multiple hexagonal pixels and that all six edges are of approximately same length. The pixel intensity is calculated by the average over all 56 sub-pixels. This transformation lastly enabled a higher sampling efficiency and hence a better computational performance. Furthermore, a spiking neural network shown in Fig. 12 was used which is based on the conductance-based I&F (integrate-and-fire) model. It behaves mostly similar to the spiking neuron model proposed in [Hodgkin and Huxley, 1952], but requires less computational complexity. The applied model consists of 3 different layers, where the receptor layer represents the cone photoreceptors of which each corresponds to a hexagonal pixel in the image. These receptive fields are connected Figure 12: Model of the Spiking Neural Network from [Kerr et al., 2011] 13

Sebastian Starke to the intermediate layer consisting of 4 different direction-selective neurons which are lastly integrated by a single neuron in the output layer.

16 Sebastian Starke to the intermediate layer consisting of 4 different direction-selective neurons which are lastly integrated by a single neuron in the output layer. Depending on the firing rate, the neurons in the output layer lastly generate the corresponding edge graphics. For experimental studies, [Kerr et al., 2011] compared their presented hexagonal SNN to a square SNN proposed in [Wu et al., 2007] which applies square receptive fields to corresponding normal square pixel based images. The results are shown in Fig. 13. The upper row shows the output over the whole image while the lower row depicts a zoomed section area marked in the original image. The performance over all edge types was measured using the Figure of Merit (FoM) [Baddeley et al., 1979] which considers missing valid edge points and false-positive classifications due to noise fluctuations. While the hexagonal SNN overall and for all edge types obtained a better signal-to-noise ratio than the square SNN, it also yielded notably better results in areas of high noise. Figure 13: Performance of hexagonal SNN over square SNN from [Kerr et al., 2011] 14

17 An Introduction to Biologically-Inspired Visual Recognition 4 Object Recognition While the primary visual cortex i.e. V1 is mainly responsible for edge detection and perceiving lower-level features, the whole connectivity along the ventral pathway accomplishes the task of object recognition integrating higher-level features. Object recognition means to find a proper segmentation of edges to determine prominent contour lines which are required to perform classification using distinctive features. Conventional algorithms solve this problem by computing several position- and scale-subimages to achieve transformation invariance before performing classification. Accordingly, this section will discuss a biologically-inspired hierarchical model for object recognition namely HMAX introducing the standard model of the architecture as well as the improvements which could be obtained by sparsity-regularization. 4.1 HMAX: The Standard Model Inspired by the work of [Hubel and Wiesel, 1968] on the monkey macaque brain, [Riesenhuber and Poggio, 1999] proposed a computational hierarchical model which aims to imitate the mechanism of object recognition in the visual cortex. However, the name HMAX (Hierarchical Model and X) was assigned by [Tarr, 1999]. Basically, HMAX is a straight feed-forward architecture though of local feedback loops which are not necessarily needed for its basic processing but are well studied to play a key role in cortical areas which models the main properties along the ventral pathway hence from V1 to IT. These are an increasing size and complexity of receptive fields and stimuli-selective neurons as well as a position-, scale- and orientation-invariant perception of features and pattern selectivity. This is obtained by simple cells having the same orientation-selective receptive fields and being located at different positions but connected to the same corresponding complex cells. Additionally, the output of a simple to a complex cell is computed by one of two pooling mechanisms (MAX or SUM) in order to suppress noise and give high relevance to strong and important afferent inputs. MAX performs a nonlinear maximum operation which takes the strongest afferent input to a simple cell yielding an postsynaptic response. The key idea is to achieve only little variation of cell responses and therefore to match the best stimuli feature. Moreover, MAX-operation responses have also shown to be selective to objects appearing together with multiple other objects at the same time by ignoring minor afferents. SUM performs an equally weighted linear summation of afferent inputs to obtain an isotropic response. In order to enable perception of higher-level features such as poses or facial expressions as well as invariance to illumination or perspective, HMAX also implies a learning network which is based on Gaussian Radial Basis Functions (GRBF) [Poggio, 1990]. This network learns from a set of samples of view-tuned unit (VTU) cells with different weighted input-output pairs. However, 15

18 Sebastian Starke the model is mainly based on MAX-operations while SUM-operations are specifically applied to higher-level VTU cells where the afferents are already particularly sensitive to specific stimuli patterns. In more detail, Fig. 14 shows the hierarchical architecture that has been proposed in [Riesenhuber and Poggio, 1999]. The model basically consists of five layers namely S1, C1, S2, C2 and VTU. The S1 layer models the visual processing from the retina a receptive field size of greyscale pixels (5 ) to the cortical area V1 resembling the properties and the mechanism of simple cells being sensitive to differently oriented bars or edges. The S1 units are two-dimensional Gaussian-filters oriented at 0, 45, 90 and 135 where each pixel is centered and normalized such that an S1 activity between -1 and 1 is obtained. The responses of simple S1-cells of same orientations within a certain pooling range are then MAXpooled by complex C2-cells of larger receptive field size while preserving feature specifity. C1 cells are then either pooled by S2 cells sensitive to co-responding features of C1 cells or larger C2 cells sensitive to the same features as C1 cells but required to combine features of higher complexity. S2 can be considered to represent the feature dictionary of HMAX that are combinations of C1 cells of different orientations. Therefore, S2 cells are also MAX-pooled by C2 cells in order to achieve size and position invariance. Accordingly, the mechanism of the C2 layer can be compared to the visual processing of the higher layers in the extrastriate cortex i.e. V4 or IT. Lastly, the C2 cells are then feeding to the VTU layer pooled by SUM-operations in order to achieve both GRBF-based learning as well as an invariant recognition of complex features of different objects. Figure 14: Hierarchical architecture of HMAX from [Riesenhuber and Poggio, 1999] 16

19 An Introduction to Biologically-Inspired Visual Recognition 4.2 Sparsity-Regularization One major disadvantage in the standard model of HMAX is that the lower-level features are more considerably static than adaptively learned. The neurons of the S1 layer fire under each input that matches a certain stimuli regardingless to its activity what tends to cause randomly-like generated mid-level features which then again apply the same behaviour to extract higher-level features. This is controversial to the actual processing in the brain, since many stimuli especially for lower features, but more generally at almost all stages of the ventral pathway are suppressed and fire only occasionally when reaching a certain amount of activity [Abdou and Pratt, 1997, Carlson et al., 2011]. This shows that sparse firing takes a prominent role in designing models to mimic the visual processing in the brain. Sparsity in general means to suppress and filter out information of minor relevance in order to reduce noisy signal and hence to use prominent features with high impact. [Waydo and Koch, 2008] firstly applied a sparse coding mechanism to the output of HMAX leading to sparse invariant representation of objects. Based on this, [Zhang et al., 2014] proposed an advanced sparsity-regularized model for HMAX where either a standard sparse coding (SSC) [Pasupathy and Connor, 2002] or an independent component analysis (ICA) [Hurri et al., 2009] is applied to every S layer in order to obtain emergence of mid-level and also higher-level features. Both SSC and ICA are linear unsupervised learning models where feature extraction with SSC provides a larger dictionary size and ICA succeeds better in inferring different feature maps. While the original HMAX consists of S1 and S2 layers of different size which are then MAX-pooled over all receptive field positions to produce single higher-level features at the final C layer, the S1 and S2 bases or filters in the sparse HMAX are of same size and are each learned by SSC or ICA. Also, there are allowed higher layers than C2 what then enables to extract multiple and also more complex features. Fig. 15 shows the model of Sparse HMAX consisting of six layers with addition of S3 and C3 compared to the original HMAX. A major Figure 15: Sparse HMAX from [Zhang et al., 2014] 17

Sebastian Starke outcome of Sparse HMAX is the ability to learn and robustly recognize higherlevel features from unlabeled training images what directly mimics a prominent capability of the brain.

16 illustrates the extracted features that were learned by S2 and S3 bases. The classification was done by a linear multiclass SVM where the resulting ROC-curves are depicted in Fig. 17.

20 Sebastian Starke outcome of Sparse HMAX is the ability to learn and robustly recognize higherlevel features from unlabeled training images what directly mimics a prominent capability of the brain. Accordingly, this strongly corresponds to the behaviour of the human cortical areas ITC and MTL. For experiments, the model was trained with images from different classification categories where Fig. 16 illustrates the extracted features that were learned by S2 and S3 bases. The classification was done by a linear multiclass SVM where the resulting ROC-curves are depicted in Fig. 17. Compared to the original HMAX, Sparse HMAX offers a strikingly more reliable classification accuracy which could be increased from 44±1.5 up to 76.13±0.85 and also manages to outperform many other state-of-the-art models for object recognition. However, it can be considered to be most efficient on large scale datasets due to the nature of sparsity. Figure 16: Learned features at S2 (top row) and S3 (bottom row) bases from [Zhang et al., 2014] Figure 17: Most selective learned features depicted in Fig. 16 (top row) with corresponding ROC-curve (bottom-row, vertical-axis: true-positive rate, horizontalaxis: false-positive rate) from [Zhang et al., 2014] 18

21 An Introduction to Biologically-Inspired Visual Recognition 5 Conclusion This paper introduced the general concepts and methods of biologically-inspired visual recognition. To give an initial background of the visual processing in the brain, the higher-level anatomy of the visual system and the functionalities of the various important cortical areas from V1 along the ventral and dorsal pathway were explained. Also, the important role of simple and complex receptive field cells has been pointed out. Accordingly, the main inspirations one can derive for computational approaches are a hierarchical extraction from lower- to higher-level features, separate and generic processing of distinct stimuli and lastly sparse firing patterns. Considering the lower-level problem of edge detection which mainly takes part in the cortical area V1, conventional algorithms typically face the problem of capturing noisy texture edges within the true boundaries of an object. Therefore, multilevel surround inhibition as an extension of the Canny-Edge-Detector has shown to be able to suppress such texture edges by applying different levels for thresholding and a parameter α which controls the inhibition strength taking into account the amount of texture around a pixel. Inspired by the hexagonally arranged cone photoreceptors on the retina, traditional images with rectangular shaped pixel lattices were converted to a hexagonal representation and processed by a spiking neural network. It could be shown that the signal-to-noise ratio of detected edges with different orientations was improved. Also, it also yielded a higher robustness to highly noisy image content. Furthermore, the higher-level problem of object recognition was presented by HMAX aiming to imitate the hierarchical processing along the ventral pathway. It mainly represents a straight feed-forward architecture where the neuron outputs of simple or complex cells are pooled by a MAX or SUM operator in order to generically extract increasingly complex features. The extension by sparsityregularization (Sparse HMAX) lastly obtained sparse firing patterns together with the possibility to add more layers to the network with the outcome to dramatically reduce the rate of misclassification. Conclusively, biologically-inspired approaches can yield remarkably good and outperforming results, but it remains that finding a convex solution might not always be possible due to the intransparency of the underlying processing scheme. However, such approximative solutions are mostly sufficient since nature itself shares the same limitations but obtains striking capabilties. 19

22 Sebastian Starke References [Abdou and Pratt, 1997] Abdou, I. E. and Pratt, W. K. (1997). Quantitative design and evaluation of enhancement/ thresholding edge detectors. Proceedings of the IEEE, Vol. 67, No. 5, pp [Baddeley et al., 1979] Baddeley, R., Abbott, L. F., Booth, M., Sengpiel, F., and Freeman, T. (1979). Responses of neurons in primary and inferior temporal visual cortices to natural scenes. Proceedings of the Royal Society B - Biological Sciences, Vol. 264, pp [Canny, 1983] Canny, J. F. (1983). Finding edges and lines in images. MIT Press, Masters thesis. [Carlson et al., 2011] Carlson, E. T., Rasquinha, R. J., Zhang, K., and Connor, C. E. (2011). A sparse object coding scheme in area V4. Current Biology, Vol. 21, pp [Ghodrati et al., 2012] Ghodrati, M., Khaligh-Razavi, S., Ebrahimpour, R., Rajaei, K., and Pooyan, M. (2012). How Can Selection of Biologically Inspired Features Improve the Performance of a Robust Object Recognition Model? PLoS One. [Hodgkin and Huxley, 1952] Hodgkin, A. and Huxley, A. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. Journal of Physiology, Vol. 117, pp [Hosoya et al., 2005] Hosoya, T., Baccus, S., and Meister, M. (2005). Dynamic predictive coding by the retina. Nature Neuroscience, Vol. 436, pp [Hubel and Wiesel, 1968] Hubel, D. and Wiesel, T. (1968). Receptive fields and functional architecture ofmonkey striate cortex. J. Physiol. [Hurri et al., 2009] Hurri, J., Hoyer, P. O., and Hyvarinen, A. (2009). Natural Image Statistics: A Probabilistic Approach to Early Computational Vision. Springer-Verlag. [Juneja and Sandhu, 2009] Juneja, M. and Sandhu, P. S. (2009). Performance Evaluation of Edge Detection Techniques for Images in Spatial Domain. International Journal of Computer Theory and Engineering, Vol. 1, No. 5. [Kandel and Schwartz, 1981] Kandel, E. and Schwartz, J. (1981). neural science. Elsevier. Principles of [Kerr et al., 2011] Kerr, D., Coleman, S., McGinnity, M., Wu, Q., and Clogenson, M. (2011). Biologically Inspired Edge Detection. IEEE, 11th International Conference on Intelligent Systems Design and Applications. 20

23 An Introduction to Biologically-Inspired Visual Recognition [Krüger et al., 2013] Krüger, N., Janssen, P., Kalkan, S., Lappe, M., Leonardis, A., Piater, J., Rodrguez-Sanchz, A., and Wiskott, L. (2013). Deep Hierarchies in the Primate Visual Cortex: What Can We Learn For Computer Vision? Transactions on Pattern Analysis and Machine Intelligence. [Marr and Hildreth, 1980] Marr, D. and Hildreth, E. (1980). Theory of Edge Detection. Proceedings of the Royal Society of London, Series B, Biological Sciences, Vol. 207, No. 1167, pp [Masland, 2001] Masland, R. H. (2001). The fundamental plan of the retina. Nature Neuroscience, Vol. 4, pp [Middleton and Sivaswamy, 2001] Middleton, L. and Sivaswamy, J. (2001). Edge Detection in a Hexagonal-Image Processing Framework. Image and Vision Computing, Vol. 19, pp [Nothdurft et al., 1999] Nothdurft, H., Gallant, J., and van Essen, D. (1999). Response modulation by texture surround in primate area V1: Correlates of popout under anesthesia. Visual Neuroscience, Vol. 16, pp [Papari et al., 2007] Papari, G., Campisi, P., and Petkov, N. (2007). Multilevel Surround Inhibition. A Biologically Inspired Contour Detector. SPIE, Vol [Papari et al., 2006] Papari, G., Campisi, P., Petkov, N., and Neri, A. (2006). A multiscale approach to contour detection by texture suppression. SPIE, In Proceedings of Alg. and Syst., Vol. 6064A. [Pasupathy and Connor, 2002] Pasupathy, A. and Connor, C. E. (2002). Population coding of shape in area V4. Nature Neuroscience, Vol. 5, pp Inc., publishing as Ben- [Pearson-Education, 2004] Pearson-Education (2004). jamin Cummings. [Poggio, 1990] Poggio, T. (1990). A theory of how the brain might work. Cold Spring Harbor Symp. Quant. Biol., Vol. 55, pp [Riesenhuber and Poggio, 1999] Riesenhuber, M. and Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature America. [Roberts, 1963] Roberts, L. (1963). solids. MIT Press. Machine perception of three-dimensional [Serre et al., 2005] Serre, T., Wolf, L., and Poggio, T. (2005). Object Recognition with Features Inspired by Visual Cortex. CVPR, Vol. 2. [Sobel and Feldman, 1968] Sobel, I. and Feldman, G. (1968). A 3x3 isotropic gradient operator for image processing. Presented at a talk at the Stanford Artificial Intelligence Project. 21

24 Sebastian Starke [Tarr, 1999] Tarr, M. (1999). News on Views: Pandemonium Revisited. Nature Neuroscience, Vol. 2, pp [Tong, 2003] Tong, F. (2003). Primary Visual Cortex and Visual Awareness. Nature Reviews, Neuroscience, Vol. 4. [Waydo and Koch, 2008] Waydo, S. and Koch, C. (2008). Unsupervised learning of individuals and categories from images. Neural Computation, Vol. 20, pp [Wu et al., 2007] Wu, Q., McGinnity, M., Maguire, L., Belatreche, A., and Blackin, B. (2007). Edge Detection Based on Spiking Neural Network Model. Springer, Proceedings of the International Conference on Intelligent Computing. [Yang, 2007] Yang, L. (2007). Biologically inspired visual models by sparse and unsupervised learning. Student Scholar Archive, Paper 260. [Zhang et al., 2014] Zhang, J., Hu, X., and Zhang, B. (2014). Sparsity-Regularized HMAX for Visual Recognition. PLoS One. 22

Reading Assignments: Lecture 5: Introduction to Vision. None. Brain Theory and Artificial Intelligence

Reading Assignments: Lecture 5: Introduction to Vision. None. Brain Theory and Artificial Intelligence Brain Theory and Artificial Intelligence Lecture 5:. Reading Assignments: None 1 Projection 2 Projection 3 Convention: Visual Angle Rather than reporting two numbers (size of object and distance to observer),