Recognizing Actions with the Associative Self-Organizing Map

Size: px

Start display at page:

Download "Recognizing Actions with the Associative Self-Organizing Map"

Irma Douglas
6 years ago
Views:

1 2013 XXIV International Conference on Information, Communication and Automation Technologies (ICAT) October 30 November 01, 2013, Sarajevo, Bosnia and Herzegovina Recognizing Actions with the Associative Self-Organizing Map Miriam Buonamente and Haris Dindo RoboticsLab, DICGIM, University of Palermo, Viale delle Scienze, Ed. 6, Palermo, Italy {miriam.buonamente, Magnus Johnsson Lund University Cognitive Science, Lundagård, Lund, Sweden Abstract When artificial agents interact and cooperate with other agents, either human or artificial, they need to recognize others actions and infer their hidden intentions from the sole observation of their surface level movements. Indeed, action and intention understanding in humans is believed to facilitate a number of social interactions and is supported by a complex neural substrate (i.e. the mirror neuron system). Implementation of such mechanisms in artificial agents would pave the route to the development of a vast range of advanced cognitive abilities, such as social interaction, adaptation, and learning by imitation, just to name a few. We present a first step towards a fully-fledged intention recognition system by enabling an artificial agent to internally represent action patterns, and to subsequently use such representations to recognize - and possibly to predict and anticipate - behaviors performed by others. We investigate a biologically-inspired approach by adopting the formalism of Associative Self-Organizing Maps (A-SOMs), an extension of the well-known Self-Organizing Maps. The A-SOM learns to associate its activities with different inputs over time, where inputs are high-dimensional and noisy observations of others actions. The A-SOM maps actions to sequences of activations in a dimensionally reduced topological space, where each centre of activation provides a prototypical and iconic representation of the action fragment. We present preliminary experiments of action recognition task on a publicly available database of thirteen commonly encountered actions with promising results. I. INTRODUCTION Advancements in the field of human-robot interaction are crucial to create the next generation of robots able to assist humans and to support elderly and disabled in everyday life. A major prerequisite to this goal is the ability to recognize other agents actions and understand their short- or long-term intentions, where an intention can be defined as a plan of action the organism chooses and commits itself to the pursuit of a goal-an intention thus includes both a means (action plan) as well as a goal [1]. Implementation of similar mechanisms in artificial agents, in our view, would be beneficial to the development of a vast range of advanced cognitive abilities, such as social interaction [2], adaptation [3] and learning by imitation [4], [5], [6]. Indeed, building systems capable of complex intention reading has been recognized to be of immense interest in a variety of domains involving collaborative and competitive scenarios such as computer games, surveillance, ambient intelligence, decision support, intelligent tutoring and obviously robotics [7]. Humans continuously observe others behavior and, based on their (noisy) observations, infer their intentions and make appropriate decisions to act. Humans do not use only verbal communication, but they also take advantage from information hidden in the observable behavior. They do that apparently effortlessly, and replicating similar abilities in artificial agents is of paramount importance in building the next generation of social robots. It is believed that action and intention recognition processes in humans are governed by the mirror neuron system (MNS): the same (premotor) neurons responsible for the execution of an action fire also when a similar action is solely observed [8], [9]. A functional view of this motor resonance phenomenon postulates that humans understand others intentions by internally simulating what they would have done in a similar situation by adopting the same motor programs used as they were actually performing the action. This view entails that humans are able to create rich internal models mirroring one s own motor capabilities and complex dynamics observed in the outer world. By exploiting their inner world, humans can simulate different actions, and foresee and evaluate their consequences [10], [11], [12]. However, apart from some isolated implementations of internal models in robotics [13], [14], their full representational and computational power has not yet been investigated in real-world applications. In this paper we present a first step towards a fully-fledged intention recognition system by enabling an artificial agent to internally represent action patterns, and subsequently to use such representations to recognize behaviors performed by others. While the field of action recognition has been an active one, especially in the machine vision community (see [15] for a recent survey), here we investigate a biologically-inspired approach which builds on the idea of internal representations. We adopt the Associative Self-Organizing Map [16], a variation of Self-Organizing Map (SOM) [17], to parsimoniously represent and to efficiently recognize human actions in real-time. We adopt the usual distinction between actions and activities, where - loosely speaking - the former are characterized by simple motion patterns typically executed by a single human (e.g. walking), while the latter are more complex and involve a sequence of actions (e.g. dancing) - which might also involve coordinated actions among a small number of humans [18]. While this paper concentrates on action recognition with an A-SOM network, our long-term goal is to reuse the same computational substrate to fully implement the action simulation idea. In other words, the agent should also be able to simulate the likely continuation of the recognized action. For instance, if we look at a man crossing the street, we are able to /13/$ IEEE

Fig. 2. A-SOM network connected with two other SOM networks. They provide the ancillary input to the main A-SOM (see the main text for more details). Fig. 1.

infer the continuation of the observed action (i.e. the intention to cross the street) even if an obstacle obscures our view.

2 Fig. 2. A-SOM network connected with two other SOM networks. They provide the ancillary input to the main A-SOM (see the main text for more details). Fig. 1. Prototypical postures of 13 different actions in our dataset: check watch, cross arms, get up, kick, pick up, point, punch, scratch head, sit down, throw, turn around, walk, wave hand. infer the continuation of the observed action (i.e. the intention to cross the street) even if an obstacle obscures our view. Our goal is to build agents able to elicit which is the likely continuation of the observed action if their view is obscured by an obstacle or other factors. Indeed, as we will see below, the A-SOM can remember perceptual sequences by associating the current network activity with its own earlier activity. Due to this ability, A-SOM could receive an incomplete input pattern and continue to elicit the likely continuation, considering this as sequence completion of perceptual activity over time. The results presented here are the first step in this direction. We have tested the A-SOM in the action recognition task on a publicly available dataset of movies representing 13 common actions 1 : check watch, cross arms, scratch head, sit down, get up, turn around, walk, wave, punch, kick, point, pick up, throw (see Fig. 1). The paper is structured as follows. An overview of the A- SOM network is given in section II. Experiments for assessing the performance of the proposed method are described in section III. Finally, conclusions and future works are outlined in section IV. 1 The movie is taken from the INRIA 4D repository, available at It offers several movies representing sequences of actions. Each video is captured from 5 different cameras. For the experiments in this paper we chose the movie Julien1 and only one point of view, the frontal camera cam0. II. ASSOCIATIVE SELF-ORGANIZING MAP (A-SOM) Our approach is based on use of Associative Self- Organizing Map (A-SOM), which is related to Self-Organizing Map (SOM). SOM is a neural network that is trained using unsupervised learning to produce a smaller discretized representation of the input space of the training samples. It resembles the functioning of the brain in pattern recognition tasks: when presented with an input, it excites neurons in a specific area. The goal of learning in A-SOMs is to cause different parts of the network to respond similarly to similar input patterns while clustering a large input space to a smaller output space. Self-Organizing Maps are different from other artificial neural networks because they use a neighborhood function to preserve the topological properties of the input space. SOM algorithms builds a lattice of neurons, where neurons located close to each other have similar characteristics. The SOM structure is made of one input layer and one output layer, the latter known as Kohonen layer. The neurons of each layer are strictly connected to each other, whereas the output neurons are connected with neighborhood neurons. Only one of these neurons can be the winner for each input provided to the network. This winner identifies the class which the input belongs to. Furthermore, the SOM has the capability to generalize, i.e. the network can recognize or characterize inputs it has never encountered before. The A-SOM is an extension to the SOM which learns to associate its activity with activity of other neural networks. It can be considered as a SOM with additional ancillary input from other networks, Fig. 2. The use of A-SOM leads to a lattice of the input data, in which the spatial locations of the resulting prototypes in the lattice are indicative of intrinsic statistical features of the input postures. Moreover, the A-SOM can be connected to itself associating its activity with its own earlier activity. This makes the A-SOM able to remember and to complete perceptual sequences over time. Many simulations prove that A-SOM, once received some initial input, can continue to elicit the activity likely to follow in the nearest future even though no further input is received [19], [20]. The A-SOM learns to associate its activity with (possibly delayed) additional inputs. It consists of an I J grid of neurons with a fixed number of neurons and a fixed topology. Each neuron n ij is associated with r + 1 weight vectors wij a Rn and wij 1 Rm1, wij 2 Rm2,..., wij r Rmr. All the elements of all the weight vectors are initialized by real numbers randomly selected from a uniform distribution

3 between 0 and 1, after which all the weight vectors are normalized, i.e. turned into unit vectors. At time t each neuron n ij receives r +1 input vectors x a (t) R n and x 1 (t d 1 ) R m1, x 2 (t d 2 ) R m2,..., x r (t d r ) R mr where d p is the time delay for input vector x p, p = 1, 2,..., r. The main net input s ij is calculated using the standard cosine metric s ij (t) = xa (t) wij a (t) x a (t) wij a (1) (t), The activity in the neuron n ij is given by y ij (t) = [ y a ij(t) + y 1 ij(t) + y 2 ij(t) y r ij(t) ] /(r + 1) (2) where the main activity yij a is calculated by using the softmax function [21] y a ij(t) = where m is the softmax exponent. (s ij (t)) m max ij (s ij (t)) m (3) The ancillary activity y p ij (t), p = 1, 2,..., r is calculated by again using the standard cosine metric y p ij (t) = xp (t d p ) w p ij (t) x p (t d p ) w p (4) ij (t). The neuron c with the strongest main activation is selected: The weights wijk a are adapted by c = arg max ij y a ij(t) (5) w a ijk(t + 1) = w a ijk(t) + α(t)g ijc (t) [ x a k(t) w a ijk(t) ] (6) where 0 α(t) 1 is the adaptation strength with α(t) 0 when t. The neighbourhood function G ijc (t) = e rc r ij 2σ 2 (t), where r c R 2 and r ij R 2 are location vectors of neurons c and n ij, is a Gaussian function decreasing with time. The weights w p ijl, p = 1, 2,..., r, are adapted by w p ijl (t + 1) = wp ijl (t) + βxp l (t d p) [ y a ij(t) y p ij (t)] (7) where β is the adaptation strength. All weights wijk a (t) and wp ijl (t) are normalized after each adaptation. In this paper the input vector x 1 is the activity of the A- SOM from the previous iteration rearranged into a vector and d 1 = 1. Fig. 3. Walk action movie created with a reduced number of images III. EXPERIMENT We tested the representational capabilities of the A-SOM on an action recognition task. Neurobiological studies argue that the human brain can perceive actions by observing only the human body poses, called postures, during action execution [22]. Thus, actions can be described as sequences of consecutive human body poses, in terms of human body silhouettes [23], [24], [25]. The dataset of actions we chose consists of more than 700 postural images reproducing 13 different actions. Since we want the agent to be able to recognize one action a time, we split the original movie into 13 different movies: one movie for each action (see Fig. 1). Each frame is preprocessed to reduce the noise induced by the image acquisition process and the so-called posture vectors are extracted (see the section III-A below) which are used to create the training set required to train the A-SOM. Our final training set is composed of about samples where every sample is a posture vector. The created input is used to train the A-SOM network and to get the A-SOM weights, the training lasted about iterations. The generated weight file is used to execute tests. The implementation of all code for the experiments presented in this paper was done in C++ using the neural modeling framework Ikaros [26]. Next sections detail the preprocessing phase as well as the results obtained. A. Preprocessing phase In order to reduce the computational load, preprocessing operations for spatial and temporal reductions have been done. The temporal reduction is performed by reducing the number of images for each movie: we established that all of the 13 movies should have the same duration consisting of 10 frames. In this way, we have a good compromise to have seamless and fluid actions, guaranteeing the quality of the movie. In Fig. 3 the images reproducing the walk action movie are represented; as the picture shows, the reduction of the number of images, and the subsequent reduction of the time, does not affect the quality of the action reproduction. The spatial reduction is performed by resizing the dimensions of the images. These images are centered at the person s centre of mass and bounding boxes of size equal to the maximum bounding box enclosing person s body are extracted. We cut the images by using the identified boundary box including the person performing the action. In this way, we simulate an attentive process in which the human eye observes and follows the salient part of the action only. In order to improve the spatial reduction so to have a faster execution, every image was shrunk to N H N W dimension to produce binary images of fixed size. Binary posture images are represented as matrices and these matrices are vectorized to produce posture vectors p R D, D = N H N W ; each posture image is represented by a posture vector p. In this way every action, consisting of 10 frames, is represented by a

4 set of posture vectors p i R D, i = 1,..., 10. The result is 13 movies with the same duration, each of them reproducing one action. In the experiment presented in this paper the values N H = 15, N W = 15 have been used and binary posture images are scanned row-wise. B. Action classification The goal of the action classification experiment is to verify if the A-SOM is able to discriminate actions. We evaluate the A-SOM by setting up a system consisting of one A-SOM connected to itself. To this end, 13 sets containing 10 posture vectors each (representing the binary images that form the videos) were constructed as explained above. We fed the A-SOM with one set a time and the centers of activity of the A-SOM, generated for each sample of the set, are recorded for all these tests. The coordinates of these centers are recorded and plotted in a diagram that shows how they are located in the space as shown in Fig. 4. The centers are connected each other through arrows describing the temporal evolution of the represented action. These diagrams depict the motion pattern for each action and allow us to evaluate if the A-SOM can discriminate the binary images that form each action. What we expect is to have a different motion pattern for each action, and different centers of activities for each distinct posture vector. In such a way the classification procedure is greatly facilitated and easily implemented through stochastic state machines (e.g. Markov processes or similar machine learning methods [27]). In the presented experiment, the A-SOM elicits different centers of activity for different posture vectors, creating in this way, different motion patterns for different actions, and demonstrating its ability to discriminate actions properly. As mentioned earlier, patterns are made of centers of activity, therefore they also give information about the ability of the A-SOM to discriminate between the images that form each actions. Since the A-SOM elicits the same centers of activity for similar posture vectors, action movies made of similar images have few centers of activation. Actions composed of images with different characteristics, present several centers of activation, one for each different image. It is possible to see, for example, in the check watch movie 1 a), the pattern presents only four centers of activation whereas in punch movie, Fig. 1 g), the pattern presents ten different centers of activation. The plotted diagrams, furthermore, show that A-SOM can create topological maps in which similar binary postures are located close to each other. Let look at the motion pattern, Fig. 4 c) of the action Get up in Fig. 1 c), the first four binary images depicting the person sitting are located closer each other and in a close position are located the other three representing the person standing up, distant from them the last three images depicting the completion of the activity. We can easily argue that the A-SOM is able to recognize and classify body postures representing actions. IV. CONCLUSION A new method based on a different type of artificial neural network, A-SOM, has been proposed to recognize, classify and simulate actions. The proposed method highlights the strength of the A-SOM in classifying the observed action, and thanks to its ability to remember perceptual sequence, A-SOM should be suitable for the prediction of the likely continuation of the perceived behavior of an agent. It has been shown that the A-SOM can proper recognize and classify actions. Future experiments are intended to demonstrate that the A-SOM can receive some initial sensory input and elicit the activity to continue the perceptual sequence even if no further input is received. The A-SOM should be able to internally simulate the sequence of activity likely to follow the activity elicited by the agents initial behavior, and thus a way to read the agents intentions. ACKNOWLEDGMENT The authors gratefully acknowledge the support from the Linnaeus Centre Thinking in Time: Cognition, Communication, and Learning, financed by the Swedish Research Council, grant no REFERENCES [1] M. Tomasello, M. Carpenter, J. Call, T. Behne, and H. Moll, Understanding and sharing intentions: the origins of cultural cognition. Behavioral Brain Science, vol. 28, no. 5, pp , Oct [2] C. Breazeal, Designing sociable robots. The MIT Press, [3] G. Pezzulo and H. Dindo, What should i do next? using shared representations to solve interaction problems, Experimental Brain Research, vol. 211, no. 3-4, pp , [4] A. Chella, H. Dindo, and I. Infantino, A cognitive framework for imitation learning, Robotics and Autonomous Systems, vol. 54, no. 5, pp , [5], Imitation learning and anchoring through conceptual spaces, Applied Artificial Intelligence, vol. 21, no. 4-5, pp , April [6] B. D. Argall, S. Chernova, M. Veloso, and B. Browning, A survey of robot learning from demonstration, Robotics and Autonomous Systems, vol. 57, no. 5, pp , [7] Y. Demiris, Prediction of intent in robotics and multi-agent systems, Cognitive Processing, vol. 8, no. 3, pp , [8] G. Rizzolatti and L. Craighero, The mirror-neuron system, Annual Review of Neuroscience, vol. 27, pp , [9] M. Iacoboni, I. Molnar-Szakacs, V. Gallese, G. Buccino, J. C. Mazziotta, and G. Rizzolatti, Grasping the intentions of others with one s own mirror neuron system. PLoS Biol, vol. 3, no. 3, p. e79, Mar [10] L. W. Barsalou, Perceptual symbol systems, Behavioral and brain sciences, vol. 22, no. 04, pp , [11] D. M. Wolpert, K. Doya, and M. Kawato, A unifying computational framework for motor control and social interaction. Philos. Trans. Royal Soc. London B Biol. Sci., vol. 358, no. 1431, pp , Mar [12] R. Grush, The emulation theory of representation: motor control, imagery, and perception, Behavioral and brain sciences, vol. 27, no. 3, pp , [13] A. Dearden and Y. Demiris, Learning forward models for robotics, in Proceedings of IJCAI-2005, Edinburgh, 2005, pp [14] H. Dindo, D. Zambuto, and G. Pezzulo, Motor simulation via coupled internal models using sequential monte carlo, in Proc. of the 22nd International Joint Conference on Artificial Intelligence (IJCAI), July , pp [15] R. Poppe, A survey on vision-based human action recognition, Image and vision computing, vol. 28, no. 6, pp , [16] M. Johnsson, C. Balkenius, and G. Hesslow, Associative selforganizing map. in IJCCI, 2009, pp [17] T. Kohonen, The self-organizing map, Neurocomputing, vol. 21, no. 1, pp. 1 6, 1998.

5 Fig. 4. Motion patterns for 13 actions: a) Check Watch; b) Cross Arms; c) Get Up; d) Kick; e) Pick Up; f) Point; g) Punch; h) Scratch Head; i) Sit Down; l) Throw; m) Turn Around; n) Walk; o) Wave Hand. The points in the diagram represent the actions center of activity and the arrows indicate the action evolution over time. Actions made of similar posture vectors, present few center of activity; whereas movies made of posture vectors with different characteristics present several centers of activity. The diagrams give indication about the ability of A-SOM to create topological maps in which similar binary postures are located close to each other. [18] P. Turaga, R. Chellappa, V. S. Subrahmanian, and O. Udrea, Machine recognition of human activities: A survey, IEEE Transactions on Circuits and Systems for Video Technology, vol. 18, no. 11, pp , [19] M. Johnsson, D. Gil, C. Balkenius, and G. Hesslow, Supervised architectures for internal simulation of perceptions and actions, Proceedings of BICS, [20] M. Johnsson, D. G. Mendez, G. Hesslow, and C. Balkenius, Internal simulation in a bimodal system. in SCAI, 2011, pp [21] C. M. Bishop, Neural networks for pattern recognition. university press, Oxford [26] [22] M. A. Giese and T. Poggio, Neural mechanisms for the recognition of biological movements, Nature Reviews Neuroscience, vol. 4, no. 3, pp , [27] [23] [24] [25] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, Actions as space-time shapes, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 12, pp , N. Gkalelis, A. Tefas, and I. Pitas, Combining fuzzy vector quantization with linear discriminant analysis for continuous human movement recognition, IEEE Transactions on Circuits and Systems for Video Technology, vol. 18, no. 11, pp , A. Iosifidis, A. Tefas, and I. Pitas, View-invariant action recognition based on artificial neural networks, IEEE Transactions on Neural Networks and Learning Systems, vol. 23, no. 3, pp , C. Balkenius, J. More n, B. Johansson, and M. Johnsson, Ikaros: Building cognitive models for robots, Advanced Engineering Informatics, vol. 24, no. 1, pp , K. P. Murphy, Machine Learning: a Probabilistic Perspective. MIT Press, 2012.

Action Recognition based on Hierarchical Self-Organizing Maps

Action Recognition based on Hierarchical Self-Organizing Maps Miriam Buonamente 1, Haris Dindo 1, and Magnus Johnsson 2 1 RoboticsLab, DICGIM, University of Palermo, Viale delle Scienze, Ed. 6, 90128 Palermo,