MOST complex animals, in particular mammals, are distinct

Size: px

Start display at page:

Download "MOST complex animals, in particular mammals, are distinct"

Hilary Bishop
5 years ago
Views:

1 326 IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, VOL. 10, NO. 2, JUNE 2018 Bio-Inspired Model Learning Visual Goals and Attention Skills Through Contingencies and Intrinsic Motivations Valerio Sperati and Gianluca Baldassarre Abstract Animal learning is driven not only by biological needs but also by intrinsic motivations (IMs) serving the acquisition of knowledge. Computational modeling involving IMs is indicating that learning of motor skills requires that autonomous agents self-generate tasks/goals and use them to acquire skills solving/leading to them. We propose a neural architecture driven by IMs that is able to self-generate goals on the basis of the environmental changes caused by the agent s actions. The main novelties of the model are that it is focused on the acquisition of attention (looking) skills and that its architecture and functioning are broadly inspired by the functioning of relevant primate brain areas (superior colliculus, basal ganglia, and frontal cortex). These areas, involved in IM-based behavior learning, play important functions for reflexive and voluntary attention. The model is tested within a simple simulated pan-tilt camera robot engaged in learning to switch on different lights by looking at them, and is able to self-generate visual goals and learn attention skills under IM guidance. The model represents a novel hypothesis on how primates and robots might autonomously learn attention skills and has a potential to account for developmental psychology experiments and the underlying brain mechanisms. Index Terms Goal-directed behavior, intrinsic motivations (IMs), overt attention routines/skills, reinforcement learning (RL), superior colliculus (SC) and basal ganglia (BG), top-down and bottom-up attention. I. INTRODUCTION MOST complex animals, in particular mammals, are distinct from other organisms as their behavior is not necessarily directed to the immediate satisfaction of biological needs, such as the attainment of food or sex (i.e., extrinsic rewards [1]). Since 1950s, researchers studying rodents observed how actions that simply have a perceivable effect in the environment, not involving any direct benefit for biological needs, are often repeatedly executed. For example, sated Manuscript received August 10, 2016; revised June 26, 2017 and September 8, 2017; accepted November 5, Date of publication November 13, 2017; date of current version June 8, This work was supported by the European Union Horizon 2020 Research and Innovation Programme under Grant (Project: GOAL-Robots Goal-Based Open-Ended Autonomous Learning Robots). (Corresponding author: Valerio Sperati.) V. Sperati is with the Institute of Cognitive Science and Technologies, National Research Council, Rome, Italy, and also with the School of Computing and Mathematics, Plymouth University, Plymouth PL4 8AA, U.K. ( valerio.sperati@istc.cnr.it). G. Baldassarre is with the Institute of Cognitive Science and Technologies, National Research Council, Rome, Italy. Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TCDS c 2017 EU mice in a Skinner box increase lever-pressing behavior when this causes the onset of a light [2] or prefer to press a lever that moves rather than a stationary one [3]. In these works, the authors already assumed the powerful motivational drive inherent to the detection of actions-outcomes contingencies, explicitly stating that a perceptible environmental change, which is unrelated to such need states as hunger and thirst, will reinforce any response that follows. [2]. These behavioral tendencies are even more pronounced in primates like monkeys, who prefer to press a lever that turns on a light rather than one with no effect [4], and is evident also in humans who are clearly attracted, for example, by interactive toys. More strongly during childhood, but also during adulthood, these species spend a significant time in curiosity-driven exploratory behaviors, for example they engage in objects manipulation especially when this involves unexpected outcomes [5]. Importantly, these activities catch the agent s attention for a limited amount of time and then reduce in intensity, as novelty or surprise effects fade away and boredom occurs [6]. It has been proposed that these behaviors have a notable adaptive advantage [1]: through them an agent gradually gains a potential control on the environment, both acquiring a general knowledge about the effects that follow specific actions (actions-outcomes relationships), and developing and refining the corresponding motor skills [7], [8]. Once these basic abilities have been acquired, thus improving the agent s competence in mastering the world, they can be later reused (eventually recombined) to solve practical or biologically relevant problems. For example Polizzi di Sorrentino et al. [9] observed how capuchin monkeys engaging with an interactive device presenting movable components and boxes with sliding doors, were able to retrieve effective actions, previously acquired during a nonrewarded exploratory phase, to obtain food hidden in the device itself. These learning processes can be formalized within the computational framework of reinforcement learning (RL) [10] but require suitable additions [11]. Indeed, they cannot leverage the attainment of extrinsic rewards but rather need to rely on the generation of intrinsic rewards directly depending on the execution of actions. In this respect, psychologists say that intrinsic motivations (IMs) drive the performance of actions fortheirownsake[12]. Although the general idea of these processes was already present in past psychological research [12] [14], the recent use of computational models to investigate the topic is leading to specify the concept of IMs

2 SPERATI AND BALDASSARRE: BIO-INSPIRED MODEL LEARNING VISUAL GOALS AND ATTENTION SKILLS THROUGH CONTINGENCIES AND IMs 327 (e.g., [11] and [15] [18]). The idea that is emerging is that IMs can serve the function of leading agents to the acquisition of knowledge and skills, useful in later stages, without the guidance of an extrinsic biological reward (in the case of animals) or a reward from another agent like the user (in the case of robots) while autonomously exploring the environment [1]. The literature (e.g., [6], [19] [21]) suggests that IM mechanisms can be grouped in three main classes according to the type of knowledge acquisition they refer to. 1) Prediction-Based IMs: These are related to how much a stimulus violates the agent s expectations (e.g., when a baby discovers that hitting a drum produces a sound very different with respect to the familiar effect obtained when hitting a pillow). 2) Novelty-Based IMs: These are related to how much a stimulus has been previously experienced and has become familiar to the agent (e.g., when a baby is attracted by a new, unfamiliar toy). 3) Competence-Based IMs: These are related to the impact of an agent s actions on the world, in particular to the capacity of accomplishing desired effects (e.g., when a baby realizes that she is becoming able to obtain a certain effect with a certain action on an interactive toy). This paper mainly focuses on competence-based IMs but also exploits novelty-based ones. Competence-based IMs have been specified in terms of effectance [13], or a general drive for mastery [11], i.e., the capacity to be able to act to accomplish what desired. Many animal behaviors illustrated above are related to this class of IMs as they involve activities that the animal carries out to learn to control the environment, i.e., to produce some desired effects on it. Here competence-based IMs are specifically employed to drive the agent to decide on which experiences to focus in order to learn to accomplish certain desired effects ( goals, see below) [22], [23]. The comprehension of the mechanisms underlying competence-based IMs is particularly relevant for robotics since it opens up the possibility of building curious, autonomous learning robots [18]. In other words, an eager-to-learn robot could be left free to interact with unfamiliar objects in an unknown place and, through play and exploration, it would autonomously acquire the necessary knowledge for mastering the environment. As for children, this would require a developmental process [24]: through trial-and-error the robot would identify and refine those actions that prove to be effective in modifying the world, gradually forming a repertoire of flexible and potentially useful skills. At a later stage, when required by an external task, the robot could efficiently reuse the acquired skills instead of having to learn them from scratch. Existing robots usually do not have these capabilities as they are generally trained to solve specific tasks [21]. These topics are a main subject of current scientific investigation and they still present important open problems [20]. Regarding IM-based autonomous acquisition of motor skills, while initially the literature considered IMs as a means to directly guide the autonomous learning of skills (e.g., [16] and [18]), later it highlighted how IMs should guide the self-generation of goals and then goals should guide learning of skills [11], [22], [25]. In this context, a goal is intended as an internal representation of a possible future desired state (or set of states) of own body or the external world that, by guiding the learning process of the agent, leads it to learn or perform the skill needed to achieve such state. When IMs are directly used to guide skill learning, they furnish a reward in correspondence to several different states of the environment while effective skills should be directed to accomplish few important states (goals [22]). Moreover, goals facilitate the later recall of the acquired skills to solve new tasks. This approach based on goals also leads to use prediction-based or novelty-based IMs to form goals, and competence-based IMs to select the goals whose skills should be learned [11], [22], [23], [26], [27]. The new perspective raises two important problems. 1) How could the agent self-generate interesting/relevant goals (and hence tasks related to the acquisition of the skills to accomplish them; this issue is also faced in the RL literature in relation to option discovery [28]). 2) How could the agent use the goals to guide the learning of the related skills (i.e., to solve the tasks). Recently, various computational architectures have been proposed to face both problems and have either referred to goals related to own body postures [25], [29], [30] or to goals related to the external world [31] [34]. The architecture presented here starts from this new perspective and in particular develops it for visual goals (images) and the learning of overt attention routines directed to accomplish them. By overt attention routines [35] we refer here to the overt movements of the eye (saccades), which are different from the covert attention processes focusing on mind internal contents and are directed to collect information from the external environment. Talking about attention, the term routine is more suitable than skill as the latter usually regards the motor aspects of movement: in the case of eye control, e.g., leading the eye fovea to one desired location of the scene, the capacity for motor control can be assumed to be already acquired while the challenge faced by the learning agent involves the acquisition of the capacity to saccade on areas of the scene where useful information can be gathered. This being said, however, since it is common within the IMbased learning literature to talk about skill learning we will here use the terms routine and skill interchangeably. The system visually explores the environment and autonomously discovers its goals. Goal formation is based on the visual detection of events, i.e., changes in the environment caused by the agent s behavior (as further explained below, saccades themselves cause the environment to change). This idea comes from the biological and psychological literature showing that primates, including children and human adults, are strongly attracted by contingencies, namely effects happening in the environment as a consequence of their actions [5], [8], [9]. Once an event is discovered, the resulting environment state becomes a goal and the system focuses on it to possibly acquire the skill necessary for its reliable accomplishment. Once the system has acquired a repertoire of goalskill couples, the goals can also be used to recall the related skills. This is shown here by activating the goals manually but in future work these might be internally and autonomously

3 328 IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, VOL. 10, NO. 2, JUNE 2018 activated by the agent. For its features, we will call the proposed system attention goal-based open-ended learning model (AGO). The presented system represents an advancement of previous models focused on learning where to look produced by the authors research group. In particular, in [36] and [37] we proposed a model integrating a bottom-up attention component, guiding saccades on the basis of the salient features of the scene, with a top-down RL component, learning to guide saccades on the basis of the task at hand. In this context, attention skills involved, as here, the capacity of exploring the scene with the central part of a pan-tilt motorized camera (putatively corresponding to the fovea of primates) in order to collect information useful to accomplish the task. In [38], we used the model to study the emergence of anticipatory gazes in babies using a hardwired reward representing an intrinsic reward in an abstract fashion. In [39], we extended the model to visually distinguish between random events and events caused by the system ( agency ): these events were used to produce an intrinsic reward directly driving eye-movement skill learning. With respect to these systems, the main novel contributions of this paper are two. 1) The capacity of the model to autonomously form goals when a change in the environment is caused by the agent. 2) The capacity to exploit goals to acquire multiple attentional skills (visual routines/saccade sequences) able to accomplish the different goals. As some of the previous models, the system architecture is bio-inspired, meaning that a functional comparison can be drawn between its computational components and some features of the primate brain. These correspondences, described in more detail in the next section, involve in particular the vertebrate subcortical nuclei called superior colliculus (SC), basal ganglia (BG), and substantia nigra pars compacta (SNc) [40]. Our interest for these structures resides in the fact that they have been involved in a neuroscientific theory, proposed by Redgrave and Gurney [7], that represents one of the neuroscientific theories most closely related to the IM-guided trial-and-error learning processes described above. This link to biology might facilitate future work directed to employ the system to model and study psychological and neuroscientific issues. The model has been tested with a simulated pan-tilt camera robot and a task based on the cognitive-psychology experimental paradigm called gaze-contingency [41], [42]. The key idea of this paradigm is that the experiment participants can cause changes in the environment by simply gazing particular objects seen in a computer screen (an eye tracker is used to detect where the participant is looking and to change the screen image in a suitable fashion). The gaze-contingency paradigm has been successfully exploited in developmental psychology to study high-level cognitive processes in very young infants [41], [42]. Indeed, notwithstanding these participants are characterized by an immature arm/hand motor control system, they can learn very rapidly to use their gaze to cause changes in the environment. In our setup the robot sees an image in a screen and can modify the environment Fig. 1. Hierarchical organization of the brain system that inspired the design of the model architecture. The schema, partially based on the layout shown in [40], is nonexhaustive and served only as a broad guide to design the model. (turning on lights ) by simply moving its gaze on colored buttons. The rest of this paper is organized as follows. Section I-A introduces the biological knowledge we considered to design the model architecture. Section II illustrates the experimental setup, with detailed sections on the environment and the robot. Section III explains the model architecture and functioning. Section IV reports the results of the tests of the system. Section V presents a review of other systems related to our model, also highlighting their differences. Finally, Section VI summarizes the conclusion and the possible future directions of this paper. A. Biological Evidence Guiding the Model Architecture Design Interesting insights about the mechanisms supporting IMs can be inferred looking at biology. For example in mammalian brain, the subcortical system composed by the SC, SN c, and BG seems to be strongly involved in the acquisition of motor skills driven by competence-based IMs (see Fig. 1). This complex implements a sensorimotor closed-loop circuitry which appears able to evaluate the consequence of an action on the environment by selecting, training and refining those behaviors which increase the amount of control exerted on the environment [7]. In particular, the SC is a midbrain nucleus pivotal for both reflexive and voluntary attentional responses [43], [44]. It presents a laminated structure: its superficial layer (SC s ), mainly receiving visual input from retinas, strongly reacts to abrupt, unexpected luminance changes in the peripheral visual field (generally due to unpredicted stimuli that suddenly appear or quickly move within the visual field), so that it can be considered as an event detector; its deep layer (SC d ) then contributes to trigger an immediate spatial reorienting of eyes, head, and body toward the event location (or avoidance behaviors [45], [46]), to allow a deeper analysis of the stimulus itself. The SC, through direct projections [47], is one of the inputs of the SN c, another midbrain nucleus whose dopaminergic neurons phasically respond not only to unpredicted rewards [48] but also to unexpected biologically

to play the role of reward signal [40], [50] [52].

4 SPERATI AND BALDASSARRE: BIO-INSPIRED MODEL LEARNING VISUAL GOALS AND ATTENTION SKILLS THROUGH CONTINGENCIES AND IMs 329 salient events [49]. The released dopamine (DA) reaches various targets including the BG, a set of nuclei having important motivational and action-selection functions; they also implement RL processes for which DA seems to play the role of reward signal [40], [50] [52]. The neural circuitry is closed by backward projections from BG to SC d, so that reflexive attentional responses can be biased by purposive behaviors (i.e., saccadic movements in this case). At a higher level, the prefrontal cortex (PFC), a large cortical region associated to goal-directed behavior, can modulate the activity of the SC-BG-SN c system. In particular, the PFC seems to serve a specific function in cognitive control: the active maintenance of patterns of activity which represent both goals and the means to achieve them [53], [54]. In this sense, the PFC appears to act as a sort of working memory which guides the overall learning process and then exploits the acquired procedural knowledge (stored in the subcortical system) to accomplish desired tasks. This paper presents several functional and structural resemblances with the neural system just described. In particular, it captures the dynamic interplay between reflexive and voluntary orienting responses. Reflexive responses are controlled by two recurrent neural networks which mimic the activity of both layers in the SC in determining the target of saccadic movements. The voluntary responses are regulated by an RL mechanism which, exploiting a reward signal produced by the SC in occasion of event detection, learns to bias those saccades that manage to cause events. Such learning process is reproduced with an actor-critic (AC) method, an algorithm able to capture the learning processes taking place in the BG [55]. Finally, two hierarchically higher networks encode relevant events (called here goals) and control the whole learning process on the basis of the currently selected goal. Most of the computations performed in the model rest on a retinocentric coordinate reference system, with the only exception of a module which implements the inhibition of return (IOR) effect necessary for improving visual exploration (see Section III-A3). This arrangement reflects the reference system actually used in most areas of brain processing visual information [40]. II. EXPERIMENTAL SETUP To investigate the intrinsically motivated acquisition of multiple skills, we built a simulated experimental setup inspired by those used in [5] and [9]. These works, focused on IMs and contingency-based learning, studied the behavior of children and capuchin monkeys interacting with a mechatronic board. This is a programmable device, equipped with a set of input manipulanda (buttons, levers, and wheels), and sensory outputs (lights, sounds, and sliding boxes occasionally containing food). The causal relationships between inputs and outputs are programmed by the experimenter. The device allowed the study of the evolution of explorative behaviors and the development of goal-directed actions, guided by IMs. The simulated experimental setup we used shares some important features with the mechatronic board experiments. In particular, it is formed by a set of colored spheres and (a) Fig. 2. Two examples of the agent s visual input: each image has a resolution of pixels, while the fovea region, marked as a red square, has a resolution of pixels. Agent looking (a) ahead (pan 0 and tilt 0 )and (b) at green sphere (pan +10 and tilt 6 ) while the random light (the tilted bar) switches on: this corresponds to a temporary increase of luminance. tilted bars. Some spheres, when looked at, can trigger events (i.e., make the bars flash), while other spheres are passive. Some events occur at chance, and do no depend on the agent s behavior. A simple robot (i.e., a pan-tilt webcam), explores the environment, acts on the objects (through its gaze), and observes the possible effects. The purpose of the robot is to gain knowledge about the control he can exert on the environment through a trial and error process that lead it to discover which actions can change the environment. The main guidance of behavior are the IMs that establish time by time what must be learned. As briefly outlined in Section I, twomain simplifications characterize the experimental setup: 1) The agent is a pan-tilt webcam, without any arm-like actuator. 2) The environment can change depending on where the viewer is looking at (gaze-contingency paradigm). These choices are mainly due to the need to simplify the robot input: the presence of a moving limb in the visual scene would have introduced a set of problems linked to body movement that, although interesting, would have been out of the scope of this paper. For the equations description we use the following notations: bold upper case letters for matrices; bold lower case letters for column vectors; italic letters for scalar variables; and Greek letters for scalar constants. A. Environment The environment (see Fig. 2), has been simulated as a 3-D black space, containing six spheres and three bars set on the same plane in front of the agent. The spheres are colored (red, green, blue, magenta, cyan, and yellow), and are characterized by similar luminance (l = 0.5, where l = 0is darkness and l = 1 is maximum brightness); bars are white and slightly dimmer with respect to the spheres (l = 0.4). The bars can instantaneously increase their brightness (l = 1) for one timestep ( switching on of lights ). Each timestep lasts 50 ms. This happens according the following rules. 1) The tilted bar switches on randomly for one timestep with a probability p rand = per timestep; we refer to this bar as the random light, and to the switching on event as the random event. 2) The vertical bar switches on for one timestep after the agent gazes at the red sphere; we refer to this bar as the (b)

330 IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, VOL. 10, NO. 2, JUNE 2018 (a) (b) Fig. 3. Robot sketch. (a) Extension of visual field and fovea in degree.

red light, to the event as the red event, and to the sphere as the red button.

5 330 IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, VOL. 10, NO. 2, JUNE 2018 (a) (b) Fig. 3. Robot sketch. (a) Extension of visual field and fovea in degree. (b) Schema showing the allowed pan-tilt movements area in degree. Given these values the robot can potentially explore an area wide. red light, to the event as the red event, and to the sphere as the red button. 3) The horizontal bar switches on six timesteps after the agent gazes at the blue sphere, and stays on for one timestep; we refer to this bar as the blue light, tothe event as the blue event, and to the sphere as blue button. This setup, although may appear simple at first sight, presents the following interesting features: seven objects out of nine (all the bars and spheres except the two buttons), if acted on (i.e., looked at ) have no environmental effects; two objects (the two buttons) if acted on cause distinct outcomes in the environment, namely the blue and red events (together the causal events); and one event (the random one) is due to chance and consequently is out of the agent s control. The different temporal delay between the two buttons effects has been introduced to represent a realistic situation where some events might not immediately follow the action that caused them. Moreover, this feature makes possible to qualitatively characterize an important aspect related to DA timing, a neuromodulator which seems to have a crucial role as reward signal in the discovery of novel actions. Indeed, with the exception of humans, mammals can acquire action-outcome contingencies which occur within a very limited temporal gap, generally 1 s, after which they find it increasingly difficult to learn [7], [41], [56]. B. Agent The agent is a simulated pan-tilt webcam mounted on a fixed body. The camera resolution is pixels, and its field of view is [see Figs. 2 and 3(a)]. Pan and tilt movements are bounded, respectively, in [ 47, +47 ] and [ 52, +52 ]. These values make the agent capable to potentially explore an area bounded in [ 70, +70 ] in both vertical and horizontal dimensions [see Fig. 3(b)]. The stimuli described in Section II-A are always positioned at 0 pan and tilt directions, in such a way that when any object is at the center of the webcam image [i.e., in the fovea region, covering about 3 3, see Figs. 2 and 3(a)] all the other ones remain within the field of view. As explained in Section II, foveation also corresponds to object manipulation and can possibly trigger an outcome in the environment. Fig. 4. Functional overview of the whole system. Components represented with a colored box are described in detail in the corresponding sections. Different colors indicate components belonging to two attentional streams: blue stands for bottom-up stream (responsible for reflexive saccades, not undergoing learning) and red stands for top-down stream (responsible for voluntary saccades, undergoing learning). Rounded gray boxes indicate input and output stages of the system. The same symbols are used in the following figures. III. CONTROL SYSTEM AGO guides the pan-tilt movements of the robot and learns according to the current visual input and the system internal states. The neural architecture of the system is inspired by some structural and functional features of the primate visual system (see Fig. 4 [40]). For this reason, the various parts of the system have been labeled according to possible anatomical brain equivalents. Moreover, similarly to the primate visual system, all neural maps of the model encode information with respect to a reference system centered on the fovea; the only exception to this is the parietal cortex that uses an absolute reference system anchored to the environment. The SC is a key component of the whole system. The SC s generates a peak of activity, used as learning signal, when movement is detected (the switching on of a light is equivalent to a movement in the visual scene); the SC d is the final station of the SC neural path and determines where the gaze will be directed. At each timestep two attentional streams, the bottom-up and the top-down stream (see Fig. 4), process information in parallel and finally compete within the SC d to determine the next

6 SPERATI AND BALDASSARRE: BIO-INSPIRED MODEL LEARNING VISUAL GOALS AND ATTENTION SKILLS THROUGH CONTINGENCIES AND IMs 331 target to observe. The bottom-up stream is responsible for both explorative and reflexive saccades: it is always active, and drives the agent s eye on the most salient points of the visual scene, namely to zone with high visual contrast and moving objects. The top-down stream is responsible for voluntary saccades: it comes into play when the agent detects, through exploration, possible cause-effect relationships between its behavior and environmental events. When this occurs, the stream learns to influence the SC d to obtain the new, desired effect in the world. When an action-outcome contingency is reasonably assimilated (through repeated learning trials), and stored as a goal-skill association, the top-down stream becomes quiescent again and ceases to affect the behavior which goes back to a wholly bottom-up control. Concerning motor behavior, the model can perform only saccades: neither pursuit movements nor fixations are possible (but repeated saccades can target the same area). The structural and functional features characterizing the various components of the system are described as follows. Section III-A describes the bottom-up stream, showing details about the visual preprocessing stage (Section III-A1), the SC (Section III-A2), and the parietal cortex (Section III-A3). Section III-B describes the top-down stream, showing details about the frontal cortex (FC) (Section III-B1) and the BG (Section III-B2). For each component the corresponding section first describes the function and then gives details on the underlying mechanisms. A. Bottom-Up Attention Stream The bottom-up stream supports the visual exploration of the environment (see Fig. 5). Behaviorally, it drives the agent to perform saccades on the six spheres, with about the same frequency, while the lights, when off, are visited less frequently because of their minor brightness. The stream gets as input a low resolution, grayscale image of the current visual field, locates high contrast regions of the scene and, at quite regular intervals due to neural dynamics (generally within 30 or 40 steps), generates a saccade on a selected point. If a movement is detected, a reflexive saccade (orienting reflex) is suddenly performed (within four or five steps) on the corresponding location and at the same time the event generates a reward signal sent to the top-down stream for further processing. The bottom-up stream is always active, and is composed by three modules. 1) The visual preprocessing stage. 2) The SC. 3) The parietal cortex. 1) Visual Preprocessing Stage: This module simplifies the visual input information. The image from the webcam ( ) is converted into grayscale and then rescaled to Then two filters are applied on the resulting image: the first generates the edge map which detects contours and is characterized by a tonic response (i.e., a response persistent in time) and the second generates the change map, which detects modifications in the visual scene and is characterized by a phasic response (i.e., a response lasting a short time). The resulting images are rescaled to and weighted, Fig. 5. Bottom-up stream. The visual input consists of the grayscale image from the camera. A preprocessing stage combines weighted information about edges and changes, and feeds the SC s. This layer has two effects: 1) it produces a reward signal exploited by BG whenever a change is detected in the image and 2) it spreads its activation to the SC d, which is responsible for saccade triggering. Both layers are characterized by self recurrent connections. Within the SC d, the information is influenced by other structures: BG (undergoing learning and belonging to the top-down stream) and parietal cortex (tracking information about eye orientation in absolute space). The bottom-up stream accounts for visual exploratory behavior and is responsible for reflexive saccades. respectively, by 0.03 and 3 (so to give predominant importance to changes), and then summed up in the saliency map. This represents rough visual features of the visual scene and constitutes the sensorial input to the SC, via the superficial layer. 2) Superior Colliculus: This module controls the saccadic movements, integrating the information coming from both cortical and subcortical pathways. It is implemented by two neural maps, the superficial and deep layers. a) Superficial layer (SC s ): This map selects the most salient points in the visual scene as possible gaze targets, and

7 332 IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, VOL. 10, NO. 2, JUNE 2018 Fig. 6. Example of reward signal generated by SC s. The value of reward (green bars) depends on the temporal gap between a saccade (black bars) and the occurrence of an event (red, dim blue, and blue bars). The reward signal fades away within ten timesteps. The red light (triggered by a saccade on the red button at step 40) produces a high reward equal to 1 as it takes place close in time to the saccade that caused it. The blue light (triggered by a saccade on the blue button at step 80) produces a lower reward equal to 0.25 as there is a time delay from the saccade that caused it. The random light (dim blue bar) is too distant in time from any saccade and so it does not elicit any reward. emits a reward signal to BG in case of movement detection. It is implemented by a network of leaky integrator units (see firing-rate models in [57]), receiving in input the saliency map. The net is characterized by recurrent connections having a Mexican hat shape, performing local excitation and distal inhibition, proportional to the distance of units in the neural space. This architecture leads to the formation of dynamic bubble-like groups of active units centered on the most salient parts of stimuli [58]. Formally, to compute the neuronal activation f of the network, first the potential p is computed as p t = p t τ ( α p t 1 + i t + W f t 1 ) (1) where τ = 3 and α = 1 are temporal constants, f t 1 is the previous neuronal activation, W is the connection matrix among units (including self-connections), p t 1 is the previous potential, and i t is the input from the saliency map. Then the result is filtered through the positive part of hyperbolic tangent function, bounding the neuronal activation in [0, 1] as f t = tanh + (p t ). (2) Given the chosen parameter values, the features of the input, and the mode of operation of the whole SC (see the next section), the activities in the SC s in a static scene are always lower than 0.1. However, when a movement is detected, a burst of activity in the corresponding position overwhelms this value, and produces a reward signal r sc (in [0, 1]), which is sent to the BG module in the top-down stream. The function of this signal will be discussed in Section III-B2. For now we only anticipate that the signal is a cue for agency, its entity depending on the temporal gap between a saccade (i.e., an action) and the event it caused (i.e., the following environment change). Fig. 6 shows how this value is computed and rapidly fades away within ten timesteps, which constitutes the reinforcing period. In the example, the reward elicited by the red event is maximum and equal to 1, given that the red light switches on immediately after the agent looks at the red button. On the other side, the reward elicited by the blue event is equal to 0.25, given that the blue light switches on after six timesteps after the agent looks at the blue button. As shown, a random event can elicit a reward signal only if, by chance, it occurs within the reinforcing period. b) Deep layer (SC d ): This map selects when and where the next saccade will be performed, integrating data coming from different pathways. The structure and the equations of this network are the same as those of the superficial layer except for the inhibition which is global and not proportional to the distance between neurons. This latter feature enables the evolution of only one activity bubble centered on the most salient point (winner-take-all neural competition [58]). When a unit of SC d exceeds an arbitrary threshold of 0.5, a saccade in the corresponding point of the space is triggered. As in animals, during the saccadic movement the whole system is temporarily blind, a feature ensuring visual stability and the avoidance of the perception of saccade-dependent changes [59], [60]. The transformation from neuronal activity to eye movement is directly performed by a 1:1 mapping between the net and the field of view of the camera (37 47 ): the winner unit position in neural space encodes the corresponding webcam desired pan and tilt angles as displacements relative to the current gaze. As soon as the saccade is executed, the activity in the whole SC is reset, and a new activity accumulation starts. The SC d is the key center of the whole system: its response depends on the integration of three inputs, two from the bottom-up stream (excitatory from SC s and inhibitory from parietal cortex), and one from the top-down stream (which can be both excitatory and inhibitory). 3) Parietal Cortex (PC): In order to prevent consecutive saccades on the same salient points when exploring under the guidance of the bottom-up attention stream, a mechanism of IOR [61] is employed. The IOR is obtained through a map encoding the absolute space of possible movements in degrees [see Fig. 3(b)]. At each timestep a Gaussian input i (height = 1 and sigma = 0.8), centered on the current absolute pan and tilt angles of the camera, is supplied to the map. The activation potential p of the map is computed as p t = p t τ ( α p t 1 + i t ) (3) where τ = 1000 and α = 1. Then the potential is filtered through the positive part of a tanh function. A submap (37 47 units), centered on the current gaze, is then cut from the absolute map, weighted by 0.1, and given as inhibitory input to the SC d. According to the spatial position of saccades, the IOR mechanism temporarily inhibits gazes on the same points, so promoting visual exploration. As time elapsed, the inhibitions fade away until the same points in space are again free to be selected as saccade targets. The PC is the only map in the whole system that represents events on the basis of an absolute space reference system. This feature is typical of some areas in the posterior parietal cortex involved in attentional processes [62], [63]. There is indeed some evidence that PC is has a role in generating the IOR effect in concert with the SC [64] [66], performing the necessary remapping of spatially inhibited points from environmental to retinal coordinates.

8 SPERATI AND BALDASSARRE: BIO-INSPIRED MODEL LEARNING VISUAL GOALS AND ATTENTION SKILLS THROUGH CONTINGENCIES AND IMs 333 B. Top-Down Attention Stream The main purpose of the top-down stream is to possibly influence the reflexive operations performed by the bottom-up stream (see red boxes in Fig. 4). This stream is composed of two modules, the FC and the BG, both undergoing learning, which encode, respectively, goals and skills. In the model, a goal is defined as an internal representation of a change in the world caused by agent s behavior (that is, an event potentially under its control), and that can be kept active in the absence of the change itself. A skill is defined as the ability to perform suitable saccades, based on visual input, to achieve a goal. Contrary to the bottom-up stream, which is always active, the top-down stream comes into play only if necessary, that is whenever the agent discovers a possible goal. 1) Frontal Cortex: The FC module is responsible for both encoding and selection of goals, and uses the active goal to bias saccades. Moreover, the FC monitors the achievement of goals and contributes to modulate the learning signal from SC s. The module consists of two maps and four single units (see Fig. 7). a) Goal encoding map: This component encodes the perceived environmental changes, due to the agent s action, as goals. The component is a neural network map implementing unsupervised competitive learning with a winner-take-all learning rule [58]. The input of the map corresponds to the change map from the preprocessing stage, multiplied by the reward signal r sc from SC s. Such multiplication tends to exclude from codification those events that are not due to the agent s behavior. The output of the map is a vector of 64 linear units bounded in [0, 1], and arranged in an 8 8 map. Given these features, a selected unit in the map tends to specialize in the representation of a specific change (i.e., a goal): indeed the connection weights of each unit encode both relative position of the event (with respect to gaze direction), and its visual appearance. From this point of view, a goal is a perceptual representation of what happens (the shape of the light) and where (the spatial location of the event), with respect to the agent s point of view. Given the nature of the input, the map is characterized by a phasic activation. b) Goal selection map: This component can be seen as encoding the intention to obtain a specific goal ( intention is intended here as in the believe desire intention literature as a currently actively pursued goal [67]). The component implemented as a recurrent 8 8 map (same structure as SC d, with global inhibition and leaky neurons). The input is based on the goal encoding map, where only units with activity greater than 0.3 are taken in considerations. This ensures that when a goal in input reaches a relatively strong encoding it triggers a selfsustained activity in the corresponding unit in the map: this unit then represents the selected goal to achieve. The strength of recurrent connections has been calibrated so as to permit the self-sustained activity of only one unit, which inhibits all others. The selected goal informs the BG about the skill to use and train (see Section III-B2). Once active, the selected goal unit can be switched off only by an active inhibition (or goal resetting), a function implemented by the unit k in the FC. c) Units for goal-matching, success rate, and goal resetting: The fact that the goal encoding map and the goal Fig. 7. Top-down stream, FC module. This module encodes events due to the agent s behavior in the goal encoding map. If a unit in the map is repeatedly activated, the corresponding unit in the goal selection map becomes tonically active (thanks to self-recurrent connections); this in turn selects a corresponding actor critic network in BG, which is responsible for learning the goal-related skill. Unit m, comparing the simultaneous activities of both maps, computes a modulatory signal for learning, which indicates if the desired goal has been reached. Units y and j monitor the current performance (success rate), while unit k inhibits the selected goal when the skill is sufficiently successful. selection map have the same activation (with, respectively, one phasically active unit and one tonically active unit) means that the agent, pursuing a specific goal, managed to obtain the corresponding change in the world. When this happens the agent s behavior is considered successful, and unit m activates with 1. If the maps do not match, the behavior is considered unsuccessful and unit m is set to 1 to mark the fact that the system caused an undesired change. Finally, when the maps are not active at the same time (lack of simultaneity), the unit m remains at 0. The value of m is then furnished as input to the unit y, that evaluates the performance for the current skill. In order to compute the activation of y, its activation potential is first computed as p t = p t τ ( α p t 1 + m t ) (4)

334 IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, VOL. 10, NO. 2, JUNE 2018 where τ = 100 and α = 0.05.

The unit y reflects the frequency with which the current goal is achieved: the higher the frequency, the higher the value of y. Once y overwhelms a threshold of 0.

9 334 IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, VOL. 10, NO. 2, JUNE 2018 where τ = 100 and α = Then y is obtained by filtering the potential through the positive part of the tanh function, so bounding the values in [0, 1]. The unit y reflects the frequency with which the current goal is achieved: the higher the frequency, the higher the value of y. Once y overwhelms a threshold of 0.7 the task is considered learned; consequently an inhibitory connection weight from the unit k, which is a bias unit tonically activated with 1, to the currently selected goal unit within the goal encoding map is increased, so that such goal becomes permanently inhibited. This causes the disengagement of the top-down stream. In the experimental setup, it can happen that the agent is deceived by repeated accidental contingencies between the random light and its own action. This occurs when, by chance, the random light switches on immediately after a saccade. In this case a unit in the goal encoding map erroneously encodes the event as a goal, thus triggering the tonic activation of a unit within the goal selection map. In order to prevent the system from remaining stuck in the attempt to achieve an impossible goal, due to the fact that unit y can never reach the learning threshold, a further unit j is employed. This unit activates with the same equation as unit y (τ = 1000 and α = 1), with the exception of the input that is substituted by (1 y): if no performance increment is possible (that is, y 0 always) j increases and reaches a threshold of 0.7 in about 2000 timesteps (100 s). When this occurs, again the same mechanism for the inhibition of the current fake goal is enabled. Note that if the goal is achievable then the unit j is soon inhibited by unit y activity. Finally the unit m plays a further, fundamental function: the reward signal r sc from SC s is indeed modulated by m (through a simple multiplication). In such a way, the information about the identity of a caused event generates positive or negative reinforces in BG according to the current goal, thus crucially influencing skill learning. In particular, m = 1(marking a success of the skill) strengthens the performed behavior, m = 0 suspends learning, and m = 1 (marking the achievement of an undesired event) trains the avoidance of the performed behavior. 2) Basal Ganglia: This module is responsible for skills learning and directly influences the activity in the SC d.itis implemented by a map which categorizes the input from the fovea, and by a set of AC networks [10] (see Fig. 8). a) Fovea classification map: This component is implemented as an 8 8 self-organizing map (SOM) [58]. The input is based on a small RGB patch (40 40 pixels, subsampled to 10 10), taken from the central part of the input image [fovea, see Fig. 3(a)]. The SOM was pretrained to classify the fovea (the training was based on exploration driven by the bottomup stream, see Fig. 9): after training, the SOM responds with 1 with the unit specialized for the current input pattern, and 0 with all others. b) Actor-critic vector: This component is a vector of 64 independent AC networks. Elements are always quiescent, unless one of them is temporarily recruited by the corresponding active unit in FC goal selection map. This selection turns on the learning process and the influence of the topdown stream on the bottom-up stream: the output of each Fig. 8. Top-down stream, BG module. This module is composed by a vector of 64 silent AC networks. An AC can be recruited by the FC. When this occurs, the AC receives as input the activation of the fovea classification map and produces as output to SC d a suggested target to focus or avoid. The learning signal is based on a combination of the reward signal r sc from SC s and the modulatory signal m from FC. Fig. 9. Receptive fields developed by the fovea classification map. The output of the map consists in the activity of the winning unit. In this example the winner unit is highlighted by a red frame, indicating that the agent is looking at the horizontal bar. actor is a map which, acting as input to SC d (see Section III-A2b), suggests points to gaze and points to avoid, thus contributing to determine the next saccade target. The activity of the actor units, bounded in [ 1, 1], is described

10 SPERATI AND BALDASSARRE: BIO-INSPIRED MODEL LEARNING VISUAL GOALS AND ATTENTION SKILLS THROUGH CONTINGENCIES AND IMs 335 Fig. 10. In this schematic, we show how the AC learning process works at a low granularity timescale (timesteps shaded in pink), coupled with the high granularity timescale used by the bottom-up stream. In practice the AC learns only in those timesteps where a saccade is scheduled for the next step. For learning, the AC takes in consideration its previous foveal input, the previous action (i.e., where the SC triggered a saccade), and the current reward. Note that the reward fades away (purple shade), according to its temporal distance from the previous saccade. In the example, the visual shift from the green to the red sphere is reinforced because of the immediate light switching (captured by the SC superficial layer, not shown here). by the following equation: a t = tanh(w a x t ) (5) where a t is the vector of activities at time step t, x t is the input from the fovea classification map, and W a is the actor weight matrix. The critic is a perceptron with one linear output unit whose activation is given by the equation v t = w T c x t (6) where v t is the state evaluation of the critic at timestep t, x t is the input from the fovea classification map, and w T c is the transpose of the critic weight vector. The TD-error is computed as δ t = r t + γ v t v t 1 (7) where δ t is the TD-error at time t, r t is the reward value, γ = 0.95 is a discount factor, and v t and v t 1 are, respectively, the current and previous critic evaluations The reward r is a combination of signals r sc from SC s and m from FC r = r sc m. (8) Based on this formula, the reward is: 1) strong in proportion to how much the caused event is temporally close to the agent s action and 2) positive if the event matches the goal, negative if does not match it, null if no goal is selected. The critic connection weights are updated with the equation w c = η c δ t x t 1 (9) where η c = 0.1 is a learning rate, and x t 1 is the previous input from the fovea classification map. The actor connection weights are updated as follows [39]: W a = η a δ t ( g t 1 ( 1 y 2 t 1 ) ) e t 1 x T t 1 (10) where η a = 0.5 is a learning rate, the symbol indicates the entrywise product, g t 1 is a vector containing a Gaussian 2-D pattern (width = 1.2) centered on the previous saccade target location computed by SC d, (1 y 2 t 1 ) is the derivative of the actor-output tanh function (1 is a vector of 1 s and y t 1 is the activation potential), e t 1 is the difference between g t 1 and y t 1, and x T t 1 is the transposed previous input. According to the formula, the actor output gets closer to the SC d output that generated a rewarded saccade or goes to the opposite direction with a negative reward signum. Positive and negative outputs act, respectively, as excitatory and inhibitory input to the SC d, so promoting or discouraging saccades on specific targets. Note that, while the actor output exerts its influence on the SC d at each timestep, the learning phase occurs at a lower temporal granularity (see Fig. 10): in particular, the input and output of the AC that are taken into account for the TD-learning are only those immediately preceding a saccade. Fig. 11 displays a complete schema of the whole system components. IV. RESULTS In order to test the behavior of AGO, 30 experiments each lasting steps without partition in trials, were run. In each run the positions of the six spheres have been randomly swapped, so the lights positions. In this way the spatial relations between causal buttons and lights changed over runs (the causal relationships remained unaltered). The system had to find and learn which saccadic patterns produced an outcome, that is which events were under the control of the agent, while ignoring the others. The performance of the agent is displayed in Fig. 12. The graph shows which events the agent codified as goals, and if it acquired the necessary skills to reproduce them. The results show that the agent s FC encoded the two causal events (first two labels on x-axis) as goals close to 100% (black bars); the model, recruiting two AC s in the BG, learned to reproduce them with a percentage of success between 85% and 95% (green bars). The remaining labels on the x-axis refer to goals that were erroneously encoded in relation to the random light: as explained in Section III-B1a, goal encoding depends on the relative position of the detected event with respect to the agent s gaze; this implies that the same random light could trigger the formation of different goals (only the first four are displayed here as Rand1, Rand2,...). Recall, however, that the agent forms goals related to the random light only if its contingency with a specific saccade happens several times. The gray bars show how the agent, in these cases, understood

Boxplots have been drawn taking in consideration only those runs that ended with a complete, successful learning (26 for the blue event and 29 for the red event).

11 336 IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, VOL. 10, NO. 2, JUNE 2018 Fig. 13. Average time of engagement of ACs that specialized for causal events in different runs of the simulation. Boxplots have been drawn taking in consideration only those runs that ended with a complete, successful learning (26 for the blue event and 29 for the red event). On each box the central mark is the median, the edges are the 25th and 75th percentiles, the whiskers extend 1.5 times the interquantile range, and the outlier datapoints outside this range are plotted individually as crosses. As shown, the red event is learned by the ACs in a shorter time with respect to the blue event due to the longer delay after the saccade that causes the latter. Fig. 11. Whole system described in detail, specifying the components of each subsystem. Fig. 14. Experimental setup used to illustrate the functioning of the system. Fig. 12. Average performance on 30 experiments. The x-axis indicate the events potentially present in the environment: each event is characterized by three bars describing: if the event was encoded as goal and the related skill subjected to a learning attempt (black bar); if the system successfully learned the skill for causing the event (green bar); if the system interrupted the skill learning attempt (gray bar), which occurs when the event was erroneously evaluated as causal instead of random. The x-axis shows, beside the red and blue events, some random events: these refers to the same random light that could be potentially encoded by different (false) goals if seen several times. that it could not learn to accomplish them and so stopped the attempts to reproduce them (as the variable j in FC reached the threshold). The short green bar for the Rand1 event is due to two cases, out of 30 experimental runs, where the system was completely deceived by the random event, without realizing the mistake. Qualitatively, in these cases the agent learned a superstitious behavior : it stared at the random light, through consecutive foveations on the same target, waiting for the next random event to happen, an improbable behavior generally discouraged by the IOR mechanism. A second quantitative result, taking in consideration only the two possible goals, is presented in Fig. 13. The two boxplots show the number of selection steps, averaged over 30 experiments, of the ACs recruited for skill learning. The graph shows that the ACs specialized for red light (AC red ), learn quicker and with regular times in comparison to the ACs specialized for the blue light (AC blue ). This difference is due to the longer temporal gap characterizing the blue light event, which makes the learning process more difficult: such time interval causes a lower reward signal and possibly the intervention of distracting events (i.e., a reflexive saccade on the random light). A. Functioning of the System: Illustrative Example We illustrate the functioning of the system by examining one simulation run in depth. Other runs are qualitatively similar.

12 SPERATI AND BALDASSARRE: BIO-INSPIRED MODEL LEARNING VISUAL GOALS AND ATTENTION SKILLS THROUGH CONTINGENCIES AND IMs 337 (a) (b) (c) Fig. 15. Functioning of the system in an illustrative example experiment. (a) Distribution of light events, with the probability of each light computed at the end of the experiment. (b) Current AC, with the probability of light events computed within its activity. (c) Performance according to the activity of units j and y in the FC. (d) Saccades occurrences. (d) The configuration of the stimuli used in the analyzed run is shown in Fig. 14. In Fig. 15, four graphs display the main activities of the system during the whole experiment. A video of the behavior of the system is shown at youtube/ekeqbt5qfuq. Fig. 15(b) shows how around step 1000 the agent encodes as a goal the blue event, recruits the AC 30 and starts to learn how to cause the event repeatedly. This can be inferred looking at the probability of light events computed within the AC activation, and looking at Fig. 15(a), that displays the occurrences of light events. It is noteworthy that, during this period, the agent not only learns to cause the blue event (p blue=0.016 ), thanks to positive rewards (in FC, unit m = 1), but also avoids to produce the red event (p red = 0.002), thanks to negative reinforcements (in FC, unit m = 1). Obviously, nothing changes for the random event, as it cannot be affected by the agent. Fig. 15(c) shows the performance of the system through the joint activities of units j and y in FC: as explained in Section III-B1c, the higher the frequency of goal achievement, the faster the y increment. This unit can hence be considered as indicating the confidence of the system on the skill effectiveness. As soon as the confidence threshold is reached, around step 5500, the skill is considered sufficiently good, and the AC 30 gets inhibited by unit k in FC (not shown in Fig. 16. Different data characterization for the same illustrative experiment considered in Fig. 15. Each stacked bar displays the average probability of a light event according to the AC activation when it was selected. The differences in probabilities suggest a skill specialization. the figure). After a brief pause, in which the system is again driven by only the bottom-up stream, the AC 7 is recruited by the red event, and the corresponding skill is acquired before step Finally, the system is deceived by the random light,

338 IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, VOL. 10, NO. 2, JUNE 2018 Fig. 17.

Each cell in the map displays the connection weights relative to the corresponding unit, storing the prototype of the event related goal.

The progressive darkeningof the map is an effect of normalization used for graphical rendering. (a) Fig. 18. Two receptive fields outlined in Fig. 17.

(b) Unit 7 encodes a change happening on close right: this happens when the agent looks at the red button. The two units will recruit, respectively, the AC 30 and AC 7.

15(c)], so that the goal is inhibited after step 10 000.

003 and p blue = 0.004). Fig. 16 shows the percentages of the light events within the selected AC s.

13 338 IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, VOL. 10, NO. 2, JUNE 2018 Fig. 17. Four snapshots showing the temporal formation of receptive fields in the 8 8 goal encoding map during an illustrative experiment. Each cell in the map displays the connection weights relative to the corresponding unit, storing the prototype of the event related goal. The last image on the right shows how, at the end of the experiment, units 7 and 30 specialized, respectively, for the encoding of the red and blue light events. The progressive darkeningof the map is an effect of normalization used for graphical rendering. (a) Fig. 18. Two receptive fields outlined in Fig. 17. (a) Unit 30 encodes a change happening on bottom left, in respect of agent s point of view: this happens when the agent looks at the blue button. (b) Unit 7 encodes a change happening on close right: this happens when the agent looks at the red button. The two units will recruit, respectively, the AC 30 and AC 7. which is erroneously encoded as a possible goal: the AC 29 is recruited to learn to achieve such goal but the performance cannot increase [see activity of j unit in Fig. 15(c)], so that the goal is inhibited after step Note that during this activity, the agent still tends to avoid to produce both causal events; that is, it learns to avoid causing the events differing from the pursued goal (p red = and p blue = 0.004). Fig. 16 shows the percentages of the light events within the selected AC s. The differences in the distributions highlight skill specialization as shown: the first recruited AC 30 specialized in switching on the blue light, whereas the second AC 7 specialized in switching on the red light; the similarity of probabilities of the third AC 29 is due to the fact that it was recruited for an uncontrollable event (i.e., the random light). Considering goal encoding, Fig. 17 shows the temporal evolution of the whole goal encoding map: the repeated experience of both causal events makes their representation progressively stronger. The two receptive fields, corresponding to both events, are enlarged and shown in Fig. 18: the images clearly show how the goal representation encodes the appearance of the event (what) and its position with respect to the agent point of view (where). What does an effective behavior of the agent look like? Fig. 19 displays some examples of the AC s responses according to the foveal input. As shown, the expert actor suggests where to perform the next saccade to reproduce the event corresponding to the goal which is currently active. The actor activity is both excitatory and inhibitory, shown respectively in red and blue in the figure. Excitatory activity promotes the production of a saccade by exciting the corresponding point in the (b) SC d, while inhibitory activity discourages it. Two interesting phenomena emerge: facilitation of saccades on causal lights and inhibition of reflexes on random light. With regard to the first point, once the agent is on the button (red for AC 7 and blue for AC 30 ), the actor immediately suggests to jump on the correct light, even if the event (and so the reward signal) have not yet happened; this is evident for the blue event that is characterized by a temporal delay. Such behavior is due to the positive evaluation that the light site gets during learning: at the beginning, when the light switches on, the agent jumps immediately on it driven by the orientation reflex; as a side-effect the inhibition on button due to IOR is smaller with respect to the other sites, and this makes it probable that the actor jumps back to it. In this way the IOR mechanism facilitates the production of a back and forth behavior, progressively strengthened by learning, where the agent tends to repeat the action that caused the desired event. This behavioral schema is predominant in all the observed experiments, but it does not happen always: in some runs the agent gazes at other stimuli after looking at the lights, and consequently a chain of saccades (generally no more than two or three) is learned. It is noteworthy that, however, in this latter case no wrong button is included in the saccadic pattern as it would produce a negative reward. Regarding the learning of the inhibition of the reflex on the random light, this emerges when the attention of the agent is captured by the random event. If this happens during the learning of an AC, the random light site gets a negative evaluation. Indeed the reflex interrupts the ongoing rewarded saccadic pattern, so that the AC gets a negative TD-error and learns to actively inhibit the site and avoid the distractor. What is the utility of the goals and skills acquired under the drive of goal self-generation and IMs? How can the agent recall the skills to accomplish desired goals? We ran some experiments to start to show some mechanisms that might be used to this purpose, but the development of a fully autonomous agent autonomously deciding which goals to activate goes beyond the scope of this paper. Fig. 20 shows a test of the previously learned skills and how they can be recalled by activating the related goals by hand. In particular the test, the goals of AC 7 and AC 30 have been externally activated, one after the other, for 1000 steps each [see Fig. 20(b)]. The figure shows how, in correspondence to the AC activation, the switching on of the causal lights almost exclusively causes the desired behaviors and effects [see Fig. 20(a)]. This can

SPERATI AND BALDASSARRE: BIO-INSPIRED MODEL LEARNING VISUAL GOALS AND ATTENTION SKILLS THROUGH CONTINGENCIES AND IMs 339 Fig. 19.

Inputs have been chosen for their importance with regards to the task and do not represent a behavioral sequence.

denote, respectively, the excitatory and the inhibitory activity of the actor, whereas the intensity of the spots reflects the intensity of such activity.

A textbox explains example 3 shown in the central column in more detail. be further appreciated if compared with a baseline experiment where the top-down stream is kept switched off (see Fig. 21). V.

[68] were among the first to introduce in machine learning and robotics the issue of the autonomous acquisition of skills on the basis of IMs, independently of the fulfilment of any extrinsic task.

algorithms. An option is a macro-action involving a policy (i.e.

option execution terminates). Options can also have an option model describing the effects of executing the option.

g., an abstract ball, block, toy, and light switch) to cause salient consequences to happen (e.g., switching on the light, activating the toy, and ringing a bell), where saliency corresponds to putative changes in light and sound intensities.

Barto and colleagues models tackle the important issue of reusing the acquired skills, possibly organized in hierarchies, for accomplishing extrinsic goals: this is an issue that might be faced with

The model presented here uses the concept of change to generate goal formation: this is similar to saliency, with the difference that here we actually simulate the visual processes that allow the

Another difference is that our system goes beyond a simple grid world to consider an environment involving continuous actions and a more realistic input (the visual images).

Note how some of these features derive from an important difference that our model also has with respect to all models considered in this section.

14 SPERATI AND BALDASSARRE: BIO-INSPIRED MODEL LEARNING VISUAL GOALS AND ATTENTION SKILLS THROUGH CONTINGENCIES AND IMs 339 Fig. 19. Illustrative snapshots showing the qualitative behavior of AC 7 (middle row), and AC 30 (bottom row), when exposed to five different foveal inputs (top row). Inputs have been chosen for their importance with regards to the task and do not represent a behavioral sequence. For each input at the top row of graphs, the corresponding actor activity (second row of graphs for AC 7 and third row of graphs for AC 30 ) is presented: red and blue spots (indicated by arrows), denote, respectively, the excitatory and the inhibitory activity of the actor, whereas the intensity of the spots reflects the intensity of such activity. Note that excessively dim activations, although effective in biasing the behavior, could not be visually appreciated and so are omitted from the graph (missing pics). A textbox explains example 3 shown in the central column in more detail. be further appreciated if compared with a baseline experiment where the top-down stream is kept switched off (see Fig. 21). V. RELATED WORKS Barto et al. [11] and Singh et al. [68] were among the first to introduce in machine learning and robotics the issue of the autonomous acquisition of skills on the basis of IMs, independently of the fulfilment of any extrinsic task. In these works, the authors addressed the problem within the RL options framework [69] augmented with concepts drawn from the psychological literature on IMs operationalized with machine learning algorithms. An option is a macro-action involving a policy (i.e., a skill establishing the primitive action to perform at each state), an initiation set (the conditions where the option can be executed), and the termination condition (the conditions where the option execution terminates). Options can also have an option model describing the effects of executing the option. The authors use a simulated grid-world playroom to test an option-based agent provided with an eye and a hand that can perform various actions (e.g., look, press, and turn on ) on various objects (e.g., an abstract ball, block, toy, and light switch) to cause salient consequences to happen (e.g., switching on the light, activating the toy, and ringing a bell), where saliency corresponds to putative changes in light and sound intensities. Interestingly, the authors also link saliency to DA production in brain. A new salient event surprises the agent and triggers the creation of a new option which is then learned by the agent by RL. Barto and colleagues models tackle the important issue of reusing the acquired skills, possibly organized in hierarchies, for accomplishing extrinsic goals: this is an issue that might be faced with our model in future work. The model presented here uses the concept of change to generate goal formation: this is similar to saliency, with the difference that here we actually simulate the visual processes that allow the agent to detect a visual change that produces the formation of a goal. Indeed, in our system any change in the visual image can potentially lead to goal formation. Another difference is that our system goes beyond a simple grid world to consider an environment involving continuous actions and a more realistic input (the visual images). This allows our system to store a visual representation of the goal in terms of caused changes. Note how some of these features derive from an important difference that our model also has with respect to all models considered in this section. The difference is that the model faces goal-based open-ended learning issues by focusing on the acquisition of attentional routines within a robot endowed with a realistic visual system rather than on motor skills involving the manipulation of the environment with an arm and hand and abstract looking actions. Another thread of research developed within the options framework, called option discovery problem, directly faces the problem of which possible mechanisms can be use to support the autonomous self-generation of options [28]. Option discovery has often been framed as the heuristic identification of useful subgoal states within some extrinsic larger tasks. When a useful subgoal is identified, the agent can learn the policy to accomplish it and then employ it as a reusable skill. As the subgoals are not extrinsically rewarded, the agent generates a pseudo-reward when they are accomplished in order to learn their policy. Various approaches have been used to generate subgoals. Subgoals might be generated genetically, as modeled in [17] and [70]. Another approach is based on the idea of generating subgoals for bottleneck states: these are world states that are visited with a high

15 340 IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, VOL. 10, NO. 2, JUNE 2018 (a) (b) (c) Fig. 20. In this evaluation experiment, where no learning is on, previously learned skills have been tested. AC 30 and AC 7, those that learned and specialized to switch on the red and blue lights, were artificially activated in sequence for 1000 steps with an interval of 3000 steps between them. This test shows how the system is potentially able to exploit the acquired knowledge about action-outcome contingencies to switch on a the light that corresponds to a certain active goal. (a) Distribution of light events. (b) Current AC. (c) Saccades occurrences. (a) (b) Fig. 21. Behavioral baseline in an experiment where the top-down stream is kept off, i.e., the behavior is completely driven by the bottom-up stream. As shown, the three lights present about the same probability to occur. (a) Distribution of light events. (b) Saccades occurrences. frequency while solving other tasks [71]. The concept has been later formalized, based on graph-theory concepts, as betweeness [72]. This refers to the relative frequency with which a certain state is visited when considering all possible couples of states of a given task domain. These concepts have been also linked to information-theory perspectives, in particular it has been proposed [73] that subgoals can be identify as states marked by relevant changes of the amount of Shannon information that the agent needs to maintain about the current goal to select the appropriate action. A key difference between our model and these approaches is that the latter ones assume the existence of extrinsic tasks (or, similarly, consider all possible tasks each defined as an initial state from which the agent has to reach a terminal state). Instead, our approach aims to find possible markers of potentially interesting (sub)goals, such as the detection of changes used here, that can be directly used to self-generate goals even in the absence of external tasks. The self-generation of goals and their use to learn skills in the current model shares many similarities, but also important differences, with another model of ours [31]. Here a simulated humanoid robot freely explores the environment through arm movements and learns to make things happen when objects are touched, in particular some spheres are turned on when hit with the hand. As done in our model, here the notion of event is crucial for triggering the formation of a goal on the basis of a perceived change in the world and both models

16 SPERATI AND BALDASSARRE: BIO-INSPIRED MODEL LEARNING VISUAL GOALS AND ATTENTION SKILLS THROUGH CONTINGENCIES AND IMs 341 share the use of AC experts to accomplish the different goals. Both systems use a competence-based IM to decide on which goal to focus the skill learning process, but while the model presented here focuses on the newly created goal for multiple trials until the visual routine to achieve it has reached a desired level of quality, the other model selects a goal at each trial based on a soft-max function: the relative advantages of these two approaches are still an open issue. The two models, however, present also two important differences. The first is that the model presented here tackles open-ended learning regarding attention visual routines rather than arm/hand behaviors as the other model. This difference strongly affects the whole architecture of the system. 1) The various input/internal/output components of the architecture tend to be 2-D maps due to the nature of visual input and attention control. 2) Action in one step amounts to deciding the pan-tilt desired values, and any point in the motor space can be achieved in one time step, i.e., one saccade; this differs from the arm/hand motor control where a given rhythmic or discrete movement requires multiple steps to be accomplished. 3) The attention control behavior is (mainly) directed to gather information from the environment rather than to change it. The second main difference is that the model presented here has an architecture that is bio-inspired, and this might facilitate the future link of the model with behavioral or neuroscientific data. Another relevant literature thread is the one of goalbabbling (e.g., [25] and [74]). The main objective of goalbabbling is the learning of inverse models able to map a desired goal to an actuator posture leading to it (e.g, arm joint angles) where this posture usually involves many redundant degrees of freedom. In early goal-babbling models, the goal was the end-actuator position in space, e.g., the position of the robot arm hand. Recently, goal-babbling literature has started to propose models that form goals related to the external world rather than the agent s body. These goals for example refer to the position of a target object [32] oratarget object and a tool usable to reach the object [33]. The approach has been also empowered through open-ended/im mechanisms (e.g., [25] and [27]). The work presented here differs from these models for its focus on the vision/attention domain, for its bio-inspired architecture, and for the study of different learning mechanisms supporting open-ended learning (e.g., the use of experts and the mechanisms to compute competence IMs [23]). To our knowledge, only another work proposed a system formed by a bottom-up and a top-down attention component and using IMs to support autonomous learning [75] (see [76] [78] for some early models using RL to actively control attention; see also the review [79] on attention, intended as focusing of exploration and learning activities under the drive of extrinsic and intrinsic motives, for autonomous developing agents). The model controls a humanoid icub robot able to look at, and manipulate, toys. In particular, the model exploits active curiosity-driven manipulation to improve perceptual learning of objects through both own experience and interaction with humans. Our models differs from this model for its use of goals to encode useful changes caused in the environment and to support open-ended learning. VI. CONCLUSION We proposed a bio-inspired computational architecture, called AGO, that is able to autonomously create its own goals on the basis of observed action-outcome contingencies, and to use them to learn the visual routines to accomplish them. Tasks are created through the exploitation of self-generated goals, that is internal representations of possible word states caused by the agent s camera fixations. Within the system, goals are shown to serve multiple purposes. 1) A newly formed goal drives the learning of the attention skills (sequences of camera fixations) needed to fulfil the current task, in particular it supports the measure of a competence-based IM that allows the agent to engage with a given goal for the time needed to acquire the related skill and contributes to produce a reward signal when matched. 2) A stored goal can be reactivated to recall the already acquired skill corresponding to it. 3) The set of stored goals, coupled with the complementary attention skills, represents the amount of potential control that an agent can exert on the environment. The architecture produces visual exploration and attention control on the basis of the interplay of two complementary components: 1) a bottom-up component that tends to drive the agent s gaze on salient locations of the scene and 2) a topdown component that learns to select visual targets among such locations on the basis of the current goal of the agent. In the authors research program [36] [39], [80], proposing visual systems founded on those two components, this paper represents a further step toward the development of an effective open-ended learning system through the introduction of selfgenerating goals and tasks and mechanisms to suitable process them to acquire the related attention skills and to recall them. The tests of the system confirm in this new context involving open-ended learning and goals an important principle proposed in the authors previous works (see [37] for an overview), namely that the synergy between the two components allows the active-vision architecture to acquire the needed attention skills in an efficient way. Indeed, the bottom-up component drives visual exploration to potentially relevant areas of the scene, and the top-down component confirms or overrides such tendencies on the basis of task-related information. To our knowledge, this is one of the first works to extensively employ goals to support open-ended learning in a system endowed with a bottom-up and top-down attention component (cf. [75], the closest paper discussed in Section V). There are a number of possible directions where the system might be developed in future work. The capacity to acquire control in an unknown environment, even without the presence of external guidance, is a fundamental means for animals, in particular higher mammals and humans, to support the acquisition of complex behaviors. In this respect,

17 342 IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, VOL. 10, NO. 2, JUNE 2018 the model presented here could be used to reproduce curious agents that visually explore and interact with the world to progressively learn how to master it. Such conduct indeed resembles the curiosity-driven behaviors observed in monkeys or children when engaged in exploratory and play activities [5], [9], [81]. To this purpose, however, it would be important to realize a close comparison of the model looking behavior with the behavior of real animals/humans as measured in experiments (see [82], [83]). We started to do this in a previous work [38] where a model sharing some features with the current one successfully reproduced the development of gazing behavior of babies engaged in a gaze-contingency experiment [41]. This experimental paradigm is particularly interesting for developmental psychology as it allows the study of the acquisition of high-level cognitive processes in very young infants as it does not require the engagement of the limb motor system which is only partially mature in babies. For example, Wang et al. [41] showed how six month-old babies, characterized by poor limb motor control, learn very rapidly to use their gaze to cause changes in the environment. The model presented here is particularly suited to face these type of scenarios for its bio-inspired architecture. In this respect, one might aim to reproduce and study attention exploratory behaviors by also incorporating in the model brain mechanisms underlying intrinsically motivated acquisition of skills [7], [56], [84], [85]. Another direction where to improve the model involves its capacity to reuse the autonomously acquired goals and skills to accomplish new tasks. In the current setup, the benefit of the acquired knowledge has only been partially shown by directly activating the model goals by hand in order to observe the performance of the related skills. In the future, the system should be enhanced with the capacity of autonomously selecting a learned goal to pursue a specific practical purpose. This might be done through the introduction of the capacity to process extrinsic rewards, for example by modeling how animals can employ the autonomously acquired knowledge to obtain resources, such as food and sex, that are important for survival or reproduction [7]. Through these rewards, the model should become able to internally reactivate the previously acquired goals so as to perform skills that lead to obtain the desired resources [86]. A further improvement of the model could aim to overcome an important limitation it has, namely the change of the environment done directly through the gaze control. In particular, the model should be made capable of gathering information needed to control a realistic actuator, such as an arm with an hand. This modification would introduce a first challenge related to the fact that learning signals would be produced by the arm actions and not by the gaze movements [36], [37]. Moreover, it would introduce a second challenge related to the perception of changes caused by the movement of the limb: this would continuously capture the agent s attention and possibly trigger the formation of goals corresponding to the seen body postures. On one side, this could be interesting to study how the model actually learns visual goals related to own body, similarly to what happens in babies. On the other side, the phenomenon could be considered a problem to be solved if one aims to develop a robot that invests its learning resources on mastering the external world: to this purpose, some kind of habituation/prediction seem to be needed to allow the robot to ignore own body changes and focus on changes of the external environment. REFERENCES [1] G. Baldassarre, What are intrinsic motivations? A biological perspective, in Proc. Int. Conf. Develop. Learn. Epigenet. Robot. (ICDL EpiRob), Frankfurt, Germany, Aug. 2011, pp. E1 E8. [2] G. B. Kish, Learning when the onset of illumination is used as reinforcing stimulus, J. Comp. Physiol. Psychol., vol. 48, no. 4, pp , [3] G. B. Kish and G. W. Barnes, Reinforcing effects of manipulation in mice, J. Comp. Physiol. Psychol., vol. 54, no. 6, pp , [4] L. E. Moon and T. M. Lodahl, The reinforcing effect of changes in illumination on lever-pressing in the monkey, Amer. J. Psychol., vol. 69, no. 2, pp , [5] F. Taffoni et al., Development of goal-directed action selection guided by intrinsic motivations: An experiment with children, Exp. Brain Res., vol. 237, no. 7, pp , [6] A. Barto, M. Mirolli, and G. Baldassarre, Novelty or surprise? Front. Psychol. Cogn. Sci., vol. 4, no. 907, pp. e1 e15, [7] P. Redgrave and K. Gurney, The short-latency dopamine signal: A role in discovering novel actions? Nat. Rev. Neurosci., vol. 7, no. 12, pp , [8] B. Elsner and B. Hommel, Contiguity and contingency in action-effect learning, Psychol. Res., vol. 68, nos. 2 3, pp , [9] E. Polizzi di Sorrentino et al., Exploration and learning in capuchin monkeys (sapajus spp.): The role of action-outcome contingencies, Animal Cogn., vol. 17, no. 5, pp , [10] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA, USA: MIT Press, [11] A. G. Barto, S. Singh, and N. Chentanez, Intrinsically motivated learning of hierarchical collections of skills, in Proc. Int. Conf. Develop. Learn. (ICDL), La Jolla, CA, USA, 2004, pp [12] D. E. Berlyne, Curiosity and exploration, Science, vol. 153, no. 3731, pp , Jul [13] R. W. White, Motivation reconsidered: The concept of competence, Psychol. Rev., vol. 66, no. 5, pp , [14] D. E. Berlyne, Curiosity and learning, Motivation Emotion, vol. 2, no. 2, pp , [15] J. Schmidhuber, A possibility for implementing curiosity and boredom in model-building neural controllers, in Proc. Int. Conf. Simulat. Adapt. Behav. From Animals Animats, 1991, pp [16] J. Schmidhuber, Curious model-building control systems, in Proc. Int. Joint Conf. Artif. Neural Netw., vol. 2. Singapore, 1991, pp [17] M. Schembri, M. Mirolli, and G. Baldassarre, Evolving childhood s length and learning parameters in an intrinsically motivated reinforcement learning robot, in Proc. 7th Int. Conf. Epigenet. Robot. (EpiRob) Model. Cogn. Develop. Robot. Syst., Piscataway, NJ, USA, Nov. 2007, pp [18] P.-Y. Oudeyer, F. Kaplan, and V. V. Hafner, Intrinsic motivation systems for autonomous mental development, IEEE Trans. Evol. Comput., vol. 11, no. 2, pp , Apr [19] P.-Y. Oudeyer and F. Kaplan, What is intrinsic motivation? A typology of computational approaches, Front. Neurorobot., vol. 1, no. 6, pp. 1 14, [20] G. Baldassarre and M. Mirolli, Eds., Intrinsically Motivated Learning in Natural and Artificial Systems. Berlin, Germany: Springer-Verlag, [21] G. Baldassarre and M. Mirolli, Intrinsically motivated learning systems: An overview, in Intrinsically Motivated Learning in Natural and Artificial Systems, G. Baldassarre and M. Mirolli, Eds. Heidelberg, Germany: Springer-Verlag, 2013, pp [22] V. G. Santucci, G. Baldassarre, and M. Mirolli, Intrinsic motivation mechanisms for competence acquisition, in Proc. IEEE Int. Conf. Develop. Learn. Epigenet. Robot. (ICDL EpiRob), San Diego, CA, USA, 2012, pp [23] V. G. Santucci, G. Baldassarre, and M. Mirolli, Which is the best intrinsic motivation signal for learning multiple skills? Front. Neurorobot., vol. 7, pp. e1 e22, Nov [24] J. Weng et al., Autonomous mental development by robots and animals, Science, vol. 291, no. 291, pp , 2001.

18 SPERATI AND BALDASSARRE: BIO-INSPIRED MODEL LEARNING VISUAL GOALS AND ATTENTION SKILLS THROUGH CONTINGENCIES AND IMs 343 [25] A. Baranes and P.-Y. Oudeyer, Active learning of inverse models with intrinsically motivated goal exploration in robots, Robot. Auton. Syst., vol. 61, no. 1, pp , [26] M. Mirolli and G. Baldassarre, Functions and mechanisms of intrinsic motivations: The knowledge versus competence distinction, in Intrinsically Motivated Learning in Natural and Artificial Systems, G. Baldassarre and M. Mirolli, Eds. Heidelberg, Germany: Springer-Verlag, 2013, pp [27] F. C. Benureau and P.-Y. Oudeyer, Behavioral diversity generation in autonomous exploration through reuse of past experience, Front. Robot. AI, vol. 3, p. 8, Mar [28] C. Diuk et al., Divide and conquer: Hierarchical reinforcement learning and task decomposition in humans, in Computational and Robotic Models of the Hierarchical Organisation of Behaviour, G.Baldassarre and M. Mirolli, Eds. Heidelberg, Germany: Springer-Verlag, 2013, pp [29] M. Rolf and M. Asada, Autonomous development of goals: From generic rewards to goal and self detection, in Proc. Joint IEEE Int. Conf. Develop. Learn. Epigenet. Robot. (ICDL Epirob), Genoa, Italy, 2014, pp [30] R. F. Reinhart, Autonomous exploration of motor skills by skill babbling, Auton. Robots, vol. 41, no. 7, pp , [31] V. G. Santucci, G. Baldassarre, and M. Mirolli, GRAIL: A goaldiscovering robotic architecture for intrinsically-motivated learning, IEEE Trans. Cogn. Develop. Syst., vol. 8, no. 3, pp , Sep [32] M. Rolf and M. Asada, Where do goals come from? A generic approach to autonomous goal-system development, IEEE Trans. Auton. Mental Develop., to be published. [Online]. Available: [33] S. Forestier and P.-Y. Oudeyer, Overlapping waves in tool use development: A curiosity-driven computational model, in Proc. 6th Joint IEEE Int. Conf. Develop. Learn. Epigenet. Robot., 2016, pp [34] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum, Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation, in Proc. Adv. Neural Inf. Process. Syst., 2016, pp [35] S. Ullman, Visual routines, Cognition, vol. 18, nos. 1 3, pp , [36] D. Ognibene, C. Balkenius, and G. Baldassarre, Integrating epistemic action (active vision) and pragmatic action (reaching): A neural architecture for camera-arm robots, in Proc. From Animals Animats 10th Int. Conf. Simulat. Adapt. Behav. (SAB), Osaka, Japan, Jul. 2008, pp [37] D. Ognibene and G. Baldassare, Ecological active vision: Four bioinspired principles to integrate bottom up and adaptive top down attention tested with a simple camera-arm robot, IEEE Trans. Auton. Mental Develop., vol. 7, no. 1, pp. 3 25, Mar [38] R. Marraffa, V. Sperati, D. Caligiore, J. Triesch, and G. Baldassarre, A bio-inspired attention model of anticipation in gaze-contingency experiments with infants, in Proc. IEEE Int. Conf. Develop. Learn. Epigenet. Robot. (ICDL EpiRob), San Diego, CA, USA, Nov. 2012, pp. e1 e8. [39] V. Sperati and G. Baldassarre, Learning where to look with movementbased intrinsic motivations: A bio-inspired model, in Proc. Int. Conf. Develop. Learn. Epigenet. Robot. (ICDL Epirob), Genoa, Italy, 2014, pp [40] O. Hikosaka, Y. Takikawa, and R. Kawagoe, Role of the basal ganglia in the control of purposive saccadic eye movements, Physiol. Rev., vol. 80, no. 3, pp , [41] Q. Wang et al., Infants in control: Rapid anticipation of action outcomes in a gaze-contingent paradigm, PLoS ONE, vol. 7, no. 2, 2012, Art. no. e [42] F. Deligianni, A. Senju, G. Gergely, and G. Csibra, Automated gazecontingent objects elicit orientation following in 8-month-old infants, Develop. Psychol., vol. 47, no. 6, pp , [43] B. White and D. P. Munoz, The superior colliculus, in Oxford Handbook of Eye Movements, S. Liversedge, I. Gilchrist, and S. Everling, Eds. Oxford, U.K.: Oxford Univ. Press, 2011, pp [44] R. J. Krauzlis, L. P. Lovejoy, and A. Zénon, Superior colliculus and visual spatial attention, Annu. Rev. Neurosci., vol. 36, pp , Jul [45] P. Dean, P. Redgrave, and G. Westby, Event or emergency? Two response systems in the mammalian superior colliculus, Trends Neurosci., vol. 12, no. 4, pp , [46] J. T. DesJardin et al., Defense-like behaviors evoked by pharmacological disinhibition of the superior colliculus in the primate, J. Neurosci., vol. 33, no. 1, pp , [47] E. Comoli et al., A direct projection from superior colliculus to substantia nigra for detecting salient visual events, Nat. Neurosci., vol. 6, no. 9, pp , [48] W. Schultz, Predictive reward signal of dopamine neurons, J. Neurophysiol., vol. 80, no. 1, pp. 1 27, [49] E. Dommett et al., How visual stimuli activate dopaminergic neurons at short latency, Science, vol. 307, no. 5714, pp , [50] P. Redgrave, T. J. Prescott, and K. Gurney, The basal ganglia: A vertebrate solution to the selection problem? Neuroscience, vol. 89, no. 4, pp , [51] J. G. McHaffie, T. R. Stanford, B. E. Stein, V. Coizet, and P. Redgrave, Subcortical loops through the basal ganglia, Trends Neurosci., vol. 28, no. 8, pp , [52] W. Schultz, Behavioral theories and the neurophysiology of reward, Annu. Rev. Psychol., vol. 57, pp , Feb [53] E. K. Miller and J. D. Cohen, An integrative theory of prefrontal cortex function, Annu. Rev. Neurosci., vol. 24, no. 1, pp , [54] S. Gilbert and P. Burgess, Executive function, Current Biol., vol. 18, no. 3, pp , [55] D. Joel, Y. Niv, and E. Ruppin, Actor-critic models of the basal ganglia: New anatomical and computational perspectives, Neural Netw., vol. 15, nos. 4 6, pp , [56] P. Redgrave, K. Gurney, T. Stafford, M. Thirkettle, and J. Lewis, The role of the basal ganglia in discovering novel actions, in Intrinsically Motivated Learning in Natural and Artificial Systems, G. Baldassarre and M. Mirolli, Eds. Heidelberg, Germany: Springer-Verlag, 2013, pp [57] P. Dayan and L. Abbott, Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems. Cambridge, MA, USA: MIT Press, [58] T. J. Anastasio, Tutorial on Neural System Modeling. Sunderland, MA, USA: Sinauer, [59] J. Ross, M. C. Morrone, M. E. Goldberg, and D. C. Burr, Changes in visual perception at the time of saccades, Trends Neurosci., vol. 24, no. 2, pp , [60] M. Ibbotson and B. Krekelberg, Visual perception and saccadic eye movements, Current Opin. Neurobiol., vol. 21, no. 4, pp , [61] J. Lupianez, R. M. Klein, and P. Bartolomeo, Inhibition of return: Twenty years after, Cogn. Neuropsychol., vol. 23, no. 7, pp , [62] R. A. Andersen and C. A. Buneo, Intentional maps in posterior parietal cortex, Annu. Rev. Neurosci., vol. 25, pp , Mar [63] R. A. Andersen, L. H. Snyder, D. C. Bradley, and J. Xing, Multimodal representation of space in the posterior parietal cortex and its use in planning movements, Annu. Rev. Neurosci., vol. 20, pp , Mar [64] M. C. Dorris, R. M. Klein, S. Everling, and D. P. Munoz, Contribution of the primate superior colliculus to inhibition of return, J. Cogn. Neurosci., vol. 14, no. 8, pp , [65] R. M. Klein, Inhibition of return, Trends Cogn. Sci., vol. 4, no. 4, pp , [66] A. Sapir, A. Hayes, A. Henik, S. Danziger, and R. Rafal, Parietal lobe lesions disrupt saccadic remapping of inhibitory location tagging, J. Cogn. Neurosci., vol. 16, no. 4, pp , [67] M. E. Bratman, Intentions, Plans, and Practical Reason. Cambridge, MA, USA: Harvard Univ. Press, [68] S. Singh, A. Barto, and N. Chentanez, Intrinsically motivated reinforcement learning, in Advances in Neural Information Processing Systems 17, L. K. Saul, Y. Weiss, and L. Bottou, Eds. Cambridge, MA, USA: MIT Press, 2005, pp [69] R. S. Sutton, D. Precup, and S. Singh, Between MDPs and semi-mdps: A framework for temporal abstraction in reinforcement learning, Artif. Intell., vol. 112, nos. 1 2, pp , [70] M. Schembri, M. Mirolli, and G. Baldassarre, Evolving internal reinforcers for an intrinsically motivated reinforcement-learning robot, in Proc. 6th IEEE Int. Conf. Develop. Learn. (ICDL), London, U.K., 2007, pp [71] A. McGovern and A. Barto, Automatic discovery of subgoals in reinforcement learning using diverse density, Comput. Sci. Dept. Faculty Publ. Series, Univ. Massachusetts Amherst, Amherst, MA, USA, [72] Ö. Şimşek and A. G. Barto, Skill characterization based on betweenness, in Advances in Neural Information Processing Systems, vol. 21, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, Eds. Boston, MA, USA: MIT Press, 2008, pp

344 IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, VOL. 10, NO. 2, JUNE 2018 [73] S. G. van Dijk and D. Polani, Grounding subgoals in information transitions, in Proc. IEEE Symp. Adapt.

Mental Develop., vol. 2, no. 3, pp. 216 229, Sep. 2010. [75] S. M. Nguyen et al., Learning to recognize objects through curiositydriven manipulation with the icub humanoid robot, in Proc. IEEE Int.

19 344 IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, VOL. 10, NO. 2, JUNE 2018 [73] S. G. van Dijk and D. Polani, Grounding subgoals in information transitions, in Proc. IEEE Symp. Adapt. Dyn. Program. Reinforcement Learn. (ADPRL), Paris, France, 2011, pp [74] M. Rolf, J. J. Steil, and M. Gienger, Goal babbling permits direct learning of inverse kinematics, IEEE Trans. Auton. Mental Develop., vol. 2, no. 3, pp , Sep [75] S. M. Nguyen et al., Learning to recognize objects through curiositydriven manipulation with the icub humanoid robot, in Proc. IEEE Int. Conf. Develop. Learn. Epirob, Osaka, Japan, 2013, pp [76] J. Schmidhuber and R. Huber, Learning to generate artificial fovea trajectories for target detection, Int. J. Neural Syst., vol. 2, nos. 1 2, pp , [77] L. Paletta and A. Pinz, Active object recognition by view integration and reinforcement learning, Robot. Auton. Syst., vol. 31, nos. 1 2, pp , [78] C. Balkenius, Attention, habituation and conditioning: Toward a computational model, Cogn. Sci. Quart., vol. 1, no. 2, pp , [79] K. Merrick, Value systems for developmental cognitive robotics: Asurvey, Cogn. Syst. Res., vol. 41, pp , Mar [80] D. Ognibene, G. Pezzulo, and G. Baldassarre, How can bottom-up information shape learning of top-down attention-control skills? in Proc. 9th IEEE Int. Conf. Develop. Learn. (ICDL), Ann Arbor, MI, USA, Aug. 2010, pp [81] J. Gottlieb, P.-Y. Oudeyer, M. Lopes, and A. Baranes, Informationseeking, curiosity, and attention: Computational and neural mechanisms, Trends Cogn. Sci., vol. 17, no. 11, pp , Nov. 2013, doi: /j.tics [82] M. Schlesinger, D. Amso, and S. P. Johnson, Simulating the role of visual selective attention during the development of perceptual completion, Develop. Sci., vol. 15, no. 6, pp , [83] M. Schlesinger and D. Amso, Image free-viewing as intrinsicallymotivated exploration: Estimating the learnability of center-of-gaze image samples in infants and adults, Front. Psychol. Cogn. Sci., vol. 4, no. 802, [84] S. I. Di Domenico and R. M. Ryan, The emerging neuroscience of intrinsic motivation: A new frontier in self-determination research, Front. Human Neurosci., vol. 11, p. 145, Mar [85] D. Caligiore et al., Intrinsic motivations drive learning of eye movements: An experiment with human adults, PLoS ONE, vol. 10, no. 3, 2015, Art. no. e [86] K. Seepanomwan, V. G. Santucci, and G. Baldassarre, Intrinsically motivated learnt goals boost the achievement of users goals in a humanoid robot, in Proc. 7th Joint IEEE Int. Conf. Develop. Lean. Epigenet. Robot. (ICDL EpiRob), Lisbon, Portugal, Sep. 2017, pp Valerio Sperati received the B.A. and M.A. degrees in psychology from the University of Rome La Sapienza, Rome, Italy, in He is currently pursuing the Ph.D. degree in computer science with the University of Plymouth, Plymouth, U.K. Since 2007, he was a Research Fellow with the Institute of Cognitive Science and Technologies, National Research Council, Rome, participating to the following European projects: ECAgents: Embodied and Communicating Agents, Swarmanoid: Towards Humanoid Robotic Swarms, and IM-CLeVeR: Intrinsically Motivated Cumulative Learning Versatile Robots. In 2017, he became a Researcher with the EU project GOAL-Robots: Goal-Based Open-Ended Autonomous Learning Robots. Gianluca Baldassarre received the B.A. and M.A. degrees in economics and the M.Sc. degree in cognitive psychology and neural networks from the University of Rome La Sapienza, Rome, Italy, in 1998 and 1999, respectively, and the Ph.D. degree in computer science from the University of Essex, Essex, U.K., in 2003, with a focus on planning with neural networks. He was a Post-Doctoral Researcher with the Italian Institute of Cognitive Sciences and Technologies, National Research Council, Rome, researching on swarm robotics, where since 2006 he has been a Researcher and coordinates the Laboratory of Computational Embodied Neuroscience. He was a Team Leader of the EU project ICEA Integrating Cognition Emotion and Autonomy from 2006 to 2009 and the Coordinator of the European integrated project IM-CleVeR Intrinsically-Motivated Cumulative-Learning Versatile Robots from 2009 to He is the Coordinator of the European FET-OPEN project GOAL-Robots Goal- Based Open-Ended Autonomous Learning Robots from 2016 to He has about 150 international peer-review publications. His current research interests include brain and robot architectures and mechanisms for cumulative learning of multiple sensorimotor skills, driven by extrinsic and intrinsic motivations, in animals/humans and robots.

Visual Selection and Attention

Visual Selection and Attention Retrieve Information Select what to observe No time to focus on every object Overt Selections Performed by eye movements Covert Selections Performed by visual attention 2