A model to explain the emergence of reward expectancy neurons using reinforcement learning and neural network $

Similar documents
Katsunari Shibata and Tomohiko Kawano

Question 1 Multiple Choice (8 marks)

Emotion Explained. Edmund T. Rolls

Hierarchical dynamical models of motor function

Learning Utility for Behavior Acquisition and Intention Inference of Other Agent

A model of dual control mechanisms through anterior cingulate and prefrontal cortex interactions

Realization of Visual Representation Task on a Humanoid Robot

Memory, Attention, and Decision-Making

Reward Hierarchical Temporal Memory

M.Sc. in Cognitive Systems. Model Curriculum

Learning in neural networks

Hebbian Plasticity for Improving Perceptual Decisions

Reactive agents and perceptual ambiguity

Robot Trajectory Prediction and Recognition based on a Computational Mirror Neurons Model

Learning and Adaptive Behavior, Part II

Neural Coding. Computing and the Brain. How Is Information Coded in Networks of Spiking Neurons?

Cost-Sensitive Learning for Biological Motion

Supplementary materials for: Executive control processes underlying multi- item working memory

The storage and recall of memories in the hippocampo-cortical system. Supplementary material. Edmund T Rolls

A U'Cortical Substrate of Haptic Representation"

Synfire chains with conductance-based neurons: internal timing and coordination with timed input

Reinforcement learning and the brain: the problems we face all day. Reinforcement Learning in the brain

Methods to examine brain activity associated with emotional states and traits

Clusters, Symbols and Cortical Topography

5th Mini-Symposium on Cognition, Decision-making and Social Function: In Memory of Kang Cheng

Observational Learning Based on Models of Overlapping Pathways

Artificial Neural Networks to Determine Source of Acoustic Emission and Damage Detection

NEOCORTICAL CIRCUITS. specifications

Supplementary Motor Area exerts Proactive and Reactive Control of Arm Movements

Time Experiencing by Robotic Agents

CSE Introduction to High-Perfomance Deep Learning ImageNet & VGG. Jihyung Kil

Prof. Greg Francis 7/31/15

Adaptive leaky integrator models of cerebellar Purkinje cells can learn the clustering of temporal patterns

Neuroscience Tutorial

Rolls,E.T. (2016) Cerebral Cortex: Principles of Operation. Oxford University Press.

Supplementary Information

TWO HANDED SIGN LANGUAGE RECOGNITION SYSTEM USING IMAGE PROCESSING

Lesson 6 Learning II Anders Lyhne Christensen, D6.05, INTRODUCTION TO AUTONOMOUS MOBILE ROBOTS

CISC 3250 Systems Neuroscience

Different inhibitory effects by dopaminergic modulation and global suppression of activity

Session Goals. Principles of Brain Plasticity

Recognition of English Characters Using Spiking Neural Networks

FUZZY LOGIC AND FUZZY SYSTEMS: RECENT DEVELOPMENTS AND FUTURE DIWCTIONS

Using stigmergy to incorporate the time into artificial neural networks

Double dissociation of value computations in orbitofrontal and anterior cingulate neurons

Analysis of Speech Recognition Techniques for use in a Non-Speech Sound Recognition System

Evolution of Plastic Sensory-motor Coupling and Dynamic Categorization

ASSOCIATIVE MEMORY AND HIPPOCAMPAL PLACE CELLS

Object recognition and hierarchical computation

Beyond bumps: Spiking networks that store sets of functions

Distinct valuation subsystems in the human brain for effort and delay

CSE511 Brain & Memory Modeling Lect 22,24,25: Memory Systems

A HMM-based Pre-training Approach for Sequential Data

EEL-5840 Elements of {Artificial} Machine Intelligence

Reading Words and Non-Words: A Joint fmri and Eye-Tracking Study

PhD : Istanbul Technical University, Graduate School of Science Engineering And Technology, Electronics Engineering, Ongoing

Development of goal-directed gaze shift based on predictive learning

TIME SERIES MODELING USING ARTIFICIAL NEURAL NETWORKS 1 P.Ram Kumar, 2 M.V.Ramana Murthy, 3 D.Eashwar, 4 M.Venkatdas

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training.

Modeling Individual and Group Behavior in Complex Environments. Modeling Individual and Group Behavior in Complex Environments

arxiv: v2 [cs.lg] 3 Apr 2019

Image Captioning using Reinforcement Learning. Presentation by: Samarth Gupta

Systems Neuroscience November 29, Memory

of subjective evaluations.

International Journal of Scientific & Engineering Research Volume 4, Issue 2, February ISSN THINKING CIRCUIT

Input-speci"c adaptation in complex cells through synaptic depression

Cognitive Neuroscience History of Neural Networks in Artificial Intelligence The concept of neural network in artificial intelligence


Attention: Neural Mechanisms and Attentional Control Networks Attention 2

Cognitive Modelling Themes in Neural Computation. Tom Hartley

Models of Basal Ganglia and Cerebellum for Sensorimotor Integration and Predictive Control

Artificial Neural Networks (Ref: Negnevitsky, M. Artificial Intelligence, Chapter 6)

Exploring the Functional Significance of Dendritic Inhibition In Cortical Pyramidal Cells

How rational are your decisions? Neuroeconomics

Learning intentional behavior in the K-model of the amygdala and entorhinal cortexwith the cortico-hyppocampal formation

Artificial Neural Network Classifiers for Diagnosis of Thyroid Abnormalities

The Frontal Lobes. Anatomy of the Frontal Lobes. Anatomy of the Frontal Lobes 3/2/2011. Portrait: Losing Frontal-Lobe Functions. Readings: KW Ch.

Computational Explorations in Cognitive Neuroscience Chapter 7: Large-Scale Brain Area Functional Organization

Computational & Systems Neuroscience Symposium

Target-to-distractor similarity can help visual search performance

Becoming symbol-minded. Judy S. DeLoache

Introduction to Computational Neuroscience

An Artificial Neural Network Architecture Based on Context Transformations in Cortical Minicolumns

Dynamics of Hodgkin and Huxley Model with Conductance based Synaptic Input

Chapter 5: Perceiving Objects and Scenes

Computational cognitive neuroscience: 2. Neuron. Lubica Beňušková Centre for Cognitive Science, FMFI Comenius University in Bratislava

Theoretical Neuroscience: The Binding Problem Jan Scholz, , University of Osnabrück

Cross-modal body representation based on visual attention by saliency

25 Things To Know. Pre- frontal

Information Processing During Transient Responses in the Crayfish Visual System

THE PREFRONTAL CORTEX. Connections. Dorsolateral FrontalCortex (DFPC) Inputs

Intelligent Control Systems

Fuzzy-Neural Computing Systems: Recent Developments and Future Directions

Systems Neuroscience CISC 3250

A computational model of integration between reinforcement learning and task monitoring in the prefrontal cortex

Motor Initiated Expectation through Top-Down Connections as Abstract Context in a Physical World

Basics of Perception and Sensory Processing

Learning Working Memory Tasks by Reward Prediction in the Basal Ganglia

A Neural Model of Context Dependent Decision Making in the Prefrontal Cortex

Artificial Neural Networks

Transcription:

Neurocomputing 69 (26) 1327 1331 www.elsevier.com/locate/neucom A model to explain the emergence of reward expectancy neurons using reinforcement learning and neural network $ Shinya Ishii a,1, Munetaka Shidara b, Katsunari Shibata a, a Department of Electrical and Electronic Engineering,Oita University, Oita 87-1192, Japan b Neuroscience Research Institute, National Institute of Advance Industrial Science and Technology (AIST), Japan Available online 17 February 26 Abstract In an experiment of multi-trial task to obtain a reward, reward expectancy neurons, which responded only in the non-reward trials that are necessary to advance toward the reward, have been observed in the anterior cingulate cortex of monkeys. In this paper, to explain the emergence of the reward expectancy neuron in terms of reinforcement learning theory, a model that consists of a recurrent neuralnetwork trained based on reinforcement learning is proposed. The analysis of the hidden layer neurons of the model during the learning suggests that the reward expectancy neurons emerge to realize smooth temporal increase of the state value by complementing the neuron that responds only in the reward trial. r 26 Elsevier B.V. All rights reserved. Keywords: Reward expectancy neuron; Anterior cingulate; Reinforcement learning; Recurrent neural network 1. Introduction Recently, in an experiment of the multi-trial task using a monkey in which several successful trials are required until it gets a reward, two types of neurons are observed in the anterior cingulate [5,4]. The one type (reward proximity type) responded more vigorously as the monkey approached the reward. The other type (reward expectancy type) increased its activity in each trial toward the reward, but dropped its activity before the reward trial. The anterior cingulate cortex is located in the medial frontal cortex. It has neural connections with various parts of frontal and limbic areas, and so can be a good candidate for integrating various information related to motivational process. The reward proximity neuron can be interpreted as the state value in reinforcement learning. On the other hand, it has been difficult to explain the emergence of the reward expectancy neuron by reinforcement learning $ This research was partially supported by the JSPS s Grants-in-Aid for Scientific Research (#1435227,#15364), and also supported by AIST. Corresponding author. E-mail address: shibata@cc.oita-u.ac.jp (K. Shibata). 1 Present address: Graduate School of Comprehensive Human Science, University of Tsukuba, Tsukuba 35-8577, Japan. theory because it did not respond in the reward trial. However, we thought that the reward expectancy neuron should be deeply relevant to the state value in reinforcement learning. Thus, in this paper, we hypothesized that reinforcement learning plays an important role not only in the basal ganglia, but also in the anterior cingulate cortex, and the reward expectancy neuron is one of the intermediate representations to generate state value in reinforcement learning. To investigate this hypothesis, we used a model that consists of a recurrent neural-network (RNN) trained based on the actor critic type reinforcement learning [1], and analyzed the behavior of the hidden-layer neurons during learning. By the same approach, some neuronal responses in the intraparietal sulcus have been already explained well in terms of reinforcement learning theory [3]. 2. Experimental result [5,4] In the first stage, a monkey is trained a single visual color discrimination task [5,4]. At the beginning, a white bar called visual cue is presented at the upper edge of a black monitor. When the monkey touches a bar in the monkey chair, the fixation point presented at the center of the 925-2312/$ - see front matter r 26 Elsevier B.V. All rights reserved. doi:1.116/j.neucom.25.12.11

1328 ARTICLE IN PRESS S. Ishii et al. / Neurocomputing 69 (26) 1327 1331 monitor changes to the red target stimulus. After a varying waiting period, the target color becomes green, which instructs the monkey to release the bar. If the monkey releases the bar within 1 s, the target turns blue to indicate that the trial is successful, and the monkey can get juice as a reward. After the monkey learned this single-trial task, the multi-trial reward schedule task (multi-trial task) is introduced. In this task, the reward is given to the monkey when it performs 1 4 trials correctly. The necessary number of trials to get the reward is determined at random. Since the visual cue becomes brighter as the monkey approaches the reward trial, it can recognize the number of trials remaining for the reward. The example responses of anterior cingulate neurons are shown in Fig. 1. The responses of the neuron in Fig. 1(A)(B) increased in the non-reward trials, but decreased before the reward trial. The responses of the neuron in Fig. 1(C),(D) decreased after the reward, and they can be interpreted to express the distance to the reward. The neurons like (C) and (D) can be explained reasonably as the state value by reinforcement learning. In this paper, we investigate why the reward expectancy neurons such as (A) and (B) emerged in terms of reinforcement learning theory. 3. Proposed model The architecture of the model proposed in this paper is shown in Fig. 2. The model is consisted of one RNN whose input is an observation vector. The actor critic [1] is employed as a reinforcement learning method. The critic generates a state value, and the actor s generate action commands. From the necessity to efficiently modify all the synapse weights based on reinforcement learning even in the hidden layers as a model of the frontal cortex, error-back-propagation type supervised learning was employed, and the training signals were generated autonomously according to reinforcement learning. Thus, it is expected that necessary functions emerge purposively, autonomously and in harmony as an intermediate representation to generate appropriate state value and actions [2]. firing rate (Hz) firing rate (Hz) 25 2 15 1 5 7 6 5 4 3 2 1 first 15 first 15 second third forth (reward) 1 5 5 second third forth (reward) 1 5 firing rate (Hz) 15 firing rate (Hz) 5 4 3 2 1 1 5 (d) first second third forth (reward) 15 1 5 first second third forth (reward) 15 1 5 Fig. 1. The responses of the anterior cingulate neurons. These figures are copied from [4]. r 22 by Igakushoin. Fig. 2. The proposed model using a recurrent neural network.

S. Ishii et al. / Neurocomputing 69 (26) 1327 1331 1329 TD-error ^r t is expressed by ^r t ¼ r tþ1 þ gvðx tþ1 Þ Vðx t Þ, (1) where r is the reward (.9 or.), Vðx t Þ is the critic (state value), x t is the observation vector, and g is a discount factor. The neural network is trained by the training signals V s;t and a s;t for the critic and actor, respectively. Those are generated based on reinforcement learning as V s;t ¼ ^r t þ Vðx t Þ¼r tþ1 þ gvðx tþ1 Þ, (2) a s;t ¼ aðx t Þþ^r t rnd t, (3) where aðx t Þ is the actor vector, rnd t the exploration factors added to aðx t Þ. As for the neural-network structure, the number of layers is four, and Elman-type RNN is introduced to deal with the past information. Furthermore, the direct connections from the input layer to the layer that corresponds to the basal ganglia were added. The frontal cortex is generally considered to realize high-order functions and also short-term memory as seen in dorsolateral prefrontal cortex. Here, the information that can be approximated enough as a linear combination of the input signals is assumed to go directly from the input to the basal ganglia, while the complicated nonlinear transformation is assumed to be done by going through the frontal cortex that is modeled as the hidden layers. Since a strong nonlinear transformation is required to generate the response of reward expectancy neurons in the anterior cingulate from the visual cue signal, at least four-layer structure is necessary after adding the layer as the basal ganglia. The number of neurons in each layer from the input layer to the layer is 8, 2, 1 or 4, respectively. The input signals of the neural network are assumed to be the signals after some pre-processing in the visual cortex or some other areas. The RGB signals of the visual cue are inputted into the first three input neurons, but each value is the same to the others since the visual cue is gray scale. The value was in the single-trial task. In the multi-trial task, it became larger as it approached to the reward such as :1! :4! :7! 1:. The next three inputs indicate the RGB signals of the target color. The next one represents by the binary values whether the monkey touched the bar or not. The last input signal is 1 when the reward is given, and it is otherwise. By considering the rate coding, the activation function of each neuron in the hidden and layers is the sigmoid function whose value ranges from. to. One of the neurons is used as critic, and the other three neurons are used as actor. One of the three actions, keep, touch, or release, is assigned to each actor neuron, respectively. An action is selected stochastically by comparing the values after adding the exploring factor rnd t to the actor vector aðx t Þ. Each factor of rnd t is a uniform random number between :3. BPTT (Back propagation through time) [6] is used as a supervised learning algorithm for the RNN, and the truncated trace time to the past was set to 8 steps. Sampling rate, i.e., one step, was set to 1 ms. Furthermore, when the task was changed from the single-trial task to the multi-trial task, the discount factor g was changed as :96! :976 since the necessary time steps to the reward becomes large. In this simulation, when the learning was done almost completely in the single-trial task, it moved to the multi-trial task. Here, the number of episodes in the single-trial task was 16 5. An episode is defined as a sequence until the monkey gets the reward. 4. Simulation result The response change of some neurons after 18 5 episodes, in other words, soon after switching to the multi-trial task is shown in Fig. 3. The results are shown for the case when the reward is given after 4 successful trials. The response of the critic is shown in Fig. 3. If the learning is performed ideally, the critic increases exponentially and smoothly toward the time when the reward is given. However, in this case, the upward trend toward the reward can be seen only in the reward trial after 6 s. The responses of the upper-hidden neurons 3 and 9 are shown in Fig. 3,. Judging from the connection weight to the critic, the upper-hidden neuron 9 made a large contribution to the critic. In the non-reward trials, a large negative TD-error appears because the reward cannot be obtained on the contrary to the expectation. Therefore, it is thought that the critic response was depressed greatly in the non-reward trials, and that is the reason why the upperhidden neuron 9, which is called reward-trial neuron here, responded only in the reward trial. Next, the responses after 3, episodes are shown in Fig. 4. Comparing the critic as shown in Fig. 4 with the previous one as shown in Fig. 3, it can be seen that the critic is increasing even before the reward trial. The responses of the upper-hidden neurons are shown in Fig. 4,. In this case, the neuron that responded only in the non-reward trial emerged as shown in Fig. 4. The weight from the upper-hidden neuron 3 to the critic was small around the 18 5th episode, but it became large around the 3 th episode. The upper-hidden neuron 3 is considered to be equivalent to the reward expectancy neuron in the experiment using a monkey. Then, in order to examine how this neuron is represented, the response of the lower-hidden neuron contributing to the upper-hidden neuron 3 was observed. As shown in Fig. 4(d), the response of the lower-hidden neuron 15 is depressed in the reward trial. It had a positive connection to the upper-hidden neuron 3 and a negative connection to the upper-hidden neuron 9. From the above result, it can be thought that the reward expectancy neuron emerged to realize the smooth temporal increase of the critic by complementing the reward trial neuron. The lower-hidden neuron that also contributes to the reward-trial neuron makes it possible for the reward expectancy neuron to generate the intermediate

133 ARTICLE IN PRESS S. Ishii et al. / Neurocomputing 69 (26) 1327 1331. 1 2 3 4 5 6 7 8 9 Fig. 3. The response of some neurons after 18 5 episodes. critic, upper-hidden3, and upper-hidden9. (d) Fig. 4. The response of some neurons after 3 episodes. critic, upper-hidden3, upper-hidden9, and (d) lower-hidden15. representation in spite of a strong nonlinear transformation from the visual cue signal. 5. Conclusion In this paper, to explain the emergence of the reward expectancy neuron in terms of reinforcement learning theory, a model that consists of a RNN-trained based on the actor critic reinforcement learning is proposed. In the simulation of the model, a neuron that can be considered as a reward expectancy neuron was observed in the hidden layer. The analysis of the result suggests that it emerged to complement the neuron that responds only in the reward trial for realizing the smooth temporal increase of the critic. The neuron responding only in the reward trial emerged to realize quite different critic between non-reward trial and reward trial even though the input signals are similar between them, and also that makes it possible to realize the response only in the non-reward trials in the reward expectancy neuron.

S. Ishii et al. / Neurocomputing 69 (26) 1327 1331 1331 References [1] A.G. Barto, et al., Neuronlike adaptive elements that can solve difficult learning control problems, IEEE Trans. Syst. Man Cybern. 13 (1983) 835 846. [2] K. Shibata, Reinforcement learning and robot intelligence, in: Proceedings of the 16th Annual Conference of JSAI, 22, 2A1-5 (in Japanese). [3] K. Shibata, K. Ito, Hidden representation after reinforcement learning of hand reaching movement with variable link length, in: Proceedings of IJCNN, 23, pp. 2619 2624. [4] M. Shidara, Representation of motivational process reward expectancy in the brain, Igaku No Ayumi 22 (3) (22) 181 186 (in Japanese). [5] M. Shidara, B.J. Richmond, Anterior cingulate: single neuronal signals related to degree of reward expectancy, Science 296 (22) 179 1711. [6] R.J. Williams, D. Zipser, Gradient-based learning algorithm for recurrent connectionist networks, Technical Report, NU-CCS-9-9, Northeastern University, 199. Shinya Ishii is a graduate student in the Department of Electrical and Electronic Engineering, Faculty of Engineering, Oita University, Japan. He received a B.Eng. in Electrical and Electronic Engineering from Oita University in 24. His research interests include neuronal model that consists of a recurrent neural network trained based on reinforcement learning. Munetaka Shidara is a professor of the Institute of Basic Medical Sciences and Doctoral Program in Kansei, Behavioral and Brain Sciences, Graduate School of Comprehensive Human Sciences, University of Tsukuba in Japan. He received Ph.D. in Basic Medical Science from the University of Tokyo in 199. His research interests include neuronal and information processing mechanisms on brain functions such as motiva tion and goal-directed behavior, and pattern recognition. Katsunari Shibata is an associate professor of the Department of Electrical and Electronic Engineering, Faculty of Engineering at Oita University. He received a Dr. of Engineering in Electronics Engineering from the University of Tokyo in 1997. His research interests include the difference of intelligence between real lives and modern robots and also autonomous learning especially using reinforcement learning and neural network.