Emergence of Emotional Appraisal Signals in Reinforcement Learning Agents

Size: px

Start display at page:

Download "Emergence of Emotional Appraisal Signals in Reinforcement Learning Agents"

Noel Campbell
5 years ago
Views:

1 Autonomous Agents and Multiagent Systems manuscript No. (will be inserted by the editor) Emergence of Emotional Appraisal Signals in Reinforcement Learning Agents Pedro Sequeira Francisco S. Melo Ana Paiva Received: date / Accepted: date Abstract The positive impact of emotions in decision-making has long been established in both natural and artificial agents. In the perspective of appraisal theories, emotions complement perceptual information, coloring our sensations and guiding our decision-making. However, when designing autonomous agents, is emotional appraisal the best complement to the perceptions? Mechanisms investigated in affective neuroscience provide support for this hypothesis in biological agents. In this paper, we look for similar support in artificial systems. We adopt the intrinsically motivated reinforcement learning framework to investigate different sources of information that can guide decision-making in learning agents, and an evolutionary approach based on genetic programming to identify a small set of such sources that have the largest impact on the performance of the agent in different tasks, as measured by an external evaluation signal. We then show that these sources of information: (i) are applicable in a wider range of environments than those where the agents evolved; (ii) exhibit interesting correspondences to emotional appraisal-like signals previously proposed in the literature, pointing towards our departing hypothesis that the appraisal process might indeed provide essential information to complement perceptual capabilities and thus guide decision-making. Keywords Emotions Appraisal Theory Intrinsic motivation Genetic programming Reinforcement learning Accepted manuscript version. The final publication, DOI /s , is available at P. Sequeira F.S. Melo A. Paiva INESC-ID / Instituto Superior Técnico, Technical University of Lisbon TagusPark, Edifício IST Porto Salvo, Portugal Tel: Fax: pedro.sequeira@gaips.inesc-id.pt; {fmelo,ana.paiva}@inesc-id.pt

2 2 Pedro Sequeira et al. 1 Introduction Research on psychology, neuroscience and other related areas has established emotions as a powerful adaptive mechanism that influences cognitive and perceptual processing [6, 8, 27]. Emotions indirectly drive behaviors that lead individuals to act, achieve goals and satisfy needs. Studies show that damage to regions of the brain identified as responsible for emotional processing impact the human and animal ability to properly learn aversive stimuli, plan courses of action and, more generally, take decisions that are advantageous for their well-being [3, 7, 17]. Appraisal theories of emotions [9, 10, 16, 18, 28, 32] suggest that they arise from evaluations of specific aspects of the individual s relationship with their environment, providing an adaptive response mechanism to situations occurring therein. In artificial systems, the area of affective computing (AC) also investigated the impact of emotional processing capabilities in the development of autonomous agents, often based in appraisal theories of emotions. Appraisalinspired mechanisms were shown to improve the performance of artificial agents in terms of different metrics, such as robustness, efficiency or believability [23, 29, 31, 33]. In very general terms, computational appraisal models feature an Appraisal derivation model that, together with the perceptual information acquired by the agent, 1 guides its decision process see Fig. 1 [22, 23]. The appraisal signals, 2 provided by such module, also referred to as appraisal variables, translate information about the history of interaction of the agent with its environment that aid decision-making and focus behavior towards dealing with the situation being evaluated [23]. In other words, such signals complement and color the agent s raw perceptions indicating, for example, whether a perception is expected or not, or pleasant or not. Perceptions Appraisal Derivation Perceptual signal Appraisal signals Decision Making Actions Fig. 1 General architecture of an emotional appraisal-based agent [23]. 3 Although one of the driving motivations for the use of emotional appraisalbased agent architectures is the creation of better agents (e.g., agents able to 1 Perceptual information can be of an internal nature, e.g., about goals, needs or beliefs, or external, e.g., about objects or events from the environment. 2 We adopt a rather broad definition of signal. Specifically, we refer to an appraisal signal any emotional appraisal-based information received and processed, in this case, by the decision-making module. 3 The diagram does not aim at providing an accurate representation of existing computational appraisal architectures for autonomous agents, but instead to highlight the point that, in such architectures, the decision making process is driven by both perceptual information from the environment and also by some form of appraisal-based information.

3 Emergence of Emotional Appraisal Signals in Reinforcement Learning Agents 3 Identification (Sec. 3) Validation (Sec. 4) Discussion (Sec. 5) Fig. 2 Roadmap for the study in the paper. We start by identifying optimal sources of information in Section 3. We validate these sources of information in Section 4 and conclude by discussing possible correspondences with appraisal dimensions of emotion in Section 5. successfully perform more complex tasks) one fundamental question remains mostly unaddressed in the literature: in the search for information that may complement an agent s perceptual capabilities, is the emotional appraisal process the best mechanism to provide such information? In this paper we contribute to this question, providing empirical evidence that appraisal-like signals may arise as natural candidates when looking for sources of information to complement an agent s perceptual capabilities. Using an evolutionary approach, we show that such signals emerge as sources of information for artificial agents, providing evolutionary advantages. We thus contribute a computational parallel to the evidence observed in biological systems, where the organisms with the most complex emotional processing capabilities are arguably those most fit to their environment [17, 18, 26]. In our study, we rely on intrinsically motivated reinforcement learning (IMRL) agents [39]. The framework of IMRL provides a principled manner to integrate multiple sources of information in the process of learning and decision-making of artificial agents [38]. 4 As such, it is a framework naturally suited to our investigation. Departing from an initial population of IMRL agents, each relying on different sources of information to guide their decisions, we use genetic programming to select the agents with maximal fitness. 5 This evolutionary process allows us to identify a minimal set of informative signals that provide general and useful information for decision-making. Finally, we establish a correspondence between the identified sources of information and the information associated with appraisal variables usually identified in the specialized literature. Overall, our experimental study highlights the usefulness of appraisal-like processes in identifying different aspects of a task the sources of information that complement an agent s perceptual capabilities in the pursuit for more reliable artificial decision-makers. The paper is organized according to the roadmap sketched in Fig. 2. Section 2 introduces the required background and notation on reinforcement learning. Section 3 identifies a minimal set of signals that provide the most useful information to guide IMRL agents. Section 4 analyzes the general applicability of the identified signals in a set of scenarios inspired by the game of Pac-Man. Finally, Section 5 analyzes the identified signals in light of appraisal theory literature and summarizes our main findings. 4 These complementary sources of information endow the agent with a richer repertoire of behaviors that may successfully overcome agent limitations [33, 40]. 5 In our approach, we use a fitness metric that directly measures the performance of the agent in the underlying task in different scenarios.

4 4 Pedro Sequeira et al. 2 Background As discussed in Section 1, in our study we rely on reinforcement learning (RL) agents. This section reviews basic RL concepts and sets up the notation used throughout the paper. We refer to [13, 42] for a detailed overview of RL. 2.1 Learning and Decision Making At each time step, and depending on its perception of the environment, an RL agent must choose an action from its action repertoire, in order to meet some pre-specified optimality criterion. Actions determine how the state of the environment evolves over time and, depending on such state, different actions have different value for the agent. Typically, the RL agent knows neither the value nor the effect of its actions, and must thus explore its environment and action repertoire before it can adequately select its actions. By state of the environment we refer to any feature of the environment that may be relevant for the agent to choose its actions optimally. Ideally, the agent should be able to unambiguously perceive all such features. Sometimes, however, the agent has limited sensing capabilities and is not able to completely determine the current state of the system. When this is the case, the agent is said to have partial observability. Throughout the paper, most agents considered have partial observability. RL agents can be modeled using the partially observable Markov decision process (POMDP) framework [14]. We denote a POMDP as a tuple M = (S, A, Z, P, O, r, γ), where S is the set of all possible environment states; A is the action repertoire of the agent; Z is the set of all possible agent observations; P(s s, a) indicates the probability that the state at time step t + 1, S t+1, is s, given that the state at time step t, S t, is s and the agent selected action A t = a. O(z s, a) indicates the probability that the observation of the agent at time step t + 1, Z t+1, is z, given that the state at time t + 1 is s and the agent selected action a at time t. r(s, a) represents the average reward that the agent expects to receive for performing action a in state s. 0 γ < 1 is some discount factor. A POMDP evolves as follows. At each time step t = 0, 1, 2, 3,..., the environment is in some state S t = s. The agent selects some action A t = a from its action repertoire, A, and the environment transitions to state S t+1 = s with probability P(s s, a). The agent receives a reward r(s, a) R and makes a new observation Z t+1 = z with probability O(z s, a), and the process repeats. 6 6 Typical RL scenarios assume that Z = S and O(z s, a) = δ(z, s), where δ denotes the Kronecker delta [42]. When this is the case, parameters Z and O can be safely discarded and

5 Emergence of Emotional Appraisal Signals in Reinforcement Learning Agents 5 The objective of the agent can be formalized as that of gathering as much reward as possible throughout its lifespan, usually discounted by the constant γ. This corresponds to maximizing the value [ ] v = E γ t r(s t, A t ). (1) t The reward r(s, a) thus evaluates the immediate utility of making action a in state s, in light of the underlying task that the agent must learn. In order to maximize the value in (1), the agent must learn a mapping that, depending on its history of observations and actions, determines the next action that the agent should take. Such mapping, denoted as π, is known as a policy, and is typically learned through a process of trial and error. In this paper we focus on policies that depend on the agent s current observation. In other words, our agents follow policies π : Z A that map each observation z Z directly to an action π(z) A. If the state is fully observable, then Z = S and Z t = S t. In this case, there is a policy π : S A referred to as the optimal policy maximizing the value in (1). We can associate with π a function Q : S A R that verifies the recursive relation: Q (s, a) = r(s, a) + γ s S P(s s, a) max b A Q (s, b). (2) Q (s, a) represents the value of executing action a in state s and henceforth following the optimal policy. We can use the recursion in (2) to iteratively compute Q for all pairs (s, a) S A. Additionally, V (s) = max a A Q (s, a) represents the value obtained by an agent starting from state s and henceforth following π. From the above, it should be apparent that the goal of the RL agent can be restated as that of learning Q, since from the latter it is possible to derive the optimal policy. Since RL agents typically have no knowledge of either P or r, one possibility is to explore the environment i.e., select actions in some exploratory manner building estimates for P and r, and then using these estimates to successively approximate Q. After exploring its environment, the agent can then exploit its knowledge and select the actions that maximize (its estimate of) Q. Throughout the paper, our RL agents follow a simple variation of this approach known as prioritized sweeping [24]. 2.2 Partial Observability and IMRL As discussed above, our RL agents use prioritized sweeping to build estimates of P and r from which they then approximate Q. Both P and r can be the simplified model thus obtained, represented as a tuple M = (S, A, P, r, γ), is referred to as a Markov decision process (MDP).

6 6 Pedro Sequeira et al. estimated by maintaining running averages of the corresponding values. For example, ˆP(s s, a) = n t(s, a, s ) n t (s, a), models the probability of transition from state s to state s by means of action a as the ratio between the number of times that, by time step t, the agent experienced a transition from s to s after selecting action a n t (s, a, s ) and the number of times agent selected action a in state s n t (s, a). However, as already mentioned, most agents considered in this paper have partial observability i.e., they are unable to unambiguously determine the state of their environment and are only able to perceive some features of this state. This is similar to what occurs in nature: individuals are only able to perceive the environment in their immediate surroundings. Such limited perception necessarily impacts their decision-making process. For example, while the optimal course of action for a hungry predator is to approach its prey, this actually requires the predator to be able to figure out the position of the prey. Similarly, partial observability also impacts the ability of our RL agents to select optimal actions. In terms of their learning algorithm, our RL agents treat each observation Z t as the full state of the environment. They thus build a transition model ˆP(z z, a) and ˆr(z, a) that will generally provide inaccurate predictions. This model is then used to build a Q-function ˆQ : Z A R that the agent uses to guide its decision process. It is a well-established fact that, in scenarios with partial observability, observations alone are not sufficient for the agent to accurately track the underlying state of the system. Therefore, policies computed by treating observations as states can lead to arbitrarily poor performances [37]. Moreover, computing the best such policy is generally hard [19]. In fact, creating robust RL agents that can overcome perceptual limitations often involves significant modeling effort and expert knowledge [40]. The intrinsically motivated reinforcement learning (IMRL) framework [38, 39] proposes the use of richer reward functions that implicitly encode information to potentially overcome the agents perceptual limitations. And, in fact, this approach was shown useful both to facilitate reward design [25, 35] and to mitigate agent limitations [4, 40, 41]. In this framework, the performance of RL agents in the original task provides a measure of the fitness of those agents. Different agents, each with a different reward function accounting for multiple sources of information, are then compared in terms of their fitness, and the most fit agent is selected. This selection process allows to identify, for a given set of environments, which sources of information are most useful to maximize the fitness of RL agents in the task at hand, providing a natural framework for the study in this paper. Formally, IMRL extends traditional RL and provides a framework to address the optimal reward problem (ORP) [40], that we now describe. Let H t be a random variable representing the history of interaction of an agent with its environment up to time-step t, and let h t = {z 1, a 1, ρ 1,..., z t 1, a t 1, ρ t 1, z t }

7 Emergence of Emotional Appraisal Signals in Reinforcement Learning Agents 7 denote a particular realization of H t. Such history corresponds to all information perceived by the agent directly from the environment: sequence {z τ, τ = 1,..., t} corresponds to observations about the environment state (according to the POMDP model described in Section 2); similarly, {a τ, τ = 1,..., t} corresponds to the sequence of actions performed by the agent; finally, {ρ τ, τ = 1,..., t} corresponds to an external evaluation signal that, at each time-step t, depends only on the underlying state S t of the environment and the action A t performed by the agent. This signal can be either environment feedback for example, when an agent receives a monetary prize for performing some action or physiological feedback for example, when an agent feels satisfied after feeding. Given a particular finite history h, we write p H (h r, e) to denote the probability of an RL agent 7 observing history h in environment e when its reward function is r. We evaluate the agent s performance by means of some real-valued fitness function f : H R, where H is the space of all possible (finite) histories. Then, given a space R of possible reward functions, a set E of possible environments, and a distribution p E (E) over the environments in E, the ORP seeks to determine the optimal reward function, denoted by r, maximizing the fitness over the set E according to r = argmax F(r), (3) r R where F(r) is the expected fitness of the RL agent using the reward function r, which is given by F(r) = h,e f(h)p H (h r, e)p E (e), (4) where each e and h is sampled according to p E (e) and p H (h r, e), respectively. Throughout this paper, we specifically consider the fitness associated with a given history h t is given by f(h t ) = t ρ t. (5) From the above, it should be apparent that the signal {ρ t, t = 1,...} actually corresponds to a (external) reward signal that determines the fitness of the agent. 8 We thus define the function r F : S A R as τ=1 r F (s, a) = E [ρ t S t = s, A t = a], (6) 7 Our RL agent all follow the prioritized sweeping algorithm and use the exploration policy detailed in Section 3. 8 We note that our choice in measuring the fitness as the cumulative external evaluation signal is only one among many other possible metrics. In the context of our study, we believe this to be a good metric as it allows us to directly measure the agent s fitness from its performance in the underlying task in the environment.

8 8 Pedro Sequeira et al. Perceptions? Percep. signal Processed signal Decision Making Actions Fig. 3 General architecture for an artificial agent. and henceforth refer to r F as the fitness-based reward function. The function r F can be seen as the sparsest representation of the task to be learned by the agent, as encoded by the signal {ρ t }. We consider throughout the paper that r F R. The interest of considering the ORP problem instead of simple RL agents driven only by the reward r F is that, in the presence of agents with limitations, the solution r to the ORP is often a better alternative than r F. In fact, the reward r obtained often leads to faster learning and induces behaviors that are more robust and efficient than those induced by r F [33, 40]. 3 Identification of Optimal Sources of Information Referring back to the roadmap in Fig. 2, we now address the baseline question driving our study: which information is (potentially) most useful to complement the perceptual capabilities of an autonomous learning agent? In other words, and referring back to the diagram of Fig. 1, we investigate possible alternatives to the appraisal derivation module that may most significantly impact agent performance (see Fig. 3). Importantly, as part of our approach, in this first set of experiments we are only interested in discovering useful sources of information, regardless of their relation with emotions or appraisal theories. To address this general question, we consider foraging scenarios where an IMRL agent acts as a predator in an environment such as those in Fig. 6. The perceptual limitations of the agent in the different environments pose challenges that directly impact its ability to capture its prey and, consequently, its fitness. In order to identify possible sources of useful information to complement the agent s perceptual limitations, we depart from a primitive population of agents, each endowed with a reward structure containing information about different aspects of the agent s past interactions with its environment. The fittest agents (i.e., those with greatest ability to capture preys) are used to successively improve the population. Upon convergence, we identify the set of agents able to attain the largest fitness. The analysis of the corresponding reward structure provides the required information about which signals are potentially most useful to complement the perceptual capabilities of our IMRL agents.

9 Emergence of Emotional Appraisal Signals in Reinforcement Learning Agents Methodology In order to determine which reward functions and, consequently, which information best complements the agent s perceptions, we adopt the genetic programming (GP) approach proposed by Niekum et al. [25]. In that work, the authors used GP in the context of IMRL and the ORP as a possible approach to identify optimal rewards for RL agents. The procedure consisted in searching for reward functions represented by programs that combine different elements of the learning domain, such as the agent s position in the environment or its hunger status. In the context of our work, there are some appealing features in the use of GP. Recall from Section 2.2 that the ORP involves the definition of a space of reward functions R and an optimization procedure to search for the optimal reward function r. GP facilitates the definition of the space of rewards by alleviating the need to specify an explicit parameterization. Instead, we implicitly define the space of possible rewards by specifying a set of operators and terminal nodes, the latter corresponding to constants or variables. Moreover, the optimization mechanism is implicitly defined by a selection method and mathematical operators that combine the terminal nodes, constructing richer, more complex and potentially more informative signals as the evolutionary procedure progresses. Another appealing feature of GP over other search methods (such as gradient descent [41]), in the context of our study, is its close parallel with natural evolution. In the continuation, we provide a detailed description of the setup and procedure used in this first experiment Genetic Programming In general terms, GP aims to find a program that maximizes some measure of fitness [15]. Programs are represented as syntax trees, where nodes correspond to either operators or terminal nodes representing primitive quantities. In our case, we use as terminal nodes quantities that summarize aspects of the history of interaction of the agent with its environment. The GP approach allows for the discovery of interesting mathematical relations between such primitive quantities. Fig. 4 shows the basic elements and operations involved in the approach of using GP to represent and evolve reward functions within IMRL. Non-operator (terminal) nodes are selected from a set T of possible terminal nodes, and represent either numerical variables or constants. Fig. 4(a) shows an example of a GP tree with a single constant terminal node representing the reward function r = 2. Fig. 4(b) shows an example of a tree with a single variable terminal node representing a reward function r = n z that rewards visits to state z according to the number of times it was observed. Operators are selected from a set O of possible operators, and its arguments are represented as their descendants in the tree. GP iteratively explores possible solutions by maintaining a population of candidate programs, producing new generations of programs by means of selection, mutation and crossover. The crossover function randomly replaces some sub-tree (a node and all of its

10 10 Pedro Sequeira et al. n z n z n za n z (a) (b) (c) (d) r za n z n za (e) Fig. 4 Defining reward functions as genetic programs. Some examples: (a) a constant GP node; (b) a variable GP node; (c) a GP tree obtained by a crossover operation between the nodes in (a) and (b); (d) a GP tree obtained by a mutation operation made to the tree in (c); (e) possible evolved GP tree. Bold nodes and lines indicate changes in the tree induced by the several operations. See text for detailed explanation. descendants) of a parent program by another sub-tree from another parent upon reproduction. Fig. 4(c) shows an example of a GP tree that could be obtained through a crossover operation between the nodes in Figs. 4(a) and 4(b), where the multiplication operator node was introduced. The resulting tree represents the reward function r = 2n z. The mutation operator replaces some node by another selected randomly. For example, Fig. 4(d) depicts a possible GP tree obtained by a mutation operation made to the tree in Fig. 4(c), where the left node was replaced. The resulting tree represents the reward function r = n za n z. 9 In our experiments, we used for primitive quantities the set T = C V, with C corresponding to the set of constants, C = {0, 1, 2, 3, 5}, and V to the set of basic variables, V = {r za, n z, n za, v z, q za, d z, e za, p zaz }, where r za = ˆr F t (z, a) is the agent s estimate at time t of the fitness-based reward function for performing action a after observing z. This basic variable essentially informs the agent of its performance in respect to the external signal ρ provided by its environment/designer. 10 It is a function of z, a, and the agent s history up to time t, H t. n z = n t (z) is the number of times that z was observed up to time-step t. This signal informs the agent about the frequency of observations. When compared globally across observations it can be used by the agent e.g., to determine which states were observed more often or which may need further exploration. It is a function of z and the agent s history up to time t, H t. 9 More details on GP can be found in [15]. 10 Recall that r F (s, a) rewards the agent in accordance with the increase/decrease of fitness caused by executing each a in each state s.

11 Emergence of Emotional Appraisal Signals in Reinforcement Learning Agents 11 n za = n t (z, a) is the number of times the agent executed action a after observing z up to time-step t. Similarly to n z, this signal informs the agent about how frequent some action was executed after observing some state. It is a function of z, a, and the agent s history up to time t, H t. v z = Vt F (z) is the value function associated with the reward function estimate ˆr t F. As we have seen in Section 2.1 this function indicates the expected value (relating fitness attainment) of having observed z and following the current policy being learned henceforth. This signal can be used to inform the agent about the fitness-based long-term utility associated with some observation. It is a function of z and the agent s history up to time t, H t. q za = Q F t (z, a) is the Q-function associated with the reward function estimate ˆr t F. Likewise v z, it can be used to indicate the long-term impact for the agent s fitness of executing some action given some observation. It is a function of z, a, and the agent s history up to time t, H t. d z = ˆd t (z) corresponds to an estimate of the number of actions needed to reach a goal after observing z. Goals correspond to those observations that maximize ˆr t F and therefore this variable denotes observations that are close/far away from experienced situations providing maximal immediate fitness. This signal can be used by the agent in its planning mechanism to pursue courses of action that will lead to greater degrees of fitness in the long-run. It is a function of z and the agent s history up to time t, H t. e za = E [ Q F t (z, a) ] is the expected Bellman error associated with Q F t at (z, a). Given an observed transition (z, a, r, z ), the Bellman error associated with Q F t is given by Q F t (z, a) = ˆr F t (z, a) + γ max b A QF t (z, b) Q F t (z, a). This signal essentially indicates the prediction error associated with some transition. If the agent receives a reward and observes a situation which value greatly differs from the previous value attributed by Q F (z, a) then this transition will denote a discrepancy between what was observed and the agent s previous model of the world. The agent can use this basic variable to e.g., identify situations changing very often or choose actions leading to more stable outcomes. It is a function of z, a, and the agent s history up to time t, H t. p zaz = ˆP t (z z, a) corresponds to the estimated probability of observing z when executing action a after observing z. Since the learning algorithm used by the agent averages the perceived reward function, p zaz is actually equivalent to ] E [ˆP(z z, a) = ˆP t (z z, a)p [Z t+1 = z Z t = z, A t = a]. z Z Similarly to e za, this signal can be used by the agent to identify the execution of actions leading to more (un)stable outcomes, i.e., the greater the number of transitions z observed so far after executing a in z, the smaller the value of p zaz, hence the more unreliable or erratic pair z, a will be. p zaz is a function of z, a, and the agent s history up to time t, H t.

12 12 Pedro Sequeira et al. Reward Function Population Reward Function Evaluation Crossover Mutation Selection Fig. 5 The GP approach to the ORP, as proposed in [25]. In each generation j, a population R j contains a set of candidate reward functions r k, k = 1,..., K. All are evaluated according to a fitness function F(r k) and evolve according to crossover, mutation and selection. The variables above include all elements stored and/or computed by the learning agent, and therefore summarize the agent s history of interaction with its environment. As for the operators used by the GP algorithm, we considered the set O = {+,,, /,, exp, log}. Throughout time, and according to the fitness obtained for each reward function, the GP procedure applies the aforementioned operations to evolve relations between the primitive variables and constants in set T and the mathematical operators in set O. For example, the GP tree depicted in Fig. 4(e) represents a more complex reward function expressed by the program 2r za (n z + n za ) that could be obtained after a few iterations of the GP algorithm. This example evolved function rewards the agent for fitness-inducing behaviors by means of the relation 2r za and punishes the agent as it becomes more and more familiarized with z and a, as given by (n z + n za ) Evolutionary Procedure Figure 5 outlines the optimization scheme for the ORP using GP, for a specific set of environments E. At each generation j, a reward function population R j of size K contains a set of candidate reward functions r k, k = 1,..., K. Each r k R j is evaluated according to the fitness function F(r k ). When all the reward functions have been evaluated, the evolutionary procedure takes place by applying the mutation and crossover operations defined earlier and applying selection over the population in order to produce the new generation of reward functions, corresponding to population R j+1. The process repeats for a number J of generations. 11 In our experiments, to run the evolutionary procedure we generate a total of 50 independent initial populations, each containing K = 100 elements, and run the evolutionary procedure for J = 50 generations for each population. For the selection method we use a steady-state procedure [43] that, in each generation j, maintains the 10 most fit elements the reward functions with highest fitness and generates 10 new random elements. The remaining 80 elements are generated either by mutating one element or through crossover. 11 The first generation, corresponding to the population R 1, is randomly generated.

13 Emergence of Emotional Appraisal Signals in Reinforcement Learning Agents 13 This is done by pairing elements of the previous population according to a rank selection that chooses parents with a probability that is proportional to their fitness, i.e., reward functions with a greater fitness have a higher probability of being mutated or paired with another reward function. Recall that resolving the ORP implies the definition of a space of reward functions R and the determination of the optimal reward function r for a specific scenario. By space of reward functions we refer to the set of all reward functions that can (potentially) be generated by the GP algorithm. In particular, any possible combination of the primitive quantities in T and the operators in O that may be generated throughout time by the evolutionary procedure corresponds to a possible reward function and, as such, to an element of our so-called space of reward functions. The parameterization of R is therefore implicitly defined by the sets T and O. The evolved optimal reward function is determined according to (3) for all r k R j, j = 1,..., J, i.e., it corresponds to the reward function with highest fitness considering all generations of all the populations that were initialized. As an effect of mutation and crossover, reward functions might gain subexpressions that do not contribute to the overall fitness attained by the agent as time evolves. Because we are interested in identifying only the interesting sources of information from the optimal reward functions, in a post-hoc procedure r is parsed for sub-expressions that may have no effect on the computed fitness. This is done by first generating all possible sub-combinations of the tree representing r. For example, an optimal reward function defined by the program 2r za (n z + n za ), depicted in Fig. 4(e), would generate the following sub-expressions: 2, 2r za, 2r za n z, 2r za n za, 2 (n z + n za ), 2 n z, 2 n za, r za, r za (n z + n za ), r za n z, r za n za, n z, (n z + n za ) and n za. Each sub-expression is used to form a new reward function and its fitness is estimated. The simplified optimal reward function is then selected as the shortest sub-expression (in number of nodes) which fitness difference, in relation to the evolved optimal reward function, is not statistically significant. 12 Many simplifications involve operations with the constants 0 and 1 as they sometimes cancel or offer no effect of the associated nodes to the overall reward: e.g., expression 0r za 1(v z (exp(0)q za )) + log(1) would automatically simplify to q za v z. In general, depending on the results for each scenario, other sub-expressions may be removed from r Estimating the Reward Function Fitness It is a computationally demanding endeavor to explicitly compute F(r), since it involves computing the expectation of f over p H and p E, as seen in (4). As such, in order to estimate the value F(r), corresponding to the reward function evaluation stage in Fig. 5, we run N = 200 independent Monte- Carlo trials of 100, 000 time-steps each, where in each trial we simulate an RL agent driven by reward r in an environment selected randomly from the 12 We resorted to a simple unpaired t test to determine this statistical significance.

14 14 Pedro Sequeira et al. (a) (b) Fig. 6 Structure of the foraging environments used in the first set of experiments. The pairs (x : y) indicate the possible locations for the agent. corresponding set of environments E. 13 We then approximate F(r) as the mean fitness across all observed histories, i.e., F(r) 1 N N f(h i ), (7) i=1 where h i is the sampled history in the ith trial Scenarios We used a total of six scenarios (see Fig. 6), either from the IMRL literature or modifications thereof [35, 39, 40]. We refer to [35] for a more detailed description of each environment and associated challenges. Hungry-Thirsty scenario: The environment is depicted in Fig. 6(a). It contains two inexhaustible resources, corresponding to food and water. Resources can be positioned in any of the environment corners (positions (1 : 1), (5 : 1), (1 : 5), and (5 : 5)), leading to a total of 12 possible configurations of food and water. The agent s fitness is defined as the amount of food consumed. However, the agent can only consume food if it is not thirsty, a condition that the agent can achieve by consuming the water resource (drinking). At each time-step after drinking, the agent becomes thirsty again with a probability of 0.2. The agent observes its position and thirsty status. Lairs scenario: In this scenario, the layout of the environment corresponds again to Fig. 6(a). In it, the agent is a predator and there are two prey lairs positioned in different corners of the environment, resulting in 6 possible configurations. The fitness of the agent is defined as the number of preys captured. Whenever a lair is occupied by a prey, the agent can drive the prey out by 13 The set E is scenario-specific. For example, in the Hungry-Thirsty scenario, E includes all possible configurations of food and water.

15 Emergence of Emotional Appraisal Signals in Reinforcement Learning Agents 15 means of a Pull action. The state of the lair transitions to prey outside, and the agent has exactly one time-step to capture the prey with a Capture action, before the prey runs away. In either case, the state of the lair transitions to empty. At every time-step there is a 0.1 probability that a prey will appear in an empty lair. In this scenario, A = {N, S, E, W, P, C}, where N, S, E and W move the agent in the corresponding direction, and P and C correspond to the Pull and Capture actions. The agent is able to observe its position in the environment and the state of both lairs. Moving-Preys scenario: The environment for this scenario is depicted in Fig. 6(b). In this scenario, the agent is a predator and, at any time-step, there is exactly one prey available, located in one of the end-of-corridor locations (positions (3 : 1), (3 : 3) or (3 : 5)). The agent s fitness is again defined as the number of preys captured. Whenever the agent captures a prey, the latter disappears from the current location and a new prey randomly appears in one of the two other possible prey locations. Persistence scenario: The environment again corresponds to the one in Fig. 6(b). In this scenario, the environment contains two types of prey always available. Hares are located in position (3 : 1) and contribute to the fitness of the agent with a value of 1. Rabbits are located in position (3 : 5) and contribute with a value of 0.01 to the agent s fitness. Whenever it captures a prey, the agent s position is reset to the initial position (position (3 : 3)). The environment also contains a fence, located in position (1 : 2), that prevents the agent from easily capturing hares. In order for the agent to cross over the fence toward the hare location at time t, it must persistently perform the action N for N t consecutive time-steps. 14 Every time the agent crosses the fence upwards, the fence is reinforced, requiring an increasing number of N actions to be crossed. 15 Seasons scenario: The environment again corresponds to the one in Fig. 6(b). In this scenario the environment contains two possible types of prey. Hares are appear in position (3 : 1) and contribute to the agent s fitness with a value of 1. Rabbits appear in position (3 : 5) and contribute to the fitness of the agent with a value of 0.1. As with the Persistence scenario, the agent s position is reset to (3 : 3) upon capturing any prey. However, unlike the Persistence scenario, in this scenario only one prey is available at each time-step, depending on the season, which changes every 5, 000 time steps. The initial season is randomly selected as either Hare Season or Rabbit Season with equal probability. Additionally, in the rabbit season, for every 10 rabbits that it captures, the agent is attacked by the rabbit farmer, which negatively impacts its fitness by a value of The fence is only an obstacle when the agent is moving upward from position (1 : 2). 15 Denoting by n t(fence) the number of times that the agent crossed the fence upwards up to time-step t, N t is given by N t = min{n t(fence) + 1; 30}.

16 16 Pedro Sequeira et al. Poisoned prey scenario: This scenario is a variation of the the Seasons scenario. The scenario layout and prey positions are the same, but both rabbits and hares are always available to the agent. Rabbits contribute to the fitness of the agent with a value of 0.1. Hares, when healthy contribute positively to the agent s fitness by an amount of 1. When poisoned they contribute negatively to the fitness of the agent with a value if 1. As in the Seasons scenario, the health status of hares changes every 5, 000 steps Agent Description In all scenarios, the agent is modeled as a POMDP whose state dynamics follow from the descriptions above. In all but the Lairs scenario, the agent has 4 actions available, A = {N, S, E, W } that deterministically move it in the corresponding direction; preys are captured automatically whenever colocated with the agent. In all but the Hungry-Thirsty and Lairs scenarios, the agent is only able to observe its current (x : y) position, and whether it is collocated with a prey. All scenarios use prioritized sweeping RL agents [24] to learn a policy that treats observations as states (see Section 2). In our experiments, prioritized sweeping updates the Q-value of up to 10 state-action pairs in each iteration, using a learning rate of α = 0.3. During its life-time, the agent uses an ε- greedy exploration strategy with a decaying exploration parameter ε t = λ t, where λ = In all experiments, we consider a discount γ = Results The results of the GP experiment are summarized in Table 1. We present the average fitness estimated according to (7) and the expression, simplified using the procedure described in Section 3.1.2, obtained by the agent using the evolved optimal reward function r selected using GP in each of the test scenarios. As a straightforward baseline for comparison, we also present the fitness obtained by an agent driven by the fitness-based reward function r F = r za. We note that the agents compared are similar in all aspects except the reward function. In particular, the dimension of the transition function and Q-function learned are the same. One first observation is that, in all scenarios, the evolved reward function clearly outperforms the fitness-based reward function. Our results are in accordance with findings in previous works on the advantages of allowing additional sources of information to guide the agent decision-making [4, 33, 39, 40]. Our results also confirm previous findings on the usefulness of an evolutionary approach to search for optimal reward functions [25]. There is, however, one key difference between our approach and that in [25]: we provide the evolutionary approach with domain-independent sources of information relating to the agent s history of interaction with the environment. We expect that

17 Emergence of Emotional Appraisal Signals in Reinforcement Learning Agents 17 Table 1 Mean fitness and evolved optimal reward function r for each scenario. For ease of analysis, we recall the set of basic variables, V = {r za, n z, n za, v z, q za, d z, e za, p zaz }. For each scenario, we also include the performance of the fitness-based reward function r F. The results correspond to averages over 200 independent Monte-Carlo trials. Scenario Reward function Mean Fitness Hungry-Thirsty Lairs Moving-Preys Persistence Seasons Poisoned prey r = q za v z 2 10, ± 6, r F = r za 7, ± 6, r = q za v z 8, ± 1, r F = r za 7, ± r = n 2 z 2, ± 45.4 r F = r za ± 18.0 r = q za v z 1, ± 11.6 r F = r za ± 1.5 r = r za + q za p zaz 6, ± r F = r za 4, ± 1, r = 5r za q za 5, ± r F = r za 1, ± 4.1 the reward functions thus evolved can be applied in domains other than those used in our experiments and described in Section 3.1. By analyzing the several simplified expressions that emerged from the evolutionary procedure in Table 1 we observe the presence of a particular subexpression, that given by q za v z. Aside from the fact that 3 out of the 6 rewards can be reconstructed directly from this quantity, it is a well-known quantity in the RL literature (known as the advantage function [2]). It proved to be crucial in scenarios having a great diversity of environment configurations, such as the hungry-thirsty and lairs scenario, and also in the persistence scenario. In this scenario, it was important for the agent to ignore sub-optimal decisions when facing the obstacle in the environment, i.e., where choosing actions other than N (the one with higher advantage) was prejudicial in terms of the future gains provided by capturing the hare. The result of the Moving- Preys scenario is given by the expression n 2 z and is quite obvious as it was important for the agent to explore the environment by choosing states with a low number of visits in order to capture the moving preys. In the seasons scenario, the resulting expression gives importance both to the fitness-based reward by means of r za + q za and also to state-action pairs that provide low probability transitions as indicated by p zaz. Such sub-expression proved useful for the agent to continue to go to the hares even when the seasons changed, thus avoiding the negative penalties from the rabbits. In the poisoned prey scenario, a greater importance was given to the fitness-based reward by means of the sub-expression 5r za and the value provided by q za ensured that the agent kept capturing hares, eventually gaining advantage in the Healthy Season.

18 18 Pedro Sequeira et al. 3.3 Discussion We recall that the goal of our first experiment was to identify possible sources of information that could improve the agent s performance if taken into consideration in the process of decision-making. 16 Given the simplification process used to remove unnecessary sub-expressions from the optimal reward functions evolved through GP, each sub-expression indicated in Table 1 can be interpreted as possible signal that can drive the agent s decision process, allowing it to maximize its fitness. Discarding additive and multiplicative constants, we can distill from Table 1 a set of five signals, Φ = {φ fit, φ adv, φ rel, φ prd, φ frq }, given by φ fit = r za corresponds to the agent s estimate of the fitness-based reward function. It evaluates the immediate impact on fitness associated with performing action a after observing z. φ rel = q za corresponds to the estimated Q-function associated with r za. This function assesses the value of executing action a after observing z in terms of long-term impact on fitness, corresponding to the long-run counterpart to φ fit. φ adv = q za v z corresponds to the estimated advantage function associated with r za [2]. This function evaluates how good action a is in state s relatively to the best action (its advantage). While φ rel evaluates the absolute value of actions, φ adv evaluates their relative value. φ prd = p zaz corresponds to the agent s estimate of the transition probabilities. As discussed in Section 3.1, it provides a measure of how predictable the observation at time t + 1 is given that the agent performed action a after observing z. Finally, φ frq = n 2 z provides a (negative) measure of how novel z is given the agent s observations. The signals φ k defined above correspond to the minimal set of sub-expressions from which we can form all the optimal reward functions for each scenario by combining them with the constants in C and using the different operators in O. 17 As noted earlier, the expression of the advantage automatically emerged as a natural candidate for our optimal sources information due to representing the whole optimal reward function (discarding additive constants) in 3 of the 6 tested scenarios. The expression for novelty also emerged as a natural candidate. As for the remaining signals we opted by breaking down the two optimal reward functions r za + q za p zaz and 5r za q za into their smallest terms, thus ensuring that a wide range of rewards can be reconstructed. It was par- 16 We again emphasize that our identification procedure is not guided by appraisal theories, the objective for now is precisely to identify useful signals, regardless of their connection with emotional appraisal. 17 In our distillation process we are focused on extracting a minimal set domainindependent informative signals. As will become clearer in the next section, apart from additive constants (which have minimum impact on the policy and can therefore be safely discarded), it will be possible to reconstruct the reward functions (and attain comparable degrees of fitness) in Table 1 as a linear combination of these signals.

19 Emergence of Emotional Appraisal Signals in Reinforcement Learning Agents 19 φ adv Perceptions φ rel φ prd φ frq φ fit? Reward signal Decision Making Actions Percep. signal Fig. 7 Architecture for an agent using the identified sources of information. ticularly of interest to consider r za as an independent signal given that, unlike the other basic variables, it is not learned and does not depend on the agent s experience, corresponding to an external evaluative signal. However, we note that this partitioning option is by no means unique, e.g., the reward features r za q za and r za + q za p zaz are also a possibilities that could be used while still assuring a minimal set of information signals. Each of the emerged signals is a function mapping observation-action-history triplets to a real-value, and will henceforth be used as a source of information guiding the decision process of the agent. The updated agent architecture is depicted in Fig. 7. Two observations are in order. First of all, in obtaining these signals, we considered a specific class of agents: our agents run prioritized-sweeping with ε-greedy exploration. Had a different learning, exploratory strategy or algorithm parameterization been used, it is possible that variations of the identified features could be observed. However, we would not expect these variations to dramatically change the sort of information provided by such features that is required by the agent to solve the intended tasks, especially in scenarios where the agent has partial observability. 18 Secondly, we would expect GP to yield different (and eventually more complex) signals, had we considered more elaborate domains. However, one interesting aspect of our results arises precisely from the fact that the features used throughout the paper were evolved in such simple scenarios. In spite of their simplicity, and as will soon become apparent, they yield significant improvements in performance in significantly more complex settings (that even include other agents). This, in our view, is indicative that, even though simple, they are extremely informative. 4 Validation of Identified Sources Section 3 focused on identifying general-purpose sources of information that can guide the decision process of an IMRL agent and impact positively its performance. These different sources of information emerged from the interaction of agents with several different environments and, as such, should be applicable in scenarios other than those of Section 3. This section investigates whether this is indeed so, i.e., whether the sources of information identified in 18 Further experimental verification would be required to back up this claim, however this is not within the scope of this paper.

Learning to Use Episodic Memory

Learning to Use Episodic Memory Nicholas A. Gorski (ngorski@umich.edu) John E. Laird (laird@umich.edu) Computer Science & Engineering, University of Michigan 2260 Hayward St., Ann Arbor, MI 48109 USA Abstract