A Dual Process Reinforcement Learning Account for Sequential Decision Making and Skill Learning

Size: px

Start display at page:

Download "A Dual Process Reinforcement Learning Account for Sequential Decision Making and Skill Learning"

Conrad Melton
5 years ago
Views:

1 A Dual Process Reinforcement Learning Account for Sequential Decision Making and Skill Learning Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science and Engineering by Research by Tejas Savalia International Institute of Information Technology Hyderabad , INDIA July 2018

2 International Institute of Information Technology Hyderabad, India CERTIFICATE It is certified that the work contained in this thesis, titled A Dual Process Reinforcement Learning Account for Sequential Decision Making and Skill Learning by Tejas Savalia, has been carried out under my supervision and is not submitted elsewhere for a degree. Date Advisor: Prof. Bapi Raju Surampudi Cognitive Science Lab Kohli Center on Intelligent Systems IIIT, Hyderabad

4 What I cannot create, I do not understand. Richard P. Feynman

5 Acknowledgments I would like to express my sincerest gratitude to my advisor, Prof. Bapi Raju who was always available to provide direction in a field I entered as a novice. His course on Intro to Cognitive Science and interdisciplinary knowledge on Reinforcement Learning in particular came in as a boon and allowed me to settle in this fascinating field with ease. The journey would not have been possible without the lengthy critical discussions and exchange of ideas we often had even after his shift to University of Hyderabad. The relaxed environment and motivation to pursue my own ideas was best suited for me and Prof. Bapi Raju generously awarded both. I learned a lot from him and intend to carry this approach in my future endeavours. I would also like to thank the Department of Science and Technology for funding this research under an Indo-French project titled Basal Ganglia at Large. I would like to thank a few special friends I made here. Viral and Utsav for keeping my Gujju Spirit alive. Those late night talks (or shouts) and systematic bashing of the universe in general gave the perfect mental massage after a stressful (or a not so stressful) day I would be so much more mature if not for you guys and I am forever in debt for not letting that happen. Abhishek for those early cricket mornings and for being an epitome of simplicity. Friends in my lab, Pramod, Shruti, Ambika, Sneha, Remya, Snigdha, Anuj, Snehil, Gautam and a few others who created an environment which made the lab feel less like a corporate office and more an environment that inspired ideas and critical views a research workspace I can only hope to get anywhere else. Finally, I would like to thank my parents and my brother who against all their instincts supported my less-known research endeavour. While they might not understand what exactly I do, they do not hesitate to put their trust in my judgments and continue to support me in my future goals to stay in the field of science against against all social odds. I would be no where without them. v

6 Abstract It has been extensively argued that there are two distinct processes governing behaviour: Goal Directed and Habitual. There have been multiple attempts at establishing a mechanism that governs which behaviour should be in charge of making sequential decisions leading to a distant reward. This thesis is organized in three parts. First we explore a simple task of action selection in a rather complex biophysical model of Basal Ganglia. We implement a probabilistic inference task and show that the model learns to establish a representation of reward contingencies. We then implement a model of Parkinson s disease for patients with and without medication as an exploration of how the disease and its drug might act in Basal Ganglia. Next, we move on to complex sequential tasks and present a framework that unifies three distinct dichotomies: Goal Directed versus Habitual behaviour, Explicit versus Implicit Learning and Model- Based versus Model-Free Reinforcement Learning. This framework suggests a hierarchical organization of each mechanism of the dichotomies in two forms: first, we suggest that the Goal-Directed controller plays a dominant role in behaviour which is then taken over by habitual behaviour across multiple trials as the chunk size increases. Second, we suggest that the most granular actions are executed habitually whereas more abstract level actions are goal directed. We suggest that a goal directed behaviour occurs with the engagement of attention as opposed to habitual behaviour which is more automatic, thereby linking the dichotomy to Explicit versus Implicit learning. We then present the idea that behaviour or execution can be organized in opposite directions to each other with attention playing a role of the switch. The final part is a computational implementation of the unified theory presented previously. We show that a hierarchical organization of Goal-Directed and Habitual Behaviour implemented using Model - Based and Model - Free Reinforcement Learning is successfully able to fare well against the respective pure forms and that such an organization is suitable for explaining motor skill acquisition. We present a possible functional network of interacting brain areas in the frontal cortex with the final values of states and actions represented in the striatum following which the Basal Ganglia model explored earlier could be used to perform action selection. We implement the Parkinson s model in simulations of a grid world and suggest a qualitative parallel with the observed literature. vi

7 Contents Chapter Page 1 Introduction Problem Statement and Motivation for this thesis History: Psychology Associative Learning: Edward Thorndike, Ivan Pavlov and B.F Skinner Associations to Maps: Edward Tolman Outcomes, Goals and Habits: Adams, Dickinson and Balleine History: Modelling Hebbian, Rescorla Wagner and Schultz Need for a Dual System Related work on Dual Process Models: Daw and Gläscher Outline of the Thesis Basal Ganglia Action Selection using Nengo and implications of Parkinson s Disease Detour: Exploring Nengo Representation Transformation Dynamics Action Selection Model of the Basal Ganglia Basal Ganglia Overview Model The Probabilistic Inference Task Simulating Parkinson s Disease in the Probabilistic Inference Task A Unified Theoretical Framework for Cognitive Sequencing Introduction Habitual versus Goal-Directed Behavior Model-Free versus Model-Based paradigm Implicit and Explicit learning Computational Equivalents Goal Directed behavior as Model-Based mechanism Habitual Behaviors as Model-Free System Unified theoretical framework Role of Response-to-Stimulus-Interval (RSI) and Prior Information in the Unified Framework Comparison with other dual system theories vii

8 viii CONTENTS 3.6 Conclusion Dual Process account to explain skill acquisition and implications of Parkinson s Disease Introduction: Mapping Sequential Decision Making to Skill Learning Model Description Two Mechanisms Model-Based RL Model-Free RL Parametric Exploration Arbitration Simulation of Parkinson s Disease Comparison with Other Arbitration Mechanisms A Possible Neural Architecture Conclusion Conclusion and Future Work Publications

9 List of Figures Figure Page 1.1 Response Times for insightful versus trial-and-error responses [Reproduced from (Thorndike, 1911)] Anatomy of Basal Ganglia Connecting Pathways within BG nuclei. Please see the main text for abbreviations BG Network Diagram in Nengo Activation in D1 and D2 Striatal MSNs for each action Neuron population spiking for each action in different BG areas for the above activation Japanese Hiragana Character images shown during training along with their reward percentages. For instance, image A was rewarded 80% times, the subject trained to select between characters A and B (20% rewarding). [Figure taken from Frank et al. (2004)] PD ON patients were less efficient at avoiding less rewarding choices whereas PD Off patients were less efficient at selecting more rewarding choices. [Figure taken from Frank et al. (2004)] Simulation Results Actor critic architecture for learning and execution. Input from the environment passes to the goal-directed mechanism to select a chunk. Action selection at the upper-level is enabled by engagement of attention in a goal-directed, model-based manner whereas at a lower level (without attentional engagement) this process implements habitual and model-free system. Action selection within a chunk occurs on a habitual, model-free basis. Neural correlates of various components of this framework are suggested here. Adapted from (Botvinick, 2008). Abbreviations: vmpfc; Ventromedial PreFrontal Cortex, DLS; DorsoLateral Striatum, VS; Ventral Striatum, DA; Dopaminergic error signal, HT+; Hypothalamus, DLPFC; DorsoLateral PreFrontal Cortex, OFC; OrbitoFrontal Cortex, MTL; Medial Temporal Lobe ix

10 x LIST OF FIGURES 3.2 Role of temporal window in engagement / disengagement of attention during learning and execution. The left panel refers to sequence execution (performance) where the flow is from top-to-bottom, attention gets gradually disengaged as you go down the hierarchy. The right panel shows the acquisition (learning) of sequences where the flow is from bottom-to-top, attention gets gradually engaged as you go up the hierarchy. The temporal window determines when to switch between the two mechanisms. For example, for an action worth 1 unit of time with the temporal window size of 5 units with RSI of 3 units; a two-action chunk would lead to attention engagement / disengagement. Lesser RSI would require more number of actions chunked together to engage / disengage attention towards the underlying task Model-Based Tree Model-Free Lookup Table Comparison for Depth search limit to 3 states Comparison for Depth search limit to 4 states Comparison for Depth search limit to 3 states, grid size increased to 10x Comparison for Depth search limit to 4 states, grid size 10x Comparison for Depth search limit to 3 states with transition probability Comparison for Depth search limit to 3 states, reduced learning rate Estimating the Values of any state via backtracking search Takes fewer steps than pure Model-Free mechanism Time to reach the goal state. Takes less time than a purely goal directed search Response time decreases exponentially with practice Traversal Time decreases with practice. The behavior is shown to go habitual from goal directed in terms of time to reach the goal Model Based actions in iteration Model Based actions in iteration Model Based actions in iteration Model Free actions in iteration Model Free actions in iteration Model Free actions in iteration Value function in iteration Value function in iteration Value function in iteration Higher stay probability of an overtrained agent. Representative finding Stay probabilities when the environment is changed at different iteration points in execution Relative Dominance of Controls (Non diseased) for achieving a reward of Relative Dominance of each mechanism for Parkinson s simulation to receive a reward of Doubled iterations to achieve an average reward of A Possible Neural Architecture calculating values to be represented by the Striatum... 55

11 List of Tables Table Page 2.1 Reward Schemes xi

12 Chapter 1 Introduction The typical rationality assumption for decision making suggests that humans (or animals) make decisions with the aim of maximizing their expected utility based on their respective versions of the environment. Such an assumption on utilitarian behavior (Kahneman & Tversky, 2013) has canonically driven wide-scale research in decision making be it modeling simplistic decisions on selecting one choice among multiple presented to complex decisions pertaining to investment in stocks in financial markets. Appropriately, this assumption has also driven research in Artificial Intelligence (Sutton, 1988) and in psychology and neuroscience (Schultz et al., 1997; Conway & Gawronski, 2013). Making decisions in a deterministic environment is simple, you observe a reward once and you keep on taking the action that led you to that reward. Such a design is a poor representation of the real world where most of the environments are arguably non-deterministic stock markets are hardly ever non-volatile. The task then becomes one of taking actions that give the best average reward over multiple iterations. A heuristic that aids in making decisions in an uncertain environment would be supported by estimating uncertainties that define the environment dynamics; making it essential to learn the model of the environment. Diving more into the real world, it is more often the case that rewards are not instantaneous; tailored to each action we take. Rather, we need to take a sequence of actions to complete a task which will then lead to a reward. For instance, it is taking a sequence of turns to reach the source of potable water is what gives us the eventual satisfaction of quenching our thirst rather than taking an individual turn. Learning a model in this case is not limited to learning the action-reward uncertainties but it needs to include learning a map-like representation of the environment. It is in this kind of sequential setting that we define our problem that this thesis addresses. 1.1 Problem Statement and Motivation for this thesis A sequential decision making problem can be formally stated over a set of states S = s 1, s 2, s 3,..., s n and a set of actions that can be taken in these states; A = a 1, a 2, a 3,..., a k leading to corresponding next states in S, such that a sequence of states and actions lead to a reward. Formally, 1

13 s i > a j > s i > a j... > r (1.1) Decision making here is in deciding which action to take at each state in order to receive the delayed reward. Solving this problem via modeling has typically followed two levels of detail. A more biophysical model is inclusive of the known functional and structural anatomy of the brain designing differential equations to replicate neuron firing as observed in detailed neurophysiological recordings. Another approach usually followed is drawing mathematical models to replicate behavior with little regard to actual brain mechanics. In this thesis, we start by exploring simplistic decision making under uncertainty using a detailed biophysical model of the Basal Ganglia. The main focus of this thesis would be, however, to focus on producing behavior pertaining to sequential decision making set in a skill learning environment. This, a rather complex task, is designed in an abstract environment aimed at replicating behavior while drawing parallels from neurophysiology. Accordingly we first present a theoretical framework unifying two prominent theories in Artificial Intelligence, suggesting a possible way to arrange their execution in order to explain behavior. In the final part, we then converge to a domain that is extremely well suited for such sequential decision skill learning. We present our simulations in an abstract environment and argue that the arrangement suggested in this thesis is perhaps more in line with what intelligent beings follow. The theory presented in this thesis underlies the motivation to not stop at just describing intelligent behavior in a learned environment but to also develop a system that better models the way intelligent beings learn which is often not optimal. A (very) long term goal of this work would be to develop an intelligent system that not just passes the Turing test limited to natural language conversations but is also indistinguishable from humans in the way it develops intelligence. 1.2 History: Psychology Associative Learning: Edward Thorndike, Ivan Pavlov and B.F Skinner It was in the late 1800s and the early 1900s when an American Psychologist, Edward Thorndike, suggested the law of effect to mark the beginning of associative learning in the form of Stimulus-Response (S-R) associations. Thorndike s law of effect stats that, If an association leads to a satisfying (rewarding) outcome, it will be strengthened and if it is followed by an unsatisfying (punishing) outcome, it will be weakened. An important aspect of Thorndike s studies was the situation a subject was in that led to a response under observation. In the classical experiment of training a cat to escape a puzzle box by pulling a lever, Thorndike emphasized on the fact that the cat will not press a lever if it was not in the puzzle box implying that any response is context-specific. Another point emphasized upon was that the response 2

14 Figure 1.1: Response Times for insightful versus trial-and-error responses [Reproduced from (Thorndike, 1911)] time curves (Figure 1) (Thorndike, 1911) show that the animals learn by performing actions; in a trial and error manner rather than thinking and developing insights on the environment and its functioning. Edward Thorndike s work in the learning process lead to the famous theory of Connectionism and kickstarted what we now call Reinforcement Learning. Another set of experiments exploring instrumental conditioning were conducted by B.F. Skinner. Skinner s work on Radical Behaviorism went on to include an animal s private events such as thoughts and emotions to follow the same reinforcing principles as overt behavior. In a series of experiments pigeons in an Operant Conditioning Chamber were delivered a food reward regardless of their movement. It was observed that the subjects tend to repeat the movements that they did before receiving the reward, thereby indicating a superstition of set of actions needed to be performed to receive the reward (Skinner, 1990, 1992). Ivan Pavlov, contemporary to Edward Thorndike, famously established classical conditioning in dogs showing response towards a neutral stimulus by pairing the neutral stimulus with a rewarding one (Pavlov & Anrep, 2003) 1. Pavlov s experiments on classical conditioning along with Thorndike s and Skinner s experiments on operant conditioning dominated the field of behavioral psychology for the major part of earlier 19 th century Associations to Maps: Edward Tolman The studies described above focused mainly on assigning Stimulus-Response (S-R) associations. A response in context with an observed stimulus leading to the reward will cause that association to strengthen. While exploring an environment, the subject goes on a trial-and-error learning behavior to establish the most rewarding stimulus-response associations for all stimuli. Those studies specifically do not cite the animals ability to learn the dynamics of the environment it is executing its actions in. 1 The dog was given food after ringing a bell in the initial stages of the experiment. Even though the bell had nothing to do with the dog receiving the food reward, it was observed that it started salivating when the bell was rung before the actual food was delivered 3

15 In his landmark paper in 1948, Edward C Tolman suggested a Cognitive Map theory through a set of experiments on rats solving a maze to receive a reward (Tolman, 1948). Rats were put in a maze, by navigating which, they would receive a reward in the form of food. The rat was first allowed to explore the maze. After initial exploration, it was placed at one arm of the cross-shaped maze conditioning it to turn right to receive a reward. After such conditioning, the rat was placed at a different position of the maze and allowed to go to the food. Tolman observed that even after changing the starting position of the rat to a different arm, rather than taking the action it was conditioned to, the rat took the correct action allowing it to reach the food reward. This indicates that the rat, during its earlier exploration forms a cognitive map of the environment and understanding the dynamics allows it to reach the reward even after the environment change. Had the rat relied on Stimulus-Response conditioning based on reward, it would have taken the same action to which it was conditioned. Tolman called this behavior Thinking-of-acts wherein rats (and humans) thought ahead of where their actions would take them and made an optimal decision to receive the reward Outcomes, Goals and Habits: Adams, Dickinson and Balleine In 1981, in their landmark paper, Adams and Dickinson tested for performance of rat s lever press on outcome devaluation. More specifically, the rats were initially trained to receive a sucrose reward by pressing a lever when shown. After a point of training, the sucrose reward was paired with illness making the reward less attractive. The prominent theory around that time was that such lever pressing response was established by stimulus response association; the rat saw a lever (stimulus) and pressed it (response), this episode was reinforced by reward (Adams & Dickinson, 1981). This experiment, however, suggested that learning was much more elaborate, that the animals also learned an encoding based on Response-Outcome association. This experiment led to further investigation of how influential the Response-Outcome and Stimulus- Response actions were over training period. In further experiments, Dickinson found that the animals were much more reliant on Response-Outcome behavior in the initial phases of the training and much more reliant on Stimulus-Response actions when the training was much more extensive. This was indicated by a relative insensitivity to devalued outcome when behavior turned habitual due to overtraining. (Dickinson, 1985; Dickinson & Balleine, 1994). Further, these effects were also found in human subjects (E. Tricomi et al., 2009). Balleine & O Doherty (2010) then went ahead to investigate differential role of Cotrico-Striatal networks in these two types of behaviors now coined as Goal Directed and Habitual Behavior. 4

16 1.3 History: Modelling Hebbian, Rescorla Wagner and Schultz One of the earliest algorithms to explain associative learning was brought in by Donald Hebb in his book titled The Organization of Behavior (Hebb et al., 1949). This rule, in simplistic terms, stated that Neurons that fire together, wire together. Hebb emphasized that this rule applies when there is a definite causality between the pre-synaptic and post-synaptic neurons. One of the first mathematical learning rule to explain classical conditioning in an experiment such as Pavlov s came from Rescorla and Wagner (Rescorla et al., 1972). Formally, Rescorla-Wagner rule is stated as follows: V n+1 x = α x β 1 (λ 1 V n+1 total ) (1.2) and V n+1 x = V n x + V n+1 x (1.3) where V n+1 is the change in associative strength (V) of the conditioned stimulus (CS x ) as a result of pairing with the unconditioned stimulus (US 1 ) at trial n + 1. α x is the strength of CS x, β 1 is the strength of US 2 1. λ 1 is the maximum strength of US 1, Vtotal n is the sum of associative strengths of all CS es. Associative strength of a stimulus X is thus updated with the term calculated in equation 1.1. Rescorla Wagner s model was then modified by Wolfram Schultz (in collaboration with Dayan and Montague) in his landmark paper aimed at modelling Dopamine firing associated with the Ventral Tegmental Area (VTA) and the Substantia Nigra during classical conditioning (Schultz et al., 1997). This model was inspired from the Reinforcement Learning literature (Sutton, 1988) from where they followed two key assumptions; first the goal of a learning mechanism is to use sensory cues to predict discounted future reward, V (t) over all possible future states and second; the states are Markovian, i.e., all information required to make a decision to reach the next state is available in the current state as determined by the sensory cues. Thus the value of a given state at time t would be: V (t) = E[r(t) + γv (t + 1)] (1.4) where r(t) is the immediate reward received in the current state, E denotes the (statistical) expectation operator and γ is the discount factor that determines the relative weights of current versus past reward estimates. An error in the estimated predictions can be defined with information available at successive time steps as given below. δ(t) = r(t) + γv (t + 1) V (t) (1.5) 2 In the classical experiment by Pavlov on the dog, US is the food, CS is the bell 5

17 The δ(t) is called the Tempral Differenece (TD) error. This error is used to better predict the expected reward across trials. Schultz s model using the Reinforcement Learning framework was a landmark in the sense that it provided a mechanism to make sequential decisions when reward was not immediate. It successfully explained the Dopamine firing (and dip) in VTA and Substantia Nigra during classical conditioning and marked a beginning of the use of reinforcement learning as a tool for computational modelling in neuroscience Need for a Dual System The models described above specifically aim at establishing and learning Stimulus-Response associations based on rewards experienced. However, it has been well established that a pure Stimulus- Response (S-R) behavior is not what an animal exhibits. From earlier works by Tolman to more recent findings by Balleine and colleagues, humans and rodents have shown separate traits in following a Response-Outcome (R-O) behavior. Individually, each of these behaviors is not sufficiently optimal. S-R associations in a complex environment need time to establish and strengthen and actions with R-O associations typically need time to execute. Additionally, the need to develop a model inclusive of both these behaviors arises from the need to understand decision making in a complex environment as it occurs in the brain. Accordingly, there have been few attempts at developing a model that is inclusive of both mechanisms as explored in the following section. It should be noted that S-R and R-O behaviors have been recoined to be called Habitual and Goal-Directed behaviors, respectively. This comes from the intuition that Stimulus-Response association is similar to getting a sensory input and making a decision based on previous experiences while not thinking of rewards as is often seen in Habitual behaviors. On the other hand R-O associations determine actions based on consideration of the nature of future outcome one desires; ergo the term Goal-Directed Behavior Related work on Dual Process Models: Daw and Gläscher Daw et al. (2005) proposed an uncertainty based competition between the two distinct processes to select the driver process. Specifically, both Goal-Directed and Habitual behaviors exhibit their mechanisms in parallel along with a measure of uncertainty on the actions they propose. The action to execute is selected based on the action proposed by the least uncertain mechanism. Gläscher et al. (2010) talk about state prediction errors in goal-directed learning. An animal is first made to learn the environment by exploration without providing specific feedback signals. Neural recordings confirm an existence of error signals indicating a representation of difference between expected and the actual state based on sensory cues. The authors then go ahead and define a mechanism where the value of each action is the weighted value of values calculated by each controller. The weight is adjusted over trails to account for change in relative control of each mechanism. 6

18 The models stated above and few others talk about decision making in simplistic two-step decision making environment. While they provide important insights into behavior, they do not explore rather complex environments. The model proposed in this thesis suggests a more plausible system with attempts of explaining the dual process framework for skill acquisition in more extensive environments along with qualitative implications of Parkinson s disease in skill learning behavior as is often observed. 1.4 Outline of the Thesis The chapters following introduction are arranged as follows. 2. Canonical model of action selection in a large scale model of basal ganglia. This chapter deals with exploring a detailed biophysical model of Basal Ganglia. The model delves deeper into the functionality of individual brain areas in the Basal Ganglia; particularly involved in motor action selection in a simplistic selection under uncertainty limited to two choices. We then apply a model of Parkinson s disease on this action selection model of the Basal Ganglia. 3. A unified theoretical framework establishing a theory of combining two reinforcement learning theories for complex sequential decision making. In this chapter, we move on from simple two-choice decision making task to developing a theory for a general cognitive sequencing problem. We suggest a hierarchical arrangement of two distinct processes involved in formation and execution of sequences. 4. Implementation of the above theory in a skill learning setting using abstract models. We then concretize a part of our earlier theory by implementing it in a grid-world setting. We test our results qualitatively with observed data in a skill learning setting and test the Parkinson s model used in the earlier chapter. 5. Conclusion and future directions. 7

19 Chapter 2 Basal Ganglia Action Selection using Nengo and implications of Parkinson s Disease In this chapter we explore a detailed biophysical model involving individual brain areas. The model is implemented on a spiking neural simulator, Nengo. The model learns a simplistic task of selecting a more rewarding action among the two choices in a stochastic environment. We use a simple error-clamping model to simulate the Parkinson s syndrome in action selection and test this model on a Probabilistic Inference task. 2.1 Detour: Exploring Nengo The Nengo Modelling API is based on Neural Engineering Framework (NEF) developed by Chris Eliasmith (Eliasmith & Anderson, 2004). In general, this framework uses spiking neuron population of models to implement higher order functions. The NEF is based on the following three principles Representation A representation problem is formulated as mapping between the value to be represented in form of a vector x and the neuron activity a. Every neuron i has an encoding vector e i. This vector forms a preferred direction vector for the neuron; the vector for which the neuron will fire most strongly. If G is the neuron s non-linearity, α i is a gain parameter and b i is the constant background current, neural activity for the neuron i is calculated as: a i = G(α i. < e i.x > +b i ) (2.1) This neural activity can be decoded using a linear decoder d i as a set of weights which maps the activity back to x i as follows: ˆx = Σa i d i (2.2) 8

The linear decoders can be found using least squared minimization procedures. 2.1.

20 The linear decoders can be found using least squared minimization procedures Transformation The NEF allows to define transformation with the same mechanism as we define functions in any scripting environment. Typically, a transformation is a neural representation of a function of a variable represented using a population of neurons. This allows defining arbitrary functions to be implemented and represented by a population of neurons Dynamics While representation and transformation would be sufficient to execute all the functions required in this work, dynamics help in applying control theory to neurons executing these functions by treating the variables for transformation as system state variables. This allows representations to encapsulate system dynamics to apply modern control theory. 2.2 Action Selection Model of the Basal Ganglia Basal Ganglia Overview The Basal Ganglia (BG) comprises of a bunch of sub cortical nuclei located at the base of the forebrain. These nuclei have afferent and efferent connections to the cerebral cortex, brain stem and the thalamus along with several other areas. The Figure 2.1 shows the general anatomy of the BG nuclei. Figure 2.1: Anatomy of Basal Ganglia A more functional scheme of the Basal Ganglia pathways can be see in Figure 2.2 9

21 Figure 2.2: Connecting Pathways within BG nuclei. Please see the main text for abbreviations. Thalamus (Th) receives the activation signals from the output nuclei of the BG, the internal segment of Globus Pallidus (GP i ) forms the output of the basal ganglia with. The Direct pathway goes from the Striatum (Caudate/Putamen) to GP i via a GABAergic inhibitory pathway. GP i projects to Th via GABAergic inhibitory pathway as well. This activation of the striatum causes inhibition of GP i which in turn releases the inhibition of Th. The indirect pathway, makes a GABAergic detour from the striatum to the external segment of the Globus Pallidus (GP e ) which makes an inhibitory GABAergic projection to STN leading to a Glutamatergic excitatory projection to GP i. Thus, activation of the indirect pathway would cause inhibition of GP e, dis-inhibition of STN, excitation of GP i and hence the final inhibition of Th. Such a push-pull mechanism has been typically used to model action selection. The striatum represents the expected values for each action. These expected values are adjusted via Dopaminergic error correction mechanisms from the Substantia Nigra Pars Compacta. Assuming that the correct actions have been learned, a correct action will have higher value represented in the striatum thereby deactivating the inhibition of GP i to Thalamus, which thus projects excitatory signals to the cortex. A lower valued action in the striatum will eventually follow the indirect pathway and project an inhibitory (less excitatory) connection via thalamus enabling the cortex to not select that action. (Frank, 2005; Frank et al., 2004, 2007) Model The action selection model presented in this section was developed by Balasubramani et al. (2015) Correspondingly, the Nengo implementation of the model involves 100 neurons in each ensemble implemented by the spiking neural simulator. Mathematically, the model is described in the following part. The following equations of the model are taken from Balasubramani et al. (2015) and this canonical model is further adapted in the thesis for a probabilistic learning (inference) task. 10

22 The striatum is assumed to represent two types of Medium Spiny Neurons (MSNs). Weight update equations for the striatum for D1 and D2 pathways would be: w D1 (s t, a t ) = η D1 λ Str D1 (δ(t))x (2.3) w D2 (s t, a t ) = η D2 λ Str D2 (δ(t))x (2.4) Here λs are sigmoid functions over error for each type of MSNs. The dopamine receptors (D1R) on the MSNs receive cortico-striatal connections whose weight is denoted by w D1. Such MSN s activity is denoted by: y D1 = w D1 x & Q = y D1 (2.5) Weight update equation countering for immediate reward would be this: δ(t) = r Q t (s t, a t ) (2.6) The STN-GPe 1 systems are assumed to be connected bidirectionally with each other and are assumed to have lateral connection. The number of neurons in STN and GPe is taken to be equal to the number of possible actions. The dynamics of this STN-GPE system is dx ST i N τ s dt = x ST N i + n j=1 Wij ST N yi ST N x GP i E (2.7) y ST N i = tanh(λ ST N x ST N i ) (2.8) dx GP i e τ g dt = x GP e i + n j=1 Wij GP e x GP i e + yi ST N x IP i (2.9) x GP i e is the internal state representation of the i th neuron in GPe; x ST i N is the internal state representation of the i th neuron in STN; W GP e indicate the lateral connections within GPe, equated to a small negative number ɛ g for both self (i = j) and non-self (i j) connections for every GPe neuron. W ST N indicate the lateral connections within STN, equated to a small positive number ɛ s for non-self (i j) and 1 + ɛ s for self (i = j) connections for every GPe neuron. The striatal output towards Direct (DP) and Indirect Pathway (IP) are modeled by assuming D1R MSNs projecting to DP. The contribution of DP to GPi is given by: x DP i = α D1 λ GP i D1 (δ U (t))y D1 (s t, a t ) (2.10) 1 STN - Subthalamic Nucleus. GPe - External Segment of the Globus Pallidus 11

23 Similarly, the GPe is modeled by assuming D2R MSNs projecting to IP. The contribution of IP to GPe is thus: x IP i = α D2 λ GP i D2 (δ U (t))y D2 (s t, a t ) (2.11) where the response funcitons are various kinds of MSNs denoted by variable y: y D1 (s t, a t ) = w D1 (s t, a t )x (2.12) y D2 (s t, a t ) = w D2 (s t, a t )x (2.13) and λ(δ) s are respective sigmoid functions over δ. Here, δ(u) is the temporal difference of utility. If δ(u) is high, action at time t has a higher utility as compared to time t 1. This enables the DP which (also known as go pathway) which exploits by selecting the same action a t. On the other hand if δ(u) is low, the IP (also known as Nogo pathway is selected for facilitating the action taken at time t 1 because the action at time t 1 has a higher utility than the action at time t. If δ(u) is somewhere in between high and low levels, a random selection of choice is made by IP. Each action neuron in GPi (Internal Segment of the Globus Pallidus) is modeled to combine the contributions of DP and IP X GP i = X DP + W ST N GP i y ST N (2.14) W ST N GP i represents relative weights of STN projections to GPi compared to that of DP projections. The direct and indirect pathways are combined downstream in GPi or further along the thalamus, which receives afferents from GPi. The GPi neurons project to thalamus over inhibitory connections. X T h = X DP W ST N GP i y ST N (2.15) These afferents activate the Thalamus(Th) neurons as follows: dy T halamus i dt = y T halamus i + x T halamus i (2.16) where yi T halamus is the state of the i th thalamic neuron. These neurons are represented as vectors; compared to the threshold marking a particular choice. The BG model looks as in the figure 2.3 in the Nengo neural simulator. Figures 2.4 and 2.5 depict the simulation results. 12

24 Figure 2.3: BG Network Diagram in Nengo 13

25 Figure 2.4: Activation in D1 and D2 Striatal MSNs for each action Figure 2.5: Neuron population spiking for each action in different BG areas for the above activation 14

26 The figure 2.4 shows the weight setting after 100 trials of learning in the action selection model. These weights are modified via the temporal difference error said to be represented in the Substantia Nigra Pars Compacta (SNc). The error signal projects to the striatum and modifies the previously stored value functions for each action by appending (or subtracting) from d1 and d2 pathway weights. Training was to learn to select the best among two actions where first action was rewarded 80% times and the other action was rewarded 20% times. The value function in D1 MSNs is much higher for the first action as compared to the second action whereas the value for D2 MSNs is much higher for the second action. This indicates the first action being pushed by the Go pathway whereas the second action is inhibited by the Nogo Pathway. The figure 2.5 shows the activity of ensembles representing parts of the BG. The indirect pathway releases the GPe inhibition for the first action and suppresses the second action. GABAergic projections to STN inverses these roles. STN Glutamateric Projections to GPi from STN along with direct pathway GABAergic projections to GPi converge to supressing GPi activity for the first action and increasing GPi activity for the second action. This causes releasing inhibition in the thalamus for the first action and inhibiting the second action. Decision is made when activity representing one of the actions in the thalamus increases beyond a thresold (0.5) thus the first action is selected. This model, thus selects the action that has higher activation in the D1 striatal MSNs and rejects the action that has a higher activation in the D2 striatal MSNs which are in turn learned via error signals from the SNc. This model this forms a canonical action-selection model for the BG on which the tasks that follow are implemented. The Inter -neuron ensembles implement intermediate functions to implement the model described above. This canonical model of BG was taken from Balasubramani et al. (2015). 2.3 The Probabilistic Inference Task The action selection model described above learned to select the action that gave maximum utility over time. Typically, actions were rewarded probabilistically; one action rewarded 80% times and the other rewarded 20% times. The model successfully learned to select the first action, thereby selecting the action which maximizes reward expectation. One interesting task based on the above probabilistic action selection was to test if the model was able to infer among actions with different expected utilities. The diagram 2.6 shows the probabilistic inference task as explained in Frank et al. (2004) 15

27 Figure 2.6: Japanese Hiragana Character images shown during training along with their reward percentages. For instance, image A was rewarded 80% times, the subject trained to select between characters A and B (20% rewarding). [Figure taken from Frank et al. (2004)]. Subjects were trained on pairs of Japanese Hiragana characters with reward contingencies as shown in figure 2.6. During the testing phase, the image A was paired with images C, D, and F. Similarly, image B was paired with images C, D, E and F. Optimal behavior would require the subjects to select A in the first case and avoid B in the second test case. This experiment was conducted on two groups of participants with Parkinson s disease; one with L-DOPA medication on (PD ON) and one with the medication off (PD OFF). It was observed that the participants with medication on were better at choosing A than avoiding B and the participants with medication off were better at avoiding B than choosing A. This finding was reported by Frank et al. (2004) as shown in figure

28 Figure 2.7: PD ON patients were less efficient at avoiding less rewarding choices whereas PD Off patients were less efficient at selecting more rewarding choices. [Figure taken from Frank et al. (2004)]. It is important to note here that the medication deteriorates performance for non-rewarding actions beyond the non-medicated patients causing a double dissociation between the two cases. On the other hand, age-matched healthy participants (labelled as Seniors) seem to exhibit similar accuracy both in choosing versus avoiding decisions. 2.4 Simulating Parkinson s Disease in the Probabilistic Inference Task This inference task was implemented in the action selection model of BG described earlier in this chapter. Specifically, six actions with reward probabilities matching to the ones in the experiment were modeled as six actions input to the striatal MSNs. These actions were rewarded as per table

29 choices Reward Reward Probability Punishment Punishment Probability A 1 80% -1 20% B 1 20% -1 80% C 1 70% -1 30% D 1 30% -1 70% E 1 60% -1 40% F 1 40% -1 60% Table 2.1: Reward Schemes The Parkinson s disease was implemented using a simple error clamping model as described in Balasubramani et al. (2015). The model is implemented as follows: The Dopamine (DA) levels are less in Parkinson s patients. This feature is implemented by clamping the error, δ to an upper bound δ lim. Some amount of DA is indeed produced by the SNc in Parkinson s patients as well, but the maximum amount of DA found in the striatum is lower as compared to controls. This leads to the rule... ifδ > δ lim ; δ = δ lim (2.17) The PD medication by L-DOPA (DA agonists) increases DA activity. This is simulated by adding a fixed constant (δ med ) to the pre-existing clamped δ. This leads to... δ := δ + δ med (2.18) This altered δ representing medicated condition is then used for simulating corresponding test cases for PD ON patients. Thus the general error representation is brought out by the following set of equations: [a, b] f or controls δ(t) = [a, δ lim ] for P D OF F (2.19) [a, δ lim + δ med ] for P D ON where δ lim + δ med are lesser than b. Implementing this Parkinson s model in the action selection model of BG explained earlier for the probabilistic inference task to check for efficiency in choosing a rewarding action and avoiding a punishing action produced results that gave similar double dissociation curves for PD ON and PD OFF simulations as shown in figure 2.8. The results shown are of 100 simulations. Each trial in a simulation was trained with action selection followed by reward 100 times. Testing on inference task (for instance 18

30 selecting 70% vs 80% rewarding actions) was done after each trial in the training phase thus 100 test trials. After each simulation, D1 and D2 weights representing the striatal value function were reset to 0. Figure 2.8: Simulation Results While we could not obtain the double dissociation as was stated in the original paper, there are some interesting parallels drawn on these results. First, Parkinson s patients are shown to fare worse as compared to No disease condition a finding consistent with the experimental study where PD patients without medication were found slow in reaching accuracies at par with the controls. We thus argue, in lieu of lack of experimental evidence on accuracies over all trials, that the model simulating Parkinson s disease is a good qualitative representation of the disease. Second, while the model simulating patients on medication did not produce the double dissociation, it did concur that medication enables PD patients to be more efficient at choosing the rewarding stimulus over avoiding the punishing stimulus. This model, thus, meets the halfway mark. It needs further investigation to produce the double dissociation observed empirically. It is also perhaps interesting to note that the earlier work Frank et al. (2004) shows a significant double dissociation in selecting a rewarding versus rejecting a punishing stimulus for patients with medication on and off. However, in a later work by Frank et al. (2007), the double 19

31 dissociation is not significant while maintaining that patients with medication on were indeed better at selecting the rewarding stimulus than rejecting the punishing one. In this chapter we explored a detailed biophysical model of action selection in a simple single step decision making environment. We implement this model using spiking neurons via the Nengo neural simulator and show that it successfully implements a probabilistic inference task. This work gave significant insights in the mechanisms of Parkinson s disease, especially the clamping error model. The Parkinson s model presented in this chapter is also used to simulate Parkinson s disease in an environment aimed at modelling skill learning as explored in the final chapter. However, as mentioned earlier, real world is not made of single stage action selection. Rather complex problems require decision making at multiple levels with each level dependent on its predecessor. We thus explore a more complex environment typically requiring sequential decisions to reach a rewarding state. The next chapter presents a theory on how two aspects of a dichotomy (Goal-Directed and Habitual behavior) can be combined for cognitive sequencing. We now move to more abstract modeling in more complex tasks as compared to more detailed neural models explored in this chapter. 20

32 Chapter 3 A Unified Theoretical Framework for Cognitive Sequencing The capacity to sequence information is central to human performance. Sequencing ability forms the foundation stone for higher order cognition related to language and goal-directed planning. Information related to the order of items, their timing, chunking and hierarchical organization are important aspects in sequencing. Past research on sequencing has emphasized two distinct and independent dichotomies: implicit versus explicit and goal-directed versus habits. We propose a theoretical framework unifying these two streams. Our proposal relies on brain s ability to implicitly extract statistical regularities from the stream of stimuli and with attentional engagement organizing sequences explicitly and hierarchically. Similarly, sequences that need to be assembled purposively to accomplish a goal require engagement of attentional processes. With repetition, these goal-directed plans become habits with concomitant disengagement of attention. Thus attention and awareness play a crucial role in the implicit-to-explicit transition as well as in how goal-directed plans become automatic habits. Cortico-subcortical loops basal ganglia-frontal cortex and hippocampus-frontal cortex loops mediate the transition process. We show how the computational principles of model-free and model-based learning paradigms, along with a pivotal role for attention and awareness, offer a unifying framework for these two dichotomies. Based on this framework, we make testable predictions related to the potential influence of response-to-stimulus interval (RSI) on developing awareness in implicit learning tasks. 3.1 Introduction Cognitive Sequencing can be viewed as the ability to perceive, represent and execute a set of actions that follow a particular order. This ability underlies vast areas of human activity including, statistical learning, artificial grammar learning, skill learning, planning, problem solving, speech and language. Many human behaviors ranging from walking to complex decision making in chess involve sequence processing (Clegg et al., 1998; Bapi et al., 2005). Such sequencing ability often involves processing repeating patterns learning while perceiving the recurrent stimuli or actions and executing accordingly. Sequencing behavior has been studied in two contrasting paradigms: goal-directed and habitual or under the popular rubric of response-outcome (R-O) and stimulus-response (S-R) behavior. A similar 21

33 dichotomy exists on the computational side under the alias of model-based versus model-free mechanisms. The model-based versus model-free computational paradigm has proved vital in designing algorithms for planning and learning in various intelligent system architectures leading to the proposal of their involvement in human behavior as well. In this article, we use another dichotomy on the learning side: implicit versus explicit along with a pivotal role for attention and awareness to connect these dichotomies and suggest a unified theoretical framework targeted towards sequence acquisition and execution. In the following, the three dichotomies will be described along with a summary of the known neural bases of these Habitual versus Goal-Directed Behavior Existence of a combination of habitual and goal-directed behaviors is shown in empirical studies on rats and humans. In the experiments to study these behaviors two phenomena have been used to differentiate: outcome devaluation - sensitivity to devaluation of the goal 1 and contingency degradation - sensitivity to an omission schedule 2 (H. Yin et al., 2004). Outcome devaluation is achieved by satiating the rats on the rewarding goal; making the reward less appealing whereas contingency degradation is achieved by omitting a stimulus within a sequence of stimuli leading to the goal. Results demonstrate that overtrained rats and humans seem to be insensitive to both the phenomena. That is, even though the outcome of following a path is devalued or a stimulus in the sequence is omitted, habits lead rats to follow the same path, thus relating overtraining to habitual or stimulus-response (S-R) kind of control (Adams & Dickinson, 1981; Killcross & Coutureau, 2003). On the other hand, moderately trained rats have had little or no difficulty adapting to the new schedule relating this behavior to a goal-directed or response-outcome (R-O) kind of control (Dickinson, 1985; Balleine & Dickinson, 1998; E. M. Tricomi et al., 2004; Balleine & O Doherty, 2010; Dolan & Dayan, 2013). Based on this proposal of two contrasting mechanisms, quite a few notable neuroimaging studies have attempted establishing the neural substrate related to the two modes of control. fmri studies related to outcome devaluation point to two sub areas of the ventromedial prefrontal cortex (vmpfc) - medial orbito-frontal cortex (OFC) and medial prefrontal cortex (PFC) as well as one of the target areas of the vmpfc structures in the human striatum, namely, the anterior caudate nucleus, to be involved in goal-directed actions (Valentin et al., 2007; Balleine & O Doherty, 2010). Studies aimed at finding the neural substrate for habitual behavior suggest an involvement of the subcortical structure, dorsolateral striatum (Hikosaka et al., 1999; H. H. Yin & Knowlton, 2006; E. Tricomi et al., 2009) Model-Free versus Model-Based paradigm In order to understand the learning process in goal-directed and habitual paradigms, two contrasting computational theories have been proposed. Goal-directed learning and control processes have been 1 Rats were satiated to make the reward at the end of the maze less appealing 2 Omit the reward completely for a few lever presses after a certain period of training 22

34 related to a model-based system (Doya et al., 2002; Khamassi & Humphries, 2012) whereas the habitual paradigm to a model-free system (Dolan & Dayan, 2013). Typically, a goal-directed system uses its view of the environment to evaluate its current state and possible future states, selecting an action that yields the highest reward. A model-based mechanism conceives this as building a search tree leading toward goal states. Such a system can be viewed as using the past experiences to understand the environment and using this view of the environment to predict the future. In contrast, a habitual behavior can be viewed as a repetition of action sequences based on past experience. Acquisition of a habit can be viewed as learning a skill correcting the residual error in the action sequence leading to the goal. Typically skills are acquired over time, are sensitive to the effectors used for acquisition (Bapi et al., 2000) and there seems to be a network of brain areas involved at various stages of skill learning (Hikosaka et al., 1999). A model-free system conceives the residual error as a sort of prediction error called the temporal difference (TD) error (Sutton, 1988). Dopamine has been noted to be a key neurotransmitter involved in encoding the prediction error and thus leading to the view that dopamine plays a crucial role in habit learning (Schultz et al., 1997). The influence of dopamine is in no way limited to habitual behaviors. Role of dopamine in functions mediated by the prefrontal cortex such as working memory, along with the observation that there are wide projections of dopamine to both the caudate and the putamen and studies manipulating dopamine levels in the prefrontal cortex affecting goal-directed behavior indicate its involvement in the goal directed mechanism as well (see Dolan & Dayan (2013) for a review). Further work has been directed at establishing a connection between the dichotomies the behavioral dichotomy of goal-directed versus habitual and the computational dichotomy of model-based versus model-free. For example, Daw et al. (2005) suggested an uncertainty based competition between the two behaviors the computationally simpler process at any point acting as a driver process. Another interesting aspect of such a combination comes from the hierarchical reinforcement learning (HRL) framework (Sutton & Barto, 1998; Sutton et al., 1999; Botvinick et al., 2009; Botvinick, 2012). An important aspect of acquisition of sequences is the formation of chunks among the sequence of stimuli. The striatum has been emphasized to be involved in the chunking process, the chunks then selected and scaled by the output circuits of the basal ganglia (Graybiel, 1998) Implicit and Explicit learning For the past three decades, there has been significant interest in sequence learning. A large body of experimental evidence suggests that there is an important distinction between implicit and explicit learning (see for example Howard & Howard (1992) and Shea et al. (2001) studies). Howard & Howard (1992) used a typical serial reaction time (SRT) task (Nissen & Bullemer, 1987), where the stimuli appeared in one of the four locations on the screen, and a key was associated with each location. The participants job was to press the key as soon as possible. The stimuli were presented in a sequential order. With practice, participants exhibited response time benefits by responding faster for the sequence. However, their performance dropped to chance-level when they were asked to predict the next possible 23

35 location of the stimulus suggesting that the participants might have learned the sequence of responses in an implicit manner and thus can not predict the next move explicitly. Another example is the study by Shea et al. (2001), where participants were given a standing platform task and were asked to mimic the movements of the line presented on a screen. The order of the stimuli was designed in such a way that the middle segment was always fixed whereas the first and the last segments varied but participants were not told about the stimulus order. It was found that the performance of the middle segment improves over time. During the recognition phase participants fail to recognize the repeated segment, pointing to the possibility that they may have acquired this via implicit learning. A recent study done with a variant of SRT task called oculomotor serial response time (SORT) task also suggests that motor sequence can be learned implicitly in the saccadic system (Kinder et al., 2008) as well and does not pose attentional demands when the SORT task is performed under dual-task condition (Shukla, 2012). Another way of differentiating implicit versus explicit learning would be to see whether an explicit instruction was given about the presence of a sequence prior to the task. An instruction specifying the presence of a sequence in the task would, in turn, drive attentional learning. Without such explicit prior knowledge, however, it may take more number of trials for the subjects to become aware of the presence of a sequence requiring them to engage their attention towards the sequence, turning the concomitant learning and execution explicit. Apart from these studies, there is a large body of clinical literature which confirms the distinction of implicit and explicit learning. Most of the clinical evidence comes from the artificial grammar learning (AGL) paradigm, where patients learned to decide whether the string of letters followed grammatical rules or not. Healthy participants were found to learn to categorize grammatical and ungrammatical strings without being able to verbalize the grammatical rules. Evidence from amnesic patients points toward implicit learning being intact in patients even though their explicit learning was severely impaired (Knowlton et al., 1992; Knowlton & Squire, 1996; Van Tilborg et al., 2011; Gooding et al., 2000). Willingham et al. (2002) suggested that activation in the left prefrontal cortex was a prerequisite for such awareness along with activation of the anterior striatum (Jueptner, Frith, et al., 1997; Jueptner, Stephan, et al., 1997). Results of the positron emission tomography (PET) study of Grafton et al. (1995) when participants performed a motor sequence learning task under implicit or explicit learning task conditions suggest that the motor cortex and supplementary motor areas were activated for implicit learning whereas the right premotor cortex, the dorsolateral cingulate, anterior cingulate, parietal cortex and also the lateral temporal cortex were associated with explicit procedural memories (Destrebecqz et al., 2005; Gazzaniga, 2004). It has been established that the brain areas involved in working memory and attentional processing are more active during explicit learning as compared to implicit learning. Further, the findings of functional magnetic resonance imaging (fmri) studies suggest that the prefrontal and anterior cingulate cortex and early visual areas are involved in both implicit and explicit learning (Aizenstein et al., 2004). However, there is a greater prefrontal activation in case of explicit processing than implicit which is consistent with the findings from attention literature suggesting that prefrontal activation is associated with controlled and effortful processing (Aizenstein et al., 2004). However, the neural bases of implicit 24

36 and explicit learning are still inconclusive. For example, Schendan et al. (2003) used fmri to differentiate brain activation involved in implicit and explicit processing. Their finding suggests that the same brain areas are activated in both types of processing. More specifically, the medial temporal lobe (MTL) is involved in both implicit and explicit learning when a higher order sequence was given to the participants. Furthermore, Pammi et al. (2012) observed a shift in fronto-parietal activation from anterior to posterior areas during complex sequence learning, indicating a shift in control of sequence reproduction with help of a chunking mechanism. In this section we discussed the three dichotomies that have stayed mostly distinct in the literature. While there have been many significant attempts at combining goal-directed behavior with model-based mechanism and habitual behavior with model-free mechanism, we attempt to add the third implicit vs explicit dichotomy to devise a unifying framework explaining both learning and execution. 3.2 Computational Equivalents In this section we present how explicit learning and goal directed behavior can be related to a modelbased mechanism whereas implicit learning and habitual behavior can be related to a model-free system. Indeed, there have been previous such attempts at bringing together the contrasting paradigms (Doya et al., 2002; Daw et al., 2005; Dezfouli & Balleine, 2012; Dolan & Dayan, 2013; Dezfouli et al., 2014; Cushman & Morris, 2015) Goal Directed behavior as Model-Based mechanism A goal directed behavior can be viewed as keeping the end-point (goal) in mind and selecting the ensuing actions accordingly. This kind of learning and control can be explained by a solving a simple markov decision process (MDP) framework (Puterman, 2014) using model-based reinforcement learning. Typically, an agent estimates its environment and calculates the value of its current state and possible future states. This estimation can be described by the Bellman equations: V (s n ) = T (s n, a, s n+1 )[R(s n, a, s n+1 ) + γv (s n+1 )] (3.1) Here, T (s n, a, s n+1 ) is the transition probability: the probability of the agent landing in state s n+1 if it takes an action a from state s n. R(s n, a, s n+1 ) denotes the reward the agent gets on taking an action a from state s n and lands in state s n+1, γ the discount factor enabling the agent to select higherreward-giving actions first. Such a system can be viewed as building a search tree and looking forward into the future, estimating values of the future states and selecting the maximum valued one (denoted as V (s n )). V (s n ) = max T (sn, a, s n+1 )[R(s n, a, s n+1 ) + γv (s n+1 )] (3.2) a 25

37 This kind of behavior requires the agent to know the transition probabilities along with the reward function. While these values are not directly available in the environment, the agent builds these gradually over time while interacting with the environment learning the transition probabilities iteratively. These values, in effect, form a model of the environment, hence deriving its name. Rewards in all the future states are estimated by propagating back the rewards from the final goal state. The transition probabilities can be initially equal for all actions and subsequently getting refined with experience. Model based RL, thus, takes the current state s and estimates a value function V (s) by building a depth first search tree upto a certain depth to determine the value of the next states the agent can be in. It uses the learned transition probability function, T (s, a, s ) to develop a map of the state space in which it looks forward to estimate a state that will give it a maximum cumulative reward, R. Once the agent picks an action a, it lands in a state s and it may receive a reward. The reward would then indicate how good a state was. The agent would compare its original belief of the state s salience as represented by V (s) and it will update V (s) based on how the new observation differed as compared to its original estimate. The agent also updates T (s, a, s ) in a similar fashion; using the error in the state in which the agent landed and the state in which the agent expected to land for that action. This sort of a mechanism can be used to understand how animals act in a goal directed fashion. The animal determines its current situation using various sensory cues from the environment. It then thinks through the consequences of the actions it could take and determine what the animal would observe if it takes a particular sequence of actions. The animal maintains a cognitive map in which it would look forward and pick an action that will land it in a situation which is the most salient. After taking that action, the animal records how good the new sensory cues were as compared to how good it expected them to be. Based on this error in expectation, it updates its belief of these particular set of sensory cues for future use. It also similarly updates its belief of its version of the environment s map based on the error in expectation of the sensory cues the animal now receives when compared to what it thought it would. The state s in MB RL represents the set of sensory cues. The transition probability function, T (s, a, s ) represents the cognitive map the animal maintains. The Value function V (s) in MB RL determines how salient these sensory cues are and the reward R determines the actual reward the animal would get Habitual Behaviors as Model-Free System Habitual behavior can be viewed as a typical S-R behavior, where the end-goal does not influence the current action selection directly. Instead, previous experiences of being in a particular state are cached (Daw et al., 2005). This can be conveyed by a model-free system through the well established temporal difference (TD) learning. TD learning follows the following update rules. p k = R(s n, a, s n+1 ) + γv (s n+1 ); (3.3) 26

38 V (s n ) = (1 α)v (s n ) + (α)p k V (s n ) = V (s n ) + α(v (s n ) p k ) (3.4) Here α is the learning rate. p k encodes a sample evaluation of state s n when the agent enters state s n for the k th time in the form of a sum of two terms the first term indicating reward the agent would receive from the current state and the second term computing discounted value of the next state s n+1 that it would enter, the value being returned is the agent s version of the value function V ( ). The last term of equation (4) refers to the prediction error signal. The definition of terms such as R( ) and V ( ) are as defined earlier in Bellman equations. This system can be viewed as looking into the past making a small adjustment to optimize performance and taking the next action. There is no explicit model of the system, the agent learns on-line learning while performing. The analogy of the terms drawn in the earlier section between goal-directed behavior and modelbased RL holds for habitual behavior as well. The state, s is the sensory cue, the function V (s) is the salience of that state and R is the reward in that state. The major difference between these two is that the agent does not explicitly update the transition probability function, T (s, a, s ). It does not look forward to the consequences of its actions but uses the historical experiences to read a sensory cue and directly take an action leading to another salient sensory cue. It may be argued that habits imply there is no more learning and hence using the construct of Model- Free Learning may not be appropriate in depicting execution of habits. We argue that no-morelearning implies that the sequence of actions that are taken in each state are perfect; in other words, the temporal difference error on taking these actions is now zero, stopping the MF RL system from learning any further. The way actions are taken using the MF RL framework thus accounts for nonlearning once the habits are established. The neuroscientific analogy to this would be that once the habit is learned, there is no more phasic dopamine firing which is supposed to indicate the TD error. While there are studies indicating a shift in activity from the Dorso-Medial Striatum (DMS) to Dorso-Lateral Striatum(DLS) as habits develop (Redgrave et al., 2010), we are not aware of a study that explicitly tests for phasic projections from the Substantia Nigra or the Ventral Tagmental Area known to produce dopamine to DLS as habits concretize. Our proposed architecture attempts a combination of the two contrasting paradigms. We suggest that implicit learning and control can be viewed in a similar way as habitual behavior and in turn both can be modeled using a model-free computational system. Similarly, explicit learning and control seem to have similar requirements as goal-directed behavior and in turn both can be understood as using a modelbased computational system. We aim to exploit the hierarchical reinforcement learning architecture and chunking phenomenon to propose how these contrasting dichotomies can be combined into a unified framework in the next section. It should be noted that in an RL setting, both the systems use history learned by experiences. Here, we mean habits as an automated behavior where we don t need to maintain a model of the environment and plan ahead habitual actions require maintaining history of the current state experiences in the 27

39 previous iterations and the actions are based on that history. Goal Directed mechanism, on the other hand requires maintaining an explicit model of the system which will then be used to plan in the future. 3.3 Unified theoretical framework In a model-based mechanism searching in a tree of possible states as one looks further ahead into the future, the search tree starts expanding exponentially, making such a search computationally infeasible. Whereas in case of a model-free mechanism, the system has to be in the exact same state as it was before to enable an update in its policy. To make such an update account for something substantial, there have to be enough samples of a particular state which might take a larger number of trials exploring the entire state space. The respective inefficiency of the individual systems (Keramati et al., 2011) and the evidence of existence of both as part of a continuum allow us to formulate a hybrid scheme combining both the computational mechanisms to explain sequence acquisition and execution in the brain. In an attempt to formulate a unifying computational theory, we add the learning factor implicit learning conceived as model-free learning whereas explicit learning conceived as model-based. One such idea in computational theories that suits our needs is the hierarchical reinforcement learning (HRL) framework (Sutton & Barto, 1998; Sutton et al., 1999). The HRL framework gives an additional power to the agent: the ability to select an option along with a primitive action for the next step. An option is a set of sequential actions (a motor program) leading to a subgoal. The agent is allowed to have separate policies for different options on selection of an option, the agent follows that option s policy; irrespective of what the external policy for the primitive actions is. We propose that learning within an option the policy of the primitive actions within an option occurs in a model-free way. The most granular set of actions a human performs are learned by a habitual mechanism and implicitly. As one moves to learning of a less granular set of actions the roles start to change a habitual model-free learning gradually transforms to a goal-directed one. At some point, one becomes aware of the recurring patterns being experienced and the attentional processes thereafter enable a shift from implicit state to explicit learning. Indeed, in the serial reaction time studies, it has been observed that as the subjects became aware of the recurring pattern or sequence, their learning might have moved from implicit to explicit state. We attribute this conversion to explicit learning to the formation of explicit motor programs or chunks. Chunks are formed when the subject becomes aware of the sequential pattern and implicit, model-free learning then turns into an explicit and model-based learning process. One interesting theory that can be used to explain the chunking process is the average reward Reinforcement Learning model. (Dezfouli & Balleine, 2012) As depicted in Figure 1, a similar analogy can be applied in control or performance of sequences with a change in direction in the process described above. The most abstract, top level goals are executed explicitly in a goal-directed way using model-based mechanism, the goal directed mechanisms gradually relinquishing control as the type of actions proceed downward in the hierarchy. At the finest chunk-level, 28

40 subject loses awareness of the most primitive actions executed and those are then executed entirely in a habitual, model-free, implicit manner. In general, we state that the way a model-free RL system is implemented can be leveraged to explain how humans undergo implicit learning. For instance, in the implicitly learned sequence in the SRT task, the participants are not aware of the existence of a sequence and yet the response time profiles improve. This is analogous to the non-existence of an explicit model of the sequence; the participants not knowing explicitly what comes after the current stimulus and yet improving bears a stark resemblance to the way model-free RL does not update the transition probabilities between consequent states and yet with experience, improves. Similarly model-based RL requires and explicit computation to learn the transition probabilities of a model. This is analogous to stating that the humans when are at a stage of explicit learning are aware of the underlying sequence in an SRT task; thereby using that internal, yet explicit, belief of the next element of the sequence helps in preparing for the next stimulus leading to faster response time profiles Otto et al. (2013). 29

41 Figure 3.1: Actor critic architecture for learning and execution. Input from the environment passes to the goal-directed mechanism to select a chunk. Action selection at the upper-level is enabled by engagement of attention in a goal-directed, model-based manner whereas at a lower level (without attentional engagement) this process implements habitual and model-free system. Action selection within a chunk occurs on a habitual, model-free basis. Neural correlates of various components of this framework are suggested here. Adapted from (Botvinick, 2008). Abbreviations: vmpfc; Ventromedial PreFrontal Cortex, DLS; DorsoLateral Striatum, VS; Ventral Striatum, DA; Dopaminergic error signal, HT+; Hypothalamus, DLPFC; DorsoLateral PreFrontal Cortex, OFC; OrbitoFrontal Cortex, MTL; Medial Temporal Lobe. In neural terms, the ventromedial prefrontal cortex (vmpfc) along with the caudate nuclei may be involved in the goal-directed part, the dorsolateral striatum and dorsolateral prefrontal cortex (DLPFC) may be engaged in the habitual part - dopamine providing the prediction error signal while the anterior 30

42 regions of the striatum and left prefrontal and medial frontal cortex playing a role in attentional processes. S. J. Gershman & Niv (2010) suggested a role for the hippocampus in task structure estimation which could be extended to estimating the world model and hence the transition probabilities required for the model-based system. Neural correlates for the options framework are detailed in Figure 3.1. In this section we presented our unifying framework for combining the three dichotomies. In the subsequent section we attempt to specify the roles of response-to-stimulus interval (RSI) [or more generally, inter-stimulus interval (ISI)] and prior information along with the pivotal role of attention in switching between the two contrasting mechanisms in the explicit versus implicit and goal-directed versus habitual dichotomies. 3.4 Role of Response-to-Stimulus-Interval (RSI) and Prior Information in the Unified Framework A model-based search leading to explicit learning is typically slower subject is required to deliberate over possible choices leading to the goal. In contrast, subject does not need to think while performing an action habitually or learning implicitly a model-free mechanism does not deliberate, it performs an action based on an already available cache of previous experiences and updates the cache as it proceeds further. Based on this, we propose that response-to-stimulus interval (RSI) [or more generally, inter-stimulus interval (ISI)] plays a key role in serial reaction time (SRT) experiments. Larger RSIs allow the subject enough time to form a model of the system, deliberate over the actions and hence this kind of learning and control corresponds to a model-based (explicit) system. On the other hand smaller RSIs do not allow the subject to form an explicit model and as is well known from the literature of serial reaction time experiments, subjects do remain sensitive to (implicitly acquire) the underlying sequential regularities (Robertson, 2007). This sort of implicit learning can be explained with temporal difference (TD) learning, where the error signal leads to an adjustment in action selection keeping the general habitual control the same. Further, knowledge (prior information) about the existence of sequential regularities in the SRT task leads to the learning and control being explicit and model-based. This can be said to engage attentional processes in our proposal. With attentional engagement, habitual control ceases to exercise control over behavior. While we propose attention-mediated arbitration between model-based and model-free systems, Lee et al. (2014) suggest that such mediation is driven by estimates of reliability of prediction of the two models. Emerging awareness of the presence of a sequence plays a similar role in mediating learning as explicit attentional processes. The complete architecture is depicted in Figure

43 Figure 3.2: Role of temporal window in engagement / disengagement of attention during learning and execution. The left panel refers to sequence execution (performance) where the flow is from top-tobottom, attention gets gradually disengaged as you go down the hierarchy. The right panel shows the acquisition (learning) of sequences where the flow is from bottom-to-top, attention gets gradually engaged as you go up the hierarchy. The temporal window determines when to switch between the two mechanisms. For example, for an action worth 1 unit of time with the temporal window size of 5 units with RSI of 3 units; a two-action chunk would lead to attention engagement / disengagement. Lesser RSI would require more number of actions chunked together to engage / disengage attention towards the underlying task. 32

44 Our proposal relies heavily on the hierarchical chunking mechanism and engagement or disengagement of attention to the underlying repeating pattern or sequence. While learning which begins implicitly and in a model-free manner, eventually as the formation of chunks proceeds up the hierarchy, at some point, the size of chunk defined in terms of the time it takes to execute the set of actions within the chunk, crosses a threshold thus engaging the attentional resources of the subject. At this point explicit model-based learning starts taking control. Similarly, during control (or execution) of a sequence, the top most selection of chunks happens via a goal-directed, model-based mechanism, on proceeding down the chunk hierarchy after the point of crossing some chunk size threshold, the subject no longer pays attention to the execution it goes on in a habitual, model-free manner. Learning or execution of a set of actions within a chunk proceeds in a habitual, model-free fashion, which at attentive level in the hierarchy can be explained by a habitual control of goal selection as suggested by Cushman & Morris (2015). Attention engagement or disengagement occurs when the chunk size is equivalent to a certain temporal window. Such a temporal window includes the RSI for a typical SRT task. For instance, larger RSIs need fewer physical actions to reach the threshold size of the temporal window during bottom-up learning and hence cause attentional engagement toward the underlying sequential pattern sooner than in case of a trial with smaller RSIs. Based on this proposal, it will be interesting to empirically investigate the impact of varying the size of temporal window and studying resultant influence on the awareness levels of the presence of an underlying sequence in the standard SRT task. According to our proposal, implicit (associative) learning in the lower-level of the hierarchy proceeds without enagement of attention. Further we propose that as the response-stimulus interval (RSI) increases the width of the temporal window available for integration of information related to the previous response and the subsequent stimulus increases. Thus increasing the temporal window allows deliberative and reflective (analytical) processes to kick in, enabling a transition to explict (awareness-driven) top-down mechanisms. This prediction can be verified experimentally and seems to be supported by preliminary evidence from the work of Cleermans and colleagues (see Cleeremans 2014). Such a hierarchical chunking mechanism for behavior generation has been suggested by Albus (1991), albeit from intelligent control architecture perspective. According to Albus (1991), a hierarchy producing intelligent behaviors comprises seven levels covering at the lowest level, the finest reflex actions, and spanning all the way up to long term planning of goals. At each higher level, the goals expand in scope and planning horizons expand in space and time. In neural terms, the motor neuron, spinal motor centers and cerebellum, the red nucleus, substantia nigra and the primary motor cortex, basal ganglia and prefrontal cortex and finally the temporal, limbic and frontal cortical areas get involved in increasing order of the hierarchy. 33

45 3.5 Comparison with other dual system theories Many dual system thoeries related to goal-directed versus habitual behavior or implicit versus explicit learning have been proposed in the recent past. For example, Keele et al. (2003) suggest a dual system where implicit learning is typically limited to a single dimensional or a unimodal system whereas explicit learning involves inputs from other dimensions as well. Our model incorporates this duality in a different sense and does not distinguish the dichotomy between different modalities. Inputs from muliple modalities are treated as actions in an abstract sense and when a bunch of such actions crosses the threshold (acquisition or execution time), this would lead to attentional modulation (engagement in the case of acquisition or disengagement in the case of performance). A similar idea has been discussed by Cleeremans (2006) who suggested that a representation obtained from exposure to a sequence may become explicit when strength of activation reaches a critical level. Formation of the chunks is, however, assumed to be driven by bottom-up, unconscious processes. These chunks become available later for conscious processing (Perruchet & Pacton, 2006). We concur with the suggestions of Keele et al. (2003) on the neural correlates of implicit and explicit learning; learning in the dorsal system being implicit whereas that in the ventral system may be related to explicit or implicit modes. However, we emphasize that the ventral system when learning is not characterized as a uni- or multi-dimensional dichotomy would be more related to explicit learning. Daltrozzo & Conway (2014) discuss three levels of processing: an abstract level storage for higher level goals, followed by an intermediate level encoding of the actions required to reach the goal and a low level acquiring highly specific information related to the exact stimulus and associated final motor action (Clegg et al., 1998). Our model reflects such a hierarchy by breaking down the actions into a finer set of sub-actions where the top most abstract actions or goals are decided by a goal-directed, model-based system whereas the more concrete actions are executed by a habitual, model-free system. Walk & Conway (2011) suggest a cascading account where two mechanisms interact with each other in a hierarchical manner concrete information being encoded in a modality specific format followed by encoding of more domain-general representations. We incorporate such an interleaving phenomenon by suggesting that the actions within a chunk are carried out in a habitual, attention-free manner; the selection of such a chunk being goal-directed and attention-mediated. Thiessen et al. (2013) discuss a dual system involving an extraction and integration framework for general statistical learning. The extraction framework is implicated in conditional statistical learning formation of chunks or associations between events occurring together. On the other hand, the integration framework is implicated in distributional statistical learning generalization of the task at hand. We can relate the extraction framework to the implicit, habitual process and the integration framework to a goal-directed mechanism that involves creation of the model of environment using information from potentially multiple sources. Batterink et al. (2015) present evidence suggesting that though there does not seem to be an explicit recognition of statistical regularities, the reaction time task, which is deemed 50% more sensitive to statistical learning, suggests that there is in fact some statistical structure of the presented stimuli learned implicitly. Our framework agrees with the conclusion that implicit and explicit statistical learning occur in parallel, attention deciding the driver process. A similar 34

46 account has been suggested by Dale et al. (2012) who state that the system initially learns a readiness response to the task in an associative fashion mapping the stimulus to a response and then undergoes a rapid transition into a predictive mode as task regularities are decoded. Reber (2013) suggests a role for the hippocampal-medial temporal lobe (MTL) loop in explicit memory whereas implicit memory is said to be distributed in the cortical areas. However, evidence from studies with Parkinson s patients suggests an important role for the basal ganglia in acquiring such implicit knowledge. We posit a similar role for the basal ganglia and corticostriatal loops in implicit learning; the knowledge that follows this learning may be stored throughout the cortex while keeping the role of MTL and hippocampus intact. 3.6 Conclusion Sequencing is a fundamental ability that underlies a host of human capacities, especially those related to higher cognitive functions. In this perspective, we suggest a theoretical framework for acquisition and control of hierarchical sequences. We bring together two hitherto unconnected streams of thought in this domain into one framework the goal-directed and habitual axis on the one hand and the explicit and implicit sequencing paradigms on the other, with the help of model-based and model-free computational paradigms. We suggest that attentional engagement and disengagement allow the switching between these dichotomies. While goal-directed and habitual behaviors are related to performance of sequences, explicit and implicit paradigms relate to learning and acquisition of sequences. The unified computational framework proposes how the bidirectional flow in this hierarchy implements these two dichotimies. We discuss the neural correlates in light of this systhesis. One aspect of applicability of our proposed framework could be skill learning. It is well known that skill learning proceeds from initially being slow, attentive and error-prone to finally being fast, automatic and error-free (Fitts & Posner, 1967). Thus it appears that sequential skill learning starts being explicit and proceeds to be implicit from the point of view of attentional demands. At first sight, this seems to be at odds with the proposed unified framework here where the hierarchy seems to have been set up to proceed from implicit to explicit learning. However the phase-wise progression of skill learning is consistent with the framework as per the following discussion. We explore more on applying this model to skill learning in the following chapter. It is pointed out that different aspects of skill are learned in parallel in different systems while improvements in reaction time are mediated by implicit system, increasing knowledge of the sequential regularities accrues in the explicit system (Hikosaka et al., 1999; Bapi et al., 2000). The proposed unitary network is consistent with these parallel processes, the implicit processes operating from bottom-up and the explicit system in a top-down fashion. Key factor is the engagement and disengagement of attentional system as demarcated in Figure 3.1. One might wonder how this approach can be applied to research in non-human animals, where explicit mechanisms are difficult to be realized. Historically, while SRT research identifying implicit versus explicit learning systems are largely based on human experiments, that of goal-directed and habitual research is based on animal experiments. The proposed 35

47 framework is equally applicable for human and non-human participants. What is proposed here is that the lower-level system operates based on associative processes that allow the system to learn implicitly, respond reactively and the computations at this level are compatible with a model-free framework. On the other hand, the upper-level system is based on predictive processes that allow the system to prepare anticipatory responses that sometime cause errors. Error-evalation while learning and error-monitoring during control are part of this system that learns using explicit processes, enables goal-directed control of actions and the computations at this level are compatible with a model-based framework. Level of attentional engagement distinguishes these two levels as shown in Figure 3.2. Of course, non-human animals can not give verbal reports of their knowledge. The explicit system in the case of pre-verbal infants and non-human animals needs to be understood in the lines of predictive systems that can elicit anticipatory, predictive responses and learn rules and transfer them to novel tasks Marcus et al. (1999); Murphy et al. (2008). Finally based on this theoretical proposal, we make predictions as to how implicit-to-explicit transition might happen in serial reaction time tasks when response-to-stimulus interval (RSI) is systematically manipulated. The mathematical formulation of such a unified mechanism is yet to be established, along with a formalization of the attentional window and its relation to RSI. 36

48 Chapter 4 Dual Process account to explain skill acquisition and implications of Parkinson s Disease We talk about two levels of sequencing behaviors in the previous chapter. One where the entire path is broken down into smaller chunks of paths and the goal directed action, after training, selects one of the chunks, whereas the chunks are executed in a habitual manner strictly limited to how we observe behavior. Another hierarchy lies in the decreasing granularity of individual motor and cognitive actions. For instance, taking a high level action of moving an arm to reach an object might engage our attention resources till the point we are certain that we will reach the object. We don t however pay attention to individual muscle movements or joint angles; this hierarchy particularly deals with the engagement and disengagement of attention. In this chapter, we explore the first of the two hierarchies in a grid-world setting and argue that such a dual process might be a good fit to model skill learning. We also support our theory by prominent observations in Parkinson s patients using the earlier exponential model of the Parkinson s syndrome. 4.1 Introduction: Mapping Sequential Decision Making to Skill Learning Goal-Directed and Habitual behavior have been well established as two distinct but interacting processes for sequential behavior. We first argue that solution of a sequential decision making problem can be used to explain skill acquisition behavior. We view skill acquisition or learning as a process that happens in parallel of execution of the said skill, i.e., on-line. In this view, skill execution is then a sequence of decisions at each step of the said skill leading to a (delayed) reward of successfully executing the skill. Thus, by this definition, executing a skill is equivalent to executing a sequence of decisions guided by the rationality assumption of maximizing the expected future reward. We further state that at each arbitrary next step of the skill being executed is independent of previous steps, as long as the information needed to take the next step is available in the current state of the system with the representation 37

49 of history condensed in the current state. Such a simplification then allows modeling of skill acquisition as a Markov Decision Process (MDPs) which is well formalized as follows. < S, A, T, R > (4.1) where, S is the set of states in the environment; A is the set of possible actions from these states; T is the set of Transition probabilities of the form T (s, a, s ) which indicate the probability of landing in state s given the current state s and action a; R is the reward function of the form R(s, a, s ) which indicates the reward received given that action a in state s took the agent to state s Formalizing the skill learning and execution problem as an MDP allows us to model it using the Reinforcement Learning (RL) paradigm. We assert that there are two arguably distinct but interacting processes akin to Reinforcement Learning which can describe skill learning Model Based RL and Model Free RL. Correspondingly, there have been multiple computational models proposed suggesting interactions between these two processes framed as Model-Based and Model-Free Reinforcement Learning respectively. For instance, Daw et al. (2005) suggest an uncertainty based competition to arbitrate control between the two processes; the more confident mechanism being the driver. Gläscher et al. (2010) suggest a weighted value function to make decisions at each step leading to a late reward; the weights dynamically adjusted to have the goal directed values account more for decisions initially and habitual values account more for decisions with incremental exposure to the environment. Here we propose a different mechanism of arbitrating control between the two processes which provides an intuitive explanation of skill learning and suggest a possible set of neural correlates that could form the brain-basis for the suggested mechanism. 4.2 Model Description Two Mechanisms Model-Based RL Model-Based RL can be viewed as making rational decisions at each step using the internal representation of the model in mind. This internal representation of the environment is defined by the model s Transition and the Reward functions Sutton (1988). Thus, the agent creates a map of the environment as it knows it, looks forward in that map and takes an action that is most rewarding in the future it can see. Initially, the agent is not aware of the consequences of its actions and thus takes random actions eventually encountering the goal. However, once the goal is achieved, while redoing the task the model the agent uses its representation of the environment based on the previous experience to navigate efficiently to reach the goal. The value function for a state as a representation of the delayed reward the agent can receive by passing through that state is defined by the following Bellman equation: 38

50 V (s) = T (s, a, s )[R(s, a, s ) + γv (s )] (4.2) Thus the value of being in a state s is the sum of the average reward the agent will receive from transitioning to the next state s and the discounted (by γ) expected future reward. Figure 4.1: Model-Based Tree For the purpose of this model, we use a simple Depth-Limited Search to estimate this value function and suggest that a state space search is a more physiologically plausible algorithm based on neurological evidence of forward search-like planning in various biological investigations (Mushiake et al., 2006) than a computationally more efficient Dynamic Programming for Value or Policy iteration. Model-Based mechanism is thus seen as looking forward into the consequences of the actions taken at a particular stage of skill execution, leading to successfully reaching the goal state the end of that skill. These actions are in particular taken corresponding to the outcomes they will lead to based on an internal representation of the environment making it convenient to explain a Response-Outcome or Goal Directed behavior. Learning in a Model-Based setting involves learning an accurate representation of the environment; in particular, the Transition Probabilities and the Reward function represented by state values. These Transition probabilities help create a cognitive map (Tolman, 1948) of the environment which is then used for a mental traversal leading to an accurate Value function. In our model as the agent traverses through the environment, it receives an observation which carries information about the state the agent is currently in. Based on the past experience, the agent maintains a belief state that it will transition to from a given state by taking a certain action. On receiving the observation from the environment, the agent then updates its transition probabilities representing the environment as per the following equations: T k+1 (s k, a k, s T k (s k, a k, s ) + α τ (1 T k (s k, a k, s )) s = s n+1 ) = T k (s k, a k, s ) + α τ (0 T k (s k, a k, s )) s s n+1 39

51 The above equation increases the probability of transition to a state s and decreasing the transition probability for all other states(miller et al., 2017). α τ indicates the learning step size for this transition probability update. This equation can be viewed in terms of State Prediciton Errors (Gläscher et al., 2010) as an expectation of landing in the state s by taking an action a is updated based on the error received with regards to expectations for landing in states other than s. The value function is updated by getting the agent to traverse through its internal model of the environment performing a depth limited search with the using the Bellman update equation across traversals through the environment V k+1 (s) max T (s, a, s )[R(s, a, s ) + γv k (s )] (4.3) a s The Value of state s at the k + 1 st iteration, thus, is updated by looking forward into the search tree and averaging the value the value of next possible states as estimated in the k th iteration Model-Free RL On the other hand, a Model-Free RL can be viewed as taking steps based on past experiences, where the agent maintains a repository of states and actions from each state and decides based on how good a particular action was given a state previously. The agent does not need to (and cannot) do a depth limited search through the internal model of the environment, instead it looks up in the repository (see figure 4.2 for a representation of a state-action lookup table based on model-free learning), matches the action that gives maximum expected reward from the current state and picks that action. This mechanism can be viewed as a Stimulus-Response or Habitual behavior, where the agent is impervious to the outcome its actions may lead to. Model Free mechanism is implemented in our framework using the Q-learning algorithm, where a Q state defined as a state-action pair represents the expected reward of that state-action pair. Thus a value function for Q states will be written as: Q(s, a) = R(s) + s T (s, a, s ) max a (Q(s, a )) (4.4) The above equation estimates the value of an action in a particular state as the sum of the reward received in that state and the average of the maximum reward the agent will receive for the next stateaction pair. 40

52 Figure 4.2: Model-Free Lookup Table Updating Q states is done via the Q-learning algorithm which is described by the following set of equations: sample k = R(s n, a n, s n+1 ) + γ max Q(s n+1, a n+1 ); (4.5) a Q(s n, a n ) = (1 α q )Q(s n, a n ) + (α q )sample k Q(s n, a n ) = Q(s n, a n ) + α q (Q(s n, a n ) sample k ) (4.6) The difference term in the final equation represents the Temporal Difference error between the expected value Q(s n, a n ) and the actual value sample k. α q is the step size of the update. The above set of equations can be viewed as keeping a running average of the desirability of a response given a state. These equations were written earlier in 3; we rewrite them here for better representation. For instance, the equation for Temporal Difference error here points directly at learning Stimulus Response (S-R) associations akin to habitual behavior whereas the equations for state-value functions are written in terms of V (s); akin to a Goal Directed behavior reach the state which gives maximum value. Ergo, actions in the model-based part of the system are taken by considering the state-value function; aiming to reach a particular state whereas actions in the model-free part of the system are taken considering the best action given a current state regardless of where that action lands the agent Parametric Exploration We first tested our theory that model-based learning is indeed slower in terms of average time taken to reach the goal state across iterations and that it takes fewer averaged-across-iterations steps to reach the goal as compared to model-free learning. All the parametric changes before were done over 50 iterations and so were the averages and corresponding standard deviations. First we used a 7x7 grid with transition probability 0.7 ( the agent is allowed to take the desired action 70% of times and has to 41

take any other action 30% times with equal probability). The figures 4.3 show a comparison between the two mechanisms in time taken to reach the goal state and average number of steps.

53 take any other action 30% times with equal probability). The figures 4.3 show a comparison between the two mechanisms in time taken to reach the goal state and average number of steps. The depth search limit is kept at 3 states the agent is allowed to look 3 states in the future for model-based search. Figure 4.3: Comparison for Depth search limit to 3 states. Next we increased the depth search limit to 4 states, keeping all other parameters intact. Figure 4.4 shows the comparison. As expected, model-based search is slower than in the previous case and it takes fewer average steps. Figure 4.4: Comparison for Depth search limit to 4 states. We changed the environment size to a 10x10 gridworld with reward at the right-bottom corner, the agent starting at the top-left. The depth is kept at 3 in this case. Figure 4.5 shows the comparison. 42

54 Figure 4.5: Comparison for Depth search limit to 3 states, grid size increased to 10x10. Testing the two mechanims after increasing the depth size to 4 on the same 10x10 environment gives comparison as in figure 4.6. Figure 4.6: Comparison for Depth search limit to 4 states, grid size 10x10. Next, we made the environment deterministic to ensure that relative differences are intact when the environment dynamics are not at play. Figure 4.7 shows the comparison. 43

55 Figure 4.7: Comparison for Depth search limit to 3 states with transition probability 1. Finally, we modified the learning rate kept at 0.1 till this simulation and reduced it to Figure 4.8: Comparison for Depth search limit to 3 states, reduced learning rate. The simulations above show that the differences between two mechanisms are intact across all free parameters. We thus use one set of these parameters to compare our model in the following sections Arbitration The two mechanisms portrayed above, while are guaranteed to converge to optimal behavior lack in a few crucial properties. A purely Model-Based approach converges to the optimal path faster because of its ability to look ahead in the environment, but building a search tree at each step of the process is computationally expensive and time consuming. This a purely model based mechanism is slow in 44

56 terms of time it needs to execute through the environment. A purely Model-Free approach, on the other hand, is quite faster in terms of execution time since the agent does not need to build a search tree and traverse through it depth-first. However, this robs the agents ability to converge faster to the optimal path making a pure Model-Free approach slow in terms of the number of traversals it takes to execute through the environment to generate optimal behavior. Thus we leverage the faster execution time of a Model-Free mechanism and faster convergence of the Model-Based mechanism and arrange them hierarchically; the agent is initially completely Model- Based and eventually turns Model-Free. This arrangement is in line with the assumption that skillful behavior is initially Goal-Directed and slow and with enough practice the behavior gets habitual and faster. To demonstrate this, we simulated the model in a simplistic grid world environment. We thus performed simulations of the model in a 7x7 sized grid-world with the agent starting at the top-left corner going to the reward at bottom-right. At each step the agent receives a small living penalty, allowing a faster and more directed exploration. The agent selects a model-based action by performing a search through its internal version of the environment when the step number in an iteration satisfies the following condition. or j = a i factor (4.7) j = a chunk size (4.8) Here, a is an integer, i is the iteration number, j is the step number within the iteration, chunk size limits the number of consequent Model-Free actions and factor incremented by an arbitrary real value controls the relative dominance of Model-Based versus Model-Free mechanisms. Trivially, with factor valued to 1 and incremented by 0, in its first iteration, the agent will traverse the grid using a pure depth first search at each step. While the agent learning a skill of traversing through the environment will not know the dynamics of the environment, we use the State Prediction Errors (Gläscher et al., 2010) to update transition probabilities with experience. In the second iteration after experiencing the reward once, the agent will take a model-based step at every alternate cells of the grid and a model-free step at the other cell. The Model-Free approach this takes eventual control iteratively with more traversals through the environment. Once the values for next states in case of Model-Based approach or the possible actions for the current state from Model-Free approach, we squish the values in a softmax function to take a probabilistic next step. σ(q(s i, a i )) j = exp Q(s i, a i ) K k=1 exp Q(s k, a k ) for all j = 0, 1,..., K (4.9) This approach can also be viewed as chunk-formation as during a skill acquisition as suggested by Graybiel (1998); in the nth iteration the agent takes a goal directed step every nth cell, the n-1 steps 45

57 being collapsed into a chunk which is executed habitually 1. The dual process algorithm is as shown in Algorithm 1. Algorithm 1 Dual Process Algorithm Initialize Internal Model with equal transition probabilities and zero value for each state for Iteration i do Agent at the start state j = 0 while Till reward reached do if j mod (i/factor) == 0 or j mod chunk size == 0 then Take Model-Based step: Build a search tree of depth d from the current state based on the internal model of the environment. Take an action that leads to the state maximizes the expected reward at the end of the search tree. Update Transition Probabilities Update Values of the state and Q-values of the actions from that state else Take Model-Free Step: Take the action that maximizes the corresponding Q-value. Update Values of the State and Q values of the corresponding action end end end The algorithm thus works as follows. In a model based step, it uses the current T (s, a, s ) to look forward in the state space from the current state s up to depth d and estimates the V (s) 2 of the next possible states. This V (s) is estimated by backtracking discounted values in the future states up to depth d (See figure 4.9 for a reference calculation). Based on the current T (s, a, s ), it then determines which action it should take that will land it in a state of maximum V (s). After the agent takes its action a, it updates: (1) The V (s) of the previous state based on the error signal it receives in the new state, (2) The Q(s, a) using the same error signal from the new state and (3) The T (s, a, s ) using the error in its original expected state it thought it would land in with action a. In a model-free step, the agent looks at the current state s and looks at the function Q(s, a) for all possible actions a from s and picks an action that has the maximum Q value. After it takes the action, it updates: (1) The Q(s, a) using the temporal difference error in expectation in the action it took and (2) The V (s) using the same error in expectation for the state it just left. It does not update the T (s, a, s ) The MB RL using V (s) to take the next steps is analogous to saying that an animal acting in a goaldirected fashion would look at the goal state it would like to be in and taking an action based on the expected salience of that state; also analogous to Response-Outcome behavior. MF RL using Q(s, a) to 1 The switching is arbitrary within the given hierarchical arrangement of the mechanisms 2 This V (s) is the optimal V(s) for the state. It is computed by assuming the agent will take an action to receive the maximum reward from the next state 46

58 Figure 4.9: Estimating the Values of any state via backtracking search take the next steps is analogous to saying that an animal acting in a habitual fashion would look at the actions available from the current state and taking the most salient action; also analogous to Stimulus- Response behavior. Both these functions are updated in every step of the agent traversing through the environment. The Response-Time and the Traversal time curves show an exponential decrease, which is expected as per the power law of practice. It is also interesting to note that the model-based RL implemented using a simple tree search here shows stark resemblance to cognitive phase of skill acquisition. In particular, the cognitive phase is characterized by employing cognitive resources to thinking through the possible next steps in the skill via self evaluations. Similarly, the autonomous phase requiring no cognitive resources dedicated to the execution of the skill draws a resemblance to the Model-Free RL where we do not use the learned transition function to build an explicit search tree and look through the future. Yet another interesting resemblance to note is the attentional engagement in these phases of skill learning. Cognitive stage requires the subject to employ a large amount of attention to the task at hand whereas the automatic phase does notwulf et al. (2001). This attention requirement is in line our earlier analogy of using the MB-MF dichotomy to model Explicit vs Implicit learning. In particular, MB RL requires the agent to pay attention whereas MF RL does not. We, yet, don t explicitly distinguish the 47

associative stage which comes in between the cognitive and the autonomous stages.

describe the associative stage of skill acquisition.

Model-Based mechanism) when compared on the metric of number of traversals through the environment (refer Figures 4.10, 4.11 4.12 and 4.24).

59 associative stage which comes in between the cognitive and the autonomous stages. Perhaps as stated by Fitts & Posner (1967), this stage is a continuous one; forming chunks of actions, which increase in size across iterations, to be executed in the MF way could be a way to describe the associative stage of skill acquisition. This model performs better than a pure Model-Based mechanism (but not a pure Model-Free mechanism) on the time metric and it performs better than a pure Model-Free mechanism (but not a pure Model-Based mechanism) when compared on the metric of number of traversals through the environment (refer Figures 4.10, and 4.24). The results plotted here are over 10 trials of 50 iterations each. The maximum depth is kept at 4 states for both model-based and the dual system. The learning rate α is at 0.1 whereas the discount factor γ is kept at 0.9. The transition probability of the environment is 0.7 the agent will be successful in taking the action it aims to 70% times and it will take any other action with equal probability 30% times. Figures 4.14, 4.15 and 4.16 show the cells where model based actions were taken at various points in the training. Figures 4.17, 4.18 and 4.19 show the same for MF control. Figures 4.20, 4.21 and 4.22 show how the value function develops across training. These heat maps show that at the initial stages the MB mechanism is dominant through the environment. This dominance gradually transfers to the model-free mechanism. The value function maps show the salience of each states through the training. Figure 4.10: Takes fewer steps than pure Model- Figure 4.11: Time to reach the goal state. Takes less Free mechanism time than a purely goal directed search 48

60 Figure 4.13: Traversal Time decreases with practice. The behavior is shown to go habitual from goal Figure 4.12: Response time decreases exponentially with practice directed in terms of time to reach the goal Figure 4.14: Model Based actions in iteration 1 Figure 4.15: Model Based actions in iteration 15 Figure 4.16: Model Based actions in iteration 40 49

Figure 4.17: Model Free actions in iteration 1 tions in iteration 15 tions in iteration Figure 4.18: Model Free ac- Figure 4.19: Model Free ac- 40 Figure 4.20: Value function in iteration 1 Figure 4.

61 Figure 4.17: Model Free actions in iteration 1 tions in iteration 15 tions in iteration Figure 4.18: Model Free ac- Figure 4.19: Model Free ac- 40 Figure 4.20: Value function in iteration 1 Figure 4.21: Value function in iteration 15 Figure 4.22: Value function in iteration 40 We then performed the standard Outcome devaluation test to indicate that the framework does indeed match with psychological expectations. To implement it more effectively this more effectively, the agent starts at the center of the grid with the bottom right giving a higher reward and the top left a lower reward. At a specific point of the iterations we devalue the higher rewarding state. We plotted the stay probabilities for the first action at the start state. The figure shows a higher stay probabilities with overtraining (outcome devalued at a later stage) and a lower stay probabilities with low or moderate training (outcome devalued at a sooner stage)(figure 4.23). The other figure shows how stay probabilities vary with outcome devaluation at different stages in the training. 50

Figure 4.24: Stay probabilities when the environment is changed at different iteration points in exe- Figure 4.23: Higher stay probability of an overtrained agent. Representative finding. cution 4.

62 Figure 4.24: Stay probabilities when the environment is changed at different iteration points in exe- Figure 4.23: Higher stay probability of an overtrained agent. Representative finding. cution 4.3 Simulation of Parkinson s Disease We simulated Parkinson s disease using the same model that was used for simulating it in the BG action selection model (Equations 2.17, 2.18 and Thus we clamped the error signal used for updating the environment to an arbitrary value to simulate Parkinson s Disease. It has been suggested that loss of DA neurons mainly affects projections in the posterior Putamen (Striatal component for Habitual Behavior) causing over reliance on goal directed control (Redgrave et al., 2010). Thus, we expected the model to show better control under a more dominant goal-directed actor than a more dominant habitual actor. We simulated the disease and compared the average reward the model attains across for conditions without disease and conditions with disease. The environment was designed to provide a reward of +10 on reaching the goal state with a living cost of 1 for each step taken in the environment. Averaging across 10 trials, with 50 iterations per trial in a 7x7 environment described above, the control group achieved mean reward of The model started with goal directed control at every step and reduced it s goal directed influence below 50% at around 12 th iteration (Figure 4.25). The mean reward for the Parkinson s condition with the same parameters produced a mean reward of We increased the goal directed dominance to cross the 50% mark at around thr 40 th iteration to obtain a reward of 6.23 (Figure 4.26). Doubling the number of iterations brought the average reward closer at 6.68 (Figure 4.27). 51

Anatomy of the basal ganglia. Dana Cohen Gonda Brain Research Center, room 410

Anatomy of the basal ganglia. Dana Cohen Gonda Brain Research Center, room 410 Anatomy of the basal ganglia Dana Cohen Gonda Brain Research Center, room 410 danacoh@gmail.com The basal ganglia The nuclei form a small minority of the brain s neuronal population. Little is known about