A Simulation of Sutton and Barto s Temporal Difference Conditioning Model

Similar documents
COMPUTATIONAL MODELS OF CLASSICAL CONDITIONING: A COMPARATIVE STUDY

The Rescorla Wagner Learning Model (and one of its descendants) Computational Models of Neural Systems Lecture 5.1

Parameter Invariability in the TD Model. with Complete Serial Components. Jordan Marks. Middlesex House. 24 May 1999

Classical Conditioning V:

Shadowing and Blocking as Learning Interference Models

Chapter 1. Give an overview of the whole RL problem. Policies Value functions. Tic-Tac-Toe example

An Attention Modulated Associative Network

Cerebral Cortex. Edmund T. Rolls. Principles of Operation. Presubiculum. Subiculum F S D. Neocortex. PHG & Perirhinal. CA1 Fornix CA3 S D

Dikran J. Martin. Psychology 110. Name: Date: Principal Features. "First, the term learning does not apply to (168)

Learning and Adaptive Behavior, Part II

acquisition associative learning behaviorism B. F. Skinner biofeedback

Learning. AP PSYCHOLOGY Unit 5

TEMPORALLY SPECIFIC BLOCKING: TEST OF A COMPUTATIONAL MODEL. A Senior Honors Thesis Presented. Vanessa E. Castagna. June 1999

Machine Learning! R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction! MIT Press, 1998!

A Model of Dopamine and Uncertainty Using Temporal Difference

Combining Configural and TD Learning on a Robot

Lateral Inhibition Explains Savings in Conditioning and Extinction

Computational Versus Associative Models of Simple Conditioning i

ALM: An R Package for Simulating Associative Learning Models

Backward Inhibitory Learning in Honeybees: A Behavioral Analysis of Reinforcement Processing

Model Uncertainty in Classical Conditioning

Packet theory of conditioning and timing

The Influence of the Initial Associative Strength on the Rescorla-Wagner Predictions: Relative Validity

1. A type of learning in which behavior is strengthened if followed by a reinforcer or diminished if followed by a punisher.

The Mechanics of Associative Change

Cerebellar Substrates for Error Correction in Motor Conditioning

Fitting Human Decision Making Models using Python

Associative learning

Topics in Animal Cognition. Oliver W. Layton

Chapter 5: Learning and Behavior Learning How Learning is Studied Ivan Pavlov Edward Thorndike eliciting stimulus emitted

Reinforcement learning and the brain: the problems we face all day. Reinforcement Learning in the brain

Context and Pavlovian conditioning

Learned changes in the sensitivity of stimulus representations: Associative and nonassociative mechanisms

Study Plan: Session 1

Reactive agents and perceptual ambiguity

PSY 402. Theories of Learning Chapter 4 Nuts and Bolts of Conditioning (Mechanisms of Classical Conditioning)

Basic characteristics

Objectives. 1. Operationally define terms relevant to theories of learning. 2. Examine learning theories that are currently important.

A Study on Edge Detection Techniques in Retinex Based Adaptive Filter

Chapter 5: How Do We Learn?

EMOTION-I Model: A Biologically-Based Theoretical Framework for Deriving Emotional Context of Sensation in Autonomous Control Systems

Exploration and Exploitation in Reinforcement Learning

acquisition associative learning behaviorism A type of learning in which one learns to link two or more stimuli and anticipate events

Learning. Learning: Problems. Chapter 6: Learning

Simulation of associative learning with the replaced elements model

Learning : may be defined as a relatively permanent change in behavior that is the result of practice. There are four basic kinds of learning

Chapter 6. Learning: The Behavioral Perspective

ISIS NeuroSTIC. Un modèle computationnel de l amygdale pour l apprentissage pavlovien.

CHAPTER 7 LEARNING. Jake Miller, Ocean Lakes High School

PSY402 Theories of Learning. Chapter 4 (Cont.) Indirect Conditioning Applications of Conditioning

Real-time attentional models for classical conditioning and the hippocampus.

I. Classical Conditioning

an ability that has been acquired by training (process) acquisition aversive conditioning behavior modification biological preparedness

Effect of extended training on generalization of latent inhibition: An instance of perceptual learning

Evaluating the TD model of classical conditioning

Hebbian Plasticity for Improving Perceptual Decisions

Approximately as appeared in: Learning and Computational Neuroscience: Foundations. Time-Derivative Models of Pavlovian

Rescorla-Wagner (1972) Theory of Classical Conditioning

An Artificial Synaptic Plasticity Mechanism for Classical Conditioning with Neural Networks

Memory, Attention, and Decision-Making

Lesson 6 Learning II Anders Lyhne Christensen, D6.05, INTRODUCTION TO AUTONOMOUS MOBILE ROBOTS

An attention-modulated associative network

Behavioral Neuroscience: Fear thou not. Rony Paz

Learning = an enduring change in behavior, resulting from experience.

Name: Period: Chapter 7: Learning. 5. What is the difference between classical and operant conditioning?

To appear in D.A. Rosenbaum & C.E. Collyer (Eds.), Timing of behavior: Neural, computational, and psychological perspectives. Cambridge, MA: MIT Press

Theories of Learning

ARTICLE IN PRESS. Cognition xxx (2009) xxx xxx. Contents lists available at ScienceDirect. Cognition. journal homepage:

Modulators of Spike Timing-Dependent Plasticity

Learning. 3. Which of the following is an example of a generalized reinforcer? (A) chocolate cake (B) water (C) money (D) applause (E) high grades

Challenging Behaviours in Childhood

Classical Conditioning Classical Conditioning - a type of learning in which one learns to link two stimuli and anticipate events.

A configural theory of attention and associative learning

A Computational Theory

Learning Deterministic Causal Networks from Observational Data

Learning. Learning is a relatively permanent change in behavior acquired through experience or practice.

Learning Habituation Associative learning Classical conditioning Operant conditioning Observational learning. Classical Conditioning Introduction

Unit 6 Learning.

3/7/2010. Theoretical Perspectives

Access from the University of Nottingham repository:

Reinforcement Learning. Odelia Schwartz 2017

CATS IN SHORTS. Easy reader of definitions and formal expressions. Holger Ursin March Uni Helse Universitetet i Bergen.

Behavioral generalization

What is Learned? Lecture 9

Emotion Explained. Edmund T. Rolls

Unit 06 - Overview. Click on the any of the above hyperlinks to go to that section in the presentation.

City Research Online. Permanent City Research Online URL:

Discrimination and Generalization in Pattern Categorization: A Case for Elemental Associative Learning

March 12, Introduction to reinforcement learning. Pantelis P. Analytis. Introduction. classical and operant conditioning.

Dopamine, prediction error and associative learning: A model-based account

Learning. Learning is a relatively permanent change in behavior acquired through experience.

Information processing at single neuron level*

Behavioral Neuroscience: Fear thou not. Rony Paz

Modeling a reaction time variant of the Perruchet effect in humans

Reinforcement Learning. With help from

Why do we have a hippocampus? Short-term memory and consolidation

EBCC Data Analysis Tool (EBCC DAT) Introduction

Functional Relationships Between Arbitrary Stimuli and Fixed Responses: Reflex Conditioning

A neurocomputational model of classical conditioning phenomena: A putative role for the hippocampal region in associative learning

Representation and Generalisation in Associative Systems

Transcription:

A Simulation of Sutton and Barto s Temporal Difference Conditioning Model Nick Schmansk Department of Cognitive and Neural Sstems Boston Universit Ma, Abstract A simulation of the Sutton and Barto [] model of classical conditioning is shown to exhibit known timing effects of that conditioning paradigm. The effects include an accurate modelling of the time course of the shift of the CR awa from the onset of the UCS and toward the onset of the CS during an acquisition ccle; a graded CR response dependent upon ISI timing; the blocking effect; and second-order conditioning. Ke to the success of this model is the inclusion of short-term memor trace variables x i and ȳ, describing the inputs and output; and the dependence of weight adaptation on a temporal-difference term ȳ. Introduction Sutton and Barto [] introduced a learning network attempting to explain the effects observed during classical conditioning. The Sutton-Barto model is a simple adaptive element described b Equations (). A three-input form of an element is shown in Figure. At its simplest, the model acts as a perceptron: the output a summation of weighted inputs. However, the model includes two additional sets of variables which are critical to the model. Each input stimulus x i has an associated eligibilit trace x i, which acts as a short-term memor of x i, and is used to indicate when and b how much the weight w i associated with x i is modified. A similar variable ȳ exists for the output, where ȳ is a weighted average of the element s past activit. An important aspect of the Sutton-Barto model is the dependence of weight modification on the difference term ȳ (and x i ). This is in contrast to Hebbian learning, which depends on onl (and x i ). The inclusion of the ȳ temporal difference term in the Sutton-Barto model is critical to accounting for the timing results observed in classical conditioning experiments, namel the blocking effect, interstimulus interval variations, and higher-order conditioning effects.

x i (t + ) = α x i (t) + x i (t) ȳ(t + ) = βȳ(t) + ( β)(t) w i (t + ) = w i (t) + c((t) ȳ(t)) x i (t) n (t) = w i (t)x i (t) () i= UCS CS x x w w x w UR / CR Figure : The simulated Sutton-Barto adaptive element described b Equations (). In Equations (), α and β are constants ranging from.., and c is a positive learning rate constant. Variables (t) and ȳ must lie in the interval [, ], and the eligibilit trace variable x is alwas greater than. A simulation of the Sutton-Barto model was conducted for the network configuration shown in Figure, based on Equations (). The simulation investigated the effects of timing differences between the input stimuli. Two inputs acted as conditioned stimuli, and CS, and a third input simulated an unconditioned stimulus, U CS. Methods The MATLAB toolkit was used to develop and execute the simulation. Four experiments were conducted, each highlighting known timing effects of classical conditioning.. Simulation (a) - Acquisition of a CR The first experiment simulated the acquisition of a conditioned response, CR, upon pairing with the UCS, following the basic classical conditioning paradigm. The timing relationship between and the UCS is shown in the plots of Figure. The ISI is fixed at time (the time between the onset of and the onset of UCS). For this simulation, the parameters were c =., α =.6, β =, w =.6. Note that weight w associated with the UCS is fixed, whereas the weight w adapts according to Equations (). The experiment consisted of ten trials, where a single trial simulated time. Ten trials were enough for the weight w to adapt to it s asmptote.

. Simulation (a) - CR dependenc on ISI In the second experiment, the ISI was varied between and 9 time. Ten trials (as described in the prior experiment) were conducted for each ISI setting, thus a total of data points ( weight w ) were gathered. For this simulation, the parameters were c =., α =.9, β =, w =.6.. Simulation (b) - Blocking effects The third experiment explored the blocking effect. This is where a conditioned stimulus, CS, is unable to alter the CR of the network if it is paired in an identicall timed manner with a which has been prior paired to an UCS resulting in adaptation of the weight. Figure shows the time course of the, CS and UCS signals during the three phases of the experiment. The first phase, shown in the upper plot of Figure, is the standard conditioning paradigm where a CS is paired with the UCS, adapting the weights on to asmptote, a necessar condition for testing the blocking effect. The next phase, shown in the middle plot, attempts to pair a new stimulus, CS, coincident with the previousl trained stimulus. In the third phase, shown in the bottom plot, CS is allowed to precede, thus allowing CS to appear novel. For this simulation, the parameters were c =., α =.6, β =, w =.6.. Simulation (c) - Second-order conditioning The fourth experiment explored the timing effects of the higher-order conditioning paradigm. Figure 7 shows the time course of the and CS signals during the two phases of the experiment. The first phase, shown in the upper plot, follows the standard CS UCS paradigm, where the weight w adapts to asmptote, a necessar condition for testing second-order conditioning to occur in the next phase. The second phase, shown in the middle plot, attempts to pair a new stimulus, CS, with, in the absence of the UCS. This second phase is repeated over a number of trials until both and CS weights w and w reach an asmptote. For this simulation, the parameters were c =., α =.6, β =, w =.6.

Results. Simulation (a) - Acquisition of a CR The results of the first experiment are shown in the two plots of Figure. The top plot displas the first trial run, and the bottom plot the tenth (and last) run. In both plots, the rise and deca (short-term memor) nature of the eligibilit trace x is evident. In the first trial, it is active coincident with the temporal difference term ȳ, thus the weight w adapts (at time tick ). B making the weight non-zero, the signal contributes to the output (the CR), thus moving the onset of the CR signal earlier (leftward). B the tenth trial, the CR is coincident with the onset of, but b this time, because the eligibilit trace x is zero, w ceases to adapt.

Simulation (a): TD adaptive elements, trial UCS = x = x xbar Bar w relative activit Simulation (a): TD adaptive elements, trial UCS = x = x xbar Bar w relative activit Figure : Shown is the first and last of a series of classical conditioning trials. The upper drawing is the time course of the adaptive elements in the first of ten pairing trials, and the bottom drawing is the time course of the last trial. In each, a CS precedes the UCS in the normal manner. The trace x indicates the eligibilit for modification of the weight. Not shown is this adaptive weight, which is zero at the beginning of trials. In trial, output element initiall responds onl to the UCS, but b trial, adaptation of the weight to its asmptote cause the output element to coincide with the onset of. The product of the trace ȳ, where ȳ is the expected output level, and x, determines the rate of weight increase. Thus, adaptation of this weight occurs until the trace ȳ moves left-ward to the point where x is zero, at which point the weight has reached asmptote.

. Simulation (a) - CR dependenc on ISI Figure plots the results on the experiment on CR dependenc on ISI. The results show that the optimal ISI (parameter dependent) is equal to time. The efficac of w asmptotic adaptation decas after that peak is reached, to the point where no adaptation is possible (ISI > time )..6 Simulation (a): Variation of ISI. Amptotic connection weight w..... ISI (simulation time steps) Figure : Shown is the effect of varing the inter-stimulus interval (ISI) between and time in a classical conditioning paradigm. The ISI is the time between the onset of the CS and the onset of the UCS. A test of a particular ISI requires the CS weight to reach asmptote, tpicall in ten trials (figure is an example from a trial set where the ISI equalled time ). The above plot demonstrates an optimal ISI equalling time, decaing exponentiall to a point where no weight adjustment is possible (here, that point is an ISI greater than time ). Variation is of course dependent on simulation parameters (here, c =., α =.9, w =.6). 6

. Simulation (b) - Blocking effects Figures and show the results of the exploration of blocking effects. Figure plots the adaptation of the and CS weights over the course of the three phases of the experiment. In trials (phase one), the weight w is allowed to reach its asmptote while CS is held inactive (thus the CS weight w cannot adapt, and remains zero). In trials (phase two), CS is activated coincident with, but the CS weight adapts onl slightl and quickl asmptotes. It is blocked b. However, beginning at trial (phase three), CS is allowed to precede, thus the CS weight w begins to positivel adapt, due to its maximal eligibilit trace x. The weight w decas because its eligibilit trace x becomes progressivel weaker against x. B trial, CS has become the predicting stimulus, and is now blocked..7.6 Simulation (b): Blocking weight w CS weight w. Connection weights.... Trials Figure : The results of a simulation sequence exploring the blocking effect are shown. The simulation consists of three phases, where the time course of the adaptive elements for each phase is shown in Figure. Phase occurs during trials, where is paired with a UCS (CS is not active). The weight associated with reaches asmptote. Phase occurs during trials, where CS is exactl coincident with during a conditioning trial. However, the weight associated with CS adapts onl slightl, demonstrating the blocking effect of the weight. In phase, occurring in trials, CS is allowed to precede, thus allowing adaptation of both weights. B trial, CS has adapted itself to become the novel stimulus, and is blocked. 7

6 Simulation (b): TD adaptive elements, trial UCS = x = x xbar CS = x xbar Bar relative activit 6 Simulation (b): TD adaptive elements, trial UCS = x = x xbar CS = x xbar Bar relative activit 6 Simulation (b): TD adaptive elements, trial UCS = x = x xbar CS = x xbar Bar relative activit Figure : Shown is the time course of the adaptive elements during the three phases of a simulation sequence exploring the blocking effect, the results of which are shown in Figure. The first phase, shown in the upper plot, is the standard conditioning paradigm where a CS is paired with an UCS, adapting the weight w to asmptote. The next phase, shown in the middle plot, attempts to pair a new stimulus, CS, coincident with the previousl trained stimulus. As shown in Figure (trials ), the weight associated with CS adapts onl slightl. In the third phase, shown in the bottom plot, CS precedes, thus appearing novel, and allowing adaptation of both and CS weights (shown in trials in Figure ). 8

. Simulation (c) - Second-order conditioning Figures 6 and 7 plot the results of the experiment on second-order conditioning. Referring to figure 6, during trials, is paired with the UCS such that weight w reaches asmptote b trial (the time course is shown in the top plot of figure 7). Second-order conditioning begins at trial with the termination of the UCS and activation of CS prior to (the time course is shown in the middle plot of figure 7). CS now begins positivel adapting its weight w in response to. However, because is itself no longer reinforced b the UCS, the weight w decas. Weights w and w equal each other around trial 9, but both deca to zero b the trial, due to a lack of hard reinforcement (which takes the form of the fixed weight w of the UCS)..7.6 Simulation (c): Second order conditioning weight w CS weight w. Connection weights.... Trials Figure 6: The results of a simulation sequence exploring the second-order conditioning paradigm are shown. The simulation consists of two phases, where the time course of the adaptive elements for each phase is shown in Figure 7. Phase occurs during trials, where is paired with a UCS in the standard manner (CS is not active). The weight associated with reaches asmptote b trial. Phase occurs during trials, where CS is paired with, which acts as a reinforcing UCS in the absence of the UCS signal. The effect demonstrated in the second phase (trials ) is an initial adaptation of the CS weight, coincident with a decrease in the weight, due to the absence of the UCS to reinforce it. B trial, both and weights have decaed to zero. 9

6 Simulation (c): TD adaptive elements, trial UCS = x = x xbar CS = x xbar Bar relative activit 6 Simulation (c): TD adaptive elements, trial UCS = x = x xbar CS = x xbar Bar relative activit 6 Simulation (c): TD adaptive elements, trial UCS = x = x xbar CS = x xbar Bar relative activit Figure 7: Shown is the time course of the adaptive elements during two phases of a simulation sequence exploring second-order conditioning, the results of which are shown in Figure 6. The first phase, shown in the upper plot, is the standard CS UCS paradigm, where the weight (shown in Figure 6) associated with adapts to asmptote. The second phase, shown in the middle plot, attempts to pair a new stimulus, CS, with, in the absence of the UCS. Pairing is successful earl-on, as evidenced b the output element trace in the middle plot, although it s output is not nearl as strong as compared to the trace of the CS UCS pairing shown in the top plot. The bottom plot shows shows the result of repeated second-order conditioning trials. The output element trace has decaed to, indicating the weights associated with and CS have decaed to zero. Figure 6 demonstrates this effect.

Discussion The simulated Sutton-Barto model successfull demonstrates known timing and contextual effects of classical conditioning. It also provides a mechanistic explanation of most aspects of the Rescorla-Wagner theor of classical conditioning. It does so b operating in a lumped-trial manner, where, within a trial, model variables are full specified, allowing insight into effects that occur across trials. For instance, in the blocking experiment shown in figure, it is evident that the CR to stimuli cannot exceed some fixed level (λ = w ). Also, the Sutton-Barto model successfull demonstrates the notion of reinforcement as the difference between actual and expected output level, in contrast to the less sophisticated behavior of the Hebbian learning rule. References [] D.S. Levine. Introduction to Neural and Cognitive Modeling, nd Edition. Lawrence Erlbaum Associates (London),. [] R.S. Sutton and A.G. Barto. Toward a modern theor of adaptive networks: Expectation and prediction. Pschological Review, 88(): 7, 98.