A Simulation of Sutton and Barto s Temporal Difference Conditioning Model

A Simulation of Sutton and Barto s Temporal Difference Conditioning Model Nick Schmansk Department of Cognitive and Neural Sstems Boston Universit Ma, Abstract A simulation of the Sutton and Barto [] model of classical conditioning is shown to exhibit known timing effects of that conditioning paradigm. The effects include an accurate modelling of the time course of the shift of the CR awa from the onset of the UCS and toward the onset of the CS during an acquisition ccle; a graded CR response dependent upon ISI timing; the blocking effect; and second-order conditioning. Ke to the success of this model is the inclusion of short-term memor trace variables x i and ȳ, describing the inputs and output; and the dependence of weight adaptation on a temporal-difference term ȳ. Introduction Sutton and Barto [] introduced a learning network attempting to explain the effects observed during classical conditioning. The Sutton-Barto model is a simple adaptive element described b Equations (). A three-input form of an element is shown in Figure. At its simplest, the model acts as a perceptron: the output a summation of weighted inputs. However, the model includes two additional sets of variables which are critical to the model. Each input stimulus x i has an associated eligibilit trace x i, which acts as a short-term memor of x i, and is used to indicate when and b how much the weight w i associated with x i is modified. A similar variable ȳ exists for the output, where ȳ is a weighted average of the element s past activit. An important aspect of the Sutton-Barto model is the dependence of weight modification on the difference term ȳ (and x i ). This is in contrast to Hebbian learning, which depends on onl (and x i ). The inclusion of the ȳ temporal difference term in the Sutton-Barto model is critical to accounting for the timing results observed in classical conditioning experiments, namel the blocking effect, interstimulus interval variations, and higher-order conditioning effects.

x i (t + ) = α x i (t) + x i (t) ȳ(t + ) = βȳ(t) + ( β)(t) w i (t + ) = w i (t) + c((t) ȳ(t)) x i (t) n (t) = w i (t)x i (t) () i= UCS CS x x w w x w UR / CR Figure : The simulated Sutton-Barto adaptive element described b Equations (). In Equations (), α and β are constants ranging from.., and c is a positive learning rate constant. Variables (t) and ȳ must lie in the interval [, ], and the eligibilit trace variable x is alwas greater than. A simulation of the Sutton-Barto model was conducted for the network configuration shown in Figure, based on Equations (). The simulation investigated the effects of timing differences between the input stimuli. Two inputs acted as conditioned stimuli, and CS, and a third input simulated an unconditioned stimulus, U CS. Methods The MATLAB toolkit was used to develop and execute the simulation. Four experiments were conducted, each highlighting known timing effects of classical conditioning.. Simulation (a) - Acquisition of a CR The first experiment simulated the acquisition of a conditioned response, CR, upon pairing with the UCS, following the basic classical conditioning paradigm. The timing relationship between and the UCS is shown in the plots of Figure. The ISI is fixed at time (the time between the onset of and the onset of UCS). For this simulation, the parameters were c =., α =.6, β =, w =.6. Note that weight w associated with the UCS is fixed, whereas the weight w adapts according to Equations (). The experiment consisted of ten trials, where a single trial simulated time. Ten trials were enough for the weight w to adapt to it s asmptote.

. Simulation (a) - CR dependenc on ISI In the second experiment, the ISI was varied between and 9 time. Ten trials (as described in the prior experiment) were conducted for each ISI setting, thus a total of data points ( weight w ) were gathered. For this simulation, the parameters were c =., α =.9, β =, w =.6.. Simulation (b) - Blocking effects The third experiment explored the blocking effect. This is where a conditioned stimulus, CS, is unable to alter the CR of the network if it is paired in an identicall timed manner with a which has been prior paired to an UCS resulting in adaptation of the weight. Figure shows the time course of the, CS and UCS signals during the three phases of the experiment. The first phase, shown in the upper plot of Figure, is the standard conditioning paradigm where a CS is paired with the UCS, adapting the weights on to asmptote, a necessar condition for testing the blocking effect. The next phase, shown in the middle plot, attempts to pair a new stimulus, CS, coincident with the previousl trained stimulus. In the third phase, shown in the bottom plot, CS is allowed to precede, thus allowing CS to appear novel. For this simulation, the parameters were c =., α =.6, β =, w =.6.. Simulation (c) - Second-order conditioning The fourth experiment explored the timing effects of the higher-order conditioning paradigm. Figure 7 shows the time course of the and CS signals during the two phases of the experiment. The first phase, shown in the upper plot, follows the standard CS UCS paradigm, where the weight w adapts to asmptote, a necessar condition for testing second-order conditioning to occur in the next phase. The second phase, shown in the middle plot, attempts to pair a new stimulus, CS, with, in the absence of the UCS. This second phase is repeated over a number of trials until both and CS weights w and w reach an asmptote. For this simulation, the parameters were c =., α =.6, β =, w =.6.

Results. Simulation (a) - Acquisition of a CR The results of the first experiment are shown in the two plots of Figure. The top plot displas the first trial run, and the bottom plot the tenth (and last) run. In both plots, the rise and deca (short-term memor) nature of the eligibilit trace x is evident. In the first trial, it is active coincident with the temporal difference term ȳ, thus the weight w adapts (at time tick ). B making the weight non-zero, the signal contributes to the output (the CR), thus moving the onset of the CR signal earlier (leftward). B the tenth trial, the CR is coincident with the onset of, but b this time, because the eligibilit trace x is zero, w ceases to adapt.

Simulation (a): TD adaptive elements, trial UCS = x = x xbar Bar w relative activit Simulation (a): TD adaptive elements, trial UCS = x = x xbar Bar w relative activit Figure : Shown is the first and last of a series of classical conditioning trials. The upper drawing is the time course of the adaptive elements in the first of ten pairing trials, and the bottom drawing is the time course of the last trial. In each, a CS precedes the UCS in the normal manner. The trace x indicates the eligibilit for modification of the weight. Not shown is this adaptive weight, which is zero at the beginning of trials. In trial, output element initiall responds onl to the UCS, but b trial, adaptation of the weight to its asmptote cause the output element to coincide with the onset of. The product of the trace ȳ, where ȳ is the expected output level, and x, determines the rate of weight increase. Thus, adaptation of this weight occurs until the trace ȳ moves left-ward to the point where x is zero, at which point the weight has reached asmptote.

. Simulation (a) - CR dependenc on ISI Figure plots the results on the experiment on CR dependenc on ISI. The results show that the optimal ISI (parameter dependent) is equal to time. The efficac of w asmptotic adaptation decas after that peak is reached, to the point where no adaptation is possible (ISI > time )..6 Simulation (a): Variation of ISI. Amptotic connection weight w..... ISI (simulation time steps) Figure : Shown is the effect of varing the inter-stimulus interval (ISI) between and time in a classical conditioning paradigm. The ISI is the time between the onset of the CS and the onset of the UCS. A test of a particular ISI requires the CS weight to reach asmptote, tpicall in ten trials (figure is an example from a trial set where the ISI equalled time ). The above plot demonstrates an optimal ISI equalling time, decaing exponentiall to a point where no weight adjustment is possible (here, that point is an ISI greater than time ). Variation is of course dependent on simulation parameters (here, c =., α =.9, w =.6). 6

. Simulation (b) - Blocking effects Figures and show the results of the exploration of blocking effects. Figure plots the adaptation of the and CS weights over the course of the three phases of the experiment. In trials (phase one), the weight w is allowed to reach its asmptote while CS is held inactive (thus the CS weight w cannot adapt, and remains zero). In trials (phase two), CS is activated coincident with, but the CS weight adapts onl slightl and quickl asmptotes. It is blocked b. However, beginning at trial (phase three), CS is allowed to precede, thus the CS weight w begins to positivel adapt, due to its maximal eligibilit trace x. The weight w decas because its eligibilit trace x becomes progressivel weaker against x. B trial, CS has become the predicting stimulus, and is now blocked..7.6 Simulation (b): Blocking weight w CS weight w. Connection weights.... Trials Figure : The results of a simulation sequence exploring the blocking effect are shown. The simulation consists of three phases, where the time course of the adaptive elements for each phase is shown in Figure. Phase occurs during trials, where is paired with a UCS (CS is not active). The weight associated with reaches asmptote. Phase occurs during trials, where CS is exactl coincident with during a conditioning trial. However, the weight associated with CS adapts onl slightl, demonstrating the blocking effect of the weight. In phase, occurring in trials, CS is allowed to precede, thus allowing adaptation of both weights. B trial, CS has adapted itself to become the novel stimulus, and is blocked. 7

6 Simulation (b): TD adaptive elements, trial UCS = x = x xbar CS = x xbar Bar relative activit 6 Simulation (b): TD adaptive elements, trial UCS = x = x xbar CS = x xbar Bar relative activit 6 Simulation (b): TD adaptive elements, trial UCS = x = x xbar CS = x xbar Bar relative activit Figure : Shown is the time course of the adaptive elements during the three phases of a simulation sequence exploring the blocking effect, the results of which are shown in Figure. The first phase, shown in the upper plot, is the standard conditioning paradigm where a CS is paired with an UCS, adapting the weight w to asmptote. The next phase, shown in the middle plot, attempts to pair a new stimulus, CS, coincident with the previousl trained stimulus. As shown in Figure (trials ), the weight associated with CS adapts onl slightl. In the third phase, shown in the bottom plot, CS precedes, thus appearing novel, and allowing adaptation of both and CS weights (shown in trials in Figure ). 8

. Simulation (c) - Second-order conditioning Figures 6 and 7 plot the results of the experiment on second-order conditioning. Referring to figure 6, during trials, is paired with the UCS such that weight w reaches asmptote b trial (the time course is shown in the top plot of figure 7). Second-order conditioning begins at trial with the termination of the UCS and activation of CS prior to (the time course is shown in the middle plot of figure 7). CS now begins positivel adapting its weight w in response to. However, because is itself no longer reinforced b the UCS, the weight w decas. Weights w and w equal each other around trial 9, but both deca to zero b the trial, due to a lack of hard reinforcement (which takes the form of the fixed weight w of the UCS)..7.6 Simulation (c): Second order conditioning weight w CS weight w. Connection weights.... Trials Figure 6: The results of a simulation sequence exploring the second-order conditioning paradigm are shown. The simulation consists of two phases, where the time course of the adaptive elements for each phase is shown in Figure 7. Phase occurs during trials, where is paired with a UCS in the standard manner (CS is not active). The weight associated with reaches asmptote b trial. Phase occurs during trials, where CS is paired with, which acts as a reinforcing UCS in the absence of the UCS signal. The effect demonstrated in the second phase (trials ) is an initial adaptation of the CS weight, coincident with a decrease in the weight, due to the absence of the UCS to reinforce it. B trial, both and weights have decaed to zero. 9

6 Simulation (c): TD adaptive elements, trial UCS = x = x xbar CS = x xbar Bar relative activit 6 Simulation (c): TD adaptive elements, trial UCS = x = x xbar CS = x xbar Bar relative activit 6 Simulation (c): TD adaptive elements, trial UCS = x = x xbar CS = x xbar Bar relative activit Figure 7: Shown is the time course of the adaptive elements during two phases of a simulation sequence exploring second-order conditioning, the results of which are shown in Figure 6. The first phase, shown in the upper plot, is the standard CS UCS paradigm, where the weight (shown in Figure 6) associated with adapts to asmptote. The second phase, shown in the middle plot, attempts to pair a new stimulus, CS, with, in the absence of the UCS. Pairing is successful earl-on, as evidenced b the output element trace in the middle plot, although it s output is not nearl as strong as compared to the trace of the CS UCS pairing shown in the top plot. The bottom plot shows shows the result of repeated second-order conditioning trials. The output element trace has decaed to, indicating the weights associated with and CS have decaed to zero. Figure 6 demonstrates this effect.

Discussion The simulated Sutton-Barto model successfull demonstrates known timing and contextual effects of classical conditioning. It also provides a mechanistic explanation of most aspects of the Rescorla-Wagner theor of classical conditioning. It does so b operating in a lumped-trial manner, where, within a trial, model variables are full specified, allowing insight into effects that occur across trials. For instance, in the blocking experiment shown in figure, it is evident that the CR to stimuli cannot exceed some fixed level (λ = w ). Also, the Sutton-Barto model successfull demonstrates the notion of reinforcement as the difference between actual and expected output level, in contrast to the less sophisticated behavior of the Hebbian learning rule. References [] D.S. Levine. Introduction to Neural and Cognitive Modeling, nd Edition. Lawrence Erlbaum Associates (London),. [] R.S. Sutton and A.G. Barto. Toward a modern theor of adaptive networks: Expectation and prediction. Pschological Review, 88(): 7, 98.