Learning of sequential movements by neural network model with dopamine-like reinforcement signal

Exp Brain Res (1998) 121:350±354 Springer-Verlag 1998 RESEARCH NOTE Roland E. Suri Wolfram Schultz Learning of sequential movements by neural network model with dopamine-like reinforcement signal Received: 24 November 1997 / Accepted: 30 April 1998 Abstract Dopamine neurons appear to code an error in the prediction of reward. They are activated by unpredicted rewards, are not influenced by predicted rewards, and are depressed when a predicted reward is omitted. After conditioning, they respond to reward-predicting stimuli in a similar manner. With these characteristics, the dopamine response strongly resembles the predictive reinforcement teaching signal of neural network models implementing the temporal difference learning algorithm. This study explored a neural network model that used a reward-prediction error signal strongly resembling dopamine responses for learning movement sequences. A different stimulus was presented in each step of the sequence and required a different movement reaction, and reward occurred at the end of the correctly performed sequence. The dopamine-like predictive reinforcement signal efficiently allowed the model to learn long sequences. By contrast, learning with an unconditional reinforcement signal required synaptic eligibility traces of longer and biologically less-plausible durations for obtaining satisfactory performance. Thus, dopamine-like neuronal signals constitute excellent teaching signals for learning sequential behavior. Key words Basal ganglia Teaching signal Temporal difference Synaptic plasticity Eligibility Matlab programs of this model are available at ftp://ftp.usc.edu/pub/bsl/suri/suri_schultz R.E. Suri W. Schultz ()) Institute of Physiology, University of Fribourg, CH-1700 Fribourg, Switzerland e-mail: Wolfram.Schultz@unifr.ch, Tel.: +41-26-300 8611, Fax: +41-26-300 9675 R.E. Suri USC Brain Project, University of Southern California, Hedco Neurosciences Building, 3614 Watt Way, Los Angeles, CA 90089-2520, USA Introduction A large body of evidence suggests that learning is crucially dependent on the degree of unpredictability of reinforcers (Rescorla and Wagner 1972; Dickinson 1980). Only reinforcers occurring at least to some degree unpredictably will sustain learning, and learning curves reach asymptotes when all reinforcers are fully predicted. The discrepancy between the occurrence of reinforcement and its prediction is referred to as an error in the prediction of reinforcement. Error-driven learning is employed in a large variety of neural models and has been particularly elaborated in the temporal-difference (TD), algorithm, which computes the prediction error continuously in real time and establishes reinforcer predictions (Sutton and Barto 1990). The TD algorithm can be implemented in an explicit critic-actor architecture (Barto et al. 1983; Sutton and Barto 1998). The prediction error is computed and emitted as a teaching signal by the critic in order to adapt synaptic weights in the actor, which directs behavioral output (Montague et al. 1993; Friston et al. 1994). This architecture resembles the anatomical structure of the basal ganglia: the critic taking the place of the nigrostriatal dopamine neurons, and the actor corresponding to the striatum (Barto 1995; Houk et al. 1995). The prediction-error signal of the critic is strikingly similar to the activities of midbrain dopamine neurons (Montague et al. 1996; Schultz et al. 1997). Both signals are increased by unpredicted rewards, are not influenced by full predicted rewards, and are decreased by omission of predicted rewards. They are transferred to the earliest reward-predicting events through experience and, thus, predict rewards before they occur rather than reporting them only after the behavior. Neurophysiological and inactivation studies suggest an important involvement of the basal ganglia in movement sequences (Kermadi and Joseph 1995; Mushiake and Strick 1995; Miyachi et al. 1997). Disorders of dopamine neurotransmission in the striatum impair serially ordered movements in human patients (Benecke et al. 1987; Phillips et al. 1993). As reinforcement-learning tasks may

constitute sequential Markov decision processes (Sutton and Barto 1998), we aimed of investigating whether TD prediction-error signals with similar characteristics as dopamine responses could be used for learning sequential movements. 351 Methods and algorithms The task consisted of a defined sequence of seven specific stimulusaction pairs. Presentation of one stimulus (A, B, C, D, E, F, or G) elicited one of seven actions (Q, R, S, T, U, V, or W). The stimulus-action pairs followed each other at intervals of 300 ms. Reward was delivered at the end of the sequence when all individual actions had been correctly chosen (A QÞB RÞC SÞD TÞE UÞF VÞG WÞ reward). The sequence was learned backwards by associating each stimulus with a particular action by trial and error, in a total of seven blocks of 100 trials. In the first block, stimulus G required action W in order to lead to reward (G WÞreward). Sequence length increased by one stimulus-action pair in each training block of 100 trials. Correct action in each step resulted in the appearance of the previously learned stimulus of the subsequent step, which predicted the terminal reward and thus constituted a conditioned reinforcer (e.g. F VÞG WÞ reward). Any incorrect action terminated the trial. Learning of the sequence was simulated 10 times, and average learning curves were computed. Modelling the dopamine prediction error signal In the critic component of the model (Fig. 1, right), a stimulus l was represented as a series of signals, x lm (t) of varying durations (Fig. 2A, lines 3±5) in order to reproduce timing mechanisms involved in the depression of dopamine activity at the time of omitted reward. A number of such signals sufficient for covering the duration of the interstimulus intervals was chosen (m=1, 2, 3 for 300-msec interstimulus interval; m=1, 2. ¼, 10 for 1-s interstimulus interval). Reward prediction P l (t) was computed as the weighted sum over the stimulus representations x lm (t) P l ˆX t w lm x lm t m Adaptive weights, w lm, were initialized with value zero. The reward prediction P(t) was the sum over the reward predictions computed from the stimuli l: Pt ˆX P l t l The desired prediction signal increased successively from one time step to the next by a factor 1/g until the reward l(t) occurred, and decreased to baseline value zero after its presentation (Fig. 2A, line 6). Therefore, the prediction error signal was rt ˆl gp t Pt t 1 One time step corresponded to 100 msec. The discount factor, g, reflecting the decreased impact of more distant reinforcement, was estimated for dopamine neurons as g=0.98 (see Fig. 2B). The learning rule for the weights w lm was w lm ˆw t lm t 1 h c rt x lm t 1 with learning rate h c =0.1. Modelling sequential movements Fig. 1 Model architecture consisting of an actor (left) and a critic (right). The prediction-error signal (rt), serves to modify the synaptic weights, w lm, of the critic itself and the synaptic weights, v nl,of the actor (heavy dots). Critic. Every stimulus l is represented over time as a series of signals, x lm (t), of different durations. Each signal, x lm (t), is multiplied with the corresponding adaptive weight, w lm,in order to compute the prediction, P l (t). The temporal differences in these predictions are computed, summed over all stimuli, and added to the reward signal, l(t), in order to compute the prediction-error signal, r(t). Actor. The actor learns stimulus-action pairs under the influence of the prediction-error signal of the critic. Every actor neuron n (large circles), represents a specific action. A winner-take-all rule prevents the actor from performing two actions at the same time and can be implemented as lateral inhibition between neurons (Grossberg and Levine 1987) The actor component of the model (Fig. 1, left) used the reward-prediction error as teaching signal for adapting the weights v nl (initialized with zero values, n=1.., l=1.. 7). Activations a n (t) were computed from stimuli e l (t) with a n ˆX t v nl e l s t n t l s n (t) was a small random perturbation (normally distributed with mean 0.0 and variance 0.1). Actions were only allowed in response to stimuli. In addition, the model performed only one action at a time, which was implemented with a winner-take-all-rule between actor neurons. The neuron n with the largest activation elicited action n. Synapses activated when stimulus l elicited action n were considered to be eligible for modification through conjoint pre-postsynaptic activation. In order to extend the eligibility for synaptic modification beyond the immediate activation, we used an eligibility trace e nl ; t which was initially set to one and decreased during subsequent time steps of 100 msec with a factor d e nl ˆ t 1 d e nl t 1 The factor d determined the rate of trace decay and was set to 0.4 (d=0.4), resulting in a relatively short eligibility trace (8% remaining after 500 ms, <1% after 1000 ms). For a particular simulation, we used a smaller factor (d=0.2), resulting in a much longer eligibility trace (33% remaining after 500 ms, 11% after 1000 ms). Weights v nl were adapted with learning rate h a =1 according to v nl ˆv t nl t 1 h a rt e nl t Results The prediction error signal of the critic component was phasically increased by unpredicted reward (Fig. 2B). The increase was successively transferred to the earliest

352 Fig. 2A±D Internal states and signals of the model. A Signals in the critic during the learning of stimulus-reward associations. Binary signals code for the stimulus (line 1) and the reward (line 2). The stimulus is internally represented as a series of sustained signals with different durations (lines 3±5). The prediction signal (line 6) and the prediction-error signal (bottom) are shown before (solid lines) and after learning (dashed lines). (All signals with baseline zero). B Prediction-error signal of the critic (left) compared with activity of dopamine neurons (right). Occurrence of unpredicted reward results in increased prediction-error signal and dopamine activity after the reward (line 1). After repeated pairings between stimulus G and reward, both prediction-error signal and dopamine activity are already increased after stimulus G, but are at baseline at the time of reward (line 2). After training with an additional stimulus (F) preceding stimulus G, both prediction-error signal and dopamine activity are increased after stimulus F and unaffected by stimulus G and reward (line 3). The response magnitude of dopamine neurons decreases for increased stimulus-reward intervals (lines 1±3). A decrease of 2% per 100-ms increased stimulus-reward intervals is estimated (g=0.98). When the reward predicted by conditioned stimulus G is omitted, both prediction-error signal and dopamine activity are decreased below baseline at the predicted time of reward (line 4) (data from Schultz et al. 1993; Mirenowicz and Schultz 1994). C Learning actions with prediction-error signals. Stimulus F elicits the correct action V, which leaves an eligibility trace (line 1). The prediction-error signal (line 2) is increased by predictive stimulus G. The weight associating stimulus F with action V (line 3) is adapted according to the preduct (eligibility trace prediction-error signal). D Reduced weight changes induced by an unconditional reinforcement signal unrelated to reward prediction. In the same situation as in C, modification of the weight F V using an unconditional reinforcement signal (line 1) results in a later and smaller increase (bottom) Fig. 3A±C Learning curves for a sequence of seven stimulus-action pairs. An additional stimulus-action pair is added to the sequence after every block of 100 trials. A Use of a prediction-error signal results in stable learning with minimal numbers of incorrect trials at all sequence lengths tested. Thus, the pair G W is learned during the first block of 100 trials (left), the sequence F FÞG W is learned during the second block, the sequence E UÞF VÞG W is learned during the third block, etc. until the whole sequence of seven steps is learned during the last block (right). B When trained with an unconditional reinforcement signal, only a sequence length of three stimulus-action associations is learned. C Prolongation of the synaptic-eligiblity trace ameliorates sequence learning with an unconditionial reinforcement signal. However, at the present state of knowledge, longer eligibility traces are biologically less plausible. Percentages are means of 10 repetitions reward-predicting stimulus at each length of sequence. The signal was decreased when predicted reward was omitted. The reinforcement signal of the critic closely resembled dopamine response in all these characteristics. Before learning, the actor component of the model randomly reacted upon presentation of stimulus G with any action from Q±W, including the correct action W. The correct pairing G±W was followed by reward (G WÞreward) and was learned in <100 trials in the first training block (Fig. 3A, left block). Then stimulus F was presented at the beginning of the second training block of 100 trials, and correct action V resulted in appearance of stimulus G in the same trial. Thus, the model learned the sequence F VÞG WÞreward (Fig. 3A, 2nd block from left). The full sequence of seven steps was learned in subsequent learning blocks without decrement (A QÞ ¼G WÞreward). In order to further assess the efficacy of a dopaminelike prediction error signal for learning, we used an unconditional, non-adaptive reinforcement signal. This signal increased after every predicted or unpredicted reward, did not code a reward-prediction error, and did not increase with reward-predicting stimuli. This is analogous to reward responses in fully established tasks frequently found in neurophysiological experiments (Niki and Watanabe 1979; Nishijo et al. 1988; Watanabe 1989; Hikosaka et al. 1989; Apicella et al. 1991). Increases of synaptic weights in the actor following correct stimulus-action pairs were considerably lower than with a prediction-error signal (compare weights F V in Fig. 2D vs C). This resulted in increasingly impaired learning with longer sequences (Fig. 3B). Already, the second learning block was impaired (F V), the third block showed further impairments (E U), and sequences of >3 were not learned within blocks of 100 trials. In order to maintain some degree of learning, the synaptic eligibility trace was prolonged by reducing its decay rate to 20% per 100 ms (d=0.2), instead of the usual 40%. Although this permitted learning of intermediate sequences, it still resulted in impairments with longer sequences (Fig. 3C).

353 Discussion This study shows that a reinforcement signal with the essential characteristics of dopamine responses was very efficient for learning movement sequences of considerable length. The coding of prediction error restricted signal increases to unpredictable reinforcement during the learning phase. This prevented run-away synaptic strength (Montague and Sejnowski 1994) without requiring additional algorithms, which were used in a different basal-ganglia model on sequential movements (Dominey et al. 1995). In addition, behavioral errors induced a decreased signal and, thus, reduced the strength of synapses involved in erroneous reactions. Whereas the present model concerned sequences of individual stimulus-action pairs, comparable results were obtained with TD models learning ocular foveation through serial, small-step eye movements (Montague et al. 1993; Friston et al. 1994). The transfer of the signal back to the earliest rewardpredicting stimulus helped to bridge the long gap between the stimulus-action pairs and the terminal reward. The predictive nature of the reinforcement signal allowed the model to strengthen synaptic weights with relatively short eligibility traces in the actor. The traces only covered the interval of 300 msec between stimulus-action pairs and subsequent predictive reinforcement, but decayed entirely during a trial of 2.4 s. Such short traces are biologically much more plausible than longer ones. Their physiological substrates may consist in sustained neuronal activity frequently found in the striatum (Schultz et al. 1995) or, possibly, prolonged changes in calcium concentration (Wickens and Kötter 1995) or formation of calmodulin-dependent protein kinase II (Houk et al. 1995). In addition, the model assumed dopamine-dependent long-term changes in synaptic transmission, depending on presynaptic activity, postsynaptic activity, and reinforcement signal. This form of plasticity was reported in striatial slice preparations (Calabresi et al. 1992, 1997; Wickens et al. 1996) and could provide a biological basis for such a three factor learning rule. The unconditional reinforcement signal presented severe disadvantages for learning long sequences. Occurring at the time of terminal reward, this signal was unable to strengthen synapses that were used earlier during the sequence. A similar result was obtained with ocular foveation movements (Friston et al. 1994). In the present model, this deficit was to some extent compensated by increasing the duration of synaptic eligibility traces. However, longer eligibility traces become increasingly hypothetical and may not be good bases for biologically plausible neural models. Acknowledgements This study was supported by the James S. McDonnell Foundation (grant 94-39). References Apicella P, Ljungberg T, Scarnati E, Schultz W (1991) Responses to reward in monkey dorsal and ventral striatum. Exp Brain Res 85:491±500 Barto AG (1995) Adaptive critics and the basal ganglia. In: Houk JC, Davis JL, Beiser DG (eds) Models of information processing in the basal ganglia. MIT Press, Cambridge, Mass., pp 215±232 Barto AG, Sutton RS, Anderson CW (1983) Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans Systems Man Cyber SMC 13:834±846 Benecke R, Rothwell JC, Dick JPR, Day BL, Marsden CD (1987) Disturbance of sequential movements in Parkinson's disease. Brain 110:361±379 Calabresi P, Pisani A, Mercuri NB, Bernardi G (1992) Long-term potentiation in the striatum is unmasked by removing the voltage-dependent magnesium block of NMDA receptor channels. Eur J Neurosci 4:929±935 Calabresi P, Saiardi A, Pisani A, Baik JH, Centonze D, Mercuri NB, Bernardi G, Borrelli E (1997) Abnormal synaptic plasticity in the striatum of mice lacking dopamine D2 receptors. J Neurosci 17:4536±4544 Dickinson A (1980) Contemporary animal learning theory. Cambridge University Press, Cambridge, Mass. Dominey P, Arbib M, Joseph JP (1995) A model of corticostriatal plasticity for learning oculomotor associations and sequences. J Cogn Neurosci 7:311±336 Friston KJ, Tononi G, Reeke GN Jr, Sporns O, Edelman GM (1994) Value-dependant selection in the brain: simulation in a synthetic neural model. Neuroscience 59:229±243 Grossberg S, Levine DS (1987) Neural dynamics of attentionally modulated Pavlovian conditioning: conditioned reinforcement, inhibition and opponent processing. Psychobiology 15:195±240 Hikosaka O, Sakamoto M, Usui S (1989) Functional properties of monkey caudate neurons. III. Activities related to expectation of target and reward. J Neurophysiol 61:814±832 Houk JC, Adams JL, Barto AG (1995) A model of how the basal ganglia generate and use neural signals that predict reinforcement. In: Houk JC, Davis JL, Beiser DG (eds) Models of information processing in the basal ganglia. MIT Press, Cambridge, Mass., pp 249±270 Kermadi I, Joseph JP (1995) Activity in the caudate nucleus during spatial sequencing. J Neurophysiol 74:911±933 Mirenowicz J, Schultz W (1994) Importance of unpredictability for reward responses in primate dopamine neurons. J Neurophysiol 72:1024±1027 Miyachi S, Hikosaka O, Miyashita K, Karadi Z, Rand MK (1997) Differential roles of monkey striatum in learning of sequential hand movement. Exp Brain Res 115:1±5 Montague PR, Sejnowski TJ (1994) The predictive brain: temporal coincidence and temporal order in synaptic learning mechanisms. Learn Mem 1:1±33 Montague PR, Dayan P, Nowlan SJ, Pouget A, Sejnowski TJ (1993) Using aperiodic reinforcement for directed self-organization during development. In: Hanson SJ, Cowan JD, Giles CL (eds) Neural information processing systems 5. Morgan Kaufmann, San Mateo, pp 969±976 Montague PR, Dayan P, Sejnowski TJ (1996) A framework for mesencephalic dopamine systems based on predictive hebbian learning. J Neurosci 16:1936±1947 Mushiake H, Strick PL (1995) Pallidal neuron activity during sequential arm movements. J Neurophysiol 74:2754±2758 Niki H, Watanabe M (1979) Prefrontal and cingulate unit activity during timing behavior in the monkey. Brain Res 171:213±224 Nishijo H, Ono T, Nishino H (1988) Single neuron responses in amygdala of alert monkey during complex sensory stimulation with affective significance. J Neurosci 8:3570±3583 Phillips JG, Bradshaw JL, Iansek R, Chiu E (1993) Motor functions of the basal ganglia. Psychol Res 55:175±181

354 Rescola RA, Wagner AR (1972) A theory of Pavlovian conditioning: variations in the effectiveness of reinforcement and non-reinforcement. In: Black AH, Prokasy WF (eds) Classical conditioning II. current research and theory. Appleton-Century- Crofts, New York, pp 64±99 Schultz W, Apicella P, Ljungberg T (1993) Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response taks. J Neurosci 13:900±913 Schultz W, Apicella P, Romo R, Scarnati E (1995) Context-dependent activity in primate striatum reflecting past and future behavioral events. In: Houk JC, Davis JL, Beiser DG (eds) Models of information processing in the basal ganglia. MIT Press, Cambridge, Mass., pp 11±28 Schultz W, Dayan P, Montague PR (1997) A neural substrate of prediction and reward. Science 275:1593±1599 Sutton RS, Barto AG (1990) Time derivative models of Pavlovian reinforcement. In: Gabriel M, Moore J (eds) Learning and compuational neuroscience: foundations of adaptive networks. MIT Press, Cambridge, Mass., pp 539±602 Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press/Bradford Books, Cambridge, Mass. Wickens J, Kötter R (1995) Cellular models of reinforcement. In: Houk JC, Davis JL, Beiser DG (eds) Models of information processing in the basal ganglia. MIT Press, Cambridge, Mass., pp 187±214 Wickens JR, Begg AJ, Arbuthnott GW (1996) Dopamine reverses the depression of rat corticostriatal synapses which normally follows high-frequency stimulation of cortex in vitro. Neuroscience 70:1±5 Watanabe M (1989) The appropriateness of behavioral responses coded in post-trial activity of primate prefrontal units. Neurosci Lett 101:113±117