Learning of sequential movements by neural network model with dopamine-like reinforcement signal

Similar documents
Dopamine neurons report an error in the temporal prediction of reward during learning

Reinforcement learning and the brain: the problems we face all day. Reinforcement Learning in the brain

A Model of Dopamine and Uncertainty Using Temporal Difference

Learning Working Memory Tasks by Reward Prediction in the Basal Ganglia

Extending the Computational Abilities of the Procedural Learning Mechanism in ACT-R

DISCUSSION : Biological Memory Models

Behavioral considerations suggest an average reward TD model of the dopamine system

Reward Hierarchical Temporal Memory

Shadowing and Blocking as Learning Interference Models

Combining Configural and TD Learning on a Robot

Different inhibitory effects by dopaminergic modulation and global suppression of activity

Foraging in an Uncertain Environment Using Predictive Hebbian Learning

Expectation of reward modulates cognitive signals in the basal ganglia

Timing and partial observability in the dopamine system

COMPUTATIONAL MODELS OF CLASSICAL CONDITIONING: A COMPARATIVE STUDY

Emotion Explained. Edmund T. Rolls

NSCI 324 Systems Neuroscience

Memory, Attention, and Decision-Making

Toward a Mechanistic Understanding of Human Decision Making Contributions of Functional Neuroimaging

MODELING FUNCTIONS OF STRIATAL DOPAMINE MODULATION IN LEARNING AND PLANNING

Dopamine Cells Respond to Predicted Events during Classical Conditioning: Evidence for Eligibility Traces in the Reward-Learning Network

Hebbian Plasticity for Improving Perceptual Decisions

Volume 7, Number 4, 2001 REVIEW

Model-Based Reinforcement Learning by Pyramidal Neurons: Robustness of the Learning Rule

How the Basal Ganglia Use Parallel Excitatory and Inhibitory Learning Pathways to Selectively Respond to Unexpected Rewarding Cues

The Rescorla Wagner Learning Model (and one of its descendants) Computational Models of Neural Systems Lecture 5.1

A behavioral investigation of the algorithms underlying reinforcement learning in humans

Computational significance of the cellular mechanisms for synaptic plasticity in Purkinje cells

Modeling the interplay of short-term memory and the basal ganglia in sequence processing

Differential roles of monkey striatum in learning of sequential hand movement

Behavioral Neuroscience: Fear thou not. Rony Paz

A Simulation of Sutton and Barto s Temporal Difference Conditioning Model

PVLV: The Primary Value and Learned Value Pavlovian Learning Algorithm

A Computational Model of Complex Skill Learning in Varied-Priority Training

9.01 Introduction to Neuroscience Fall 2007

Computational Versus Associative Models of Simple Conditioning i

Dopamine: generalization and bonuses

Systems Neuroscience November 29, Memory

The role of efference copy in striatal learning

A causal link between prediction errors, dopamine neurons and learning

Reinforcement learning, conditioning, and the brain: Successes and challenges

Dopamine neurons activity in a multi-choice task: reward prediction error or value function?

Active Control of Spike-Timing Dependent Synaptic Plasticity in an Electrosensory System

Dynamic Stochastic Synapses as Computational Units

A model to explain the emergence of reward expectancy neurons using reinforcement learning and neural network $

Evaluating the Effect of Spiking Network Parameters on Polychronization

CASE 49. What type of memory is available for conscious retrieval? Which part of the brain stores semantic (factual) memories?

Classical and instrumental conditioning: From laboratory phenomena to integrated mechanisms for adaptation

Systems Neurobiology: Plasticity in the Auditory System. Jason Middleton -

Behavioral Neuroscience: Fear thou not. Rony Paz

Presupplementary Motor Area Activation during Sequence Learning Reflects Visuo-Motor Association

Choosing the Greater of Two Goods: Neural Currencies for Valuation and Decision Making

Effects of lesions of the nucleus accumbens core and shell on response-specific Pavlovian i n s t ru mental transfer

Solving the Distal Reward Problem through Linkage of STDP and Dopamine Signaling

Part 11: Mechanisms of Learning

The individual animals, the basic design of the experiments and the electrophysiological

Solving the distal reward problem with rare correlations

VS : Systemische Physiologie - Animalische Physiologie für Bioinformatiker. Neuronenmodelle III. Modelle synaptischer Kurz- und Langzeitplastizität

Solving the distal reward problem with rare correlations

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

A Balanced Hebbian Algorithm for Associative Learning in ACT-R

Neural Cognitive Modelling: A Biologically Constrained Spiking Neuron Model of the Tower of Hanoi Task

Neural Cognitive Modelling: A Biologically Constrained Spiking Neuron Model of the Tower of Hanoi Task

How Do Computational Models of the Role of Dopamine as a Reward Prediction Error Map on to Current Dopamine Theories of Schizophrenia?

Reinforcement Learning. Odelia Schwartz 2017

Learning = an enduring change in behavior, resulting from experience.

A Role for Dopamine in Temporal Decision Making and Reward Maximization in Parkinsonism

SUPPLEMENTARY INFORMATION

Computational & Systems Neuroscience Symposium

A Model of Prefrontal Cortical Mechanisms for Goal-directed Behavior

Chapter 5: Learning and Behavior Learning How Learning is Studied Ivan Pavlov Edward Thorndike eliciting stimulus emitted

Modulators of Spike Timing-Dependent Plasticity

Dopamine modulation in a basal ganglio-cortical network implements saliency-based gating of working memory

Cholinergic suppression of transmission may allow combined associative memory function and self-organization in the neocortex.

Characterization of serial order encoding in the monkey anterior cingulate sulcus

Temporal Sequence Learning, Prediction, and Control - A Review of different models and their relation to biological mechanisms

Mechanistic characterization of reinforcement learning in healthy humans using computational models

Behavioral dopamine signals

TREATMENT-SPECIFIC ABNORMAL SYNAPTIC PLASTICITY IN EARLY PARKINSON S DISEASE

A computational model of integration between reinforcement learning and task monitoring in the prefrontal cortex

Is model fitting necessary for model-based fmri?

Synaptic plasticityhippocampus. Neur 8790 Topics in Neuroscience: Neuroplasticity. Outline. Synaptic plasticity hypothesis

Neural Networks 39 (2013) Contents lists available at SciVerse ScienceDirect. Neural Networks. journal homepage:

Supplementary Figure 1. Example of an amygdala neuron whose activity reflects value during the visual stimulus interval. This cell responded more

Efficient Emulation of Large-Scale Neuronal Networks

Lateral Inhibition Explains Savings in Conditioning and Extinction

Intrinsically Motivated Learning of Hierarchical Collections of Skills

From Numerosity to Ordinal Rank: A Gain-Field Model of Serial Order Representation in Cortical Working Memory

Short-term memory traces for action bias in human reinforcement learning

A neural circuit model of decision making!

Parkinsonism or Parkinson s Disease I. Symptoms: Main disorder of movement. Named after, an English physician who described the then known, in 1817.

Statistical Models of Conditioning

Neurobiological Foundations of Reward and Risk

A Dynamic Neural Network Model of Sensorimotor Transformations in the Leech

Theoretical and Empirical Studies of Learning

OPTO 5320 VISION SCIENCE I

Computational cognitive neuroscience: 2. Neuron. Lubica Beňušková Centre for Cognitive Science, FMFI Comenius University in Bratislava

Dopamine reward prediction error signalling: a twocomponent

Temporal coding in the sub-millisecond range: Model of barn owl auditory pathway

BIPN 140 Problem Set 6

Transcription:

Exp Brain Res (1998) 121:350±354 Springer-Verlag 1998 RESEARCH NOTE Roland E. Suri Wolfram Schultz Learning of sequential movements by neural network model with dopamine-like reinforcement signal Received: 24 November 1997 / Accepted: 30 April 1998 Abstract Dopamine neurons appear to code an error in the prediction of reward. They are activated by unpredicted rewards, are not influenced by predicted rewards, and are depressed when a predicted reward is omitted. After conditioning, they respond to reward-predicting stimuli in a similar manner. With these characteristics, the dopamine response strongly resembles the predictive reinforcement teaching signal of neural network models implementing the temporal difference learning algorithm. This study explored a neural network model that used a reward-prediction error signal strongly resembling dopamine responses for learning movement sequences. A different stimulus was presented in each step of the sequence and required a different movement reaction, and reward occurred at the end of the correctly performed sequence. The dopamine-like predictive reinforcement signal efficiently allowed the model to learn long sequences. By contrast, learning with an unconditional reinforcement signal required synaptic eligibility traces of longer and biologically less-plausible durations for obtaining satisfactory performance. Thus, dopamine-like neuronal signals constitute excellent teaching signals for learning sequential behavior. Key words Basal ganglia Teaching signal Temporal difference Synaptic plasticity Eligibility Matlab programs of this model are available at ftp://ftp.usc.edu/pub/bsl/suri/suri_schultz R.E. Suri W. Schultz ()) Institute of Physiology, University of Fribourg, CH-1700 Fribourg, Switzerland e-mail: Wolfram.Schultz@unifr.ch, Tel.: +41-26-300 8611, Fax: +41-26-300 9675 R.E. Suri USC Brain Project, University of Southern California, Hedco Neurosciences Building, 3614 Watt Way, Los Angeles, CA 90089-2520, USA Introduction A large body of evidence suggests that learning is crucially dependent on the degree of unpredictability of reinforcers (Rescorla and Wagner 1972; Dickinson 1980). Only reinforcers occurring at least to some degree unpredictably will sustain learning, and learning curves reach asymptotes when all reinforcers are fully predicted. The discrepancy between the occurrence of reinforcement and its prediction is referred to as an error in the prediction of reinforcement. Error-driven learning is employed in a large variety of neural models and has been particularly elaborated in the temporal-difference (TD), algorithm, which computes the prediction error continuously in real time and establishes reinforcer predictions (Sutton and Barto 1990). The TD algorithm can be implemented in an explicit critic-actor architecture (Barto et al. 1983; Sutton and Barto 1998). The prediction error is computed and emitted as a teaching signal by the critic in order to adapt synaptic weights in the actor, which directs behavioral output (Montague et al. 1993; Friston et al. 1994). This architecture resembles the anatomical structure of the basal ganglia: the critic taking the place of the nigrostriatal dopamine neurons, and the actor corresponding to the striatum (Barto 1995; Houk et al. 1995). The prediction-error signal of the critic is strikingly similar to the activities of midbrain dopamine neurons (Montague et al. 1996; Schultz et al. 1997). Both signals are increased by unpredicted rewards, are not influenced by full predicted rewards, and are decreased by omission of predicted rewards. They are transferred to the earliest reward-predicting events through experience and, thus, predict rewards before they occur rather than reporting them only after the behavior. Neurophysiological and inactivation studies suggest an important involvement of the basal ganglia in movement sequences (Kermadi and Joseph 1995; Mushiake and Strick 1995; Miyachi et al. 1997). Disorders of dopamine neurotransmission in the striatum impair serially ordered movements in human patients (Benecke et al. 1987; Phillips et al. 1993). As reinforcement-learning tasks may

constitute sequential Markov decision processes (Sutton and Barto 1998), we aimed of investigating whether TD prediction-error signals with similar characteristics as dopamine responses could be used for learning sequential movements. 351 Methods and algorithms The task consisted of a defined sequence of seven specific stimulusaction pairs. Presentation of one stimulus (A, B, C, D, E, F, or G) elicited one of seven actions (Q, R, S, T, U, V, or W). The stimulus-action pairs followed each other at intervals of 300 ms. Reward was delivered at the end of the sequence when all individual actions had been correctly chosen (A QÞB RÞC SÞD TÞE UÞF VÞG WÞ reward). The sequence was learned backwards by associating each stimulus with a particular action by trial and error, in a total of seven blocks of 100 trials. In the first block, stimulus G required action W in order to lead to reward (G WÞreward). Sequence length increased by one stimulus-action pair in each training block of 100 trials. Correct action in each step resulted in the appearance of the previously learned stimulus of the subsequent step, which predicted the terminal reward and thus constituted a conditioned reinforcer (e.g. F VÞG WÞ reward). Any incorrect action terminated the trial. Learning of the sequence was simulated 10 times, and average learning curves were computed. Modelling the dopamine prediction error signal In the critic component of the model (Fig. 1, right), a stimulus l was represented as a series of signals, x lm (t) of varying durations (Fig. 2A, lines 3±5) in order to reproduce timing mechanisms involved in the depression of dopamine activity at the time of omitted reward. A number of such signals sufficient for covering the duration of the interstimulus intervals was chosen (m=1, 2, 3 for 300-msec interstimulus interval; m=1, 2. ¼, 10 for 1-s interstimulus interval). Reward prediction P l (t) was computed as the weighted sum over the stimulus representations x lm (t) P l ˆX t w lm x lm t m Adaptive weights, w lm, were initialized with value zero. The reward prediction P(t) was the sum over the reward predictions computed from the stimuli l: Pt ˆX P l t l The desired prediction signal increased successively from one time step to the next by a factor 1/g until the reward l(t) occurred, and decreased to baseline value zero after its presentation (Fig. 2A, line 6). Therefore, the prediction error signal was rt ˆl gp t Pt t 1 One time step corresponded to 100 msec. The discount factor, g, reflecting the decreased impact of more distant reinforcement, was estimated for dopamine neurons as g=0.98 (see Fig. 2B). The learning rule for the weights w lm was w lm ˆw t lm t 1 h c rt x lm t 1 with learning rate h c =0.1. Modelling sequential movements Fig. 1 Model architecture consisting of an actor (left) and a critic (right). The prediction-error signal (rt), serves to modify the synaptic weights, w lm, of the critic itself and the synaptic weights, v nl,of the actor (heavy dots). Critic. Every stimulus l is represented over time as a series of signals, x lm (t), of different durations. Each signal, x lm (t), is multiplied with the corresponding adaptive weight, w lm,in order to compute the prediction, P l (t). The temporal differences in these predictions are computed, summed over all stimuli, and added to the reward signal, l(t), in order to compute the prediction-error signal, r(t). Actor. The actor learns stimulus-action pairs under the influence of the prediction-error signal of the critic. Every actor neuron n (large circles), represents a specific action. A winner-take-all rule prevents the actor from performing two actions at the same time and can be implemented as lateral inhibition between neurons (Grossberg and Levine 1987) The actor component of the model (Fig. 1, left) used the reward-prediction error as teaching signal for adapting the weights v nl (initialized with zero values, n=1.., l=1.. 7). Activations a n (t) were computed from stimuli e l (t) with a n ˆX t v nl e l s t n t l s n (t) was a small random perturbation (normally distributed with mean 0.0 and variance 0.1). Actions were only allowed in response to stimuli. In addition, the model performed only one action at a time, which was implemented with a winner-take-all-rule between actor neurons. The neuron n with the largest activation elicited action n. Synapses activated when stimulus l elicited action n were considered to be eligible for modification through conjoint pre-postsynaptic activation. In order to extend the eligibility for synaptic modification beyond the immediate activation, we used an eligibility trace e nl ; t which was initially set to one and decreased during subsequent time steps of 100 msec with a factor d e nl ˆ t 1 d e nl t 1 The factor d determined the rate of trace decay and was set to 0.4 (d=0.4), resulting in a relatively short eligibility trace (8% remaining after 500 ms, <1% after 1000 ms). For a particular simulation, we used a smaller factor (d=0.2), resulting in a much longer eligibility trace (33% remaining after 500 ms, 11% after 1000 ms). Weights v nl were adapted with learning rate h a =1 according to v nl ˆv t nl t 1 h a rt e nl t Results The prediction error signal of the critic component was phasically increased by unpredicted reward (Fig. 2B). The increase was successively transferred to the earliest

352 Fig. 2A±D Internal states and signals of the model. A Signals in the critic during the learning of stimulus-reward associations. Binary signals code for the stimulus (line 1) and the reward (line 2). The stimulus is internally represented as a series of sustained signals with different durations (lines 3±5). The prediction signal (line 6) and the prediction-error signal (bottom) are shown before (solid lines) and after learning (dashed lines). (All signals with baseline zero). B Prediction-error signal of the critic (left) compared with activity of dopamine neurons (right). Occurrence of unpredicted reward results in increased prediction-error signal and dopamine activity after the reward (line 1). After repeated pairings between stimulus G and reward, both prediction-error signal and dopamine activity are already increased after stimulus G, but are at baseline at the time of reward (line 2). After training with an additional stimulus (F) preceding stimulus G, both prediction-error signal and dopamine activity are increased after stimulus F and unaffected by stimulus G and reward (line 3). The response magnitude of dopamine neurons decreases for increased stimulus-reward intervals (lines 1±3). A decrease of 2% per 100-ms increased stimulus-reward intervals is estimated (g=0.98). When the reward predicted by conditioned stimulus G is omitted, both prediction-error signal and dopamine activity are decreased below baseline at the predicted time of reward (line 4) (data from Schultz et al. 1993; Mirenowicz and Schultz 1994). C Learning actions with prediction-error signals. Stimulus F elicits the correct action V, which leaves an eligibility trace (line 1). The prediction-error signal (line 2) is increased by predictive stimulus G. The weight associating stimulus F with action V (line 3) is adapted according to the preduct (eligibility trace prediction-error signal). D Reduced weight changes induced by an unconditional reinforcement signal unrelated to reward prediction. In the same situation as in C, modification of the weight F V using an unconditional reinforcement signal (line 1) results in a later and smaller increase (bottom) Fig. 3A±C Learning curves for a sequence of seven stimulus-action pairs. An additional stimulus-action pair is added to the sequence after every block of 100 trials. A Use of a prediction-error signal results in stable learning with minimal numbers of incorrect trials at all sequence lengths tested. Thus, the pair G W is learned during the first block of 100 trials (left), the sequence F FÞG W is learned during the second block, the sequence E UÞF VÞG W is learned during the third block, etc. until the whole sequence of seven steps is learned during the last block (right). B When trained with an unconditional reinforcement signal, only a sequence length of three stimulus-action associations is learned. C Prolongation of the synaptic-eligiblity trace ameliorates sequence learning with an unconditionial reinforcement signal. However, at the present state of knowledge, longer eligibility traces are biologically less plausible. Percentages are means of 10 repetitions reward-predicting stimulus at each length of sequence. The signal was decreased when predicted reward was omitted. The reinforcement signal of the critic closely resembled dopamine response in all these characteristics. Before learning, the actor component of the model randomly reacted upon presentation of stimulus G with any action from Q±W, including the correct action W. The correct pairing G±W was followed by reward (G WÞreward) and was learned in <100 trials in the first training block (Fig. 3A, left block). Then stimulus F was presented at the beginning of the second training block of 100 trials, and correct action V resulted in appearance of stimulus G in the same trial. Thus, the model learned the sequence F VÞG WÞreward (Fig. 3A, 2nd block from left). The full sequence of seven steps was learned in subsequent learning blocks without decrement (A QÞ ¼G WÞreward). In order to further assess the efficacy of a dopaminelike prediction error signal for learning, we used an unconditional, non-adaptive reinforcement signal. This signal increased after every predicted or unpredicted reward, did not code a reward-prediction error, and did not increase with reward-predicting stimuli. This is analogous to reward responses in fully established tasks frequently found in neurophysiological experiments (Niki and Watanabe 1979; Nishijo et al. 1988; Watanabe 1989; Hikosaka et al. 1989; Apicella et al. 1991). Increases of synaptic weights in the actor following correct stimulus-action pairs were considerably lower than with a prediction-error signal (compare weights F V in Fig. 2D vs C). This resulted in increasingly impaired learning with longer sequences (Fig. 3B). Already, the second learning block was impaired (F V), the third block showed further impairments (E U), and sequences of >3 were not learned within blocks of 100 trials. In order to maintain some degree of learning, the synaptic eligibility trace was prolonged by reducing its decay rate to 20% per 100 ms (d=0.2), instead of the usual 40%. Although this permitted learning of intermediate sequences, it still resulted in impairments with longer sequences (Fig. 3C).

353 Discussion This study shows that a reinforcement signal with the essential characteristics of dopamine responses was very efficient for learning movement sequences of considerable length. The coding of prediction error restricted signal increases to unpredictable reinforcement during the learning phase. This prevented run-away synaptic strength (Montague and Sejnowski 1994) without requiring additional algorithms, which were used in a different basal-ganglia model on sequential movements (Dominey et al. 1995). In addition, behavioral errors induced a decreased signal and, thus, reduced the strength of synapses involved in erroneous reactions. Whereas the present model concerned sequences of individual stimulus-action pairs, comparable results were obtained with TD models learning ocular foveation through serial, small-step eye movements (Montague et al. 1993; Friston et al. 1994). The transfer of the signal back to the earliest rewardpredicting stimulus helped to bridge the long gap between the stimulus-action pairs and the terminal reward. The predictive nature of the reinforcement signal allowed the model to strengthen synaptic weights with relatively short eligibility traces in the actor. The traces only covered the interval of 300 msec between stimulus-action pairs and subsequent predictive reinforcement, but decayed entirely during a trial of 2.4 s. Such short traces are biologically much more plausible than longer ones. Their physiological substrates may consist in sustained neuronal activity frequently found in the striatum (Schultz et al. 1995) or, possibly, prolonged changes in calcium concentration (Wickens and Kötter 1995) or formation of calmodulin-dependent protein kinase II (Houk et al. 1995). In addition, the model assumed dopamine-dependent long-term changes in synaptic transmission, depending on presynaptic activity, postsynaptic activity, and reinforcement signal. This form of plasticity was reported in striatial slice preparations (Calabresi et al. 1992, 1997; Wickens et al. 1996) and could provide a biological basis for such a three factor learning rule. The unconditional reinforcement signal presented severe disadvantages for learning long sequences. Occurring at the time of terminal reward, this signal was unable to strengthen synapses that were used earlier during the sequence. A similar result was obtained with ocular foveation movements (Friston et al. 1994). In the present model, this deficit was to some extent compensated by increasing the duration of synaptic eligibility traces. However, longer eligibility traces become increasingly hypothetical and may not be good bases for biologically plausible neural models. Acknowledgements This study was supported by the James S. McDonnell Foundation (grant 94-39). References Apicella P, Ljungberg T, Scarnati E, Schultz W (1991) Responses to reward in monkey dorsal and ventral striatum. Exp Brain Res 85:491±500 Barto AG (1995) Adaptive critics and the basal ganglia. In: Houk JC, Davis JL, Beiser DG (eds) Models of information processing in the basal ganglia. MIT Press, Cambridge, Mass., pp 215±232 Barto AG, Sutton RS, Anderson CW (1983) Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans Systems Man Cyber SMC 13:834±846 Benecke R, Rothwell JC, Dick JPR, Day BL, Marsden CD (1987) Disturbance of sequential movements in Parkinson's disease. Brain 110:361±379 Calabresi P, Pisani A, Mercuri NB, Bernardi G (1992) Long-term potentiation in the striatum is unmasked by removing the voltage-dependent magnesium block of NMDA receptor channels. Eur J Neurosci 4:929±935 Calabresi P, Saiardi A, Pisani A, Baik JH, Centonze D, Mercuri NB, Bernardi G, Borrelli E (1997) Abnormal synaptic plasticity in the striatum of mice lacking dopamine D2 receptors. J Neurosci 17:4536±4544 Dickinson A (1980) Contemporary animal learning theory. Cambridge University Press, Cambridge, Mass. Dominey P, Arbib M, Joseph JP (1995) A model of corticostriatal plasticity for learning oculomotor associations and sequences. J Cogn Neurosci 7:311±336 Friston KJ, Tononi G, Reeke GN Jr, Sporns O, Edelman GM (1994) Value-dependant selection in the brain: simulation in a synthetic neural model. Neuroscience 59:229±243 Grossberg S, Levine DS (1987) Neural dynamics of attentionally modulated Pavlovian conditioning: conditioned reinforcement, inhibition and opponent processing. Psychobiology 15:195±240 Hikosaka O, Sakamoto M, Usui S (1989) Functional properties of monkey caudate neurons. III. Activities related to expectation of target and reward. J Neurophysiol 61:814±832 Houk JC, Adams JL, Barto AG (1995) A model of how the basal ganglia generate and use neural signals that predict reinforcement. In: Houk JC, Davis JL, Beiser DG (eds) Models of information processing in the basal ganglia. MIT Press, Cambridge, Mass., pp 249±270 Kermadi I, Joseph JP (1995) Activity in the caudate nucleus during spatial sequencing. J Neurophysiol 74:911±933 Mirenowicz J, Schultz W (1994) Importance of unpredictability for reward responses in primate dopamine neurons. J Neurophysiol 72:1024±1027 Miyachi S, Hikosaka O, Miyashita K, Karadi Z, Rand MK (1997) Differential roles of monkey striatum in learning of sequential hand movement. Exp Brain Res 115:1±5 Montague PR, Sejnowski TJ (1994) The predictive brain: temporal coincidence and temporal order in synaptic learning mechanisms. Learn Mem 1:1±33 Montague PR, Dayan P, Nowlan SJ, Pouget A, Sejnowski TJ (1993) Using aperiodic reinforcement for directed self-organization during development. In: Hanson SJ, Cowan JD, Giles CL (eds) Neural information processing systems 5. Morgan Kaufmann, San Mateo, pp 969±976 Montague PR, Dayan P, Sejnowski TJ (1996) A framework for mesencephalic dopamine systems based on predictive hebbian learning. J Neurosci 16:1936±1947 Mushiake H, Strick PL (1995) Pallidal neuron activity during sequential arm movements. J Neurophysiol 74:2754±2758 Niki H, Watanabe M (1979) Prefrontal and cingulate unit activity during timing behavior in the monkey. Brain Res 171:213±224 Nishijo H, Ono T, Nishino H (1988) Single neuron responses in amygdala of alert monkey during complex sensory stimulation with affective significance. J Neurosci 8:3570±3583 Phillips JG, Bradshaw JL, Iansek R, Chiu E (1993) Motor functions of the basal ganglia. Psychol Res 55:175±181

354 Rescola RA, Wagner AR (1972) A theory of Pavlovian conditioning: variations in the effectiveness of reinforcement and non-reinforcement. In: Black AH, Prokasy WF (eds) Classical conditioning II. current research and theory. Appleton-Century- Crofts, New York, pp 64±99 Schultz W, Apicella P, Ljungberg T (1993) Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response taks. J Neurosci 13:900±913 Schultz W, Apicella P, Romo R, Scarnati E (1995) Context-dependent activity in primate striatum reflecting past and future behavioral events. In: Houk JC, Davis JL, Beiser DG (eds) Models of information processing in the basal ganglia. MIT Press, Cambridge, Mass., pp 11±28 Schultz W, Dayan P, Montague PR (1997) A neural substrate of prediction and reward. Science 275:1593±1599 Sutton RS, Barto AG (1990) Time derivative models of Pavlovian reinforcement. In: Gabriel M, Moore J (eds) Learning and compuational neuroscience: foundations of adaptive networks. MIT Press, Cambridge, Mass., pp 539±602 Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press/Bradford Books, Cambridge, Mass. Wickens J, Kötter R (1995) Cellular models of reinforcement. In: Houk JC, Davis JL, Beiser DG (eds) Models of information processing in the basal ganglia. MIT Press, Cambridge, Mass., pp 187±214 Wickens JR, Begg AJ, Arbuthnott GW (1996) Dopamine reverses the depression of rat corticostriatal synapses which normally follows high-frequency stimulation of cortex in vitro. Neuroscience 70:1±5 Watanabe M (1989) The appropriateness of behavioral responses coded in post-trial activity of primate prefrontal units. Neurosci Lett 101:113±117