Cerebral Cortex Principles of Operation

Size: px

Start display at page:

Download "Cerebral Cortex Principles of Operation"

Shannon Rice
6 years ago
Views:

1 OUP-FIRST UNCORRECTED PROOF, June 17, 2016 Cerebral Cortex Principles of Operation Edmund T. Rolls Oxford Centre for Computational Neuroscience, Oxford, UK 3

2 3 Great Clarendon Street, Oxford, OX2 6DP, United Kingdom Oxford University Press is a department of the University of Oxford. It furthers the University s objective of excellence in research, scholarship, and education by publishing worldwide. Oxford is a registered trade mark of Oxford University Press in the UK and in certain other countries Edmund Rolls 2016 The moral rights of the author have been asserted First Edition published in 2016 Impression: 1 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, by licence or under terms agreed with the appropriate reprographics rights organization. Enquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above You must not circulate this work in any other form and you must impose this same condition on any acquirer Published in the United States of America by Oxford University Press 198 Madison Avenue, New York, NY 10016, United States of America British Library Cataloguing in Publication Data Data available Library of Congress Control Number: ISBN Printed and bound by CPI Group (UK) Ltd, Croydon, CR0 4YY Oxford University Press makes no representation, express or implied, that the drug dosages in this book are correct. Readers must therefore always check the product information and clinical procedures with the most up-to-date published product information and data sheets provided by the manufacturers and the most recent codes of conduct and safety regulations. The authors and the publishers do not accept responsibility or legal liability for any errors in the text or for the misuse or misapplication of material in this work. Except where otherwise stated, drug dosages and recommendations are for the non-pregnant adult who is not breast-feeding Links to third party websites are provided by Oxford in good faith and for information only. Oxford disclaims any responsibility for the materials contained in any third party website referenced in this work.

3 Preface The overall aim of this book is to provide insight into the principles of operation of the cerebral cortex. These are key to understanding how we, as humans, function. There have been few previous attempts to set out some of the important principles of operation of the cortex, and this book is pioneering. I have asked some of the leading investigators in neuroscience about their views on this, and most have not had many well formulated answers or hypotheses. As clear hypotheses are needed in this most important area of 21st century science, how our brains work, I have formulated a set of hypotheses to guide thinking and future research. I present evidence for many of the hypotheses, but at the same time we must all recognise that hypotheses and theory in science are there to be tested, and hopefully refined rather than rejected. Nevertheless, such theories and hypotheses are essential to progress, and it is in this frame of reference that I present the theories, hypotheses, and ideas that I have produced and collected together. This book focusses on the principles of operation of the cerebral cortex, because at this time it is possible to propose and describe many principles, and many are likely to stand the test of time, and provide a foundation I believe for further developments, even if some need to be changed. In this context, I have not attempted to produce an overall theory of operation of the cerebral cortex, because at this stage of our understanding, such a theory would be incorrect or incomplete. I believe though that many of the principles will be important, and that many will provide the foundations for more complete theories of the operation of the cerebral cortex. Given that many different principles of operation of the cortex are proposed in this book, with often several principles in each Chapter, the reader may find it convenient to take one Chapter at a time, and think about the issues raised in each Chapter, as the overall enterprise is large. The Highlights sections provided at the end of each Chapter may be useful in helping the reader to appreciate the different principles being considered in each Chapter. To understand how the cortex works, including how it functions in perception, memory, attention, decision making, and cognitive functions, it is necessary to combine different approaches, including neural computation. Neurophysiology at the single neuron level is needed because this is the level at which information is exchanged between the computing elements of the brain. Evidence from the effects of brain damage, including that available from neuropsychology, is needed to help understand what different parts of the system do, and indeed what each part is necessary for. Neuroimaging is useful to indicate where in the human brain different processes take place, and to show which functions can be dissociated from each other. Knowledge of the biophysical and synaptic properties of neurons is essential to understand how the computing elements of the brain work, and therefore what the building blocks of biologically realistic computational models should be. Knowledge of the anatomical and functional architecture of the cortex is needed to show what types of neuronal network actually perform the computation. And finally the approach of neural computation is needed, as this is required to link together all the empirical evidence to produce an understanding of how the system actually works. This book utilizes evidence from all these disciplines to develop an understanding of how different types of memory, perception, attention, and decision making are implemented by processing in the cerebral cortex.

4 iv Preface I emphasize that to understand how memory, perception, attention, decision making, cognitive functions, and actions are produced in the cortex, we are dealing with large scale computational systems with interactions between the parts, and that this understanding requires analysis at the computational and global level of the operation of many neurons to perform together a useful function. Understanding at the molecular level is important for helping to understand how these large scale computational processes are implemented in the brain, but will not by itself give any account of what computations are performed to implement these cognitive functions. Instead, understanding cognitive functions such as object recognition, memory recall, attention, and decision making requires single neuron data to be closely linked to computational models of how the interactions between large numbers of neurons and many networks of neurons allow these cognitive problems to be solved. The single neuron level is important in this approach, for the single neurons can be thought of as the computational units of the system, and is the level at which the information is exchanged by the spiking activity between the computational elements of the brain. The single neuron level is therefore, because it is the level at which information is communicated between the computing elements of the brain, the fundamental level of information processing, and the level at which the information can be read out (by recording the spiking activity) in order to understand what information is being represented and processed in each brain area. With its focus on how the brain and especially how the cortex works at the computational neuroscience level, this book is distinct from the many excellent books on neuroscience that describe much evidence about brain structure and function, but do not aim to provide an understanding of how the brain works at the computational level. This book aims to forge an understanding of how some key brain systems may operate at the computational level, so that we can understand how the cortex actually performs some of its complex and necessarily computational functions in memory, perception, attention, decision making, cognitive functions, and actions. A test of whether one s understanding is correct is to simulate the processing on a computer, and to show whether the simulation can perform the tasks of cortical systems, and whether the simulation has similar properties to the real cortex. The approach of neural computation leads to a precise definition of how the computation is performed, and to precise and quantitative tests of the theories produced. How memory systems in the cortex work is a paradigm example of this approach, because memory like operations which involve altered functionality as a result of synaptic modification are at the heart of how many computations in the cortex are performed. It happens that attention and decision making can be understood in terms of interactions between and fundamental operations in memory systems in the cortex, and therefore it is natural to treat these areas of cognitive neuroscience in this book. The same fundamental concepts based on the operation of neuronal circuitry can be applied to all these functions, as is shown in this book. One of the distinctive properties of this book is that it links the neural computation approach not only firmly to neuronal neurophysiology, which provides much of the primary data about how the cortex operates, but also to psychophysical studies (for example of attention); to neuropsychological studies of patients with brain damage; and to functional magnetic resonance imaging (fmri) (and other neuroimaging) approaches. The empirical evidence that is brought to bear is largely from non human primates and from humans, because of the considerable similarity of their cortical systems. In this book, I have not attempted to produce a single computational theory of how the cortex operates. Instead, I have highlighted many different principles of cortical function, most of which are likely to be building blocks of how our cortex operates. The reason for this approach is that many of the principles may well be correct, and useful in understanding how the cortex operates, but some might turn out not to be useful or correct. The aim of this

5 Preface v book is therefore to propose some of the fundamental principles of operation of the cerebral cortex, many or most of which will provide a foundation for understanding the operation of the cortex, rather than to produce a single theory of operation of the cortex, which might be disproved if any one of its elements was found to be weak. The overall aims of the book are developed further, and the plan of the book is described, in Chapter 1, Section 1.1. Some of the main Principles of Operation of the Cerebral Cortex that I describe can be found in the titles of Chapters 2 22; but in practice, most Chapters include several Principles of Operation, which will appear in the Highlights to each Chapter. Section 26.5 may be useful in addition to the Highlights, for Section 26.5 draws together in a synthesis some of the Principles of Operation of the Cerebral Cortex that are described in the book. Further evidence on how these principles are relevant to the operation of different cortical areas and systems and operate together is provided in Chapters In these Chapters, the operation of two major cortical systems, those involved in memory and in visual object recognition, are considered to illustrate how the principles are combined to implement two different key cortical functions. The Appendices provide some of the more formal and quantitative properties of the operation of neuronal systems, and are provided because they provide a route to a deeper understanding on the principles, and to enable the presentation in earlier Chapters to be at a readily approachable level. The Appendices describe many of the building blocks of the neurocomputational approach, and are designed to be useful for teaching. Appendix D describes Matlab software that has been made available with this book to provide simple demonstrations of the operation of some key neuronal networks related to cortical function. The programs are available at Part of the material described in the book reflects work performed in collaboration with many colleagues, whose tremendous contributions are warmly appreciated. The contributions of many will be evident from the references cited in the text. Especial appreciation is due to Gustavo Deco, Simon M. Stringer, and Alessandro Treves who have contributed greatly in an always interesting and fruitful research collaboration on computational aspects of brain function, and to many neurophysiology and functional neuroimaging colleagues who have contributed to the empirical discoveries that provide the foundation to which the computational neuroscience must always be closely linked, and whose names are cited throughout the text. Much of the work described would not have been possible without financial support from a number of sources, particularly the Medical Research Council of the UK, the Human Frontier Science Program, the Wellcome Trust, and the James S. McDonnell Foundation. I am also grateful to many colleagues who I have consulted while writing this book, including Joel Price (Washington University School of Medicine), and Donald Wilson (New York University). Dr Patrick Mills is warmly thanked for his comments on the text. Section on ars memoriae is warmly dedicated to my colleagues at Corpus Christi College, Oxford. The book was typeset by the author using L A TEXand WinEdt. The cover includes part of the picture Pandora painted in 1896 by J. W. Waterhouse. The metaphor is to look inside the system of the mind and the brain, in order to understand how the brain functions, and thereby better to understand and treat its disorders. The cover also includes an image of the dendritic morphology of excitatory neurons in S1 whisker barrel cortex (Fig. 1.14) (adapted from Marcel Oberlaender, Christiaan P.J. de Kock, Randy M. Bruno, Alejandro Ramirez, Hanno S. Meyer, Vincent J. Dercksen, Moritz Helmstaedter and Bert Sakmann, Cell type specific three dimensional structure of thalamocortical circuits in a column of rat vibrissal cortex, Cerebral Cortex, 2012, Vol. 22, issue 10, pp , by permission of Oxford University Press). The cover also includes a diagram of the computational circuitry of the hippocampus by the author (Fig. 24.1). The aim of these second two images is to highlight the importance of moving from the anatomy of the cortex using all the approaches available including neuronal network models that address and

6 vi Preface incorporate neurophysiological discoveries to lead to an understanding of how the cortex operates computationally. Updates to and.pdfs of many of the publications cited in this book are available at Updates and corrections to the text and notes are also available at I dedicate this work to the overlapping group: my family, friends, and colleagues in salutem praesentium, in memoriam absentium.

7 Contents 1 Introduction Principles of operation of the cerebral cortex: introduction and plan Neurons Neurons in a network Synaptic modification Long term potentiation and long term depression Distributed representations Definitions Advantages of different types of coding Neuronal network approaches versus connectionism Introduction to three neuronal network architectures Systems level analysis of brain function Ventral cortical visual stream Dorsal cortical visual stream Hippocampal memory system Frontal lobe systems Brodmann areas The fine structure of the cerebral neocortex The fine structure and connectivity of the neocortex Excitatory cells and connections Inhibitory cells and connections Quantitative aspects of cortical architecture Functional pathways through the cortical layers The scale of lateral excitatory and inhibitory effects, and modules Highlights 39 2 Hierarchical organization Introduction Hierarchical organization in sensory systems Hierarchical organization in the ventral visual system Hierarchical organization in the dorsal visual system Hierarchical organization of taste processing Hierarchical organization of olfactory processing Hierarchical multimodal convergence of taste, olfaction, and vision Hierarchical organization of auditory processing Hierarchical organization of reward value processing Hierarchical organization of connections to the frontal lobe for short term memory Highlights 69 3 Localization of function Hierarchical processing 72

8 viii Contents 3.2 Short range neocortical recurrent collaterals Topographic maps Modularity Lateralization of function Ventral and dorsal cortical areas Highlights 74 4 Recurrent collateral connections and attractor networks Introduction Attractor networks implemented by the recurrent collaterals Evidence for attractor networks implemented by recurrent collateral connections Short term Memory Long term Memory Decision Making The storage capacity of attractor networks A global attractor network in hippocampal CA3, but local in neocortex The speed of operation of cortical attractor networks Dilution of recurrent collateral cortical connectivity Self organizing topographic maps in the neocortex Attractors formed by forward and backward connections between cortical areas? Interacting attractor networks Highlights 90 5 The noisy cortex: stochastic dynamics, decisions, and memory Reasons why the brain is inherently noisy and stochastic Attractor networks, energy landscapes, and stochastic neurodynamics A multistable system with noise Stochastic dynamics and the stability of short term memory Analysis of the stability of short term memory Stability and noise in a model of short term memory Long term memory recall Stochastic dynamics and probabilistic decision making in an attractor network Decision making in an attractor network Theoretical framework: a probabilistic attractor network Stationary multistability analysis: mean field Integrate and fire simulations of decision making: spiking dynamics Reaction times of the neuronal responses Percentage correct Finite size noise effects Comparison with neuronal data during decision making Testing the model of decision making with human functional neuroimaging Decisions based on confidence in one s decisions: self monitoring Decision making with multiple alternatives The matching law Comparison with other models of decision making Perceptual decision making and rivalry Symmetry breaking 135

9 Contents ix 5.9 The evolutionary utility of probabilistic choice Selection between conscious vs unconscious decision making, and free will Creative thought Unpredictable behaviour Predicting a decision before the evidence is applied Highlights Attention, short term memory, and biased competition Bottom up attention Top down attention biased competition The biased competition hypothesis Biased competition single neuron studies Non spatial attention Biased competition fmri A basic computational module for biased competition Architecture of a model of attention Simulations of basic experimental findings Object recognition and spatial search The neuronal and biophysical mechanisms of attention Serial vs parallel attentional processing Top down attention biased activation Selective attention can selectively activate different cortical areas Sources of the top down modulation of attention Granger causality used to investigate the source of the top down biasing Top down cognitive modulation A top down biased activation model of attention Conclusions Highlights Diluted connectivity Introduction Diluted connectivity and the storage capacity of attractor networks The autoassociative or attractor network architecture being studied The storage capacity of attractor networks with diluted connectivity The network simulated The effects of diluted connectivity on the capacity of attractor networks Synthesis of the effects of diluted connectivity in attractor networks The effects of dilution on the capacity of pattern association networks The effects of dilution on the performance of competitive networks Competitive Networks Competitive networks without learning but with diluted connectivity Competitive networks with learning and with diluted connectivity Competitive networks with learning and with full (undiluted) connectivity Overview and implications of diluted connectivity in competitive networks The effects of dilution on the noise in attractor networks Highlights Coding principles Types of encoding 209

10 x Contents 8.2 Place coding with sparse distributed firing rate representations Reading the code used by single neurons Understanding the code provided by populations of neurons Synchrony, coherence, and binding Principles by which the representations are formed Information encoding in the human cortex Highlights Synaptic modification for learning Introduction Associative synaptic modification implemented by long term potentiation Forgetting in associative neural networks, and memory reconsolidation Forgetting Factors that influence synaptic modification Reconsolidation Spike timing dependent plasticity Long term synaptic depression in the cerebellar cortex Reward prediction error learning Blocking and delta rule learning Dopamine neuron firing and reward prediction error learning Highlights Synaptic and neuronal adaptation and facilitation Mechanisms for neuronal adaptation and synaptic depression and facilitation Sodium inactivation leading to neuronal spike frequency adaptation Calcium activated hyper polarizing potassium current Short term synaptic depression and facilitation Short term depression of thalamic input to the cortex Relatively little adaptation in primate cortex when it is operating normally Acetylcholine, noradrenaline, and other modulators of adaptation and facilitation Acetylcholine Noradrenergic neurons Synaptic depression and sensory specific satiety Neuronal and synaptic adaptation, and the memory for sequential order Destabilization of short term memory by adaptation or synaptic depression Non reward computation in the orbitofrontal cortex using synaptic depression Synaptic facilitation and a multiple item short term memory Synaptic facilitation in decision making Highlights Backprojections in the neocortex Architecture Learning Recall Semantic priming Top down Attention Autoassociative storage, and constraint satisfaction 261

11 Contents xi 11.7 Highlights Memory and the hippocampus Introduction Hippocampal circuitry and connections The hippocampus and episodic memory Autoassociation in the CA3 network for episodic memory The dentate gyrus as a pattern separation mechanism, and neurogenesis Rodent place cells vs primate spatial view cells Backprojections, and the recall of information from the hippocampus to neocortex Subcortical structures connected to the hippocampo cortical memory system Highlights Limited neurogenesis in the adult cortex No neurogenesis in the adult neocortex Limited neurogenesis in the adult hippocampal dentate gyrus Neurogenesis in the chemosensing receptor systems Highlights Invariance learning and vision Hierarchical cortical organization with convergence Feature combinations Sparse distributed representations Self organization by feedforward processing without a teacher Learning guided by the statistics of the visual inputs Bottom up saliency Lateral interactions shape receptive fields Top down selective attention vs feedforward processing Topological maps to simplify connectivity Biologically decodable output representations Highlights Emotion, motivation, reward value, pleasure, and their mechanisms Emotion, reward value, and their evolutionary adaptive utility Motivation and reward value Principles of cortical design for emotion and motivation Objects are first represented independently of reward value Specialized systems for face identity and expression processing in primates Unimodal processing to the object level before multimodal convergence A common scale for reward value Sensory specific satiety Economic value is represented in the orbitofrontal cortex Neuroeconomics vs classical microeconomics Output systems influenced by orbitofrontal cortex reward value representations Decision making about rewards in the anterior orbitofrontal cortex 291

12 xii Contents Probabilistic emotion related decision making Non reward, error, neurons in the orbitofrontal cortex Reward reversal learning in the orbitofrontal cortex Dopamine neurons and emotion The explicit reasoning system vs the emotional system Pleasure Personality relates to differences in sensitivity to rewards and punishers Highlights Noise in the cortex, stability, psychiatric disease, and aging Stochastic noise, attractor dynamics, and schizophrenia Introduction A dynamical systems hypothesis of the symptoms of schizophrenia The depth of the basins of attraction: mean field flow analysis Decreased stability produced by reduced NMDA conductances Increased distractibility produced by reduced NMDA conductances Synthesis: network instability and schizophrenia Stochastic noise, attractor dynamics, and obsessive compulsive disorder Introduction A hypothesis about obsessive compulsive disorder Glutamate and increased depth of the basins of attraction Synthesis on obsessive compulsive disorder Stochastic noise, attractor dynamics, and depression Introduction A non reward attractor theory of depression Evidence consistent with the theory Relation to other brain systems implicated in depression Implications for treatments Mania and bipolar disorder Stochastic noise, attractor dynamics, and aging NMDA receptor hypofunction Dopamine Impaired synaptic modification Cholinergic function and memory Highlights Syntax and Language Neurodynamical hypotheses about language and syntax Binding by synchrony? Syntax using a place code Temporal trajectories through a state space of attractors Hypotheses about the implementation of language in the cerebral cortex Tests of the hypotheses a model Attractor networks with stronger forward than backward connections The operation of a single attractor network module Spike frequency adaptation mechanism Tests of the hypotheses findings with the model A production system A decoding system Evaluation of the hypotheses 359

13 Contents xiii 17.5 Highlights Evolutionary trends in cortical design and principles of operation Introduction Different types of cerebral neocortex: towards a computational understanding Neocortex or isocortex Olfactory (pyriform) cortex Hippocampal cortex Addition of areas in the neocortical hierarchy Evolution of the orbitofrontal cortex Evolution of the taste and flavour system Principles Taste processing in rodents Evolution of the temporal lobe cortex Evolution of the frontal lobe cortex Highlights Genetics and self organization build the cortex Introduction Hypotheses about the genes that build cortical neural networks Genetic selection of neuronal network parameters Simulation of the evolution of neural networks using a genetic algorithm The neural networks The specification of the genes The genetic algorithm, and general procedure Pattern association networks Autoassociative networks Competitive networks Evaluation of the gene based evolution of single layer networks The gene based evolution of multi layer cortical systems Highlights Cortex versus basal ganglia design for selection Systems level architecture of the basal ganglia What computations are performed by the basal ganglia? How do the basal ganglia perform their computations? Comparison of selection in the basal ganglia and cerebral cortex Highlights Sleep and Dreaming Is sleep necessary for cortical function? Is sleep involved in memory consolidation? Dreams Highlights Which cortical computations underlie consciousness? Introduction 420

14 xiv Contents 22.2 A Higher Order Syntactic Thought (HOST) theory of consciousness Multiple routes to action A computational hypothesis of consciousness Adaptive value of processing that is related to consciousness Symbol grounding Qualia Pathways Consciousness and causality Consciousness and higher order syntactic thoughts Selection between conscious vs unconscious decision making systems Dual major routes to action: implicit and explicit The Selfish Gene vs The Selfish Phenotype Decision making between the implicit and explicit systems Determinism Free will Content and meaning in representations The causal role of consciousness and the relation between the mind and the brain Comparison with other theories of consciousness Higher order thought theories Oscillations and temporal binding A high neural threshold for information to reach consciousness James Lange theory and Damasio s somatic marker hypothesis LeDoux s approach to emotion and consciousness Panksepp s approach to emotion and consciousness Global workspace theories of consciousness Monitoring and consciousness Highlights Cerebellar cortex Introduction Architecture of the cerebellum The connections of the parallel fibres onto the Purkinje cells The climbing fibre input to the Purkinje cell The mossy fibre to granule cell connectivity Modifiable synapses of parallel fibres onto Purkinje cell dendrites The cerebellar cortex as a perceptron Highlights: differences between cerebral and cerebellar cortex microcircuitry The hippocampus and memory Introduction Systems level functions of the hippocampus Systems level anatomy Evidence from the effects of damage to the hippocampus The necessity to recall information from the hippocampus Systems level neurophysiology of the primate hippocampus Head direction cells in the presubiculum Perirhinal cortex, recognition memory, and long term familiarity memory A theory of the operation of hippocampal circuitry as a memory system Hippocampal circuitry Entorhinal cortex 488

15 Contents xv CA3 as an autoassociation memory Dentate granule cells CA1 cells Recoding in CA1 to facilitate retrieval to the neocortex Backprojections to the neocortex, memory recall, and consolidation Backprojections to the neocortex quantitative aspects Simulations of hippocampal operation The learning of spatial view and place cell representations Linking the inferior temporal visual cortex to spatial view and place cells A scientific theory of the art of memory: scientia artis memoriae Tests of the theory of hippocampal cortex operation Dentate gyrus (DG) subregion of the hippocampus CA3 subregion of the hippocampus CA1 subregion of the hippocampus Evaluation of the theory of hippocampal cortex operation Tests of the theory by hippocampal system subregion analyses Comparison with other theories of hippocampal function Highlights Invariant visual object recognition learning Introduction Invariant representations of faces and objects in the inferior temporal visual cortex Processing to the inferior temporal cortex in the primate visual system Translation invariance and receptive field size Reduced translation invariance in natural scenes Size and spatial frequency invariance Combinations of features in the correct spatial configuration A view invariant representation Learning in the inferior temporal cortex Distributed encoding Face expression, gesture, and view Specialized regions in the temporal cortical visual areas Approaches to invariant object recognition Feature spaces Structural descriptions and syntactic pattern recognition Template matching and the alignment approach Invertible networks that can reconstruct their inputs Feature hierarchies Hypotheses about object recognition mechanisms Computational issues in feature hierarchies The architecture of VisNet Initial experiments with VisNet The optimal parameters for the temporal trace used in the learning rule Different forms of the trace learning rule, and error correction The issue of feature binding, and a solution Operation in a cluttered environment Learning 3D transforms Capacity of the architecture, and an attractor implementation Vision in natural scenes effects of background versus attention The representation of multiple objects in a scene Learning invariant representations using spatial continuity Lighting invariance 654

16 xvi Contents Invariant global motion in the dorsal visual system Deformation invariant object recognition Learning invariant representations of scenes and places Finding and recognising objects in natural scenes Further approaches to invariant object recognition Other types of slow learning HMAX Sigma Pi synapses Deep learning Visuo spatial scratchpad memory, and change blindness Processes involved in object identification Highlights Synthesis Principles of cortical operation, not a single theory Levels of explanation, and the mind brain problem Brain computation compared to computation on a digital computer Understanding how the brain works Synthesis on principles of operation of the cerebral cortex Hierarchical organization Localization of function Recurrent collaterals and attractor networks The noisy cortex Top down attention Diluted connectivity Sparse distributed graded firing rate encoding Synaptic modification Adaptation and facilitation Backprojections Neurogenesis Binding and syntax Evolution of the cerebral cortex Genetic specification of cortical design The cortical systems for emotion Memory systems Visual cortical processing for invariant visual object recognition Cortical lamination, operation, and evolution Highlights 692 A Introduction to linear algebra for neural networks 694 A.1 Vectors 694 A.1.1 The inner or dot product of two vectors 694 A.1.2 The length of a vector 695 A.1.3 Normalizing the length of a vector 696 A.1.4 The angle between two vectors: the normalized dot product 696 A.1.5 The outer product of two vectors 697 A.1.6 Linear and non linear systems 698 A.1.7 Linear combinations, linear independence, and linear separability 699 A.2 Application to understanding simple neural networks 700 A.2.1 Capability and limitations of single layer networks 701 A.2.2 Non linear networks: neurons with non linear activation functions 703

17 Contents xvii A.2.3 Non linear networks: neurons with non linear activations 704 B Neuronal network models 706 B.1 Introduction 706 B.2 Pattern association memory 706 B.2.1 Architecture and operation 707 B.2.2 A simple model 710 B.2.3 The vector interpretation 712 B.2.4 Properties 713 B.2.5 Prototype extraction, extraction of central tendency, and noise reduction 716 B.2.6 Speed 716 B.2.7 Local learning rule 717 B.2.8 Implications of different types of coding for storage in pattern associators 722 B.3 Autoassociation or attractor memory 723 B.3.1 Architecture and operation 724 B.3.2 Introduction to the analysis of the operation of autoassociation networks 725 B.3.3 Properties 727 B.3.4 Use of autoassociation networks in the brain 733 B.4 Competitive networks, including self organizing maps 734 B.4.1 Function 734 B.4.2 Architecture and algorithm 735 B.4.3 Properties 736 B.4.4 Utility of competitive networks in information processing by the brain 741 B.4.5 Guidance of competitive learning 743 B.4.6 Topographic map formation 745 B.4.7 Invariance learning by competitive networks 749 B.4.8 Radial Basis Function networks 751 B.4.9 Further details of the algorithms used in competitive networks 752 B.5 Continuous attractor networks 756 B.5.1 Introduction 756 B.5.2 The generic model of a continuous attractor network 758 B.5.3 Learning the synaptic strengths in a continuous attractor network 759 B.5.4 The capacity of a continuous attractor network: multiple charts 761 B.5.5 Continuous attractor models: path integration 761 B.5.6 Stabilization of the activity packet within a continuous attractor network 764 B.5.7 Continuous attractor networks in two or more dimensions 766 B.5.8 Mixed continuous and discrete attractor networks 767 B.6 Network dynamics: the integrate and fire approach 767 B.6.1 From discrete to continuous time 768 B.6.2 Continuous dynamics with discontinuities 769 B.6.3 An integrate and fire implementation 773 B.6.4 The speed of processing of attractor networks 774 B.6.5 The speed of processing of a four layer hierarchical network 777 B.6.6 Spike response model 780 B.7 Network dynamics: introduction to the mean field approach 781 B.8 Mean field based neurodynamics 783 B.8.1 Population activity 783 B.8.2 The mean field approach used in a model of decision making 785 B.8.3 The model parameters used in the mean field analyses of decision making 787 B.8.4 A basic computational module based on biased competition 788 B.8.5 Multimodular neurodynamical architectures 789 B.9 Sequence memory implemented by adaptation in an attractor network 791

18 xviii Contents B.10 Error correction networks 792 B.10.1 Architecture and general description 792 B.10.2 Generic algorithm for a one layer error correction network 793 B.10.3 Capability and limitations of single layer error correcting networks 793 B.10.4 Properties 797 B.11 Error backpropagation multilayer networks 799 B.11.1 Introduction 799 B.11.2 Architecture and algorithm 799 B.11.3 Properties of multilayer networks trained by error backpropagation 802 B.12 Biologically plausible networks vs backpropagation 803 B.13 Convolution networks 804 B.14 Contrastive Hebbian learning: the Boltzmann machine 806 B.15 Deep Belief Networks 807 B.16 Reinforcement learning 807 B.16.1 Associative reward penalty algorithm of Barto and Sutton 808 B.16.2 Reward prediction error or delta rule learning, and classical conditioning 810 B.16.3 Temporal Difference (TD) learning 811 B.17 Highlights 814 C Information theory, and neuronal encoding 815 C.1 Information theory 816 C.1.1 The information conveyed by definite statements 816 C.1.2 Information conveyed by probabilistic statements 817 C.1.3 Information sources, information channels, and information measures 818 C.1.4 The information carried by a neuronal response and its averages 819 C.1.5 The information conveyed by continuous variables 822 C.2 The information carried by neuronal responses 824 C.2.1 The limited sampling problem 824 C.2.2 Correction procedures for limited sampling 825 C.2.3 The information from multiple cells: decoding procedures 826 C.2.4 Information in the correlations between cells: a decoding approach 830 C.2.5 Information in the correlations between cells: second derivative approach 835 C.3 Information theory results 838 C.3.1 The sparseness of the distributed encoding used by the brain 839 C.3.2 The information from single neurons 850 C.3.3 The information from single neurons: temporal codes versus rate codes 852 C.3.4 The information from single neurons: the speed of information transfer 854 C.3.5 The information from multiple cells: independence versus redundancy 866 C.3.6 Should one neuron be as discriminative as the whole organism? 870 C.3.7 The information from multiple cells: the effects of cross correlations 871 C.3.8 Conclusions on cortical neuronal encoding 875 C.4 Information theory terms a short glossary 879 C.5 Highlights 880 D Simulation software for neuronal network models 881 D.1 Introduction 881 D.2 Autoassociation or attractor networks 881 D.2.1 Running the simulation 881 D.2.2 Exercises 883 D.3 Pattern association networks 884

19 Contents xix D.3.1 Running the simulation 884 D.3.2 Exercises 886 D.4 Competitive networks and Self Organizing Maps 886 D.4.1 Running the simulation 886 D.4.2 Exercises 888 D.5 Highlights 889 References 890 Index 950

20 Appendix 2 Neuronal network models B.1 Introduction Formal models of neural networks are needed in order to provide a basis for understanding the processing and memory functions performed by real neuronal networks in the brain. The formal models included in this Appendix all describe fundamental types of network found in different brain regions, and the computations they perform. Each of the types of network described can be thought of as providing one of the fundamental building blocks that the brain uses. Often these building blocks are combined within a brain area to perform a particular computation. The aim of this Appendix is to describe a set of fundamental networks used by the brain, including the parts of the brain involved in memory, attention, decision making, and the building of perceptual representations. As each type of network is introduced, we will point briefly to parts of the brain in which each network is found. Understanding these models provides a basis for understanding the theories of how different types of memory functions are performed. The descriptions of these networks are kept relatively concise in this Appendix. More detailed descriptions of some of the quantitative aspects of storage in pattern associators and autoassociators are provided in the Appendices of Rolls and Treves (1998) Neural Networks and Brain Function. Another book that provides a clear and quantitative introduction to some of these networks is Hertz, Krogh and Palmer (1991) Introduction to the Theory of Neural Computation, and other useful sources include Dayan and Abbott (2001), Gerstner, Kistler, Naud and Paninski (2014) (who focus on neuronal dynamics), Amit (1989) (for attractor networks), Koch (1999) (for a biophysical approach), Wilson (1999)(on spiking networks), and Rolls (2008d). Some of the background to the operation of the types of neuronal network described here, including a brief review of the evidence on neuronal structure and function, and on synaptic plasticity and the rules by which synaptic strength is modified, much based on studies with long term potentiation, is provided in Chapter 1. The network models on which we focus in this Appendix utilize a local learning rule, that is a rule for synaptic modification, in which the signals needed to alter the synaptic strength are present in the pre and post synaptic neurons. We focus on these networks because use of a local learning rule is biologically plausible. We discuss the issue of biological plausibility of the networks described, and show how they differ from less biologically plausible networks such as multilayer backpropagation of error networks, in Section B.12. B.2 Pattern association memory A fundamental operation of most nervous systems is to learn to associate a first stimulus with a second that occurs at about the same time, and to retrieve the second stimulus when the first is presented. The first stimulus might be the sight of food, and the second stimulus the taste of food. After the association has been learned, the sight of food would enable its taste to be retrieved. In classical conditioning, the taste of food might elicit an unconditioned response of

21 unconditioned stimuli Pattern association memory 707 input axon input axon e i dendrite conditioned stimuli x j w ij synapse cell body h i = dendritic activation y i = output firing output axon responses Fig. B.1 A pattern association memory. An unconditioned stimulus has activity or firing rate e i for the ith neuron, and produces firing y i of the ith neuron. An unconditioned stimulus may be treated as a vector, across the set of neurons indexed by i, of activity e. The firing rate response can also be thought of as a vector of firing y. The conditioned stimuli have activity or firing rate x j for the jth axon, which can also be treated as a vector x. salivation, and if the sight of the food is paired with its taste, then the sight of that food would by learning come to produce salivation. Pattern associators are thus used where the outputs of the visual system interface to learning systems in the orbitofrontal cortex and amygdala that learn associations between the sight of objects and their taste or touch in stimulus reinforcer association learning (see Chapter 15). Pattern association is also used throughout the cerebral (neo)cortical areas, as it is the architecture that describes the backprojection connections from one cortical area to the preceding cortical area (see Chapters 1, 24 and 6). Pattern association thus contributes to implementing top down influences in attention, including the effects of attention from higher to lower cortical areas, and thus between the visual object and spatial processing streams (Rolls and Deco 2002) (see Chapter 6); the effects of mood on memory and visual information processing (Rolls and Stringer 2001b); the recall of visual memories; and the operation of short term memory systems (see Section 4.3.1). B.2.1 Architecture and operation The essential elements necessary for pattern association, forming what could be called a prototypical pattern associator network, are shown in Fig. B.1. What we have called the second or unconditioned stimulus pattern is applied through unmodifiable synapses generating an input to each neuron, which, being external with respect to the synaptic matrix we focus on, we can call the external input e i for the ith neuron. [We can also treat this as a vector, e, as indicated in the legend to Fig. B.1. Vectors and simple operations performed with them are summarized in Appendix A. This unconditioned stimulus is dominant in producing or forcing the firing of the output neurons (y i for the ith neuron, or the vector y)]. At the same time, the first or conditioned stimulus pattern consisting of the set of firings on the horizontally running

22 708 Neuronal network models input axons in Fig. B.1 (x j for the jth axon) (or equivalently the vector x) is applied through modifiable synapses w ij to the dendrites of the output neurons. The synapses are modifiable in such a way that if there is presynaptic firing on an input axon x j paired during learning with postsynaptic activity on neuron i, then the strength or weight w ij between that axon and the dendrite increases. This simple learning rule is often called the Hebb rule, after Donald Hebb who in 1949 formulated the hypothesis that if the firing of one neuron was regularly associated with another, then the strength of the synapse or synapses between the neurons should increase 36. After learning, presenting the pattern x on the input axons will activate the dendrite through the strengthened synapses. If the cue or conditioned stimulus pattern is the same as that learned, the postsynaptic neurons will be activated, even in the absence of the external or unconditioned input, as each of the firing axons produces through a strengthened synapse some activation of the postsynaptic element, the dendrite. The total activation h i of each postsynaptic neuron i is then the sum of such individual activations. In this way, the correct output neurons, that is those activated during learning, can end up being the ones most strongly activated, and the second or unconditioned stimulus can be effectively recalled. The recall is best when only strong activation of the postsynaptic neuron produces firing, that is if there is a threshold for firing, just like real neurons. The advantages of this are evident when many associations are stored in the memory, as will soon be shown. Next we introduce a more precise description of the above by writing down explicit mathematical rules for the operation of the simple network model of Fig. B.1, which will help us to understand how pattern association memories in general operate. (In this description we introduce simple vector operations, and, for those who are not familiar with these, refer the reader to Appendix A.) We have denoted above a conditioned stimulus input pattern as x. Each of the axons has a firing rate, and if we count or index through the axons using the subscript j, the firing rate of the first axon is x 1, of the second x 2, of the jth x j, etc. The whole set of axons forms a vector, which is just an ordered (1, 2, 3, etc.) set of elements. The firing rate of each axon x j is one element of the firing rate vector x. Similarly, using i as the index, we can denote the firing rate of any output neuron as y i, and the firing rate output vector as y. With this terminology, we can then identify any synapse onto neuron i from neuron j as w ij (see Fig. B.1). In this book, the first index, i, always refers to the receiving neuron (and thus signifies a dendrite), while the second index, j, refers to the sending neuron (and thus signifies a conditioned stimulus input axon in Fig. B.1). We can now specify the learning and retrieval operations as follows: B Learning The firing rate of every output neuron is forced to a value determined by the unconditioned (or external or forcing stimulus) input e i. In our simple model this means that for any one neuron i, y i = f(e i ) (B.1) which indicates that the firing rate is a function of the dendritic activation, taken in this case to reduce essentially to that resulting from the external forcing input (see Fig. B.1). The function f is called the activation function (see Fig. 1.4), and its precise form is irrelevant, at least during this learning phase. For example, the function at its simplest could be taken to be linear, so that the firing rate would be just proportional to the activation. 36 In fact, the terms in which Hebb put the hypothesis were a little different from an association memory, in that he stated that if one neuron regularly comes to elicit firing in another, then the strength of the synapses should increase. He had in mind the building of what he called cell assemblies. In a pattern associator, the conditioned stimulus need not produce before learning any significant activation of the output neurons. The connection strengths must simply increase if there is associated pre and postsynaptic firing when, in pattern association, most of the postsynaptic firing is being produced by a different input.

23 Pattern association memory 709 The Hebb rule can then be written as follows: δw ij = αy i x j (B.2) where δw ij is the change of the synaptic weight w ij that results from the simultaneous (or conjunctive) presence of presynaptic firing x j and postsynaptic firing or activation y i, and α is a learning rate constant that specifies how much the synapses alter on any one pairing. The Hebb rule is expressed in this multiplicative form to reflect the idea that both presynaptic and postsynaptic activity must be present for the synapses to increase in strength. The multiplicative form also reflects the idea that strong pre and postsynaptic firing will produce a larger change of synaptic weight than smaller firing rates. It is also assumed for now that before any learning takes place, the synaptic strengths are small in relation to the changes that can be produced during Hebbian learning. We will see that this assumption can be relaxed later when a modified Hebb rule is introduced that can lead to a reduction in synaptic strength under some conditions. B Recall When the conditioned stimulus is present on the input axons, the total activation h i of a neuron i is the sum of all the activations produced through each strengthened synapse w ij by each active neuron x j. We can express this as C h i = x j w ij j=1 (B.3) where C indicates that the sum is over the C input axons (or connections) indexed by j to j=1 each neuron. The multiplicative form here indicates that activation should be produced by an axon only if it is firing, and only if it is connected to the dendrite by a strengthened synapse. It also indicates that the strength of the activation reflects how fast the axon x j is firing, and how strong the synapse w ij is. The sum of all such activations expresses the idea that summation (of synaptic currents in real neurons) occurs along the length of the dendrite, to produce activation at the cell body, where the activation h i is converted into firing y i. This conversion can be expressed as y i = f(h i ) (B.4) where the function f is again the activation function. The form of the function now becomes more important. Real neurons have thresholds, with firing occurring only if the activation is above the threshold. A threshold linear activation function is shown in Fig. 1.4b on page 7. This has been useful in formal analysis of the properties of neural networks. Neurons also have firing rates that become saturated at a maximum rate, and we could express this as the sigmoid activation function shown in Fig. 1.4c. Yet another simple activation function, used in some models of neural networks, is the binary threshold function (Fig. 1.4d), which indicates that if the activation is below threshold, there is no firing, and that if the activation is above threshold, the neuron fires maximally. Whatever the exact shape of the activation function, some non linearity is an advantage, for it enables small activations produced by interfering memories to be minimized, and it can enable neurons to perform logical operations, such as to fire or respond only if two or more sets of inputs are present simultaneously.

24 710 Neuronal network models B.2.2 A simple model An example of these learning and recall operations is provided in a simple form as follows. The neurons will have simple firing rates, which can be 0 to represent no activity, and 1 to indicate high firing. They are thus binary neurons, which can assume one of two firing rates. If we have a pattern associator with six input axons and four output neurons, we could represent the network before learning, with the same layout as in Fig. B.1, as shown in Fig. B.2: CS U C S Fig. B.2 Pattern association: before synaptic modification. The unconditioned stimulus (UCS) firing rates are shown as 1 if high and 0 if low as a row vector being applied to force firing of the four output neurons. The six conditioned stimulus (CS) firing rates are shown as a column vector being applied to the vertical dendrites of the output neurons which have initial synaptic weights of 0. where x or the conditioned stimulus (CS) is , and y or the firing produced by the unconditioned stimulus (UCS) is (The arrows indicate the flow of signals.) The synaptic weights are initially all 0. After pairing the CS with the UCS during one learning trial, some of the synaptic weights will be incremented according to equation B.2, so that after learning this pair the synaptic weights will become as shown in Fig. B.3: CS U C S Fig. B.3 Pattern association: after synaptic modification. The synapses where there is conjunctive preand post synaptic activity have been strengthened to value 1.

25 Pattern association memory 711 We can represent what happens during recall, when, for example, we present the CS that has been learned, as shown in Fig. B.4: CS Activation h i Firing y i Fig. B.4 Pattern association: recall. The activation h i of each neuron i is converted with a threshold of 2 to the binary firing rate y i (1 for high, and 0 for low). The activation of the four output neurons is 3300, and if we set the threshold of each output neuron to 2, then the output firing is 1100 (where the binary firing rate is 0 if below threshold, and 1 if above). The pattern associator has thus achieved recall of the pattern 1100, which is correct. We can now illustrate how a number of different associations can be stored in such a pattern associator, and retrieved correctly. Let us associate a new CS pattern with the UCS 0101 in the same pattern associator. The weights will become as shown next in Fig. B.5 after learning: CS U C S Fig. B.5 Pattern association: synaptic weights after learning a second pattern association.

26 712 Neuronal network models If we now present the second CS, the retrieval is as shown in Fig. B.6: CS Activation h i Firing y i Fig. B.6 Pattern association: recall with the second CS. The binary output firings were again produced with the threshold set to 2. Recall is perfect. This illustration shows the value of some threshold non linearity in the activation function of the neurons. In this case, the activations did reflect some small cross talk or interference from the previous pattern association of CS1 with UCS1, but this was removed by the threshold operation, to clean up the recall firing. The example also shows that when further associations are learned by a pattern associator trained with the Hebb rule, equation B.2, some synapses will reflect increments above a synaptic strength of 1. It is left as an exercise to the reader to verify that recall is still perfect to CS1, the vector (The activation vector h is 3401, and the output firing vector y with the same threshold of 2 is 1100, which is perfect recall.) B.2.3 The vector interpretation The way in which recall is produced, equation B.3, consists for each output neuron i of multiplying each input firing rate x j by the corresponding synaptic weight w ij and summing the products to obtain the activation h i. Now we can consider the firing rates x j where j varies from 1 to N, the number of axons, to be a vector. (A vector is simply an ordered set of numbers see Appendix A.) Let us call this vector x. Similarly, on a neuron i, the synaptic weights can be treated as a vector, w i. (The subscript i here indicates that this is the weight vector on the ith neuron.) The operation we have just described to obtain the activation of an output neuron can now be seen to be a simple multiplication operation of two vectors to produce a single output value (called a scalar output). This is the inner product or dot product of two vectors, and can be written h i = x w i. (B.5) The inner product of two vectors indicates how similar they are. If two vectors have corresponding elements the same, then the dot product will be maximal. If the two vectors are similar but not identical, then the dot product will be high. If the two vectors are completely different, the dot product will be 0, and the vectors are described as orthogonal. (The term orthogonal means at right angles, and arises from the geometric interpretation of vectors, which is summarized in Appendix A.) Thus the dot product provides a direct measure of how similar two vectors are. It can now be seen that a fundamental operation many neurons perform is effectively to compute how similar an input pattern vector x is to their stored weight vector w i. The

27 Pattern association memory 713 similarity measure they compute, the dot product, is a very good measure of similarity, and indeed, the standard (Pearson product moment) correlation coefficient used in statistics is the same as a normalized dot product with the mean subtracted from each vector, as shown in Appendix A. (The normalization used in the correlation coefficient results in the coefficient varying always between +1 and 1, whereas the actual scalar value of a dot product clearly depends on the length of the vectors from which it is calculated.) With these concepts, we can now see that during learning, a pattern associator adds to its weight vector a vector δw i that has the same pattern as the input pattern x, if the postsynaptic neuron i is strongly activated. Indeed, we can express equation B.2 in vector form as δw i = αy i x. (B.6) We can now see that what is recalled by the neuron depends on the similarity of the recall cue vector x r to the originally learned vector x. The fact that during recall the output of each neuron reflects the similarity (as measured by the dot product) of the input pattern x r to each of the patterns used originally as x inputs (conditioned stimuli in Fig. B.1) provides a simple way to appreciate many of the interesting and biologically useful properties of pattern associators, as described next. B.2.4 B Properties Generalization During recall, pattern associators generalize, and produce appropriate outputs if a recall cue vector x r is similar to a vector that has been learned already. This occurs because the recall operation involves computing the dot (inner) product of the input pattern vector x r with the synaptic weight vector w i, so that the firing produced, y i, reflects the similarity of the current input to the previously learned input pattern x. (Generalization will occur to input cue or conditioned stimulus patterns x r that are incomplete versions of an original conditioned stimulus x, although the term completion is usually applied to the autoassociation networks described in Section B.3.) This is an extremely important property of pattern associators, for input stimuli during recall will rarely be absolutely identical to what has been learned previously, and automatic generalization to similar stimuli is extremely useful, and has great adaptive value in biological systems. Generalization can be illustrated with the simple binary pattern associator considered above. (Those who have appreciated the vector description just given might wish to skip this illustration.) Instead of the second CS, pattern vector , we will use the similar recall cue , as shown in Fig. B.7:

28 714 Neuronal network models CS Activation h i Firing y i Fig. B.7 Pattern association: generalization using an input vector similar to the second CS. It is seen that the output firing rate vector, 0101, is exactly what should be recalled to CS2 (and not to CS1), so correct generalization has occurred. Although this is a small network trained with few examples, the same properties hold for large networks with large numbers of stored patterns, as described more quantitatively in Section B on capacity below and in Appendix A3 of Rolls and Treves (1998). B Graceful degradation or fault tolerance If the synaptic weight vector w i (or the weight matrix, which we can call W) has synapses missing (e.g. during development), or loses synapses, then the activation h i or h is still reasonable, because h i is the dot product (correlation) of x with w i. The result, especially after passing through the activation function, can frequently be perfect recall. The same property arises if for example one or some of the conditioned stimulus (CS) input axons are lost or damaged. This is a very important property of associative memories, and is not a property of conventional computer memories, which produce incorrect data if even only 1 storage location (for 1 bit or binary digit of data) of their memory is damaged or cannot be accessed. This property of graceful degradation is of great adaptive value for biological systems. We can illustrate this with a simple example. If we damage two of the synapses in Fig. B.6 to produce the synaptic matrix shown in Fig. B.8 (where x indicates a damaged synapse which has no effect, but was previously 1), and now present the second CS, the retrieval is as follows: CS x x Activation h i Firing y i Fig. B.8 Pattern association: graceful degradation when some synapses are damaged (x).

29 Pattern association memory 715 The binary output firings were again produced with the threshold set to 2. The recalled vector, 0101, is perfect. This illustration again shows the value of some threshold non linearity in the activation function of the neurons. It is left as an exercise to the reader to verify that recall is still perfect to CS1, the vector (The output activation vector h is 3301, and the output firing vector y with the same threshold of 2 is 1100, which is perfect recall.) B The importance of distributed representations for pattern associators A distributed representation is one in which the firing or activity of all the elements in the vector is used to encode a particular stimulus. For example, in a conditioned stimulus vector CS1 that has the value , we need to know the state of all the elements to know which stimulus is being represented. Another stimulus, CS2, is represented by the vector We can represent many different events or stimuli with such overlapping sets of elements, and because in general any one element cannot be used to identify the stimulus, but instead the information about which stimulus is present is distributed over the population of elements or neurons, this is called a distributed representation (see Section 8.2). If, for binary neurons, half the neurons are in one state (e.g. 0), and the other half are in the other state (e.g. 1), then the representation is described as fully distributed. The CS representations above are thus fully distributed. If only a smaller proportion of the neurons is active to represent a stimulus, as in the vector , then this is a sparse representation. For binary representations, we can quantify the sparseness by the proportion of neurons in the active (1) state. In contrast, a local representation is one in which all the information that a particular stimulus or event has occurred is provided by the activity of one of the neurons, or elements in the vector. One stimulus might be represented by the vector , another stimulus by the vector , and a third stimulus by the vector The activity of neuron or element 1 would indicate that stimulus 1 was present, and of neuron 2, that stimulus 2 was present. The representation is local in that if a particular neuron is active, we know that the stimulus represented by that neuron is present. In neurophysiology, if such cells were present, they might be called grandmother cells (cf. Barlow (1972), (1995); see Chapters 1 and 8 and Appendix C), in that one neuron might represent a stimulus in the environment as complex and specific as one s grandmother. Where the activity of a number of cells must be taken into account in order to represent a stimulus (such as an individual taste), then the representation is sometimes described as using ensemble encoding. The properties just described for associative memories, generalization, and graceful degradation are only implemented if the representation of the CS or x vector is distributed. This occurs because the recall operation involves computing the dot (inner) product of the input pattern vector x r with the synaptic weight vector w i. This allows the activation h i to reflect the similarity of the current input pattern to a previously learned input pattern x only if several or many elements of the x and x r vectors are in the active state to represent a pattern. If local encoding were used, e.g , then if the first element of the vector (which might be the firing of axon 1, i.e. x 1, or the strength of synapse i1, w i1 ) is lost, the resulting vector is not similar to any other CS vector, and the activation is 0. In the case of local encoding, the important properties of associative memories, generalization and graceful degradation do not thus emerge. Graceful degradation and generalization are dependent on distributed representations, for then the dot product can reflect similarity even when some elements of the vectors involved are altered. If we think of the correlation between Y and X in a graph, then this correlation is affected only a little if a few X, Y pairs of data are lost (see Appendix A).

30 716 Neuronal network models B.2.5 Prototype extraction, extraction of central tendency, and noise reduction If a set of similar conditioned stimulus vectors x are paired with the same unconditioned stimulus e i, the weight vector w i becomes (or points towards) the sum (or with scaling, the average) of the set of similar vectors x. This follows from the operation of the Hebb rule in equation B.2. When tested at recall, the output of the memory is then best to the average input pattern vector denoted < x >. If the average is thought of as a prototype, then even though the prototype vector < x > itself may never have been seen, the best output of the neuron or network is to the prototype. This produces extraction of the prototype or central tendency. The same phenomenon is a feature of human memory performance (see McClelland and Rumelhart (1986) Chapter 17), and this simple process with distributed representations in a neural network accounts for the psychological phenomenon. If the different exemplars of the vector x are thought of as noisy versions of the true input pattern vector < x > (with incorrect values for some of the elements), then the pattern associator has performed noise reduction, in that the output produced by any one of these vectors will represent the output produced by the true, noiseless, average vector < x >. B.2.6 Speed Recall is very fast in a real neuronal network, because the conditioned stimulus input firings x j (j = 1, C axons) can be applied simultaneously to the synapses w ij, and the activation h i can be accumulated in one or two time constants of the dendrite (e.g ms). Whenever the threshold of the cell is exceeded, it fires. Thus, in effectively one step, which takes the brain no more than ms, all the output neurons of the pattern associator can be firing with rates that reflect the input firing of every axon. This is very different from a conventional digital computer, in which computing h i in equation B.3 would involve C multiplication and addition operations occurring one after another, or 2C time steps. The brain performs parallel computation in at least two senses in even a pattern associator. One is that for a single neuron, the separate contributions of the firing rate x j of each axon j multiplied by the synaptic weight w ij are computed in parallel and added in the same timestep. The second is that this can be performed in parallel for all neurons i = 1, N in the network, where there are N output neurons in the network. It is these types of parallel and time continuous (see Section B.6) processing that enable these classes of neuronal network in the brain to operate so fast, in effectively so few steps. Learning is also fast ( one shot ) in pattern associators, in that a single pairing of the conditioned stimulus x and the unconditioned stimulus (UCS) e which produces the unconditioned output firing y enables the association to be learned. There is no need to repeat the pairing in order to discover over many trials the appropriate mapping. This is extremely important for biological systems, in which a single co occurrence of two events may lead to learning that could have life saving consequences. (For example, the pairing of a visual stimulus with a potentially life threatening aversive event may enable that event to be avoided in future.) Although repeated pairing with small variations of the vectors is used to obtain the useful properties of prototype extraction, extraction of central tendency, and noise reduction, the essential properties of generalization and graceful degradation are obtained with just one pairing. The actual time scales of the learning in the brain are indicated by studies of associative synaptic modification using long term potentiation paradigms (LTP, see Section 1.5). Co occurrence or near simultaneity of the CS and UCS is required for periods of as little as 100 ms, with expression of the synaptic modification being present within typically a few seconds.

31 Pattern association memory 717 Post synaptic Neuron (Output Neuron) Post synaptic Neuron (Output Neuron) Presynaptic Terminal Dendrite Presynaptic Terminal Dendrite CS CS UCS UCS Modifiable Synapse Modifiable Synapse (a) Invertebrate (b) Vertebrate Fig. B.9 (b) In vertebrate pattern association learning, the unconditioned stimulus (UCS) may be made available at all the conditioned stimulus (CS) terminals onto the output neuron because the dendrite of the postsynaptic neuron is electrically short, so that the effect of the UCS spreads for long distances along the dendrite. (a) In contrast, in at least some invertebrate association learning systems, the unconditioned stimulus or teaching input makes a synapse onto the presynaptic terminal carrying the conditioned stimulus. B.2.7 Local learning rule The simplest learning rule used in pattern association neural networks, a version of the Hebb rule, is, as shown in equation B.2 above, δw ij = αy i x j. This is a local learning rule in that the information required to specify the change in synaptic weight is available locally at the synapse, as it is dependent only on the presynaptic firing rate x j available at the synaptic terminal, and the postsynaptic activation or firing y i available on the dendrite of the neuron receiving the synapse (see Fig. B.9b). This makes the learning rule biologically plausible, in that the information about how to change the synaptic weight does not have to be carried from a distant source, where it is computed, to every synapse. Such a non local learning rule would not be biologically plausible, in that there are no appropriate connections known in most parts of the brain to bring in the synaptic training or teacher signal to every synapse. Evidence that a learning rule with the general form of equation B.2 is implemented in at least some parts of the brain comes from studies of long term potentiation, described in Section 1.5. Long term potentiation (LTP) has the synaptic specificity defined by equation B.2, in that only synapses from active afferents, not those from inactive afferents, become strengthened. Synaptic specificity is important for a pattern associator, and most other types of neuronal network, to operate correctly. The number of independently modifiable synapses on each neuron is a primary factor in determining how many different memory patterns can be stored in associative memories (see Sections B and B.3.3.7). Another useful property of real neurons in relation to equation B.2 is that the postsynaptic term, y i, is available on much of the dendrite of a cell, because the electrotonic length of the dendrite is short. In addition, active propagation of spiking activity from the cell body along the dendrite may help to provide a uniform postsynaptic term for the learning. Thus if a neuron is strongly activated with a high value for y i, then any active synapse onto the cell will be capable of being modified. This enables the cell to learn an association between the pattern

32 718 Neuronal network models of activity on all its axons and its postsynaptic activation, which is stored as an addition to its weight vector w i. Then later on, at recall, the output can be produced as a vector dot product operation between the input pattern vector x and the weight vector w i, so that the output of the cell can reflect the correlation between the current input vector and what has previously been learned by the cell. It is interesting that at least many invertebrate neuronal systems may operate very differently from those described here, as described by Rolls and Treves (1998) (see Fig. B.9a). If there were 5,000 conditioned stimulus inputs to a neuron, the implication is that every one would need to have a presynaptic terminal conveying the same UCS to each presynaptic terminal, which is hardly plausible. The implication is that at least some invertebrate neural systems operate very differently to those in vertebrates and, in such systems, the useful properties that arise from using distributed CS representations such as generalization would not arise in the same simple way as a property of the network. B Capacity The question of the storage capacity of a pattern associator is considered in detail in Appendix A3 of Rolls and Treves (1998). It is pointed out there that, for this type of associative network, the number of memories that it can hold simultaneously in storage has to be analysed together with the retrieval quality of each output representation, and then only for a given quality of the representation provided in the input. This is in contrast to autoassociative nets (Section B.3), in which a critical number of stored memories exists (as a function of various parameters of the network), beyond which attempting to store additional memories results in it becoming impossible to retrieve essentially anything. With a pattern associator, instead, one will always retrieve something, but this something will be very small (in information or correlation terms) if too many associations are simultaneously in storage and/or if too little is provided as input. The conjoint quality capacity input analysis can be carried out, for any specific instance of a pattern associator, by using formal mathematical models and established analytical procedures (see e.g. Treves (1995), Rolls and Treves (1998), Treves (1990) and Rolls and Treves (1990)). This, however, has to be done case by case. It is anyway useful to develop some intuition for how a pattern associator operates, by considering what its capacity would be in certain well defined simplified cases. Linear associative neuronal networks These networks are made up of units with a linear activation function, which appears to make them unsuitable to represent real neurons with their positive only firing rates. However, even purely linear units have been considered as provisionally relevant models of real neurons, by assuming that the latter operate sometimes in the linear regime of their transfer function. (This implies a high level of spontaneous activity, and may be closer to conditions observed early on in sensory systems rather than in areas more specifically involved in memory.) As usual, the connections are trained by a Hebb (or similar) associative learning rule. The capacity of these networks can be defined as the total number of associations that can be learned independently of each other, given that the linear nature of these systems prevents anything more than a linear transform of the inputs. This implies that if input pattern C can be written as the weighted sum of input patterns A and B, the output to C will be just the same weighted sum of the outputs to A and B. If there are N input axons, then there can be only at most N mutually independent input patterns (i.e. none able to be written as a weighted sum of the others), and therefore the capacity of linear networks, defined above, is just N, or equal to the number of inputs to each neuron. In general, a random set of less than N vectors (the CS input pattern vectors) will tend to be mutually independent but not mutually orthogonal (at 90 deg to each other) (see Appendix A). If they are not orthogonal (the normal situation), then the dot product of them is not 0, and

33 Pattern association memory 719 the output pattern activated by one of the input vectors will be partially activated by other input pattern vectors, in accordance with how similar they are (see equations B.5 and B.6). This amounts to interference, which is therefore the more serious the less orthogonal, on the whole, is the set of input vectors. Since input patterns are made of elements with positive values, if a simple Hebbian learning rule like the one of equation B.2 is used (in which the input pattern enters directly with no subtraction term), the output resulting from the application of a stored input vector will be the sum of contributions from all other input vectors that have a non zero dot product with it (see Appendix A), and interference will be disastrous. The only situation in which this would not occur is when different input patterns activate completely different input lines, but this is clearly an uninteresting circumstance for networks operating with distributed representations. A solution to this issue is to use a modified learning rule of the following form: δw ij = αy i (x j x) (B.7) where x is a constant, approximately equal to the average value of x j. This learning rule includes (in proportion to y i ) increasing the synaptic weight if (x j x) > 0 (long term potentiation), and decreasing the synaptic weight if (x j x) < 0 (heterosynaptic long term depression). It is useful for x to be roughly the average activity of an input axon x j across patterns, because then the dot product between the various patterns stored on the weights and the input vector will tend to cancel out with the subtractive term, except for the pattern equal to (or correlated with) the input vector itself. Then up to N input vectors can still be learned by the network, with only minor interference (provided of course that they are mutually independent, as they will in general tend to be). Table B.1 Effects of pre and post synaptic activity on synaptic modification Post synaptic activation 0 high Presynaptic firing 0 No change Heterosynaptic LTD high Homosynaptic LTD LTP This modified learning rule can also be described in terms of a contingency table (Table B.1) showing the synaptic strength modifications produced by different types of learning rule, where LTP indicates an increase in synaptic strength (called long term potentiation in neurophysiology), and LTD indicates a decrease in synaptic strength (called long term depression in neurophysiology). Heterosynaptic long term depression is so called because it is the decrease in synaptic strength that occurs to a synapse that is other than that through which the postsynaptic cell is being activated. This heterosynaptic long term depression is the type of change of synaptic strength that is required (in addition to LTP) for effective subtraction of the average presynaptic firing rate, in order, as it were, to make the CS vectors appear more orthogonal to the pattern associator. The rule is sometimes called the Singer Stent rule, after work by Singer (1987) and Stent (1973), and was discovered in the brain by Levy (Levy 1985, Levy and Desmond 1985) (see also Brown, Kairiss and Keenan (1990b)). Homosynaptic long term depression is so called because it is the decrease in synaptic strength that occurs to a synapse which is (the same as that which is) active. For it to occur, the postsynaptic neuron must

34 720 Neuronal network models simultaneously be inactive, or have only low activity. (This rule is sometimes called the BCM rule after the paper of Bienenstock, Cooper and Munro (1982); see Rolls and Deco (2002), Chapter 7). Associative neuronal networks with non linear neurons With non linear neurons, that is with at least a threshold in the activation function so that the output firing y i is 0 when the activation h i is below the threshold, the capacity can be measured in terms of the number of different clusters of output pattern vectors that the network produces. This is because the non linearities now present (one per output neuron) result in some clustering of the outputs produced by all possible (conditioned stimulus) input patterns x. Input patterns that are similar to a stored input vector can produce, due to the non linearities, output patterns even closer to the stored output; and vice versa, sufficiently dissimilar inputs can be assigned to different output clusters thereby increasing their mutual dissimilarity. As with the linear counterpart, in order to remove the correlation that would otherwise occur between the patterns because the elements can take only positive values, it is useful to use a modified Hebb rule of the form shown in equation B.7. With fully distributed output patterns, the number p of associations that leads to different clusters is of order C, the number of input lines (axons) per output neuron (that is, of order N for a fully connected network), as shown in Appendix A3 of Rolls and Treves (1998). If sparse patterns are used in the output, or alternatively if the learning rule includes a non linear postsynaptic factor that is effectively equivalent to using sparse output patterns, the coefficient of proportionality between p and C can be much higher than one, that is, many more patterns can be stored than inputs onto each output neuron (see Appendix A3 of Rolls and Treves (1998)). Indeed, the number of different patterns or prototypes p that can be stored can be derived for example in the case of binary units (Gardner 1988) to be p C/[a o log(1/a o )] (B.8) where a o is the sparseness of the output firing pattern y produced by the unconditioned stimulus. p can in this situation be much larger than C (see Appendix A3 of Rolls and Treves (1998), Rolls and Treves (1990) and Treves (1990)). This is an important result for encoding in pattern associators, for it means that provided that the activation functions are non linear (which is the case with real neurons), there is a very great advantage to using sparse encoding, for then many more than C pattern associations can be stored. Sparse representations may well be present in brain regions involved in associative memory for this reason (see Appendix C). The non linearity inherent in the NMDA receptor based Hebbian plasticity present in the brain may help to make the stored patterns more sparse than the input patterns, and this may be especially beneficial in increasing the storage capacity of associative networks in the brain by allowing participation in the storage of especially those relatively few neurons with high firing rates in the exponential firing rate distributions typical of neurons in sensory systems (see Appendix C). B Interference Interference occurs in linear pattern associators if two vectors are not orthogonal, and is simply dependent on the angle between the originally learned vector and the recall cue or CS vector (see Appendix A), for the activation of the output neuron depends simply on the dot product of the recall vector and the synaptic weight vector (equation B.5). Also in non linear pattern associators (the interesting case for all practical purposes), interference may occur if two CS patterns are not orthogonal, though the effect can be controlled with sparse encoding of the UCS patterns, effectively by setting high thresholds for the firing of output units. In

35 Pattern association memory 721 Input A Input B Fig. B.10 A non linearly separable mapping. Required Output other words, the CS vectors need not be strictly orthogonal, but if they are too similar, some interference will still be likely to occur. The fact that interference is a property of neural network pattern associator memories is of interest, for interference is a major property of human memory. Indeed, the fact that interference is a property of human memory and of neural network association memories is entirely consistent with the hypothesis that human memory is stored in associative memories of the type described here, or at least that network associative memories of the type described represent a useful exemplar of the class of parallel distributed storage network used in human memory. It may also be suggested that one reason that interference is tolerated in biological memory is that it is associated with the ability to generalize between stimuli, which is an invaluable feature of biological network associative memories, in that it allows the memory to cope with stimuli that will almost never be identical on different occasions, and in that it allows useful analogies that have survival value to be made. B Expansion recoding If patterns are too similar to be stored in associative memories, then one solution that the brain seems to use repeatedly is to expand the encoding to a form in which the different stimulus patterns are less correlated, that is, more orthogonal, before they are presented as CS stimuli to a pattern associator. The problem can be highlighted by a non linearly separable mapping (which captures part of the exclusive OR (XOR) problem), in which the mapping that is desired is as shown in Fig. B.10. The neuron has two inputs, A and B. This is a mapping of patterns that is impossible for a one layer network, because the patterns are not linearly separable 37. A solution is to remap the two input lines A and B to three input lines 1 3, that is to use expansion recoding, as shown in Fig. B.11. This can be performed by a competitive network (see Section B.4). The synaptic weights on the dendrite of the output neuron could then learn the following values using a simple Hebb rule, equation B.2, and the problem could be solved as in Fig. B.12. The whole network would look like that shown in Fig. B.11. Competitive networks could help with this type of recoding, and could provide very useful preprocessing for a pattern associator in the brain (Rolls and Treves 1998, Rolls 2008d). It is possible that the lateral nucleus of the amygdala performs this function, for it receives inputs from the temporal cortical visual areas, and may preprocess them before they become the inputs to associative networks at the next stage of amygdala processing (Rolls 2008d, Rolls 2014a). The granule cells of the cerebellum may operate similarly (Chapter 23). 37 See Appendix A. There is no set of synaptic weights in a one layer net that could solve the problem shown in Fig. B.10. Two classes of patterns are not linearly separable if no hyperplane can be positioned in their N dimensional space so as to separate them (see Appendix A). The XOR problem has the additional constraint that A = 0, B = 0 must be mapped to Output = 0.

36 722 Neuronal network models Competitive Network Pattern Associator A B Fig. B.11 Expansion recoding. A competitive network followed by a pattern associator that can enable patterns that are not linearly separable to be learned correctly. Synaptic weight Input 1 (A=1, B=0) 1 Input 2 (A=0, B=1) 1 Input 3 (A=1, B=1) 0 Fig. B.12 Synaptic weights on the dendrite of the output neuron in Fig. B.11. B.2.8 Implications of different types of coding for storage in pattern associators Throughout this section, we have made statements about how the properties of pattern associators such as the number of patterns that can be stored, and whether generalization and graceful degradation occur depend on the type of encoding of the patterns to be associated. (The types of encoding considered, local, sparse distributed, and fully distributed, are described above.) We draw together these points in Table B.2. Table B.2 Coding in associative memories* Local Sparse distributed Fully distributed Generalization, completion, No Yes Yes graceful degradation Number of patterns that can N of order C/[a o log(1/a o)] of order C be stored (large) (can be larger) (usually smaller than N) Amount of information Minimal Intermediate Large in each pattern (log(n) bits) (Na o log(1/a o ) bits) (N bits) (values if binary) * N refers here to the number of output units, and C to the average number of inputs to each output unit. a o is the sparseness of output patterns, or roughly the proportion of output units activated by a UCS pattern. Note: logs are to the base 2. The amount of information that can be stored in each pattern in a pattern associator is considered in Appendix A3 of Rolls and Treves (1998). That Appendix has been made available at and contains a quantitative approach to the capacity of pattern association networks that has not been published elsewhere. In conclusion, the architecture and properties of pattern association networks make them

37 Autoassociation or attractor memory 723 very appropriate for stimulus reinforcer association learning. Their high capacity enables them to learn the reinforcement associations for very large numbers of different stimuli. B.3 Autoassociation or attractor memory Autoassociative memories, or attractor neural networks, store memories, each one of which is represented by a pattern of neural activity. The memories are stored in the recurrent synaptic connections between the neurons of the network, for example in the recurrent collateral connections between cortical pyramidal cells. Autoassociative networks can then recall the appropriate memory from the network when provided with a fragment of one of the memories. This is called completion. Many different memories can be stored in the network and retrieved correctly. A feature of this type of memory is that it is content addressable; that is, the information in the memory can be accessed if just the contents of the memory (or a part of the contents of the memory) are used. This is in contrast to a conventional computer, in which the address of what is to be accessed must be supplied, and used to access the contents of the memory. Content addressability is an important simplifying feature of this type of memory, which makes it suitable for use in biological systems. The issue of content addressability will be amplified below. An autoassociation memory can be used as a short term memory, in which iterative processing round the recurrent collateral connection loop keeps a representation active by continuing neuronal firing. The short term memory reflected in continuing neuronal firing for several hundred milliseconds after a visual stimulus is removed which is present in visual cortical areas such as the inferior temporal visual cortex (see Chapter 25) is probably implemented in this way. This short term memory is one possible mechanism that contributes to the implementation of the trace memory learning rule which can help to implement invariant object recognition as described in Chapter 25. Autoassociation memories also appear to be used in a short term memory role in the prefrontal cortex. In particular, the temporal visual cortical areas have connections to the ventrolateral prefrontal cortex which help to implement the short term memory for visual stimuli (in for example delayed match to sample tasks, and visual search tasks, as described in Section 4.3.1). In an analogous way the parietal cortex has connections to the dorsolateral prefrontal cortex for the short term memory of spatial responses (see Section 4.3.1). These short term memories provide a mechanism that enables attention to be maintained through backprojections from prefrontal cortex areas to the temporal and parietal areas that send connections to the prefrontal cortex, as described in Chapter 6. Autoassociation networks implemented by the recurrent collateral synapses between cortical pyramidal cells also provide a mechanism for constraint satisfaction and also noise reduction whereby the firing of neighbouring neurons can be taken into account in enabling the network to settle into a state that reflects all the details of the inputs activating the population of connected neurons, as well as the effects of what has been set up during developmental plasticity as well as later experience. Attractor networks are also effectively implemented by virtue of the forward and backward connections between cortical areas (see Sections 11 and 4.3.1). An autoassociation network with rapid synaptic plasticity can learn each memory in one trial. Because of its one shot rapid learning, and ability to complete, this type of network is well suited for episodic memory storage, in which each past episode must be stored and recalled later from a fragment, and kept separate from other episodic memories (see Chapter 24).

38 724 Neuronal network models external input e i y j w ij h i = dendritic activation y i = output firing output Fig. B.13 The architecture of an autoassociative neural network. B.3.1 Architecture and operation The prototypical architecture of an autoassociation memory is shown in Fig. B.13. The external input e i is applied to each neuron i by unmodifiable synapses. This produces firing y i of each neuron, or a vector of firing on the output neurons y. Each output neuron i is connected by a recurrent collateral connection to the other neurons in the network, via modifiable connection weights w ij. This architecture effectively enables the output firing vector y to be associated during learning with itself. Later on, during recall, presentation of part of the external input will force some of the output neurons to fire, but through the recurrent collateral axons and the modified synapses, other neurons in y can be brought into activity. This process can be repeated a number of times, and recall of a complete pattern may be perfect. Effectively, a pattern can be recalled or recognized because of associations formed between its parts. This of course requires distributed representations. Next we introduce a more precise and detailed description of the above, and describe the properties of these networks. Ways to analyze formally the operation of these networks are introduced in Appendix A4 of Rolls and Treves (1998) and by Amit (1989). B Learning The firing of every output neuron i is forced to a value y i determined by the external input e i. Then a Hebb like associative local learning rule is applied to the recurrent synapses in the network: δw ij = αy i y j. (B.9) It is notable that in a fully connected network, this will result in a symmetric matrix of synaptic weights, that is the strength of the connection from neuron 1 to neuron 2 will be the same as the strength of the connection from neuron 2 to neuron 1 (both implemented via recurrent collateral synapses). It is a factor that is sometimes overlooked that there must be a mechanism for ensuring that during learning y i does approximate e i, and must not be influenced much by activity in the recurrent collateral connections, otherwise the new external pattern e will not be stored in the network, but instead something will be stored that is influenced by the previously stored memories. It is thought that in some parts of the brain, such as the hippocampus, there

39 Autoassociation or attractor memory 725 are processes that help the external connections to dominate the firing during learning (see Chapter 24, Treves and Rolls (1992b) and Rolls and Treves (1998)). B Recall During recall, the external input e i is applied, and produces output firing, operating through the non linear activation function described below. The firing is fed back by the recurrent collateral axons shown in Fig. B.13 to produce activation of each output neuron through the modified synapses on each output neuron. The activation h i produced by the recurrent collateral effect on the ith neuron is, in the standard way, the sum of the activations produced in proportion to the firing rate of each axon y j operating through each modified synapse w ij, that is, h i = y j w ij (B.10) j where j indicates that the sum is over the C input axons to each neuron, indexed by j. The output firing y i is a function of the activation produced by the recurrent collateral effect (internal recall) and by the external input (e i ): y i = f(h i + e i ) (B.11) The activation function should be non linear, and may be for example binary threshold, linear threshold, sigmoid, etc. (see Fig. 1.4). The threshold at which the activation function operates is set in part by the effect of the inhibitory neurons in the network (not shown in Fig. B.13). The connectivity is that the pyramidal cells have collateral axons that excite the inhibitory interneurons, which in turn connect back to the population of pyramidal cells to inhibit them by a mixture of shunting (divisive) and subtractive inhibition using GABA (gamma aminobutyric acid) terminals, as described in Section B.6. There are many fewer inhibitory neurons than excitatory neurons (in the order of 5 10%, see Table 1.1) and of connections to and from inhibitory neurons (see Table 1.1), and partly for this reason the inhibitory neurons are considered to perform generic functions such as threshold setting, rather than to store patterns by modifying their synapses. Similar inhibitory processes are assumed for the other networks described in this Appendix. The non linear activation function can minimize interference between the pattern being recalled and other patterns stored in the network, and can also be used to ensure that what is a positive feedback system remains stable. The network can be allowed to repeat this recurrent collateral loop a number of times. Each time the loop operates, the output firing becomes more like the originally stored pattern, and this progressive recall is usually complete within 5 15 iterations. B.3.2 Introduction to the analysis of the operation of autoassociation networks With complete connectivity in the synaptic matrix, and the use of a Hebb rule, the matrix of synaptic weights formed during learning is symmetric. The learning algorithm is fast, one shot, in that a single presentation of an input pattern is all that is needed to store that pattern. During recall, a part of one of the originally learned stimuli can be presented as an external input. The resulting firing is allowed to iterate repeatedly round the recurrent collateral system, gradually on each iteration recalling more and more of the originally learned pattern. Completion thus occurs. If a pattern is presented during recall that is similar but not identical to any of the previously learned patterns, then the network settles into a stable recall state in which the firing corresponds to that of the previously learned pattern. The network can

40 726 Neuronal network models thus generalize in its recall to the most similar previously learned pattern. The activation function of the neurons should be non linear, since a purely linear system would not produce any categorization of the input patterns it receives, and therefore would not be able to effect anything more than a trivial (i.e. linear) form of completion and generalization. Recall can be thought of in the following way, relating it to what occurs in pattern associators. The external input e is applied, produces firing y, which is applied as a recall cue on the recurrent collaterals as y T. (The notation y T signifies the transpose of y, which is implemented by the application of the firing of the neurons y back via the recurrent collateral axons as the next set of inputs to the neurons.) The activity on the recurrent collaterals is then multiplied by the synaptic weight vector stored during learning on each neuron to produce the new activation h i which reflects the similarity between y T and one of the stored patterns. Partial recall has thus occurred as a result of the recurrent collateral effect. The activations h i after thresholding (which helps to remove interference from other memories stored in the network, or noise in the recall cue) result in firing y i, or a vector of all neurons y, which is already more like one of the stored patterns than, at the first iteration, the firing resulting from the recall cue alone, y = f(e). This process is repeated a number of times to produce progressive recall of one of the stored patterns. Autoassociation networks operate by effectively storing associations between the elements of a pattern. Each element of the pattern vector to be stored is simply the firing of a neuron. What is stored in an autoassociation memory is a set of pattern vectors. The network operates to recall one of the patterns from a fragment of it. Thus, although this network implements recall or recognition of a pattern, it does so by an association learning mechanism, in which associations between the different parts of each pattern are learned. These memories have sometimes been called autocorrelation memories (Kohonen 1977), because they learn correlations between the activity of neurons in the network, in the sense that each pattern learned is defined by a set of simultaneously active neurons. Effectively each pattern is associated by learning with itself. This learning is implemented by an associative (Hebb like) learning rule. The system formally resembles spin glass systems of magnets analyzed quantitatively in statistical mechanics. This has led to the analysis of (recurrent) autoassociative networks as dynamical systems made up of many interacting elements, in which the interactions are such as to produce a large variety of basins of attraction of the dynamics. Each basin of attraction corresponds to one of the originally learned patterns, and once the network is within a basin it keeps iterating until a recall state is reached that is the learned pattern itself or a pattern closely similar to it. (Interference effects may prevent an exact identity between the recall state and a learned pattern.) This type of system is contrasted with other, simpler, systems of magnets (e.g. ferromagnets), in which the interactions are such as to produce only a limited number of related basins, since the magnets tend to be, for example, all aligned with each other. The states reached within each basin of attraction are called attractor states, and the analogy between autoassociator neural networks and physical systems with multiple attractors was drawn by Hopfield (1982) in a very influential paper. He was able to show that the recall state can be thought of as the local minimum in an energy landscape, where the energy would be defined as E = 1 w ij (y i < y >)(y j < y >). (B.12) 2 i,j This equation can be understood in the following way. If two neurons are both firing above their mean rate (denoted by < y >), and are connected by a weight with a positive value, then the firing of these two neurons is consistent with each other, and they mutually support each other, so that they contribute to the system s tendency to remain stable. If across the whole network such mutual support is generally provided, then no further change will take

41 Autoassociation or attractor memory 727 place, and the system will indeed remain stable. If, on the other hand, either of our pair of neurons was not firing, or if the connecting weight had a negative value, the neurons would not support each other, and indeed the tendency would be for the neurons to try to alter ( flip in the case of binary units) the state of the other. This would be repeated across the whole network until a situation in which most mutual support, and least frustration, was reached. What makes it possible to define an energy function and for these points to hold is that the matrix is symmetric (see Hopfield (1982), Hertz, Krogh and Palmer (1991), Amit (1989)). Physicists have generally analyzed a system in which the input pattern is presented and then immediately removed, so that the network then falls without further assistance (in what is referred to as the unclamped condition) towards the minimum of its basin of attraction. A more biologically realistic system is one in which the external input is left on contributing to the recall during the fall into the recall state. In this clamped condition, recall is usually faster, and more reliable, so that more memories may be usefully recalled from the network. The approach using methods developed in theoretical physics has led to rapid advances in the understanding of autoassociative networks, and its basic elements are described in Appendix A4 of Rolls and Treves (1998), and by Hertz, Krogh and Palmer (1991) and Amit (1989). B.3.3 Properties The internal recall in autoassociation networks involves multiplication of the firing vector of neuronal activity by the vector of synaptic weights on each neuron. This inner product vector multiplication allows the similarity of the firing vector to previously stored firing vectors to be provided by the output (as effectively a correlation), if the patterns learned are distributed. As a result of this type of correlation computation performed if the patterns are distributed, many important properties of these networks arise, including pattern completion (because part of a pattern is correlated with the whole pattern), and graceful degradation (because a damaged synaptic weight vector is still correlated with the original synaptic weight vector). Some of these properties are described next. B Completion Perhaps the most important and useful property of these memories is that they complete an incomplete input vector, allowing recall of a whole memory from a small fraction of it. The memory recalled in response to a fragment is that stored in the memory that is closest in pattern similarity (as measured by the dot product, or correlation). Because the recall is iterative and progressive, the recall can be perfect. This property and the associative property of pattern associator neural networks are very similar to the properties of human memory. This property may be used when we recall a part of a recent memory of a past episode from a part of that episode. The way in which this could be implemented in the hippocampus is described in Chapter 24. B Generalization The network generalizes in that an input vector similar to one of the stored vectors will lead to recall of the originally stored vector, provided that distributed encoding is used. The principle by which this occurs is similar to that described for a pattern associator. B Graceful degradation or fault tolerance If the synaptic weight vector w i on each neuron (or the weight matrix) has synapses missing (e.g. during development), or loses synapses (e.g. with brain damage or ageing), then the activation h i (or vector of activations h) is still reasonable, because h i is the dot product (correlation) of y T with w i. The same argument applies if whole input axons are lost. If an

42 728 Neuronal network models output neuron is lost, then the network cannot itself compensate for this, but the next network in the brain is likely to be able to generalize or complete if its input vector has some elements missing, as would be the case if some output neurons of the autoassociation network were damaged. B Prototype extraction, extraction of central tendency, and noise reduction These arise when a set of similar input pattern vectors {e} (which induce firing of the output neurons {y}) are learned by the network. The weight vectors w i (or strictly wi T ) become (or point towards) the average {< y >} of that set of similar vectors. This produces extraction of the prototype or extraction of the central tendency, and noise reduction. This process can result in better recognition or recall of the prototype than of any of the exemplars, even though the prototype may never itself have been presented. The general principle by which the effect occurs is similar to that by which it occurs in pattern associators. It of course only occurs if each pattern uses a distributed representation. Related to outputs of the visual system to long term memory systems (see Chapter 24), there has been intense debate about whether when human memories are stored, a prototype of what is to be remembered is stored, or whether all the instances or the exemplars are each stored separately so that they can be individually recalled (McClelland and Rumelhart (1986), Chapter 17, p. 172). Evidence favouring the prototype view is that if a number of different examples of an object are shown, then humans may report more confidently that they have seen the prototype before than any of the different exemplars, even though the prototype has never been shown (Posner and Keele 1968, Rosch 1975). Evidence favouring the view that exemplars are stored is that in categorization and perceptual identification tasks the responses made are often sensitive to the congruity between particular training stimuli and particular test stimuli (Brooks 1978, Medin and Schaffer 1978, Jacoby 1983a, Jacoby 1983b, Whittlesea 1983). It is of great interest that both types of phenomena can arise naturally out of distributed information storage in a neuronal network such as an autoassociator. This can be illustrated by the storage in an autoassociation memory of sets of stimuli that are all somewhat different examples of the same pattern. These can be generated, for example, by randomly altering each of the input vectors from the input stimulus. After many such randomly altered exemplars have been learned by the network, recall can be tested, and it is found that the network responds best to the original (prototype) input vector, with which it has never been presented. The reason for this is that the autocorrelation components that build up in the synaptic matrix with repeated presentations of the exemplars represent the average correlation between the different elements of the vector, and this is highest for the prototype. This effect also gives the storage some noise immunity, in that variations in the input that are random noise average out, while the signal that is constant builds up with repeated learning. B Speed The recall operation is fast on each neuron on a single iteration, because the pattern y T on the axons can be applied simultaneously to the synapses w i, and the activation h i can be accumulated in one or two time constants of the dendrite (e.g ms). If a simple implementation of an autoassociation net such as that described by Hopfield (1982) is simulated on a computer, then 5 15 iterations are typically necessary for completion of an incomplete input cue e. This might be taken to correspond to ms in the brain, rather too slow for any one local network in the brain to function. However, it has been shown that if the neurons are treated not as McCulloch Pitts neurons which are simply updated at each iteration, or cycle of timesteps (and assume the active state if the threshold is exceeded), but instead are analyzed and modelled as integrate and fire neurons in real continuous time, then the network can

43 Autoassociation or attractor memory 729 effectively relax into its recall state very rapidly, in one or two time constants of the synapses (see Section B.6 and Treves (1993), Battaglia and Treves (1998a) and Appendix A5 of Rolls and Treves (1998)). This corresponds to perhaps 20 ms in the brain. One factor in this rapid dynamics of autoassociative networks with brain like integrateand fire membrane and synaptic properties is that with some spontaneous activity, some of the neurons in the network are close to threshold already before the recall cue is applied, and hence some of the neurons are very quickly pushed by the recall cue into firing, so that information starts to be exchanged very rapidly (within 1 2 ms of brain time) through the modified synapses by the neurons in the network. The progressive exchange of information starting early on within what would otherwise be thought of as an iteration period (of perhaps 20 ms, corresponding to a neuronal firing rate of 50 spikes/s) is the mechanism accounting for rapid recall in an autoassociative neuronal network made biologically realistic in this way. Further analysis of the fast dynamics of these networks if they are implemented in a biologically plausible way with integrate and fire neurons is provided in Section B.6, in Appendix A5 of Rolls and Treves (1998), and by Treves (1993). The general approach applies to other networks with recurrent connections, not just autoassociators, and the fact that such networks can operate much faster than it would seem from simple models that follow discrete time dynamics is probably a major factor in enabling these networks to provide some of the building blocks of brain function. Learning is fast, one shot, in that a single presentation of an input pattern e (producing y) enables the association between the activation of the dendrites (the post synaptic term h i ) and the firing of the recurrent collateral axons y T, to be learned. Repeated presentation with small variations of a pattern vector is used to obtain the properties of prototype extraction, extraction of central tendency, and noise reduction, because these arise from the averaging process produced by storing very similar patterns in the network. B Local learning rule The simplest learning used in autoassociation neural networks, a version of the Hebb rule, is (as in equation B.9) δw ij = αy i y j. The rule is a local learning rule in that the information required to specify the change in synaptic weight is available locally at the synapse, as it is dependent only on the presynaptic firing rate y j available at the synaptic terminal, and the postsynaptic activation or firing y i available on the dendrite of the neuron receiving the synapse. This makes the learning rule biologically plausible, in that the information about how to change the synaptic weight does not have to be carried to every synapse from a distant source where it is computed. As with pattern associators, since firing rates are positive quantities, a potentially interfering correlation is induced between different pattern vectors. This can be removed by subtracting the mean of the presynaptic activity from each presynaptic term, using a type of long term depression. This can be specified as δw ij = αy i (y j z) (B.13) where α is a learning rate constant. This learning rule includes (in proportion to y i ) increasing the synaptic weight if (y j z) > 0 (long term potentiation), and decreasing the synaptic weight if (y j z) < 0 (heterosynaptic long term depression). This procedure works optimally if z is the average activity < y j > of an axon across patterns. Evidence that a learning rule with the general form of equation B.9 is implemented in at least some parts of the brain comes from studies of long term potentiation, described in Section 1.5. One of the important potential functions of heterosynaptic long term depression

44 730 Neuronal network models is its ability to allow in effect the average of the presynaptic activity to be subtracted from the presynaptic firing rate (see Appendix A3 of Rolls and Treves (1998), and Rolls and Treves (1990)). Autoassociation networks can be trained with the error correction or delta learning rule described in Section B.10. Although a delta rule is less biologically plausible than a Hebb like rule, a delta rule can help to store separately patterns that are very similar (see McClelland and Rumelhart (1988), Hertz, Krogh and Palmer (1991)). B Capacity One measure of storage capacity is to consider how many orthogonal patterns could be stored, as with pattern associators. If the patterns are orthogonal, there will be no interference between them, and the maximum number p of patterns that can be stored will be the same as the number N of output neurons in a fully connected network. Although in practice the patterns that have to be stored will hardly be orthogonal, this is not a purely academic speculation, since it was shown how one can construct a synaptic matrix that effectively orthogonalizes any set of (linearly independent) patterns (Kohonen 1977, Kohonen 1989, Personnaz, Guyon and Dreyfus 1985, Kanter and Sompolinsky 1987). However, this matrix cannot be learned with a local, one shot learning rule, and therefore its interest for autoassociators in the brain is limited. The more general case of random non orthogonal patterns, and of Hebbian learning rules, is considered next. However, it is important to reduce the correlations between patterns to be stored in an autoassociation network to not limit the capacity (Marr 1971, Kohonen 1977, Kohonen 1989, Kohonen et al. 1981, Sompolinsky 1987, Rolls and Treves 1998), and in the brain mechanisms to perform pattern separation are frequently present (Rolls 2016f), including granule cells, as shown in many places in this book. With non linear neurons used in the network, the capacity can be measured in terms of the number of input patterns y (produced by the external input e, see Fig. B.13) that can be stored in the network and recalled later whenever the network settles within each stored pattern s basin of attraction. The first quantitative analysis of storage capacity (Amit, Gutfreund and Sompolinsky 1987) considered a fully connected Hopfield (1982) autoassociator model, in which units are binary elements with an equal probability of being on or off in each pattern, and the number C of inputs per unit is the same as the number N of output units. (Actually it is equal to N 1, since a unit is taken not to connect to itself.) Learning is taken to occur by clamping the desired patterns on the network and using a modified Hebb rule, in which the mean of the presynaptic and postsynaptic firings is subtracted from the firing on any one learning trial (this amounts to a covariance learning rule, and is described more fully in Appendix A4 of Rolls and Treves (1998)). With such fully distributed random patterns, the number of patterns that can be learned is (for C large) p 0.14C = 0.14N, hence well below what could be achieved with orthogonal patterns or with an orthogonalizing synaptic matrix. Many variations of this standard autoassociator model have been analyzed subsequently. Treves and Rolls (1991) have extended this analysis to autoassociation networks that are much more biologically relevant in the following ways. First, some or many connections between the recurrent collaterals and the dendrites are missing (this is referred to as diluted connectivity, and results in a non symmetric synaptic connection matrix in which w ij does not equal w ji, one of the original assumptions made in order to introduce the energy formalism in the Hopfield model). Second, the neurons need not be restricted to binary threshold neurons, but can have a threshold linear activation function (see Fig. 1.4). This enables the neurons to assume real continuously variable firing rates, which are what is found in the brain (Rolls and Tovee 1995b, Treves, Panzeri, Rolls, Booth and Wakeman 1999). Third, the representation need not be fully distributed (with half the neurons on, and half off ), but instead can have a small proportion of the neurons firing above the spontaneous rate, which is what is found in

45 Autoassociation or attractor memory 731 parts of the brain such as the hippocampus that are involved in memory (see Treves and Rolls (1994), and Chapter 6 of Rolls and Treves (1998)). Such a representation is defined as being sparse, and the sparseness a of the representation can be measured, by extending the binary notion of the proportion of neurons that are firing, as a = ( N y i /N) 2 i=1 N yi 2/N i=1 (B.14) where y i is the firing rate of the ith neuron in the set of N neurons. Treves and Rolls (1991) have shown that such a network does operate efficiently as an autoassociative network, and can store (and recall correctly) a number of different patterns p as follows CRC p a ln( 1 a )k (B.15) where C RC is the number of synapses on the dendrites of each neuron devoted to the recurrent collaterals from other neurons in the network, and k is a factor that depends weakly on the detailed structure of the rate distribution, on the connectivity pattern, etc., but is roughly in the order of The main factors that determine the maximum number of memories that can be stored in an autoassociative network are thus the number of connections on each neuron devoted to the recurrent collaterals, and the sparseness of the representation. For example, for C RC = 12, 000 and a = 0.02, p is calculated to be approximately 36, 000. This storage capacity can be realized, with little interference between patterns, if the learning rule includes some form of heterosynaptic long term depression that counterbalances the effects of associative long term potentiation (Treves and Rolls (1991); see Appendix A4 of Rolls and Treves (1998)). It should be noted that the number of neurons N (which is greater than C RC, the number of recurrent collateral inputs received by any neuron in the network from the other neurons in the network) is not a parameter that influences the number of different memories that can be stored in the network. The implication of this is that increasing the number of neurons (without increasing the number of connections per neuron) does not increase the number of different patterns that can be stored (see Rolls and Treves (1998) Appendix A4), although it may enable simpler encoding of the firing patterns, for example more orthogonal encoding, to be used. This latter point may account in part for why there are generally in the brain more neurons in a recurrent network than there are connections per neuron (see e.g. Chapter 24). The non linearity inherent in the NMDA receptor based Hebbian plasticity present in the brain may help to make the stored patterns more sparse than the input patterns, and this may be especially beneficial in increasing the storage capacity of associative networks in the brain by allowing participation in the storage of especially those relatively few neurons with high firing rates in the exponential firing rate distributions typical of neurons in sensory systems (see Sections B and C.3.1). B Context The environmental context in which learning occurs can be a very important factor that affects retrieval in humans and other animals. Placing the subject back into the same context in which the original learning occurred can greatly facilitate retrieval. Context effects arise naturally in association networks if some of the activity in the network reflects the context in which the learning occurs. Retrieval is then better when that context

46 732 Neuronal network models is present, for the activity contributed by the context becomes part of the retrieval cue for the memory, increasing the correlation of the current state with what was stored. (A strategy for retrieval arises simply from this property. The strategy is to keep trying to recall as many fragments of the original memory situation, including the context, as possible, as this will provide a better cue for complete retrieval of the memory than just a single fragment.) The effects that mood has on memory including visual memory retrieval may be accounted for by backprojections from brain regions such as the amygdala and orbitofrontal cortex in which the current mood, providing a context, is represented, to brain regions involved in memory such as the perirhinal cortex, and in visual representations such as the inferior temporal visual cortex (see Rolls and Stringer (2001b)). The very well known effects of context in the human memory literature could arise in the simple way just described. An implication of the explanation is that context effects will be especially important at late stages of memory or information processing systems in the brain, for there information from a wide range of modalities will be mixed, and some of that information could reflect the context in which the learning takes place. One part of the brain where such effects may be strong is the hippocampus, which is implicated in the memory of recent episodes, and which receives inputs derived from most of the cortical information processing streams, including those involved in space (see Chapter 24). B Mixture states If an autoassociation memory is trained on pattern vectors A, B, and A + B (i.e. A and B are both included in the joint vector A + B; that is if the vectors are not linearly independent), then the autoassociation memory will have difficulty in learning and recalling these three memories as separate, because completion from either A or B to A + B tends to occur during recall. (The ability to separate such patterns is referred to as configurational learning in the animal learning literature, see e.g. Sutherland and Rudy (1991).) This problem can be minimized by re representing A, B, and A + B in such a way that they are different vectors before they are presented to the autoassociation memory. This can be performed by recoding the input vectors to minimize overlap using, for example, a competitive network, and possibly involving expansion recoding, as described for pattern associators (see Section B.2, Fig. B.11). It is suggested that this is a function of the dentate granule cells in the hippocampus, which precede the CA3 recurrent collateral network (Treves and Rolls 1992b, Treves and Rolls 1994) (see Chapter 24). B Memory for sequences One of the first extensions of the standard autoassociator paradigm that has been explored in the literature is the capability to store and retrieve not just individual patterns, but whole sequences of patterns. Hopfield, in the same 1982 paper, suggested that this could be achieved by adding to the standard connection weights, which associate a pattern with itself, a new, asymmetric component, which associates a pattern with the next one in the sequence. In practice this scheme does not work very well, unless the new component is made to operate on a slower time scale than the purely autoassociative component (Kleinfeld 1986, Sompolinsky and Kanter 1986). With two different time scales, the autoassociative component can stabilize a pattern for a while, before the heteroassociative component moves the network, as it were, into the next pattern. The heteroassociative retrieval cue for the next pattern in the sequence is just the previous pattern in the sequence. A particular type of slower operation occurs if the asymmetric component acts after a delay τ. In this case, the network sweeps through the sequence, staying for a time of order τ in each pattern. One can see how the necessary ingredient for the storage of sequences is only a minor departure from purely Hebbian learning: in fact, the (symmetric) autoassociative component of

47 Autoassociation or attractor memory 733 the weights can be taken to reflect the Hebbian learning of strictly simultaneous conjunctions of pre and post synaptic activity, whereas the (asymmetric) heteroassociative component can be implemented by Hebbian learning of each conjunction of postsynaptic activity with presynaptic activity shifted a time τ in the past. Both components can then be seen as resulting from a generalized Hebbian rule, which increases the weight whenever postsynaptic activity is paired with presynaptic activity occurring within a given time range, which may extend from a few hundred milliseconds in the past up to include strictly simultaneous activity. This is similar to a trace rule (see Chapter 25), which itself matches very well the observed conditions for induction of long term potentiation, and appears entirely plausible. The learning rule necessary for learning sequences, though, is more complex than a simple trace rule in that the time shifted conjunctions of activity that are encoded in the weights must in retrieval produce activations that are time shifted as well (otherwise one falls back into the Hopfield (1982) proposal, which does not quite work). The synaptic weights should therefore keep separate traces of what was simultaneous and what was time shifted during the original experience, and this is not very plausible. Levy and colleagues (Levy, Wu and Baxter 1995, Wu, Baxter and Levy 1996) have investigated these issues further, and the temporal asymmetry that may be present in LTP (see Section 1.5) has been suggested as a mechanism that might provide some of the temporal properties that are necessary for the brain to store and recall sequences (Minai and Levy 1993, Abbott and Blum 1996, Markram, Pikus, Gupta and Tsodyks 1998, Abbott and Nelson 2000). A problem with this suggestion is that, given that the temporal dynamics of attractor networks are inherently very fast when the networks have continuous dynamics (see Section B.6), and that the temporal asymmetry in LTP may be in the order of only milliseconds to a few tens of milliseconds (see Section 1.5), the recall of the sequences would be very fast, perhaps ms per step of the sequence, with every step of a 10 step sequence effectively retrieved and gone in a quick fire session of ms. Another way in which a delay could be inserted in a recurrent collateral path in the brain is by inserting another cortical area in the recurrent path. This could fit in with the corticocortical backprojection connections described in Chapter 11, which would introduce some conduction delay (see Panzeri, Rolls, Battaglia and Lavis (2001)). B.3.4 Use of autoassociation networks in the brain Because of its one shot rapid learning, and ability to complete, this type of network is well suited for episodic memory storage, in which each episode must be stored and recalled later from a fragment, and kept separate from other episodic memories. It does not take a long time (the many epochs of backpropagation networks) to train this network, because it does not have to discover the structure of a problem. Instead, it stores information in the form in which it is presented to the memory, without altering the representation. An autoassociation network may be used for this function in the CA3 region of the hippocampus (see Chapter 24, and Rolls and Treves (1998) Chapter 6). An autoassociation memory can also be used as a short term memory, in which iterative processing round the recurrent collateral loop keeps a representation active until another input cue is received. This may be used to implement many types of short term memory in the brain (see Section 4.3.1). For example, it may be used in the perirhinal cortex and adjacent temporal lobe cortex to implement short term visual object memory (Miyashita and Chang 1988, Amit 1995, Hirabayashi, Takeuchi, Tamura and Miyashita 2013); in the dorsolateral prefrontal cortex to implement a short term memory for spatial responses (Goldman Rakic 1996); and in the prefrontal cortex to implement a short term memory for where eye movements should be made in space (see Section and Rolls (2008d)). Such an autoassociation memory in

48 734 Neuronal network models the temporal lobe visual cortical areas may be used to implement the firing that continues for often 300 ms after a very brief (16 ms) presentation of a visual stimulus (Rolls and Tovee 1994) (see e.g. Fig. C.17), and may be one way in which a short memory trace is implemented to facilitate invariant learning about visual stimuli (see Chapter 25). In all these cases, the short term memory may be implemented by the recurrent excitatory collaterals that connect nearby pyramidal cells in the cerebral cortex. The connectivity in this system, that is the probability that a neuron synapses on a nearby neuron, may be in the region of 10% (Braitenberg and Schütz 1991, Abeles 1991) (Chapter 7). The recurrent connections between nearby neocortical pyramidal cells may also be important in defining the response properties of cortical cells, which may be triggered by external inputs (from for example the thalamus or a preceding cortical area), but may be considerably dependent on the synaptic connections received from nearby cortical pyramidal cells. The cortico cortical backprojection connectivity described in Chapters 1, 11, and 24 can be interpreted as a system that allows the forward projecting neurons in one cortical area to be linked autoassociatively with the backprojecting neurons in the next cortical area (see Chapters 11 and 24). This would be implemented by associative synaptic modification in for example the backprojections. This particular architecture may be especially important in constraint satisfaction (as well as recall), that is it may allow the networks in the two cortical areas to settle into a mutually consistent state. This would effectively enable information in higher cortical areas, which would include information from more divergent sources, to influence the response properties of neurons in earlier cortical processing stages. This is an important function in cortical information processing of interacting associative networks. B.4 Competitive networks, including self organizing maps B.4.1 Function Competitive neural networks learn to categorize input pattern vectors. Each category of inputs activates a different output neuron (or set of output neurons see below). The categories formed are based on similarities between the input vectors. Similar, that is correlated, input vectors activate the same output neuron. In that the learning is based on similarities in the input space, and there is no external teacher that forces classification, this is an unsupervised network. The term categorization is used to refer to the process of placing vectors into categories based on their similarity. The term classification is used to refer to the process of placing outputs in particular classes as instructed or taught by a teacher. Examples of classifiers are pattern associators, one layer delta rule perceptrons, and multilayer perceptrons taught by error backpropagation (see Sections B.2, B.3, B.10 and B.11). In supervised networks there is usually a teacher for each output neuron. The categorization produced by competitive nets is of great potential importance in perceptual systems including the whole of the visual cortical processing hierarchies, as described in Chapter 25. Each category formed reflects a set or cluster of active inputs x j that occur together. This cluster of coactive inputs can be thought of as a feature, and the competitive network can be described as building feature analyzers, where a feature can now be defined as a correlated set of inputs. During learning, a competitive network gradually discovers these features in the input space, and the process of finding these features without a teacher is referred to as self organization. Another important use of competitive networks is to remove redundancy from the input space, by allocating output neurons to reflect a set of inputs that co occur.

49 Competitive networks, including self organizing maps 735 input stimuli x j w ij h i = dendritic activation y = output firing i responses Fig. B.14 The architecture of a competitive network. Another important aspect of competitive networks is that they separate patterns that are somewhat correlated in the input space, to produce outputs for the different patterns that are less correlated with each other, and may indeed easily be made orthogonal to each other. This has been referred to as orthogonalization. Another important function of competitive networks is that partly by removing redundancy from the input information space, they can produce sparse output vectors, without losing information. We may refer to this as sparsification. B.4.2 B Architecture and algorithm Architecture The basic architecture of a competitive network is shown in Fig. B.14. It is a one layer network with a set of inputs that make modifiable excitatory synapses w ij with the output neurons. The output cells compete with each other (for example by mutual inhibition) in such a way that the most strongly activated neuron or neurons win the competition, and are left firing strongly. The synaptic weights, w ij, are initialized to random values before learning starts. If some of the synapses are missing, that is if there is randomly diluted connectivity, that is not a problem for such networks, and can even help them (see below). In the brain, the inputs arrive through axons, which make synapses with the dendrites of the output or principal cells of the network. The principal cells are typically pyramidal cells in the cerebral cortex. In the brain, the principal cells are typically excitatory, and mutual inhibition between them is implemented by inhibitory interneurons, which receive excitatory inputs from the principal cells. The inhibitory interneurons then send their axons to make synapses with the pyramidal cells, typically using GABA (gamma aminobutyric acid) as the inhibitory transmitter. B Algorithm 1. Apply an input vector x and calculate the activation h i of each neuron h i = j x j w ij (B.16) where the sum is over the C input axons, indexed by j. (It is useful to normalize the length of each input vector x. In the brain, a scaling effect is likely to be achieved both by feedforward inhibition, and by feedback inhibition among the set of input cells (in a preceding network) that give rise to the axons conveying x.)

50 736 Neuronal network models The output firing y 1 i is a function of the activation of the neuron y 1 i = f(h i ). (B.17) The function f can be linear, sigmoid, monotonically increasing, etc. (see Fig. 1.4). 2. Allow competitive interaction between the output neurons by a mechanism such as lateral or mutual inhibition (possibly with self excitation), to produce a contrast enhanced version of the firing rate vector y i = g(yi 1 ). (B.18) Function g is typically a non linear operation, and in its most extreme form may be a winnertake all function, in which after the competition one neuron may be on, and the others off. Algorithms that produce softer competition without a single winner to produce a distributed representation are described in Section B below. 3. Apply an associative Hebb like learning rule δw ij = αy i x j. (B.19) 4. Normalize the length of the synaptic weight vector on each dendrite to prevent the same few neurons always winning the competition: (w ij ) 2 = 1. (B.20) j (A less efficient alternative is to scale the sum of the weights to a constant, e.g. 1.0.) 5. Repeat steps 1 4 for each different input stimulus x, in random sequence, a number of times. B.4.3 B Properties Feature discovery by self organization Each neuron in a competitive network becomes activated by a set of consistently coactive, that is correlated, input axons, and gradually learns to respond to that cluster of coactive inputs. We can think of competitive networks as discovering features in the input space, where features can now be defined by a set of consistently coactive inputs. Competitive networks thus show how feature analyzers can be built, with no external teacher. The feature analyzers respond to correlations in the input space, and the learning occurs by self organization in the competitive network. Competitive networks are therefore well suited to the analysis of sensory inputs. Ways in which they may form fundamental building blocks of sensory systems are described in Chapter 25. The operation of competitive networks can be visualized with the help of Fig. B.15. The input patterns are represented as dots on the surface of a sphere. (The patterns are on the surface of a sphere because the patterns are normalized to the same length.) The directions of the weight vectors of the three neurons are represented by s. The effect of learning is to move the weight vector of each of the neurons to point towards the centre of one of the clusters of inputs. If the neurons are winner take all, the result of the learning is that although there are correlations between the input stimuli, the outputs of the three neurons are orthogonal. In this sense, orthogonalization is performed. At the same time, given that each of the patterns within a cluster produces the same output, the correlations between the patterns within a cluster become higher. In a winner take all network, the within pattern correlation becomes 1, and the patterns within a cluster have been placed within the same category.

51 Competitive networks, including self organizing maps 737 Fig. B.15 Competitive learning. The dots represent the directions of the input vectors, and the s the weights for each of three output neurons. (a) Before learning. (b) After learning. (After Rumelhart and Zipser 1986.) B Removal of redundancy In that competitive networks recode sets of correlated inputs to one or a few output neurons, then redundancy in the input representation is removed. Identifying and removing redundancy in sensory inputs is an important part of processing in sensory systems (cf. Barlow (1989)), in part because a compressed representation is more manageable as an output of sensory systems. The reason for this is that neurons in the receiving systems, for example pattern associators in the orbitofrontal cortex or autoassociation networks in the hippocampus, can then operate with the limited numbers of inputs that each neuron can receive. For example, although the information that a particular face is being viewed is present in the 10 6 fibres in the optic nerve, the information is unusable by associative networks in this form, and is compressed through the visual system until the information about which of many hundreds of faces is present can be represented by less than 100 neurons in the temporal cortical visual areas (Rolls, Treves and Tovee 1997b, Abbott, Rolls and Tovee 1996). (Redundancy can be defined as the difference between the maximum information content of the input data stream (or channel capacity) and its actual content; see Appendix C.) The recoding of input pattern vectors into a more compressed representation that can be conveyed by a much reduced number of output neurons of a competitive network is referred to in engineering as vector quantization. With a winner take all competitive network, each output neuron points to or stands for one of or a cluster of the input vectors, and it is more efficient to transmit the states of the few output neurons than the states of all the input elements. (It is more efficient in the sense that the information transmission rate required, that is the capacity of the channel, can be much smaller.) Vector quantization is of course possible when the input representation contains redundancy. B Orthogonalization and categorization Figure B.15 shows visually how competitive networks reduce the correlation between different clusters of patterns, by allocating them to different output neurons. This is described as orthogonalization. It is a process that is very usefully applied to signals before they are used as inputs to associative networks (pattern associators and autoassociators) trained with Hebbian rules (see Sections B.2 and B.3), because it reduces the interference between patterns stored in these memories. The opposite effect in competitive networks, of bringing closer together very similar input patterns, is referred to as categorization. These two processes are also illustrated in Fig. B.16, which shows that in a competitive network, very similar input patterns (with correlations higher in this case than approximately 0.8)

52 738 Neuronal network models (a) (b) Ouput Correlation Ouput Correlation Input Correlation Input Correlation a) Before Learning b) After Learning Fig. B.16 Orthogonalization and categorization in a competitive network: (a) before learning; (b) after learning. The correlations between pairs of output vectors (abscissa) are plotted against the correlations of the corresponding pairs of input vectors that generated the output pair, for all possible pairs in the input set. The competitive net learned for 16 cycles. One cycle consisted of presenting the complete input set of stimuli in a renewing random sequence. The correlation measure shown is the cosine of the angle between two vectors (i.e. the normalized dot product). The network used had 64 input axons to each of 8 output neurons. The net was trained with 64 stimuli, made from 8 initial random binary vectors with each bit having a probability of 0.5 of being 1, from each of which 8 noisy exemplars were made by randomly altering 10 % of the 64 elements. Soft competition was used between the output neurons. (A normalized exponential activation function described in Section B was used to implement the soft competition.) The sparseness a of the input patterns thus averaged 0.5; and the sparseness a of the output firing vector after learning was close to 0.17 (i.e. after learning, primarily one neuron was active for each input pattern; before learning, the average sparseness of the output patterns produced by each of the inputs was 0.39). produce more similar outputs (close to 1.0), whereas the correlations between pairs of input patterns that are smaller than approximately 0.7 become much smaller in the output representation. (This simulation used soft competition between neurons with graded firing rates.) Further analyses of the operation of competitive networks, and how diluted connectivity can help the operation of competitive networks, are provided in Section 7.4 (Rolls 2016f). B Sparsification Competitive networks can produce more sparse representations than those that they receive, depending on the degree of competition. With the greatest competition, winner take all, only one output neuron remains active, and the representation is at its most sparse. This effect can be understood further using Figs. B.15 and B.16. This sparsification is useful to apply to representations before input patterns are applied to associative networks, because sparse representations allow many different pattern associations or memories to be stored in these networks (see Sections B.2 and B.3). B Capacity In a competitive net with N output neurons and a simple winner take all rule for the competition, it is possible to learn up to N output categories, in that each output neuron may be allocated a category. When the competition acts in a less rudimentary way, the number of categories that can be learned becomes a complex function of various factors, including the number of modifiable connections per cell and the degree of dilution, or incompleteness, of the connections. Such a function has not yet been described analytically in general, but an

53 Competitive networks, including self organizing maps 739 upper bound on it can be deduced for the particular case in which the learning is fast, and can be achieved effectively in one shot, or one presentation of each pattern. In that case, the number of categories that can be learned (by the self organizing process) will at most be equal to the number of associations that can be formed by the corresponding pattern associators, a process that occurs with the additional help of the driving inputs, which effectively determine the categorization in the pattern associator. Separate constraints on the capacity result if the output vectors are required to be strictly orthogonal. Then, if the output firing rates can assume only positive values, the maximum number p of categories arises, obviously, in the case when only one output neuron is firing for any stimulus, so that up to N categories are formed. If ensemble encoding of output neurons is used (soft competition), again under the orthogonality requirement, then the number of output categories that can be learned will be reduced according to the degree of ensemble encoding. The p categories in the ensemble encoded case reflect the fact that the betweencluster correlations in the output space are lower than those in the input space. The advantages of ensemble encoding are that dendrites are more evenly allocated to patterns (see Section B.4.9.5), and that correlations between different input stimuli can be reflected in correlations between the corresponding output vectors, so that later networks in the system can generalize usefully. This latter property is of crucial importance, and is utilized for example when an input pattern is presented that has not been learned by the network. The relative similarity of the input pattern to previously learned patterns is indicated by the relative activation of the members of an ensemble of output neurons. This makes the number of different representations that can be reflected in the output of competitive networks with ensemble encoding much higher than with winner take all representations, even though with soft competition all these representations cannot strictly be learned. B Separation of non linearly separable patterns A competitive network can not only separate (e.g. by activating different output neurons) pattern vectors that overlap in almost all elements, but can also help with the separation of vectors that are not linearly separable. An example is that three patterns A, B, and A+B will lead to three different output neurons being activated (see Fig. B.17). For this to occur, the length of the synaptic weight vectors must be normalized (to for example unit length), so that they lie on the surface of a sphere or hypersphere (see Fig. B.15). (If the weight vectors of each neuron are scaled to the same sum, then the weight vectors do not lie on the surface of a hypersphere, and the ability of the network to separate patterns is reduced.) The property of pattern separation makes a competitive network placed before an autoassociation (or pattern association) network very valuable, for it enables the autoassociator to store the three patterns separately, and to recall A+B separately from A and B. This is referred to as the configuration learning problem in animal learning theory (Sutherland and Rudy 1991). Placing a competitive network before a pattern associator will enable a linearly inseparable problem to be solved. For example, three different output neurons of a two input competitive network could respond to the patterns 01, 10, and 11, and a pattern associator can learn different outputs for neurons 1 3, which are orthogonal to each other (see Fig. B.11). This is an example of expansion recoding (cf. Marr (1969), who used a different algorithm to obtain the expansion). The sparsification that can be produced by the competitive network can also be advantageous in preparing patterns for presentation to a pattern associator or autoassociator, because the sparsification can increase the number of memories that can be associated or stored.

54 740 Neuronal network models Inputs x x x h 1 h 2 h 3 Activation h y1 y2 y3 Output y Fig. B.17 Separation of linearly dependent patterns by a competitive network. The network was trained on patterns 10, 01, and 11, applied on the inputs x 1 and x 2. After learning, the network allocated output neuron 1 to pattern 10, neuron 2 to pattern 01, and neuron 3 to pattern 11. The weights in the network produced during the learning are shown. Each input pattern was normalized to unit length, and thus for pattern 11, x 1 =0.7 and x 2 =0.7, as shown. Because the weight vectors were also normalized to unit length, w 31 =0.7 and w 32 =0.7. B Stability These networks are generally stable if the input statistics are stable. If the input statistics keep varying, then the competitive network will keep following the input statistics. If this is a problem, then a critical period in which the input statistics are learned, followed by stabilization, may be useful. This appears to be a solution used in developing sensory systems, which have critical periods beyond which further changes become more difficult. An alternative approach taken by Carpenter and Grossberg in their Adaptive Resonance Theory (Carpenter 1997) is to allow the network to learn only if it does not already have categorizers for a pattern (see Hertz, Krogh and Palmer (1991), p. 228). Diluted connectivity can help stability, by making neurons tend to find inputs to categorize in only certain parts of the input space, and then making it difficult for the neuron to wander randomly throughout the space later (see further Section 7.4). B Frequency of presentation If some stimuli are presented more frequently than others, then there will be a tendency for the weight vectors to move more rapidly towards frequently presented stimuli, and more neurons may become allocated to the frequently presented stimuli. If winner take all competition is used, the result is that the neurons will tend to become allocated during the learning process to the more frequently presented patterns. If soft competition is used, the tendency of neurons to move from patterns that are infrequently or never presented can be reduced by making the competition fairly strong, so that only a few neurons show any learning when each pattern is presented. Provided that the competition is moderately strong (see Section B.4.9.4), the result is that more neurons are allocated to frequently presented patterns, but one or some neurons are allocated to infrequently presented patterns. These points can all be easily demonstrated in simulations. In an interesting development, it has been shown that if objects consisting of groups of features are presented during training always with another object present, then separate representations of each object can be formed provided that each object is presented many times, but on each occasion is paired with a different object (Stringer and Rolls 2008, Stringer, Rolls and Tromans 2007b). This is related to the fact that in this scenario the frequency of cooccurrence of features within the same object is greater than that of features between different objects (see Section ).

55 Competitive networks, including self organizing maps 741 B Comparison to principal component analysis (PCA) and cluster analysis Although competitive networks find clusters of features in the input space, they do not perform hierarchical cluster analysis as typically performed in statistics. In hierarchical cluster analysis, input vectors are joined starting with the most correlated pair, and the level of the joining of vectors is indicated. Competitive nets produce different outputs (i.e. activate different output neurons) for each cluster of vectors (i.e. perform vector quantization), but do not compute the level in the hierarchy, unless the network is redesigned (see Hertz, Krogh and Palmer (1991)). The feature discovery can also be compared to principal component analysis (PCA). (In PCA, the first principal component of a multidimensional space points in the direction of the vector that accounts for most of the variance, and subsequent principal components account for successively less of the variance, and are mutually orthogonal.) In competitive learning with a winner take all algorithm, the outputs are mutually orthogonal, but are not in an ordered series according to the amount of variance accounted for, unless the training algorithm is modified. The modification amounts to allowing each of the neurons in a winner take all network to learn one at a time, in sequence. The first neuron learns the first principal component. (Neurons trained with a modified Hebb rule learn to maximize the variance of their outputs see Hertz, Krogh and Palmer (1991).) The second neuron is then allowed to learn, and because its output is orthogonal to the first, it learns the second principal component. This process is repeated. Details are given by Hertz, Krogh and Palmer (1991), but as this is not a biologically plausible process, it is not considered in detail here. I note that simple competitive learning is very helpful biologically, because it can separate patterns, but that a full ordered set of principal components as computed by PCA would probably not be very useful in biologically plausible networks. The point here is that biological neuronal networks may operate well if the variance in the input representation is distributed across many input neurons, whereas principal component analysis would tend to result in most of the variance being allocated to a few neurons, and the variance being unevenly distributed across the neurons. B.4.4 B Utility of competitive networks in information processing by the brain Feature analysis and preprocessing Neurons that respond to correlated combinations of their inputs can be described as feature analyzers. Neurons that act as feature analyzers perform useful preprocessing in many sensory systems (see e.g. Chapter 8 of Rolls and Treves (1998)). The power of competitive networks in multistage hierarchical processing to build combinations of what is found at earlier stages, and thus effectively to build higher order representations, is also described in Chapter 25 of this book. An interesting development is that competitive networks can learn about individual objects even when multiple objects are presented simultaneously, provided that each object is presented several times more frequently than it is paired with any other individual object (Stringer and Rolls 2008) (see Section ). This property arises because learning in competitive networks is primarily about forming representations of objects defined by a high correlation of coactive features in the input space (Stringer and Rolls 2008). B Removal of redundancy The removal of redundancy by competition is thought to be a key aspect of how sensory systems, including the ventral cortical visual system, operate. Competitive networks can also be thought of as performing dimension reduction, in that a set of correlated inputs may

56 742 Neuronal network models be responded to as one category or dimension by a competitive network. The concept of redundancy removal can be linked to the point that individual neurons trained with a modified Hebb rule point their weight vector in the direction of the vector that accounts for most of the variance in the input, that is (acting individually) they find the first principal component of the input space (see Section B and Hertz, Krogh and Palmer (1991)). Although networks with anti Hebbian synapses between the principal cells (in which the anti Hebbian learning forces neurons with initially correlated activity to effectively inhibit each other) (Földiák 1991), and networks that perform Independent Component Analysis (Bell and Sejnowski 1995), could in principle remove redundancy more effectively, it is not clear that they are implemented biologically. In contrast, competitive networks are more biologically plausible, and illustrate redundancy reduction. The more general use of an unsupervised competitive preprocessor is discussed below (see Fig. B.24). B Orthogonalization The orthogonalization performed by competitive networks is very useful for preparing signals for presentation to pattern associators and autoassociators, for this re representation decreases interference between the patterns stored in such networks. Indeed, this can be essential if patterns are overlapping and not linearly independent, e.g. 01, 10, and 11. If three such binary patterns were presented to an autoassociative network, it would not form separate representations of them, because either of the patterns 01 or 10 would result by completion in recall of the 11 pattern. A competitive network allows a separate neuron to be allocated to each of the three patterns, and this set of orthogonal representations can be learned by associative networks (see Fig. B.17). B Sparsification The sparsification performed by competitive networks is very useful for preparing signals for presentation to pattern associators and autoassociators, for this re representation increases the number of patterns that can be associated or stored in such networks (see Sections B.2 and B.3). B Brain systems in which competitive networks may be used for orthogonalization and sparsification One system is the hippocampus, in which the dentate granule cells are believed to operate as a competitive network in order to prepare signals for presentation to the CA3 autoassociative network (see Chapter 24). In this case, the operation is enhanced by expansion recoding, in that (in the rat) there are approximately three times as many dentate granule cells as there are cells in the preceding stage, the entorhinal cortex. This expansion recoding will itself tend to reduce correlations between patterns (cf. Marr (1970), and Marr (1969)). Also in the hippocampus, the CA1 neurons are thought to act as a competitive network that recodes the separate representations of each of the parts of an episode that must be separately represented in CA3, into a form more suitable for the recall using pattern association performed by the backprojections from the hippocampus to the cerebral cortex (see Chapter 24 and Rolls and Treves (1998) Chapter 6). The granule cells of the cerebellum may perform a similar function, but in this case the principle may be that each of the very large number of granule cells receives a very small random subset of inputs, so that the outputs of the granule cells are decorrelated with respect to the inputs (Marr (1969); see Chapter 23).

57 Competitive networks, including self organizing maps 743 B A Output Fig. B.18 Competitive net receiving a normal forward set of inputs A, but also another set of inputs B that can be used to influence the categories formed in response to A inputs. The inputs B might be backprojection inputs. B.4.5 Guidance of competitive learning Although competitive networks are primarily unsupervised networks, it is possible to influence the categories found by supplying a second input, as follows (Rolls 1989a). Consider a competitive network as shown in Fig. B.18 with the normal set of inputs A to be categorized, and with an additional set of inputs B from a different source. Both sets of inputs work in the normal way for a competitive network, with random initial weights, competition between the output neurons, and a Hebb like synaptic modification rule that normalizes the lengths of the synaptic weight vectors onto each neuron. The idea then is to use the B inputs to influence the categories formed by the A input vectors. The influence of the B vectors works best if they are orthogonal to each other. Consider any two A vectors. If they occur together with the same B vector, then the categories produced by the A vectors will be more similar than they would be without the influence of the B vectors. The categories will be pulled closer together if soft competition is used, or will be more likely to activate the same neuron if winner take all competition is used. Conversely, if any two A vectors are paired with two different, preferably orthogonal, B vectors, then the categories formed by the A vectors will be drawn further apart than they would be without the B vectors. The differences in categorization remain present after the learning when just the A inputs are used. This guiding function of one of the inputs is one way in which the consequences of sensory stimuli could be fed back to a sensory system to influence the categories formed when the A inputs are presented. This could be one function of backprojections in the cerebral cortex (Rolls 1989c, Rolls 1989a) (Chapter 11). In this case, the A inputs of Fig. B.18 would be the forward inputs from a preceding cortical area, and the B inputs backprojecting axons from the next cortical area, or from a structure such as the amygdala or hippocampus. If two A vectors were both associated with positive reinforcement that was fed back as the same B vector from another part of the brain, then the two A vectors would be brought closer together in the representational space provided by the output of the neurons. If one of the A vectors was associated with positive reinforcement, and the other with negative reinforcement, then the output representations of the two A vectors would be further apart. This is one way in which external signals could influence in a mild way the categories formed in sensory systems. Another is that if any B vector only occurred for important sensory A inputs (as shown by the

58 744 Neuronal network models Taste Taste Input Taste Output Flavour Olfaction Olfaction Input Flavour Output Olfaction Output Fig. B.19 A two layer set of competitive nets in which feedback from layer 2 can influence the categories formed in layer 1. Layer 2 could be a higher cortical visual area with convergence from earlier cortical visual areas (see Chapter 25). In the example, taste and olfactory inputs are received by separate competitive nets in layer 1, and converge into a single competitive net in layer 2. The categories formed in layer 2 (which may be described as representing flavour ) may be dominated by the relatively orthogonal set of a few tastes that are received by the net. When these layer 2 categories are fed back to layer 1, they may produce in layer 1 categories in, for example, the olfactory network that reflect to some extent the flavour categories of layer 2, and are different from the categories that would otherwise be formed to a large set of rather correlated olfactory inputs. A similar principle may operate in any multilayer hierarchical cortical processing system, such as the ventral visual system, in that the categories that can be formed only at later stages of processing may help earlier stages to form categories relevant to what can be identified at later stages. immediate consequences of receiving those sensory inputs), then the A inputs would simply be more likely to have any representation formed than otherwise, due to strong activation of neurons only when combined A and B inputs are present. A similar architecture could be used to provide mild guidance for one sensory system (e.g. olfaction) by another (e.g. taste), as shown in Fig. B.19. (Another example of where this architecture could be used is convergence in the visual system at the next cortical stage of processing, with guiding feedback to influence the categories formed in the different regions of the preceding cortical area, as illustrated in Chapter 11.) The idea is that the taste inputs would be more orthogonal to each other than the olfactory inputs, and that the taste inputs would influence the categories formed in the olfactory input categorizer in layer 1, by feedback from a convergent net in layer 2. The difference from the previous architecture is that we now have a two layer net, with unimodal or separate networks in layer 1, each feeding forward to a single competitive network in layer 2. The categories formed in layer 2 reflect the co occurrence of a particular taste with particular odours (which together form flavour in layer 2). Layer 2 then provides feedback connections to both the networks in layer 1. It can be shown in such a network that the categories formed in, for example, the olfactory net in layer 1 are influenced by the tastes with which the odours are paired. The feedback signal is built only in layer 2, after there has been convergence between the different modalities. This architecture

59 Competitive networks, including self organizing maps 745 captures some of the properties of sensory systems, in which there are unimodal processing cortical areas followed by multimodal cortical areas. The multimodal cortical areas can build representations that represent the unimodal inputs that tend to co occur, and the higher level representations may in turn, by the highly developed cortico cortical backprojections, be able to influence sensory categorization in earlier cortical processing areas (Rolls 1989a). Another such example might be the effect by which the phonemes heard are influenced by the visual inputs produced by seeing mouth movements (cf. McGurk and MacDonald (1976)). This could be implemented by auditory inputs coming together in the cortex in the superior temporal sulcus onto neurons activated by the sight of the lips moving (recorded during experiments of Baylis, Rolls and Leonard (1987), and Hasselmo, Rolls, Baylis and Nalwa (1989b)), using Hebbian learning with co active inputs. Backprojections from such multimodal areas to the early auditory cortical areas could then influence the responses of auditory cortex neurons to auditory inputs (see Section 4.10 and Fig. 4.5, and cf. Calvert, Bullmore, Brammer, Campbell, Williams, McGuire, Woodruff, Iversen and David (1997)). A similar principle may operate in any multilayer hierarchical cortical processing system, such as the ventral visual system, in that the categories that can be formed only at later stages of processing may help earlier stages to form categories relevant to what can be identified at the later stages as a result of the operation of backprojections (Rolls 1989a). The idea that the statistical correlation between the inputs received by neighbouring processing streams can be used to guide unsupervised learning within each stream has also been developed by Becker and Hinton (1992) and others (see Phillips, Kay and Smyth (1995)). The networks considered by these authors self organize under the influence of collateral connections, such as may be implemented by cortico cortical connections between parallel processing systems in the brain. They use learning rules that, although somewhat complex, are still local in nature, and tend to optimize specific objective functions. The locality of the learning rule, and the simulations performed so far, raise some hope that, once the operation of these types of networks is better understood, they might achieve similar computational capabilities to backpropagation networks (see Section B.11) while retaining biological plausibility. B.4.6 Topographic map formation A simple modification to the competitive networks described so far enables them to develop topological maps. In such maps, the closeness in the map reflects the similarity (correlation) between the features in the inputs. The modification that allows such maps to self organize is to add short range excitation and long range inhibition between the neurons. The function to be implemented has a spatial profile that is described as having a Mexican hat shape (see Fig. B.20). The effect of this connectivity between neurons, which need not be modifiable, is to encourage neurons that are close together to respond to similar features in the input space, and to encourage neurons that are far apart to respond to different features in the input space. When these response tendencies are present during learning, the feature analyzers that are built by modifying the synapses from the input onto the activated neurons tend to be similar if they are close together, and different if far apart. This is illustrated in Figs. B.21 and B.22. Feature maps built in this way were described by von der Malsburg (1973) and Willshaw and von der Malsburg (1976). It should be noted that the learning rule needed is simply the modified Hebb rule described above for competitive networks, and is thus local and biologically plausible. (For computational convenience, the algorithm that Kohonen (Kohonen 1982, Kohonen 1989, Kohonen 1995) has mainly used does not use Mexican hat connectivity between the neurons, but instead arranges that when the weights to a winning neuron are updated, so to a smaller extent are those of its neighbours see further Hertz, Krogh and Palmer (1991).)

60 746 Neuronal network models + Magnitude 0 _ 0 Distance Fig. B.20 Mexican hat lateral spatial interaction profile. A very common characteristic of connectivity in the brain, found for example throughout the neocortex, consists of short range excitatory connections between neurons, with inhibition mediated via inhibitory interneurons. The density of the excitatory connectivity even falls gradually as a function of distance from a neuron, extending typically a distance in the order of 1 mm from the neuron (Braitenberg and Schütz 1991), contributing to a spatial function quite like that of a Mexican hat. (Longer range inhibitory influences would form the negative part of the spatial response profile.) This supports the idea that topological maps, though in some cases probably seeded by chemoaffinity, could develop in the brain with the assistance of the processes just described. It is noted that some cortico cortical connections even within an area may be longer, skipping past some intermediate neurons, and then making connections after some distance with a further group of neurons. Such longer range connections are found for example between different columns with similar orientation selectivity in the primary visual cortex. The longer range connections may play a part in stabilizing maps, and again in the exchange of information between neurons performing related computations, in this case about features with the same orientations. If a low dimensional space, for example the orientation sensitivity of cortical neurons in the primary visual cortex (which is essentially one dimensional, the dimension being angle), is mapped to a two dimensional space such as the surface of the cortex, then the resulting map can have long spatial runs where the value along the dimension (in this case orientation tuning) alters gradually, and continuously. Such self organization can account for many aspects of the mapping of orientation tuning, and of ocular dominance columns, in V1 (Miller 1994, Harris, Ermentrout and Small 1997). If a high dimensional information space is mapped to the twodimensional cortex, then there will be only short runs of groups of neurons with similar feature responsiveness, and then the map must fracture, with a different type of feature mapped for a short distance after the discontinuity. This is exactly what Rolls suggests is the type of topology found in the anterior inferior temporal visual cortex, with the individual groupings representing what can be self organized by competitive networks combined with a trace rule as described in Section Here, visual stimuli are not represented with reference to their position on the retina, because here the neurons are relatively translation invariant. Instead, when recording here, small clumps of neurons with similar responses may be encountered close together, and then one moves into a group of neurons with quite different feature selectivity (personal observations). This topology will arise naturally, given the anatomical

61 Competitive networks, including self organizing maps 747 Fig. B.21 Kohonen feature mapping from a two dimensional L shaped region to a linear array of 50 units. Each unit has 2 inputs. The input patterns are the X,Y coordinates of points within the L shape shown. In the diagrams, each point shows the position of a weight vector. Lines connect adjacent units in the 1 D (linear) array of 50 neurons. The weights were initialized to random values within the unit square (a). During feature mapping training, the weights evolved through stages (b) and (c) to (d). By stage (d) the weights have formed so that the positions in the original input space are mapped to a 1 D vector in which adjacent points in the input space activate neighbouring units in the linear array of output units. (Reproduced with permission from Hertz, Krogh and Palmer 1991, Fig ) connectivity of the cortex with its short range excitatory connections, because there are very many different objects in the world and different types of features that describe objects, with no special continuity between the different combinations of features possible. Rolls hypothesis contrasts with the view of Tanaka (1996), who has claimed that the inferior temporal cortex provides an alphabet of visual features arranged in discrete modules. The type of mapping found in higher cortical visual areas as proposed by Rolls implies that topological self organization is an important way in which maps in the brain are formed, for it seems most unlikely that the locations in the map of different types of object seen in an environment could be specified genetically (Rolls and Stringer 2000). Consistent with this, Tsao, Freiwald, Tootell and Livingstone (2006) described with macaque fmri anterior face patches at A15 to A22. A15 might correspond to where we have analysed face selective neurons (it might translate to 3 mm posterior to our sphenoid reference, see Section 25.2), and at this level there are separate regions specialized for face identity in areas TEa and TEm on the ventral lip of the superior temporal sulcus and the adjacent gyrus, and for face expression and movement in the cortex deep in the superior temporal sulcus (Hasselmo, Rolls and Baylis 1989a, Baylis, Rolls and Leonard 1987, Rolls 2007e). The middle face patch of Tsao, Freiwald, Tootell and Livingstone (2006) was at A6, which is probably part of the posterior inferior temporal cortex, and, again consistent with self organizing map principles, has a high concentration of face selective neurons within the patch. The biological utility of developing such topology preserving feature maps may be that if the computation requires neurons with similar types of response to exchange information more than neurons involved in different computations (which is more than reasonable), then the total length of the connections between the neurons is minimized if the neurons that need

62 748 Neuronal network models Fig. B.22 Example of a one dimensional topological map that self organized from inputs in a low dimensional space. The network has 64 neurons (vertical elements in the diagram) and 64 inputs per neuron (horizontal elements in the diagram). The four different diagrams represent the net tested with different input patterns. The input patterns x are displayed at the left of each diagram, with white representing firing and black not firing for each of the 64 inputs. The central square of each diagram represents the synaptic weights of the neurons, with white representing a strong weight. The row vector below each weight matrix represents the activations of the 64 output neurons, and the bottom row vector the output firing y. The network was trained with a set of 8 binary input patterns, each of which overlapped in 8 of its 16 on elements with the next pattern. The diagram shows that as one moves through correlations in the input space (top left to top right to bottom left to bottom right), so the output neurons activated move steadily across the output array of neurons. Closely correlated inputs are represented close together in the output array of neurons. The way in which this occurs can be seen by inspection of the weight matrix. The network architecture was the same as for a competitive net, except that the activations were converted linearly into output firings, and then each neuron excited its neighbours and inhibited neurons further away. This lateral inhibition was implemented for the simulation by a spatial filter operating on the output firings with the following filter weights (cf. Fig. B.20): 5, 5, 5, 5, 5, 5, 5, 5, 5, 10, 10, 10, 10, 10, 10, 10, 10, 10, 5, 5, 5, 5, 5, 5, 5, 5, 5 which operated on the 64 element firing rate vector. to exchange information are close together (cf. Cowey (1979), Durbin and Mitchison (1990)). Examples of this include the separation of colour constancy processing in V4 from global motion processing in MT as follows (see Chapter 3 of Rolls and Deco (2002)). In V4, to compute colour constancy, an estimate of the illuminating wavelengths can be obtained by summing the outputs of the pyramidal cells in the inhibitory interneurons over several degrees of visual space, and subtracting this from the excitatory central ON colour tuned region of the receptive field by (subtractive) feedback inhibition. This enables the cells to discount the illuminating wavelength, and thus compute colour constancy. For this computation, no inputs from motion selective cells (which in the dorsal stream are colour insensitive) are needed. In MT, to compute global motion (e.g. the motion produced by the average flow of local motion elements, exemplified for example by falling snow), the computation can be performed by averaging in the larger (several degrees) receptive fields of MT the local motion inputs received

63 Competitive networks, including self organizing maps 749 by neurons in earlier cortical areas (V1 and V2) with small receptive fields (see Chapter 3 of Rolls and Deco (2002) and Rolls and Stringer (2007)). For this computation, no input from colour cells is useful. Having separate areas (V4 and MT) for these different computations minimizes the wiring lengths, for having intermingled colour and motion cells in a single cortical area would increase the average connection length between the neurons that need to be connected for the computations being performed. Minimizing the total connection length between neurons in the brain is very important in order to keep the size of the brain relatively small. Placing close to each other neurons that need to exchange information, or that need to receive information from the same source, or that need to project towards the same destination, may also help to minimize the complexity of the rules required to specify cortical (and indeed brain) connectivity (Rolls and Stringer 2000). For example, in the case of V4 and MT, the connectivity rules can be simpler (e.g. connect to neurons in the vicinity, rather than look for colour, or motion marked cells, and connect only to the cells with the correct genetically specified label specifying that the cell is either part of motion or of colour processing). Further, the V4 and MT example shows that how the neurons are connected can be specified quite simply, but of course it needs to be specified to be different for different computations. Specifying a general rule for the classes of neurons in a given area also provides a useful simplification to the genetic rules needed to specify the functional architecture of a given cortical area (Rolls and Stringer 2000). In our V4 and MT example, the genetic rules would need to specify the rules separately for different populations of inhibitory interneurons if the computations performed by V4 and MT were performed with intermixed neurons in a single brain area. Together, these two principles, of minimization of wiring length, and allowing simple genetic specification of wiring rules, may underlie the separation of cortical visual information processing into different (e.g. ventral and dorsal) processing streams. The same two principles operating within each brain processing stream may underlie (taken together with the need for hierarchical processing to enable the computations to be biologically plausible in terms of the number of connections per neuron, and the need for local learning rules, see Section B.12) much of the overall architecture of visual cortical processing, and of information processing and its modular architecture throughout the cortex more generally. The rules of information exchange just described could also tend to produce more gross topography in cortical regions. For example, neurons that respond to animate objects may have certain visual feature requirements in common, and may need to exchange information about these features. Other neurons that respond to inanimate objects might have somewhat different visual feature requirements for their inputs, and might need to exchange information strongly. (For example, selection of whether an object is a chisel or a screwdriver may require competition by mutual (lateral) inhibition to produce the contrast enhancement necessary to result in unambiguous neuronal responses.) The rules just described would account for neurons with responsiveness to inanimate and animate objects tending to be grouped in separate parts of a cortical map or representation, and thus separately susceptible to brain damage (see e.g. Farah (1990), Farah (2000)). B.4.7 Invariance learning by competitive networks In conventional competitive learning, the weight vector of a neuron can be thought of as moving towards the centre of a cluster of similar overlapping input stimuli (Rumelhart and Zipser 1985, Hertz, Krogh and Palmer 1991, Rolls and Treves 1998, Rolls and Deco 2002, Perry, Rolls and Stringer 2006, Rolls 2008d). The weight vector points towards the centre of the set of stimuli in the category. The different training stimuli that are placed into the same

64 750 Neuronal network models Fig. B.23 (a) Conventional competitive learning. A cluster of overlapping input patterns is categorized as being similar, and this is implemented by a weight vector of an output neuron pointing towards the centre of the cluster. Three clusters are shown, and each cluster might after training have a weight vector pointing towards it. (b) Invariant representation learning. The different transforms of an object may span an elongated region of the space, and the transforms at the far ends of the space may have no overlap (correlation), yet the network must learn to categorize them as similar. The different transforms of two different objects are represented. category (i.e. activate the same neuron) are typically overlapping in that the pattern vectors are correlated with each other. Figure B.23a illustrates this. For the formation of invariant representations, there are multiple occurrences of the object at different positions in the space. The object at each position represents a different transform (whether in position, size, view etc.) of the object. The different transforms may be uncorrelated with each other, as would be the case for example with an object translated so far in the space that there would be no active afferents in common between the two transforms. Yet we need these two orthogonal patterns to be mapped to the same output. It may be a very elongated part of the input space that has to be mapped to the same output in invariance learning. These concepts are illustrated in Fig. B.23b. Objects in the world have temporal and spatial continuity. That is, the statistics of the world are such that we tend to look at one object for some time, during which it may be transforming continuously from one view to another. The temporal continuity property is used in trace rule invariance training, in which a short term memory in the associative learning rule normally used to train competitive networks is used to help build representations that reflect the continuity in time that characterizes the different transforms of each object, as described in Chapter 25. The transforms of an object also show spatial continuity, and this can also be used in invariance training in what is termed continuous spatial transform learning, described in Section In conventional competitive learning the overall weight vector points to the prototypical representation of the object. The only sense in which after normal competitive training (without translations etc) the network generalizes is with respect to the dot product similarity of any input vector compared to the central vector from the training set that the network learns. Continuous spatial transformation learning works by providing a set of training vectors that overlap, and between them cover the whole space over which an invariant transform of the object must be learned. Indeed, it is important for continuous spatial transformation learning that the different exemplars of an object are sufficiently close that the similarity of adjacent training exemplars is sufficient to ensure that the same postsynaptic neuron learns to bridge the continuous space spanned by the whole set of training exemplars of a given object (Stringer,

65 Competitive networks, including self organizing maps 751 Output Layer Supervised Learning Hidden Layer Unsupervised Learning Input Layer Fig. B.24 A hybrid network, in which for example unsupervised learning rapidly builds relatively orthogonal representations based on input differences, and this is followed by a one layer supervised network (taught for example by the delta rule) that learns to classify the inputs based on the categorizations formed in the hidden/intermediate layer. Perry, Rolls and Proske 2006, Perry, Rolls and Stringer 2006, Perry, Rolls and Stringer 2006). This will enable the postsynaptic neuron to span a very elongated space of the different transforms of an object, as described in Section B.4.8 Radial Basis Function networks As noted above, a competitive network can act as a useful preprocessor for other networks. In the neural examples above, competitive networks were useful preprocessors for associative networks. Competitive networks are also used as preprocessors in artificial neural networks, for example in hybrid two layer networks such as that illustrated in Fig. B.24. The competitive network is advantageous in this hybrid scheme, because as an unsupervised network, it can relatively quickly (with a few presentations of each stimulus) discover the main features in the input space, and code for them. This leaves the second layer of the network to act as a supervised network (taught for example by the delta rule, see Section B.10), which learns to map the features found by the first layer into the output required. This learning scheme is very much faster than that of a (two layer) backpropagation network, which learns very slowly because it takes it a long time to perform the credit assignment to build useful feature analyzers in layer one (the hidden layer) (see Section B.11). The general scheme shown in Fig. B.24 is used in radial basis function (RBF) neural networks. The main difference from what has been described is that in an RBF network, the hidden neurons do not use a winner take all function (as in some competitive networks), but instead use a normalized activation function in which the measure of distance from a weight vector of the neural input is (instead of the dot product x w i used for most of the networks described in this book), a Gaussian measure of distance: y i = exp[ (x w i) 2 /2σi 2] exp[ (x w k ) 2 /2σk 2]. k (B.21) The effect is that the response y i of neuron i is a maximum if the input stimulus vector x is centred at w i, the weight vector of neuron i (this is the upper term in equation B.21). The

66 752 Neuronal network models magnitude is normalized by dividing by the sum of the activations of all the k neurons in the network. If the input vector x is not at the centre of the receptive field of the neuron, then the response is decreased according to how far the input vector is from the weight vector w i of the neuron, with the weighting decreasing as a Gaussian function with a standard deviation of σ. The idea is like that implemented with soft competition, in that the relative response of different neurons provides an indication of where the input pattern is in relation to the weight vectors of the different neurons. The rapidity with which the response falls off in a Gaussian radial basis function neuron is set by σ i, which is adjusted so that for any given input pattern vector, a number of RBF neurons are activated. The positions in which the RBF neurons are located (i.e. the directions of their weight vectors, w) are determined usually by unsupervised learning, e.g. the vector quantization that is produced by the normal competitive learning algorithm. The first layer of an RBF network is not different in principle from a network with soft competition, and it is not clear how biologically a Gaussian activation function would be implemented, so the treatment is not developed further here (see Hertz, Krogh and Palmer (1991), Poggio and Girosi (1990a), and Poggio and Girosi (1990b) for further details). B.4.9 B Further details of the algorithms used in competitive networks Normalization of the inputs Normalization is useful because in step 1 of the training algorithm described in Section B.4.2.2, the neuronal activations, formed by the inner product of the pattern and the normalized weight vector on each neuron, are scaled in such a way that they have a maximum value of 1.0. This helps different input patterns to be equally effective in the learning process. A way in which this normalization could be achieved by a layer of input neurons is given by Grossberg (1976a). In the brain, a number of factors may contribute to normalization of the inputs. One factor is that a set of input axons to a neuron will come from another network in which the firing is controlled by inhibitory feedback, and if the numbers of axons involved is large (hundreds or thousands), then the inputs will be in a reasonable range. Second, there is increasing evidence that the different classes of input to a neuron may activate different types of inhibitory interneuron (e.g. Buhl, Halasy and Somogyi (1994)), which terminate on separate parts of the dendrite, usually close to the site of termination of the corresponding excitatory afferents. This may allow separate feedforward inhibition for the different classes of input. In addition, the feedback inhibitory interneurons also have characteristic termination sites, often on or close to the cell body, where they may be particularly effective in controlling firing of the neuron by shunting (divisive) inhibition, rather than by scaling a class of input (see Section B.6). B Normalization of the length of the synaptic weight vector on each dendrite This is necessary to ensure that one or a few neurons do not always win the competition. (If the weights on one neuron were increased by simple Hebbian learning, and there was no normalization of the weights on the neuron, then it would tend to respond strongly in the future to patterns with some overlap with patterns to which that neuron has previously learned, and gradually that neuron would capture a large number of patterns.) A biologically plausible way to achieve this weight adjustment is to use a modified Hebb rule: δw ij = αy i (x j w ij ) (B.22) where α is a constant, and x j and w ij are in appropriate units. In vector notation, δw i = αy i (x w i ) (B.23)

67 Competitive networks, including self organizing maps 753 where w i is the synaptic weight vector on neuron i. This implements a Hebb rule that increases synaptic strength according to conjunctive pre and post synaptic activity, and also allows the strength of each synapse to decrease in proportion to the firing rate of the postsynaptic neuron (as well as in proportion to the existing synaptic strength). This results in a decrease in synaptic strength for synapses from weakly active presynaptic neurons onto strongly active postsynaptic neurons. Such a modification in synaptic strength is termed heterosynaptic longterm depression in the neurophysiological literature, referring to the fact that the synapses that weaken are other than those that activate the neuron. This is an important computational use of heterosynaptic long term depression (LTD). In that the amount of decrease of the synaptic strength depends on how strong the synapse is already, the rule is compatible with what is frequently reported in studies of LTD (see Section 1.5). This rule can maintain the sums of the synaptic weights on each dendrite to be very similar without any need for explicit normalization of the synaptic strengths, and is useful in competitive nets. This rule was used by Willshaw and von der Malsburg (1976). As is made clear with the vector notation above, the modified Hebb rule moves the direction of the weight vector w i towards the current input pattern vector x in proportion to the difference between these two vectors and the firing rate y i of neuron i. If explicit weight (vector length) normalization is needed, the appropriate form of the modified Hebb rule is: δw ij = αy i (x j y i w ij ). (B.24) This rule, formulated by Oja (1982), makes weight decay proportional to yi 2, normalizes the synaptic weight vector (see Hertz, Krogh and Palmer (1991)), is still a local learning rule, and is known as the Oja rule. B Non linearity in the learning rule Non linearity in the learning rule can assist competition (Rolls 1989b, Rolls 1996c). For example, in the brain, long term potentiation typically occurs only when strong activation of a neuron has produced sufficient depolarization for the voltage dependent NMDA receptors to become unblocked, allowing Ca 2+ to enter the cell (see Section 1.5). This means that synaptic modification occurs only on neurons that are strongly activated, effectively assisting competition to select few winners. The learning rule can be written: δw ij = αm i x j (B.25) where m i is a (e.g. threshold) non linear function of the post synaptic firing y i which mimics the operation of the NMDA receptors in learning. (It is noted that in associative networks the same process may result in the stored pattern being more sparse than the input pattern, and that this may be beneficial, especially given the exponential firing rate distribution of neurons, in helping to maximize the number of patterns stored in associative networks (see Sections B.2, B.3, and C.3.1). B Competition In a simulation of a competitive network, a single winner can be selected by searching for the neuron with the maximum activation. If graded competition is required, this can be achieved by an activation function that increases greater than linearly. In some of the networks we have simulated (Rolls 1989b, Rolls 1989a, Wallis and Rolls 1997), raising the activation to a fixed power, typically in the range 2 5, and then rescaling the outputs to a fixed maximum (e.g. 1) is simple to implement. In a real neuronal network, winner take all competition can be implemented using mutual (lateral) inhibition between the neurons with non linear activation

68 754 Neuronal network models functions, and self excitation of each neuron (see e.g. Grossberg (1976a), Grossberg (1988), Hertz, Krogh and Palmer (1991)). Another method to implement soft competition in simulations is to use the normalized exponential or softmax activation function for the neurons (Bridle (1990); see Bishop (1995)): y = exp(h)/ exp(h i ). (B.26) i This function specifies that the firing rate of each neuron is an exponential function of the activation, scaled by the whole vector of activations h i, i = 1, N. The exponential function (in increasing supralinearly) implements soft competition, in that after the competition the faster firing neurons are firing relatively much faster than the slower firing neurons. In fact, the strength of the competition can be adjusted by using a temperature T greater than 0 as follows: y = exp(h/t )/ exp(h i /T ). (B.27) i Very low temperatures increase the competition, until with T 0, the competition becomes winner take all. At high temperatures, the competition becomes very soft. (When using the function in simulations, it may be advisable to prescale the firing rates to for example the range 0 1, both to prevent machine overflow, and to set the temperature to operate on a constant range of firing rates, as increasing the range of the inputs has an effect similar to decreasing T.) The softmax function has the property that activations in the range to + are mapped into the range 0 to 1.0, and the sum of the firing rates is 1.0. This facilitates interpretation of the firing rates under certain conditions as probabilities, for example that the competitive network firing rate of each neuron reflects the probability that the input vector is within the category or cluster signified by that output neuron (see Bishop (1995)). B Soft competition The use of graded (continuous valued) output neurons in a competitive network, and soft competition rather than winner take all competition, has the value that the competitive net generalizes more continuously to an input vector that lies between input vectors that it has learned. Also, with soft competition, neurons with only a small amount of activation by any of the patterns being used will nevertheless learn a little, and move gradually towards the patterns that are being presented. The result is that with soft competition, the output neurons all tend to become allocated to one of the input patterns or one of the clusters of input patterns. B Untrained neurons In competitive networks, especially with winner take all or finely tuned neurons, it is possible that some neurons remain unallocated to patterns. This may be useful, in case patterns in the unused part of the space occur in future. Alternatively, unallocated neurons can be made to move towards the parts of the space where patterns are occurring by allowing such losers in the competition to learn a little. Another mechanism is to subtract a bias term µ i from y i, and to use a conscience mechanism that raises µ i if a neuron wins frequently, and lowers µ i if it wins infrequently (Grossberg 1976b, Bienenstock, Cooper and Munro 1982, De Sieno 1988). B Large competitive nets: further aspects If a large neuronal network is considered, with the number of synapses on each neuron in the region of 10,000, as occurs on large pyramidal cells in some parts of the brain, then there is a potential disadvantage in using neurons with synaptic weights that can take on

69 Competitive networks, including self organizing maps 755 only positive values. This difficulty arises in the following way. Consider a set of positive normalized input firing rates and synaptic weight vectors (in which each element of the vector can take on any value between 0.0 and 1.0). Such vectors of random values will on average be more highly aligned with the direction of the central vector (1,1,1,...,1) than with any other vector. An example can be given for the particular case of vectors evenly distributed on the positive quadrant of a high dimensional hypersphere: the average overlap (i.e. normalized dot product) between two binary random vectors with half the elements on and thus a sparseness of 0.5 (e.g. a random pattern vector and a random dendritic weight vector) will be approximately 0.5, while the average overlap between a random vector and the central vector will be approximately A consequence of this will be that if a neuron begins to learn towards several input pattern vectors it will get drawn towards the average of these input patterns which will be closer to the 1,1,1,...,1 direction than to any one of the patterns. As a dendritic weight vector moves towards the central vector, it will become more closely aligned with more and more input patterns so that it is more rapidly drawn towards the central vector. The end result is that in large nets of this type, many of the dendritic weight vectors will point towards the central vector. This effect is not seen so much in small systems, since the fluctuations in the magnitude of the overlaps are sufficiently large that in most cases a dendritic weight vector will have an input pattern very close to it and thus will not learn towards the centre. In large systems, the fluctuations in the overlaps between random vectors 1 become smaller by a factor of N so that the dendrites will not be particularly close to any of the input patterns. One solution to this problem is to allow the elements of the synaptic weight vectors to take negative as well as positive values. This could be implemented in the brain by feedforward inhibition. A set of vectors taken with random values will then have a reduced mean correlation between any pair, and the competitive net will be able to categorize them effectively. A system with synaptic weights that can be negative as well as positive is not physiologically plausible, but we can instead imagine a system with weights lying on a hypersphere in the positive quadrant of space but with additional inhibition that results in the cumulative effects of some input lines being effectively negative. This can be achieved in a network by using positive input vectors, positive synaptic weight vectors, and thresholding the output neurons at their mean activation. A large competitive network of this general nature does categorize well, and has been described more fully elsewhere (Bennett 1990). In a large network with inhibitory feedback neurons, and principal cells with thresholds, the network could achieve at least in part an approximation to this type of thresholding useful in large competitive networks. A second way in which nets with positive only values of the elements could operate is by making the input vectors sparse and initializing the weight vectors to be sparse, or to have a reduced contact probability. (A measure a of neuronal population sparseness is defined (as before) in equation B.28: a = ( N y i /N) 2 i=1 N yi 2/N i=1 (B.28) where y i is the firing rate of the ith neuron in the set of N neurons.) For relatively small net sizes simulated (N = 100) with patterns with a sparseness a of, for example, 0.1 or 0.2, learning onto the average vector can be avoided. However, as the net size increases, the sparseness required does become very low. In large nets, a greatly reduced contact probability between neurons (many synapses kept identically zero) would prevent learning of the average vector, thus allowing categorization to occur (see Section 7.4). Reduced contact probability

70 756 Neuronal network models will, however, prevent complete alignment of synapses with patterns, so that the performance of the network will be affected. B.5 Continuous attractor networks B.5.1 Introduction Single cell recording studies have shown that some neurons represent the current position along a continuous physical dimension or space even when no inputs are available, for example in darkness (see Chapter 24). Examples include neurons that represent the positions of the eyes (i.e. eye direction with respect to the head), the place where the animal is looking in space, head direction, and the place where the animal is located. In particular, examples of such classes of cells include head direction cells in rats (Ranck 1985, Taube, Muller and Ranck 1990a, Taube, Goodridge, Golob, Dudchenko and Stackman 1996, Muller, Ranck and Taube 1996) and primates (Robertson, Rolls, Georges François and Panzeri 1999), which respond maximally when the animal s head is facing in a particular preferred direction; place cells in rats (O Keefe and Dostrovsky 1971, McNaughton, Barnes and O Keefe 1983, O Keefe 1984, Muller, Kubie, Bostock, Taube and Quirk 1991, Markus, Qin, Leonard, Skaggs, McNaughton and Barnes 1995) that fire maximally when the animal is in a particular location; and spatial view cells in primates that respond when the monkey is looking towards a particular location in space (Rolls, Robertson and Georges François 1997a, Georges François, Rolls and Robertson 1999, Robertson, Rolls and Georges François 1998). In the parietal cortex there are many spatial representations, in several different coordinate frames (see Chapter 4 of Rolls and Deco (2002) and Andersen, Batista, Snyder, Buneo and Cohen (2000)), and they have some capability to remain active during memory periods when the stimulus is no longer present. Even more than this, the dorsolateral prefrontal cortex networks to which the parietal networks project have the capability to maintain spatial representations active for many seconds or minutes during short term memory tasks, when the stimulus is no longer present (see Section 4.3.1). In this section, we describe how such networks representing continuous physical space could operate. The locations of such spatial networks in the brain are the parietal areas, the prefrontal areas that implement short term spatial memory and receive from the parietal cortex (see Section 4.3.1), and the hippocampal system which combines information about objects from the inferior temporal visual cortex with spatial information (see Chapter 24). A class of network that can maintain the firing of its neurons to represent any location along a continuous physical dimension such as spatial position, head direction, etc. is a Continuous Attractor neural network (CANN). It uses excitatory recurrent collateral connections between the neurons to reflect the distance between the neurons in the state space of the animal (e.g. head direction space). These networks can maintain the bubble of neural activity constant for long periods wherever it is started to represent the current state (head direction, position, etc) of the animal, and are likely to be involved in many aspects of spatial processing and memory, including spatial vision. Global inhibition is used to keep the number of neurons in a bubble or packet of actively firing neurons relatively constant, and to help to ensure that there is only one activity packet. Continuous attractor networks can be thought of as very similar to autoassociation or discrete attractor networks (described in Section B.3), and have the same architecture, as illustrated in Fig. B.25. The main difference is that the patterns stored in a CANN are continuous patterns, with each neuron having broadly tuned firing which decreases with for example a Gaussian function as the distance from the optimal firing location of the cell is varied, and with different neurons having tuning that overlaps throughout the space. Such

71 Continuous attractor networks 757 external input e i r j w ij h i = dendritic activation r = output firing i output Fig. B.25 The architecture of a continuous attractor neural network (CANN). tuning is illustrated in Fig For comparison, the autoassociation networks described in Section B.3 have discrete (separate) patterns (each pattern implemented by the firing of a particular subset of the neurons), with no continuous distribution of the patterns throughout the space (see Fig ). A consequent difference is that the CANN can maintain its firing at any location in the trained continuous space, whereas a discrete attractor or autoassociation network moves its population of active neurons towards one of the previously learned attractor states, and thus implements the recall of a particular previously learned pattern from an incomplete or noisy (distorted) version of one of the previously learned patterns. The energy landscape of a discrete attractor network (see equation B.12) has separate energy minima, each one of which corresponds to a learned pattern, whereas the energy landscape of a continuous attractor network is flat, so that the activity packet remains stable with continuous firing wherever it is started in the state space. (The state space refers to the set of possible spatial states of the animal in its environment, e.g. the set of possible head directions.) In Section B.5.2, we first describe the operation and properties of continuous attractor networks, which have been studied by for example Amari (1977), Zhang (1996), and Taylor (1999), and then, following Stringer, Trappenberg, Rolls and De Araujo (2002b), address four key issues about the biological application of continuous attractor network models. One key issue in such continuous attractor neural networks is how the synaptic strengths between the neurons in the continuous attractor network could be learned in biological systems (Section B.5.3). A second key issue in such continuous attractor neural networks is how the bubble of neuronal firing representing one location in the continuous state space should be updated based on non visual cues to represent a new location in state space (Section B.5.5). This is essentially the problem of path integration: how a system that represents a memory of where the agent is in physical space could be updated based on idiothetic (self motion) cues such as vestibular cues (which might represent a head velocity signal), or proprioceptive cues (which might update a representation of place based on movements being made in the space, during for example walking in the dark). A third key issue is how stability in the bubble of activity representing the current location can be maintained without much drift in darkness, when it is operating as a memory system (Section B.5.6). A fourth key issue is considered in Section B.5.8 in which we describe networks that store

72 758 Neuronal network models both continuous patterns and discrete patterns (see Fig ), which can be used to store for example the location in (continuous, physical) space where an object (a discrete item) is present. B.5.2 The generic model of a continuous attractor network The generic model of a continuous attractor is as follows. (The model is described in the context of head direction cells, which represent the head direction of rats (Taube et al. 1996, Muller et al. 1996) and macaques (Robertson, Rolls, Georges François and Panzeri 1999), and can be reset by visual inputs after gradual drift in darkness.) The model is a recurrent attractor network with global inhibition. It is different from a Hopfield attractor network primarily in that there are no discrete attractors formed by associative learning of discrete patterns. Instead there is a set of neurons that are connected to each other by synaptic weights w ij that are a simple function, for example Gaussian, of the distance between the states of the agent in the physical world (e.g. head directions) represented by the neurons. Neurons that represent similar states (locations in the state space) of the agent in the physical world have strong synaptic connections, which can be set up by an associative learning rule, as described in Section B.5.3. The network updates its firing rates by the following leaky integrator dynamical equations. The continuously changing activation h HD i of each head direction cell i is governed by the equation τ dhhd i (t) = h HD i (t) + ϕ 0 dt C HD (w ij w inh )rj HD (t) + Ii V, (B.29) where rj HD is the firing rate of head direction cell j, w ij is the excitatory (positive) synaptic weight from head direction cell j to cell i, w inh is a global constant describing the effect of inhibitory interneurons, and τ is the time constant of the system 38. The term h HD i (t) indicates the amount by which the activation decays (in the leaky integrator neuron) at time t. (The network is updated in a typical simulation at much smaller timesteps than the time constant of the system, τ.) The next term in equation B.29 is the input from other neurons in the network rj HD weighted by the recurrent collateral synaptic connections w ij (scaled by a constant ϕ 0 and C HD which is the number of synaptic connections received by each head direction cell from other head direction cells in the continuous attractor). The term Ii V represents a visual input to head direction cell i. Each term Ii V is set to have a Gaussian response profile in most continuous attractor networks, and this sets the firing of the cells in the continuous attractor to have Gaussian response profiles as a function of where the agent is located in the state space (see e.g. Fig on page 496), but the Gaussian assumption is not crucial. (It is known that the firing rates of head direction cells in both rats (Taube, Goodridge, Golob, Dudchenko and Stackman 1996, Muller, Ranck and Taube 1996) and macaques (Robertson, Rolls, Georges François and Panzeri 1999) is approximately Gaussian.) When the agent is operating without visual input, in memory mode, then the term Ii V is set to zero. The firing rate ri HD of cell i is determined from the activation h HD i and the sigmoid function r HD i (t) = where α and β are the sigmoid threshold and slope, respectively. j e, (B.30) 2β(hHD i (t) α) 38 Note that for this section, we use r rather than y to refer to the firing rates of the neurons in the network, remembering that, because this is a recurrently connected network (see Fig. B.13), the output from a neuron y i might be the input x j to another neuron.

73 B.5.3 Continuous attractor networks 759 Learning the synaptic strengths between the neurons that implement a continuous attractor network So far we have said that the neurons in the continuous attractor network are connected to each other by synaptic weights w ij that are a simple function, for example Gaussian, of the distance between the states of the agent in the physical world (e.g. head directions, spatial views etc) represented by the neurons. In many simulations, the weights are set by formula to have weights with these appropriate Gaussian values. However, Stringer, Trappenberg, Rolls and De Araujo (2002b) showed how the appropriate weights could be set up by learning. They started with the fact that since the neurons have broad tuning that may be Gaussian in shape, nearby neurons in the state space will have overlapping spatial fields, and will thus be coactive to a degree that depends on the distance between them. They postulated that therefore the synaptic weights could be set up by associative learning based on the co activity of the neurons produced by external stimuli as the animal moved in the state space. For example, head direction cells are forced to fire during learning by visual cues in the environment that produce Gaussian firing as a function of head direction from an optimal head direction for each cell. The learning rule is simply that the weights w ij from head direction cell j with firing rate rj HD (Hebb) rule to head direction cell i with firing rate r HD i δw ij = kri HD rj HD are updated according to an associative (B.31) where δw ij is the change of synaptic weight and k is the learning rate constant. During the learning phase, the firing rate ri HD of each head direction cell i might be the following Gaussian function of the displacement of the head from the optimal firing direction of the cell r HD i = e s2 HD /2σ2 HD, (B.32) where s HD is the difference between the actual head direction x (in degrees) of the agent and the optimal head direction x i for head direction cell i, and σ HD is the standard deviation. Stringer, Trappenberg, Rolls and De Araujo (2002b) showed that after training at all head directions, the synaptic connections develop strengths that are an almost Gaussian function of the distance between the cells in head direction space, as shown in Fig. B.26 (left). Interestingly if a non linearity is introduced into the learning rule that mimics the properties of NMDA receptors by allowing the synapses to modify only after strong postsynaptic firing is present, then the synaptic strengths are still close to a Gaussian function of the distance between the connected cells in head direction space (see Fig. B.26, left). They showed that after training, the continuous attractor network can support stable activity packets in the absence of visual inputs (see Fig. B.26, right) provided that global inhibition is used to prevent all the neurons becoming activated. (The exact stability conditions for such networks have been analyzed by Amari (1977)). Thus Stringer, Trappenberg, Rolls and De Araujo (2002b) demonstrated biologically plausible mechanisms for training the synaptic weights in a continuous attractor using a biologically plausible local learning rule. Stringer, Trappenberg, Rolls and De Araujo (2002b) went on to show that if there was a short term memory trace built into the operation of the learning rule, then this could help to produce smooth weights in the continuous attractor if only incomplete training was available, that is if the weights were trained at only a few locations. The same rule can take advantage in training the synaptic weights of the temporal probability distributions of firing when they happen to reflect spatial proximity. For example, for head direction cells the agent will necessarily move through similar head directions before reaching quite different head directions, and so the temporal proximity with which the cells fire can be used to set up the appropriate synaptic weights. This new proposal for training continuous

74 760 Neuronal network models Recurrent synaptic weight profiles Recurrent weights Fitted Gaussian profile Recurrent weights with NMDA threshold Firing rate profiles 0.7 Synaptic weight Firing rate Low lateral inhibition Intermediate lateral inhibition High lateral inhibition Node Node Fig. B.26 Training the weights in a continuous attractor network with an associative rule (equation B.31). Left: the trained recurrent synaptic weights from head direction cell 50 to the other head direction cells in the network arranged in head direction space (solid curve). The dashed line shows a Gaussian curve fitted to the weights shown in the solid curve. The dash dot curve shows the recurrent synaptic weights trained with rule equation (B.31), but with a non linearity introduced that mimics the properties of NMDA receptors by allowing the synapses to modify only after strong postsynaptic firing is present. Right: the stable firing rate profiles forming an activity packet in the continuous attractor network during the testing phase when the training (visual) inputs are no longer present. The firing rates are shown after the network has been initially stimulated by visual input to initialize an activity packet, and then allowed to settle to a stable activity profile without visual input. The three graphs show the firing rates for low, intermediate and high values of the lateral inhibition parameter w inh. For both left and right plots, the 100 head direction cells are arranged according to where they fire maximally in the head direction space of the agent when visual cues are available. (After Stringer, Trappenberg, Rolls and de Araujo 2002.) attractor networks can also help to produce broadly tuned spatial cells even if the driving (e.g. visual) input (Ii V in equation B.29) during training produces rather narrowly tuned neuronal responses. The learning rule with such temporal properties is a memory trace learning rule that strengthens synaptic connections between neurons, based on the temporal probability distribution of the firing. There are many versions of such rules (Rolls and Milward 2000, Rolls and Stringer 2001a), which are described more fully in Chapter 25, but a simple one that works adequately is δw ij = kr HD i r HD j (B.33) where δw ij is the change of synaptic weight, and r HD is a local temporal average or trace value of the firing rate of a head direction cell given by r HD (t + δt) = (1 η)r HD (t + δt) + ηr HD (t) (B.34) where η is a parameter set in the interval [0,1] which determines the contribution of the current firing and the previous trace. For η = 0 the trace rule (B.33) becomes the standard Hebb rule (B.31), while for η > 0 learning rule (B.33) operates to associate together patterns of activity that occur close together in time. The rule might allow temporal associations to influence the synaptic weights that are learned over times in the order of 1 s. The memory trace required for operation of this rule might be no more complicated than the continuing firing that is an inherent property of attractor networks, but it could also be implemented by a number of biophysical mechanisms, discussed in Chapter 25. Finally, we note that some long term depression (LTD) in the learning rule could help to maintain the weights of different neurons equally potent (see Section B and equation B.22), and could compensate for

75 Continuous attractor networks 761 irregularity during training in which the agent might be trained much more in some than other locations in the space (see Stringer, Trappenberg, Rolls and De Araujo (2002b)). B.5.4 The capacity of a continuous attractor network: multiple charts and packets The capacity of a continuous attractor network can be approached on the following bases. First, as there are no discrete attractor states, but instead a continuous physical space is being represented, some concept of spatial resolution must be brought to bear, that is the number of different positions in the space that can be represented. Second, the number of connections per neuron in the continuous attractor will directly influence the number of different spatial positions (locations in the state space) that can be represented. Third, the sparseness of the representation can be thought of as influencing the number of different spatial locations (in the continuous state space) that can be represented, in a way analogous to that described for discrete attractor networks in equation B.14 (Battaglia and Treves 1998b). That is, if the tuning of the neurons is very broad, then fewer locations in the state space may be represented. Fourth, and very interestingly, if representations of different continuous state spaces, for example maps or charts of different environments, are stored in the same network, there may be little cost of adding extra maps or charts. The reason for this is that the large part of the interference between the different memories stored in such a network arises from the correlations between the different positions in any one map, which are typically relatively high because quite broad tuning of individual cells is common. In contrast, there are in general low correlations between the representations of places in different maps or charts, and therefore many different maps can be simultaneously stored in a continuous attractor network (Battaglia and Treves 1998b). For a similar reason, it is even possible to have the activity packets that operate in different spaces simultaneously active in a single continuous attractor network of neurons, and to move independently of each other in their respective spaces or charts (Stringer, Rolls and Trappenberg 2004). B.5.5 Continuous attractor models: path integration So far, we have considered how spatial representations could be stored in continuous attractor networks, and how the activity can be maintained at any location in the state space in a form of short term memory when the external (e.g. visual) input is removed. However, many networks with spatial representations in the brain can be updated by internal, self motion (i.e. idiothetic), cues even when there is no external (e.g. visual) input. Examples are head direction cells in the presubiculum of rats and macaques, place cells in the rat hippocampus, and spatial view cells in the primate hippocampus (see Chapter 24). The major question arises about how such idiothetic inputs could drive the activity packet in a continuous attractor network and, in particular, how such a system could be set up biologically by self organizing learning. One approach to simulating the movement of an activity packet produced by idiothetic cues (which is a form of path integration whereby the current location is calculated from recent movements) is to employ a look up table that stores (taking head direction cells as an example), for every possible head direction and head rotational velocity input generated by the vestibular system, the corresponding new head direction (Samsonovich and McNaughton 1997). Another approach involves modulating the strengths of the recurrent synaptic weights in the continuous attractor on one but not the other side of a currently represented position, so that the stable position of the packet of activity, which requires symmetric connections in different directions from each node, is lost, and the packet moves in the direction of the temporarily increased

76 762 Neuronal network models Visual input I V r ID ID w w RC rhd Fig. B.27 General network architecture for a one dimensional continuous attractor model of head direction cells which can be updated by idiothetic inputs produced by head rotation cell firing r ID. The head direction cell firing is r HD, the continuous attractor synaptic weights are w RC, the idiothetic synaptic weights are w ID, and the external visual input is I V. weights, although no possible biological implementation was proposed of how the appropriate dynamic synaptic weight changes might be achieved (Zhang 1996). Another mechanism (for head direction cells) (Skaggs, Knierim, Kudrimoti and McNaughton 1995) relies on a set of cells, termed (head) rotation cells, which are co activated by head direction cells and vestibular cells and drive the activity of the attractor network by anatomically distinct connections for clockwise and counter clockwise rotation cells, in what is effectively a lookup table. However, no proposal was made about how this could be achieved by a biologically plausible learning process, and this has been the case until recently for most approaches to path integration in continuous attractor networks, which rely heavily on rather artificial pre set synaptic connectivities. Stringer, Trappenberg, Rolls and De Araujo (2002b) introduced a proposal with more biological plausibility about how the synaptic connections from idiothetic inputs to a continuous attractor network can be learned by a self organizing learning process. The essence of the hypothesis is described with Fig. B.27. The continuous attractor synaptic weights w RC are set up under the influence of the external visual inputs I V as described in Section B.5.3. At the same time, the idiothetic synaptic weights w ID (in which the ID refers to the fact that they are in this case produced by idiothetic inputs, produced by cells that fire to represent the velocity of clockwise and anticlockwise head rotation), are set up by associating the change of head direction cell firing that has just occurred (detected by a trace memory mechanism described below) with the current firing of the head rotation cells r ID. For example, when the trace memory mechanism incorporated into the idiothetic synapses w ID detects that the head direction cell firing is at a given location (indicated by the firing r HD ) and is moving clockwise (produced by the altering visual inputs I V ), and there is simultaneous clockwise head rotation cell firing, the synapses w ID learn the association, so that when that rotation cell firing occurs later without visual input, it takes the current head direction firing in the continuous attractor into account, and moves the location of the head direction attractor in the appropriate direction.

77 Continuous attractor networks 763 For the learning to operate, the idiothetic synapses onto head direction cell i with firing ri HD need two inputs: the memory traced term from other head direction cells r HD j (given by equation B.34), and the head rotation cell input with firing rk ID ; and the learning rule can be written δwijk ID = k ri HD r HD j rk ID, (B.35) where k is the learning rate associated with this type of synaptic connection. The head rotation cell firing (rk ID ) could be as simple as one set of cells that fire for clockwise head rotation (for which k might be 1), and a second set of cells that fire for anticlockwise head rotation (for which k might be 2). After learning, the firing of the head direction cells would be updated in the dark (when Ii V = 0) by idiothetic head rotation cell firing rk ID as follows j τ dhhd i (t) = h HD i (t) + ϕ 0 dt C HD (w ij w inh )rj HD (t) + Ii V 1 + ϕ 1 ( C HD ID wijkr ID j HD r ID j,k k ). (B.36) Equation B.36 is similar to equation B.29, except for the last term, which introduces the effects of the idiothetic synaptic weights wijk ID, which effectively specify that the current firing of head direction cell i, ri HD, must be updated by the previously learned combination of the particular head rotation now occurring indicated by rk ID, and the current head direction indicated by the firings of the other head direction cells rj HD indexed through j 39. This makes it clear that the idiothetic synapses operate using combinations of inputs, in this case of two inputs. Neurons that sum the effects of such local products are termed Sigma Pi neurons (see Section A.2.3). Although such synapses are more complicated than the two term synapses used throughout the rest of this book, such three term synapses appear to be useful to solve the computational problem of updating representations based on idiothetic inputs in the way described. Synapses that operate according to Sigma Pi rules might be implemented in the brain by a number of mechanisms described by Koch (1999) (Section ), Jonas and Kaczmarek (1999), and Stringer, Trappenberg, Rolls and De Araujo (2002b), including having two inputs close together on a thin dendrite, so that local synaptic interactions would be emphasized. Simulations demonstrating the operation of this self organizing learning to produce movement of the location being represented in a continuous attractor network were described by Stringer, Trappenberg, Rolls and De Araujo (2002b), and one example of the operation is shown in Fig. B.28. They also showed that, after training with just one value of the head rotation cell firing, the network showed the desirable property of moving the head direction being represented in the continuous attractor by an amount that was proportional to the value of the head rotation cell firing. Stringer, Trappenberg, Rolls and De Araujo (2002b) also describe a related model of the idiothetic cell update of the location represented in a continuous attractor, in which the rotation cell firing directly modulates in a multiplicative way the strength of the recurrent connections in the continuous attractor in such a way that clockwise rotation cells modulate the strength of the synaptic connections in the clockwise direction in the continuous attractor, and vice versa. 39 The term ϕ 1 /C HD ID is a scaling factor that reflects the number C HD ID of inputs to these synapses, and enables the overall magnitude of the idiothetic input to each head direction cell to remain approximately the same as the number of idiothetic connections received by each head direction cell is varied.

764 Neuronal network models Fig. B.28 Idiothetic update of the location represented in a continuous attractor network.

The activity packet was initialized to a head direction of 75 degrees, and the packet was allowed to settle without visual input.

For timestep=100 to 300 the clockwise rotation cells were active with a firing rate of 0.15 to represent a moderate angular velocity, and the activity packet moved clockwise.

78 764 Neuronal network models Fig. B.28 Idiothetic update of the location represented in a continuous attractor network. The firing rate of the cells with optima at different head directions (organized according to head direction on the ordinate) is shown by the blackness of the plot, as a function of time. The activity packet was initialized to a head direction of 75 degrees, and the packet was allowed to settle without visual input. For timestep=0 to 100 there was no rotation cell input, and the activity packet in the continuous attractor remained stable at 75 degrees. For timestep=100 to 300 the clockwise rotation cells were active with a firing rate of 0.15 to represent a moderate angular velocity, and the activity packet moved clockwise. For timestep=300 to 400 there was no rotation cell firing, and the activity packet immediately stopped, and remained still. For timestep=400 to 500 the anti clockwise rotation cells had a high firing rate of 0.3 to represent a high velocity, and the activity packet moved anticlockwise with a greater velocity. For timestep=500 to 600 there was no rotation cell firing, and the activity packet immediately stopped. It should be emphasized that although the cells are organized in Fig. B.28 according to the spatial position being represented, there is no need for cells in continuous attractors that represent nearby locations in the state space to be close together, as the distance in the state space between any two neurons is represented by the strength of the connection between them, not by where the neurons are physically located. This enables continuous attractor networks to represent spaces with arbitrary topologies, as the topology is represented in the connection strengths (Stringer, Trappenberg, Rolls and De Araujo 2002b, Stringer, Rolls, Trappenberg and De Araujo 2002a, Stringer, Rolls and Trappenberg 2005, Stringer and Rolls 2002). Indeed, it is this that enables many different charts each with its own topology to be represented in a single continuous attractor network (Battaglia and Treves 1998b). In the network described so far, self organization occurs, but one set of synapses is Sigma Pi. We have gone on to show that the Sigma Pi synapses are not necessary, and can be replaced by a competitive network that learns to respond to combinations of the spatial position and the idiothetic velocity, as illustrated in Fig. B.29 (Stringer and Rolls 2006). B.5.6 Stabilization of the activity packet within the continuous attractor network when the agent is stationary With irregular learning conditions (in which identical training with high precision of every node cannot be guaranteed), the recurrent synaptic weights between nodes in the continuous attractor will not be of the perfectly regular and symmetric form normally required in a

Memory, Attention, and Decision-Making

Memory, Attention, and Decision-Making A Unifying Computational Neuroscience Approach Edmund T. Rolls University of Oxford Department of Experimental Psychology Oxford England OXFORD UNIVERSITY PRESS Contents