The Pennsylvania State University. The Graduate School. Department of Computer Science and Engineering EFFICIENT AND SCALABLE BIOLOGICALLY PLAUSIBLE

Size: px

Start display at page:

Download "The Pennsylvania State University. The Graduate School. Department of Computer Science and Engineering EFFICIENT AND SCALABLE BIOLOGICALLY PLAUSIBLE"

Kathlyn Sanders
5 years ago
Views:

1 The Pennsylvania State University The Graduate School Department of Computer Science and Engineering EFFICIENT AND SCALABLE BIOLOGICALLY PLAUSIBLE SPIKING NEURAL NETWORKS WITH LEARNING APPLIED TO VISION A Dissertation in Computer Science and Engineering by Ankur Gupta 2010 Ankur Gupta Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy December 2010

2 The dissertation of Ankur Gupta was reviewed and approved* by the following: Lyle N. Long Distinguished Professor of Aerospace Engineering, Bioengineering, and Mathematics Dissertation Co-Advisor Co-Chair of Committee Soundar R. T. Kumara Professor of Computer Science & Engineering and Allen E. & M. Pearce Professor of Industrial Engineering Dissertation Co-Advisor Co-Chair of Committee Robert T. Collins Associate Professor of Computer Science & Engineering William E. Higgins Distinguished Professor of Electrical Engineering, Computer Science & Engineering, and Bioengineering John C. Collins Distinguished Professor of Physics Raj Acharya Professor of Computer Science & Engineering Department Head *Signatures are on file in the Graduate School

3 iii ABSTRACT Spiking neural networks are more biologically plausible than rate-based neural networks. By incorporating the aspect of time into the model itself, spiking networks are more like biological neural circuits. However, learning methods for spiking neural networks are not as well developed as for the rate-based networks. In this thesis, it is shown that spiking neural networks can be trained to solve computer vision problems using biologically plausible learning methods. For this, a Hebbian learning method based on spike time dependent plasticity is developed and implemented on different problems. The algorithms proposed have been used to simulate billions of synapses on a laptop and are shown to be efficient and scalable. The largest simulation on a laptop had about 1.5 billion synapses. This method is implemented in a hierarchical network architecture containing only spiking neurons with simple cells combining to form more specialized cells similar to the visual processing in non-human primates. This is in contrast to other approaches, which do not use only spiking neurons and/or use non-biology-based learning methods. The processing in the present approach is feed-forward. Thus, it is very fast. Experimental evidence supports feed-forward processing at least in the initial stages of visual processing in the primate brain. The network and the learning method developed in this work is tested on various cases such as trained on Gabor like cells, LED numbers, and also the MNIST database, which consists of handwritten digits. Simulations on simple cells revealed bell-shaped tuning curves of cells similar to those observed experimentally in the V1, and MT/V5 area of cats and non-human primate brains. Results on the MNIST dataset showed that digits such as 1, 2, and 4 were easier to recognize than digits such as 3, 6, and 5, which is probably because the former have simpler features than the latter. An accuracy of 89% on the MNIST dataset is obtained using a semisupervised learning approach. The results are encouraging considering only spiking neurons were

4 iv used throughout, including learning. This work is important as it demonstrates that an all-spiking neural-network approach with only spike-time based learning can solve engineering problems without the use of other traditional learning methods. The network is also extended to process color images. Color is often ignored in visual processing codes due to the complexity involved. The opponent channels theory for color processing is used and preliminary results using color images of fruits are reported. The results suggest that the network is able to correctly identify fruits not just based on shape but also based on color. For simulations, a new software called CSpike for simulating spiking neural networks is developed in this thesis. The software is developed using C++ in an object oriented manner exploiting object-oriented programming principles of inheritance, polymorphism, and encapsulation so that it is easy to understand, maintain and modify in future. The Qt application development framework is used for handling images.

5 v TABLE OF CONTENTS LIST OF FIGURES... vii LIST OF TABLES... xii ACKNOWLEDGEMENTS... xiii Chapter 1 Introduction... 1 Chapter 2 Literature Review Rate Based Networks Spike Based Networks Spike Based Models Hodgkin-Huxley Model Izhikevich Model Leaky Integrate and Fire Model Invariant Object Recognition Chapter 3 Mammalian Vision Overview of Image Formation Hierarchical Processing Contrast and Color Chapter 4 Learning Hebbian Learning and STDP Homeostasis and Winner-take-all Neuronal and Synaptic Genesis Chapter 5 CSpike Object-Oriented Approach Neuron Modeling in the Code Performance Gabor Filtering Translational Invariance and Sub-sampling Input and Output Chapter 6 Results Training on 48 Artificial Characters Test Problem and Network Structure Results Training Simple Gabor-like Cells Test Problem and Network Structure...79

6 vi Results Training on LED Numbers Test Problem and Network Structure Results Training on MNIST Dataset Test Problem and Network Structure Results Color Object Recognition Test Problem and Network Structure Results Chapter 7 Conclusions Comparison with Biology Comparison with Biologically Motivated Systems Comparison with HMAX Comparison with spikenet (STDP, Thorpe) Comparison with Adaptive Resonance Theory Role of Feedback Processing Concluding Remarks Appendix A Code Input File and Pre/Post Processing A.1 Sample Input File A.2 Sample Makefile A.3 Matlab Codes for Post-processing Appendix B CSpike Code B.1 List of Files in the Code B.2 List of all Classes B.3 Class and Member Declarations for Neuron, Layer, Synapse, SynapseMesh, and Network References...130

7 vii LIST OF FIGURES Figure 1-2: Development of Neural Networks has been along two directions... 3 Figure 2-1: Training time taken as the problem size was increased linearly with the number of processors [11]...9 Figure 2-2: Time taken for one forward propagation as the problem size increases linearly with the number of processors [11]... 9 Figure 2-3: Training time taken as the problem size was increased linearly with the number of processors [11]...10 Figure 2-4: Voltage vs time plot of Hodgkin-Huxley model (red solid) compared to the LIF model (blue dashed) [18] Figure 3-1: The cornea and lens focus the light on the retina. Fovea allows for high visual acuity with high concentration of cone cells. The output signal from the retina passes through the optic nerve [76]. Reprinted with permission from Sinauer publishing Figure 3-2: Distribution of rods and cones [77] Figure 3-3: Recordings from typical retinal ganglion cells; Left: on-center cell; Right: off-center cell. Four type of stimulus for each of the cells is shown towards the left [78]. Reprinted with permission from Freeman publishing Figure 3-4: The cornea and lens focus the image onto the photoreceptor cells in the retina. The photoreceptors turn light into electrical signals and provide input to the middle and the retinal ganglion layer. The input then passes through the optic nerves to the optical chiasm, LGN, and the visual cortex. Left part of the scene from both the eyes registers in the right hemispheres, whereas the right part registers in the left hemisphere [21]. Reprinted with permission from the Society for Neuroscience Figure 3-5: Dorsal and Ventral Streams [80]. Reprinted under the Creative Commons license...26 Figure 3-6: Shown is a plausible feed forward pathway for rapid visual categorization task in monkeys (ventral stream). Information from the retina passed through LGN to V1. Simple and complex cells are found in the areas V1 and V2. From V2, the information goes through V4 to posterior and anterior inferior temporal cortex (ITC), where neurons responding to faces and objects are found. The prefrontal cortex (PFC) contains neurons that categorize objects. Information then passes through the pre-motor cortex (PMC) and motor cortex (MC) to motor neurons of the spinal cord. In the figure, the first latency is an estimate of the earliest neuronal response and second is the average latency (Modified from Thorpe and Thorpe [81]). Reprinted with permission from the American Association for the Advancement of Science... 27

8 Figure 3-7: Responses of a neuron to illuminated rectangular slits at different orientations. Orientations are shown in the left column, with responses shown in the right column. The neuron is strongly activated by a vertical bar and not by bars at other orientations. These are known as simple cells. From Hubel and Wiesel [82]. Reprinted with permission from John Wiley & Sons publishing Figure 3-8: Schematic of the standard HMAX model with five layers. The lowest layer is a layer of simple cells with simple features whereas the highest layer is a layer of view tuned cells, which have invariance properties [2]. Reprinted with permission from Nature Publishing Group Figure 3-9: Comparison between hierarchical HMAX model [2] (left) and our model (right). We use spiking neurons throughout including learning, whereas HMAX does not use processing or learning using spiking neurons Figure 3-10: Images of plants and butterflies. Left: Grey-scale; Right: Color Figure 3-11: Left: Additive mixing of light sources for e.g. CRTs; Right: Subtractive mixing for e.g. in printing, pigments, and inks Figure 3-12: Effect of intensity shift on channels. Leftmost column is original intensity, middle is 50% shift, and rightmost is 100% intensity shift. RGt and BYt are RG and BY channels with a threshold of 50% respectively. Intensities for each channel were normalized to between 0 and Figure 4-1: Variation of synaptic change with pre and post-synaptic spike time difference [96]. Reprinted with permission from the Society for Neuroscience Figure 4-2: Images of arrays of final synapse weights plotted as 8x8 projections to previous layer in a layer of network after learning. White represents the highest synaptic strength and black represents the lowest...51 Figure 5-1: Some classes in the code. Arrows denote inheritance between classes Figure 5-2: A typical network showing how synaptic meshes are interleaved between 2-D arrays of neurons Figure 5-3: Flow chart of the code...58 Figure 5-4: Snapshot of the HTML documentation of the code Figure 5-5: Voltage evolution of a single neuron using leaky integrates and fire model with constant input current...60 Figure 5-6: Frequency-current curves for a leaky IF neuron. All time steps are in milliseconds Figure 5-7: CPU time variation with number of synapses. Both axes are on logarithmic scale viii

9 ix Figure 5-8: CPU time variation with number of neurons. Both axes are on logarithmic scale Figure 5-9: Volume vs. power requirements of biological and man-made systems [19] Figure 5-10: Comparison of biological vs. man-made systems [19] Figure 5-11: Gabor filter kernels with values of the bandwidth parameter of 0.5, 1, and 2, from left to right, respectively. The values of the other parameters are as follows: wavelength 10, orientation 0, phase offset 0, and aspect ratio 0.5 [136]. Plotted as intensities with white representing highest and black the lowest value...69 Figure 5-12: Illustration of max operation and sub-sampling in 1-D. There are 8 preneurons and 3 post-neurons. The afferent input size is 4 neurons with overlap of 2 neurons Figure 5-13: Pseudo-code for implementing the WTA and sub-sampling Figure 5-14: Input file format Figure 5-15: Output file formats Figure 6-1: Character set used. Black denotes ON pixel, whereas white is OFF Figure 6-2: Output when characters are presented in following order: 'C', 'D', 'A', 'B', 'C', 'D', Figure 6-3: Weight distribution before (left) and after (right) training Figure 6-4: Variation of Frobenius norm of the weight change matrix with number of epochs...78 Figure 6-5: Voltage plots from the 4 output neurons when presented cyclically with 4 bars of different orientations for every 50 msec. Each neuron learns a bar of different orientation and the inter-spike time decreases as the bar is being learned Figure 6-6: Voltage of an output neuron before training (left) and after training (right) Figure 6-7: Shown are the firing rates of 4 output neurons for 36 Gabor filter test images with step size of 5 degrees. Output of each neuron is shown in a different color Figure 6-8: Experimental tuning curve of a cell from cat striate cortex [139]. The cell has preferred orientation of around 84 degrees Figure 6-9: Weights before training (on left) and after training (on right) for the 4 output neurons. Each output neuron learns to recognize a unique Gabor filter image Figure 6-10: LED type numbers formed by switching 3 horizontal and 4 vertical bars ON or OFF... 83

10 x Figure 6-11: Network architecture. Input is passed through 4 fixed stencil type connections, which do Gabor type filtering extracting simple features. The next two connections are learning layers with modifiable connections. The first layer of modifiable connections is a many to few connections with projections of size 10x10 on to one neuron. The second layer is all to all connections connecting 10 output neurons to 36 neurons or 9 neurons per stream in the previous layer...84 Figure 6-12: Firing rates of the 10 output neurons plotted as intensities, when number shown on the left-most column is presented as input Figure 6-13: Weights after learning plotted as intensities; Left: Weights of 36 neurons in the first learning layer plotted as 10 X 10 projection from previous layer, Right: Weights of the second learning layer of 10 output neurons (correspondingly 10 rows) plotted as 12 X 3 projections from previous layer Figure 6-14: Firing rates of the 10 output neurons plotted as intensities for a network configuration that has 6 neurons per stream instead of 9. Number shown on the leftmost column is presented as input Figure 6-15: Some images from the MNIST dataset [1] Figure 6-16: Network architecture of spiking neural network simulations. Many to few and stencil type connections are shown by projections onto a single post-neuron in relevant layers. Flow diagram on the left shows types of synaptic connections and if they can be learned or fixed Figure 6-17: Response of the lower layers of the network as image of the number two is shown. Intensities correspond to the number of spikes. The Gabor filtering and twodimensional WTA stages are shown. Spiking neurons are used at all layers Figure 6-18: Synapse strengths of the first learning layer plotted as 8x8 intensity patches after training. White representing the highest synaptic strength and black the lowest. Only the first 45 patches from each of the 4 orientation channels are shown for simplicity. Some patches remain random and are not modified Figure 6-19: Synapse strengths after learning of the second learning layer plotted as intensities. Each row corresponds to connections of each of the 10 output neurons (each of which has 756X3 synapses) Figure 6-20: Voltage plot of the output neurons when a number 4 is presented from the testing set after learning. The neuron in the middle left fires with the highest firing rate as it is tuned to the number Figure 6-21: Confusion matrix with intensities as percent correct for each number Figure 6-22: Images of 12 fruits Figure 6-23: Layout of the network. The input is the image of size 50x50 replicated three times for the three opponent channels... 96

11 xi Figure 6-24: Red, green, blue, grayscale, and the opponent channels normalized and plotted as intensities Figure 6-25: Synapse strengths of the first learning layer plotted as 150X50 projections to the previous layer. White represents highest synaptic strength whereas black represents the lowest. Each row represents connections of a single post-neuron and the three columns represents the three opponent channels Figure 6-26: Firing rates of the 12 output neurons plotted as intensities. Number shown on the left-most column represents each of the 12 fruits images

12 xii LIST OF TABLES Table 2-1: Table comparing various neural simulators Table 5-1: Neuron, Layer, and Network classes Table 5-2: Synapse, SynapseMesh, and AllToAllMesh classes Table 5-3: Comparing different metrics of the human brain and the Jaguar supercomputer Table 5-4: Data and methods in the Gabor class

13 xiii ACKNOWLEDGEMENTS There are so many people who have contributed in many different ways for this work during my stay here. I am grateful to my advisors Dr. Lyle Long and Dr. Soundar Kumara for their help and guidance. I am thankful to Dr. Long for encouraging and supporting me. I enjoyed working with him in many projects and despite many constraints he gave me freedom to pursue my own ideas. He has been very responsive and was always willing to work with me whenever I got stuck. I am equally thankful to Dr. Kumara for helping me and wishing me well for this work and beyond. Thanks to all the committee members who provided valuable comments on this thesis and helped me improve it. This thesis would not have been possible without their feedback and suggestions. Special thanks to Dr. John Collins for the discussions and for giving important comments that have been of immense help. In addition, many thanks to all the PennState teachers who have taught me invaluable skills while my stay here. Even though I have been away from my family, I never felt that during the later part of my stay here- I lived like in a one world family. Thanks to the Art of Living foundation for introducing me to lifetime friends and well-wishers. Inexpressible thanks and wishes to Kelly and Derek for welcoming me into their own family and supporting me in innumerable ways. Moments with them, Zach, and Seth will always be cherished. Thanks to Birjoo, Joann, Mark, John, Kyle, Brendyn, and Ray. They have taught me so much in life and I am blessed to know them. I have enjoyed long conversations and thought provoking discussions with my friends Gopal and Bikash. The multi-aspect discussions included academic, philosophical, and common day-to-day issues, which have deeply enriched my understandings. Thanks to both past and present lab members Scott Hanford, Matt Hill, Oranuj Janarathikarn, and Pankaj Jha for the

14 xiv friendly office atmosphere they created. It is not possible to list all the names that have influenced me and helped in some way or the other. Thanks and best wishes to all those. Last but not the least words cannot describe the love, support, and encouragement I have received from my biological family. They have been very patient and understanding throughout my studies.

15 Chapter 1 Introduction Artificial neural networks (ANN) can broadly be classified into three generations. The first generation models consisted of McCulloch and Pitts neurons that restricted the output signals to discrete '0' or '1' values. The second generation models, by using a continuous activation function, allowed the output to take values between '0' and '1'. This made them more suited to analog computations, at the same time requiring fewer neurons for digital computation than the first generation models [3]. One could think of this analog output between 0 and 1 as normalized firing rates. This is often called a rate-coding scheme as it implies some averaging mechanism. Spiking neural networks belong to the third generation of neural networks and are more biologically plausible because the concept of time is inherent in the model. Similar to biological neurons, they use spikes or pulses to represent information flow. Neurological research also shows the importance of time in that the biological neurons store information in the timing of spikes and in the synapses. Figure 1-1 shows the importance of precise spike times. Though the two plots in the figure have the same number of spikes in a time window, the lower plot conveys richer information. The top plot in the figure has a constant firing rate, which can be interpreted as output from a second-generation rate based neuron. There have been many studies in the past using spiking neuron models to solve different problems for example spatial and temporal pattern analysis [4], instructing a robot for navigation and grasping tasks [5-7], character recognition [8, 9], and learning visual features [10].

2 Figure 1-1: Voltage plot for a spiking neuron with constant current (above) and varying current (below) having the same number of spikes in the time interval shown.

One direction or approach focuses on algorithmic accuracy and efficiency and is primarily developed for engineering applications without much regard to how information is processed in the biological

16 2 Figure 1-1: Voltage plot for a spiking neuron with constant current (above) and varying current (below) having the same number of spikes in the time interval shown. Modeled using leaky integrate-and-fire model. Generally speaking, the historical development of neural networks has been along one of the two directions as shown in Figure 1-2. One direction or approach focuses on algorithmic accuracy and efficiency and is primarily developed for engineering applications without much regard to how information is processed in the biological brain. Another direction focuses more on brain modeling and simulation but is not fast or suitable for engineering applications. The approach used here tries to bridge the gap by using only fast spike based information processing throughout the network and also solves some engineering problems.

17 3 Figure 1-2: Development of Neural Networks has been along two directions. Most of the success of the second generation neural networks can be attributed to the development of proper training algorithms for them; e.g., the backpropagation algorithm [11, 12]. It is one of the most widely known algorithms for training these networks and is essentially a supervised gradient-descent algorithm. In previous papers [11, 13], we showed how these secondgeneration models could be made scalable and run efficiently on massively parallel computers. We developed an object-oriented, massively-parallel ANN software package SPANN (Scalable Parallel Artificial Neural Network). The software was used to identify character sets consisting of 48 characters and with various levels of resolution and network sizes. The code correctly identified all the characters when adequate training was used in the network. The training of a problem size with 2 billion neuron weights (comparable to rat brain) on an IBM BlueGene/L computer using 1000 dual PowerPC 440 processors required less than 30 minutes. Even though that network was very fast, it was not as biologically realistic, as it used rate-based neuron modeling. But, learning in humans is mostly unsupervised and uses voltage spikes. Unfortunately, learning algorithms for rate-based networks aren t as suitable for the third generation spiking networks [14]. One commonly used unsupervised learning approach for

18 4 spiking neural networks is called spike time dependent plasticity (STDP) [15-17]. It is a form of competitive Hebbian learning and uses spike timing information to set the synaptic weights. This is based on the experimental evidence that the time delay between pre- and post-synaptic spikes helps determine the strength of the synapse. We have shown how these spiking neural networks can be trained efficiently using Hebbian-style unsupervised learning and STDP [9, 18, 19]. Nessler et al. [20] show theoretically that STDP and WTA can approximate expectation maximization (EM), which is a well known technique for finding maximum likelihood estimates of parameters in probabilistic models. In the present work, a Hebbian based learning method based on STDP timing along with winner take all type competitiveness is applied to a neural network architecture inspired from the mammalian vision system for object recognition. We choose the mammalian vision system, as it is better understood than most other parts of the brain (especially the early stages of the vision system). Also, about 25% of the human brain is devoted to vision, more than any other sense [21]. So it is worthwhile to investigate and model the visual processing system in hopes of building a more brain-like and possibly better system for artificial vision tasks. Insights from this modeling could also help in solving learning problems in general. Moreover, it is most likely that the brain uses a common algorithm across modalities for learning [22-24], which means that if we could solve the learning puzzle in the visual stage, perhaps we could solve it for other sensory modalities as well. Invariant object recognition incorporates learning and is an important part of any vision system. Humans and non-human primates can perform invariant (scale, viewpoint, illumination, expression) recognition in real time and they outperform the best machine systems. There have been many studies in the past on both human and animal vision systems and how they process visual stimuli. Taking motivation from biology, many vision systems have also been built [2, 5, 25-32]. Recently such a biologically motivated system was shown to outperform state of the art

19 5 systems on real-world object recognition databases [30-32]. There are robust systems such as HMAX [2, 30, 32] which use biologically plausible processing in the lower layers, but use traditional supervised learning techniques in the higher layers. There are also biologically realistic networks or brain simulators that either rank medium to high in biological plausibility such as NCS, GENESIS, and NEURON [33-36] or make some simplifications such as only one spike per neuron such as SPIKENET [37-39]. Then there are other computer vision approaches such as SIFT [28, 40, 41] and tradition neural network based approaches such as Neocognitron [25, 42, 43]. Our approach uses spiking neurons throughout, uses more than one spike per neuron, and is more biologically meaningful. In this work we do not want to model the entire brain. We want to demonstrate how visual tasks such as object recognition can be performed using only spiking neurons and biologically realistic learning techniques. Such a system could be useful in both engineering applications and also help answer important scientific questions concerning learning and the brain. We next discuss how invariance can be achieved in such a system. Invariance to scale, position and image-plane rotation can be done using a hierarchical pyramid like approach similar to [2, 30, 32, 44, 45], however, in this work we just focus on the position invariance for simplicity. Rotation in depth, on the other hand, is predominantly learned by experience in primates. Thus, it doesn t need to be included in the pyramid scheme. Presence of view-tuned cells in large numbers compared to view-invariant cells in the inferior temporal lobe (IT) [46-48] suggests such coding is done by experience, i.e., with large number of views of an object, novel views can be close enough to old ones to be recognized. Our approach incorporates the above observations and others. The major features of our approach are: Learning based on precise spike times similar to STDP

20 6 Hierarchical layered learning with learning in higher layers taking place after the lower level layers have developed their properties Either overall unsupervised learning or unsupervised learning in lower layers with higher layers having the ability to use class label information Homeostasis for stability WTA based on spike-time difference Latching which implies synapses that have reached a certain level of saturation need not be modified and can form permanent memories Can easily incorporate neurogenesis and synaptogenesis (birth and death of neurons and synapses). The approach uses multiple layers of spiking neurons throughout (including learning). We have implemented this method in an object oriented C++ code (called CSpike), which uses fast neuron models such as leaky integrate and fire (LIF) neuron model, and can simulate billions of synapses on a laptop. The present work is organized as follows: Chapter 2 presents related literature review, Chapters 3 and 4 discuss mammalian vision and learning, Chapter 5 presents the code and performance, Chapters 6 and 7 present the results and conclusions.

21 Chapter 2 Literature Review This chapter reviews the literature on rate based and spike based networks, spiking neuron models, some existing spiking neural network codes, and invariant object recognition. A wide variety of systems for biologically motivated invariant object recognition have been developed. Some are based on computer vision approaches such as [28, 40, 41], some are using rate based neural networks [25, 42, 43], and some are more biologically realistic approaches such as [2, 30, 32, 37-39]. Some of the latter approaches also use spiking neurons but use traditional supervised learning techniques in the upper layer and some make simplifications such as 1 spike per neuron. For spiking neuron models there are a number of models one can use depending on the requirements for biological accuracy and computational cost. There are also so-called brain simulators that can model processes at the ionic-channel level with varying degree of biological accuracy [33-36]. 2.1 Rate Based Networks Second generation Artificial Neural Networks (ANN) have been used for many complex tasks such as stock prediction and nonlinear function approximations [49-53]. ANNs loosely mimic the human brain and consist of large networks of artificial neurons. These neurons have two or more input ports and one output port. Generally, each input port is assigned a weight, and also a change of weight (delta-weight) to speed up the convergence. The output of a neuron is the weighted sum of the inputs. A transfer function is generally applied to the output depending on the desired behavior of the ANN. For example, the sigmoid function is generally used when the

22 8 output varies continuously but not linearly with input. The learning in ANNs occurs through iteratively modifying the input weights of each neuron, and often uses the back-propagation algorithm [12, 54, 55]. Training massive neural networks can be extremely time consuming however, since they do not scale well. We have previously developed ANN software, called SPANN (Scalable Parallel Artificial Neural Network) [11], which runs on massively parallel computers and uses the backpropagation training algorithm. An object oriented (C++) [56] approach was used to model the neural network. The Message Passing Interface (MPI) library [57] was used for parallelization. Figure 2-1 through Figure 2-3 show the performance results of SPANN on massively parallel computers. In order to maintain parallel efficiency, the number of neurons per layer was scaled linearly with the number of processors. The runs shown in Figure 2-1 and Figure 2-2 were performed on the NASA SGI Columbia computer [58]. The runs for Figure 2-3 were performed on an IBM Bluegene [59]. Figure 2-1 and Figure 2-3 show that the training time is essentially constant when the number of neurons is scaled linearly with the number of processors. All of these runs used the same number of training intervals. These are just to compare the timings; however, the larger networks may require more training for the same level of accuracy. Figure 2-2 shows that the time taken for a single forward propagation step also remains essentially constant as the number of neurons are scaled linearly with processors. For neurons per layer (using 6 layers) on 500 processors, the total memory required was about 0.2GB/processor (each weight was stored as a 4-byte floating point number). The memory required by a single neuron for this case was about 1 KB. The largest case (on 1000 processors) used about 2.5 billion neuron weights and was trained in under 30 min.

23 9 Figure 2-1: Training time taken as the problem size was increased linearly with the number of processors [11]. Figure 2-2: Time taken for one forward propagation as the problem size increases linearly with the number of processors [11].

10 Figure 2-3: Training time taken as the problem size was increased linearly with the number of processors [11]. 2.2 Spike Based Networks A number of systems have been developed in the past for the simulation of spiking neural networks [33-35, 39].

24 10 Figure 2-3: Training time taken as the problem size was increased linearly with the number of processors [11]. 2.2 Spike Based Networks A number of systems have been developed in the past for the simulation of spiking neural networks [33-35, 39]. They vary in the degree to which they represent biologically realistic neurons, parallel support, complexity, OS support, and speed. Table 2-1 lists the capabilities/limitations of these models with respect to some of these parameters. In the table below, biological reality is rated on a scale of low, medium, or high. Speed is also rated on a scale of low, medium, or high when performing somewhat similar neural processing tasks. The next few paragraphs discuss briefly each of these systems.

25 Parameters, System Biologically Realistic Parallel support GENESIS High MPI, PVM NEST Medium-High MPI, mutithreading Speed Language OS GUI Multiple spikes per neuron Low C, GENESIS scripting High C++, user interaction using SLI Linux, OS/X, Cygwin XODUS Yes Linux Possible Yes NCS High MPI Medium C++ Linux, No Yes Cygwin spikenet Low-Medium No High C++ Linux No No (research) NEURON High MPI Medium Based on Hoc (similar to C) Linux, OS X, Windows Yes Yes CSpike Medium In progress High C++, QT Linux, OSX Table 2-1: Table comparing various neural simulators. No Yes 11 GENESIS [35] was designed to be a generic simulation system for building a biologically realistic neural network simulator. It has been used for biochemical reactions as well as more realistic small and large neural networks [60]. It includes a variety of multi-compartmental models with Hodgkin Huxley (HH) or calcium dependent conductance models [60]. As neuronal models such as simple Integrate & Fire (IF) and that by Izhikevich [61] are not realistic enough for intended use of GENESIS, these are not added in GENESIS by default (although one can construct them). It also offers its graphical interface XODUS. The GENESIS core code and XODUS are both written in C. To get maximum benefit one has to learn and write code in the GENESIS scripting language. PGENESIS [35] is the parallel version of GENESIS and has support for both MPI and PVM. It can be run on a variety of platforms including Windows with Cygwin, OS/X, Linux on both 32 and 64 bit architectures. NCS [33] was designed to be mainly a mammalian brain simulator with large networks of HH type biologically realistic neurons arranged in a column. It s current version (version 5) is written in C++ and MPI. NCS can run on any Linux cluster and has also been run on an 8000 cpu

26 12 Swiss EPFL IBM Blue Brain machine. The largest simulation run on NCS had about 1 million single-compartment neurons connected by 1 trillion synapses and required about 30 min on 120 CPUs to simulate one biological second [60]. The NEST initiative [34] was started to build networks of large neurons with biologically realistic connectivity and a small number of compartments. The simulator can model different neuron types including IF, HH, and also different types of synapses including STDP learning rule. The user needs to learn a stack oriented simulation language which is called SLI to build simulation networks. It is written in C++ and MPI and supports parallelization by muti-threading or message passing. It can be compiled and run in a Linux platform with certain required libraries mentioned in the NEST initiative website. For a simulation network of neurons and 1 million synapses, the speedup obtained was almost linear on up to 8 processors. The code spikenet [39] was developed for building large networks of spiking neurons with simple pulse-coupled IF models. The original code was for research purposes and current versions are now commercialized [62]. The source code of the research version of spikenet can still be downloaded from the spikenet research page [39]. SpikeNET has only been used and was originally developed for processing only one spike per neuron and it also cannot implement synaptic delays. The idea was that the brain could use rank order coding (the order in which neurons fire) to encode information rather than rate coding. The main emphasis was on doing the computation in real time for a variety of image processing tasks such as real time object tracking, face recognition etc. SNVision technology [62] which is based on spikenet can do tasks such as near real time detection in natural scenes, recognition of multiple targets, other recognition tasks etc. The spikenet code for research purposes was written in C++ and runs on Linux. NEURON [36] was developed initially to be a simulator for modeling cells with complex ionic channels or with cable properties. A number of papers in the literature have used NEURON as a simulation tool for problems involving cells that obey certain complex branched anatomies or

27 13 biophysical properties such as multiple ion channels etc. [60]. One of it s most important capability is that it allows modelers to work on a higher level without worrying about computational issues by offering a natural syntax such as the concept of a section. Sections are basically un-branched neurites (axons or dendrites) and can be assembled into branched trees. Models are created based on an interpreted language based on hoc (similar to C) [63], and can also write additional functions in the NMODL language. It offers a GUI interface and modelers without any knowledge of programming can use the GUI to build complex models but often one needs to do some programming to exploit its full capabilities. It also has parallel support and can be run on Beowulf clusters, IBM Blue Gene, and Cray XT3. A thalamocortical network model by Traub et al. (2005) with roughly 5 million equations, 3000 cells, and 1 million connections, showed almost linear speedup on up to 800 processors. It is claimed that the speedup obtained is generally linear with CPUs unless each CPU is solving fewer than 100 equations [60]. Our code CSpike is fully object-oriented written in C++ and uses Qt library for image input/output. It can use more than one spike per neuron in contrast to spikenet and thus is more biologically realistic. It is scalable and can simulate billions of synapses in a laptop as shown in Section 5.3. It is difficult to compare computational time of CSpike with the systems mentioned above as they use very different neuron and synaptic modeling. JSpike [18] is an object oriented spiking network code built using Java and uses the same neuron modeling and network structure as CSpike. For similar simulations, CSpike is about 4 times faster than JSpike (see Section 5.3 ). An advantage of using C++ is better parallelism support. Even though efficient JIT compilers have been built for Java to make it comparable to C++, parallelization in java is not as well supported as in C++ using MPI or CUDA.

28 Spike Based Models There are many different models one could use to model both the individual spiking neurons, and also the nonlinear dynamics of the system. Individual neuron models can be categorized according to their biological plausibility and speed. Generally speaking, the more biologically plausible models tend to require more computation time. Izhikevich [61] and Long and Fang [64] compare these and many more such models. In the following sections three wellknown models are presented: Hodgkin-Huxley (HH), Izhikevich (IZ), and the Leaky integrate and fire (LIF) model. Among these, Hodgkin-Huxley is the most computationally expensive, whereas the Leaky integrate and fire is the least. Often these models are solved numerically using Euler discretization in time. In the IZ and LIF models, voltages are reset after a spike. Thus the roundoff errors will not accumulate with time steps. Whereas, the HH model keeps evolving even after a spike, thus errors could keep accumulating leading to accuracy and stability issues. In addition, the storage cost of HH is twice that of IZ and four times of LIF [64] Hodgkin-Huxley Model The Hodgkin-Huxley (HH) [65] model was one of the first detailed neuron models developed, and they received a Nobel prize for this work. It was based on experimental results from squid giant axons. The HH equations are: (2.1)

29 15 Here v is the voltage across cell membrane, C is the capacitance, ΣI k is the sum of the ionic currents passing through the cell membrane; Na, K, and L denote the three types of channels; m, n, and h are called the gating variables. E, g, α, and β are other parameters which are listed below: (2.2) In this model, after a spike, the voltage is not reset but continues to evolve (unlike the LIF and IZ models). Thus, it could lead to accumulation of round-off errors if solving numerically. Long and Fang [64] show that fourth order Runge-Kutta scheme gives much better accuracy than using Euler discretization in time. Though this model can simulate biology at the ionic-channel level, it is quite expensive to compute. For 1 ms of simulation, this model takes 1200 FLOPS (floating point operations) according to Izhikevich [61]. Also, recently, Long and Fang [64] showed this model to take roughly 30 times the time taken by LIF model using GNU C and 2.8 GHz 8-core Mac OS X server Izhikevich Model Izhikevich [61] proposed a model which is computationally much less expensive than the HH model, yet can capture much richer spike activity. The governing equations are given by:

30 16 dv(t) = 0.04v 2 + 5v +140 " u + I(t) dt du(t) = a(bv " u) dt (2.3) With spike resetting modeled according to the following equation if v 30mV, then v = c u = u + d Here, v represents the voltage of a neuron, while the variable u functions as a recovery variable adjusting v. I(t) represents input currents. With proper choice of parameters a, b, c, and d, the model can exhibit firing patterns of all known types. When the membrane voltage v(t) reaches 30 mv, a spike is emitted, and the membrane voltage and the recovery variable are reset according to above equation. For 1 ms of simulation, this model takes 13 FLOPS [61]. Recently, Long and Fang [64] showed this model to take roughly 3 times the time taken by LIF model using GNU C and 2.8 GHz 8-core Mac OS X server. They also showed that this model needs smaller discretization time step size using first-order Euler scheme for numerical stability and many interesting effects observed in Izhikevich [61] could be fortuitous Leaky Integrate and Fire Model Another rather simple model is the leaky integrate and fire (LIF) [66] model, which is much simpler and is computationally very inexpensive. Even though the LIF model is computationally inexpensive, it can model the spike times very similar to the HH model. Figure 2-4 shows the voltage time history of a single neuron for both HH and LIF models. It is evident that the spike times are quite similar. This is important as spiking networks often use the timing

31 of spikes for learning. Also, detailed neuron behaviors are not needed for some engineering neural network systems. The LIF model is: 17 dv i dt = 1 ( " (I + I ) R # v input i i ) (2.4) Where, v is voltage, τ=rc is the time constant, R is resistance, C is capacitance, I input is a possible input current (usually zero), and I i is the current from the synapses. For one step, this model requires about 5 floating point operations [61]. Numerical implementation of this model is discussed in section 5.2. Figure 2-4: Voltage vs time plot of Hodgkin-Huxley model (red solid) compared to the LIF model (blue dashed) [18]. 2.4 Invariant Object Recognition The Neocognitron [25, 42] was one of the first invariant hierarchical models consisting of alternating simple and complex cell layers and using traditional neural networks. It has been used for handwritten character recognition and other pattern recognition tasks. Perrett and Oram [67]

32 18 also proposed a hierarchical model and used a Gaussian RBF trained to recognize a paper-clip like object at various rotation angles. One of the criticisms of such models is that they were not designed to deal with real world databases. Also, even though the structure was hierarchical, they were not designed to match biological data from experiments such as HMAX [2]. One of the first object recognition models based on physiological data from monkey inferotempoal cortex (IT) and make testable predictions was HMAX [2]. In a hierarchical manner, they constructed view tuned cells (VTU) invariant to position and scale, (but not so to 3D rotations) similar to those observed by Hoffman and Logothetis [47], and Logothetis et al. [48] on experiments performed on monkey IT cortex. The key feature of the model is a MAX like operation that signals the best match of any part of the stimulus to the afferent s preferred feature. Their network is a feedforward architecture consisting of simple, complex and view-tuned units. The model is able to recognize images containing preferred clips as well as another distracter clip as inputs. In 90% of the cases the response of the preferred clip was above the response of the distracter clip in the two-clip display. The model is not fooled even when the input image is scrambled into pieces; i.e., it is able to detect a scrambled image from an unscrambled one. Serre et al. [30] proposed a biologically motivated object recognition model similar to HMAX in the earlier stages of processing but used SVM and gentle-ada boost for learning the features in the later stage. They obtained higher accuracies than SIFT [28] on both the MIT- CBCL [68] and Caltech databases [69]. They introduced a set of features, where each feature is obtained by combination of local edge detectors tolerant to scale and position changes over multiple positions similar to the complex cells in the primary visual cortex. Given an input image, first a vector of corresponding features is computed, and then a classifier such as support vector machines (SVM) is used on these features. It is able to learn from fewer examples than traditional systems, as scanning over all positions and scales is not required. They used the Caltech dataset for classification tasks. The highest accuracy obtained was 99.8% on the cars dataset using SVM.

33 19 An accuracy of 98.2% was obtained on faces and 98% on motorbikes, both using gentle adaptive boosting (AdaBoost). AdaBoost constructs a series of classifiers in which subsequent classifiers are built so as to favor instances that are misclassified by previous classifiers. Mutch et al. [31] refined the approach in Serre et al. [30] and used several improvements such as using simple versions of sparsification (this speeds up the process by looking only into dominant orientation at the S2 stage) and lateral inhibition (this again suppresses the nondominant orientations in the S1 and C1 layer outputs). The results on the Caltech101 dataset and UIUC car localization tasks achieved state of the art performance. Later, Cadieu et al. [44] presented a model similar to the V4 area in the visual cortex. It showed similar selectivity and translation invariant shape representations as observed in V4. Serre et al. [32] also present a hierarchical system for object recognition with cortex-like mechanisms. They applied their model to various tasks such as invariant single object recognition in clutter, multi-class categorization problems, and complex scene understanding tasks relying on both shaped-based as well as texture-based objects. They used various databases such as Caltech5, Caltech101, and MIT-CBCL database for object recognition in clutter; MIT StreetScene database [70] for object recognition without clutter and also recognition of texturebased objects. Combining the texture and object-based approaches they provided a framework for a complete system for scene understanding. The reported accuracies were all better than the benchmark existing systems. Another approach towards invariant object recognition is in [71-73], which they call VisNet. Their model is a feature hierarchy model in which invariant representations are built using self-organizing learning based on input statistics. The network has four layers with convergence to each part of a layer from a small region of the preceding layer. The neurons within a layer are competitive so that too many surrounding neurons receiving similar inputs are not excited. A modified Hebbian trace-learning rule is used for learning. This trace-learning rule

34 20 takes into account the decaying trace of previous cells activity. Thus, both the current firing rate and the firing rate of the recent stimuli are used for modifying the weights. Another well-known object recognition algorithm based on local image features that are largely invariant to translation, scale and rotation is SIFT (Scale Invariant Feature Transforms) [28, 40, 41]. In this method, image features are transformed into local feature coordinates that are invariant to rotation, translation, and scaling. This involves four major steps consisting of 1) scale-space extrema detection which involves searching through scale and space to find potential interest points in the scene, 2) Key-point localization which involves fitting of a detailed model to determine location and scale, 3) Orientation assignment in which local image gradient directions are used to assign one or more orientations to each key-point location. 4) Key-point descriptor in which the local image gradients are measured at the selected scale in the region around each keypoint and transformed into representation that allow for local shape distortion and illumination changes. Several stable key points are selected in scale space and feature detection is performed at only these locations. There are other biology-based approaches too. The idea of using one spike per neuron was initially explored in van Rullen et al. [37], and they developed a software called SpikeNet [65, 74]. They used rank ordering of the spike times from different neurons as a code. They justified their assumption by observing the electro-physical data of the monkey temporal lobe. It was observed that some neurons of the monkey temporal lobe responded to a face stimuli with a latency of ms. After taking into account that the information has to pass through 10 different stages and the conduction velocities of the neocortical fiber, it was found that each stage had less than 10 ms for computation. Such rapid processing presents a problem for conventional rate coding if more than one spike is to be emitted by a cell [37]. Later Delorme and Thorpe [29] also used one spike per neuron and used a learning rule based on spike timings to learn weights in a supervised manner. Their network was robust to contrast and luminance changes. The accuracy

35 21 achieved was 97% on a modified test dataset consisting of novel views, but the accuracy dropped rapidly if the images are modified such as removal of color, addition of noise. Masquelier and Thorpe [10] used STDP with spiking neural networks to achieve selectivity to intermediate-complexity visual features in natural images, even though they used only one spike per neuron. They use a network architecture similar to Serre et al. [30] that consists of alternating simple and complex cells in a hierarchy. They used a simplified STDP learning rule with infinite window, in which only the order of the spikes mattered not the precise timings. Simple cells gain selectivity from linear sum operations whereas complex cells gain invariance from a max operation. The Caltech dataset with faces and motorbikes was used for classification. In the classification task of classifying face or non-face, they achieved an accuracy of 99.1% for potential+rbf approach based on training a radial basis function (RBF) classifier on the C2 cells final potential. On the motorbike set, using the same approach, the accuracy reached was 97.8%.

36 Chapter 3 Mammalian Vision About 25% of the entire human brain is devoted to vision more than any other sensory modality [21]. It is also one of the best understood regions of the brain. The human and nonhuman primate vision system is remarkable. It can do various image understanding and manipulation tasks in varying scenarios very fast. Though there have been advances in computer vision, they are far from being able to match human vision capabilities [32]. In order to develop better computer vision systems, it is important to learn from the human system. So the next few paragraphs present the important components and functions of the human vision system. 3.1 Overview of Image Formation In humans, light passes through the cornea, which accounts for about 65-75% of focusing, and then through the lens [Figure 3-1] [75]. Both the cornea and the lens combined have a power of about 60 Diopters [75]. The focused light is incident on the sheet of photoreceptor cells in the retina. Photoreceptors absorb light and send signals to the nearby neurons for further processing. There are about 125 million photoreceptor cells in each human eye and are either rods or cones [21]. Rods are sensitive to low light and do not convey color whereas cones are used in high light levels and convey color.

37 23 Figure 3-1: The cornea and lens focus the light on the retina. Fovea allows for high visual acuity with high concentration of cone cells. The output signal from the retina passes through the optic nerve [76]. Reprinted with permission from Sinauer publishing. Figure 3-2: Distribution of rods and cones [77].

24 There are roughly 20 times more rods than cones, with the fovea having the highest concentration of cones and almost no rods [76]. Figure 3-2 shows the density distribution of the rods and cones.

38 24 There are roughly 20 times more rods than cones, with the fovea having the highest concentration of cones and almost no rods [76]. Figure 3-2 shows the density distribution of the rods and cones. There are three types of cones in the human eye, each sensitive to a different range of wavelengths of light. The three cones respond maximally at wavelengths of roughly 575, 540, and 450 nanometers (which roughly corresponds to the red, green and blue wavelengths respectively) [76]. From the rods and cones the signal passes to the cells in the Lateral Geniculate Nucleus (LGN) via an intermediate layer [Figure 3-4] consisting of three types of cells- bipolar, horizontal, and amacrine. Bipolar cells receive input from the receptors and many connect directly to the retinal ganglion cells. Horizontal cells connect receptors and the bipolar cells horizontally by relatively long parallel connections. The amacrine cells link bipolar and the retinal ganglion cells [76]. Most cells in the LGN exhibit on-center, off-surround or off-center, on-surround behavior as shown in [Figure 3-3] [78]. The difference of two 2-D Gaussian filters at different scales is a good model for these cells. Near the center of the gaze, each ganglion cell receives inputs from very few cells in the previous layer thus resulting in high resolution. In the periphery, each ganglion cell receives inputs from many cells in the previous layer (predominantly rods) resulting Figure 3-3: Recordings from typical retinal ganglion cells; Left: on-center cell; Right: off-center cell. Four type of stimulus for each of the cells is shown towards the left [78]. Reprinted with permission from Freeman publishing.

in poor resolution. The ganglion axons pass the information to the visual cortex for further processing. 25 Figure 3-4: The cornea and lens focus the image onto the photoreceptor cells in the retina.

39 in poor resolution. The ganglion axons pass the information to the visual cortex for further processing. 25 Figure 3-4: The cornea and lens focus the image onto the photoreceptor cells in the retina. The photoreceptors turn light into electrical signals and provide input to the middle and the retinal ganglion layer. The input then passes through the optic nerves to the optical chiasm, LGN, and the visual cortex. Left part of the scene from both the eyes registers in the right hemispheres, whereas the right part registers in the left hemisphere [21]. Reprinted with permission from the Society for Neuroscience.

40 26 The signals from the ganglion cells go into different streams, with each specializing in a different type of visual processing. These streams are connected series of neurons carrying information in segregated parallel pathways. Each stream is believed to communicate relevant spatio-temporal information to specialized brain areas for relevant tasks. The precise number of and function of these streams is still not known. From the occipital lobe, two major, and wellknown streams have been identified. They are the ventral ( what ) and the dorsal ( where ) streams [79] as shown in [Figure 3-5]. The ventral stream is associated with object recognition and form representation, whereas the dorsal stream is associated with motion, control of eyes and arms. Our focus here will be mostly on the ventral stream. Figure 3-5: Dorsal and Ventral Streams [80]. Reprinted under the Creative Commons license. Humans have vision using two-eyes, called binocular vision. Signals from the two eyes pass via the optic nerves to the optic-chiasm where some nerve fibers cross-over so that both sides of the brain receive signals from both eyes which is necessary for binocular vision. As a result, the left part of the visual scene registers in the right hemisphere, and the right part of the scene registers in the left hemisphere [Figure 3-4] [21].

27 3.2 Hierarchical Processing Hierarchical processing is an important aspect of vision in both humans and non-human primates. Here we just focus on the ventral stream.

41 Hierarchical Processing Hierarchical processing is an important aspect of vision in both humans and non-human primates. Here we just focus on the ventral stream. The input stimuli pass through the retina, LGN, V1, V2, V4, and areas of IT before being fed to the PFC [81]. Figure 3-6 shows how information passes from various regions in the brain of a monkey in a go-no go task [81]. A go-no go task is a method to measure the reaction time where the subject is required to press a button when one Figure 3-6: Shown is a plausible feed forward pathway for rapid visual categorization task in monkeys (ventral stream). Information from the retina passed through LGN to V1. Simple and complex cells are found in the areas V1 and V2. From V2, the information goes through V4 to posterior and anterior inferior temporal cortex (ITC), where neurons responding to faces and objects are found. The prefrontal cortex (PFC) contains neurons that categorize objects. Information then passes through the pre-motor cortex (PMC) and motor cortex (MC) to motor neurons of the spinal cord. In the figure, the first latency is an estimate of the earliest neuronal response and second is the average latency (Modified from Thorpe and Thorpe [81]). Reprinted with permission from the American Association for the Advancement of Science.

42 28 stimulus appears and withhold the response for other types of stimulus. Two mechanisms: increasing receptive field sizes and cells selective to more and more complex stimuli appear as one moves higher up in the ventral stream from V1 to IT. These two properties; i.e., increasing receptive fields and cells becoming more and more specialized moving up the hierarchy are present throughout the ventral stream. Hubel and Wiesel [82-84] performed experiments on the cat and monkey cortex and identified simple and complex cells. Simple cells found in V1 [Figure 3-7] respond to stimuli such as edges and bars [82-84]. Whereas, complex cells were invariant to translation. There are also cells higher up in the hierarchy which are invariant to scale, position changes, view, and faces [47, 85].

43 Figure 3-7: Responses of a neuron to illuminated rectangular slits at different orientations. Orientations are shown in the left column, with responses shown in the right column. The neuron is strongly activated by a vertical bar and not by bars at other orientations. These are known as simple cells. From Hubel and Wiesel [82]. Reprinted with permission from John Wiley & Sons publishing. 29

44 30 These invariances are most likely encoded by a hierarchical feed-forward processing rather than feedback signals, as they are very fast. The neuron latencies for object recognition are as low as 100 ms [81], with the invariant and selective responses have similar latencies [48, 86, 87]. This suggests that there is minimal feedback processing and a feed-forward hierarchy is the dominant mode for encoding at least some of the invariances. Hierarchy is also found in other tasks such as categorization. Rhesus monkeys show hierarchical categorization with untrained discrimination between faces of their own species and another species. In regards to within-species face categorization, they show better discrimination of within-species faces than other-species faces [88]. Auditory processing also shows hierarchical structure. Recently, Rauschecker and Scott [89] proposed a hierarchical structure for auditory processing similar to vision in the primary auditory cortex across non-human primate species. In all, a hierarchical structure is common to many aspect of visual processing. Such hierarchies can be built easily using a pyramid like approach [2, 10, 30, 32, 44] in both traditional computational vision and in spiking neural networks [10]. One such well known approach is HMAX [2] which uses five layers of hierarchy from the S1 layer consisting of simple cells to the invariant view tuned cells like those found in the IT region [Figure 3-8]. In this model, first simple cells (S1) consisting of different orientations are formed using arrays of Gaussian filters (obtained by taking second derivative of Gaussian) on the original image. These S1 cells are then used to form C1 cells using the MAX operation, which are invariant to translation and scale. Different S1 and C1 cells are combined to form composite feature cells (S2), which are passed through a MAX like operation to yield complex invariant C2 cells. These C2 cells finally combine to form view-tuned cells responding to a certain view of the input object. These operations are very similar to that found in the visual stream. Gaussian filtering is very similar to Gabor filtering and MAX like operations are similar to winner-take-all type operations found in biological systems. We use these biologically plausible sets of operations

45 31 along with spiking neurons for the hierarchies in our network. Figure 3-9 shows the comparison of our network architecture for vision to HMAX by Reisenhuber et al. [2]. The major difference is that we use all the processing and learning using spiking neurons (which is more biologically plausible) whereas in HMAX traditional learning methods such as support vector machines are used in higher layers. More differences with HMAX are presented in Chapter 7. There is another advantage to the hierarchical approach in regards to parallelism. As one moves up the hierarchy, fewer cells are involved in recognizing complex stimuli and thus fewer processors are needed for the same latency as one moves up. Also, since less information needs to be passed across processors, decisions can be made very quickly higher up. Figure 3-8: Schematic of the standard HMAX model with five layers. The lowest layer is a layer of simple cells with simple features whereas the highest layer is a layer of view tuned cells, which have invariance properties [2]. Reprinted with permission from Nature Publishing Group.

46 Figure 3-9: Comparison between hierarchical HMAX model [2] (left) and our model (right). We use spiking neurons throughout including learning, whereas HMAX does not use processing or learning using spiking neurons. 32

33 3.3 Contrast and Color Humans are able to maintain brightness constancy over six orders of magnitude of absolute intensity levels [76].

47 Contrast and Color Humans are able to maintain brightness constancy over six orders of magnitude of absolute intensity levels [76]. This is done by encoding for contrast, which involves scaling the absolute intensity signal, and can be easily done in spiking neural networks by dividing by the average input current. Many object recognition codes perform well for a particular data set but fail in the real world due to vast changes in the intensity levels. Humans however, can recognize the same person on a bright sunny day or inside a dark room with no effort. The same holds for color. Color is encoded mainly by local contrast of the cone signals. By using local cone contrast rather than absolute cone signals, we are able to achieve color constancy in varying light sources [76]. Color plays a very important element in visual processing and object recognition. For example, Figure 3-10 shows the same image in grey scale and in color. In the grey-scale image it is very difficult to see the butterflies, but in the color image the butterflies are very obvious. But color processing is often omitted in vision models due to the complexity involved. Issues such as color constancy, additive/subtractive mixing, and simultaneous color contrast make color a Figure 3-10: Images of plants and butterflies. Left: Grey-scale; Right: Color

34 challenging topic. Simultaneous color contrast refers to the effect that two colors side by side interact with one another and change our perception.

are mixed. The product in additive mixing appears lighter than individual components as each light adds energy to the mixture [Figure 3-11].

48 34 challenging topic. Simultaneous color contrast refers to the effect that two colors side by side interact with one another and change our perception. Additive mixing means that the spectral power distribution of the sum of two lights is the sum of respective spectral power distributions, for example when two light sources of different wavelengths are mixed. The product in additive mixing appears lighter than individual components as each light adds energy to the mixture [Figure 3-11]. Subtractive mixing is when different pigments are mixed such as in inks. As the product of mixing absorbs more light, it always appears darker. The perception of color is a complex process. The color signal entering the eye depends on the illumination source and also the reflective property of the object. Each point in a scene is illuminated by a light source with specific spectral power distribution. Each surface is also characterized by how it reflects the light. This is called as the spectral reflectance function. The product of the reflectance and the illumination yields the spectral power distribution of the color signal entering the eye. This signal is processed by the three cone photoreceptors depending on their respective spectral sensitivities. The brain then works with the responses provided by the three cone photoreceptors. Figure 3-11: Left: Additive mixing of light sources for e.g. CRTs; Right: Subtractive mixing for e.g. in printing, pigments, and inks.

49 35 To model color processing, it is important to model the process starting from the receptor level to the higher levels. The tri-chromatic theory provides an explanation of the process at the receptor level whereas the opponent process theory describes the neural mechanisms involved in further processing. The tri-chromatic color theory states that any color can be obtained by mixing the three primary colors. There are three cone types corresponding to large, medium, and short wavelengths, and the absence of one or more of these can lead to color blindness. The opponent theory explains why we cannot see certain color pairs such as red and green simultaneously. It also explains the color illusion that when one looks at a red (or blue) patch for about a minute and then immediately looks at a white area, a green (or yellow) patch is seen. We also have double-opponent cells in addition to single-opponent cells. They reside in collections (blobs) in layer 4 of V1 and typically have larger receptive fields than single-opponent cells. They allow us to perceive similar colors even under varying illumination conditions much like we are able to recognize the same object in varying illumination conditions. The opponent process theory states that the responses from the three cones are combined to produce three antagonist color pairs: red/green, blue/yellow, and black/white. The three opponent channels are as follows: " C1% " R ( G % $ ' C2 $ ' = $ ' (R + G ( 2B) /2 $ ' # C3& # (R + G + B) /3 & (3.1) Note that these channels also need to be normalized to between 0 and 255 to be consistent with the C3 channel. The normalized channels are: RG = C C , BY = 2 2 (3.2)

50 In matrix form, conversion from RGB space to the opponent color space can be expressed as follows: 36 " Gr% " 1/3 1/3 1/3 % " R% " 0 % $ ' $ ' $ ' $ ' $ RG ' = $ 1/2 (1/2 0 ' $ G ' + $ 255/2 ' # $ BY& ' # $ 1/4 1/4 (1/2&' # $ B& ' # $ 255/2& ' (3.3) A quick eigen-analysis shows that these are independent. These opponent channels also provide certain invariant properties. These can be best understood by analyzing how light-color changes are modeled. Light color changes and the shift to the red, green, and blue channels can be modeled as: " a 0 0% " R% $ ' $ ' $ 0 b 0 ' $ G ' # $ 0 0 c& ' # $ B& ' " d% $ ' + $ e ' # $ f &' (3.4) Here a, b, c model the changes in color and d, e, f model arbitrary light offsets. If a=b=c; i.e., when the light changes by same factor in all the channels, it is equal to light intensity change. When d=e=f; i.e., equal shifts in all the channels, it is light intensity shift. Light intensity change includes shadows and lightening geometry changes. Light intensity shifts model changes due to increased diffuse light. Thus, the channels C1 and C2 are invariant with respect to light intensity shift, as any offsets in intensity will cancel out due to the subtractions. The intensity channels C3 has no invariance properties. Figure 3-12 shows the effect of intensity shift on the plant and butterfly picture. The original color image, grayscale, red, green, blue, red-green (RG), and blue-yellow (BY) channels are shown. The left-most column is original intensity, the middle is 50% intensity shift, and rightmost is 100% shift. Each channel was normalized to between 0 and 1. The RG and BY channels are invariant to shift in intensities, whereas the other channels are not. In addition, the butterfly

51 37 and red flowers pop out in the red-green channel and the blue flowers pop out in the blue-yellow channel. This popping out effect is further illustrated by the bottom two rows, which show the 50% threshold channels; i.e., only values above 0.5 are shown.

52 Figure 3-12: Effect of intensity shift on channels. Leftmost column is original intensity, middle is 50% shift, and rightmost is 100% intensity shift. RGt and BYt are RG and BY channels with a threshold of 50% respectively. Intensities for each channel were normalized to between 0 and 1. 38

53 39 For robust object recognition, invariant to light intensity and color invariance s, several descriptors have been proposed. SIFT descriptors by Lowe [41] construct edge orientation histograms to describe local shape information. The gradient of an image is shift-invariant as taking derivatives cancels out the offsets. SIFT uses normalized gradient magnitudes and directions and hence changing of intensity (scaling) has no effect too. Also, the SIFT descriptors are invariant to light color changes as well. OpponentSIFT [90] computes SIFT descriptors using the three opponent color channels. In a comparison of different color descriptors, Sande et al. [90] found that the best choice of color descriptor is OpponentSIFT when no prior knowledge about the dataset or image categories is available. They compared color descriptors such as RGB, Opponent, Hue, HSV- SIFT, Color moments, and HueSIFT on PASCAL VOC [91] image benchmark and NIST TRECVID [92] based video benchmark. The color part descriptor (first two channels) of this model is not invariant to light color changes whereas the intensity part descriptor (last channel) is invariant to light color changes and shift. Color is often ignored in object recognition systems. However, most of human vision is in color, which has effects like popping out of certain colors in specific backgrounds. In fact, this knowledge has been often used by artists to have things stand-out by using certain color combinations next to each other. As a preliminary step, the neural network wass extended to process and recognize color images of fruits. Color processing is incorporated in the present code by using the three normalized opponent channels described above. The values for all the channels are between 0 and 255. These are converted to realizable current values at the input stage. For this, the intensity value of each pixel was scaled between I min and I max. Here I min is the minimum current needed to sustain spiking and I max is the current corresponding to maximum allowed firing rate f max. They are calculated using the following equations from Koch [93]:

54 40 I min = V th R I max = 1 ( R V "V E th rest )( 1" E) "1 tref "1/ f max #, E = e (3.5) Here, V th is the voltage threshold to produce a spike, V rest is the resting potential, t ref is the refractory period after a spike, τ=rc is the time constant, R is resistance, and C is capacitance.

55 Chapter 4 Learning Biological learning involves acquiring new knowledge, skills, and understanding. Though it is easy to see the effects of learning in biological systems, the underlying mechanisms are quite complex. Also, there are issues such as motivation, rewards, and habituation involved in biological learning. No scientist has been able to fully replicate biological learning on a machine. So-called generalized intelligence with human-like learning seems a far goal. Though machine systems can perform some tasks, none come close to human adaptability and robustness. Kelley and Long [94] discusses the approaches that could lead to generalized intelligence. Long and Kelley [95] argue that developing conscious robots is an achievable goal in this century. Efficient learning algorithms along with cognitive development, hardware implementations and sensor processing are all ingredients for building conscious robots. There are broadly two types of learning that we encounter in computer-based models: supervised (e.g. support vector machines (SVM), backpropagation of errors, Bayesian type methods) in which the labels are provided, and unsupervised (e.g. clustering, adaptive resonance theory (ART), self-organizing maps (SOM), Hebbian learning) where label information is missing. There are also systems, which use a combination of these, and other psychological approaches such as reinforcement learning. In the next few chapters we discuss biology based learning methods such as Hebbian learning, spike time dependent plasticity (STDP), the learning method that we use in our network, and neural and synaptic genesis (birth of neurons and synapses).

56 Hebbian Learning and STDP Hebb [97] postulated that when an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth or metabolic change takes place in one or both cells such that A s efficiency, as one of the cells firing B, is increased. Long term potentiation (LTP) provided strong experimental support for Hebb s postulate. LTP is the long Figure 4-1: Variation of synaptic change with pre and post-synaptic spike time difference [96]. Reprinted with permission from the Society for Neuroscience. term increase in synaptic strength and its opposing effect is long term depression (LTD). While the above quote from Hebb is well known, another quote from the book is: The old, long-established memory would then last, not reversible except with pathological processes

57 43 in the brain; less strongly established memories would gradually disappear unless reinforced. For a computational approach to be stable and robust, there must be an effective method for forgetting as well as learning. Spike time dependent plasticity (STDP) can be viewed as a more quantitative form of Hebbian learning. It emphasizes the importance of causality in synaptic strengthening or weakening. When the pre-synaptic spikes precede post-synaptic spikes by tens of milliseconds, synaptic efficacy is increased. Whereas, when the post-synaptic spikes precede the pre-synaptic spikes, the synaptic strength decreases. Figure 4-1 shows the experimentally observed variation of synaptic change with spike time difference. Both in vitro and in vivo experiments have provided evidence for this mechanism [98-101]. Such temporal LTP and LTD mechanism can be explained possibly by cellular mechanisms involving the rise in postsynaptic Ca +2 levels. Glutamate is the main excitatory neurotransmitter in the central nervous system (CNS). An electrical impulse in the pre-synaptic cell causes an influx of Ca +2 ions with release of the neurotransmitter. The neurotransmitter diffuses across the synaptic gap between the pre and post synapses, and excites (or inhibits) the post neuron by interacting with receptor proteins. There are 3 different sub-types of ionotropic (ligand-gated ion channels close or open based on binding of chemical messenger as opposed to voltage-gated ion channels) glutamate receptors: NMDA, AMPA, and Kainate. The (N-methyl-daspartate) NMDA receptor, which is a subtype of glutamate receptors, is believed to act as a coincidence detector for pre and post-synaptic spiking in spike-timing dependent LTP (tltp) induction. GABA receptors respond to the main inhibitory neurotransmitter (GABA) in the CNS. Depending on the type of connections among neurons, the STDP window response can be divided into three categories: Excitatory to excitatory, excitatory to inhibitory, and inhibitory to excitatory. Excitatory to excitatory is the most common type of STDP window, where prebefore-post leads to LTP and post-before-pre leads to LTD.

58 44 Excitatory to inhibitory response is observed in connections between excitatory neurons to GABAAergic neurons [102]. Here pre-before-post within a 60 ms window leads to LTD, whereas post-before-pre leads to LTP. This is just the opposite response as compared to STDP responses in excitatory-excitatory neurons. In a similar study of excitatory-to-inhibitory neurons in mouse brain stem slices [103, 104], it was found that pre-before-post induced LTD but postbefore-pre induced no change. Inhibitory to excitatory response windows for connections are more variable. For large and non-overlapping post-before-pre times LTP is induced, whereas for small overlapping windows, LTD is induced. Large time differences between pre-and post synaptic spikes leads to LTD, whereas small time differences lead to LTP. These types of connections also exhibit the usual excitatory-excitatory connection type behavior. Thus, there are more than one mode of operation of STDP but the most common one is the excitatory to excitatory connections and will be discussed here. Hebbian learning is often implemented as spike-time dependent plasticity (STDP) in spiking neuron codes [17]. Previously, we implemented STDP for character recognition [9] but used the active dendrite dynamic synapse model (ADDS) by Panchev [7]. In the present study, we use the simple leaky integrate and fire (LIF) model and use a more efficient learning method. One well-known STDP algorithm modifies the synaptic weights using the following algorithm: + + # A exp( t / ) if t 0 w = #! # < " " A exp( "# " t /! ) if # t $ 0 (4.1) w new = w +!" w( w # w ) if " w $ 0 old max old w +!" w( w # w ) if " w < 0 old old min (4.2)

59 45 Where, Δt=(t pre t post ) is the time delay between the presynaptic spike and the postsynaptic spike. If the presynaptic spike occurs before the postsynaptic spike, it probably helped cause the postsynaptic spike, and consequently we should encourage this by increasing the synaptic weight. And if the presynaptic spike occurs after the postsynaptic spike, then we reduce the weight of the synapse since there was no cause and effect in this case. STDP can be used for inhibitory or excitatory neurons. The above algorithm is not necessarily the optimal learning approach for spiking networks, even though it has worked well on a number of applications. One issue with STDP involves causality. When a post-synaptic neuron fires in a time-marching code (or in real life), it is unknown (at that time) whether one of the presynaptic neurons will fire in the future (and at what time). In the laboratory, current can be injected near individual neurons after they fire, but this is not necessarily an efficient computational algorithm. The above algorithm can be easily implemented for very small systems where memory storage is not an issue, but for millions of neurons and billions of synapses we need extremely efficient algorithms that use minimal computer operations and storage, and minimal gather/scatter operations. It is also helpful to keep in mind that most of the STDP research is still in its infancy. Most of the STDP experiments are performed on pyramidal cells and excitatory synapses. It is well known that there are a large variety of cell types with different biophysical properties. Also, it is still unclear how STDP operating in timescale of milliseconds can justify most of the behavioral/learning events occurring at much higher time scale. It might be worthwhile to computationally analyze the consequences of STDP in computational simulations with large neural networks. But, here we just focus on developing and testing a stable learning method that is motivated from STDP.

60 46 We have previously implemented Hebbian learning very efficiently in the JSpike software [18]. It was implemented in essentially a homeostatic event driven manner. That is, if a neuron fires, then the learning method is called by the neuron that fired. This neuron has ready access to all the presynaptic neurons that connect to it. When the postsynaptic neuron fires (and learning is turned on), it can then loop thru all the presynaptic neurons and compute which ones also fired during the time interval between this and the previous postsynaptic firing. Since the current is reset each time a neuron fires, we simply need to know which presynaptic neurons fired between the last two firings of the postsynaptic firings. These are the neurons that connect to the synapses that are strengthened. Any neuron that has not contributed to the postsynaptic neuron firing has its weight decreased. This approach is spike-time dependent, but it is different than the standard STDP algorithm. The weight updates are done using: w = w old + A + e ("#t /$ + ), t post 2 % t pre % t post1, #t = t post1 " t pre w old " A " e ("#t /$ " ), t pre % t post 2, #t = t post 2 " t pre (4.3) In the present work, we use this approach and also implement a winner take all mechanism described below. 4.2 Homeostasis and Winner-take-all The homeostasis mechanism implemented here implies that the net sum of incoming weights associated with a neuron remains the same [19]. This ensures that the synaptic weights are bounded and the learning is stable. The algorithm can be implemented in the following way: 1) Let (i,j) be a synapse with t post2 <t pre <t post1, with total p such synapses. (i, j) A. Calculate p w + 2) Let q be no. of synapses with t pre <t post2. (i, j) B. Calculate q.

61 47 3) if (i, j) A w new = w old + w + else if (i, j) B w - = - p ( w + )/q, w new = w old + w - Δw + = A + e ( Δt /τ + ), Δt = t post1 t pre Here, t pre is the pre-synaptic spike time; t post1 and t post2 are the post-synaptic spike times. We have used τ+ = 15 ms, A+ = The above algorithm assumes that the weights are not saturating to 1 or becoming negative. In order to take care of those cases, synapses are identified beforehand which are going to saturate and instead of incrementing them by w +, they are set to 1 and (w old + w + - 1) is subtracted from p w + to reflect the correct sum. Similarly, the synapses that would become negative are treated the same way. A latching mechanism for the synapses is also implemented so that the synapses above a threshold of 0.95 are not modified again in the future. We found this technique improves the overall accuracy. This is like formation of permanent memories in the brain. Winner take all (WTA) is a concept that promotes competitive learning among neurons. After the competition, only one neuron will be the most active for some input and the rest of the neurons will eventually become inactive for that input. It is worth looking at WTA and other related learning methodologies in light of their generalizabilities and discriminatory capacities. Biologically plausible learning methods can be broadly classified as dense, local, or sparse. Competitive learning such as WTA is a local learning rule as it activates only the unit that fits the input pattern best and suppresses the others through fixed inhibitory connections. This type of grandmother cell representation is very limited in the number of discriminable input states it can represent and also is not very generalizable as the winner unit turns on simply when the input is within some Hamming distance from its preferred input. Dense coding can be seen as the other extreme, where a large number of units are active for each input pattern. Thus, it can code a large number of discriminable input states. But then the mapping and learning become more complicated to implement by simple neuron-like units [105]. Sparse coding [106, 107] is a

62 48 tradeoff between these two extremes where the input patterns are represented by the activity in a small proportion of the available units. It is very likely that depending on the requirement, one or a combination of these methods is being used in the brain. The grandmother cell type response cannot be totally ruled out. For example the grandmother cell representations are found in monkeys for some crucial tasks [108]. Recently, Quiroga et al. [24] found cells tuned to respond to images of Jennifer Anniston. Similarly, they found another cell that selectively responded to Hale Berry or an abstract concept of her. Collins and Jin [109] show that a grandmother cell type representation could be information theoretically efficient provided it is accompanied by distributed coding type cells. Maass [110] shows that WTA is quite powerful compared to threshold and sigmoidal gates often used in traditional neural networks. It is shown that any Boolean function can be computed using a single k-wta unit [110]. This is very interesting, as at least two-layered perceptron circuits are needed to compute complicated functions. They also showed that any continuous function can be approximated by a single soft WTA unit (A soft winner take all operation assumes values depending on the rank of corresponding input in linear order). Another advantage is the fact that approximate WTA computation can be done very fast (linear-size) in analog VLSI chips [111]. Thus, complex feed-forward multi-layered perceptron circuits can be replaced by a single competitive WTA stage leading to low power analog VLSI chips [110]. There have been many implementations of winner take all (WTA) computations in recurrent networks in the literature [112, 113]. Also there have been many analog VLSI implementations of these circuit [113, 114]. The WTA model implemented here is influenced by the WTA implementation on recurrent networks by Oster and Liu [113]. In their implementation, the neuron that receives spikes with the shortest inter-spike interval is the winner. But it is not

63 49 clear in their implementation how (starting from random weights) a new neuron can learn a new category. A modified version of winner take all (WTA) with Hebbian learning is implemented here to demonstrate how different neurons can learn different categories. WTA is applied on both the learning layers while training and is switched off while testing. The WTA is implemented as follows: 1) At every time step, find the post neuron with the least spike time difference t post1 - t post2. Note that these last two post-synaptic neuron spike times are easily available. This neuron is declared as the winner. 2) The winner inhibits the other neurons from firing by sending an inhibitory pulse to others. If the winner neuron has not learned any feature, it learns the new feature by the above Hebbian learning method. The neuron remains the winner unless another neuron has a lower spike time interval or a new image is presented. The above learning approach is followed except in the last (uppermost) learning layer where the winner is declared according to the supplied category/label information about the input image rather than the spike time difference. Thus the overall approach is like a semi-supervised approach, with unsupervised learning in the lower learning layer and supervised learning in the last learning layer. We are assuming that all membrane potentials are discharged at the onset of a stimulus. This can be achieved, for example, by a decay in the membrane potential. There have been several implementations of sparse coding schemes too. Földiák [105] shows how a layer of neurons can learn to sparse code by using Hebbian excitatory connections between the input and output units and anti-hebbian learning among the output connections. Olshausen and Field [115] showed that minimizing an objective function of high sparseness and low reconstruction error on a set of natural images yields a set of basis functions similar to the Gabor like receptive fields of simple cells in primary visual cortex. One interesting study is by

64 50 Einhauser et al. [116], who developed a neural network model which could develop receptive field properties similar to the simple and complex cells found in visual cortex. The network could learn from natural stimuli obtained by mounting a camera to a cat s head to approximate input to a cat s visual system. Though, they did not use spiking neural network. 4.3 Neuronal and Synaptic Genesis Neurogenesis and synaptogenesis (birth of neurons or synapses) has been shown to occur in the brain in certain settings [117, 118]. It is also known that it does affect some learning and memory tasks [ ]. Many studies have indicated that the brain is much more plastic than previously thought [122]. For example, brain scans of people who loose their limbs in accidents show that parts of the brain maps corresponding to the lost limb are taken over by the surrounding brain maps. This plasticity and change can be made possible by birth and death of neurons and connections. For efficient simulations it will be important to be able to add neurons and synapses, and to also remove them when necessary. We found that our network will benefit from modeling these processes. For illustration, we trained a network with the network architecture shown in the right plot of Figure 3-9 using our learning approach. The goal was to recognize handwritten digits from 0 to 9. More details about the network structure and training problem can be found in Section Our focus here is on the first learning layer in the architecture which was trained in an unsupervised way. Initially the synaptic weights were set to random values in this layer. Figure 4-2 shows 2D-arrays of synapse values after training for this learning layer plotted as 8x8 projections onto the previous layer. The synaptic weights are plotted as intensities with white representing the highest synaptic strength and black the lowest. These arrays represent features that were learned. Note that some of them remained unchanged (remain random starting values) during training and could have as well been

65 51 removed from the simulation for efficiency. Or, better yet, we could have started with fewer synapses and added them as needed. This would ensure that unnecessary connections are not there and will save computer memory. Figure 4-2: Images of arrays of final synapse weights plotted as 8x8 projections to previous layer in a layer of network after learning. White represents the highest synaptic strength and black represents the lowest.

Chapter 5 CSpike This chapter describes the CSpike code developed here. Section 5.1 describes the design aspect and the object oriented approach followed.

66 Chapter 5 CSpike This chapter describes the CSpike code developed here. Section 5.1 describes the design aspect and the object oriented approach followed. It also presents key classes and a flow chart of the code. Section 5.2 discusses the neuron modeling in the code. Section 5.3 describes the Gabor filtering and the meaning of different parameters used. Section 5.4 discusses performance and scalability of the code. Section 5.5 discusses how translational invariance and sub-sampling are incorporated in the code. 5.1 Object-Oriented Approach We have followed an object-oriented approach (OOP) [56] to build the present code using C++. The OOP principles of polymorphism, encapsulation and inheritance have been followed for easier understanding and future developments. A few classes in the code are the Neuron, Layer, Network, Synapse, and the SynapseMesh class. They are shown in Figure 5-1. The Neuron class and the Synapse class are the two basic classes from which other classes are derived. A Network consists of arrays of Layers and a Layer consists of 2-D arrays of neurons. Figure 5-1: Some classes in the code. Arrows denote inheritance between classes

67 53 The SynapseMesh class represents the connections and can implement connections of different kinds such as all to all or many to few connections. The code supports four types of connections: AllToAll, ManyToFew, StencilMesh, and OneToOne. The AllToAll connection means all neurons in the pre-synaptic layer are connected to all neurons in the post-synaptic layer. The ManyToFew connection means each neuron in the postsynaptic layer is connected to a rectangular region of neurons in the pre-synaptic layer. The StencilMesh connection is similar to ManyToFew except that overlap is allowed at the presynaptic level and that each post-synaptic neuron connects to a pre-synaptic region with center as itself. This connection type is useful for doing correlation/convolution. The OneToOne connection just connects each neuron in the pre-synaptic layer to each neuron in the post-synaptic neuron. The shape and number of neurons in both layers must be the same in this case. Figure 5-2 shows configuration of a typical network and how synapse meshes are interleaved between layers of neurons. The 2-D configuration is chosen so as to process images directly. Another advantage of using C++ is the free availability of the widely used parallel libraries MPI [123] and OpenMP [124], which could be used to parallelize the code in future. Figure 5-2: A typical network showing how synaptic meshes are interleaved between 2-D arrays of neurons.

68 54 Data mytype Vnew, Vold, Inew, Iold, inputcurrent, lastspiketime int numspikes bool refperiod static mytype R,C, Vth, Vreset, Vhigh Neuron class Neuron constructor simpleifstep updatecurrent getvoltage getcurrent setinputcurrent getnumspikes Methods Data int xdim, ydim, layerid Neuron **Neurons2D Layer class Layer constructor simpleifstep updatecurrent getvoltage getcurrent Methods Data int numlayers Layer *Layers SynapseMesh **arraysynapsemesh ConnectionType *ctypearray Network class Methods Network constructor processsynapticmesh setrandsynapticmesh getweights setweights Table 5-1: Neuron, Layer, and Network classes. There are a few key classes such as Neuron, Layer, Network, Synapse, SynapseMesh and they will be mentioned here. Table 5-1 shows the data and methods of the Neuron, Layer, and Network classes. The code declares a new data type called mytype, which can be defined to be a float, long, or double according to the need. The neuron class has information about the neuron such as its current, voltage, incoming current, last spike time, number of spikes, and whether it just spiked and is in the refractory period (by boolean data type refperiod). Other neuron parameters such as resistance, capacitance, and threshold are declared as statics as they are the

69 55 same for all the neurons. The Neuron class defines methods such as simpleifstep which implements the leaky integrate-and-fire model along with the usual constructors, get, and set methods present in a class. The Layer class contains a 2D array of neurons defined by Neurons2D. It also stores the number of neurons in the x and y dimensions in the variables xdim and ydim. The Layer class defines its own simpleifstep which essentially calls the corresponding simpleifstep method in the neuron class for all neurons in the layer. The network class consists of arrays of layers. It also contains the pointers to synaptic meshes sandwiched between the layers. It also contains information about the types of each of the synaptic meshes. It consists of its own method processsynapticmesh which calls the virtual method processsynapses declared in the class SynapseMesh and defined in the inherited classes. This arrangement allows flexibility for the type of connections via exploiting the polymorphic principle of the object-oriented code design. Table 5-2 shows the data and methods of the Synapse, SynapseMesh, and AllToAllMesh classes. The Synapse class defines the data and members associated with each synapse. It stores the synapse strength, which can be of weighttype type and can be defined to be any data type such as char, float, or double along with a boolean synapseactive which controls whether this synapse could be modified. There is also a static delay, which can account for delay in synaptic transmission but is not used in this work. The usual constructors, get, and set methods are also defined for this class. SynapseMesh is an abstract class, which is used to define generic connections between layers. It consists of layer1dimx, layer1dimy, which contain the dimension of the left layer and layer2dimx, layer2dimy, which are the dimensions of the right layer. It also contains pointers to the left and right layers. Since it is a virtual class all the methods it declares are pure virtual functions and are later defined in the classes that implement it. An example of such a class is AllToAllMesh class which stores 4-Dimensonal array of synapses, with indices

70 corresponding to x and y dimensions of the left and right layer, and also declares all the virtual methods. 56 static mytype delay weighttype strength bool synapseactive Data Synapse class Methods Synapse constructor setrandsynapsestrength getsynapsestrength setsynapsestrength Data int layer1dimx, layer1dimy, layer2dimx, layer2dimy, synapsemeshid Layer * layerleft, layerright float weightchangenorm SynapseMesh class (Abstract) Methods (virtual) getlayerleftdimx, getlayerleftdimy getlayerrightdimx, getlayerrightdimy getlayerleft, getlayerright processsynapses virtual destructor Data Synapse ****synapses4d AllToAllMesh class (inherits SynapseMesh) Methods (virtual) AlltoAllMesh constructor Implement methods of SynapseMesh class Table 5-2: Synapse, SynapseMesh, and AllToAllMesh classes. Figure 5-3 shows the flow chart of the code. The first step is reading the input file containing the input parameters and initialization of QT [125] to use image processing API s for reading and writing image files. This is followed by a construction of the Gabor filters with different orientations followed by initialization and memory allocation of the network. If the training parameter is set to true, synaptic weights are read from the input file, otherwise they are set to random values. Once all these initialization steps are done, the steps of getting an input image, setting & updating input current, processing neurons and synapses, and updating the time step are done for the specified number of time steps.

71 57 Another feature of the code is that it can use chars to store weights instead of floats or doubles. Chars require just 1 byte in contrast to floats or doubles, which requires 4 or 8 bytes usually. This drastically reduces the amount of memory needed to store the synapses or weights. This is very important as neural codes usually have many more synapses than neurons. The human brain has about synapses, which corresponds to roughly bytes of storage assuming each synapse requires one byte. The fastest supercomputer currently in the world is Cray XT5 Jaguar [126] with about Bytes of memory. The code can convert the synaptic weights on the fly from char to floats or doubles as required by the computation. Documentation of the code is done using Doxygen [127]. Figure 5-4 shows a snapshot of the HTML documentation. It consists of various classes, methods, data-members, and files in an organized fashion with convenient links to view the code section of interest.

72 Figure 5-3: Flow chart of the code. 58

73 Figure 5-4: Snapshot of the HTML documentation of the code. 59

74 Neuron Modeling in the Code We use the leaky integrate and fire model (LIF) for the present work. A brief description follows. For a neuron i in a system of neurons, the voltage would be modeled by: dv i dt = 1 ( " (I input + I i ) R # v i ) (5.1) Where, v is voltage, τ=rc is the time constant, R is resistance, C is capacitance, I input is a possible input current (usually zero), and I i is the current from the synapses. When the membrane potential reaches some threshold value v th, it produces a spike, and immediately after that the membrane potential is reset to a value v reset. After this event, the membrane potential remains at the resting potential till t ref time. Figure 5-5 shows the process for constant current. The current from the synapses can be modeled as: I i = $ N M! wi j! j= 1 k = 1 # ( t " t j k ) (5.2) Figure 5-5: Voltage evolution of a single neuron using leaky integrates and fire model with constant input current.

75 61 Where N is the number of presynaptic neurons and M is the number of times the j th presynaptic neuron has fired since the i th neuron fired. The coefficients w ij and α, represent the %#!" %!!"!"#$%#&'()*+,-) $#!" $!!" #!" *+,-./0,-" 12"3"!&!%#" 12"3"!&$" 12"3"!&'"!"!&!"!&%"!&'"!&("!&)" $&!" $&%" $&'" $&(" $&)" %&!".&/%0)'%""#&0)*&1-) Figure 5-6: Frequency-current curves for a leaky IF neuron. All time steps are in milliseconds. synapse weights and the increment of current injected per spike, respectively. One way to solve Eq. 5.1 is to solve it using O(Δt) Euler discretization. Thus, the voltages for the next time step are computed using the following equation: n % v +1 i = v n i 1" #t ( ' * + #t(i input + I i )R & $ ) $ (5.3) Traditionally, one way to estimate computation time has been flops (number of floating point operations) count. This gave good estimates since computations of floating point operations were slow. In present days, this is no longer the case, as many operations are in the hardware itself. Instead, issues such as locality of reference and cache latencies often play a major role. For

76 62 e.g., for some Intel core micro architectures [128], cache latencies for first level, second level and third level caches are about 3, 15, and 110 cycles. The second and third level latencies could easily exceed the latencies of some floating-point operations. Thus, locality of data becomes very important. However, it might be still useful to compare flops to get rough estimates of computational costs of different algorithms. Eq. 5.3 requires about 8 flops for each time step, assuming each floating-point operation such as addition, multiplication, and division takes 1 flop. This is a reasonable assumption for present day computer architectures and for large problem sizes. Another approach could be to solve Eq. 5.1 analytically for a time interval when current is constant. Since current depends on weights, and weight changes occur only when there are pre and post-synaptic spikes, one could solve Eq. 5.1 analytically for voltage in some time interval. Eq. 5.4 gives the analytical solution, and it takes about flops (assuming exponential takes 5-10 flops depending on the computer). v(t) = IR(1 e t /τ ) + v(t = 0)e t /τ (5.4) We choose to use Eq. 5.3, as that approach is faster, more general, and could be used to compute other models such as Hodgkin-Huxley and Izhikevich as well, which do not have a trivial analytical solution. Since Eq. 5.3 is just first order accurate, the time step needs to be smaller for accuracy. Other higher order schemes could be used but they will be difficult to implement because of high gradients and will be computationally expensive [64]. Also, fourth order Runge-Kutta scheme is only slightly more accurate then the Euler scheme for the LIF model [64]. Figure 5-6 shows the frequency vs current plot of the numerical simulations with different step sizes compared to the analytical solution given by the following equation:

77 63 (5.5) Here, <f> is the frequency, t ref =2.68 ms, R=38.3 MΩ, V th =16.4 mv, and C = nf. For small time steps such as dt=0.025 ms, the curve overlaps with the analytical solution. For time step of 0.1 ms and 0.4 ms the numerical solution doesn t follow the analytical solution especially at higher current values. This is because for larger time steps, the timings of the spikes are not precisely modeled. For the same time-step size, as the current increases, the inter-spike interval decreases and there is more and more discrepancy between analytical and numerical solution. 5.3 Performance Figure 5-7 and Figure 5-8 show the dependence of the CPU time with number of synapses and neurons on a 2.2GHz Intel Core 2 Duo laptop running on MAC OS X The compiler used was gcc version with O3 optimization. The network had three layers with each layer having the same number of neurons and all to all connections. All the simulations were run for 500 time steps with a time step of ms, the same as in Long and Gupta [129]. The dependence is almost linear with both number of neurons and synapses. For Figure 5-7, the number of synapses varied from 20,000 to 1.5 billion, whereas the number of neurons varied from 300 to 82,000. The largest network had 1.5 billion synapses, took about 48 minutes of CPU time, and required about 3GB of memory. The smallest case had the same number of synapses as a worm, whereas the largest case had the same number as a cockroach. For Figure 5-8, there were no synapses. The largest case in this plot had 30 million neurons (about the same as in a mouse) and it took 19 minutes of CPU time. The smallest case had 300 neurons (about the same as in a worm) and took less than 0.01 seconds of CPU time for simulating seconds of biological

78 time. The simulations in Figure 5-7 are about 4 times faster than in Long and Gupta [129] which used Java, whereas the simulations in Figure 5-8 took almost similar time. 64 Figure 5-7: CPU time variation with number of synapses. Both axes are on logarithmic scale.

79 65 Figure 5-8: CPU time variation with number of neurons. Both axes are on logarithmic scale. It is interesting that supercomputers now have the computational power and memory approaching the human brain even though issues such as brain wiring, sensory inputs, and learning remain to be addressed [11]. Even more interesting is the fact that computers are precise and fast whereas brains are imprecise and the neurons quite slow. The fastest supercomputer presently is Cray XT5 Jaguar with computational power of about 2.3x10 15 Flops. It has 3x10 14 Bytes of memory and 224,256 cores [130]. It consumes about 2.35 Mega Watts of power. In contrast, human brain has about neurons and roughly synapses [131]. Assuming each synapse represents a byte, it can store roughly bytes. It s computational speed is estimated to be about 100 teraflops [132]. Table 5-3 shows a comparison of different metrics of the human brain and the Jaguar supercomputer. Most notable are the volume and power requirements as shown in Figure 5-9. The Jaguar consumes about 6 orders of magnitude higher power than the human brain and occupies about 6 orders of magnitude more volume.

80 66 Human Brain Jaguar Supercomputer Units 100 billion neurons 224,256 cores Speed 0.1 Peta Flops 2.33 Peta Flops Memory 1 Peta Byte Peta Byte Volume m m 3 Power 20 Watts 7.6 Mega Watts Table 5-3: Comparing different metrics of the human brain and the Jaguar supercomputer. Figure 5-10 compares biological and man-made systems in terms of memory and speed. A TI 83 calculator is faster than a worm in terms of both memory and speed. A 2 GHz laptop is comparable to a cockroach; and, a mouse is comparable to a dual quad-core server. The Jaguar supercomputer is faster than the human brain but has about 3 times less memory according to the above estimates. A word of caution though: these estimates do not take into account the wiring, connectivity, and algorithms of the brain. Also, more memory is typically required for computer programs simulating the brain, to store computational constructs. A few years ago, researchers at EPFL simulated a neocortical column of a rat s brain [133]. A neocortical column is only 2 mm long, 0.5 mm in diameter, and contains roughly 10,000 neurons and 30 million synapses. The machine that simulated this column was an IBM Blue Gene/L supercomputer capable of speeds of 18.7 trillion calculations per second. This study suggests we would need even more Flops to simulate the human brain than previously thought.

81 67 Figure 5-9: Volume vs. power requirements of biological and man-made systems [19]. Figure 5-10: Comparison of biological vs. man-made systems [19].

82 68 The present code uses biologically plausible vision processing approaches and spiking neural networks to design a vision processing system similar to the workings of the brain. The approach is scalable and no more difficult to parallelize than 2 nd generation neural networks [11, 19]. 5.4 Gabor Filtering Gabor filters [134, 135] are used in image processing for edge detection, feature extraction, and also from simulating simple/complex cells in the visual cortex. Their frequency and orientation responses are similar to cells found in the human vision system. Mathematically, a Gabor function can be defined as the product of a Gaussian and cosine function in 2-D as follows: ) G ",#,$,%,& (x, y) = exp ' x ( 2 + $ 2 y ( 2, ) + * 2% 2. cos+ 2/ x ( - * " +&,. - x ( = x cos# + y sin# y ( = y cos# ' x sin# (5.6) Here, λ is the wavelength of the cosine factor and controls the spatial extent, θ is the filter orientation, ψ is the phase offset and specifies the symmetry, σ is the standard deviation of the Gaussian factor, γ defines the ellipticity and specifies the aspect ratio of the filter, x and y are the pixel coordinates. σ is usually not directly specified but via λ and bandwidth b as follows: " # = 1 $ ln2 2 2 b +1 2 b %1 (5.7) Here b is the half-response spatial frequency bandwidth of simple cells. The ratio σ/λ specifies the spatial frequency bandwidth of simple cells and hence the number of parallel excitatory and inhibitory stripes. For b=1, which is usually the case, σ/λ is The smaller the

bandwidth b, the larger σ, and the number of visible parallel excitatory and inhibitory stripe zones as shown in Figure 5-11.

83 bandwidth b, the larger σ, and the number of visible parallel excitatory and inhibitory stripe zones as shown in Figure Figure 5-11: Gabor filter kernels with values of the bandwidth parameter of 0.5, 1, and 2, from left to right, respectively. The values of the other parameters are as follows: wavelength 10, orientation 0, phase offset 0, and aspect ratio 0.5 [136]. Plotted as intensities with white representing highest and black the lowest value. Data float gbsigma, gblambda, gbfullsize, gbangle, gbpsi, gbgamma float **gbfloat Gabor class Gabor constructor computegaborvalues getvalues myqsave (save image) Methods Table 5-4: Data and methods in the Gabor class. Table 5-4 shows the data and methods in the Gabor class. Scalar variables such as gbsigma, gblambda, and gbfullsize are used to store the standard deviation, wavelength, and the size (in pixels) of the filter respectively. The 2-Dimensional array variable gbfloat is used to store the intensity values of the filter. The class provides methods such as computegaborvalues to compute the Gabor values and myqsave to save the Gabor filter as an image file using QT. For Gabor filtering, stencil type connections are used in the present work. Stencil connections are just like many to few connections except they are fixed sized and replicated throughout. This is similar to applying a kernel in computer vision approaches. These connections remain fixed throughout training and are used for simple feature extraction.

84 Translational Invariance and Sub-sampling For translational invariance, 2D max filtering is done using a WTA type mechanism. For any post-synaptic neuron, the current is set according to the most active pre-synaptic neuron in a 2-dimensional pre-synaptic region. Two neighboring post-synaptic neurons overlap in their presynaptic regions, which will control the degree of sub-sampling, for example if the size of the pre-synaptic region is P X P pixels, and there is no overlap, there is 1 post-neuron every P neuron, resulting in a reduction in size of the post layer by a factor of P in each direction. The most active neuron is defined as the neuron that fired the highest number of spikes. Figure 5-12 shows how the above WTA operation looks like in a single dimension. There are 8 pre-neurons, WTA/max is done over 4 pre-neurons with overlap of 2 neurons. The number of post-neurons and sampling factor can be calculated in each dimension as follows: npost = SF = npre npost npre " affsize + stepx stepx (5.8) Here, npre is the number if pre-neurons, affsize is the number of afferent neurons for WTA, stepx is the step size (controls the overlap) in a single direction, and SF is the sampling factor. Note that for npost to be a whole number the parameters should be chosen such that stepx perfectly divides the numerator.

85 71 Figure 5-12: Illustration of max operation and sub-sampling in 1-D. There are 8 pre-neurons and 3 post-neurons. The afferent input size is 4 neurons with overlap of 2 neurons. Figure 5-13 shows the pseudo-code to implement the WTA and sub-sampling. For each post-neuron the max spikes from a 2-D region of pre-neurons is computed and is used to set the current. stepx step size in x direction stepy step size in y direction for each post-synaptic neuron (i, j) maxspikes 0; //compute max num of spikes for each pre-synaptic neuron (k,m) numspikesprevlayer number of spikes of previous layer neuron(i *stepx + k, j*stepy+m) maxspikes = MAX( maxspikes, numspikesprevlayer) endfor setcurrent for neuron (i, j) as factor*maxspikes endfor Figure 5-13: Pseudo-code for implementing the WTA and sub-sampling.

86 Input and Output The network parameters are read from an input file present in the same directory as the source code. A sample input file is shown in Figure Each line of this file contains a single network parameter. This is the input file containing neuron and simulation parameters 3000 //simulation steps 38.3 //R, Resistance (Mega Ohms) //C, Capacitance (nano Faraday) 16.4 //VthMean, Voltage threshold (milli Volts). if ispoisson is 1, it is mean threshold 0.0 //Vreset, Reset value of voltage after firing (milli Volts) 0.5 //tref //tref, refractory period after firing (milli seconds) 40.0 //Vhigh, After reaching threshold (milli seconds) 0.0 //t0, time initial (milli seconds) 0.1 //dt, simulation time step (milli seconds) 0.0 //V0, initial voltage (milli Volts) 0.0 //I0, initial current (nano Amperes) 1E36 //starting lastspiketime, set to very high value if starting the simulation (msec) 0 //set to 1 if intial refractory period, 0 if no initial refractory period 0.0 //0.43 //1.0 //Iconst, constant current (nano Amperes) (>Vth/R to spike) 0.0 //IstartTime, start time for contant current pulse (milli seconds) 0 //ispoisson (0 or 1), do the spikes follow poisson dist? Figure 5-14: Input file format. //Voltages file format (outputvoltages.txt) //There is one row like this for every time step: //Time voltageneuron1 voltageneuron2 voltageneuron3... //Spike times file format (outputspikes.txt) //There is one row like this for each neuron that fires. At a given time step, you could get multiple lines if more than one neuron fires at that timestep // TimeStepNum Time NeuronNumThatFired //Firing rates file format (outputfiringrates.txt) //There is one row like this for each neuron: //LayerNum NeuronInum NeuronJnum NumberOfTimesNeuronFired FiringRate(HZ) //Synaptic weights file format (outputweights.txt) //There is one row for each synaptic connection //weightnum LayerNum NeuronInum NeuronJnum weightinum weightjnum weight Figure 5-15: Output file formats.

87 73 A number of other variables such as location of the input directory, type of floating-point computation, and number of instances are defined in the globals.h file. The typical output files one could generate from running the code are shown in Figure If the network size is big, the voltages and the spikes files tend to be very large and it is better not to generate them.

88 Chapter 6 Results This chapter presents the results on training and simulating spiking networks on different problems. Results from our rate-based massively parallel code were presented in chapter 2. The first section uses the active dendrite, dynamic synapse model, which is described briefly in that section. The results in the rest of the sections use simple leaky integrate fire neurons with the training method developed and described in Chapter Training on 48 Artificial Characters The neuronal model used here is the active dendrite and dynamic synapse model [7, 137] except that the dynamic synapses have not been used. Unlike the simple leaky integrate and fire model described in previous sections, this model does not assume the membrane resistance, R, and time constant, τ, to be constant. In this model, a neuron receives input (via spikes) through a set of synapses and dendrites. The total post-synaptic current for the synapse i, with weight w i attached to a dendrite is given by: (6.1) Here t f i is the set of pre-synaptic spike times filtered as Dirac-delta pulses. The time constant τ d i and resistance R d i defines the active property of the dendrites as the function of synaptic weights. The STDP modification rule is used for modifying the weights. For more detailed description of this model refer to Gupta and Long [138].

75 6.1.1 Test Problem and Network Structure In order to test the network, a character set, shown in Figure 6-1 consisting of 48 characters is used.

89 Test Problem and Network Structure In order to test the network, a character set, shown in Figure 6-1 consisting of 48 characters is used. Each character is represented by an array of 3X5 pixels. The character set and the test problem is the same as used in Long and Gupta [11], which used traditional neural networks and back propagation on massively parallel computers. Integrate and fire neurons with constant input current are used as input neurons. If the pixel is "on", a constant current is supplied to the corresponding neuron, whereas if the pixel is "off", no current is supplied to that particular neuron. The number of input neurons is equal to the number of pixels in the image. Thus there are 15 input neurons in the present case. The number of output neurons is the number of characters to Figure 6-1: Character set used. Black denotes ON pixel, whereas white is OFF. be trained. There are two layers in the network. The first layer consists of simple leaky integrate and fire neurons which receive constant or zero input current, corresponding to 'on' or 'off' states of the input pixels. The next layer is the layer of active dendrite neurons, each of which is connected to all of the neurons in the previous layer. Finally, each of the output layer neuron is

connected to every other output neuron via inhibitory lateral connections. These lateral connections reflect the competition among the output neurons. 76 6.1.

90 connected to every other output neuron via inhibitory lateral connections. These lateral connections reflect the competition among the output neurons Results For preliminary testing, the network was trained using only four characters ('A', 'B', 'C', and 'D'). There were 15 input neurons and 4 output neurons for this case. The weights were initialized to random values between 0.5 and 1.0, so that all the output neurons spike at the first training iteration. Each character was presented one at a time sequentially during the training process. When the Frobenius norm of the weight change matrix was very small, it was assumed that the network is fully trained and no further significant learning is possible. For this simple case, the Frobenius norm of the weight change matrix was after 100 epochs (an epoch means one full presentation of all the four characters), and thus the training was stopped at this point. The simulation time step was 0.2 ms. Figure 6-2: Output when characters are presented in following order: 'C', 'D', 'A', 'B', 'C', 'D',...

91 77 Figure 6-2 shows the output of the network when the characters are presented one by one in the order 'C', 'D', 'A', 'B', 'C', 'D',... and so on. A new character was presented 1ms before every 300 ms. We can see that only a particular neuron responds to a particular character by spiking continuously unless the next character is presented, when another neuron starts spiking, and so on. Figure 6-3 show the weights of each of the connections before and after training respectively. The after weights show that since some inputs are always switched off, the corresponding weights are reduced to very small values. Also, weights corresponding to always ON pixels saturate to one. The rest of the weights are between zero and one. Figure 6-3: Weight distribution before (left) and after (right) training. The next test case was to train the full 48 character set in Figure 6-1. The network here consisted of 15 neurons in the input layer, and 48 in the output layer. The network structure was the same as previous except that there were more connections as the number of output neurons was 48. The network was trained for 100 epochs. During this training process each of the 48 characters was presented sequentially until the Frobenius norm of the weight change matrix was very small. Figure 6-4 shows the variation of the Frobenius norm with the number of training epochs. It decreases roughly linearly (log scale) with number of epochs.

78 Figure 6-4: Variation of Frobenius norm of the weight change matrix with number of epochs In this case, most of the characters (43) were recognized uniquely in the sense that either

Strikingly, all have the same number of pixels (11) in their character representations and look very similar due to the coarse representation.

2 Training Simple Gabor-like Cells In this section, the network is trained to behave like simple cells found in the mammalian vision systems.

92 78 Figure 6-4: Variation of Frobenius norm of the weight change matrix with number of epochs In this case, most of the characters (43) were recognized uniquely in the sense that either a unique neuron fired or the firing times of the same neuron were different. A single neuron, however responded with the same firing times for the characters M, U, and W. Strikingly, all have the same number of pixels (11) in their character representations and look very similar due to the coarse representation. Similarly, characters J and 3, each having 10 pixels, also had nonunique representations. The rest of the characters had a unique representation. 6.2 Training Simple Gabor-like Cells In this section, the network is trained to behave like simple cells found in the mammalian vision systems. Simple cells respond preferentially to bars of specific orientations. Gabor filter type receptive fields (see Section 5.4 ) are common for these cells. We use the spike-time based Hebbian model with winner take all and homeostasis as discussed in Chapter 4 for this problem and all subsequent problems in the next sections.

93 Test Problem and Network Structure In this case we use images of 4 Gabor filters oriented at 0, 45, 90, and 135 degrees to train the network. The size of each Gabor image is 25x25 pixels. Thus, there were 625 input neurons, with each neuron connected directly to each pixel. There are 4 output neurons connected using all to all connections. There are also all to all inhibitory lateral connections in the output layer, which encourage competition among neurons. The grey-scale intensity value of each pixel was scaled between I min and I max. Here I min is the min. current needed to sustain spiking and I max is the current corresponding to maximum firing rate. This scaled current is fed as input to the input neurons. During training, each image is presented for a duration of 50 msec cyclically for a total training time of 20 secs. The training is done in a completely unsupervised manner; i.e., label information are not provided. The goal was that each output neuron learns a unique Gabor filter image after sufficient training. Once trained, we later test the network with 36 Gabor filters with orientations in step sizes of 5 degrees. The various parameters used in the network are R=38.3 MΩ, C=0.207 nf, τ=7.93 ms, V th =16.4 mv, V high =40 mv, V reset =0 mv, τ + =15 ms, A + =0.01, α=0.015, t ref =2.68 mv, dt=0.1 ms Results Figure 6-5 shows the voltage plots of training the spiking network with 4 Gabor filter images at 0, 45, 90, and 135 degree orientations. Only the first 0.4 secs of the simulation is shown in the figure. Each output neuron learns to recognize a different Gabor filter and fires more vigorously as it learns to recognize more. Figure 6-6 shows the voltage plots of an output neuron before and after training. It learns a particular image by firing earlier and more vigorously. Once trained, 36 Gabor filters at step size of 5 degrees were presented to the network. Figure 6-7 shows

94 80 the firing rates of 4 output neurons plotted against filters of different orientations. We observe bell-shaped tuning curves similar to the tuning curves of cells experimentally observed in V1 [82, 84, 139] and MT/V5 [140, 141] area. Figure 6-8 shows one such tuning curve from a cell in the cat striate cortex. The width of the curve and the firing rates match the computational results quite well. Figure 6-5: Voltage plots from the 4 output neurons when presented cyclically with 4 bars of different orientations for every 50 msec. Each neuron learns a bar of different orientation and the inter-spike time decreases as the bar is being learned. Figure 6-6: Voltage of an output neuron before training (left) and after training (right).

95 81 Figure 6-7: Shown are the firing rates of 4 output neurons for 36 Gabor filter test images with step size of 5 degrees. Output of each neuron is shown in a different color. Figure 6-8: Experimental tuning curve of a cell from cat striate cortex [139]. The cell has preferred orientation of around 84 degrees.

82 Figure 6-9 shows the weights associated with each of the 4 output neurons before and after training. The weights were initialized to random values before training.

96 82 Figure 6-9 shows the weights associated with each of the 4 output neurons before and after training. The weights were initialized to random values before training. The weights are represented as intensities, with a weight of 1.0 corresponding to white and 0.0 corresponding to black. We can see that a unique neuron learns a unique Gabor image. But, we would like to emphasize that since the learning is unsupervised; it cannot be determined before training which neuron will learn which image. Figure 6-9: Weights before training (on left) and after training (on right) for the 4 output neurons. Each output neuron learns to recognize a unique Gabor filter image. 6.3 Training on LED Numbers This network is trained on a set of ten LED type numbers for better understanding. Each LED number can be constructed by switching ON or OFF seven vertical/horizontal bars. For example, number two can be constructed by switching ON three horizontal and two vertical bars, as shown in Figure 6-10.

97 83 Figure 6-10: LED type numbers formed by switching 3 horizontal and 4 vertical bars ON or OFF Test Problem and Network Structure The network architecture with input 2 is shown in Figure As shown, the number 2 consists of 3 ON horizontal bars and 2 ON vertical bars. The network has two layers of modifiable connections and one layer of fixed stencil type Gabor connections. The input image is of size 38 X 38 pixels, which passes through Gabor stencil connections at 4 different orientations. This process results in 4 images of size 30 X 30. Then, many to few connections with projections of size 10 X 10 connect 9 neurons in each orientation stream. Finally, all to all connections connect neurons in all the four streams to ten outputs. Thus the number of neurons in the network including the input is 5776 X 3600 X 36 X 10. The number of modifiable connections is 3600 in the many to few layer and 360 in the all to all layer. The training is done in an unsupervised way with numbers presented at random. Each number has an equal chance for being selected. Since the images have only vertical and horizontal bars, the inclined streams (last two in Figure 6-11) have no effect. They could have been simply removed.

98 84 Figure 6-11: Network architecture. Input is passed through 4 fixed stencil type connections, which do Gabor type filtering extracting simple features. The next two connections are learning layers with modifiable connections. The first layer of modifiable connections is a many to few connections with projections of size 10x10 on to one neuron. The second layer is all to all connections connecting 10 output neurons to 36 neurons or 9 neurons per stream in the previous layer Results Figure 6-12 shows the firing rates of the ten output neurons plotted as intensities. Each output neuron is tuned to a unique number. Though this is a simple case, it provides a better understanding of the two learning layers of the network. Figure shows the weights after training for these two learning layers. The weights in the first learning layer are simply tuning to the horizontal and vertical bars, whereas the second learning layer weights are combining these so

99 that each output neuron is tuned to a specific number. For example, the weights associated with neuron specializing in 2 (third row in the right) has all the 0 degree stream weights turned on, 85 Figure 6-12: Firing rates of the 10 output neurons plotted as intensities, when number shown on the left-most column is presented as input. but only about 2 weights of the 90 degree stream fully turned on and 2 weights partially turned on. Similarly, for weights associated with neuron specializing in 8 (second last row), all the weights are turned on. What would happen if we decrease the number of neurons in the intermediate layer? Figure 6-14 shows that the network has trouble distinguishing between 0 and 8. This is because we need at least 7 neurons in the many to few layer to capture all the 7 bars. In this case we only have 6 neurons/stream (or 24 total) in the many to few layer. The network configuration including the input is 5776 X 2400 X 24 X 10. So there are total 2400 connections in the many to few layer and 240 in the all to all layer.

86 Figure 6-13: Weights after learning plotted as intensities; Left: Weights of 36

Right: Weights of the second learning layer of 10 output neurons (correspondingly 10

Figure 6-14: Firing rates of the 10 output neurons plotted as intensities for a

100 86 Figure 6-13: Weights after learning plotted as intensities; Left: Weights of 36 neurons in the first learning layer plotted as 10 X 10 projection from previous layer, Right: Weights of the second learning layer of 10 output neurons (correspondingly 10 rows) plotted as 12 X 3 projections from previous layer. Figure 6-14: Firing rates of the 10 output neurons plotted as intensities for a network configuration that has 6 neurons per stream instead of 9. Number shown on the left-most column is presented as input.

87 6.4 Training on MNIST Dataset The MNIST dataset [1] consists of images of handwritten numbers. It consists of 60,000 training images and 10,000 test images. The image sizes are 28x28 pixels.

101 Training on MNIST Dataset The MNIST dataset [1] consists of images of handwritten numbers. It consists of 60,000 training images and 10,000 test images. The image sizes are 28x28 pixels. Figure 6-15 shows some of the images form the data set. We use the first 10,000 images for training from the training set and the first 10,000 images for testing from the testing set. The images appear in random order in the training and testing sets. As the images are quite small (28x28 pixels), for Figure 6-15: Some images from the MNIST dataset [1]. proper handling of boundaries and Gabor filtering, zero intensity boundaries were added on all 4 sides to make them 36x36 pixels. Also, the grey-scale intensity value of each pixel was scaled between I min and I max and fed to the input neuron layer. Here I min is the minimum current needed to sustain spiking and I max is the current corresponding to maximum allowed firing rate f max. They are calculated using the following equations from Koch [93]: I min = V th R I max = 1 ( R V "V E th rest )( 1" E) "1 tref "1/ f max #, E = e (6.2) Here, V th is the voltage threshold to produce a spike, V rest is the resting potential, t ref is the refractory period after a spike, τ=rc is the time constant, R is resistance, and C is capacitance.

102 Test Problem and Network Structure Figure 6-16 shows the network architecture. There are essentially five layers of neurons and four layers of synapses. The input image is passed through the fixed Gabor connections (four orientations) using stencil type connections of size 13X13 in the first layer. In the second layer, if needed, 2D max filtering is done for translation invariance using WTA. In the next layer, feature extraction is done using modifiable many to few type connections. For these connections, the preneuron layer is divided into 9 equal parts as shown. Each post-neuron receives connections from the 8X8 pre-neurons belonging to the same part. Finally, all to all type connections combine all the features at different orientations projecting into 10 outputs. The number of neurons are 5184, 2304, 2304, 2268, and 10 from input to output if no max operations are done for translational invariance. The modifiable synapses in the many to few and all to all learning layers are 145,152 Figure 6-16: Network architecture of spiking neural network simulations. Many to few and stencil type connections are shown by projections onto a single post-neuron in relevant layers. Flow diagram on the left shows types of synaptic connections and if they can be learned or fixed.

89 and 22,680 respectively. Note that this number is an upper estimate as many synapses do not get modified and thus play no role in the predictions (see Figure 6-18).

103 89 and 22,680 respectively. Note that this number is an upper estimate as many synapses do not get modified and thus play no role in the predictions (see Figure 6-18). If we assume the same percentage of modifiable synapses as shown in Figure 6-18 (about 39%), then there are about 56,600 many to few synapses being modified. The network can be divided into two sub-systems functionally. The lower sub-system handles the invariance and consists of functional stages of feature extraction and invariance. The upper sub-system is where learning takes place. The lower sub-system takes information from the input image in the form of pixels and converts them to features to be passed on to the upper subsystem. Both the lower and upper sub-systems consist of spiking neurons but differ functionally. Basically, the initial layers of the human vision system are well enough understood that we can pre-wire those layers and set the values of the synapses. The present approach is in agreement with biological evidence which suggests that learning in the higher layer occurs after the lower level cells have developed their response properties [142, 143]. This is discussed more in the discussion section. Figure 6-17: Response of the lower layers of the network as image of the number two is shown. Intensities correspond to the number of spikes. The Gabor filtering and twodimensional WTA stages are shown. Spiking neurons are used at all layers.

104 90 Figure 6-17 shows how the signal looks when an image of the number two is passed through the first sub-system. The Gabor filtering stage separates the image into features at different orientations. After the Gabor filtering stage, a two-dimensional WTA over a local field can be done for translational invariance and sub sampling. The Gabor filtered images look very similar to images that would have been obtained using convolution filtering used in computer vision. For training, each image was presented for 300 ms of simulation time. The images were presented in a random manner. The training was done in a layered manner for the two learning layers in a semi-supervised way. All the synapses were set to random values before training. First, only the ManyToFew mesh layer was switched on for training in an unsupervised way. After this layer was trained, it was switched off for training and the AllToAll mesh layer was switched on for training in a supervised way. This training method bears resemblance to learning in the brain where the higher layers develop their properties only after lower layers have developed theirs [142, 143], and is discussed in the discussion section.

91 6.4.2 Results Figure 6-18 shows the synapse strengths of the ManyToFew mesh layer after training.

105 Results Figure 6-18 shows the synapse strengths of the ManyToFew mesh layer after training. The synapse strengths are plotted as intensity patches with sizes the same as the projection sizes of the pre-neurons. Many patches look like bars of different sizes at different positions and orientations. Note that many patches (synapses) are not trained and they could have been removed or better yet one could only add synapses when needed. We currently don t have the ability to create and destroy synapses dynamically in the code, but have made some preliminary progress [144]. Figure 6-18: Synapse strengths of the first learning layer plotted as 8x8 intensity patches after training. White representing the highest synaptic strength and black the lowest. Only the first 45 patches from each of the 4 orientation channels are shown for simplicity. Some patches remain random and are not modified.

106 92 Figure 6-19: Synapse strengths after learning of the second learning layer plotted as intensities. Each row corresponds to connections of each of the 10 output neurons (each of which has 756X3 synapses). Figure 6-19 shows the synapse strength plots for the AllToAll mesh layer. The synapse strengths are plotted for each of the 10 output neurons as projections to previous layer. The synapse strengths are highest around the middle of each row as that region represents the center of the images and the numbers are located roughly in the center of the images. Also, it suggests that not many features are oriented at 135 degrees compared to features at other orientations. After training, each image was presented for duration of 300 ms for testing. Figure 6-20 shows the voltage plot of the output neurons for 300 ms when a number 4 from the test dataset is presented after training. The neuron tuned to the corresponding number fires with the highest

107 93 Figure 6-20: Voltage plot of the output neurons when a number 4 is presented from the testing set after learning. The neuron in the middle left fires with the highest firing rate as it is tuned to the number 4. firing rate. The neuron which is tuned to number 8 (bottom left corner) fires with the second highest firing rate whereas the neuron tuned to 6 (just above the bottom left corner) fires only one spike in 300 ms of simulation time. Each output neuron represented a unique number and the firing rate was used to distinguish between the outputs. It was found that the network accuracy was 91.2% for the training set and 89% for the test set. This is encouraging considering only spiking neurons were used throughout including learning in contrast to other methods, which usually use traditional

108 94 neural networks with supervised and gradient descent type learning throughout the network [145]. Also, other methods typically use more feature layers along with many layers of convolutions and sub-sampling [2, 32, 145]. Some approaches also augment the training set with artificially shifted, skewed, scaled, and sheared images. We believe the accuracy would increase if we incorporated all these techniques too, but the network would become larger and would require much more training time. Figure 6-21 shows the confusion matrix with intensities representing percent correct for each of the 10 numbers. It is easy to see that numbers 1, 2, and 4 were the top three performers, whereas, numbers 3, 6, and 5 had not so good accuracies. This is not surprising as the former numbers are simpler and have lesser features than latter. In addition, number 3 is being confused with numbers 2 and 8 most of time. Number 5 and 6 are mostly being confused with 8. It is also evident that most of the numbers are being confused to number 8, which is likely because it roughly incorporates the features of all the other numbers. Figure 6-21: Confusion matrix with intensities as percent correct for each number.

109 95 The network had 167,832 modifiable synapses and 9,766 neurons (including input and output). Though many synapses were not modified during training (see Figure 6-18). The time taken for training was about 25 sec per image when each image was presented for duration of 3000 time steps or 300 ms of simulation time. The time for testing an image was about 13 sec for 300 ms of simulation time. The total memory required was 16 MB for running the network. The code was run on a 2.8 GHz 8-core Mac OS X server with 24 GB RAM using GNU C++ version Color Object Recognition As discussed in Section 3.3, color is quite a complex phenomenon, but plays an important role in object recognition. Here we build a preliminary model capable of dealing with color images. The three opponent channels: grayscale, red/green, and blue/yellow are used here. We are not addressing more complex issues as color perception, constancy, and simultaneous color contrast right now. The last two opponent channels have the added advantage of being invariant to light intensity shift (see Section 3.3 ). In this simulation, images of 12 different fruits in color are used Test Problem and Network Structure Figure 6-22 shows the images of 12 fruits in color. The input to the network is an image of size 50x50 pixels replicated three times for the three normalized opponent channels as discussed in Section 3.3. Figure 6-23 shows the layout of the network. There are two layers of all to all connections. The network size is 7500x20x12. The training method is the same as previous;

i.e., each image is presented for a duration of 3000 time steps or 300 ms randomly and the first learning layer is trained first followed by the second.

110 i.e., each image is presented for a duration of 3000 time steps or 300 ms randomly and the first learning layer is trained first followed by the second. 96 Figure 6-22: Images of 12 fruits. Figure 6-23: Layout of the network. The input is the image of size 50x50 replicated three times for the three opponent channels.

111 Figure 6-24: Red, green, blue, grayscale, and the opponent channels normalized and plotted as intensities. 97

112 Results Figure 6-24 shows the fruit images plotted in the red, green, blue, grayscale, red-green, and yellow-blue channels. The red-green and yellow-blue channels are normalized as mentioned in section 3.3. One interesting aspect about the normalized red-green and yellow-blue channels is that absolute black and absolute white map on to the same value. Figure 6-25 shows the synapse strengths plotted as intensity. It shows that ten fruits are mapped uniquely. On closer examination, it is seen that the first and the sixth row each learns two similar looking fruit images. It also shows that the two grape images each having the same shape but different colors are mapped differently. If we had used just grayscale images instead of the opponent channels and used a grayscale network, it would have confused them. This is one advantage of using the opponent channels and the color network. Each fruit is identified uniquely as shown in Figure 6-26, which shows the firing rates for each of the twelve images.

113 Figure 6-25: Synapse strengths of the first learning layer plotted as 150X50 projections to the previous layer. White represents highest synaptic strength whereas black represents the lowest. Each row represents connections of a single post-neuron and the three columns represents the three opponent channels. 99

114 Figure 6-26: Firing rates of the 12 output neurons plotted as intensities. Number shown on the left-most column represents each of the 12 fruits images. 100

Recognition of English Characters Using Spiking Neural Networks

Recognition of English Characters Using Spiking Neural Networks Amjad J. Humaidi #1, Thaer M. Kadhim *2 Control and System Engineering, University of Technology, Iraq, Baghdad 1 601116@uotechnology.edu.iq