Learning and Adaptive Behavior, Part II April 12, 2007 The man who sets out to carry a cat by its tail learns something that will always be useful and which will never grow dim or doubtful. -- Mark Twain
A Challenge: Getting RL to Work on Real Robots When is learning appropriate? When task is originally under-specified or difficult to code exactly by hand When task has parameters that are likely to change over time in unpredictable ways When time taken to learn control policy is less than that for hand-coding a comparable policy When learned policy can be executed more efficiently than a hand-coded one
Problems with RL on Robots Huge number of states to explore, with large number of possible actions in each state. E.g., 24 sonar sensors, quantized into 3 range bands 282 billion possible states If possible actions in each state or go forwards or backwards > 560 billion state-action combinations to try Robot is physical, thus it takes time to perform an action 1 second per action 20,000 years to try each combination During early learning, robot s actions may be dangerous Let s try rolling down the stairwell to see what next state I end up in One possible safeguard: give robot reflexes to stop dangerous actions
One Possible Framework for RL on Robots 3 components: Provided system, with initial control policy and immediate reward function Policy learning process Value learning process Policy learner + value learner Q(s,a) Initial control policy next state action Reward Environment
Fundamental Issues Quick learning: learn in real-time Generalization: take what is known and hypothesize about unknown/unvisited states Shifting functions: adapt over time
Other Examples of Reinforcmenet Learning: Model-Based Reinforcement Learning QuickTime and a Sorenson Video decompressor are needed to see this picture. QuickTime and a Sorenson Video decompressor are needed to see this picture.
Another Related Example: Imitation Learning QuickTime and a Sorenson Video decompressor are needed to see this picture.
Two New Topics: Objectives To understand neural network learning applied to robotics To understand genetic algorithms applied to robotics
First: Quick Background in Neural Nets Some of earliest work in neural networks (or connectionist systems) was McCulloch-Pitts model of neurons (1943) McCulloch-Pitts: Simple linear threshold unit Synaptic weights associated with each synaptic input If threshold exceeded, neuron fired, carrying output to next neuron Later, Rosenblatt (1958) introduced Perceptron: Input vector x 1 x 2 x 3.. w 1 w 2 w 3.. Neuron Σ θ Output.. x n w n Synaptic weights
Neural Net Quick Background (con t.) 1960s 1970s: neural net research in decline due to Minsky and Papert book (1969) that proved limitations of single-layer perceptron networks 1980s: resurgence due to multi-layer neural networks and use of backpropogation as a means for training these systems Much work in connectionism since, with significant progress Keep in mind: neural nets are only abstract computational models of biological neurons
Methods for Encoding Behavior-Based Robotic Control in Neural Networks Hebbian Learning Perceptron Learning Classical Conditioning Adaptive Heuristic Critic Learning
1. Hebbian Learning Hebb (1949) developed one of earliest training algorithms for neural networks Hebbian learning: increases synaptic strength along neural pathways associated with a stimulus and a correct response Specifically: w ij (t+1) = w ij (t) + η*o i o j Donald Hebb 1904-1985 where: w ij (t) and w ij (t+1) are synaptic weights connecting neurons i and j before and after updating η is the learning rate coefficient o i and o j are the outputs of neurons i and j, respectively
2. Perceptron Learning has been used for Robotic Learning Overall training procedure: Repeat: 1. Present an example from a set of positive and negative learning experiences. 2. Verify the output of the network as to whether it is correct or incorrect. 3. If it is incorrect, supply the correct output at the output unit. 4. Adjust the synaptic weights of the perceptrons in a manner that reduces the error between the observed output and correct output. Until satisfactory performance as manifested by convergence is achieved or some other stopping condition is met.
How to Update Synaptic Weights? Delta rule: used for perceptrons without hidden layers Modify synaptic weights according to the formula: (w ij ) = η*w ij *(t j -o j ) where (w ij ) is the synaptic adjustment applied to the connection between neurons i and j η is the learning rate coefficient t j and o j are the correct and incorrect outputs, respectively The Delta rule strives to minimize the error term using a gradient descent approach
Gradient Descent Approach Gradient Descent: Refers to learning methods that seek to minimize an objective function by which system performance is measured At each point in time, the policy is to choose the next step that yields the minimal objective function value. The learning rate parameters refer to the step size taken at each point in time Each step is computed only on the basis of local information, which is extremely efficient, but introduces possibility of traps in local minima Gradient Descent Hill-Climbing: Analogous process whereby objective function is maximized Hill climbing
Another Method for Updating Weights: Back-Propagation Back-propagation: most commonly used method for updating synaptic weights Employs generalized version of delta rule for use in multilayer perceptron networks (which is commonly used in robotic control and vision) Usually, synaptic weights are initialized to random values Weights are adjusted by following update rule as training instances are provided: w ij (t+1) = w ij (t) + ηδ j o i where: δ j = o j (1-o j )(t j -o j ) for an output node, and δ j = o j (1-o j )Σ k δ k w jk for a hidden layer node t j and o j are the correct and incorrect outputs, respectively The errors are propagated backward from the output layer.
3. Classical Conditioning Studied by Pavlov (1927), assumes that unconditioned stimuli (US) automatically generates an unconditioned response (UR) Pavlov determined: US-UR pair is defined genetically and is appropriate to ensure survival in the agent s environment In Pavlov s studies: Sight of food (US) results in dog s salivation (UR) Associations can be developed between a conditioned stimulus (CS), which has no intrinsic survival value, and the UR Further studies: Bell rings repeatedly with sight of food Leads to bell ringing alone generating salivation NOTE: Hebbian learning can also produce classical conditioning Ivan Pavlov
Classical Conditioning for Self-Organization of Behavior-Based Robotic System Instead of hard-wiring relationships between stimuli and responses, learning architecture permits associations to develop over time SENSORS Aversive Unconditioned Stimulus Collision Detectors Range Finder Target Detector Conditioned Stimulus Inhibition Unconditioned Response Robot Motors Appetitive Unconditioned Stimulus
Robot Example of Classical Conditioning Studies of Vershure, et al (1992) for collision avoidance and target acquisition Divide positive unconditioned stimulus (US) into 4 discrete areas in which attractive (appetitive) target may appear: ahead behind left right Unconditioned response (UR) set consists of 6 possible commands: advance reverse turn right 9 degrees turn right 1 degree turn left 9 degrees turn left 1 degree Two additional collision sensors serve as negative US, producing a response consisting of a reverse and turning 9 degrees away from direction of collision Negative US can inhibit positive US, to ensure that managing collisions is first priority
Robot example (con t.) CSs: use range sensor that produces distance profile of 180 degrees in direction robot is heading Readings divided into varying, discrete levels of resolution: Forward (ranging from 30 to +30): 20 units covering 3 degrees each Area to right (+30 to +60): 5 units covering 6 degrees each Area to far right (+60 to +90): 3 units covering 10 degrees each Left side: analogous to right Total range readings: 36 units
Robot example (con t.) For neural network implementation, perceptron-like linear threshold units with binary activation values are used Synaptic weights updated according to following rule: (w ij ) = (1/N) x (ηo i o j εοw ij ) where η is the learning rate ε is the decay rate N is the number of units in the CS field o i, o j is the binary output value of units i and j, respectively, and o is the average activity of the US field
Robot example (con t.) Robot s task: learn useful behaviors by associating perceptual stimuli with environmental feedback Behaviors include: avoidance, where robot learns not to bump into things approach, to desired target Note that robot has no a priori understanding of how to use range data to prevent collisions from occurring in the US set; this must be learned from the CS Simulation studies: indicate that successful behavior does occur in a manner consistent with agent s goals
Other Examples of Robot Learning Using Classical Conditioning Others have shown similar technique useful for learning: Sorting colored blocks that conduct electricity (either strongly or weakly) based on feedback from an aversive or appetitive response Teaching a Braitenberg-like robot, consisting of a neural net of 5 neurons, to seek light Teaching robot to develop a topological map using a neural network, by learning suitable responses at different locations in the world Robot learning of encodings similar to potential fields for representing learned landmark positions
Example of RL Learning (+Neural Nets) (Movies) QuickTime and a Cinepak decompressor are needed to see this picture. Khepera robot QuickTime and a Cinepak decompressor are needed to see this picture.
4. Connectionist Adaptive Heuristic Critic Learning (AHC) With connectionist adaptive heuristic critic (AHC) reinforcement learning methods, a critic learns the utility or value of particular states through reinforcement Learned values are then used locally to reinforce the selection of particular actions for a given state Gachet, et al (1994) use this approach as basis for learning relative strengths (i.e., gain matrix G) of each behavior s response within an active assemblage Specific goals of this study: Learn how to effectively coordinate a robot equipped with the following behaviors: goal attraction two perimeter-following behaviors (left and right) free-space attraction avoiding objects following a path Output for each behavior is a vector that the robot sums before execution (a la schema methods we have studied)
Connectionist AHC Learning System Weight Matrix V Adaptive Critic Element Reinforcement Signal Local Reinforcement Signals from Critic Sensor Data Classification System Input Layer Weight Matrix W Adaptive Search Element Output Behavioral Gains
AHC network (con t) AHC Network starts with classification system that maps incoming sonar data onto a set of situations (either 32 or 64, depending on the task) that reflect the sensed environment Output layer containing a weight matrix W, called the associative search element (ASE) computes individual behavioral gains for each behavior: W ki (t+1) = W ki (t) + αb i (t)e ki (t) where α is learning rate and e ki (t) is eligibility of weight W ki for reinforcement Adaptive critic element (ACE) determines reinforcement signal to be applied to the ASE ACE s weights V are updated independently: V ki (t+1) = V ki (t) + βb i (t)x k (t) where β is positive constant and X k is eligibility for the ACE This partitioning of reinforcement updating rules from action element updating is a characteristic of AHC methods in general
AHC Example (con t.) Task for robot: learn set of gain multipliers (G) for a particular task-environment Three different missions: learning to explore environment safely learning how to move back and forth between alternating goal points safely follow a predetermined path without collisions During exploration: Robot moves randomly When collision occurs, negative reinforcement applied and robot moved back to a position it occupied N steps earlier (N = 30 for simulation, 10 for physical robot) For goal-oriented missions, negative reinforcement occurs when collision is imminent, and when robot is pointing away from goal (or path) and no obstacles are blocking its way to the goal Experimentations showed learning approach successful for these tasks
Summary of neural network learning Neural networks, a form of reinforcement learning, use specialized, multi-node architectures. Learning in neural nets occurs through the adjustment of synaptic weights by an error minimization procedure, such as: Hebb s rule Back Propagation Classical conditioning in which a conditioned stimulus is eventually associated with an unconditioned response can be manifested in robotic systems