Reinforcement Learning and Artificial Intelligence

Size: px

Start display at page:

Download "Reinforcement Learning and Artificial Intelligence"

Martha Oliver
6 years ago
Views:

1 Reinforcement Learning and Artificial Intelligence PIs: Rich Sutton Michael Bowling Dale Schuurmans Vadim Bulitko plus: Dale Schuurmans Vadim Bulitko Lori Troop Mark Lee

2 Reinforcement learning is learning from interaction to achieve a goal Environment state action reward Agent complete agent temporally situated continual learning & planning object is to affect environment environment stochastic & uncertain

3 Psychology Artificial Intelligence Reinforcement Learning Neuroscience Control Theory

4 Intro to RL with a robot bias policy reward value TD error world model generalized policy iteration

5 World or world model 5

6 Hajime Kimura s RL Robots Backward New Robot, Same algorithm Before After

7 Devilsticking Finnegan Southey University of Alberta Stefan Schaal & Chris Atkeson Univ. of Southern California Model-based Reinforcement Learning of Devilsticking

10 The RoboCup Soccer Competition

11 Autonomous Learning of Efficient Gait Kohl & Stone (UTexas) 2004

15 Policies A policy maps each state to an action to take Like a stimulus response rule We seek a policy that maximizes cumulative reward The policy is a subgoal to achieving reward

16 The Reward Hypothesis The goal of intelligence is to maximize the cumulative sum of a single received number: reward = pleasure - pain Artificial Intelligence = reward maximization

17 Brain reward systems What signal does this neuron carry? Honeybee Brain VUM Neuron Hammer, Menzel

18 Value

19 Value systems are hedonism with foresight We value situations according to how much reward we expect will follow them All efficient methods for solving sequential decision problems determine (learn or compute) value functions as an intermediate step Value systems are a means to reward, yet we care more about values than rewards

20 Pleasure good Even enjoying yourself you call evil whenever it leads to the loss of a pleasure greater than its own, or lays up pains that outweigh its pleasures.... Isn't it the same when we turn back to pain? To suffer pain you call good when it either rids us of greater pains than its own or leads to pleasures that outweigh them. Plato, Protagoras

21 Reward r : States Pr(R) e.g., r t R Policies π : States Pr(Actions) π is optimal Value Functions V π : States R V π (s) = E π γ k 1 r t +k k =1 given that s t = s discount factor 1 but <1

Backgammon STATES: configurations of the playing board ( 10

22 Backgammon STATES: configurations of the playing board ( ) ACTIONS: moves REWARDS: win: +1 lose: 1 else: 0 a big game

23 TD-Gammon Tesauro, Value Action selection by 2-3 ply search TD Error V t +1 V t Start with a random Network Play millions of games against itself Learn a value function from this simulated experience Six weeks later it s the best player of backgammon in the world

24 The Mountain Car Problem Goal Gravity wins SITUATIONS: car's position and velocity ACTIONS: three thrusts: forward, reverse, none REWARDS: always 1 until car reaches the goal No Discounting Minimum-Time-to-Goal Problem Moore, 1990

25 Value Functions Learned while solving the Mountain Car problem Minimize Time-to-Goal Value = estimated time to goal Goal region

26 TD error

27 Reward r : States Pr(R) e.g., r t R Policies π : States Pr(Actions) π is optimal Value Functions V π : States R V π (s) = E π γ k 1 r t +k k =1 given that s t = s discount factor 1 but <1

28 V t = E γ k 1 r t +k k =1 = E r t+1 + γ γ k 1 r t +1+k k =1 r t +1 +γv t +1 TDerror t = r t +1 +γv t +1 V t

29 Brain reward systems seem to signal TD error TD error Wolfram Schultz, et al.

30 World or world model 30

31 World models

32 Autonomous helicopter flight via Reinforcement Learning Ng (Stanford), Kim, Jordan, & Sastry (UC Berkeley) 2004

35 Reason as RL over Imagined Experience 1. Learn a predictive model of the world s dynamics transition probabilities, expected immediate rewards 2. Use model to generate imaginary experiences internal thought trials, mental simulation (Craik, 1943) 3. Apply RL as if experience had really happened vicarious trial and error (Tolman, 1932)

36 GridWorld Example

37 Intro to RL with a robot bias policy reward value TD error world model generalized policy iteration

38 Policy Iteration π 0 V π 0 π 1 V π 1 Lπ * V * π * policy evaluation policy improvement greedification Improvement is monotonic Converges is a finite number of steps Generalized Policy Iteration: Intermix the two steps more finely, state by state, action by action, sample by sample

39 Generalized Policy Iteration policy evaluation Policy π Value Function π V greedification π * V *

40 Summary: RL s Computational Theory of Mind Reward Policy Value Function Predictive Model

41 Summary: RL s Computational Theory of Mind Reward Policy Value Function Predictive Model A learned, time-varying prediction of imminent reward

42 Summary: RL s Computational Theory of Mind Reward Policy Value Function Predictive Model A learned, time-varying prediction of imminent reward Key to all efficient methods for finding optimal policies

43 Summary: RL s Computational Theory of Mind Reward Policy Value Function Predictive Model A learned, time-varying prediction of imminent reward Key to all efficient methods for finding optimal policies This has nothing to do with either biology or computers

44 Summary: RL s Computational Theory of Mind Reward Policy It s all created from the scalar reward signal Value Function Predictive Model

45 Summary: RL s Computational Theory of Mind Reward Policy It s all created from the scalar reward signal Value Function Predictive Model together with the causal structure of the world

46 Summary: RL s Computational Theory of Mind Reward Policy It s all created from the scalar reward signal Value Function Predictive Model together with the causal structure of the world

47 additions for next year sarsa how it might actually applied to something they might do

48 World or world model 48

49 Current work Knowledge is subjective An individual s knowledge consists of predictions about its future (low-level) sensations and actions Everything else (objects, houses, people, love) must be constructed from sensation and action

50 Subjective/Predictive Knowledge John is in the coffee room My car in is the South parking lot What we know about geography, navigation What we know about how an object looks, rotates What we know about how objects can be used Recognition strategies for objects and letters The portrait of Washington on the dollar in the wallet in my other pants in the laundry, has a mustache on it

51 RoboCup soccer keepaway Stone, Sutton & Kuhlmann, 2005

Chapter 1. Give an overview of the whole RL problem. Policies Value functions. Tic-Tac-Toe example

Chapter 1. Give an overview of the whole RL problem. Policies Value functions. Tic-Tac-Toe example Chapter 1 Give an overview of the whole RL problem n Before we break it up into parts to study individually Introduce the cast of characters n Experience (reward) n n Policies Value functions n Models