Sequential Decision Making

Size: px

Start display at page:

Download "Sequential Decision Making"

Clare Hill
5 years ago
Views:

1 Sequential Decision Making

2 Sequential decisions Many (most) real world problems cannot be solved with a single action. Need a longer horizon

3 Ex: Sequential decision problems We start at START and want to get to the goal (+1) while avoiding the pit (-1). How do we get to goal (and be rewarded +1)? Assume a fully observable environment Plan: Up, Up, Right, Right Right

4 Ex: Sequential decision problems cont d What if actions are not deterministic? Prob 0.8 to move in desired direction Prob 0.1 to move to the side (left or right) Cannot create a plan ahead of time! What is the probability that the old plan succeeds?

5 Execute predefined plan Old plan: Up, Up, Right, Right, Right Probability to succeed:

6 Markov Model In a first order Markov model The distribution p(x t t) depends only on the distribution p(x t-1 ) The present (current state) can be predicted using local knowledge of the most recent past (state at the previous step)

7 Transition model The transition function defines the probability to transition from one state to another. In decision making, each action is associated with its own transition function à select action to control the system to behave in some desired way or maximize the chance of achieving some goal T(s,a,s ) (first order Markov assumption*) Probability to reach s starting from s given action a. * T depends only on the previous state s and not the rest of the history

8 Reward function Defines the reward of a certain state In this problem someone defined the two terminal states (where a mission / an episode would stop) to have reward +1 (goal) and -1 (trap). Typically something that is part of our job as engineers to come up with to achieve a certain behavior

9 We need to make it up! Assume agent gets reward R(s) for being in state s R([4,3]) = +1 (Go to goal) R([4,2]) = -1 (Avoid trap) R(rest) = (Get to goal quickly) Let the utility of a certain path be the sum of the state utilities for that path

10 Markov Decision Process This defines a Markov Decision Process (MDP) Assumes fully observable environment Defined by: Initial state: S 0 Transition model: T(s,a,s ) Reward: R(s) What does a solution look like?

11 Solution to MDP A solution to an MDP cannot be a fixed plan (non-deterministic world, need to sense state) It is a policy π Maps state to action à a=π(s)

12 How good is a policy? How to measure the quality of a policy à Measure expected utility over the history (stochastic Env means that we need to use expectations) Optimal policy, π * : highest possible expected utility

13 Ex: Optimal policy π * Optimal policy for the previous problem Tells us what to do in each state Actual path only known when moving because actions are non-deterministic

14 Optimal policy depends on T and R!! Different behaviors for different R What if R>0 always? How about changing T?

15 Utility of sequences U h ([s 0,s 1,.s n ]) = R(s 0 )+R(s 1 )+ +R(s n ) What about infinite sequences? Might get without terminal state. How to compare and?

16 Discounted rewards Idea: Give less weight to future rewards Captures that rewards tomorrow is less certain Use discount factor γ, 0< γ<1 Gives bounded utility

17 Selecting the best policy How do we select the best policy? Many state sequences to compare As before, maximize expected utility

18 Value iteration Key insight: Utility of a state is immediate reward plus discounted expected utility of next states (assuming that we choose the optimal policy) Idea: Iterate Calculate utility of each state Use utilities to select optimal decision in each state

19 Algorithm: Value iteration Initialize U(s) arbitrary for all s Loop until policy has converged loop over all states, s loop over all actions, a end end end NOTE: Q-function comes back in Reinforcement learning

20 Bellman equation Bellman equation Converges to unique optimal solution Stop iterations when largest change in utility for any state is small enough Can show that

21 Value iteration Run value_iteration.m

22 Policy iteration Alternative to value iteration to find a policy Choose an arbitrary policy Loop until policy does not change any more Compute the value function, V(s) given the policy, known as policy evaluation Given this value function, improve the policy for each state

23 Do not always have a model What if we do not have, e.g., the model T(s,a,s )? Use learning, for example Identify T from lots of data Q-learning: Estimate Q(s,a) directly Q(s,a) is total reward starting from s, applying action a and then acting optimally after that. Ex: DeepMinds atari games (modeled Q using a deep net where s was a sequence of images, i.e. learned action from image sequence)

24 Partially observable environment What if the environment is not observable? Remember: there are no sensors that let us see everything so this is the normal case!

25 Partially observable environment What if the environment is not observable? Cannot execute policy since the state is unknown! Results in a Partially Observable MDP or POMDP Sensors provide observations of the environment Observation model O(s,o) gives probability of making observation o in state s

26 First attempt at POMDP No measurements Initial position unknown First attempt on a plan: Move left 5 times (now likely that you are on the left side) Move up 5 times (now likely that you are in upper left corner) Move right 5 times 77.5% success rate Expected utility only 0.08

27 Belief state The belief state can be used in partially observable world (conditional planning, vacuum cleaner) Definition: Belief state b is a probability distribution over possible states B(s) is the probability of being in state s Example: Initial belief state assuming that the agent can be anywhere except +1 or -1?

28 Belief state update Need to update the belief state as we go along Assume belief state b and action a Update? b (s ) = some function of b(s), a, s and o.

Belief state update Need to update the belief state as we go along Assume belief state b and action a Update: a is a normalization factor such that Bayes

29 Belief state update Need to update the belief state as we go along Assume belief state b and action a Update: a is a normalization factor such that Bayes rulewith: b(s) = p(s) ( prior ) b (s ) = p(s s,a,o) ( posterior ) Sum[T(s,a,s )b(s)] = p(s a,s) ( prediction ) O(s,o) = p(o s ) = p(o s,s,a) {o indep of s and a, given s }

30 Solving POMDP Key insight: Optimal action depends on belief state and not actual state! Optimal policy p * (b) Decision cycle: 1. Execute action a=p * (b) 2. Receive observation 3. Update belief state

31 Turn a POMDP into a MDP Introduce t(b,a,b ) probability of reaching belief state b from b given action a r(b) = S b(s)r(s) t(b,a,b ) and r(b) define an observable MDP Optimal MDP strategy p * (b) is also optimal for the original POMDP

32 Second attempt at POMDP No observation à Problem is deterministic in belief space The policy is a fixed sequence Optimal sequence is: Left, Up, Up, Right, Up, Up, Right, Up, Up, Right, Up, Right, Up, Right, Up, Right, Expected utility 0.38 (was 0.08 before)

33 So is it simple? Sounds simple at first, BUT Belief state is a probability distribution! Compare MDP and POMDP in the 4x3 world MDP state : The position of the agent, i.e. 1 discrete variable with 11 possible values. POMDP belief state: 11 dimensional vector of continuous variables!!!

Lecture 13: Finding optimal treatment policies

MACHINE LEARNING FOR HEALTHCARE 6.S897, HST.S53 Lecture 13: Finding optimal treatment policies Prof. David Sontag MIT EECS, CSAIL, IMES (Thanks to Peter Bodik for slides on reinforcement learning) Outline