Partially-Observable Markov Decision Processes as Dynamical Causal Models. Finale Doshi-Velez NIPS Causality Workshop 2013

Size: px

Start display at page:

Download "Partially-Observable Markov Decision Processes as Dynamical Causal Models. Finale Doshi-Velez NIPS Causality Workshop 2013"

Christopher Wells
5 years ago
Views:

1 Partially-Observable Markov Decision Processes as Dynamical Causal Models Finale Doshi-Velez NIPS Causality Workshop 2013

2 The POMDP Mindset We poke the world (perform an action) Agent World

3 The POMDP Mindset We poke the world (perform an action) We get a poke back (see an observation) Agent We get a poke back (get a reward) -$1 World

4 What next? We poke the world (perform an action) We get a poke back (see an observation) Agent We get a poke back (get a reward) -$1 World

5 What next? We poke the world (perform an action) We get a poke back (see an observation)?the world is a mystery... Agent We get a poke back (get a reward) -$1 World

6 The agent needs a representation to use when making decisions We poke the world (perform an action) Representation of how the world works Representation of current world state We get a poke back (see an observation)?the world is a mystery... Agent We get a poke back (get a reward) -$1 World

7 Many problems can be framed this way Robot navigation (take movement actions, receive sensor measurements) Dialog management (ask questions, receive answers) Target tracking (search a particular area, receive sensor measurements) the list goes on...

8 The Causal Process, Unrolled a t-1 a t a t+1 a t+2 o t-1 o t o t+1 o t+2 r t-1 r t r t+1 r t+2 -$1 -$1 -$5 $10 8

9 The Causal Process, Unrolled a t-1 a t a t+1 a t+2 Given a history of actions, observations, and rewards How can we act in order to maximize long-term future rewards? o t-1 o t o t+1 o t+2 r t-1 r t r t+1 r t+2 -$1 -$1 -$5 $10 9

10 The Causal Process, Unrolled a t-1 a t a t+1 a t+2 Key Challenge: The entire history may be needed to make near-optimal decisions o t-1 o t o t+1 o t+2 r t-1 r t r t+1 r t+2 10

11 The Causal Process, Unrolled a t-1 a t a t+1 a t+2 All past events are needed to predict future events o t-1 o t o t+1 o t+2 r t-1 r t r t+1 r t+2 11

12 The Causal Process, Unrolled The representation is a sufficient statistic that summarizes the history a t-1 a t a t+1 a t+2 s t-1 s t s t+1 s t+2 o t-1 o t o t+1 o t+2 r t-1 r t r t+1 r t+2 12

13 The Causal Process, Unrolled The representation is a sufficient statistic that summarizes the history a t-1 a t a t+1 a t+2 s t-1 s t s t+1 s t+2 o t-1 o t o t+1 o t+2 r t-1 r t r t+1 r t+2 We call this representation the information state. 13

14 What is state? Sometimes, there exists an obvious choice for this hidden variable (such as a robot's true position) At other times, learning a representation that makes the system Markovian may provide insights into the problem.

15 Formal POMDP definition A POMDP consists of A set of states S, actions A, and observations O A transition function T( s' s, a ) An observation function O( o s, a ) A reward function R( s, a ) A discount factor γ ], the expected long- The goal is to maximize E[ term discounted reward. t=1 γ t R t

16 Relationship to Other Models Hidden State? Markov Model Hidden Markov Model Decisions?... s t-1 s t s t+1 s t o t-1 o t o t+1 o t+2... Markov Decision Process a t-1 a t a t+1 a t+2... s t-1 s t s t+1 s t+2... r t-1 r t r t+1 r t+2 POMDP s t-1 s t s t+1 s t+2 a t-1 a t a t+1 a t+2... s t-1 s t s t+1 s t o t-1 o t o t+1 o t r t-1 r t r t+1 r t+2...

17 Formal POMDP definition A POMDP consists of A set of states S, actions A, and observations O A transition function T( s' s, a ) An observation function O( o s, a ) A reward function R( s, a ) A discount factor γ This ], the expected optimization long- is called Planning The goal is to maximize E[ term discounted reward. t=1 γ t R t

18 Formal POMDP definition A POMDP consists of A set of states S, actions A, and observations O A transition function T( s' s, a ) An observation function O( o s, a ) A reward function R( s, a ) A discount factor γ ], the expected long- The goal is to maximize E[ term discounted reward. t=1 γ t R t Learning These is called Learning

19 Planning Bellman Recursion for the value (long-term expected reward) V (b)=max E[ t=1 γ t R t b 0 =b].=max a [R(b,a)+γ( o O P(b oa b)v (b oa ))]

20 State and State, a quick aside In the POMDP literature, the term state usually refers to the hidden state (i.e. the robot's true location). The posterior distribution of states s is called the belief b(s). It is a sufficient statistic for the history, and thus the information state for the POMDP.

21 Planning Bellman Recursion for the value (long-term expected reward) V (b)=max E[ t=1 γ t R t b 0 =b].=max a [R(b,a)+γ( o O P(b oa b)v (b oa ))]

22 Planning Bellman Recursion for the value (long-term expected reward) V (b)=max E[ t=1 γ t R t b 0 =b] Belief b (sufficient statistic/ information state).=max a [R(b,a)+γ( o O P(b oa b)v (b oa ))]

23 Planning Bellman Recursion for the value (long-term expected reward) V (b)=max E[ t=1 γ t R t b 0 =b].=max a [R(b,a)+γ( o O P(b oa b)v (b oa ))] Immediate reward for taking action a in belief b

24 Planning Bellman Recursion for the value (long-term expected reward) V (b)=max E[ t=1 γ t R t b 0 =b].=max a [R(b,a)+γ( o O P(b oa b)v (b oa ))] Expected future rewards

25 Planning Bellman Recursion for the value (long-term expected reward) V (b)=max E[ t=1 γ t R t b 0 =b].=max a [R(b,a)+γ( o O P(b oa b)v (b oa ))] especially when b is high-dimensional, solving for this continuous function is not easy (PSPACE-HARD)

26 Planning: Yes, we can! Global: Approximate the entire function V(b) via a set of support points b'. - e.g. SARSOP b Local: Approximate a the value for a particular belief with forward simulation - e.g. POMCP b t... a o

27 Learning Given histories h=(a 1,r 1, o 1, a 2,r 2, o 2,...,a T, r T, o T ) we can learn T, O, R via forward-filtering/backwardsampling or <fill in your favorite timeseries algorithm> Two principles usually suffice for exploring to learn: Optimism under uncertainty: Try actions that might be good Risk control: If an action seems risky, ask for help.

28 Example: Timeseries in Diabetes Data: Electronic health records of ~17,000 diabetics with 5+ A1c lab measurements and 5+ anti-diabetic agents prescribed. Meds (Antidiabetic agents) Clinician Model Patient Model Lab Results (A1c) Collaborators: Isaac Kohane, Stan Shaw

29 Example: Timeseries in Diabetes Data: Electronic health records of ~17,000 diabetics with 5+ A1c lab measurements and 5+ anti-diabetic agents prescribed. Meds (Antidiabetic agents) Clinician Model Patient Model Lab Results (A1c)

30 Discovered Patient States The patient states each correspond to a set of A1c levels (unsurprising) A1c < 5.5 A1c A1c A1c A1c > 8.5

31 Example: Timeseries in Diabetes Data: Electronic health records of ~17,000 diabetics with 5+ A1c lab measurements and 5+ anti-diabetic agents prescribed. Meds (Antidiabetic agents) Clinician Model Patient Model Lab Results (A1c)

32 Discovered Clinician States The clinician states follow the standard treatment protocols for diabetes (unsurprising, but exciting that we discovered this is a completely unsupervised manner) Next steps: Incorporate more variables; identify patient and clinician outliers (quality of care) Metformin A1c control Metformin, Glipizide A1c up A1c up A1c up Basic Insulins A1c control Metformin, Glyburide A1c up Glargine, Lispro, Aspart

33 Example: Experimental Design In a very general sense: Action space: all possible experiments + submit State space: which hypothesis is true Observation space: results of experiments Reward: cost of experiment Allows for non-myopic sequencing of experiments. Example: Bayesian Optimization? Joint with: Ryan Adams/HIPS group

34 Summary POMDPs provide a framework for modeling causal dynamical systems making optimal sequential decisions POMDPs can be learned and solved!... a t-1 a t a t+1 a t s t-1 s t s t+1 s t o t-1 o t o t+1 o t r t-1 r t r t+1 r t+2...

Sequential Decision Making

Sequential Decision Making Sequential decisions Many (most) real world problems cannot be solved with a single action. Need a longer horizon Ex: Sequential decision problems We start at START and want