Lecture 13: Finding optimal treatment policies

Similar documents
Sequential Decision Making

Reinforcement Learning

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

The optimism bias may support rational action

Partially-Observable Markov Decision Processes as Dynamical Causal Models. Finale Doshi-Velez NIPS Causality Workshop 2013

EECS 433 Statistical Pattern Recognition

Dan Lizotte, Lacey Gunter, Eric Laber, Susan Murphy Department of Statistics - University of Michigan

Adaptive Treatment of Epilepsy via Batch-mode Reinforcement Learning

CS343: Artificial Intelligence

Generative Adversarial Networks.

Artificial Intelligence Lecture 7

Policy Gradients. CS : Deep Reinforcement Learning Sergey Levine

Bayesian Reinforcement Learning

Lecture 5: Sequential Multiple Assignment Randomized Trials (SMARTs) for DTR. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning and Artificial Intelligence

Rational Agents (Ch. 2)

Using Bayesian Networks to Analyze Expression Data. Xu Siwei, s Muhammad Ali Faisal, s Tejal Joshi, s

arxiv: v2 [cs.lg] 1 Jun 2018

Policy Gradients. CS : Deep Reinforcement Learning Sergey Levine

Katsunari Shibata and Tomohiko Kawano

Dynamic Control Models as State Abstractions

Adaptive Treatment of Epilepsy via Batch Mode Reinforcement Learning

Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable Markov Decision Processes

Seminar Thesis: Efficient Planning under Uncertainty with Macro-actions

Learning Classifier Systems (LCS/XCSF)

arxiv: v1 [cs.ai] 28 Jan 2018

Chapter 2: Intelligent Agents

Introduction to Design and Analysis of SMARTs

Intelligent Systems. Discriminative Learning. Parts marked by * are optional. WS2013/2014 Carsten Rother, Dmitrij Schlesinger

Introduction to Deep Reinforcement Learning and Control

CS 4365: Artificial Intelligence Recap. Vibhav Gogate

Lecture 11: Clustering to discover disease subtypes and stages

CS324-Artificial Intelligence

Rational Agents (Ch. 2)

A Scoring Policy for Simulated Soccer Agents Using Reinforcement Learning

Rational Agents (Chapter 2)

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training.

BREAST CANCER EPIDEMIOLOGY MODEL:

Lecture 2 Agents & Environments (Chap. 2) Outline

Towards Learning to Ignore Irrelevant State Variables

Challenges in Developing Learning Algorithms to Personalize mhealth Treatments

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Introduction to Computational Neuroscience

Agents. Robert Platt Northeastern University. Some material used from: 1. Russell/Norvig, AIMA 2. Stacy Marsella, CS Seif El-Nasr, CS4100

CS148 - Building Intelligent Robots Lecture 5: Autonomus Control Architectures. Instructor: Chad Jenkins (cjenkins)

Solutions for Chapter 2 Intelligent Agents

Artificial Neural Networks (Ref: Negnevitsky, M. Artificial Intelligence, Chapter 6)

Intelligent Agents. BBM 405 Fundamentals of Artificial Intelligence Pinar Duygulu Hacettepe University. Slides are mostly adapted from AIMA

Reinforcement Learning : Theory and Practice - Programming Assignment 1

Data mining for Obstructive Sleep Apnea Detection. 18 October 2017 Konstantinos Nikolaidis

Learning in neural networks

CS 771 Artificial Intelligence. Intelligent Agents

CLEANing the Reward: Counterfactual Actions Remove Exploratory Action Noise in Multiagent Learning

Agents and Environments

arxiv: v1 [cs.lg] 2 Dec 2017

FEELING AS INTRINSIC REWARD

Machine Learning! R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction! MIT Press, 1998!

Adversarial Decision-Making

Bayesian Networks Representation. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Learning from data when all models are wrong

Agents and Environments

Cost-Sensitive Learning for Biological Motion

Intelligent Machines That Act Rationally. Hang Li Toutiao AI Lab

Motivation: Attention: Focusing on specific parts of the input. Inspired by neuroscience.

COMP9444 Neural Networks and Deep Learning 5. Convolutional Networks

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

Introduction to Artificial Intelligence 2 nd semester 2016/2017. Chapter 2: Intelligent Agents

A Vision-based Affective Computing System. Jieyu Zhao Ningbo University, China

Overview. What is an agent?

Approximate Solutions For Partially Observable Stochastic Games with Common Payoffs

Big Data and Machine Learning in RCTs An overview

Intelligent Agents. Outline. Agents. Agents and environments

EEL-5840 Elements of {Artificial} Machine Intelligence

The Good News. More storage capacity allows information to be saved Economic and social forces creating more aggregation of data

Does Machine Learning. In a Learning Health System?

CLEAN Rewards to Improve Coordination by Removing Exploratory Action Noise

A HMM-based Pre-training Approach for Sequential Data

Georgetown University ECON-616, Fall Macroeconometrics. URL: Office Hours: by appointment

Econometrics II - Time Series Analysis

Princess Nora University Faculty of Computer & Information Systems ARTIFICIAL INTELLIGENCE (CS 370D) Computer Science Department

Asalient problem in genomic signal processing is the design

Agents and Environments. Stephen G. Ware CSCI 4525 / 5525

Reinforcement Learning in RAT, Approaches

Markov Decision Processes for Screening and Treatment of Chronic Diseases

arxiv: v1 [cs.lg] 25 Nov 2018

CSE Introduction to High-Perfomance Deep Learning ImageNet & VGG. Jihyung Kil

Minimum Feature Selection for Epileptic Seizure Classification using Wavelet-based Feature Extraction and a Fuzzy Neural Network

Why did the network make this prediction?

Sensory Cue Integration

Deep Reinforcement Learning for Dynamic Treatment Regimes on Medical Registry Data

Vorlesung Grundlagen der Künstlichen Intelligenz

POND-Hindsight: Applying Hindsight Optimization to POMDPs

Some Regulatory Implications of Machine Learning Larry D. Wall Federal Reserve Bank of Atlanta

POMCoP: Belief Space Planning for Sidekicks in Cooperative Games

Predictive Models for Healthcare Analytics

Lecture 12: Disease progression modeling

Human Activities: Handling Uncertainties Using Fuzzy Time Intervals

Learning to Identify Irrelevant State Variables

Transcription:

MACHINE LEARNING FOR HEALTHCARE 6.S897, HST.S53 Lecture 13: Finding optimal treatment policies Prof. David Sontag MIT EECS, CSAIL, IMES (Thanks to Peter Bodik for slides on reinforcement learning)

Outline for today s class Finding optimal treatment policies Reinforcement learning / dynamic treatment regimes What makes this hard? Q-learning (Watkins 89) Fitted Q-iteration (Ernst et al. 05) Application to schizophrenia (Shortreed et al., 11) Deep Q-networks for playing Atari games (Mnih et al. 15)

Previous Lectures Supervised learning classification, regression Unsupervised learning clustering Reinforcement learning more general than supervised/unsupervised learning learn from interaction w/ environment to achieve a goal environment reward new state agent action [Slide from Peter Bodik]

Finding optimal treatment policies Health interventions / treatments Prescribe insulin and Metformin Prescribe statin True state.. What the health system sees? Blood pressure = 135 Temperature = 99 F Blood pressure = 150 WBC count = 6.8*10 9 / Temperature = 98 F A1c = 7.7% ICD9 = Diabetes

Finding optimal treatment policies Health interventions / treatments Prescribe insulin and Metformin Prescribe statin True state.. How to optimize the treatment policy? What the health system sees? Blood pressure = 135 Temperature = 99 F Blood pressure = 150 WBC count = 6.8*10 9 / Temperature = 98 F A1c = 7.7% ICD9 = Diabetes

Key challenges 1. Only have observational data to learn policies from At least as hard as causal inference Reduction: just 1 treatment & time-step 2. Have to define outcome that we want to optimize (reward function) 3. Input data can be high-dimensional, noisy, and incomplete 4. Must disentangle (possibly long-term) effects of sequential actions and confounders à credit assignment problem

Robot in a room START +1-1 actions: UP, DOWN, LEFT, RIGHT UP 80% move UP 10% move LEFT 10% move RIGHT reward +1 at [4,3], -1 at [4,2] reward -0.04 for each step what s the strategy to achieve max reward? what if the actions were deterministic? [Slide from Peter Bodik]

Robot in a room +1-1 actions: UP, DOWN, LEFT, RIGHT UP 80% move UP 10% move LEFT 10% move RIGHT START reward +1 at [4,3], -1 at [4,2] reward -0.04 for each step states actions rewards what is the solution? [Slide from Peter Bodik]

Is this a solution? +1-1 only if actions deterministic not in this case (actions are stochastic) solution/policy mapping from each state to an action [Slide from Peter Bodik]

Optimal policy +1-1 [Slide from Peter Bodik]

Reward for each step: -2 +1-1 [Slide from Peter Bodik]

Reward for each step: -0.1 +1-1 [Slide from Peter Bodik]

Reward for each step: -0.04 +1-1 [Slide from Peter Bodik]

Reward for each step: -0.01 +1-1 [Slide from Peter Bodik]

Reward for each step:??? +1-1 [Slide from Peter Bodik]

Reward for each step: +0.01 +1-1 [Slide from Peter Bodik]

Outline for today s class Finding optimal treatment policies Reinforcement learning / dynamic treatment regimes What makes this hard? Q-learning (Watkins 89) Fitted Q-iteration (Ernst et al. 05) Application to schizophrenia (Shortreed et al., 11) Deep Q-networks for playing Atari games (Mnih et al. 15)

Markov Decision Process (MDP) set of states S, set of actions A, initial state S 0 transition model P(s,a,s ) = P(s t+1 = s s t = s, a t =a) P( [1,1], up, [1,2] ) = 0.8 reward function r(s) r( [4,3] ) = +1 reward new state goal: maximize cumulative reward in the long run environment agent action policy: mapping from S to A p(s) or p(s,a) (deterministic vs. stochastic) reinforcement learning transitions and rewards usually not available how to change the policy based on experience how to explore the environment [Slide from Peter Bodik]

State representation pole-balancing move car left/right to keep the pole balanced state representation position and velocity of car angle and angular velocity of pole what about Markov property? would need more info noise in sensors, temperature, bending of pole solution coarse discretization of 4 state variables left, center, right totally non-markov, but still works [Slide from Peter Bodik]

Designing rewards robot in a maze episodic task, not discounted, +1 when out, 0 for each step chess GOOD: +1 for winning, -1 losing BAD: +0.25 for taking opponent s pieces high reward even when lose rewards rewards indicate what we want to accomplish NOT how we want to accomplish it shaping positive reward often very far away rewards for achieving subgoals (domain knowledge) also: adjust initial policy or initial value function [Slide from Peter Bodik]

Computing return from rewards episodic (vs. continuing) tasks game over after N steps optimal policy depends on N; harder to analyze additive rewards V(s 0, s 1, ) = r(s 0 ) + r(s 1 ) + r(s 2 ) + infinite value for continuing tasks discounted rewards V(s 0, s 1, ) = r(s 0 ) + γ*r(s 1 ) + γ 2 *r(s 2 ) + value bounded if rewards bounded [Slide from Peter Bodik]

Finding optimal policy using value iteration state value function: V p (s) expected return when starting in s and following p optimal policy p* has property: V (s) = max a X Learn using fixed point iteration: s 0 P a ss 0[ra s,s 0 + V (s 0 )] s r a s Equivalent formulation uses state-action value function: Q (s, a) =rs,s a + V (s 0 ) V (s) = max Q (s, a) 0 a (expected return when starting in s, performing a, and following p) Q k+1 (s, a) = X s 0 P a ss 0[ra s,s 0 + max a 0 Q k (s 0,a 0 )]

Q-learning Same as value iteration, but rather than assume Pr(s s, a) is known, estimate it from data (i.e. episodes) Input: sequences/episodes from some behavior policy s a,r s s Combine data from all episodes into a set of n tuples (n = # episodes * length of each): {(s, a, s 0 )} Use these to get empirical estimate and use this instead In reinforcement learning, episodes are created as we go, using current policy + randomness for exploration ˆP a ss 0 s r a r

Where can Q-learning be used? need complete model of the environment and rewards robot in a room state space, action space, transition model can we use DP to solve robot in a room? back gammon, or Go? helicopter? optimal treatment trajectories? [Slide from Peter Bodik]

Outline for today s class Finding optimal treatment policies Reinforcement learning / dynamic treatment regimes What makes this hard? Q-learning (Watkins 89) Fitted Q-iteration (Ernst et al. 05) Application to schizophrenia (Shortreed et al., 11) Deep Q-networks for playing Atari games (Mnih et al. 15)

Fitted Q-iteration Challenge: in infinite or very large state spaces, very difficult to estimate Pr(s s, a) Moreover, this is a harder problem than we need to solve! We only need to learn how to act Can we learn the Q function directly, i.e. a mapping from s,a to expected cumulative reward? ( model-free RL) Reduction to supervised machine learning (exactly the same as we did in causal inference using regression) Input is the same: sequences/episodes from some s a,r s behavior policy: First let s create a dataset learn ˆQ(s t,a t )! r t+1 F = {(hs n t,a n t i,r n t+1,s n t+1),n=1,..., F } and

Fitted Q-iteration First let s create a dataset learn ˆQ(s t,a t )! r t+1 and Trick: to predict the cumulative reward, we iterate this process Initialize For k=1, ˆQ 0 (s n t,a n t )=r n t+1 using 1. Create training set for k th learning problem: TS k = {(hs n t,a n t i, ˆQ k GOAL: extrapolate to actions other than a n t (i.e., compute counterfactuals) 2. Use supervised learning to estimate function from TS k 3. Update Q values for each observed tuple in using Bellman equation: ˆQ k (s n t,a n t )=r n t+1 + F = {(hs n t,a n t i,r n t+1,s n t+1),n=1,..., F } F 1 (s n t,a n t )), 8hs n t,a n t i2f} max a ˆQ k F 1 (s n t+1,a)] ˆQ k 1 (s n t,a n t )

Example of Q-iteration Adaptive treatment strategy for treating psychiatric disorders Initial Treatment A 2 Month Outcome Initial Response Second Treatment 40% No Response A+C 60% Some Response A Second Response 60% Nonremission 40% Remission 30% Nonremission 70% Remission Final Status 24% Nonremission 16% Remission 18% Nonremission 42% Remission Figure 2 A comparison of two strategies. The strategy beginning with medication A has an overall remission rate at 4 months of 58% (16 + 42%). The strategy beginning with medication B has an overall remission rate at 4 months of 65% (25 + 40%). Medication A is best if considered as a standalone treatment, but medication B is best initially when considered as part of a sequence of treatments. B 50% No Response 50% Some Response B+C B 50% Nonremission 50% Remission 20% Nonremission 80% Remission 25% Nonremission 25% Remission 10% Nonremission 40% Remission [Murphy et al., Neuropsychopharmacology, 2007]

Example of Q-iteration Goal: minimize average level of depression over 4-month period; only 2 decisions (initial and second treatment) Y 2 = summary of depression weeks 9 through 12 S 8 = summary of side-effects up to end of 8 th week First, regress onto Y 2 using: b 0 þ b 1 S 8 þðb 2 þ b 3 S 8 ÞT 2 learn decision rule that recommends switching treatment for patient if (b 2 + b 3 S 8 is less than zero augmenta Then consider initial decision T 1, regressing on o Y 1 + min( b 0 + b 1 S 8 +(b 2 + b 3 S 8 ), b 0 + b 1 S 8 ) [Murphy et al., Neuropsychopharmacology, 2007]

Empirical study for schizophrenia Clinical Antipsychotic Trials of Intervention Effectiveness: 18 month multistage clinical trial of 1460 patients with schizophrenia 2 stages Subjects randomly given a stage 1 treatment: olanzapine, risperidone, quetiapine, ziprasidone, and perphenazine Followed for up to 18 months; allowed to switch treatment if original was not effective: Lack of efficacy (i.e., symptoms still high) Lack of tolerability (i.e., side-effects large) Data recorded every 3 months (i.e., 6 time points) Reward at each time point: (negative) PANSS score (low PANSS score = few psychotic symptoms) [Shortreed et al., Mach Learn, 2011]

Empirical study for schizophrenia Most of the missing data is due to people dropping out of study prior to that month [Shortreed et al., Mach Learn, 2011]

Empirical study for schizophrenia Data pre-processing: Multiple imputation for the features (i.e. state) Bayesian mixed effects model for PANSS score (i.e. reward) Fitted Q-iteration performed using linear regression Different weight vector for each action (allows for nonlinear relationship between state and action) Different weight vectors for each of the two time points Weight sharing for variables not believed to be action specific but just helpful for estimating Q function (e.g., tardive dyskinesia, recent psychotic episode, clinic site type) Bootstrap voting to get confidence intervals for treatment effects [Shortreed et al., Mach Learn, 2011]

Empirical study for schizophrenia Optimal treatment policy: [Shortreed et al., Mach Learn, 2011]

Empirical study for schizophrenia Stage 1 stage-action value-function: [Shortreed et al., Mach Learn, 2011]

Empirical study for schizophrenia Stage 2 stage-action value-function: [Shortreed et al., Mach Learn, 2011]

Measuring convergence in fitted Q- iteration Figure 5: Convergence of estimated using FQI, given by [Prasad et al., 2017]

Playing Atari with deep reinforcement learning Game "Breakout : control paddle at bottom to break all bricks in upper half of screen Do fitted Q-iteration using deep convolutional neural networks to model the Q function Use eps-greedy algorithm to perform exploration Infinite time horizon [Mnih et al., 2015]