Search e Fall /18/15

Size: px

Start display at page:

Download "Search e Fall /18/15"

Edward Snow
6 years ago
Views:

1 Sample Efficient Policy Click to edit Master title style Search Click to edit Emma Master Brunskill subtitle style e Fall

2 Sample Efficient RL Objectives Probably Approximately Correct Minimizing regret Bayes-optimal RL Focus today: Empirical performance 2

3 Last Time: Policy Search Using Gradients Gradient approaches only guaranteed to find a local optima Finite-difference methods scale with # of parameters needed to represent the policy, but don t require differentiable policy Likelihood ratio gradient approaches Require us to be able to compute gradient analytically Cost independent of # params Don t need to know dynamics model Benefit from using value function/baseline to reduce variance 3

4 Question from Last Class Figure from David Silver 4

5 How to Choose a Compatible Value Function Approximator? Ex. Policy parameterization: Compatibility requires that: Therefore a reasonable choice for value function is Example from Sutton, McAllester, SIngh & Mansour NIPS

6 Today: Sample Efficient Policy Search Powerful function approximators to represent policy value May not be easy to take derivative Like last time, may benefit from exploiting structure (e.g. not completely blackbox optimization Figure from David Silver 6

7 Recall: Gaussian Process to Represent MDP Dynamics/Reward Models s = + s s Figure adjusted from Wilson et al. JMLR

8 Today: Gaussian Process to Represent Value of Parameterized Policy V(ϴ) ϴ Figure adjusted from Wilson et al. JMLR

9 Now Generalizing Policy Value (Rather than Model Dynamics/Rewards) V(ϴ) ϴ Figure adjusted from Wilson et al. JMLR

10 Why Use GPs? Used frequently in Bayesian optimization Last ~5 years Bayesian optimization has become very influential & useful Brief (relevant) digression Two big motivations for Bayesian optimization 10

11 Motivation 1: ML Parameter Tuning ML methods getting more powerful and complex Sophisticated fitting methods Often involve tuning many parameters Hard for non experts to use Performance often substantially impacted by choice of parameters Material from Ryan Adams 11

12 Deep Learning Big investments by Google, Facebook, Microsoft, etc. Many choices: number of layers, weight regularization, layer size, which nonlinearity, batch size, learning rate schedule, stopping conditions Material from Ryan Adams 12

13 Classification of DNA Sequences Predict which DNA sequences will bind with which proteins, Miller et al. (2012) Choices: margin param, entropy param, converg. criterion Material from Ryan Adams 13

14 Motivation 2: Increasing Instances Which Blur Experimentation & Optimization 14

15 Website Design Ex. What Photo to Put Here? Material from Peter Frazier 15

16 6 Photos on Website: 51*50*49*48*47*46= 12,966,811,200 Options! Material from Peter Frazier 16

17 Design Choices Matter Material from Peter Frazier 17

18 Standard A/B Testing or Experimentation Doesn t Scale Material from Peter Frazier 18

19 Standard A/B Testing or Experimentation Doesn t Scale Material from Peter Frazier 19

20 Standard A/B Testing or Experimentation Doesn t Scale Material from Peter Frazier 20

21 Don t Want to Have Worse Revenue on 1.25 million Users Material from Peter Frazier 21

22 Why Use GPs? Used frequently in Bayesian optimization Last ~5 years Bayesian optimization has become very influential & useful Brief (relevant) digression Two big motivations for Bayesian optimization ML algorithm parameter tuning Large online experimental settings where care about performance (e.g. revenue) while testing 22

23 Bayesian Optimization Build a probabilistic model for the objective. Include hierarchical structure about units, etc. Compute the posterior predictive distribution. Integrate out all the possible true functions. Gaussian process regression popular Optimize a cheap proxy function instead. The model is much cheaper than that true objective Two key ideas Use model to guide how to search space. Model is an approximation, but when sampling a point in the real world is more costly than computation, very useful Use proxy function to guide how to balance exploration and exploitation 23

24 Historical Background of Bayesian Optimization Closely related to statistical ideas of optimal design of experiments, dating back to Kirstine Smith in As response surface methods, date back to Box and Wilson in 1951 As Bayesian optimization, studied first by Kushner in 1964 and then Mockus in Methodologically, it touches on several important machine learning areas: active learning, contextual bandits, Bayesian nonparametrics Started receiving serious attention in ML in 2007, Interest exploded when it was realized that Bayesian optimization provides an excellent tool for finding good ML hyperparameters. Material from Ryan Adams 24

25 Bayesian Optimization for Policy Search V(ϴ) ϴ Material from Ryan Adams 25

26 Bayesian Optimization for Policy Search Dark points are the policies we ve already tried in the world V(ϴ) Blue line is estimated mean of function (value of policy) Grey areas represent uncertainty over function value ϴ Figure modified from Ryan Adams 26

27 Exercise: Where Would You Sample Next (What Policy Would You Evaluate) & Why? What Algorithm Would You Use for Sampling? Dark points are the policies we ve already tried in the world V(ϴ) Blue line is estimated mean of function (value of policy) Grey areas represent uncertainty over function value ϴ Figure modified from Ryan Adams 27

28 Probability of Improvement Exercise: Where Would You Sample Next (What Policy Would You Evaluate) & Why? What Algorithm Would You Use for Sampling? Dark points are the policies we ve already tried in the world V(ϴ) Blue line is estimated mean of function (value of policy) Grey areas represent uncertainty over function value ϴ Figure modified from Ryan Adams 28

29 Expected Exercise:Improvement Where Would You Sample Next (What Policy Would You Evaluate) & Why? What Algorithm Would You Use for Sampling? Dark points are the policies we ve already tried in the world V(ϴ) Blue line is estimated mean of function (value of policy) Grey areas represent uncertainty over function value ϴ Figure modified from Ryan Adams 29

Upper/Lower Confidence Bound Exercise: Where Would You Sample Next (What Policy Would You Evaluate) & Why? What Algorithm Would You Use for Sampling?

30 Upper/Lower Confidence Bound Exercise: Where Would You Sample Next (What Policy Would You Evaluate) & Why? What Algorithm Would You Use for Sampling? Dark points are the policies we ve already tried in the world V(ϴ) Blue line is estimated mean of function (value of policy) Grey areas represent uncertainty over function value ϴ Figure modified from Ryan Adams 30

31 Probability of Improvement Utility function relative to f, best point so far V(ϴ) Acquisition function 31 Figure modified from Ryan Adams, Adams Equations from

32 Expected Improvement Utility function relative to f, best point so far Acquisition function V(ϴ) 32 Figure modified from Ryan Adams, Equations from

33 Upper/Lower Confidence Bound Acquisition function V(ϴ) 33 Figure modified from Ryan Adams, Equations from

34 Acquisition Function Probability of improvement Expected Improvement Upper confidence bound Other ideas? What are the limitations of these? 34

35 Policy Search as Black Box Bayesian Optimization Bayesian Optimization with a Gaussian Process Create new training point [f(θi),r] π = f(θi) Execute policy π in environment for T steps, get total reward R 35

36 Policy Search as Bayesian Optimization Figure modified from Calandra, Seyfarth, Peters & Deissenroth

37 Gait Optimization GP Policy search Reduced samples needed to find a fast walk by about 3x Lizotte et al. IJCAI

38 More Gait Optimization Gait parameters: 4 threshold values of the FSM (two for each leg) & 4 control signals applied during extension and flexion (separately for knees and hips). Figures modified from Calandra, Seyfarth, Peters & Deissenroth

39 videos 39

40 Why is this Suboptimal? Bayesian Optimization with a Gaussian Process Create new training point [f(θi),r] π = f(θi) Execute policy π in environment for T steps, get total reward R 40

41 Throwing Away All Information But Policy Parameter & Total Reward from Trajectory. Also Ignores Structure of Relationship Between Policies Figure from David Silver 41

42 Covariance function: Key choice for GP When should 2 policies be considered close? Figures from Ryan Adams 42

43 Behavior Based Kernel (Wilson et al. JMLR 2014) KL divergence Probability of trajectory under policy ϴi 43

44 Behavior Based Kernel (Wilson et al. JMLR 2014) KL divergence Prob. trajectory under policy ϴi 44

45 Behavior Based Kernel (BBK) (Wilson et al. JMLR 2014) KL divergence Prob. trajectory under policy ϴi Do we need to know the dynamics model? 45

46 Throwing Away All Information But Policy Parameter & Total Reward from Trajectory. Also Ignores Structure of Relationship Between Policies Figure from David Silver 46

47 Model-Based Bayesian Optimization Algorithm (MBOA) (Wilson et al. JMLR 2014) Bayesian Optimization with a Gaussian Process π = f(θi) Use data to build a model. Use to accelerate learning Create new training point [f(θi),r] Execute policy π in environment for T steps, get total reward R 47

48 Figure from WIlson, Fern & Tadepalli JMLR

49 Figure from WIlson, Fern & Tadepalli JMLR

50 Figure from WIlson, Fern & Tadepalli JMLR

51 Figure from WIlson, Fern & Tadepalli JMLR

52 New Kernel vs Model Based Information? Using model information often greatly improves performance if it s a good model If it s not good, learns to ignore New kernel to relate policies (BBK) much less of an impact 52

53 Other Ways to Go Beyond Black-Box Bayesian Optimization Current work in my group (Rika, Christoph, Dexter Lee, Joe Runde) Bayesian Optimization with a Gaussian Process Create new training point [f(θi),r] π = f(θi) Execute policy π in environment for T steps, get total reward R 53

54 Subtleties of Bayesian Optimization Still have to choose policy class (this determines the input space) Have to choose kernel function Have to choose hyperparameters of kernel function (can optimize these) Have to choose acquisition function 54

55 Impact of Acquisition Function & Hyperparameters Manually fixed hyperparameters led to sub-optimal solutions for all the acquisition functions. Figures modified from Calandra, Seyfarth, Peters & Deissenroth

56 Summary: Bayesian Optimization for Efficient Policy Search Benefits Direct policy search Finds global optima Uses sophisticated function class (GP) to model input policy param & output policy value Use smart (but typically myopic) acquisition function to balance exploration/exploitation in searching policy space Can be very sample efficient Not a silver bullet Still have to decide on policy space Choose kernel function (though squared expl often works well) Should optimize hyperparameters 56

57 Stuff to Know: Bayesian Optimization for Efficient Policy Search Properties (global optima, no gradient information used) Define and know benefits/drawbacks of different acquisition functions Understand how to take a policy search problem and formulate as a black box Bayesian optimization problem Be able to list some things have to do in practice (optimize hyperparameters, etc) 57

Practical Bayesian Optimization of Machine Learning Algorithms. Jasper Snoek, Ryan Adams, Hugo LaRochelle NIPS 2012

Practical Bayesian Optimization of Machine Learning Algorithms Jasper Snoek, Ryan Adams, Hugo LaRochelle NIPS 2012 ... (Gaussian Processes) are inadequate for doing speech and vision. I still think they're