Practical Bayesian Optimization of Machine Learning Algorithms. Jasper Snoek, Ryan Adams, Hugo LaRochelle NIPS 2012

Size: px

Start display at page:

Download "Practical Bayesian Optimization of Machine Learning Algorithms. Jasper Snoek, Ryan Adams, Hugo LaRochelle NIPS 2012"

Bernard Gibbs
6 years ago
Views:

1 Practical Bayesian Optimization of Machine Learning Algorithms Jasper Snoek, Ryan Adams, Hugo LaRochelle NIPS 2012

2 ... (Gaussian Processes) are inadequate for doing speech and vision. I still think they're inadequate for doing speech and vision. But when you're in a domain where you have no prior knowledge and the only thing that you can expect is that similar inputs should have similar outputs, then Gaussian Processes are ideal.

3 ... (Gaussian Processes) are inadequate for doing speech and vision. I still think they're inadequate for doing speech and vision. But when you're in a domain where you have no prior knowledge and the only thing that you can expect is that similar inputs should have similar outputs, then Gaussian Processes are ideal.... Gaussian processes are a way of using Machine Learning to simulate the graduate student - Geoff Hinton

4 Motivation N.

5 Deep Neural Networks Require Skill to Set Hyperparameters

6 Common Strategies Grid Search Random Search

7 Common Strategies Grid Search Random Search - Sometimes better because some parameters have no effect

8 Can we use Machine Learning instead? - To predict regions of the hyperparameter Space that might give better results. - to predict how well a new combination of hyperparameters will do and also model the uncertainty of that prediction

9 Bayesian Optimization - Frame Hyperparameter Search as an Optimization Problem

10 Bayesian Optimization - Frame Hyperparameter Search as an Optimization Problem - Model the estimation of the function from high level parameters (hyperparameters) to the error metric as a regression problem

11 Bayesian Optimization - Frame Hyperparameter Search as an Optimization Problem - Model the estimation of the function from high level parameters (hyperparameters) to the error metric as a regression problem - Use G.P Prior : Similar inputs have similar outputs to build a statistical model of the function. Prior is weak but general and effective.

12 Bayesian Optimization - Frame Hyperparameter Search as an Optimization Problem - Model the estimation of the function from high level parameters (hyperparameters) to the error metric as a regression problem - Use G.P Prior : Similar inputs have similar outputs to build a statistical model of the function. Prior is weak but general and effective. - Use statistics to tell us: Location of expected minimum of the function Expected Improvement of trying other parameters

13 Bayesian Optimization (Mockus '78) - Method for the global optimization of multi-modal, computationally expensive black box functions - Assumes that the unknown function was sampled from a Gaussian Process (prior) and uses the observations (likelihood) to maintain a posterior - Observations are the measure of generalization performance under different settings of the hyperparameters we wish to optimize. - The next set of hyperparameters are selected using the maintained posterior using a strategy determined by the acquisition function

14 Gaussian Processes Specifies a distribution over functions such that any finite subset of N points follows a Multivariate Gaussian Distribution.

15 Gaussian Processes Specifies a distribution over functions such that any finite subset of N points follows a Multivariate Gaussian Distribution. The properties of the resulting distribution on functions is specified by a mean and a positive definite covariance function

16 The predictive mean and covariance given the observations Is given by:

17 Intuition GP's are a prior for smooth functions Similar inputs (high covariance) should have similar outputs

18 Intuition Exploration: Seek Places with High Variance Exploitation: Seek Places in the locality of places you're already doing well at.

19 Intuition Exploration: Seek Places with High Variance Exploitation: Seek Places in the locality of places you're already doing well at. The acquisition function balances these to determine point of next evaluation

20 Acquisition Functions The Acquisition function tells us which experiment to run next and what it's goodness will be 1. GP Upper Confidence Bound Idea: Minimize regret over course of optimization. Balance exploration and exploitation 2. Expected Improvement Idea: How much can I expect to improve over the best I've seen so far by running an experiment with these parameters?

21 Intuition

22 Intuition

23 Intuition

24 Intuition

25 Intuition

26 Intuition

27 Intuition

28 Intuition

29 An Eggsperiment Parameters: Boiling Time (1-12m) Cooling Time (1-12m) Salt (0-10 pinches) Pepper (0-10 pinches) Optimal 'Soft Boiled Egg'

30 After 5 Iterations...

31 After 5 Iterations...

32 After 10 Iterations...

33 After 10 Iterations...

34 After 12 Iterations...

35 After 14 Iterations...

36 After 16 Iterations...

37 After 20 Iterations...

38 After 25 Iterations...

39 After 25 Iterations...

40 Practical Bayesian Optimization Integrate out all parameters in Bayesian Optimization Choose appropriate covariance Choice of acquisition function is important

41 Accounting for additional cost Expected Improvement per Second Incorporate a preference towards choosing points that are not only good, but likely to be evaluated quickly

42 Parallelizing Bayesian Optimization 'N' completed evaluations 'J' pending evaluations

43 Parallelizing Bayesian Optimization 'N' completed evaluations 'J' pending evaluations Posterior samples after 3 Observations Expected improvement under individual samples Integrated expected improvement

44 Implications

45 Implications CIFAR-10, 9 Hyperparameters Impossible to find by hand!!

46 Benefits For each input dimension, an appropriate scale for measuring similarity is learned. - are 200 and 300 as similar as 2.0 and 3.0?

47 Benefits For each input dimension, an appropriate scale for measuring similarity is learned. - are 200 and 300 as similar as 2.0 and 3.0? What is the sensitivity to each dimension? Which dimensions don't matter?

48 Benefits For each input dimension, an appropriate scale for measuring similarity is learned. - are 200 and 300 as similar as 2.0 and 3.0? What is the sensitivity to each dimension? Which dimensions don't matter? Reproducible Research level the playing field. Its a lot more honest than human beings

49 Benefits For each input dimension, an appropriate scale for measuring similarity is learned. - are 200 and 300 as similar as 2.0 and 3.0? What is the sensitivity to each dimension? Which dimensions don't matter? Reproducible Research level the playing field. Its a lot more honest than human beings If you have the resources to run a fairly large number of experiments, bayesian optimization is better than a person at finding good combinations of hyperparameters

50 References: [Paper] Practical Bayesian Optimization of Machine Learning Algorithms Jasper Snoek, Hugo Larochelle and Ryan P. Adams Advances in Neural Information Processing Systems, 2012 [Talk/Slides] Jasper Snoek: "Bayesian Optimization for Machine Learning and Science" [Book] Machine Learning: a Probabilistic Perspective Kevin Murphy

Search e Fall /18/15

Search e Fall /18/15 Sample Efficient Policy Click to edit Master title style Search Click to edit Emma Master Brunskill subtitle style 15-889e Fall 2015 11 Sample Efficient RL Objectives Probably Approximately Correct Minimizing