EXPLORATION FLOW 4/18/10

Size: px

Start display at page:

Download "EXPLORATION FLOW 4/18/10"

Justina McBride
5 years ago
Views:

1 EXPLORATION Peter Bossaerts CNS 102b FLOW Canonical exploration problem: bandits Bayesian optimal exploration: The Gittins index Undirected exploration: e-greedy and softmax (logit) The economists and psychologists view on logit: unobserved heterogeneity or random utility Directed exploration: exploration bonuses Contrast with ambiguity aversion and exploration malus The numerical analyst s view: simulated annealing, tabu search etc. Neurobiological foundations: human imaging The role of the neurotransmitter dopamine and norepinephrine 2 1

2 CANONICAL EXPLORATION PROBLEM: BANDITS Which slot machine ( arm ) to choose? Can try only one at a time Don t see payoff of arms that are not chosen Assume arms are independent And stationary, i.e., payoff distribution remains the same (opposite: restless ) Multi-armed bandit 3 BAYESIAN OPTIMAL EXPLORATION: THE GITTINS INDEX Prior, likelihood, posterior Maximize expected gain (minimize expected loss) General property of optimal policy: eventually play a single arm that is not necessarily the truly best one Trade-off exploitation-exploration Complicated dynamic programming problem Solution is of the index type: each arm is tracked using an index, and at any point in time, pick the arm with the highest (Gittins) index 4 2

3 THE GITTINS INDEX At t, for each arm k (which is in state s) compute the stopping time that would maximize the per- (discounted) period expected (discounted) payoff: Then pick the arm with the highest Gittins index Problem: in all but a few specific cases, Gittins indices are hard to compute and don t apply to more realistic problems like restless bandits Armsthat are not visited are not updated, so one may get stuck with a (truly) suboptimal arm 5 UNDIRECTED EXPLORATION: E-GREEDY AND SOFTMAX (LOGIT) Heuristic ways of exploring Epsilon-greedy: follow the currently deemed optimal arm (option) with probability 1-e, and try any other option with probability e Problem: exploration does not take into account estimated value of sub-optimal options Improvement: softmax: explore sub-optimal options with a frequency that decreases with the estimated value 6 3

4 SOFTMAX E.g., for 6 options with (estimated) values Q(l,T) at T: 1/beta is interpreted as the exploration intensity (temperature see later for physics interpretation) 7 SOFTMAX IS OPTIMAL Find best (mixed) strategy that trades off values of exploitation (first term) and of exploration (entropy of mixing; second term) Value of option l at T (6 options) Mixing This is undirected exploration: choices do not depend on how uncertain estimated values are, but on the entropy of the choice policy (Specifically, one would go for one option ONLY if that option is far superior to all others, irrespective of how sure one is about this!) 8 4

5 SOFTMAX (LOGIT) FUNCTION Choice Probability Difference in estimated value between choices 9 THE ECONOMISTS AND PSYCHOLOGISTS VIEW ON LOGIT: UNOBSERVED HETEROGENEITY OR RANDOM UTILITY Humans make choices that look erratic The economist interprets this as reflecting that she (the economist) has insufficient information to model all aspects of preferences/utility (McFadden) (Remember, for economists: choice==preference!) This is referred to as unobserved heterogeneity The psychologist (and some decision theorists like Luce) interprets this as reflecting random utility 10 5

6 RANDOM UTILITY/UNOBSERVED HETEROGENEITY With binary choice: These errors are either unobserved factors affecting preferences over time (random utility) or unobserved heterogeneity (in cross-section) 11 LOGIT Details: 12 6

IMPORTANT REMARK Economists do not believe that exploration is something that is added to valuation/preferences Because: CHOICE==PREFERENCES Exploration is just the difference between picking the

7 IMPORTANT REMARK Economists do not believe that exploration is something that is added to valuation/preferences Because: CHOICE==PREFERENCES Exploration is just the difference between picking the option that is best under myopia (shortsightedness) and the one that is overall the best Only in rare instances is myopia optimal 13 LOGIT/SOFTMAX MAY NOT BE THE ENTIRE STORY IT IS A MAINTAINED, UNTESTED ASSUMPTION IN MUCH EMPIRICAL WORK $!!"#,!"# +!"# BAD FIT!"#$%&'()*+)#+,--%$)+./012%+ *!"# )!"# (!"# '!"# &!"# %!"# $!"#./0/-1#./0/-2#!"# -)# -'# -%#!# %# '# )# 3%)+45$%-)%6+782()*+#9+./012%+ 14 7

8 DIRECTED EXPLORATION: EXPLORATION BONUSES Valuation is based on: Value v at t from continuing to choose an option PLUS the novelty of the option at t (Kakade-Dayan, Neural Networks 2002) Novelty=how often has this option been visited? How often has it been seen? Is it salient? So, the bonus for an option is related to the uncertainty about its estimated value, the estimation uncertainty This is DIRECTED exploration: you visit options about which you want to know more 15 THE ECONOMISTS PERSPECTIVE: AMBIGUITY AVERSION AND EXPLORATION MALUS Ambiguity = not knowing probabilities Measure of ambiguity = estimation uncertainty (for Bayesians: posterior variance of estimated probabilities) Most humans are ambiguity averse (see later) They PENALIZE choices for which they do not know the probabilities (Hansen and Sargent): Utility = Estimated Value ß *Estimation Uncertainty This criterion is also used in robust control in engineering (estimation uncertainty is referred to there as model uncertainty) 16 8

9 HUMAN BEHAVIOR An exploration bonus or malus? We don t know Recent evidence from one of my PhD students: Undirected exploration a la softmax An exploration malus like in economics (ambiguity aversion) 17 THE NUMERICAL ANALYST S VIEW: SIMULATED ANNEALING, TABU SEARCH ETC. Numerical analysts have had to deal a lot with difficult optimization problems They have developed a toolkit of well-operating exploration procedures Originally inspired by the problem of avoiding local maxima Most primitive: re-start your search at new initial conditions, to be sure 18 9

RE-STARTING OPTIMIZATION 19 SIMULATED ANNEALING Annealing: Heating metal so that random movement of atoms would change the structure SLOW cooling - need enough local exploration for these

10 RE-STARTING OPTIMIZATION 19 SIMULATED ANNEALING Annealing: Heating metal so that random movement of atoms would change the structure SLOW cooling - need enough local exploration for these atoms End result is better structure Numerical version: Randomly try something else Stay with probability P, even if value at trial is lower than original one! Try out the neighborhood 20 10

11 SIMULATED ANNEALING FLOWCHART As time passes, lower the temperature, i.e., the size of the random move 21 SIMULATED ANNEALING AND SOFTMAX Random moves in simulated annealing are often disciplined: use the softmax rule This means that random moves to sub-optimal choices are less likely as these choices become (estimated to be) more inferior The picture on slide 19 is NOT simulated annealing! 22 11

12 ANOTHER EXPLORATION RULE: TABU SEARCH Proven to be useful in very difficult combinatorial optimization problems, such as traveling salesman problem The idea is, e.g., to NOT take routes that you think are close to optimal Very effective where it takes time (several trials) to figure out whether an option is worth it 23 NOTE DIFFERENCE BETWEEN SOFTMAX AND TABU SEARCH Softmax: your modal choice is the (currently deemed) optimal one Tabu search: your are not allowed to re-visit the optimal move (for a while) Need to INHIBIT urge to stick to what is temporarily optimal We (work with K Preuschoff of U Zurich) have recently noticed that tabu search works much better if you are in an environment where payoff probabilities change abruptly With tabu search, you quickly pick up new optimal learning rates (at the expense of temporary exploration when it is not needed) 24 12

13 NEUROBIOLOGICAL FOUNDATIONS: HUMAN IMAGING 25 EXPLORATION-RELATED ACTIVATION 26 13

CONTRAST THIS WITH REWARD-RELATED ACTIVATIONS 27 IMPORT OF THESE FINDINGS Separation of valuation and exploration Unlike economists view that value of an option should include both Exploitation:

14 CONTRAST THIS WITH REWARD-RELATED ACTIVATIONS 27 IMPORT OF THESE FINDINGS Separation of valuation and exploration Unlike economists view that value of an option should include both Exploitation: value of continuing with the same option forever ( immediate exercise value ) Exploration: value of trying the option and thereby learning that it may become a better one ( optional value ) (In financial economics, however, immediate exercise and optional values are often analyzed separately) 28 14

15 THE ROLE OF THE NEUROTRANSMITTER DOPAMINE Dopamine neurons signal novelty (Schultz 1998) SN/VTA activation correlated with novelty (oddball) (Bunzeck ea Neuron 2006) At genetic level, polymorhisms of dopamine D4 receptors are associated with novelty seeking (mice; humans) (Ebstein, ) D4 receptor agonist induces time spent with novel objects without affectin overall locomotor activity in mice (Powell ea, 2003) Link with ADH disorder and substance abuse 29 AND NOREPINEPHRINE Enhancing NE levels induces rats to abandon old hypotheses and find the newly optimal paths in a navigation task (Or are they more sensitive to signals of UNEXPECTED UNCERTAINTY, i.e., that something changed? Yu-Dayan, Neuron 2005) 30 15

16 FIRING MODE OF NE CHANGES Note the explicit separation of value of exploitation and value of exploration Also, implicitly, there is the idea that exploration applies to cases where it is not clear that anything has to be learned (boredom) 31 16

Neurobiological Foundations of Reward and Risk

Neurobiological Foundations of Reward and Risk... and corresponding risk prediction errors Peter Bossaerts 1 Contents 1. Reward Encoding And The Dopaminergic System 2. Reward Prediction Errors And TD (Temporal