The optimism bias may support rational action

Similar documents
Sequential Decision Making

Two-sided Bandits and the Dating Market

Burn-in, bias, and the rationality of anchoring

Lecture 13: Finding optimal treatment policies

Using Heuristic Models to Understand Human and Optimal Decision-Making on Bandit Problems

Reinforcement Learning

Human and Optimal Exploration and Exploitation in Bandit Problems

A resource-rational analysis of human planning

Bayesian Reinforcement Learning

Adversarial Decision-Making

Generic Priors Yield Competition Between Independently-Occurring Preventive Causes

Learning to Identify Irrelevant State Variables

An Understanding of Role of Heuristic on Investment Decisions

Sequential search behavior changes according to distribution shape despite having a rank-based goal

Empirical evidence for resource-rational anchoring and adjustment

Irrationality in Game Theory

Why so gloomy? A Bayesian explanation of human pessimism bias in the multi-armed bandit task

The distorting effect of deciding to stop sampling

Introduction to Behavioral Economics Like the subject matter of behavioral economics, this course is divided into two parts:

A Decision-Theoretic Approach to Evaluating Posterior Probabilities of Mental Models

Towards Learning to Ignore Irrelevant State Variables

A Ra%onal Perspec%ve on Heuris%cs and Biases. Falk Lieder, Tom Griffiths, & Noah Goodman Computa%onal Cogni%ve Science Lab UC Berkeley

A Framework for Sequential Planning in Multi-Agent Settings

Reinforcement Learning and Artificial Intelligence

Christopher G. Lucas

A Cooking Assistance System for Patients with Alzheimers Disease Using Reinforcement Learning

CS 4365: Artificial Intelligence Recap. Vibhav Gogate

Reinforcement learning and the brain: the problems we face all day. Reinforcement Learning in the brain

A Model of Dopamine and Uncertainty Using Temporal Difference

Cognitive modeling versus game theory: Why cognition matters

References. Christos A. Ioannou 2/37

Exploring the Influence of Particle Filter Parameters on Order Effects in Causal Learning

Reinforcement Learning : Theory and Practice - Programming Assignment 1

Cooperation in Risky Environments: Decisions from Experience in a Stochastic Social Dilemma

The Game Prisoners Really Play: Preference Elicitation and the Impact of Communication

POND-Hindsight: Applying Hindsight Optimization to POMDPs

Marcus Hutter Canberra, ACT, 0200, Australia

Choice adaptation to increasing and decreasing event probabilities

A Scoring Policy for Simulated Soccer Agents Using Reinforcement Learning

Using Inverse Planning and Theory of Mind for Social Goal Inference

Evaluation of Linguistic Labels Used in Applications

Seminar Thesis: Efficient Planning under Uncertainty with Macro-actions

Discovering Inductive Biases in Categorization through Iterated Learning

Markov Decision Processes for Screening and Treatment of Chronic Diseases

Effectively Learning from Pedagogical Demonstrations

Dynamic Control Models as State Abstractions

Learning What Others Like: Preference Learning as a Mixed Multinomial Logit Model

Lecture 2: Learning and Equilibrium Extensive-Form Games

How do People Really Think about Climate Change?

Why so gloomy? A Bayesian explanation of human pessimism bias in the multi-armed bandit task

Remarks on Bayesian Control Charts

Generalization and Theory-Building in Software Engineering Research

A HMM-based Pre-training Approach for Sequential Data

Cost-Sensitive Learning for Biological Motion

On Modeling Human Learning in Sequential Games with Delayed Reinforcements

Individual Differences in Attention During Category Learning

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Causal Models Interact with Structure Mapping to Guide Analogical Inference

Christian W. Bach and Conrad Heilmann Agent connectedness and backward induction

Reinforcement Learning as a Framework for Ethical Decision Making

RAPID: A Belief Convergence Strategy for Collaborating with Inconsistent Agents

Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable Markov Decision Processes

UNIVERSITY OF CALIFORNIA SANTA CRUZ A STOCHASTIC DYNAMIC MODEL OF THE BEHAVIORAL ECOLOGY OF SOCIAL PLAY

Intelligent Machines That Act Rationally. Hang Li Toutiao AI Lab

Bottom-Up Model of Strategy Selection

Equilibrium Selection In Coordination Games

Bayesian Updating: A Framework for Understanding Medical Decision Making

NEW METHODS FOR SENSITIVITY TESTS OF EXPLOSIVE DEVICES

Dopamine neurons activity in a multi-choice task: reward prediction error or value function?

Perfect Bayesian Equilibrium

Katsunari Shibata and Tomohiko Kawano

A rational analysis of curiosity

An Economic Model of the Planning Fallacy

Subjective randomness and natural scene statistics

Search e Fall /18/15

Modeling Human Understanding of Complex Intentional Action with a Bayesian Nonparametric Subgoal Model

I. INTRODUCTION /$ IEEE 70 IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, VOL. 2, NO. 2, JUNE 2010

Utility Maximization and Bounds on Human Information Processing

Designing a Bayesian randomised controlled trial in osteosarcoma. How to incorporate historical data?

A Cognitive Model of Strategic Deliberation and Decision Making

When contributions make a difference: Explaining order effects in responsibility attribution

Bayesian and Frequentist Approaches

9 research designs likely for PSYC 2100

Sawtooth Software. The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? RESEARCH PAPER SERIES

CS343: Artificial Intelligence

arxiv: v2 [cs.ai] 26 Sep 2018

NEW DIRECTIONS FOR POSITIVE ECONOMICS

Thinking and Guessing: Bayesian and Empirical Models of How Humans Search

Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting

Optimal Design of Biomarker-Based Screening Strategies for Early Detection of Prostate Cancer

REPORT DOCUMENTATION PAGE

Conditional behavior affects the level of evolved cooperation in public good games

Understanding Managerial Decision Risks in IT Project Management: An Integrated Behavioral Decision Analysis Perspective

Operant matching. Sebastian Seung 9.29 Lecture 6: February 24, 2004

Dynamic Simulation of Medical Diagnosis: Learning in the Medical Decision Making and Learning Environment MEDIC

Solutions for Chapter 2 Intelligent Agents

Cognitive Modeling. Lecture 9: Intro to Probabilistic Modeling: Rational Analysis. Sharon Goldwater

Partially-Observable Markov Decision Processes as Dynamical Causal Models. Finale Doshi-Velez NIPS Causality Workshop 2013

Descending Marr s levels: Standard observers are no panacea. to appear in Behavioral & Brain Sciences

Transcription:

The optimism bias may support rational action Falk Lieder, Sidharth Goel, Ronald Kwan, Thomas L. Griffiths University of California, Berkeley 1 Introduction People systematically overestimate the probability of good outcomes [1] and systematically underestimate how long it will take to achieve them [2]. From an epistemic perspective, optimism is irrational because it misrepresents the information that we have been given. Yet, surprisingly, optimistic people often perform better than their more realistic peers [1]. How can it be that being irrational leads to better performance than being more rational? In this abstract, we explore a potential solution to this puzzle. Concretely, we we investigate the hypothesis that overestimating the probability of achieving good outcomes can compensate for the cognitive limitations that prevent us from looking far ahead into the future and fully considering the long-term consequences of our actions. Previous work in reinforcement learning has used different notions of optimism to promote learning through exploration and thereby benefits the returns of future decisions [3 7]. Here, we investigate the immediate benefits of optimism for the returns of present actions rather than learning and future returns, and explore a different notion of optimism that formalizes the psychological theory that people overestimate the probability of good events relative to bad events. 2 Model We model the decision environment E as a Markov Decision Process (MDP) [8] E = (S, A, T, γ, r), (1) where S are the states, A are the actions, T are the transition probabilities, γ = 1 is the discount factor, and r is the reward function. We model the agent s internal model M of the environment E by the MDP ( M = S, A, ˆT ), γ, r, (2) whose transition probabilities ˆT may differ from the true probabilities T. Concretely, we use this distortion of the transition probabilities to model optimism and pessimism according to ˆT α (s s, a) T (s s, a) sig(v E (s ) V E (s)) α, (3) where sig is the sigmoidal function, V E is the optimal value function of the MDP E, α = 0 corresponds to realism, α > 0 corresponds to optimism, and α < 0 corresponds to pessimism. We model the the implications of bounded cognitive resources on people s performance in sequential decision-making by assuming that they can look only h steps ahead and therefore act according to the policy π h (s) = arg max Q h (s, a), (4) a [ ] Q h (s t, a) = E ˆT r(s t, a, S t+1 ) + max π t+h 1 i=t+1 r(s i, π(s i ), S i+1 ), (5) where the expectation E ˆT is taken with respect to the subjective transition probabilities ˆT. We compute this solution using backward induction with planning horizon h [9]. 1

Figure 1: Illustration of the MDP structure is in the simulations and Experiment. 3 Simulation of sequential decision-making with delayed rewards One of the key challenges in decision-making is that immediate rewards are sometimes misaligned with long-term rewards. This problem is exacerbated by the limited number of steps that agents can plan ahead. Hence, decision-makers tend to underweight distant uncertain rewards relative to immediate certain rewards [10]. Our theory predicts that optimism can compensate for this problem. To illustrate this prediction we simulated the effect of optimism on a bounded agent s performance in the sequential decision problem illustrated in Figure 1. The states s 0,, s 100 correspond to having completed 0% to 100% of the work needed to reach the goal. At each point in time the agent can choose between working towards a goal (a 1 ) which costs effort and resources (r 1 = 1) versus leisure (a 0 ) which generates a small reward (r 0 = +1) but does not advance the agent s progress. The states represent the agent s progress towards the goal. Once the goal has been achieved (S t = 100%) the agent can reap a large reward (r 2 = +100). This task is a finite-horizon MDP lasting 12 rounds. The agent can plan only 5 rounds ahead (h = 5), and its average rate of progress when working towards the goal for one round is 20%: T (s s, a 1 ) = Binomial (s 1 s 1, 100, 0.2). (6) To simulate the effects of optimism and pessimism on decision-making in this environment, we computing the policy π 5 (Equation 5) for the internal models M pessimism, M realism, and M pessimism with ˆT α for α pessimism = 10, α realism = 0, and α optimism = 10 respectively, and simulated the performance of the resulting myopic policies in the environment E. We found that the bounded agent whose model of the world was accurate (myopic realism) performed much worse than the optimal policy. While the optimal policy chooses action a 1 until it reaches the goal state in almost all cases, the myopic realistic agent does always chooses action a 0 in state s 0 (0%) and consequently never reached the goal state (s 100 ). By contrast, the optimistic bounded agent chose to always invest effort (a 1 ) in the initial state and consequently performed optimally (see Figure 2A). The pessimistic bounded agent performed at the same level as the realistic one. Note that the optimistic agent exhibited the planning fallacy [2]: while the expected completion time following the optimal policy was 4.5 months the optimistic agent s estimate of the expected completion time was only 3.9 months. This made it worthwhile for the optimistic bounded agent to pursue the goal even though it was thinking only 5 steps ahead. Thus, the irrational planning fallacy led to rational action. Hence, according to our theory, people with a very accurate model of the world might perform worse in sequential decision problems with the chain structure shown in Figure 1 than their more optimistic peers. To test this prediction, we have planned an experiment that will induces people to be either optimistic, realistic, or pessimistic and then measures their performance in the chain-structured MDP shown in Figure 1, and conducted a pilot experiment to tune the proposed experimental design. 4 Pilot Experiment To evaluate the effectiveness of our experimental manipulations and determine the decisions that our model would predict for the resulting internal models we conducted a pilot experiment. 2

A B Figure 2: A: Performance of the optimistic, realistic, pessimistic myopic bounded agents and the optimal policy in the decision environment shown in Figure 1. B: Simulation of the main experiment by condition depending on how many steps people can plan ahead. 4.1 Methods We recruited 200 adult participants on Amazon Mechanical Turk. Participants received $0.50 for about 6 minutes of work. Eight participants were excluded since their answers to the survey questions indicated that they had not performed the task. Participants solved a sequential decision problem with the structure shown in Figure 1. To convey this structure to our participants we created a game called Product Manager. In this game participants play the manager of a car company. In each round (month) the participant decides whether the company will focus on marketing the existing product SportsCar (a 0 ) or invest in the development of a new product HoverCar (a 1 ). Participants started with a capital of 1 000 000 and their task was to maximize the company s capital after 4, 12, 24, or 72 months. The reward for marketing the old product was drawn from a normal distribution with mean $10 000 and standard deviation $1 000 (r 0 N (µ = 10 000, σ = 1000)), the reward for investing into development was drawn from a normal distribution with mean $ 15 000 and standard deviation $1 500 (r 1 N (µ = 15 000, σ = 1 500)), and the reward for marketing the new product was normally distributed with a mean of $135 000 and standard deviation $13 500 (r 2 N (µ = 135 000, σ = 13 500)). In each round the participant was shown the current state (e.g., HoverCar is currently 0% developed. ), the number of the current round and the total number of rounds, their current balance, and the rewards of their most recent decision. The experiment was structured into four blocks: instructions, training block, and a survey. The instructions introduced the game and informed the participants about the number of rounds, the return on marketing the existing product the cost of developing the new product, and the return on marketing the new product. In the training block the participants task was to explore the effects of investing in development versus marketing in three simulations lasting 10 rounds each. Finally, the survey asked participants to estimate the average rate of progress that occurred when they decided to invest into development and marketing respectively. Each participant was randomly assigned to one of three experimental conditions. The three experimental conditions differed in the rate of progressed used in the simulations of the training phase and were designed to induce pessimism, realism, and optimism respectively. In the pessimism condition, the average rate of progress in the training block was half the true rate of progress; in the realism condition, it was equal to the true rate of progress; and in the optimism condition, it was twice the true rate of progress. The true rate of progress was set such the expected number of investments needed to reach 100% development was 80% of the total duration of the game. 4.2 Results and Discussion To examine the effectiveness of our experimental manipulations we compared the three groups estimates of the average amount of progress achieved by a single investment in product development. We found that people estimated the rate of progress to be higher in the optimism condition than in the realism condition (t(123.7) = 2.30, p = 0.01) and the pessimism condition (t(126.2) = 2.59, p < 0.01). The difference between the realism condition and the pessimism condition was not 3

statistically significant (t(122.5) = 0.53, p = 0.30). In conclusion, our experimental manipulation was successful at inducing optimism and created significant group differences. Furthermore, we found that each group s estimate of the rate of progress was significantly higher than the rate of progress in the examples they had observed. In the pessimism condition, people overestimated the rate of progress by 12.4% (t(65) = 4.16, p < 0.0001). In the realism condition, participants overestimated the presented rate of progress by 7.4% (t(61) = 2.94, p = 0.0023), and in the optimism condition, people overestimated the presented rate of progress by 5.2% (t(63) = 1.66, p = 0.05). The overestimation of the frequency of positive events is consistent with the optimism bias [1]. Interestingly, the optimism bias decreased with the true frequency. Hence, at least in our study, the optimism bias could result from Bayesian inference with an optimistic prior. 5 Planned Main Experiment The main experiment will use the paradigm used in the pilot experiment with the addition of a test block following the training block. In the test block participants will play the Product Manager game for 4, 12, 24, or 72 rounds starting at 0% progress. Participants will receive a financial bonus of up to $2 proportional to their capital at the end of the test phase. 5.1 Model Predictions We used the subjective transition probabilities induced by our experimental manipulations in the pilot experiment to derive our model s prediction for the results of the main experiment. As shown in Figure 2B, our simulation shows that optimism shortens the number of steps that people have to plan ahead to realize that they should invest into product development. Our model predicts that when the game lasts only four rounds, then participants in the optimism condition will invest but participants in the realism and pessimism condition will not. When the game lasts 12 or more rounds, then the benefit of optimism over realism depends on how many steps people can plan ahead. 5.2 Next Steps Since it appears plausible that people would plan less than 6 steps ahead, we will refine the experimental manipulations to induce larger differences between the three groups subjective transition probabilities such that our model predicts a benefit of optimism in the 12-month condition even if people plan only 5 steps ahead. We will test this prediction by running the main experiment with the experimental manipulations determined through iterative piloting. We will use the data to test our theory s prediction that optimism improves people s performance in sequential decision problems in which determining the best action requires planning ahead many steps. 6 Discussion Our theory suggests that the optimism bias might serve to improve people s decisions in environments in which the rewards for prolonged cumulative effort justify forgoing immediate gratification. According to our model the optimism bias achieves this by compensating for the cognitive limitations that prevent us from looking far enough into the future to fully consider long-term consequences unless we underestimate how long it will take to achieve them. Hence, at least in some cases, the planning fallacy [2] helps us make better decisions. Hence, it may be a sign of bounded rationality rather than irrationality. This abstract continues the line of work begun by [11] by generalizing the definition of optimism proposed therein from a specific class of decision problems to general decision problems and testing its predictions empirically. We have demonstrated that this extension can capture the benefits of optimism when obtaining a high reward requires persistent effort. Furthermore, the proposed experiment will be the first to empirically test our boundedly rational theory of optimism. The beneficial effects of optimism illustrate that for bounded agents there is a tension between epistemic rationality (having beliefs that are as accurate as possible) versus instrumental rationality (choosing the actions that maximize one s expected utility). Concretely, our simulations suggest 4

that bounded agents have to be epistemically irrational to achieve instrumental rationality [12]. This might be the deeper reason for why we are optimistic for ourselves but not for others [2]. In conclusion, our theory suggests that optimism and the planning fallacy might not be irrational after all but reflect the rational use of limited cognitive resources [13, 14]. Acknowledgments. This work was supported by ONR MURI N00014-13-1-0341. References [1] T. Sharot, The optimism bias, Current Biology, vol. 21, no. 23, pp. R941 R945, 2011. [2] R. Buehler, D. Griffin, and M. Ross, Exploring the planning fallacy : Why people underestimate their task completion times., Journal of personality and social psychology, vol. 67, no. 3, p. 366, 1994. [3] R. S. Sutton, Integrated architectures for learning, planning, and reacting based on approximating dynamic programming, in Proceedings of the seventh international conference on machine learning, pp. 216 224, 1990. [4] P. Auer, Using confidence bounds for exploitation-exploration trade-offs, J. Mach. Learn. Res., vol. 3, pp. 397 422, Mar. 2003. [5] I. Szita and A. Lőrincz, The many faces of optimism: a unifying approach, in Proceedings of the 25th international conference on Machine learning, pp. 1048 1055, ACM, 2008. [6] P. Sunehag and M. Hutter, Rationality, optimism and guarantees in general reinforcement learning, Journal of Machine Learning Research, vol. 16, pp. 1345 1390, 2015. [7] P. Sunehag and M. Hutter, A dual process theory of optimistic cognition, in Proceedings of the 36th Annual Conference of the Cognitive Science Society (P. Bello, M. Guarini, M. McShane, and B. Scassellati, eds.), (Austin, TX), Cognitive Science Society, 2014. [8] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. Cambridge: MIT Press, 1998. [9] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. [10] J. Myerson and L. Green, Discounting of delayed rewards: Models of individual choice., Journal of the experimental analysis of behavior, vol. 64, pp. 263 276, Nov. 1995. [11] R. Neumann, A. N. Rafferty, and T. L. Griffiths, A bounded rationality account of wishful thinking, in Proceedings of the 36th Annual Conference of the Cognitive Science Society (P. Bello, M. Guarini, M. McShane, and B. Scassellati, eds.), (Austin, TX), Cognitive Science Society, 2014. [12] F. Lieder, M. Hsu, and T. L. Griffiths, The high availability of extreme events serves resource-rational decision-making, in Proceedings of the 36th Annual Conference of the cognitive science society (P. Bello, M. Guarini, M. McShane, and B. Scassellati, eds.), (Austin, TX), Cognitive Science Society, 2014. [13] T. L. Griffiths, F. Lieder, and N. D. Goodman, Rational use of cognitive resources: Levels of analysis between the computational and the algorithmic, Topics in Cognitive Science, vol. 7, no. 2, pp. 217 229, 2015. [14] F. Lieder, T. L. Griffiths, and N. D. Goodman, Burn-in, bias, and the rationality of anchoring, in Adv. Neural Inf. Process. Syst. 25 (P. Bartlett, F. C. N. Pereira, L. Bottou, C. J. C. Burges, and K. Q. Weinberger, eds.), 2013. 5