The optimism bias may support rational action Falk Lieder, Sidharth Goel, Ronald Kwan, Thomas L. Griffiths University of California, Berkeley 1 Introduction People systematically overestimate the probability of good outcomes [1] and systematically underestimate how long it will take to achieve them [2]. From an epistemic perspective, optimism is irrational because it misrepresents the information that we have been given. Yet, surprisingly, optimistic people often perform better than their more realistic peers [1]. How can it be that being irrational leads to better performance than being more rational? In this abstract, we explore a potential solution to this puzzle. Concretely, we we investigate the hypothesis that overestimating the probability of achieving good outcomes can compensate for the cognitive limitations that prevent us from looking far ahead into the future and fully considering the long-term consequences of our actions. Previous work in reinforcement learning has used different notions of optimism to promote learning through exploration and thereby benefits the returns of future decisions [3 7]. Here, we investigate the immediate benefits of optimism for the returns of present actions rather than learning and future returns, and explore a different notion of optimism that formalizes the psychological theory that people overestimate the probability of good events relative to bad events. 2 Model We model the decision environment E as a Markov Decision Process (MDP) [8] E = (S, A, T, γ, r), (1) where S are the states, A are the actions, T are the transition probabilities, γ = 1 is the discount factor, and r is the reward function. We model the agent s internal model M of the environment E by the MDP ( M = S, A, ˆT ), γ, r, (2) whose transition probabilities ˆT may differ from the true probabilities T. Concretely, we use this distortion of the transition probabilities to model optimism and pessimism according to ˆT α (s s, a) T (s s, a) sig(v E (s ) V E (s)) α, (3) where sig is the sigmoidal function, V E is the optimal value function of the MDP E, α = 0 corresponds to realism, α > 0 corresponds to optimism, and α < 0 corresponds to pessimism. We model the the implications of bounded cognitive resources on people s performance in sequential decision-making by assuming that they can look only h steps ahead and therefore act according to the policy π h (s) = arg max Q h (s, a), (4) a [ ] Q h (s t, a) = E ˆT r(s t, a, S t+1 ) + max π t+h 1 i=t+1 r(s i, π(s i ), S i+1 ), (5) where the expectation E ˆT is taken with respect to the subjective transition probabilities ˆT. We compute this solution using backward induction with planning horizon h [9]. 1
Figure 1: Illustration of the MDP structure is in the simulations and Experiment. 3 Simulation of sequential decision-making with delayed rewards One of the key challenges in decision-making is that immediate rewards are sometimes misaligned with long-term rewards. This problem is exacerbated by the limited number of steps that agents can plan ahead. Hence, decision-makers tend to underweight distant uncertain rewards relative to immediate certain rewards [10]. Our theory predicts that optimism can compensate for this problem. To illustrate this prediction we simulated the effect of optimism on a bounded agent s performance in the sequential decision problem illustrated in Figure 1. The states s 0,, s 100 correspond to having completed 0% to 100% of the work needed to reach the goal. At each point in time the agent can choose between working towards a goal (a 1 ) which costs effort and resources (r 1 = 1) versus leisure (a 0 ) which generates a small reward (r 0 = +1) but does not advance the agent s progress. The states represent the agent s progress towards the goal. Once the goal has been achieved (S t = 100%) the agent can reap a large reward (r 2 = +100). This task is a finite-horizon MDP lasting 12 rounds. The agent can plan only 5 rounds ahead (h = 5), and its average rate of progress when working towards the goal for one round is 20%: T (s s, a 1 ) = Binomial (s 1 s 1, 100, 0.2). (6) To simulate the effects of optimism and pessimism on decision-making in this environment, we computing the policy π 5 (Equation 5) for the internal models M pessimism, M realism, and M pessimism with ˆT α for α pessimism = 10, α realism = 0, and α optimism = 10 respectively, and simulated the performance of the resulting myopic policies in the environment E. We found that the bounded agent whose model of the world was accurate (myopic realism) performed much worse than the optimal policy. While the optimal policy chooses action a 1 until it reaches the goal state in almost all cases, the myopic realistic agent does always chooses action a 0 in state s 0 (0%) and consequently never reached the goal state (s 100 ). By contrast, the optimistic bounded agent chose to always invest effort (a 1 ) in the initial state and consequently performed optimally (see Figure 2A). The pessimistic bounded agent performed at the same level as the realistic one. Note that the optimistic agent exhibited the planning fallacy [2]: while the expected completion time following the optimal policy was 4.5 months the optimistic agent s estimate of the expected completion time was only 3.9 months. This made it worthwhile for the optimistic bounded agent to pursue the goal even though it was thinking only 5 steps ahead. Thus, the irrational planning fallacy led to rational action. Hence, according to our theory, people with a very accurate model of the world might perform worse in sequential decision problems with the chain structure shown in Figure 1 than their more optimistic peers. To test this prediction, we have planned an experiment that will induces people to be either optimistic, realistic, or pessimistic and then measures their performance in the chain-structured MDP shown in Figure 1, and conducted a pilot experiment to tune the proposed experimental design. 4 Pilot Experiment To evaluate the effectiveness of our experimental manipulations and determine the decisions that our model would predict for the resulting internal models we conducted a pilot experiment. 2
A B Figure 2: A: Performance of the optimistic, realistic, pessimistic myopic bounded agents and the optimal policy in the decision environment shown in Figure 1. B: Simulation of the main experiment by condition depending on how many steps people can plan ahead. 4.1 Methods We recruited 200 adult participants on Amazon Mechanical Turk. Participants received $0.50 for about 6 minutes of work. Eight participants were excluded since their answers to the survey questions indicated that they had not performed the task. Participants solved a sequential decision problem with the structure shown in Figure 1. To convey this structure to our participants we created a game called Product Manager. In this game participants play the manager of a car company. In each round (month) the participant decides whether the company will focus on marketing the existing product SportsCar (a 0 ) or invest in the development of a new product HoverCar (a 1 ). Participants started with a capital of 1 000 000 and their task was to maximize the company s capital after 4, 12, 24, or 72 months. The reward for marketing the old product was drawn from a normal distribution with mean $10 000 and standard deviation $1 000 (r 0 N (µ = 10 000, σ = 1000)), the reward for investing into development was drawn from a normal distribution with mean $ 15 000 and standard deviation $1 500 (r 1 N (µ = 15 000, σ = 1 500)), and the reward for marketing the new product was normally distributed with a mean of $135 000 and standard deviation $13 500 (r 2 N (µ = 135 000, σ = 13 500)). In each round the participant was shown the current state (e.g., HoverCar is currently 0% developed. ), the number of the current round and the total number of rounds, their current balance, and the rewards of their most recent decision. The experiment was structured into four blocks: instructions, training block, and a survey. The instructions introduced the game and informed the participants about the number of rounds, the return on marketing the existing product the cost of developing the new product, and the return on marketing the new product. In the training block the participants task was to explore the effects of investing in development versus marketing in three simulations lasting 10 rounds each. Finally, the survey asked participants to estimate the average rate of progress that occurred when they decided to invest into development and marketing respectively. Each participant was randomly assigned to one of three experimental conditions. The three experimental conditions differed in the rate of progressed used in the simulations of the training phase and were designed to induce pessimism, realism, and optimism respectively. In the pessimism condition, the average rate of progress in the training block was half the true rate of progress; in the realism condition, it was equal to the true rate of progress; and in the optimism condition, it was twice the true rate of progress. The true rate of progress was set such the expected number of investments needed to reach 100% development was 80% of the total duration of the game. 4.2 Results and Discussion To examine the effectiveness of our experimental manipulations we compared the three groups estimates of the average amount of progress achieved by a single investment in product development. We found that people estimated the rate of progress to be higher in the optimism condition than in the realism condition (t(123.7) = 2.30, p = 0.01) and the pessimism condition (t(126.2) = 2.59, p < 0.01). The difference between the realism condition and the pessimism condition was not 3
statistically significant (t(122.5) = 0.53, p = 0.30). In conclusion, our experimental manipulation was successful at inducing optimism and created significant group differences. Furthermore, we found that each group s estimate of the rate of progress was significantly higher than the rate of progress in the examples they had observed. In the pessimism condition, people overestimated the rate of progress by 12.4% (t(65) = 4.16, p < 0.0001). In the realism condition, participants overestimated the presented rate of progress by 7.4% (t(61) = 2.94, p = 0.0023), and in the optimism condition, people overestimated the presented rate of progress by 5.2% (t(63) = 1.66, p = 0.05). The overestimation of the frequency of positive events is consistent with the optimism bias [1]. Interestingly, the optimism bias decreased with the true frequency. Hence, at least in our study, the optimism bias could result from Bayesian inference with an optimistic prior. 5 Planned Main Experiment The main experiment will use the paradigm used in the pilot experiment with the addition of a test block following the training block. In the test block participants will play the Product Manager game for 4, 12, 24, or 72 rounds starting at 0% progress. Participants will receive a financial bonus of up to $2 proportional to their capital at the end of the test phase. 5.1 Model Predictions We used the subjective transition probabilities induced by our experimental manipulations in the pilot experiment to derive our model s prediction for the results of the main experiment. As shown in Figure 2B, our simulation shows that optimism shortens the number of steps that people have to plan ahead to realize that they should invest into product development. Our model predicts that when the game lasts only four rounds, then participants in the optimism condition will invest but participants in the realism and pessimism condition will not. When the game lasts 12 or more rounds, then the benefit of optimism over realism depends on how many steps people can plan ahead. 5.2 Next Steps Since it appears plausible that people would plan less than 6 steps ahead, we will refine the experimental manipulations to induce larger differences between the three groups subjective transition probabilities such that our model predicts a benefit of optimism in the 12-month condition even if people plan only 5 steps ahead. We will test this prediction by running the main experiment with the experimental manipulations determined through iterative piloting. We will use the data to test our theory s prediction that optimism improves people s performance in sequential decision problems in which determining the best action requires planning ahead many steps. 6 Discussion Our theory suggests that the optimism bias might serve to improve people s decisions in environments in which the rewards for prolonged cumulative effort justify forgoing immediate gratification. According to our model the optimism bias achieves this by compensating for the cognitive limitations that prevent us from looking far enough into the future to fully consider long-term consequences unless we underestimate how long it will take to achieve them. Hence, at least in some cases, the planning fallacy [2] helps us make better decisions. Hence, it may be a sign of bounded rationality rather than irrationality. This abstract continues the line of work begun by [11] by generalizing the definition of optimism proposed therein from a specific class of decision problems to general decision problems and testing its predictions empirically. We have demonstrated that this extension can capture the benefits of optimism when obtaining a high reward requires persistent effort. Furthermore, the proposed experiment will be the first to empirically test our boundedly rational theory of optimism. The beneficial effects of optimism illustrate that for bounded agents there is a tension between epistemic rationality (having beliefs that are as accurate as possible) versus instrumental rationality (choosing the actions that maximize one s expected utility). Concretely, our simulations suggest 4
that bounded agents have to be epistemically irrational to achieve instrumental rationality [12]. This might be the deeper reason for why we are optimistic for ourselves but not for others [2]. In conclusion, our theory suggests that optimism and the planning fallacy might not be irrational after all but reflect the rational use of limited cognitive resources [13, 14]. Acknowledgments. This work was supported by ONR MURI N00014-13-1-0341. References [1] T. Sharot, The optimism bias, Current Biology, vol. 21, no. 23, pp. R941 R945, 2011. [2] R. Buehler, D. Griffin, and M. Ross, Exploring the planning fallacy : Why people underestimate their task completion times., Journal of personality and social psychology, vol. 67, no. 3, p. 366, 1994. [3] R. S. Sutton, Integrated architectures for learning, planning, and reacting based on approximating dynamic programming, in Proceedings of the seventh international conference on machine learning, pp. 216 224, 1990. [4] P. Auer, Using confidence bounds for exploitation-exploration trade-offs, J. Mach. Learn. Res., vol. 3, pp. 397 422, Mar. 2003. [5] I. Szita and A. Lőrincz, The many faces of optimism: a unifying approach, in Proceedings of the 25th international conference on Machine learning, pp. 1048 1055, ACM, 2008. [6] P. Sunehag and M. Hutter, Rationality, optimism and guarantees in general reinforcement learning, Journal of Machine Learning Research, vol. 16, pp. 1345 1390, 2015. [7] P. Sunehag and M. Hutter, A dual process theory of optimistic cognition, in Proceedings of the 36th Annual Conference of the Cognitive Science Society (P. Bello, M. Guarini, M. McShane, and B. Scassellati, eds.), (Austin, TX), Cognitive Science Society, 2014. [8] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. Cambridge: MIT Press, 1998. [9] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. [10] J. Myerson and L. Green, Discounting of delayed rewards: Models of individual choice., Journal of the experimental analysis of behavior, vol. 64, pp. 263 276, Nov. 1995. [11] R. Neumann, A. N. Rafferty, and T. L. Griffiths, A bounded rationality account of wishful thinking, in Proceedings of the 36th Annual Conference of the Cognitive Science Society (P. Bello, M. Guarini, M. McShane, and B. Scassellati, eds.), (Austin, TX), Cognitive Science Society, 2014. [12] F. Lieder, M. Hsu, and T. L. Griffiths, The high availability of extreme events serves resource-rational decision-making, in Proceedings of the 36th Annual Conference of the cognitive science society (P. Bello, M. Guarini, M. McShane, and B. Scassellati, eds.), (Austin, TX), Cognitive Science Society, 2014. [13] T. L. Griffiths, F. Lieder, and N. D. Goodman, Rational use of cognitive resources: Levels of analysis between the computational and the algorithmic, Topics in Cognitive Science, vol. 7, no. 2, pp. 217 229, 2015. [14] F. Lieder, T. L. Griffiths, and N. D. Goodman, Burn-in, bias, and the rationality of anchoring, in Adv. Neural Inf. Process. Syst. 25 (P. Bartlett, F. C. N. Pereira, L. Bottou, C. J. C. Burges, and K. Q. Weinberger, eds.), 2013. 5