OBSERVATIONAL LEARNING

Size: px

Start display at page:

Download "OBSERVATIONAL LEARNING"

Clemence Brittany Lucas
5 years ago
Views:

1 LABORATOIRE DE NEUROSCIENCES COGNITIVES DEPARTEMENT D ETUDES COGNITIVES OBSERVATIONAL LEARNING EMMANUELLE BONNET SUPERVISED BY STEFANO PALMINTERI Total word count: RESEARCH MASTER IN COGNITIVE SCIENCE (COGMASTER) JUNE 6, ECOLE NORMALE SUPERIEURE ECOLE DES HAUTES ETUDES EN SCIENCES SOCIALES UNIVERSITE PARIS DESCARTES 1

2 Originality declaration: The scientific question that this thesis addresses has not, to my knowledge, been explicitly addressed. The originality of the proposed protocol was to include a condition in which imitation was not always beneficial. Although recent work has evidenced that observational learning is beneficial in terms of performance, the specific question that we target - whether participants imitate despite a maladaptive value of observation - hasn t been formerly elucidated to in the literature. The experiment that was designed in order to study the scientific question considered had never been used in a social context. To date and to my knowledge, no other study strictly compared private learning from trials and error to observational learning when subjects have to infer the consequences of an observed action. If biases in the private domain have recently been evidenced, a new computational model was designed and tested against the collected behavioural data to investigate potential computational bias in observational learning. Contribution declaration: Have participated in this project: Stefano Palminteri, Emmanuelle Bonnet. Definition of the question of scientific interest: S. Palminteri. Bibliographic literature review, general approach definition: S. Palminteri, E. Bonnet. Choice of experimental design: S. Palminteri. Experimental task programming: E. Bonnet, adapted from H. Vanderdriessche s scripts. Subject testing: E. Bonnet. Behavioural data analyses: E. Bonnet, S. Palminteri. Design of model-based approach: S. Palminteri. Computational scripts production: S. Palminteri, E. Bonnet. Results interpretation: E. Bonnet, S. Palminteri. Thesis redaction, figures and table production: E. Bonnet. Proofreading and comments on the thesis: S. Palminteri. Acknowledgments I would first like to thank Stefano Palminteri, my supervisor, for accepting me with my short experience in computational neurosciences, accompanying me throughout the project and sharing his expertise. Vasilisa Skvortsova (post-doctoral researcher), Heloïse Théro (PhD candidate), Lou Safran (PhD candidate), Sophie Bavard (trainee), Anis Najar (post-doctoral researcher), Henri Vanderdriessche (lab manager): thank you for being there to answer my questions and helping me find new information. To all the coworkers I had the pleasure of discussing with at conferences and lunches, thank you for these insightful exchanges. You all made this a very enriching experience. 2

3 Summary I. Introduction... 4 Private Reinforcement Learning in the global framework of Decision Making Insights on social learning from Social Sciences... 7 Current research question II. Methods Task design Participants and procedures Analyses III. Results Behavioural results Computational results IV. Discussion General conclusion Perspectives V. References Abstract: Reinforcement-based learning deals with the ability of biasing choices to maximize the occurrence of pleasant events (rewards) and minimizing that of unpleasant events (punishments). This process has been quite well characterized at the computational and neurobiological levels in the private setting, i.e. in situations that do not involve any social interaction. This project aimed to further investigate observational reinforcement learning with two research questions: First, do subjects imitate despite a maladaptive value of imitation? Secondly, does imitation learning present a computational bias? To answer these questions, healthy participants played a probabilistic learning task, either by themselves or in sessions in which they could sometimes see the actions of another player. A subtle gain in performance was evidenced when observing a good model and there was a tendency for a decreased performance when observing a bad model. The model parameter analyses evidenced a tendency of differential learning from confirmatory or disconfirmatory observed choices, and overall our results qualitatively support our hypothesis on learning biases not only in a private but also in a social context. 3

4 I. Introduction Man has three ways of acting wisely. First, on meditation; that is the noblest. Secondly, on imitation; that is the easiest. Thirdly, on experience; that is the bitterest. Confucius, the Analects, as reported in Chambers Dictionary of quotations (1997), p.279 [1] More than 2 millennia after Confucius, a more descriptive and analytical stance has been taken on the study of decision-making. It has been proposed to be divided in several steps [2] [3]. First, one must have a representation of the current situation involving the decision i.e. state. Then, one evaluates the actions in terms of their respective consequences i.e. valuation and acts accordingly i.e. action selection. Essentially, with the outcome of the action one can re-evaluate future actions. This Reinforcement learning, which allows to act (wisely) on retrospective experience, will be the focus of the present work. Reinforcement learning will be first described from an historical point of view, and an emphasis will be made on the neurobiology and the processes that have been interpreted in terms of biases, or systemic deviation from a normative model (I.A). The emerging Social Neuroscience literature has provided numerous theories and experimental tools to investigate decision making in a social context and specifically learning through other s experience (I.B). This review will lead us to the current research problematic of this project on observational (reinforcement) learning (I.C). Private Reinforcement Learning in the global framework of Decision Making. 1. From conditioning to the mathematical formalisation of Reinforcement Learning. Historically, Reinforcement Learning is grounded in the Behaviourist approach of Psychology from the end of 19 th century. A variety of experimental paradigms have evidenced conditioning or associative learning, mainly in animals. In classical or Pavlovian conditioning, the association between a state and an outcome is learned, like Pavlov salivating dog when a bell that predicted food rang. Essentially the reward is delivered irrespective of the animal s behaviour, as opposed to instrumental conditioning. The establishment and strengthening of Stimulus-Response associations, or response reinforcement, was theorized by Thorndike in the Law of Effect. In instrumental conditioning, a learner establishes an explicit representation of sequence of actions that lead from a defined state to the associated outcome. The valence of the outcome then guides actions when the state is reencountered, essentially to increase the probability of obtaining a reward and minimize the probability of a punishment [4]. For associative learning to occur, the action (i.e. conditioned stimuli) and its associated outcome(i.e. unconditioned stimuli) must occur closely in time i.e. temporal contiguity and the occurrence of the action must indicate a change in the likelihood of the outcome s occurrence - i.e contingency. The Rescorla-Wagner model [5] was the first formalisation of the changes in associative strength between a conditioned stimulus and a subsequent unconditioned stimulus as a result of a conditioning trial. Essentially, it posits that associative learning occur as a function of surprise, because the co-occurrence is unanticipated [6]. The delta rule (cf. Equation 1) adjusts the expected value of the option proportionally to the prediction error, which is the difference between the actual and the expected outcome. Following the methodological paradigms and analyses from Behaviourism [7] instrumental probabilistic (bandit) tasks, in which subjects are presented with a binary choice, are nowadays 4

5 used to investigate how human subjects learn by trial and error which of the two options is more rewarding. From the 1970s onward, models of Reinforcement Learning emerged in the field of machine learning. From the modelling perspective, Reinforcement Learning is one of three categories of learning, among supervised and unsupervised algorithms [8]. Specifically, it formulates how an agent learns to take actions in an environment so as to maximize cumulative reward, by learning from trial and error the consequences of actions and choosing the best action accordingly. Reinforcement learning theories highlights the importance of both the value function to learn from action value to predict the reinforcer and the policy function or action selection, that map stimulus to actions to solve both the optimal control and the prediction problems [9]. A version of the prominent Q-learning algorithm [10] was used in the present study. This iterative model free algorithm computes the action-state function and converges to true action values with a sufficient number of trials in a fixed environment. After initialising the possible state-action pairs, the expectation of future reward (or expected Q- value) is updated with the Rescorla-Wagner learning rule, as follow: (1) Q t+1 (St, at) = Q t (st, at) + α x δ (2) δ = R t+1 Qt(st, at) Thus, learning is driven by the prediction error delta. The learning rate alpha quantifies by how much a Q-value is updated by the last observed outcome: when alpha=0 there is no learning and when alpha = 1, only the last information is considered. Even if an agent chose to exploit the option with the highest estimated Q-value, there might be a better option available if one explores its environment. The trade-off between exploring an environment to gather more information, and exploiting a current policy is a fundamental aspect of decision making. Computationally, in Reinforcement Learning algorithms, action selection is implemented with a softmax function, that grades action probabilities by estimated values. The probability of choosing A over B is a sigmoid function of the value difference between A and B. (3) P(A) = 1 (1 + (e Q(B) Q(A) )/β) Beta, a computational parameter referred to as the inverse temperature in analogy to thermodynamics, determines this trade-off: when beta tends to 0, the current policy is exploited. In this case it resembles the e-greedy approach, in which a random action is selected with some probability ε and action with highest value ( greedy ) is chosen the rest of time. For higher values of beta parameter, the probabilities of the different actions tend to be similar, which induces exploration. More complex models have been proposed, such as the actor-critic algorithm that processes separately the value prediction and the action selection, or model-based Reinforcement Learning that integrate and use a representation of the current environment to anticipate the future reward and select the best action accordingly. Nonetheless, imaging studies have revealed overlapping activations and the use of both types of Reinforcement Learning in healthy participants [11] [12]. Armed with these more complex mathematical formalization of learning mechanisms, modelbased functional imaging [13] was developed in the last 20 years to investigate the processes involved in Reinforcement Learning and their neural implementation. 5

6 2. Reinforcement Learning: processes, neural implementation and deviations from optimality. Neurobiologically, reward learning occurs in a dynamic fronto-striatal network in direct link with midbrain dopaminergic neural activity. Essentially, a seminal neurophysiological study [14] evidenced the role of the neuromodulator dopamine in the strengthening of the synaptic connections for Stimulus-Reward associations. If a reward occurred whereas unpredicted, midbrain dopaminergic neurons fired at the time of outcome. If the reward was predicted and occurred, the neurons fired at the time of the cue. Lastly, if a reward was predicted but no reward occurred, a burst occurred at cue presentation but there was a depletion of firing rates (i.e. dips) at the time when the outcome should have been presented. Numerous fmri studies [15] evidenced the role of striatum and midbrain, receiving strong dopaminergic connections and encoding for Reward Prediction Error. In Electroencephalography, an event related potential (ERP) reflects the valence of the outcome of own actions. This feedback-related, medial-frontal negativity ERP component follows error feedback in trial and error tasks [16] [17]. In addition to limbic regions, ventromedial prefrontal cortex has been associated with the representation of both expected and actual rewards. Learning to avoid punishments involve additional modulators (including serotonin) and subcortical structures such as the amygdala and anterior insula [18], more causally evidenced with neuropharmacological and lesions studies [19] [20]. It is out of the scope of this project to detail all the factors that have been shown to modulate decision-making, such as the integration of cost [21], delayed reward [22] or even the risk or uncertainty about obtaining the reward [23], all involving others structures like the Anterior Cingulate Cortex (ACC) and modulators such as serotonin or norepinephrine [3]. The next section will specifically review, in the framework of Reinforcement Learning, how deviations from normative learning appears when two environmental factors are manipulated, namely the valence and the amount of information available to learning. Numerous behavioural data have evidenced that humans are sensible to the valence of the prediction error, and learn differently in response to better than expected and worse than expected outcomes [24] [25] [26]. Asymmetric computational models with distinct learning rates have provided a better explanation of experimental data compared to the basic Reinforcement Learning model that didn t account for this valence induced bias [27]. Except for one study [28] that interpreted their opposite finding of a higher learning rate for negative compared to positive prediction error in terms of risk aversion, the behavioural data tend to favour the optimistic bias. In simulations, optimistic meta-learner outperformed unbiased learner in attaining rewards, under specific and fixed conditions [24] [29] that are still debated, notably the adaptation of the learning rate to the distribution of reward [30]. Interestingly, optimistic subjects had a reduced negative learning rate compared to unbiased subjects, which can be understood as underestimating worse than expected outcomes. This finding is in line with a long history of analysing biases in economics and psychology [31]. Unrealistic optimism in the domain of general beliefs has been revealed with overconfidence [32] and underestimation of the likelihood of encountering negative life events for example, smokers assessing their risk of premature mortality [33]. Furthermore, it suggests that the unrealistic optimism occurring at level of general beliefs may be a consequence of the low-level asymmetry found in Reinforcement Learning. 6

7 Reinforcement learning cannot be reduced to learning from someone s own obtained outcomes. Learning from the outcome of the alternative and unchosen action has been referred to as counterfactual learning. Computationally, it is implemented by adding to the classic factual module a counterfactual module that update the value of the unchosen option. Remarkably, the opposite pattern of valence-induced asymmetry was found for counterfactual learning [34]. For unchosen outcomes, the learning rate for integrating worse than expected outcomes was higher compared to integrating better than expected outcomes (i.e. positive prediction error). In a recent experiment, [35] factual and counterfactual learning were compared in trials in which participants either freely chose between two options or were forced to make a choice. When forced to choose an option they didn t prefer, learning was driven by higher negative learning rate for worse than expected compared to better than expected outcomes. When learning from an unchosen and non-preferred option, the learning rate for better than expected but unchosen option was significantly higher than learning rate for worse than expected unchosen option. Overall, these results provide evidence that what occurs at this low level of Reinforcement Learning is rather a confirmation bias and not simply a valuation bias: heathy participants preferentially consider the outcomes that support their current choice, in an egocentric manner. In addition to the differences revealed by comparing healthy and diverse cohorts of patients, interindividual differences arise throughout the course of lifespan, which has been linked to the maturation of the executive prefrontal regions and developmental changes in connectivity between striatal and prefrontal regions [36] [37]. Reinforcement Learning provides a general framework for investigating the behavioural differences and testing hypotheses of different underlying computations, according to personality traits such as optimism or age. Briefly, if children learn preferentially from worse than expected outcome [38] the behaviour of adolescent was best explained by simpler models - without the counterfactual learning module - compared to adults [39]. If monitoring one s owns error is a fundamental ability, there are numerous examples in which one acquire knowledge vicariously, through observing the consequences of other s choices. The emerging research on social cognition investigates decision making in a social context, and provides experimental paradigms that allowed a better understanding of the monitoring of self versus other s choices and observational learning. Insights on social learning from Social Sciences. 1. Evidence of social observation and influence Social learning has been more generally defined in Ethology as any process through which one individual ( the demonstrator ) influences the behaviour of another individual ( the observer ) in a manner that increases the probability that the observer learns [40]. In animal literature, the model is usually an expert to observe from [41]. Nonetheless, behavioural ecology has further investigated when imitation occurred and from whom the behaviour was copied [42]. Essentially, those studies revealed an adaptive advantage of social learning that allows to acquire information less costly than from private learning [43]. As a species, human are effective social learners. Bandura s Social Cognitive Theory [44] initially named Social Learning Theory, first theorised that behaviour is indeed learnt from the environment through observational learning. He evidenced the learning of aggressive behaviour 7

8 with the Bobo doll experiment: children who observed an adult model mistreating a doll applied the same aggressive scheme when reencountering the doll. Similarly as Reinforcement Learning, social neuroscience emerged from a union between social psychology, economics, cognitive and affective neurosciences. It has offered new insight on the underlying processes and neural implementations of decision making in a social context [45], which necessitate understanding of the other s behaviour. At the most basic level, the Mirror Neuron System, activated when either performing an action or observing it performed by someone else [46], is thought to mediate the understanding of other s people motor intentions and action goals. Mentalizing refers to the ability of representing another s person psychological perspective. If the theory of mind literature focuses on understanding other s belief and thought, it has been studied jointly with the emotional perspective taking or empathy, that is the ability to understand other s feelings. An entire range of studies derived from Economics and game theory [47] use non-iterative and anonymous investment games to investigate social preferences in rigorous social contexts. Interestingly, these social preferences such as punishing unfair players are represented in the same reward-related brain areas [48]. In the Reinforcement Learning literature, a social reputation learning task [49] investigated how subjects learned from trial and error the associated outcome and the reliability (i.e. being honest or lying) of an observed confederate that advised them. Similarly, there was a prediction error like signal about the truth of communicative intentions. In a recent study [50] participants who observed an actor performing an action directed toward an object gave higher desirability ratings compared to objects which were not the object of an action. In addition to the experimental evidence of mimetic desire, connectivity analyses revealed that the mirror neuron system affected the values encoded by the striatum and ventromedial prefrontal cortex, specifically by attributing a positive value to options chosen by others. Similarly, in an advice learning task [51], simulations and behavioural data favoured a model in which the outcomes of an action that was advised by a confederate were more positively evaluated, indistinctively of the valence of the advice. If these findings have evidenced social influence in a global social Decision Making, the last part of this literature review will refocus on specific case of observational learning and on the monitoring of self vs non-self performances. 2. Experimental paradigms of observational learning. Comparing private versus observational learning and more specifically the efficacy of passive observational learning remains very rarely investigated. A behavioural study [52] compared an active task of trial and error learning to a condition in which subjects matched previously recorded choices of another subject but were tested on the passively learnt associations. Due to the active nature of operant learning, more engagement in the task compared to passive observation should favour the active leaning [53]. Nonetheless, a high performance was obtained in both active and observation when the reward rate was high (such as 80/20 contingencies). Interestingly, subjects overestimated the probability of winning in harsh observed condition with a low probability of reward, such as 40/20 contingencies which has been interpreted as an optimistic bias. However, contrasting with previously mentioned studies, this bias wasn t evidenced at the computational level, nor in actors learning by direct experience. Mainly, this study evidenced that observational learning can be at least as successful in terms of performance accuracy as private learning from one s own outcomes. 8

The monitoring of self and other s performance has been widely investigated in electrophysiological studies with similar experimental paradigms in which participants could either learn privately by

9 The monitoring of self and other s performance has been widely investigated in electrophysiological studies with similar experimental paradigms in which participants could either learn privately by trial and error or observe a confederate performing the task. The electrophysiological marker of feedback-related negativity has been recently related to the encoding of feedback stimuli in situations of observation, but specifically in the context of human social interaction [54] [55] [56]. Essentially, these studies evidenced that both groups active and observational learned at comparable rate. However, differences in the feedback-related negativity amplitude arose. It was significantly and specifically reduced after negative feedback in observational condition relative to active learners, which suggest differential processes for monitoring self vs other s outcomes, and raise the question of their neural implementation. Making the link with the empathy literature, an EEG study that used a go/no-go task further demonstrated that observed errors produced distinctive electrophysiological brain responses depending on interpersonal context in a social setting of competition or cooperation [57]. Regarding the computational processes underlying learning through observation, a pioneer imaging study [58] used the Reinforcement Learning framework to investigate the neural mechanisms of learning when a subject had the possibility to observe the action of another player before choosing for themselves. Participants took turn in a classic instrumental learning task, and could either observe the action or the action and its associated outcome of another participant. Their results, displayed in Figure 1, evidenced that performance in terms of accuracy increases with the amount of contextual here social information. Figure 1: Figure taken from Burke et al. (2010) [58] If individual outcome prediction errors were positively correlated with neural activity in ventral striatum, two observational signals were evidenced with imaging techniques. First, an observational action prediction error it being the actual minus predicted observed choice was associated with dorsolateral prefrontal cortex s activity. Second, an observational outcome prediction error was correlated with activity in the ventromedial prefrontal cortex and the ventral striatum. This contrasts with another instrumental learning study that evidenced the role of dorsal striatum in encoding observed outcome prediction error [59]. Very recently, an intracranial model-based imaging study with epileptic patients [60] has linked the ventral Anterior Cingulate Cortex (ACC) with the encoding of the expected value of an observed action and of the outcome prediction error. This calls for further investigation of the neural bases of observational learning and for a subdivision of the ACC: its ventral region seem to be associated to social attention, learning and empathy in comparison to dorsal region that is classically activated in reward-guided behaviours. The benefit, in terms of accuracy, of observational learning was replicated to a lesser extent in a developmental study that also manipulated the amount of observed information [61]. Despite 9

10 using the same reward contingencies 80% reward and 20% loss for the most rewarding symbol the behavioural gain when only observing the actions was much subtle in the latter study - of about 1% compared to the 7% gain originally found. In the developmental replication, children at the end of learning reach a similar level of accuracy when presented either with the action and the outcome or only the action of another player, whereas in adult there was a significant difference in accuracy among the two conditions of observations, either partial or complete. In this developmental replication, children were more prone to copy same-aged children models compared to adults, which is not uncommon in the developmental literature, specifically in domains in which children are not perceived as less capable than adults [62] [63]. As rewarding vicarious experience has been shown to be influenced by the perceived similarity with the observed agent [64][54], it overall emphasizes the importance of social influence even in the context of observational (reinforcement) learning. Current research question Experimental studies have evidenced that subjects indeed process not only their own but the outcomes of other s choice, and the adaptive value of observational learning and imitation has been revealed with a relative gain in performance in the social interaction condition compared to private learning [58] [61]. Indeed, if the observed agent is a good model, the adaptiveness of imitation is by default evident. Our first question of interest was to modulate the value of imitation, by implementing good and bad models to observe from. Following Confucius, if imitation is easy, it could be nonetheless unwise as non-adaptive and sub-optimal in the context of observing a bad model. We expected to have a significant gain in private mean performance when observing a Good model, and replicate previous findings. We further hypothesized that imitation of a Bad model would result in a decreased performance in this observational condition compared to private learning. We were interested in quantifying the difference between the mean correct in Private and Observational learning, to compare it to the relatively high gain evidenced in the initial study [58]. Secondly, this work aimed to further investigate the observational reinforcement learning process. As egocentric confirmatory biases have been evidenced in the private setting of Reinforcement Learning, a second question arose: does observational learning present a computational bias? In contrast with the mentioned studies above, participants in our experiment were only presented with the action of their confederate, not it s associated outcome that they had to infer. Nonetheless, the player could confront the observed choice to what he believed the correct symbol was at that time. We hypothesized that subject would learn differentially from choices that were confirmatory or not with what subjects believed the best symbol was at the time of choice. Specifically, we investigated if an asymmetry was found when comparing the social learning rates, and if so, if the direction of the asymmetry could allow for extending an egocentric confirmatory bias to a social context of observational learning. To answer these questions, a computational modelling approach was chosen in the global framework of Reinforcement Learning. A variant of a classically instrumental learning task was designed, as presented in the next Methods section (II.A). Behavioural data was acquired from healthy participants, and several Reinforcement Learning models were compared to test the aforementioned predictions. The results will be presented and discussed in the next two sections (III & IV). 10

11 II. Methods Task design 1. Original task design (Experiment I) Experiment I was designed and conducted as planned in the pre-registration. (cf. Annex 1). A variant of a classic instrumental learning task - which had been used in several other studies [23] [65] was designed. Classically, in this probabilistic instrumental learning task, subjects are repeatedly presented with a binary choice between two abstract visual stimuli that result in either winning or losing a point. One of the two stimuli has a higher probability of reward, and the subject has to find out by trial and error the action that had the highest expected value. In this experiment, the probability of a positive reward (+1) was of 70% for the correct symbol, and symmetrically of 30% for the incorrect one. Symbols - letters from the Agathodaimon alphabet - were presented in fixed pairs that represented choice context for 20 trials, thus allowing learning. The main change in the design was to include two conditions of learning: one in which, as classically done, subjects learned by themselves with the outcomes of their own choices, and an observational condition in which they would alternatively observe the choice made by another subject and make their own private choices on the same pair of symbols. Unknown to the participants, and to allow for the experimental manipulation of the observational condition, the observed choices were those of an implemented virtual player. Essentially, two conditions of observation were included. In the Good Observation condition, the observed choice that was presented to the participant was the symbol associated with the highest probability of reward in 80% of times. In the other condition, Bad Observation, the symbol with the lowest probability of reward was chosen in 80% of times. At the beginning of each trial, after a fixation cross and a variable inter-trial interval of ms, the two options were presented for 750 ms. In observational trials, after a go-signal, the other (virtual) player s choices were indicated by a red frame around the chosen stimuli that appeared 500 ms after the stimuli presentation. To move on to the subsequent trial, participants had to match the observed choice by pressing the right or left response button, and the frame remained until the correct response button was pressed. Essentially, no feedback was given for the other s player choices. In private trials, after the presentation of the two options and the go-signal, subjects made their own choice by pressing left or right response button for the left or right stimulus for which they have the outcome (+1 or -1). Figure 2 illustrates the sequence of events for the two types of trials private or observational that were displayed separately on the two sides of a partitioned computer screen. After a short training (30 trials), subjects underwent 4 sessions consisting of 1 block of 20 private trials, and 2 blocks of 40 trials including 20 private trials and 20 observational trials. Thus, in one session there were 100 trials (with in total 60 private trials and 40 observational ones) and in total the experiment included 400 trials. In one half of the Observation blocks, the virtual player chose the correct option 80% of times, while in the second half of the Observation blocks, the virtual player chose the correct option 20% of times. Blocks ( Private, Observation Good and Observation Bad ) were randomized within sessions. 11

Revised task design (Experiment II) In the debriefing of Experiment I, participants announced that they were more disturbed in the Observation condition and they wouldn t even consider the choice of

12 Figure 2: Trials and stimuli: sequence of events on an individual trial for (A) private/action trial (B) observational trials. Participants pictures were displayed on the upper part of each side of the computer screen to differentiate private from observational trials. 2. Revised task design (Experiment II) In the debriefing of Experiment I, participants announced that they were more disturbed in the Observation condition and they wouldn t even consider the choice of the other player once they found themselves the most rewarding symbol. In addition, several participants realized or had strong intuitions that they weren t really playing in real time against the other player. This motivated several modifications in the original design of our social probabilistic learning task. Instead of being confronted with both good and bad observers, a between subject design was adopted in which participants were randomly assigned to one of the two observational conditions; either Good or Bad. The performance of the virtual player was unchanged (choice of the correct option 80% and 20% of times in Good and Bad Observation, respectively.) but the difficulty was increased to encourage the subjects to make the most of the information available in the game, i.e. looking at the choice of the other player. It is important to remind that no explicit instructions to take into account the confederate s choice was given to the participants. The correct symbol is here associated with a 60% probability of rewards and 40% of losses, and symmetrically the incorrect symbol with a 40% probability of rewards and 60% of losses. To maximize the chances of the subject believing in the social setting, the number of sessions was reduced to three sessions, after a short training of 30 trials. As in Experiment I, one session consisted in 1 block of 20 private trials and 2 blocks of 40 trials with 20 private trials interspersed with 20 observational trials. Blocks ( Private and Observation either Good or Bad) were randomized within session. Additionally, at the end of each observational block, subjects had to answer if they thought they gained less, equally or more than the other participant on a 5-points scale. Overall, our two experimental designs (cf. Figure 2) can be schematically represented as followed: S 24 C3 S4 12

subjects. S 22 < T2 > C2 S3 Figure 2: General design of the experiment.

13 In Experiment I, 24 subjects were confronted to 3 conditions of learning (C3: Private, Observation Good and Observation Bad) for 4 sessions. In Experiment 2, in 3 sessions, only 2 conditions of learning were presented to each subject (C2: Private or Observation) as the type of Observation (Good or Bad) was changed from within to between subjects. S 22 < T2 > C2 S3 Figure 2: General design of the experiment. OG and OB represent the two types of observation: either Good Observation, in which the most rewarding symbol was chosen 80% of the time, or Bad Observation, with the symmetric performance (the most rewarding symbol was chosen 20% of the time). In Private (P) bloc, there is no observation. Participants and procedures. Respectively 24 and 44 new healthy participants were enrolled in our two studies (Table 1). Participants were recruited through the Relais d Information sur les Sciences Cognitives website, the inclusion criteria were being over 18 years old and reporting no history of psychiatric and neurological illness. All study procedures were consistent with the Declaration of Helsinki (1964, revised 2013) and participants gave their written informed consent before the experiment. Participants were reimbursed 10-20e for their participation, on average 13,9 ± 4.69 euros. Experiment I Experiment II Good Bad Observation Observation No. of subjects Male/Female subjects (No.) 10/.14 9/.13 11/.11 Age (years; mean ± SD) 24.5 (± 3.45) 24.9 (± 3.5) 26.4 (± 3.9) Pair (same gender/mixed) (No.) 9/.15 9/.13 10/.12 Table 1 demographics of the two cohorts of participants. Participants were put in pairs - either mixed or same gender - and received the instructions jointly. They were told they would engage in a small game in which they could sometimes observe each other s choice in real time but not the outcomes associated with that action. Participants were informed that some symbols would result in winning more often than others, and were 13

14 encouraged to accumulate as much points as possible. It was emphasized that they didn t play for competition but for themselves and importantly, there was no explicit directive to observe the other player. In the second experiment, participants were unknowingly and randomly assigned to the two conditions of observation. Subjects were tested in two separated cabins, nonetheless participant s picture were displayed on the upper part of each side of the computer screen to differentiate private from observational trials. This choice was doubly motivated: to avoid any verbal or nonverbal communication between the participants and to make the participant believe in the fictitious social setting. Indeed, differences in speed would have emerged during the course of the experiment as it was a selfpaced task. At the end of each session, the experimenter came in both cabins to set the subsequent one, and participants were instructed to begin each session at the same time. The debriefing of the experiment was done with each pair of subjects, and all participants were debriefed about the cover story after the experiment. Analyses 1. Behavioural variables and analyses As our primary dependent variable, we extracted the subject s choice for each bloc and calculated the correct choice rate, which is the rate of trials in which the subject chose the most rewarding stimulus (i.e. the stimulus associated with 70% or 60% of getting a reward in Experiment I and II, respectively). This measure of performance was analysed with repeated ANOVAs including the factor session (1-4), and condition (Private, Observation Good, Observation Bad). Condition was entered as a within-subject factor for Experiment I and between-subject factor for Experiment II. The goal of this analysis was to investigate how our experimental factors (imitation good vs. imitation bad) of learning impacted the subject s performance. Our hypothesis was that, compared to private sessions, subjects would have respectively a higher and a lower mean correct rate when being in condition in which Observation, and by extension imitation, is beneficial (Good Observation) or not (Bad Observation). Statistical analyses were performed using Matlab ( and R. 2. Computational modelling By analysing each individual history of choices and outcomes, and formalizing different learning mechanisms, the model-based approach allows to make quantitative prediction about the experimental data and to tease apart the possible computational processes underlying differences in performance. Based on our general hypotheses, we fitted a Q-learning model containing different learning rates and two distinct algorithmic modules: a factual learning module to learn from one s own outcomes and a social module to learn from the action of the observed player. The outcome of the observed choice remained unknown to the active player, but the player could confront the observed choice to what he believed the correct symbol was at that time. We hypothesized that the subject would learn differently from consistent or inconsistent observed choices relative to the subject current policy, hence two learning rates in the social module. For the factual module, two private learning rates were implemented, to account for subjects bias toward learning from positive prediction error. For each pairs of symbols (each choice context) the model estimates the expected value of the two options, i.e. the Q-values. Essentially, Q-values represent the expected reward obtained when 14

15 choosing a particular option in a given context. At the first trial of a bloc of learning, there is a 50% chance of receiving a positive outcome (+1) and a 50% chance of receiving a negative outcome (- 1), thus the Q-values are set at 0 before learning, After the outcome is revealed at the end of every trial t, the value of the chosen option is updated, in the factual learning system, according to the following rule: (1) Q cp (t + 1) = Q cp (t) + α P +. PE c (t) if PE c (t) > 0 α P. PE c (t) if PE c (t) < 0 In Equation 1, PE c (t) is the prediction error of the chosen option, i.e. the difference between the expected outcome Q c (t) and the actual obtained outcome R c (t) as in Equation 2. The prediction error is positive if the actual reward is better than the expected one, or negative otherwise. The learning rate is a parameter that adjust the amplitude of value change throughout the trials. (2) PE c (t) = R c (t) Q cp (t) In the case of observation, the reward associated with the observed action remained unknown to the subject. The Social prediction Error (SE) was calculated as follow, attributing a positive value to the observed action. (3) SE(t) = 1 Q cs (t) The observed action value was then updated in the social module with two learning rates: one for observed actions that were confirmatory of what the participant believed being the best option in a given trial, and one for disconfirmatory observed choice (α SC & α SD ) (4) Q cs (t + 1) = Q cs (t) + α SC. PE c (t) α SD. PE c (t) if Q cs (t) = max (Q cs (t)) if Q cs (t) max (Q cs (t)) For private trials only, the probability of selecting the chosen option was estimated with a softmax rule, the stochastic decision rule described in the Introduction (I.A) The temperature beta is another parameter that adjusts for the stochasticity of decision making. (5) P cp (t) = 1 (1 + (e (Q up(t) Q cp (t) )/β) To disentangle the different assumptions about the decision processes involved in our task, we also considered versions with a reduced number of learning rates and modules, i.e. a nested model comparison. In a BASIC Imitation model, subjects learn indifferently from confirmatory or disconfimatory observed choices. α SC = α SD. Similarly as the original computational observational model [58] and as an extension of the notion of mimetic desire, a positive value is attributed to the option that has been chosen by others. Learning is governed by different learning rates in the factual and the imitation system α P and α S, respectively. 15

16 Two simpler asocial models were developed, with only a factual learning module and no social module. In these models learning is exclusively driven by learning from chosen outcomes, and not from observing the choice of the other player. These factual modules contained either two learning rate to account for the valence-dependent learning bias α + P > α P or only one single learning rate α + P = α P. Hence our model space therefore included 4 gradually less complex Q-learning models to be compared given our data, summarized below in Table 2. Table 2 Computational models ordered in increasing complexity. α: learning rate. PE: prediction error; SE = social prediction error; Q(t): action-state value. Q(C)t = value of chosen option, Q(U)t = value of unchosen option. 3. Parameter optimization and model comparison: Model fitting and comparison are crucial steps as it first allows to obtain the best-fitting model parameters values, and to identify the model that provides the best estimate of the data - among the set of computational models. Indeed, fitting a single model in isolation isn t informative [66]. Even if it has a high explanatory power, the good fit could come over-fitting with a high number of parameters and does not discard the possibility that perhaps a simpler model would fit as well the experimental data. The issue of complexity versus quality of fit and the pitfalls of over- or under-fitting the data with either too complex or too simple models has been widely acknowledged in the literature [67] [68]. Bayesian model comparison methods [69] are used to make relative comparison between models to ultimately identify a winning model amongst a set of candidate models. The four competing candidate models were fitted separately to the data in our two experiments, and to the two conditions of observation for Experiment II. In a first 1 st analysis, the model parameters were optimized by finding parameters values that maximize the likelihood of observing the experimental data given the model. Parameter optimization assumed a set of free parameters per subject, and was calculated using Matlab s fmincon function that minimizes the negative log-likelihood of the data given different parameter settings (ranges: 0<β<Infinite, and 0< α<1) (6) LL = log(p(data Model)) 16

17 The negative log-likelihood was used to calculate, at the individual level and for each model, two relative quality-of-fit criteria: the Bayesian Information Criterion (BIC) and Akaike Information Criterion (AIC). In brief, Bayesian model comparison methods estimate the goodness of fit of the candidate models or accuracy but penalize a model for its complexity. Derived from information theory, the AIC will favour the model which probability distribution has the smaller discrepancy from the true distribution. BIC: derived from an approximation to the full Bayesian model comparison. From Equation (7) and (8), one can note that the AIC penalize the number of parameters (degree of freedom or df) less strongly than does the BIC. (7) (8) BIC = log(ntrials) df + 2 LL AIC = 2 df + LL The likelihood maximization doesn t rely on any particular approximation of the model evidence. In a second analysis, models parameters were optimized by minimizing the logarithm of the Laplace approximation to the model evidence (i.e. Log Posterior Probability, LPP). (9) LPP = log (P(Data Model, Parameters)) LPP maximization includes priors over the parameters (temperature gamma (1.2,5); LR beta (1.1.,1.1). Essentially, it avoids wrongful fitting of the parameters estimates that could be driven by noise. The same priors were used for all learning rates, to avoid bias in learning rate comparison. Of note, the priors are based on previous literature [70] and have not been chosen for this study. Finally, the individual LPP were fed into the mbb-vb-toolbox [71]. Contrary to fixed-effect analyses that average the criteria for each model, the random-effect model selection allows the investigation of inter-individual differences and to discard the hypothesis of the pooled evidence to be biased or driven by some individuals i.e. outliers. This procedure estimates the model expected frequencies and the exceedance probability for each model within a set of models, given the data gathered from all participants. The expected frequency is the probability of the model to generate the data obtained from any randomly selected participant it is a quantification of the posterior probability of the model (PP). It must be compared to chance level, which is one over the number of model in the model space 0.25 in our two experiments. The exceedance probability (XP), is the probability that a given model fits the data better than all other models in the model space. Theoretically, a model with the highest expected frequency, and the highest exceedance probability is considered as the winning model. Model comparison will be further detailed, in the light of our data, in the Result and Discussion sections (III.B.1). 4. Statistical analyses on model parameters The best fitting learning rates obtained from the LPP maximization were compared using twotailed one sample ttests: i) factual learning rates for positive and negative prediction errors and ii) imitation learning rates for confirmatory of dis-confirmatory choices. As a reminder, a confirmatory choice is when the other player chooses the option that the participant believes is the best in a given trial. 17

18 Our goal was double. First, to replicate the confirmation bias found in private settings: namely a higher learning rates for positive compared to negative prediction errors. Second: to investigate how learning from observation might be biased. We lastly investigated the exploration rate with comparison of 1/beta temperature parameter among the two studies and the two conditions of Observation in Experiment II. III. Results Behavioural results 1. Private performance in the private versus social conditions The rate of private choice in which subjects picked the most rewarding symbol was analysed and compared for each subject according to the condition of learning either private or observational with respectively a between and within subject design repeated measure ANOVA for Experiment I and II. As can be seen in Figure 4.A, learning was evidenced in both experiments as the rate of correct choices globally increases throughout the experiment, from chance level (0.5) at the beginning to a plateau. The four sessions of learning are represented separately in Experiment I, due to a misfortunate error in a script that caused the pairs of symbols representing choice context across the 3 blocs to be repeated over the four sessions leading to a learning curve of 80 trials while only one bloc of 20 trials which is the duration of a learning curve in Experiment II are presented in Figure 4.B. In Experiment I, there was no significant difference between our conditions of learning (F(2,46)=0.519, p=0.58). The mean correct rate was of 74.4% (± 0.2%) for Private; 76,6% (± 0.2%) for Good Observation and 73, 3% (± 0.2%) for Bad Observation. Of note, the pattern (Figure 4.B) corresponds to that predicted in the preregistration (cf. Annex 1) In Experiment II, if the effect of learning condition per se was not significant (F(1,21)=0.058, p=0.812), there was a significant interaction between the condition of learning and the type of observation (good/bad) (F(1,126)=5.945, p=0.01), as predicted assuming that in the Good and in the Bad conditions performance are altered in opposite directions. 2. Gain in imitation: Calculating the difference in observational mean correct minus private mean correct in the two experiments allowed us to compare the relative gain in performance in the social condition or private condition of learning. We expected this difference to be positive in the Good Observation, negative in Bad Observation due to its maladaptive value. In Experiment I, a very small but positive gain was found in the Good Observation condition (0.02 ± 0.18) and an even reduced loss was evidenced in the Bad Observation condition (-0.01 ± 0.16) Importantly, these two measures weren t significantly different (t(1,23)=1.5, p=0.23, paired ttest), which reveals that in Experiment I, learning with additional contextual and social information was not significantly beneficial (or detrimental) for our participants. In Experiment II, the relative gain was significantly different in the two conditions of Observation (p=0.0066, two samples ttest). There was a significant gain in performance (mean gain +0.05, ± 0.008) in the good Observation condition (p=0.0042, ttest). In the Bad Observation condition, the small tendency of loss (-0.04, ± 0.136), seen in Figure 5.C wasn t significantly different than zero but just below the significance level (t(21)=, p= ), which reveal a subtle tendency of decreased performance when observing a bad model. 18

Figure 4: A) Rate of correct responses for the 3 conditions of learning. In Experiment 1, for 80 trials (left panel) and for the two conditions of learning in Experiment 2.

19 Figure 4: A) Rate of correct responses for the 3 conditions of learning. In Experiment 1, for 80 trials (left panel) and for the two conditions of learning in Experiment 2. B) Performance on the learning task: correct choice rate averaged across sessions. C) Mean difference (Observational mean correct minus private mean correct) or relative gain and loss due to the social condition of learning. Bars indicate the mean, and error bars indicate the standard deviation. *** p<0.001 ** p<0.005, * p<0.05, NS: non significant p> Rough imitation and self-evaluation in Experiment II. For Experiment II, we analysed the rough percentage of time the participant imitated the last observed action. A mixt-design ANOVA with the factor type of Observational condition and sessions revealed only a significant main effect of type, not session nor interaction (F(1,42)=53,16 p<0.0001). Chiefly, there were more imitative responses in the Good Observation condition (61% ± 0.09) compared to Bad Observation condition. (40% ±0.09). (p < , Welch ttest). As the implemented virtual player chose mainly the most rewarding symbol in the Good Observation condition and symmetrically the less rewarding symbol in the Bad Observation, an interpretation for the reduced percentage of imitative response in the Bad Observation condition could be that participants would have discovered that mostly the less rewarding symbol was chosen. However, the analysis of our subjective measure of self-evaluation don t favor this hypothesis. At the end of each observational session, participants reported on a 5-point scales if they thought they gained less, as much or more than the other player. Unexpectedly, in both conditions of observation, participants indistinctively and overall answered they gained less than other player. This was surprising, as overconfidence has been largely evidenced in the literature [29], and one could have expected that participants in the Bad Observation condition would have rated themselves more positively. However, both the difficulty of the task and the virtual social context might have impacted this subjective measure of self-evaluation. 19

These behavioural analyses were blind to inner knowledge of correct or incorrect symbol and about the subject s inference on the correctness of the observed symbol.

their current belief of what the correct option was. Computational results 1.

20 These behavioural analyses were blind to inner knowledge of correct or incorrect symbol and about the subject s inference on the correctness of the observed symbol. This motivated our computational approach to investigate how subjects learned, on a trial by trial basis, i) from own outcomes and ii) from observed choices that were consistent or inconsistent with their current belief of what the correct option was. Computational results 1. Model comparison The different Bayesian model comparison methods that were used (BIC, AIC from LL, and LPP) provided us with different answers for selecting the best model amongst the four candidate models. (Table 3), which calls for further analyses and investigation (cf. IV.B) In the two experiments, the BIC was lower for a model without learning from actions of others: the asocial but with asymmetric private learning rates. One interpretation could be that participants didn t learn from observed choices. However, our behavioural results from the previous section call in question this hypothesis. Importantly, we investigated the random effects (from the LPP approximation) that take into account interindividual variability. The posterior probability, which is the probability of a model to generate the data, only exceeded chance level (0.25) for the more complex model in both experiments (Fig.4.A). The probability that a model fits the data better than all other models in the model space (XP, or exceedance probability), was the highest for model 4 (Fig.4.B) Our final motivation to investigate the Biased imitation model came from our behavioral results (III.A), with our quantitatively defined behavioral differences in private versus observational condition. The free computational parameters, specifically the social and private learning rates, described in the next results section come from the LPP, and are summarized in Table 4. Figure 4. Model Comparison. The panel represents the exceedance probability (A) and posterior probability (B) of the models, whose calculation is based on the LPP. The red line represents random posterior probability (0.25). 20

21 Table 3 model comparison criteria Llmax BIC AIC LPP PP XP Experiment I (N=24) Model ± ± ± ± ,01 ± 0, Model ± ± ± ± ,04 ± 0, Model ± ± ± ± ,07 ± 0, Model ± ± ± ,88 ± Experiment II (N=44) Model ± ± ± ± ,005 ± Model ± ± ± ± ,005 ± 0, Model ± ± ± ± ,11 ± 0, Model ± ± ± ± ,84 ± 0, Table 3. For each model, a summary of its fitting performance is presented. LL maximization, parameters obtained when maximizing the negative log likelihood; LPP maximization, parameters obtained when maximizing the negative log of the Laplace approximation of the posterior probability. AIC, Akaike Information Criterion (computed with LLmax); BIC, Bayesian Information Criterion (computed with LLmax); LLmax, maximal log likelihood; LPP, log of posterior probability; PP, posterior probability of the model given the data; XP, exceedance probability (computed from LPP). Table 4 Computational free parameters. alpha alphap+ alphap- alphasconf alphasdis 1/beta Experiment I (N=24) Model 1 0,30 ± 0, ,40 ± 0,57 Model 2-0,34 ± 0,23 0,25 ± 0, ,39 ± 0,68 Model 3-0,33 ± 0,22 0,27 ± 0,29 0,02 ± 0,02-0,33 ± 0,61 Model 4-0,35 ± 0,23 0,30 ± 0,30 0,13 ± ,02 ± 0,02 0,36 ± 0,65 Experiment II (N=44) Model 1 0,36 ± 0, ,25 ± 0,2 Model 2-0,48 ± 0,32 0,31 ± 0, ,40 ± 0,31 Model 3-0,47 ± 0,34 0,22 ± 0,21 0,14 ± 0,25-0,32 ± 0,34 Model 4-0,47 ± 0,33 0,22 ± 0,21 0,17 ± 0,25 0,15 ± 0,25 0,29 ± 0,24 Table 4. Computational free parameters for the four gradually more complex models. For the two experiments, it summarizes for each model the likelihood maximizing ( best ) parameters averaged across subjects. Data are expressed as mean ± s.e.m. 2. Free parameters analysis: investigation of learning biases The selected model in which subjects learn from their own outcomes and other s choice was implemented with two modules, a factual or private module and a social imitation module, each containing two distinct learning rates. We then investigated these free parameters to evidence possible biases not only in private learning but also in observation learning. Bias in private learning: Globally, our results go in the same direction than the robust bias evidenced with a higher learning rate for positive compared to negative prediction error, i.e. higher learning rate for better than expected private outcome compared to worse than expected outcomes (Figure 5.A) In the first experiment, the small tendency of higher learning rate for positive prediction compared to negative one was not found to be significant (t(23)=0.78, p=0.44). 21

22 However, it was strictly replicated in Experiment II, where the difference was found to be significant when considering all 44 subjects (t(43)=6.47, p<0.0001), and for the two conditions of observation taken separately. (t(21)=4.95, p= for Observation Bad, and t(21)=4.14, p= for Observation Good). When considering the two experiments together, the private LR for positive prediction error was significantly higher compared to negative prediction error (t(67)=5.14, p<0.0001), which suggests that the valence-induced bias evidenced in the literature is robust and found even in a global social setting. Bias in observational learning: In the first experiment, the learning rate for confirmatory choices was significantly higher than learning rate for disconfirmatory observed choices. (t(23)=2.31, p=0.03), which would suggest that participants learn more from confirmatory observed choices compared to disconfirmatory ones. Overall, it would extend the egocentric confirmation bias [34] to the social setting. In experiment II, even if small tendency is seen in Figure 5.B, there was no significant difference between the learning rate of confirmatory and disconfirmatory observed choices (t(43)= 0.54, p=0.59). Again analyses of social learning rate separately yield the same non-significant result. (for the Bad Observation condition (t(21)=0.18, p=0.86) and the Good observation condition (t(21)=0.56; p=0.58)). Despite considering the two experiments together, the difference in the social learning rates was just above the significance level (p(67)=1.77, p=0.08), therefore we cannot conclude with certitude that observational learning present a computational bias. Of note, the pattern in Figure 5.D of higher learning rate for confirmatory observed actions compared to disconfirmatory observed ones tends to favour the hypothesis that, even in a social context, reinforcement learning is egocentrically biased. It nonetheless calls for further analysis and investigation. Beta parameter: exploration rate. In regard to the exploration rate, there was no significant difference in the beta parameter between the two experiments (t(59.6)=-0.89, p=0.37), nor between the two conditions of observation in the second experiment (t(42)=-0.66, p= 0.52). Overall, it suggests that observational reinforcement learning is associated with explorative behaviour, at least in our two studies with having to infer on the observed action (1/β =0.31 ± 0.28). 22

Figure 5 The panel represents the learning rate for the best fitting mode (i.e. the Biased model 4) The left panel show the learning rates following positive prediction errors (α+) and negative prediction errors (α ), in experiment 1 (5.

Bars indicate the mean, and error bars indicate the standard deviation *** P <0.001. IV.

23 Figure 5 The panel represents the learning rate for the best fitting mode (i.e. the Biased model 4) The left panel show the learning rates following positive prediction errors (α+) and negative prediction errors (α ), in experiment 1 (5.A) and experiment 2 (5.B). the right panel show social learning rates for confirmatory or disconfirmatory observed choices. Bars indicate the mean, and error bars indicate the standard deviation *** P < IV. Discussion General conclusion The goal of the present study, grounded in the general framework of Reinforcement Learning, was to investigate the adaptive value and the computational processes of learning from observation. A social instrumental probability task was designed in which two factors were modulated, namely the condition of learning with private or observational sessions and the value of observed model good or bad observer. Overall, our behavioral results converge towards previous findings in both adults [58] and children [61] that evidenced a positive gain in imitation even in condition where the participant didn t see the outcome of the observed choice. In Experiment I there was a small but non-significant tendency of differential performance across the conditions of learning, namely Private, Good Observation and Bad Observation. Several explanations can be offered as to why a strict gain and loss wasn t found. The task was relatively easy: the correct most rewarding symbol was associated with a reward in 70% of the time and the incorrect symbol only in 30% of time. Participants efficiently learnt the associations and 23

24 obtained a high rate of correct responses privately. Supposedly, there is no need to consider the actions of another player if one already has an effective policy. It is the reason why the task difficulty was increased in Experiment II, as we hypothesized that with increasing difficulty the participants would try to make the most of the contextual information that was at hand, namely the action of the observed confederate. In experiment II, there was a significant gain in the Good Observation condition compared to private learning, which can be positively attributed to the social information that was added in the observational condition of learning. The size of the effect of gain in the context of Good Observation was smaller compared to the initial study (5% compared to 7% of Burke). As their task was significantly less difficult than ours (the most rewarding symbol yield a reward in 80% of time compared to respectively 70% and 60% in our two experiments), it raises the question of the replicability of the relatively extreme gain that the authors initially found. It is important to remind that no explicit instruction to observe the other player was given to our participants, which was supposedly but not explicitly the case in the initial study. Our results subtly support the hypothesis that imitation can be maladaptive in the sense of suboptimal performance in accuracy, when a subject is presented with a bad model to observe from. In experiment II, there was a tendency of decreased accuracy in the Bad Observation condition compared to private learning. Nonetheless, as a reduced percentage of imitative trials was found in the Bad Observation condition compared to the Good Observation condition, Lastly, our investigation of computational processes has so far favoured a model in which participant learnt not only from their own outcome, but also from the action of the observed confederate. Our results replicate the robust valence-induced bias in private learning, with a higher learning rate for better than expected compared to worse than expected outcomes. In regard to the potential biases in observational learning, there was little but no substantial evidence of a dissymmetry in the social learning rate, which calls for replication and further investigation of observational learning. Perspectives As our relative model comparison yielded mixt results (cf.iii.b) in selecting the best winning model, more analyses are required in order to rigorously select the best computational model to explain our data. The next logical step would be to test its generative performance, which is its the capacity of a model to recover the correct parameters in simulated datasets. Based on recent guidelines [68], a model must be rejected if the behavioral phenomenon is not found in the simulated data, or if there are significant differences between the observed and the simulated data. Indeed differences in learning rates might not reflect true difference in learning but could be an artefact of the parameter optimization procedure. Bayesian model selection procedures are debated in cognitive neurosciences [67] [72]. Ideally, criteria should provide the same answers or converge on selecting the best model that wins in the relative comparison, as it has been evidenced in the literature [65]. However, it is not uncommon that, based on the same data, different answers are provided with different criteria that approximate the model evidence. Several factors have indeed been shown to impact the results of the relative comparison, namely the sample size, the number of observation and the presence of outliers [73]. In our task, the BIC criterion might have penalized too much the more complex model for its additional parameters that were introduced in the social module, especially given 24

25 that the imitation effect is relatively small (~5%) compared to the overall performance of more than 65%. A recent behavioural and computational study [39] has shown rigorously that the BIC might not be an adequate criterion, adequate in the sense that it avoids both over-fitting and under-fitting of data. In simulated data of virtual participants, the BIC criterion failed to identify the correct model used to generate simulated data. In contrast, in this example, the LPP-based criterion correctly selected the more complex model including a counterfactual module in simulated data. Once post-model simulations would have discarded any alternative model and validated the social model, one line of extension would be to bring the subjects to perform the task in the IRM to explore the neural substrates and neural dynamics involved in factual and observational learning in light of previous imaging results. To further investigate the interindividual differences, our task could and will be implemented on the internet. There is recent evidence that experimental psychology can be effectively run on AMD (Amazon Mechanical Turk) [74]. It would allow to develop the range of naive population that participate in Cognitive experiments, and eliminate the risk of potential experimenter bias. The first step will be to investigate it with virtual players, to allow for the experimental manipulation of several parameters such as task difficulty as initially planned in our preregistration or value of outcome in gain and loss sessions. Secondly, by implementing the task in real time, one could further investigate how the similarity in performance of the two participants drive the imitation, and how similarity (age, sex) and closeness with the other participant drive differential imitation and inference mechanisms. A last perspective for future research in social observation would be to investigate traits such as anxiety, depression, compulsion in light of inter-individual differences in the computational processes at play, as it has been recently evidence with goal-directed behaviour and its related deficits [75]. At the time of writing of this present thesis, exploratory analyses on three psychological questionnaires assessing anxiety, depression [76]. (Hospital Anxiety and Depression Scale), social anxiety [77] (Liebowitz Social Anxiety Scale), and broad autistic traits [78] (Broad Autistic Phenotype Questionnaire), are performed to further investigate possible inter-individual differences in our social learning experiment. 25

26 V. References 1. Chambers, A. (Ed.) (1997) Chambers dictionary of quotations. New-York: Chambers. 2. Rangel, A., Camerer, C., Montague, P.R. (2008). A framework for studying the neurobiology of value-based decision making, Nat. Rev. Neurosci., 9(7): Doya, K. (2008). Modulators of Decision Making. Nat. Rev. Neurosci, 11: Thorndike, E.L. (1933) A Theory of the Action of the After-Effects of a Connection upon It. Psychol Rev: Rescorla, R. A, & Wagner, AR. (1972). A theory of Pavlovian conditioning: variations in the effectiveness of reinforcement and non-reinforcement. AH. Black & W.F. Prokasy (Eds.), Classical conditioning II: current research and theory (pp ) New York: Appleton- Century-Crofts. 6. Niv Y, Schoenbaum G. (2008). Dialogues on prediction errors. Trends Cogn Sci. 12: Skinner BF. (1938). The Behaviour of Organisms: An Experimental Analysis: Appleton Century Crofts. 8. Dayan P, Abbott LF. (2005). Theoretical neuroscience: computational and mathematical modeling of neural systems: The MIT Press. 9. Sutton, R.S., Barto, A.G. (1998). Reinforcement learning: an introduction. MIT Press, Cambridge, Massachusetts. 10. Watkins, C.J.C.H., & Dayan, P. (1992) Q-learning, Mach. Learn., 8:3 4, pp Daw, N.D. (2011). Trial-by-trial data analysis using computational models. In Affect, Learning and Decision Making, Attention and Performance XXIII, E.A. Phelps, T.W. Robbins, and M. Delgado, eds. (New York: Oxford University Press). 12. Worbe, Y. Palminteri, S., Savulich, G., Daw, N.D., Fernandez-Egea, E., et al. (2016). Valencedependent influence of serotonine depletion on model-based choice strategy. Molecular Psychiatry, 21: O Doherty, J.P., Hampton, A., Kim, H. (2007) Model-based fmri and its application to reward learning and decision making. Ann N.Y Acad Sci 1104: Schultz, W., Dayan, P., Montague, P.R (1997) A neural substrate of prediction and reward. Science 275, Garrison, J., Erdeniz, B. & Done, J.(2013). Prediction error in reinforcement learning: a metaanalysis of neuroimaging studies. Neurosci. Biobehav. Rev. 37, Miltner, W.H., Baum, C.H., and Coles, M.G. (1997). Event-related brain potentials following incorrect feedback in a time-estimation task: Evidence for a generic neural system for error detection. J. Cogn. Neurosci. 9, Frank, M.J., Woroch, B.S., Curran, T. (2005). Error-related negativity predicts reinforcement learning and conflict biases. Neuron:47: Palminteri, S. et al. (2012) Critical roles for anterior insula and dorsal striatum in punishmentbased avoidance learning. Neuron 76, Frank, M.J., Hutchison, K. (2009). Genetic contributions to avoidance based decisions: striatal D2 receptors polymorphisms. Neuroscience 164, Palminteri S, Pessiglione M. (2017). Opponent brain systems for reward and punishment learning: causal evidence from drug and lesion studies in humans. In Decision Neuroscience. Dreher JC and Tremblay L (Eds.). Elsevier 21. Skortsova, V., Palminteri, S., Pessiglione, M. (2014). Learning to minimize efforts versus maximizing rewards: computational principles and neural correlates. J. Neurosci. 34,

27 22. Palminteri, S., Clair, A.H., Mallet, L., Pessiglione, M. (2012) Similar improvement of reward and punishment learning by serotonin reuptake inhibitors in obsessive-compulsive disorder. Biological Psychiatry. 72(3): Behrens, T.E. et al. (2007) Learning the value of information in an uncertain world. Nat. Neurosci. 10.9: Cazé, R. D. & van der Meer, M.A.A. (2013). Adaptive properties of differential learning rates for positive and negative outcomes. Biol. Cybern. 107, den Ouden H.E.M, et al. (2013) Dissociable effects of dopamine and serotonin on reversal learning. Neuron 80(4): Frank, M. J., Moustafa, A. A, Haughey, H. M., Curran, T. & Hutchison, K. E. (2007). Genetic triple dissociation reveals multiple roles for dopamine in reinforcement learning. Proc. Natl Acad. Sci. USA 104, Lefebvre G, Lebreton M, Meyniel F, Bourgeois-Gironde S, Palminteri S (2017) Behavioural and neural characterization of optimistic reinforcement learning. Nat Hum Behav 67(March): Niv, Y., Edlund, J. A., Dayan, P. & O Doherty, J. P. (2012). Neural prediction errors reveal a risk-sensitive reinforcement-learning process in the human brain. J. Neurosci. 32, Johnson, D.D.P. & Fowler, J.H. (2011) The evolution of over-confidence. Nature 477, Gershman, S.J. (2015) Do learning rates adapt to the distribution of rewards? Psychon. Bull. Rev. 22, Sharot, T., Garrett, N. (2016). Forming beliefs: why valence matters. Trends in Cognitive Science, 20 (25-33): 32. Weinstein, N. D. (1980). Unrealistic optimism about future life events. Journal of Personality and Social Psychology, 39, Schoenbaum, M. (1997). Do smokers understand the mortality effects of smoking? Evidence from the health and retirement survey. Am. J. Public Health 87, Palminteri S, Lefebvre G, Kilford EJ, Blakemore SJ. (preprint). Valence biases factual and counterfactual learning in opposite directions. BioRxiv. 35. Vanderdriessche, H., Thero, H., Chambon, V., Palminteri, S. (2017, May). Choice supportive bias in reinforcement learning. Poster session presented at the SBDM, Bordeaux, France. 36. van Den Bos, W., Cohen, M. X., Kahnt, T. & Crone, E.A. (2012). Striatum-medial prefrontal cortex connectivity predicts developmental changes in reinforcement learning. Cereb. Cortex 22, Frank, M.J., Drag, L.L. (2008). Learning to avoid in older age. Psychology and Aging, 23:2: Christakou, A, Gershman, S.J, Niv, Y, Simmons, A,, Brammer, M, Rubia, K. (2013) Neural and psychological maturation of decision-making in adolescence and young adulthood. Journal of Cognitive Neurosciences. 25(11): Palminteri, S., Kilford, E.J., Coricelli, G. Blakemore, S-J. (2016). The computational development of reinforcement learning in adolescence. PLoS Computational Biology. 40. Hoppitt, W,, Laland, K.N. (2013). Social learning: an introduction to mechanisms, methods, and models. Princeton, NJ: Princeton University Press. 41. Subiaul, F., Cantlon, J.F., Holloway, R.L., Terrace, H.S. (2004). Cognitive imitation in rhesus macaques. Science, 305: Laland, K.N. (2004). Social learning strategies. Learning & Behaviour, 32(1),

28 43. Rendell, L. Fogarty, L., Hoppitt, W.J.E., Morgan, T.J.H., Webster, M.M, & Laland, K.N. (2011). Cognitive culture: theoretical and empirical insights into social learning strategies. Trends in Cognitive Sciences, 15: Bandura, A. (2011). Social cognitive theory. In A. W. Kruglanski, E. T. Higgins, & P. A. M. Van Lange (Eds.), Handbook of Theories of Social Psychology: Volume One. London: SAGE. 45. Behrens, T.E.J., Hunt, L.T., Rushworth, M.F.S. (2009) The computation of social behaviour. Science 324(5931): Rizzolatti, G., Craighero, L. (2004). The Mirror-neuron system. Annual review of neuroscience 27: Camerer, C.F. Behavioural game theory: experiments in strategic interaction. Princeton: Princeton University press. 48. de Quervain, D.J.F., Fischbacher, U., Treyer, V., Schellhammer, M., Schnyder, U., et al (2004). The neural bases of altruistic punishment. Science: Behrens, T.E.J., Hunt, L.T., Woolrich, M.W., Rushworth, M.F.S. (2008) Associative learning of social value. Nature, 456: Lebreton, M., Kawa, S., Forgeot D Arc, B., Daunizeau, J., Pessiglione, M. (2012). Your goal is mine: unravelling mimetic desires in the human brain. Journal of Neuroscience 32(21): Biele, G., Rieskamp, J., Krugel, L.K., Heekeren, H.. (2011). The neural bases of following advice. PLoS Biology, 9(6):e Nicolle, A., Symmonds, M., Dolan, R.J. (2011). Optimistic bias in observational learning of value. Cognition 119: Cohn, D., Atlas, L., Ladner, R. (1994). Improving generalization with active learning. Machine Learning 15(2): Thoma, P., Bellebaum, C. (2012). Your error s got me feeling. How empathy relates to the electrophysiological correlates of performance monitoring. Front. in Human Neuroscience, Rak, N., Bellebaum, C., Thoma, P. (2013). Empathy and feedback processing in active and observational learning. Cogn. Affect. Behav Neurosci. 56. Fukushima, H., Hiraki, K. Whose loss is it? Human neurophysiological correlates of non-self reward processing. Social Neuroscience 4(3): Koban, L., Pourtois, G., Vocat, R., & Vuilleumier, P. (2010). When your errors make me lose or win: event-related potentials to observed errors of co-operators and competitors. Social Neuroscience 5(4): Burke, C.J., Tobler, P.N., Baddeley, M., Schultz, W. (2010). Neural mechanisms of observational learning. PNAS 107,32: Cooper, J.C., Dunne, S., Furey, T., O Doherty, J.P (2011). Human dorsal striatum encodes prediction errors during observational learning of instrumental actions. Journal of Cognitive Neuroscience, 24:1, Hill, M.R., Boorman, E.D., Fried, I. (2015). Observational learning computations in neurons of the human anterior cingulate cortex. Nature Communications 7: Rodriguez-Buritica, J.M., Eppinger, B., Schuck, N.W., Heekeren, H.R. Li, S.C. (2015) Electrophysiological correlates of observational learning in children. Developmental Sciences Wood, L.A., Harrison, R.A., Lucas, A.J., McGuigan, N., Burdett, E.R.R et al. (2016). Model agebased and copy when uncertain baises in children s social learning of a novel task. Journal of Experimental child psychology. 150:

29 63. Zmyj, N., Seehagen, S. (2013). The role of a model s age for young children s imitation: a research review. Infant and child development, 22(6): Mobbs, D., yu, R., Meyer, M., Passamonti, L.Seymour, B. et al (2009). A key role for similarity in vicarious reward. Science 324(5929): Palminteri, S., Khamassi, M., Joffily, M., Coricelli, G. (2015) Contextual modulation of value signals in reward and punishment learning. Nat. Commun. 6, Roberts, S., Pashler, H. (2000) How persuasive is a good fit? A comment on theory testing. Psychological Review 107(2): Pitt, I., Myung, J. (2002). When a good fit can be bad, Trends in Cognitive Sciences 6: Palminteri, S., Wyart, V., Koechlin, E. (2017). The importance of falsification in computational cognitive modeling. Trends in Cognitive Sciences. 69. Burnham, K.P., Anderson, D.R. (2002). Model selection and multimodel inference: a practical information-theoric approach (2 nd ed.) New-York: Springer-Verlag. 70. Daw, N.D., Gershman, S.J., Seymour, B., Dayan, P., Dolan, R.J. (2011). Model-based influences on humans choices and striatal prediction errors. Neuron 69(6): Daunizeau, J., Adam, V., Rigoux, L. (2014). VBA: a probabilistic treatment of nonlinear models for neurobiological and behavioural data. PLoS Comput. Biol. 10, e G. S. Corrado, L. P. Sugrue, J. R. Brown, W. T. Newsome, in Neuroeconomics: Decision Making and the Brain, P. W. Glimcher, E. Fehr, C. F. Camerer, R. a Poldrack, Eds. (Academic Press, London, UK, ed. 2009, 2009), pp Stephan, K.E., Penny,, W.D., Daunizeau, J., Moran, R. J., Friston, K. J. (2009) Bayesian Model selection for group studies. Neuroimage. 46(4): Crump, M.J.C., McDonnell, J.V., Gureckis, T.M. (2013). Evaluating Amazon s Mechanical Turk as a Tool for experimental behavioral research. PLoS ONE 8(3):eS Gillan, C.M., Kosinski, M., Whelan, R., Phelps, E.A., Daw, N.D. (2016). Characterizing a psychiatric symptom dimension related to deficits in goal-directed control. elife e Zigmond, A.S., Snait, R.P. (1983). The Hospital Anxiety and Depression Scale, Acta Psychiatrica Scandinavica, 67, Heeren, A., Maurage, P., Rossignol, M., Vanhaelen, M., Peschard, V., Eeckhout, C., Philippot, P. (2012). The self-report version of the Liebowitz Social Anxiety Scale: Psychometric properties of the French version. Canadian Journal of Behavioural Science. 44: Chevallier, C., Peylet, B., & Reeve, P. French version of the broad autism phenotype questionnaire. [validation en cours...]. Original tool: Hurley, Robert SE, et al. (2007) "The broad autism phenotype questionnaire." Journal of autism and developmental disorders 37.9:

30 Pre-registration 2 30

Pre-registration internship M2 Emmanuelle Bonnet Laboratoire de Neurosciences Cognitives (LNC) Supervisor: Stefano Palminteri Session: June Language : English Suggested reviewers: Mathias

31 Pre-registration internship M2 Emmanuelle Bonnet Laboratoire de Neurosciences Cognitives (LNC) Supervisor: Stefano Palminteri Session: June Language : English Suggested reviewers: Mathias Pessiglione, Nicolas Baumard Observational learning biases Literature and objectives Reinforcement-based learning deals with the ability of biasing choices in order to maximize the occurrence of pleasant events (rewards) and minimizing that of unpleasant events (punishments). This process has been quite well characterized at the computational and neurobiological levels in the private setting, i.e. in situations that does not involve any social interaction. Burke et al (2010) explored the neural mechanisms of learning en a subject had the possibility to observe the action of another player before choosing for themselves. However, such observational reinforcement learning remains much less investigated and understood. In particular Burke s study involved a task including only conditions in which imitation had a positive value. Furthermore, Burke et al. only consider a computational model in which imitation was unbiased by the participant prior on which is the correct option: a configuration that is unlikely given recent evidence on the ubiquity of reinforcement learning biases (Lefebvre 2016, Palminteri 2016). This project aim to further investigate observational reinforcement learning with two research questions: 1) Do subjects imitate despite a maladaptive value of imitation? 2) Does imitation learning present a computational bias? Methods Study design: stimuli, procedure & participants In the task, subjects will have to choose between two visual symbols given that one has a higher probability of reward, with the aim of maximizing monetary gains. After a short and neutral training (30 trials), subjects will undergo 4 sessions, consisting of 1 block of 20 private trials, and 2 blocks of 40 trials including 20 private trials and 20 trials in which the subject can observe the choice of another (virtual) subject. In one half of the Observation blocks, the virtual player will choose the correct option 80% of times, while in the second half of the Observation blocks, the virtual player will chose the correct option 20% of times. Blocks ( Private, Observation Good and Observation Bad ) will be randomized within session. Two experiments will be Fig.1 conducted, as illustrated on the right. In the first experiment the task difficulty, as measured by outcome contingency, will be the same in all blocks (70% of reward for the correct option, 30% of reward for the incorrect option). In a second experiment we will manipulate task difficulty (easy block: 80% of reward for the correct option; difficult block 60% of reward for the correct option). Sample size The sample size (N=20) of the first experiment is based on the previous study on observational reinforcement learning (N=21). The sample size of the second experiment is based on the sample size of the first experiment: to achieve the same number of trials per condition, given that in the first experiment we have a 3x1 conditions / 400 trials per subject and in the second experiment we have a 3x2 conditions / 400 trials per subject, we have to double the sample size (N=40). Participants who will undergo the 4 sessions will be included in the analyses, and all of the participants trials will be analyzed.

32 Pre-registration internship M2 Analyses The first question of interest will be addressed with model-free (i.e., classical) behavioral analyses. Multipleway repeated measures ANOVAs with condition and difficulty as within-subjects factors. This gives a 3x1 ANOVA for Exp.1 and a 3x2 ANOVA for Exp.2. The same statistical models will be applied to two dependent variables: correct choice rate (i.e. performance) and reaction time. Figure 2 illustrate the predictions if i) imitation is maintained regardless of its maladaptive value and ii) imitation interact with difficulty. The second question will be addressed with model-based (i.e. computational) analyses. Our model space will include two models. The BASIC model will include a classic factual system to learn from chosen outcomes (in the Private trials) and an Imitation system to learn from the choices of the other participant. The hypothesis underlying this Imitation module is that other people attribute positive values to options that have been chosen by others. This basically extends to notion of mimetic desire to the reinforcement learning framework (Lebreton et al. JNeuro, 2012). In this BASIC model (which is very similar to that proposed by Burke et al.), learning is governed by different learning rates in the factual and the imitation system. We will also develop a BIASED model in which the factual learning system has two learning rates to account for subjects bias toward learning from positive prediction errors. The Imitation module will also have two learning rates: one for consistent choices and another for inconsistent ones. We will fit the BASIC and the BIASED models to the data using likelihood maximization (fmincon function) and the maximal likelihood will be used to calculate the Bayesian Information Criterion (BIC) to compare goodness-of-fit (i.e. likelihood > complexity) the two models. The best fitting learning rates will also be compared using two-tailed one sample ttests: i) factual learning rates for positive and negative prediction errors and ii) imitation learning rates for consistent and inconsistent choices (note: a consistent choice is when the other player chooses the option that the participant believes is the best in a given trial). Expected results Behavioral analyses (Illustrated in Fig.2) o Experiment 1: Main effect of condition (the Figure depicts a case in which the imitation effect is similar in the Observation Good and in the Observation Bad ). o Experiment 2: Main effect of condition + condition by contingency interaction Fig.2 Model-based analyses: o Model comparison: the BIASED model will better account of the data compared to the BASIC model, even accounting for its extra-complexity (low BIC). We expect differences in the learning rates as depicted in Figure 3 (lower for negative prediction errors and inconsistent choices). Fig.3 References Burke, C.J., Tobler, P.N., Baddeley, M., Schultz, W. (2010). Neural mechanisms of observational learning. PNAS Lebreton, M., Kawa, S., Forgeot d Arc, B., Daunizeau, J., Pessiglione, M. (2012). Your goal is mine: unraveling mimetic desires in the human brain. Journal of Neuroscience 32(21): Lefebvre, G., Lebreton, M., Meyniel, F., Bourgeois-Gironde, S., Palminteri, S. (2016). Asymmetric reinforcement learning : computational and neural bases of positive life orientation. BioRxiv. Palminteri, S., Lefebre, G., Kilford, E.J., Blakemore, S-J (2016). Confirmation bias in human reinforcement learning: evidence from counterfactual feedback processing.

Reinforcement Learning. With help from

Reinforcement Learning. With help from Reinforcement Learning With help from A Taxonomoy of Learning L. of representations, models, behaviors, facts, Unsupervised L. Self-supervised L. Reinforcement L. Imitation L. Instruction-based L. Supervised