Biosensor Approach to Psychopathology Classification

Size: px

Start display at page:

Download "Biosensor Approach to Psychopathology Classification"

Martha Fleming
5 years ago
Views:

1 Biosensor Approach to Psychopathology Classification Misha Koshelev 1,2, Terry Lohrenz 3, Marina Vannucci 2,4, P. Read Montague 1,2,3,4 * 1 Program in Cell and Molecular Biology, Baylor College of Medicine, Houston, Texas, United States of America, 2 W. M. Keck Center for Interdisciplinary Bioscience Training, Houston, Texas, United States of America, 3 Department of Neuroscience, Computational Psychiatry Unit, Baylor College of Medicine, Houston, Texas, United States of America, 4 Department of Statistics, Rice University, Houston, Texas, United States of America Abstract We used a multi-round, two-party exchange game in which a healthy subject played a subject diagnosed with a DSM-IV (Diagnostic and Statistics Manual-IV) disorder, and applied a Bayesian clustering approach to the behavior exhibited by the healthy subject. The goal was to characterize quantitatively the style of play elicited in the healthy subject (the proposer) by their DSM-diagnosed partner (the responder). The approach exploits the dynamics of the behavior elicited in the healthy proposer as a biosensor for cognitive features that characterize the psychopathology group at the other side of the interaction. Using a large cohort of subjects (n = 574), we found statistically significant clustering of proposers behavior overlapping with a range of DSM-IV disorders including autism spectrum disorder, borderline personality disorder, attention deficit hyperactivity disorder, and major depressive disorder. To further validate these results, we developed a computer agent to replace the human subject in the proposer role (the biosensor) and show that it can also detect these same four DSM-defined disorders. These results suggest that the highly developed social sensitivities that humans bring to a two-party social exchange can be exploited and automated to detect important psychopathologies, using an interpersonal behavioral probe not directly related to the defining diagnostic criteria. Citation: Koshelev M, Lohrenz T, Vannucci M, Montague PR (2010) Biosensor Approach to Psychopathology Classification. PLoS Comput Biol 6(10): e doi: /journal.pcbi Editor: Tim Behrens, John Radcliffe Hospital, United Kingdom Received June 24, 2010; Accepted September 20, 2010; Published October 21, 2010 Copyright: ß 2010 Koshelev et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: This work was supported by a training fellowship from the Keck Center for Interdisciplinary Bioscience Training of the Gulf Coast Consortia to MK. MV was partially supported by NIH-NHGRI grant number R01-HG003319, and by NSF-DMS grant number PRM was partially supported by NIH R01 grants DA11723 and MH The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * read@bcm.tmc.edu Introduction Fairness games as probes for social exchange Social interactions among humans reflect the execution of some of the most important and complex behavioral software with which humans are endowed. Consequently, we should expect the computations involved in human social exchange to be subtle and perhaps even difficult to expose and study in controlled settings. However, exposing these computations is crucial if we are to improve our characterization and understanding of normal human cognitive function and dysfunction. In recent years, the components of social exchange in healthy subjects have been probed using interactive economic exchange games [1 8]. These games typically involve two subjects interacting for one or multiple rounds through the exchange of monetary gestures to one another. For our purposes here, these games require three classes of computation be intact and functioning in the minds of the interacting subjects. They require that each subject can (1) compute norms for what is fair in each exchange, (2) detect deviations in monetary gestures that deviate from these norms, and (3) choose actions predicated on such deviations [9 15]. These experimental probes have been used previously in the area of behavioral economics and neuroeconomics, but here we show that the behavioral gestures elicited in the context of economic exchange games can be used to classify certain psychopathologies. The twist in our effort here is that we use a data-driven approach examining the reactions of the healthy partner as a kind of biosensor while playing an exchange game with a subject possessing a psychopathology. Multi-round trust game In this paper, we used a multi-round fairness game played by pairs ( dyads ) of interacting humans to extract behavioral phenotypes defined by the dynamics of play exhibited over the 10 rounds of a complete game [6,7,16]. The game we employ is called a trust game [17 19]; see Figure 1A. In the 10-round trust game, one player (called the investor or the proposer) is endowed with 20 monetary units and chooses to send some fraction i to their partner (called the trustee or the responder). The amount sent is tripled to 3i on the way to the trustee. The trustee decides which fraction r to return in response to the investor, thus each round is represented by two numbers: the investment fraction i and the repayment fraction r. All the rules are transparent to both players. The game is played for 10 rounds and the repeated exchanges allow the players to build models of what to expect from their partner providing that their capacity to sense, model, and respond to their partner s decisions is intact. In most of the dyads, the subjects were given no information about their partner and did not meet or speak to the partner before, during, or after the task. Following [16], we also included personal dyads, in which the partners met before the task, were instructed together, and saw a picture of their partner during each round. Biosensor hypothesis The basic approach of this paper derives from our prior work showing that this same game elicits unique behavioral phenotypes PLoS Computational Biology 1 October 2010 Volume 6 Issue 10 e

2 Biosensor Approach to Psychopathologies Author Summary Human social interaction is exquisitely complex, and perturbed social interaction is a hallmark of psychological pathogy. When someone has a psychological disorder the focus is generally on their behavior, but this behavior is rarely something displayed in isolation and typically induces profound changes in the people interacting with the disturbed individual. In this work we asked if the behavior of one person in a simple two-person economic exchange game is sensitive to features that could classify the pathology of their partner. We analyzed a large group of previously recorded interactions involving healthy persons and people diagnosed with a variety of psychological disorders, and found that a healthy person s behavior is indeed quantitatively and systematically influenced by their partner s pathology. These results could ultimately lead to a different way of understanding and diagnosing psychological disease. when a game is played between a healthy investor and a trustee diagnosed with a range of DSM-defined disorders Autism Spectrum Disorder (ASD) [20], Borderline Personality Disorder (BPD) [6], Major Depressive Disorder (MDD), and Attention Deficit Hyperactivity Disorder (ADHD) [20]. In all these studies, we noticed that the behavioral differences affect not only the trustee, but also a healthy investor who plays with this trustee. A similar conclusion that a healthy subject is sensing the psychological nature of the opponent during play was obtained in a recent paper [21], where it was shown that a subject can gauge the strategic sophistication of the opponent in repeated play of a complex stag hunt game. These observations suggested the hypothesis that the healthy (or typical) investor s behavior might be used to read out features that could characterize the psychopathology group playing in the trustee role. This possibility was also suggested by the nature of the interpersonal interaction enforced by the game. In any multiround interaction with another human, a player s choices are rather dramatically entangled with those of her partner. In addition, although the game is characterized by two numbers per exchange (investment and repayment ratio), it does require players to have several cognitive capacities intact to accomplish a normal exchange. These include short-term and working memory, sufficiently accurate models of what to expect from another human in this exchange, appropriate sensitivity to positive and negative social signals, and intact capacity to respond to such Figure 1. Model-free clustering of an objective multi-round economic exchange game. A) Depiction of Multi-Round Trust Task. A ten round task in which two players, an investor and a trustee, undergo repeated interactions. Adapted from previous publications [6,7,16,20]. B&C) Our Approach. Following [23], we cluster investor-trustee dyads based on a regression of previous choices in the trust game. Specifically, we predict ratios of investment i t in round t as a polynomial of past rounds of investment and return. The number of clusters, order of polynomial, and number of rounds back on which to base this dependence are all taken as free parameters in the model. doi: /journal.pcbi g001 PLoS Computational Biology 2 October 2010 Volume 6 Issue 10 e

3 Biosensor Approach to Psychopathologies signals. Collectively, these observations support the basic hypothesis that humans bring highly developed social sensitivities to twoparty interactions that might be profitably exploited as a biological sensor (biosensor) first using a human proposer (investor) and later capturing this behavior in a computer agent. Results Available data We analyze the results of 287 dyads, in which healthy participants play against healthy trustees, as well as against the trustees that have four different disorders: ASD, BPD, MDD, and ADHD. Each subject played only one game. With the exception of some patients with BPD, participants with disorders were not medicated. A detailed description of the data is given in Table S1. Bayesian classification of multi-round social exchange We sought to classify the dynamics using only the numbers exchanged in the game between players (investment and repayment ratios), the number of types or styles of play (number of clusters), and the functional dependence of the next investment on preceding investment and repayment ratios. In short, we sought a heavily data-driven approach. We extended a previously published method [22,23] to cluster available trust game data. This method uses a regression approach to the functional dependence that clusters individuals based on coefficients of the regression. This method has advantages over traditional clustering approaches: (i) the number of types in our population is estimated directly from the data, and (ii) classification uncertainty is captured by probabilities rather than categorical cluster assignments. An investor is not classified as either within or not within a cluster, but instead a probability of being in a cluster is computed. This allows us to identify clusters where a style of behavior (a type) is over-represented (in comparison with what is expected by chance), under-represented, or neither (see below for details of this calculation). Data-driven modeling of two-party social exchange in the trust game The basic model is determined directly from the numbers exchanged by the two players during the game. We model the healthy proposer s investment at time t as a function of preceding investment and repayment ratios. In this black-box, regression approach [22 23], we assume that we can capture meaningful variations in types of investor play by using a regression model based only on previous investment and return ratios, in contrast to other approaches [21,24] which commit to more explicit models of how these values are used in mental processes to generate behavior. It is known that an arbitrary continuous function can be approximated, with any given accuracy, by a polynomial of an appropriate order. As a result, a widely used approach to describe such functions is to try polynomial dependence of increasing order. For a first order dependence of the current investment i j,t on previous investments and repayments the model is: i j,t ~b 0 zb 1 i j,t{1 zb 2 r j,t{1 zb 3 i j,t{2 zb 4 r j,t{2 zerror where j indexes the subject and t indexes the current round of the game. For a second order dependence of the current proposer investment on previous investments and repayments, this expression would accrue all possible second order terms in lagged investments and repayments, including terms of the type ð1þ bi j,t{1 r j,t{1 that describe interaction between investments and repayments. Such terms acknowledge that the current choice by the investor i j,t is entangled with their previous interactions with their partner. Although expression 1 depicts a first order dependence on previous investments and repayments extending back two rounds of the game, in this paper, we do not pre-commit to the exact functional dependence for the current proposer investment nor to the number of exchanges into the past that best predict the person s current decision. Instead, we assume a general polynomial dependence of the current investment ratio i j,t on previous investments and repayments, and determine the order of this polynomial dependence directly from the data. Similarly, we determine the number of rounds into the past required to predict optimally the person s current investment ratio from previous investment and repayment ratios. The details of this general approach follow. Formally, we model the behavior as a mixture of regressions. For a fixed order of polynomial dependence P, a fixed look-back window M, and a fixed number of clusters K, we assume a single investor s data is given by i j : XK k~1 h k N(X j b k,s 2 k I), Here, i j is an investor s (8-dimensional) vector of investments (we consider models looking back as many as two rounds; to make the models comparable we only consider 8 investments), X j is the model matrix of independent variables defining the regression (all less than or equal to P-degree monomials in lagged investments and repayments going back M rounds), b k is the k-th cluster s regression coefficients, s 2 k is the variance of the error term in the k- th cluster, h k is the weight assigned to the k-th cluster, N the multivariate normal, and I is the identity matrix. This behavior model is applied to the data from the whole group, with the data itself determining both the appropriate subdivision into clusters and the regression coefficients within each cluster. We use the data augmentation approach [25], defining latent variables t~ft j g n j~1 which assign investors to clusters, to form the complete data (i j,t j ). We then get the joint posterior of the parameters and the latent variables by combining the complete data likelihood with priors over the parameters. We choose for priors h : Dir(2,2,...,2)(K times), b k : N(0,s 2 0 I) with s 0~ 0:1, s 2 k : IG(1=2,1=2), where Dir denotes the Dirichlet distribution and IG the inverse gamma distribution [26]. These are the same priors that were used by Houser-Keane-McCabe in their work [23]. As our independent variables all lie on the interval ½0,1Š, we chose the prior variance of the coefficients to be proportional to this range. Estimating the parameter For the above model, we use a two-stage Gibbs sampling algorithm to estimate the parameters [27]: Start with initial parameters (h (0),b (0),s (0) ) then repeat: Step 1: Sample allocations t (m) j given (h (m{1),b (m{1),s (m{1) ): t j : Mult(p), where Mult is a multinomial distribution, and p i ~ h in(x j b i,s 2 i I). P K h k N(X j b k,s 2 k I) PLoS Computational Biology 3 October 2010 Volume 6 Issue 10 e k~1

4 Biosensor Approach to Psychopathologies Step 2: Sample (h (m),b (m),s (m) ) given t (m) j s: h : Dir(2zn 1,2zn 2,...,2zn K ) with n k ~#fj : t j ~kg for k~1 to K s 2 k : IG((1z8: n k )=2, (1zs 2 k )=2), s2 k ~ s 2 1 ws2 2 w...ws2 K X fj:t j ~kg (i j {X j b k ) 2 b k : N(b k,m k ), with M k ~½s {2 k X k P X k P zs{2 0 IŠ {1, and b k ~s {2 k M kxk P ip k Here, ik P is the pooled investment data over cluster k, X k P is the pooled model matrix over cluster k, and N(x,S) is the normal multivariate density with mean x and covariance S. The sequences of samples can then be used to estimate parameters. To avoid possible adverse effect of potential outliers on this Gaussian-based (hence outlier-sensitive) method, we check that the empirical distribution of the differences i j {X j b k between the observed and predicted values is indeed consistent with the normality hypothesis. Finally, the optimal number of clusters, polynomial order, and look-back window can be determined by computing the marginal likelihood of each model (see the Methods section for details) and selecting the model with the largest value. Mapping the Bayesian classification of healthy proposer behavior onto DSM phenomenology of responder The method described above identified 4 clusters. In terms of the relevant parameters, two rounds were found to be the optimal number of previous moves for predicting the influence of past investments and repayment ratios on the current investment ratio made by the investor. To connect our clusters to the DSM-IV phenomenology, we determined which groups of subjects defined by DSM-specific criteria were over- or under-represented in each cluster and the number of standard deviations by which they were over- or under-represented. The results of the clustering are shown in Figure 2 (see Table S2, Table S3 and Table S4 for a detailed description). Cluster 1 Figure 2. Groups over-/under-represented in behavioral clusters. We analyzed over-/under-representation of original groups in our clusters. Our approach is depicted in Figure 1 and detailed in Materials & Methods section. We used the most frequent value of a dyad s cluster assignment over all draws from the posterior to assign a type for this analysis. We computed the number of standard deviations over-/under-representation in the cluster as compared to that expected by chance. These values are shown for each cluster and each original group. ASD = Adolescents with Autism Spectrum Disorder [20]; ADHD = Children with Attention-Deficit/Hyperactivity Disorder; Per = Healthy individuals who met before playing the trust game [16]; Imp = Healthy individuals who played the trust game remotely with individuals from the California Institute of Technology [7]; BPD- M = Medicated individuals with Borderline Personality Disorder [6]; BPD-N = Non-medicated individuals with Borderline Personality Disorder [6]. doi: /journal.pcbi g002 PLoS Computational Biology 4 October 2010 Volume 6 Issue 10 e

5 Biosensor Approach to Psychopathologies over-represents individuals with ADHD. Although 54% of these individuals would be expected to fall into this cluster by chance, 89% of them end up in this cluster. Cluster 2 significantly overrepresents individuals with Autism Spectrum Disorder. By chance, 23% of these individuals should fall in this cluster; however, we see 44% of them in the cluster. In Cluster 3, medicated and nonmedicated individuals with Borderline Personality Disorder are over-represented. By chance, 15% of individuals from each group should fall into this cluster. However, 36% of medicated and 27% of non-medicated Borderline Personality Disorder individuals belong to this cluster. Cluster 4 should by chance represent 8% of individuals with MDD, but 20% of them fall into this cluster. The chi-square analysis confirms the statistical significance of this overrepresentation (see Methods section). Additional result: probability of belonging to a cluster is correlated with the severity of the disorder For two disorders, there are known scores describing its severity: for ASD, there is a score on the Autism Diagnostic Interview- Revised [28] Repetitive behavior subscale, and for BPD, there is a score on the Interpersonal Trust Scale [29]. In both cases, we found a statistically significant correlation between these scores and the probability of belonging to the corresponding cluster ( = percent match of the dyad in this cluster from 30,000 draws from the posterior): R~0: and R 2 ~0:2569 for ASD (Figure 3) and R~{0: and R 2 ~0:3261 for BPD (Figure 4). Characterizing corresponding social behavior With the clusters defined as described, we sought to characterize the kinds of social gestures (signals sent across rounds and between players) that define them. In Figure 5, we summarize the acrossround social gestures for each cluster in terms of the regression coefficients for the investment and repayment ratios and the constant term (see also Figure S1 and Figure S2). We discuss the potential importance of these findings below, but here we summarize in Figure 5 the average social gesture of each cluster by plotting the average regression coefficients for each restricting the number of rounds back to two the optimal number that Figure 3. ADI-R C, repetitive interests score, correlates with assignment of dyads with ASD individuals to cluster 2. For dyads with Adolescents with Autism Spectrum Disorder [20] assigned to cluster 2, in which they are over-represented, we analyzed the correlation of the (i) percent match of the dyad into cluster 2 from 30,000 draws from the posterior and (ii) the score on the Autism Diagnostic Interview-Revised [28] Repetitive Behavior subscale of the ASD individual playing in the trustee role. We found a correlation with R~0: and R 2 ~0:2569. doi: /journal.pcbi g003 Figure 4. Interpersonal trust scale correlates with assignment of dyads with BPD trustees to cluster 3. For dyads with Borderline Personality Disorder, Medicated and Non-Medicated [6], assigned to cluster 3, in which they are over-represented, we analyzed the correlation of the (i) percent match of the dyad into cluster 3 from 30,000 draws from the posterior and (ii) the score on the Interpersonal Trust Scale [29] of the BPD individual playing in the trustee role (selfreport, lower score implies less trust). We found a correlation with R~{0: and R 2 ~0:3261 (pv0:05). doi: /journal.pcbi g004 predicts the investors next investment ratio (Figure 5B). Notice that in Cluster 4, the dependence is dominated by the constant term; this term reflects universally high investments. In Cluster 4, investors playing subjects with major depressive disorder are overrepresented. The other over-represented group in Cluster 4 are investors playing trustees that they meet before the game and whose pictures they see each round of the exchange. It is interesting to note that investors playing subjects with ASD end up over-represented in the same cluster (Cluster 2) as investors playing subjects in an impersonal version of the game where subjects do not meet nor see each other. Computer agents as investor-side biosensor The above results provide evidence that examining investor-side behavior provides a new kind of readout for some important psychopathology groups studied under the probe of the multiround trust game. The game itself, although simple (in each round only two numbers are exchanged), requires a number of intact cognitive functions including working memory, short-term memory, the capacity to model and predict the partner s likely response, the capacity to sense deviations from these expectations, good a priori models of human trade instincts (reflected by round one offers and responses), and so on. One value of this approach is that it utilizes a probe that is not directly related to the symptom lists that define DSM classifications, and therefore provides a possible alternative method of classifying some psychopathologies or at least identifying or isolating some of their malfunctioning computations. To verify the robustness of the clustering algorithm we employed a previously described computer agent designed to play the trustee role. The possibility to design agents of this type was shown in our previous work [6]. The corresponding k-nearest neighbor agents use the database containing the results of all the rounds of all the dyads. A healthy trustee agent, to describe how much to repay, looks at the vector of 6 previous choices (last 3 investments PLoS Computational Biology 5 October 2010 Volume 6 Issue 10 e

6 Biosensor Approach to Psychopathologies Figure 5. Characteristics of behavioral clusters. A D) Left Panel: Investor and Trustee Behavior in Behavioral Clusters. For each cluster, the corresponding number of dyads is shown in the title. Further, the corresponding mean investment ratios (red) and return ratios (black) are represented. Standard error of the mean is plotted, but is smaller than the markers used to denote means. Right Panel: Polynomial Coefficients Used to Predict Investment Ratios for Behavioral Clusters. Mean values of polynomial coefficients used to predict investment ratios for each cluster are shown. Specifically, the coefficients by the constant term (gray), return (red), and investment (green) ratios are shown. doi: /journal.pcbi g005 and last 3 repayments) and finds, of all the records with healthy trustees, k~6 situations in which corresponding previous choices were the closest (in the Euclidean distance). Out of the 6 recorded outcomes of these closest situations, the agent selects one with equal probability. A BPD trustee agent similarly selects from dyads with a BPD trustee. These trustee agents were validated in [6]: in interaction with healthy human investors, the BPD agent was shown to reproduce accurately ruptures in cooperation normally observed when a healthy investor plays a BPD trustee. Such ruptures were not observed in healthy investors playing a healthy computer trustee. In our case, we need to supplement these agents with a similar investor agent that select the investment value based on the 6 closest dyads. Our hypothesis is that the same correlation with disorders can be detected by players playing against the investor agent. Since it was already shown that the trustee agents adequately describe the trustee behavior, we had healthy investor agent play either the healthy or BPD trustee agent in the trust task for ten rounds (Figure 6A) 1,000 times. These interactions were then assigned to the previously determined clusters using the posterior distribution of parameters generated from the analysis of the human dyads (see details in the Methods section). Notably, interactions between the BPD trustee and healthy investor agent were statistically significantly over-represented by 7.19 standard deviations in Cluster 3 the same cluster in which investors playing both medicated and non-medicated individuals with Borderline Personality Disorder are over-represented. On the other hand, interactions between the healthy investor and healthy trustee agents were not statistically significantly over-represented in this same cluster; see Figure 7 and Figure S3. Thus, for BPD, the same correlation between the statistical clustering and disorders can indeed be achieved by using the investor agents (For the ASD group, there were insufficient data (n~16) to develop an analogous trustee agent and so no validation along this psychopathology was possible at this time). Discussion We have used a data-driven, Bayesian regression approach to cluster the healthy investor behavioral data from a large set of 287 trust interactions, which included trustees from several DSM mental-illness groups. The Bayesian approach allowed us to determine in a principled way the number of clusters in our population (four) and probabilities for each dyad to belong to each cluster. Next, we used a chi-square criterion for over/underrepresentation to determine which pre-defined DSM-IV groups PLoS Computational Biology 6 October 2010 Volume 6 Issue 10 e

7 Biosensor Approach to Psychopathologies Figure 6. Agent-vs-agent validation of clustering scheme. A) Depiction of agent-vs-agent trust task. Specifically, a k-nearest neighbors agent that samples healthy investor behavior plays the multi-round trust game against a k-nearest neighbors agent that samples healthy or BPD trustee behavior for ten rounds [6]. B) Depiction of the space of sampled interactions. The sampling agent uses the records of investment and return from the trust game as played by either (i) healthy trustees, (ii) healthy investors, or (iii) BPD trustees depending on the specific agent used. The agent starts with a vector representing several immediate past choices for the game that is currently playing (this vector forms the center of the circle), and PLoS Computational Biology 7 October 2010 Volume 6 Issue 10 e

Biosensor Approach to Psychopathologies selects several records for which the corresponding vectors have the smallest Euclidean distance to the current vector (these vectors are inside the circle).

8 Biosensor Approach to Psychopathologies selects several records for which the corresponding vectors have the smallest Euclidean distance to the current vector (these vectors are inside the circle). C) The sampling agent finds the next investment (or return) ratios for all the closest recorded game trajectories. In Panel C, these ratios represent i 3. D) The agent then selects, with equal probability, one of these next ratios and returns it as the investment (or return) ratio for the game that is currently playing. doi: /journal.pcbi g006 are statistically significantly over- and under-represented in each cluster. We found that there is a one-to-one correspondence between the resulting clusters and the DSM-IV disorders: namely, dyads in which trustees have a certain DSM-IV disorder are overrepresentedin the corresponding cluster. Moreover, there is a correlation between the severity of each disorder and the probability of belonging to the correpsonding cluster. The finding that a trustee s disorder can be detected based on the investor s behavior is in line with the fact that in any multiround interaction with another human, a player s choices are dramatically entangled with those of her partner. Humans bring highly developed social sensitivities to two-party interactions. Our Figure 7. Simulated data over-/under-represented in behavioral clusters. The analysis detailed in Figure 2 was repeated with the Simulated Interactions. In Figure 2, Cluster 3 over-represents healthy individuals playing BPD trustees. Similarly, we compared the number of standard deviations by which in our analysis of simulated interactions, Cluster 3 over-represents simulated healthy-vs-bpd interactions by 7.19 standard deviations. On the other hand, Cluster 3 over-represents healthy-vs-healthy simulated interactions only by 0.46 standard deviations. doi: /journal.pcbi g007 results show that these sensitivities can serve as a biosensor the quantitative behavioral dynamics of a healhy person can capture the subtle behavioral abnormalities (abnormalities that are difficult to capture by the usual statistical analysis) of her partner. To further validate our approach, we used a previously described k-nearest neighbor sampling agent, as well as its implementation in the investor role, to simulate healthy vs healthy and healthy vs BPD interactions. We showed that healthy vs. BPD agent interactions were over-represented in the same cluster as healthy vs BPD individuals, whereas the healthy vs. healthy agent interactions were under-represented in this cluster. Having arrived at an initial validation of our clustering, one can ask what further information can we extract from our method. Specifically, what do the patterns of play (Figure 5A) and polynomial coefficients used to predict investor behavior (Figure 5B) tell us about the behavior of individuals in each group? We start with the fourth, and smallest, cluster. This cluster overrepresents (i) dyads who met before playing the trust game as well as (ii) healthy investors playing trustees with Major Depressive Disorder. In this cluster, investment ratios are very high, and return ratios, in comparison to other clusters, are also high. For this trust cluster, the constant term effectively dominates the polynomial predicting the investment ratio. The third, next largest, cluster, over-represents both medicated and non-medicated individuals with Borderline Personality Disorder. In this cluster, both investment and return ratios are relatively low. The second cluster over-represents adolescents with Autism- Spectrum Disorder. The difference in pattern of play between this cluster and cluster one is difficult to detect by simply looking at the round by round average investment and repayment levels. Notably, the two clusters separate individuals with Autism- Spectrum Disorder from individuals with ADHD, two disorders that are often difficult to separate because they share several symptoms. One of the advantages of our method is that we arrive not only at clusters, but also at polynomial coefficients that can be used to predict investment ratios in each cluster. By looking at these coefficients, one can see a characterizing feature of cluster two - specifically, the current investment ratio depends strongly on the ratios of investment and return one round back. It is known that reciprocity is a driving signal in the trust game [25], and that the sensitivity to reciprocity of individuals with Autism-Spectrum Disorder is blunted [28]. The investor behavior in cluster may be an adaptation to this diminished sensitivity. While our results show the statistically significant biosensing of certain disorders, the resulting clustering does not provide us with a clear diagnostics since each cluster contains, in addition to individuals with the corresponding disorder, also a large number of healthy individuals; see Table S5. The fact that we did not get a clear separation between normal participants and participants with disorders (i.e. we find healthy participants scattered across the cluster) points to two distinctly different ways to approach psychopathology [30]. One possibility is that psychopathology groups are reflections of quantitative differences along normal cognitive dimensions (and their correlations) that are probed by our interpersonal exchange game. The second is that the first possibility holds but is augmented by the fact that psychopathology groups bring extra (or fewer) or different cognitive dimensions to PLoS Computational Biology 8 October 2010 Volume 6 Issue 10 e

9 Biosensor Approach to Psychopathologies the responses elicited by the game (Figure S4). To shed light on this issue in the context of this task, we clustered the healthy dyads alone and then assigned the disordered dyads to these clusters. The algorithm again selected 4 as the optimal number of clusters, 1 as the polynomial order, and 2 as the lookback length, but the assignment of the disordered dyads to the clusters is somewhat different than in the main result (Table S6). For BPD dyads the overrepresentation result is stronger, but for the other groups it is weaker. Also, while the betas of the regression (Table S7) are quite similar in three of the clusters, the fourth (Figure S5) is substantially different. Finally, the cluster assignments of the healthy dyads are in good concordance across the two clusterings (adjusted Rand index of.94 [31 33]; see also Table S8). Taken together, these facts suggest that the second view of psychopathology mentioned above is to be favored in this task, and that as far as this behavioral probe is concerned the disordered individuals are qualitatively different. Interestingly, a seemingly more direct classification based directly on the return values does not lead to such a statistically significant correlation between clusters and disorders: many differences between healthy and pathological trustees cannot be detected against the background of other behavioral differences; see Table S9. This shows that humans acting as biosensors have the ability to filter out the important differences and thus, help in diagnosing psychopathologies. To summarize: we have data from 287 dyads involved in one such task - the trust game. We use a data-driven, agnostic method [22,23] to arrive at (i) the number of clusters, (ii) the order of the polynomial that predicts investment ratios, and (iii) the number of rounds prior on which investor decisions depend directly from the data. We then arrived at a probabilistic clustering of these dyads, and analyzed over-representation of initial groups in the new clusters. We found that, by clustering dyads based on investor decisions, we were able to over-represent trustees with different disorders in separate clusters. Further, we used previously described k-nearest neighbor sampling agents [6] to generate 1000 interactions each for healthy vs healthy and healthy vs BPD agents. By clustering these interactions based on the polynomial coefficients from our initial clusters, we found that simulated healthy vs BPD interactions are statistically significantly overrepresented in the same cluster as real healthy vs BPD interactions, but that simulated healthy vs healthy interactions are statistically significantly under-represented in the same cluster. We believe that these results constitute a significant step forward in quantitative diagnosis of psychiatric illness. The fact that brain images have helped in the analysis of human behavior in fairness games [2 8] makes us believe that our diagnoses can be further refined by using the corresponding brain imaging data. Current psychiatric diagnoses are based on the DSM [34]. Essentially, these are lists of criteria used by a trained physician to characterize whether or not a person has a specific disorder. Such clinical, experience-based classification schemes provide a valuable understanding of psychiatric and neurological disorders. However, to uncover genetic underpinnings of various psychiatric disorders and to provide quantitative behavioral and neural measures, it is desirable to have quantitative measures of normal social interactions that can expose computations perturbed in various psychopathologies. Such measures could then be used to quantify abnormalities in social exchange, to diagnose psychiatric and neurological disorders, and to probe the genetic basis of such disorders. The results presented in this paper show some of our first steps in this direction; however, as more data on this and similar parametric social exchange tasks becomes available it should help to construct a quantitative understanding of mental disorders. Additional observations Intuitively, one might expect the investment on the next round to be an interactive function of both previous investment and the repayment the investor received, rather than independent effects of each. However, our analysis shows that the optimal clustering corresponds to polynomials of order P~1, i.e., to the linear dependence (1). This means that, contrary to this intuition, the second-order terms in particular, interaction terms between investments and repayments (such as bi j,t{1 r j,t{1 ) do not lead to a statistically significant improvement of the model s explanatory power. For patients diagnosed with a DSM-IV disorder, medication is an important potential confound. In our study, only some BPD patients were medicated. According to Figure 2, both medicated and non-medicated BPD patients were statistically significantly over-represented in the corresponding Cluster 3. Thus, the presence or absence of medication does not affect our classification. In this paper, we use a purely data-driven approach to data analysis. This approach is important from the foundational viewpoint, since it enables us, in particular, to further confirm the objective nature of the existing psychopathology classification. From the practical viewpoint, once this classification is established, we can improve the diagnostic efficiency if we explicitly use the known diagnoses in classification and regression analysis. For example, this may make it possible to find the markers that identify healthy subjects with superior discriminatory power. Materials and Methods Ethics statement Informed consent was obtained for all research involving human participants, and all clinical investigation were conducted according to the principles expressed in the Declaration of Helsinki. All procedures were approved by the Baylor College of Medicine Institutional Review Board. Multi-round trust game The game is described in the previous section. Healthy participants were invited to the Human Neuroimaging Laboratory at Baylor College of Medicine. Prior to playing the game, each participant was instructed they would earn between $20 and $40, scaled by number of monetary units (MU) each player individually accrued. Following the game, each participant was compensated as follows:,68 MU = $20, MU = $25, MU = $30, MU = $35, and.300 MU = $40. Bayesian classification: Gibbs sampler We discarded 1,000 draws as burn-in, sampled 30,000 draws from the posterior, and assessed convergence using the Raftery- Lewis test [35]. We used the R Bayesian Output Analysis program to perform these calculations [36]. We repeated our analyses using 8,000 cycles total as per Houser, Keane, and McCabe [23] and 1,000; 3,000; and 5,000 cycles as burn-in and arrived at similar over-representation results. Checking normality To check that the empirical distribution of the differences i j {X j b k between the observed and predicted values is indeed consistent with the normality hypothesis, we normalize each difference by subtracting the sample mean of the differences from the corresponding cluster and then divide by the sample standard deviation of these differences. We then compute the sample skewness and the sample kurtosis of the collection of all these PLoS Computational Biology 9 October 2010 Volume 6 Issue 10 e

10 Biosensor Approach to Psychopathologies differences, and use Matlab s Jarque-Berra test to check normality. Normality has been confirmed with p~0:77. Since the null hypothesis of normality is rejected when pv0:05, our value of p indicates a strong empirical support for the normality hypothesis. Bayesian classification: Optimal parameters We used the Laplace-Metropolis estimator of the marginal likelihood [37], as described in Houser, Keane, and McCabe [23], to compare models with different values of the number K of clusters, order P of the polynomials, and the number D of past rounds on which the model depends. We did not include any results in which 2 of 3 samplers arrived at at least one empty type in the mode of the last 5,000 of 8,000 draws from the posterior. To maximize marginal likelihood (i.e., to find a posterior mode), we used component-wise optimization (also known as conditional maximization or step-wise ascent; see, e.g., p. 312 of [38]), the use of which is well-established for Bayesian problems such as maximizing the posterior mode, and arrived at the same answer when comparing the maximum log marginal likelihoods for different models. As a result, we concluded that the optimal model has K~4 clusters, a first order polynomial P~1, and a dependence of ratios of investment on ratios of investment and return D~2 rounds into the past. We found that, in contrast to the simpler case described in [23], our marginal likelihood values are sometimes fairly close to one another in many cases and thus, the results of comparing these values can potentially change if we repeat the same computational experiment. To makes sure that our selection of 4 clusters does not change, we supplemented the conditional maximization by the exhaustive analysis of all possible triples (K,D,P) with up to 10 clusters, polynomials of order 1 to 3, and a time dependence of 1 or 2 rounds into the past. For each such model, we used several samplers and got several values of marginal likelihoods; when we compare two models, we select the simpler one (the one with fewest overall parameters) unless the other one has a statistically significantly larger mean. Since for the same model (K,D,P), the distribution of marginal likelihood values is sometimes not Gaussian (see Figure S6B), we could not use the usual t-test. Instead, we used the Wilcoxon rank-sum test at the 5% significance level [39]. The results (shown on Figure S6A and detailed in Table S4) confirm that the model with (K,D,P) = (4, 1, 2) is optimal. Over-representation analysis To check whether the observed over-representation of participants with disorders in different clusters is statistically significant, we apply the chi square test corresponding to a null hypothesis that the participants of different disorders g~1,...,g are randomly distributed in different clusters k~1,...,k. Let n k be the number of elements in the k-th cluster, n g,k the number of elements of g-th group in this cluster, k(g) the cluster corresponding to the group g, and p g the ratio of group g in the population as a whole. Under the null hypothesis, due to the central limit theorem, the value n g,k is asymptotically normally distributed, with mean p g : nk and variance p g : (1{pg ) : n k. Thus, the ratio p def g,k ~ n g,k =n k is normally distributed with mean p g and variance s 2 g,k ~(p : g (1{p g ))=n k. Thus, to test the null hypothesis, we can form the test statistic x 2 ~ PG o 2 g,k(g), where o def g,k ~ p g,k{p g is a g~1 s g,k relative over-(under-) representation of the group g in the cluster k. For G~4, the null hypothesis is rejected with pƒ0:05 when x 2 9:49. Thus, when each of the four terms o 2 g,k in the sum satisfies the inequality o 2 g,k 9:49=4~2:47, the null-hypothesis is rejected. We therefore pconsider the groups which are over-represented at the level o g,k ffiffiffiffiffiffiffiffiffi 2:47~1:57. Please note that when o g,k 1:96, already the over-representation of the group g in cluster k is statistically significant with pƒ0:05. Applying resulting classification to other records Our clustering is based on iterations of Gibbs sampling. Every additional vector X (e.g., of agents playing) is then classified as follows. For each recorded iteration of the Gibbs sampling (after the burn-in), based on the recorded values of b and s, we compute the probabilities p i of X belonging to different clusters i (we use the same formula as in the subsection Estimating the parameter ). Then, we select a cluster i with the probability p i. After all these selections, we assign the dyad characterized by the vector X to the cluster to which, among all the iterations, this vector was assigned the largest number of times. Supporting Information Figure S1 Mean ratios of investment and return for all dyads considered in analysis. Ratios of investment ( = MU/20) and return ( = MU/[3*investment amount]) are shown for each of the initial groups considered in the analysis. The number of pairs in each group is indicated in the title. Standard errors of the mean are displayed. Found at: doi: /journal.pcbi s001 (0.11 MB PDF) Figure S2 Polynomial coefficient distributions over 30,000 draws from posterior distribution. The polynomial coefficients that predict investment ratios are stable after 30,000 draws from the posterior distribution and are approximately normally distributed. We show a histogram of each polynomial coefficient whose mean is shown in Fig 5. A red line is placed at the zero position, denoting a monomial that does not contribute to the value of the predicted investment ratio. Found at: doi: /journal.pcbi s002 (0.15 MB PDF) Figure S3 Mean ratios of investment for agent vs agent interactions. As seen in human players [6], cooperation fails across rounds when the BPD k-nearest neighbor sampling agent engages in a repeated exchange of trust with the control individual k-nearest neighbor sampling agent. Among 1,000 interactions between control agents (gray) paired with control trustees (gray), investments were large and sustained across early (1 to 5) rounds and late (6 to 10) rounds of the game. However, among 1,000 interactions between control agents (gray) paired with trustee agents sampling from interactions with BPD individuals (red), a decrease in investment level from early to late rounds of the game indicates a failure in cooperation across the iterated exchange. Mean percent invested and SEM are plotted. Found at: doi: /journal.pcbi s003 (0.03 MB PDF) Figure S4 Clustering healthy versus healthy plus disordered dyads. Here X s represent healthy dyads, while the read and blue o s and + s represent disordered dyads (the black and white + and 2 s represent the projection onto the healthy dimensions). If the healthy dyads are clustered alone, then there would be two distinct clusters along the first axis. The disordered dyads would be assigned evenly across these clusters, resulting in no overrepresentation. In contrast, if all of the dyads are clustered, there would likely be two clusters, one with the o s overrepresented, and one with the + s overrepresented. Found at: doi: /journal.pcbi s004 (0.00 MB PDF) PLoS Computational Biology 10 October 2010 Volume 6 Issue 10 e

11 Biosensor Approach to Psychopathologies Figure S5 Means and standard deviations of posterior distributions of parameters in four clusters defined by clustering all dyads and healthy dyads only. Found at: doi: /journal.pcbi s005 (0.04 MB PDF) Figure S6 Model selection. A) Plot of the log marginal likelihoods computed using the method of Lewis-Raftery [37] used by Houser-Keane-McCabe [23]. We ran 9 samplers for each choice of the number of clusters K, number of rounds to look back D, and order of the polynomial P describing investment ratios in a given round in terms of ratios of investment and return in prior rounds. We used the standard quantile function in MATLAB R14 SP3 (Natick, MA) to compute 95% quantiles for our marginal likelihoods, which are plotted in this graph. B) Histograms of log marginal likelihood values for two samplers, showing that the distributions of the log marginal likelihood values are not Gaussian; thus, the Wilcoxon rank-sum test for comparison of medians was used [39]. Found at: doi: /journal.pcbi s006 (0.04 MB PDF) Table S1 Available data. All trust game datasets used in the analysis are included here, along with appropriate references if applicable. The number of dyads in each dataset is shown, along with the abbreviations used in other figures and tables. Found at: doi: /journal.pcbi s007 (0.03 MB DOC) Table S2 Matching dyads to clusters. We present the most common and second most common match of all dyads in each original group into each new cluster. As can be seen, assignments are relatively stable over 30,000 draws from the posterior distribution (Fig. S2 shows the polynomial coefficient distributions for the same number of draws from the posterior distribution). Found at: doi: /journal.pcbi s008 (0.25 MB DOC) Table S3 Groups over-/under-represented in behavioral clusters. We present the number and percentage of each original group matched into each new cluster. The assignment, as in previous tables, is based on the most common match of each dyad over 30,000 draws from the posterior distribution. Found at: doi: /journal.pcbi s009 (0.04 MB DOC) Table S4 Model selection. We present log marginal likelihoods estimated using the method of Lewis-Raftery [37] from 9 samplers for each choice of number of clusters K, look-back rounds D, and order of polynomial P describing the dependence of investment ratios in a given round on investment & return ratios in prior rounds. We sort all models by the number of model parameters, and discard models for which. = 6 have an empty type in the mode of all 5,000 draws from the posterior (after the first 3,000 is discarded as burn-in). We used the Wilcoxon Rank-Sum Test [39] to compare a given model s median log marginal likelihood with that of each model with fewer parameters. We chose the model (red) with the largest marginal likelihood for which we can guarantee that it is better than all parsimonious models. We report, in the right-hand column, for each model, the number of the first model for which we cannot guarantee the marginal likelihood is superior; that is, either (i) the median of this model is lower than the median of the model # in this column or (ii) the Wilcoxon rank-sum test, as implemented in MATLAB R14 SP3 (Natick, MA), does not reject the null hypothesis that the medians of the log marginal likelihoods for the two different models come from the same distribution at a 95% significance level. Please also see Figure S5. For the case of two models: K = 4, D = 2, P = 1 and K = 3, D = 2, P = 1, we performed an analysis using 3 samplers to compare the method of marginal likelihood used above based on the posterior mode [25] with a method, the mean harmonic estimator, that is based on using all draws from the posterior (see [38] for a detailed review of this and other methods of calculating marginal likelihoods and model selection). We found, for the median value of 3 samplers for each model, the log marginal likelihood for the Lewis-Raftery method preferred the K = 4, D = 2, P = 1 model by units, whereas the mean harmonic estimator preferred the K = 4, D = 2, P = 1 model by units. Thus, the two estimates are in good agreement, further justifying our use of the well-established Lewis-Raftery method. Found at: doi: /journal.pcbi s010 (0.08 MB DOC) Table S5 Degree of clustering expressed in terms of clinically relevant indices. For each of the four pathologies, this table describes the values of the standard clinically relevant indices: sensitivity, specificity, and positive and negative predictive values. Found at: doi: /journal.pcbi s011 (0.02 MB DOC) Table S6 Clustering based on all the trustees vs. clustering based only on healthy trustees. For each of the four pathologies, this table describes the over-representation of participants with this pathology in the corresponding cluster. According to our computations (see Methods section), only over-estimations of 1.5 and larger are statistically significant. Found at: doi: /journal.pcbi s012 (0.02 MB DOC) Table S7 Summary statistics of posterior distributions of regression coefficients. Found at: doi: /journal.pcbi s013 (0.46 MB DOC) Table S8 Relation between clusters based on all dyads and clusters based on only healthy trustees. This table shows that for dyads with healthy trustees, there is a concordance between their cluster assignments across the two clusterings: when we cluster all dyads and when we only cluster dyads with healthy trustees. Of all the healthy dyads which were assigned to Cluster 1 in the clustering based on all the dyads, 100% got assigned to Cluster 1 in the healthy-trustees-only clustering. Of all the healthy dyads which were assigned to Cluster 2 in the clustering based on all the dyads, 91.9% got assigned to Cluster 2 in the healthy-trustees-only clustering. Of all the healthy dyads which were assigned to Cluster 3 in the clustering based on all the dyads, 95.0% got assigned to Cluster 3 in the healthy-trustees-only clustering. Of all the healthy dyads which were assigned to Cluster 4 in the clustering based on all the dyads, 89.5% got assigned to Cluster 4 in the healthytrustees-only clustering. Found at: doi: /journal.pcbi s014 (0.02 MB DOC) Table S9 Clustering based on the investment ratios vs. clustering based on the return ratios. For clustering based on the investment ratios, different optimization techniques lead to the same optimal number of clusters K = 4. For clustering based on the investment ratios, different optimization techniques lead to different numbers of clusters: K = 2 and K = 5. For each of the corresponding three clusterings, for each of the four pathologies, we list of largest overrepresentation and, in parenthesis, the number of the cluster in which this over-representation occurs. A blank slot means that the corresponding pathology is not over-represented in any of the clusters. According to our computations (see Methods section), only over-estimations of 1.5 and larger are statistically significant. For clustering based on the investment ratios, all overrepresenta- PLoS Computational Biology 11 October 2010 Volume 6 Issue 10 e

12 Biosensor Approach to Psychopathologies tions are statistically significant, and different pathology groups are overrepresented in different clusters - i.e., this clustering provides a statistically significant separation of different pathologies. For clusterings based on the return ratios, not all overrepresentations are statistically significant, and some groups are overrepresented in the same cluster. Found at: doi: /journal.pcbi s015 (0.02 MB DOC) References 1. Trivers RL (1971) The evolution of reciprocal altruism. Q Rev Biol 46: Rilling JK, Gutman DA, Zeh TR, Pagnoni G, Berns GS, et al. (2002) A neural basis for social cooperation. Neuron 35: Sanfey AG, Rilling JK, Aronson JA, Nystrom LE, Cohen JD (2003) The neural basis of economic decision-making in the ultimatum game. Science 300: Rilling JK, Sanfey AG, Aronson JA, Nystrom LE, Cohen JD (2004) The neural correlates of theory of mind within interpersonal interactions. Neuroimage 22: Delgado MR, Frank RH, Phelps EA (2005) Perceptions of moral character modulate the neural systems of reward during the trust game. Nat Neurosci 8: King-Casas B, Sharp C, Lomax-Bream L, Lohrenz T, Fonagy P, et al. (2008) The rupture and repair of cooperation in borderline personality disorder. Science 321: King-Casas B, Tomiln D, Anen C, Camerer CF, Quartz SR, et al. (2005) Getting to know you: Reputation and trust in a two-person economic exchange. Science 308: Singer T, Seymour B, O Doherty JP, Stephan KE, Dolan RJ, et al. (2006) Empathic neural responses are modulated by the perceived fairness of others. Nature 439: Camerer CF (2003) Behavioral game theory: Experiments in strategic interaction. New York: Russell Sage Foundation. 550 p. 10. Camerer CF, Fehr E (2006) When does economic man dominate social behavior? Science 311: Kagel JH, Roth AE (1997) The handbook of experimental economics. PrincetonNew Jersey: Princeton University Press. 740 p. 12. Montague PR, Lohrenz T (2007) To detect and correct: Norm violations and their enforcement. Neuron 56: Axelrod R (1984) The Evolution of Cooperation. New York: Basic Books. 256 p. 14. Güth W, Schmittberger R, Schwarze B (1982) An experimental analysis of ultimatum bargaining. J Econ Behav Organ 3: Roth AE (1995) Bargaining experiments. In: Kagel JH, Roth AE, eds. The handbook of experimental economics. PrincetonNew Jersey: Princeton University Press. pp Tomlin D, Kayali MA, King-Casas B, Anen C, Camerer CF, et al. (2006) Agentspecific responses in the cingulate cortex during economic exchanges. Science 312: Camerer C, Weigelt K (1988) Experimental tests of a sequential equilibrium reputation model. Econometrica 56: Weigelt K, Camerer CF (1988) Reputation and corporate strategy: A review of recent theory and applications. Strategic Management Journal 9: Berg J, Dickhaut J, McCabe K (1995) Trust, reciprocity, and social history. Games Econ Behav 10: Chiu PH, Kayali MA, Kishida KT, Tomlin D, Klinger LG, et al. (2008) Self responses along cingulate cortex reveal quantitative neural phenotype for highfunctioning autism. Neuron 57: Author Contributions Conceived and designed the experiments: PRM. Performed the experiments: TL PRM. Analyzed the data: MK TL. Contributed reagents/ materials/analysis tools: MK TL MV PRM. Wrote the paper: MK TL MV PRM. 21. Yoshida W, Dolan RJ, Friston KJ (2008) Game Theory of Mind. PLoS Comput Biol 4: e Houser D, Bechara A, Keane M, McCabe K, Smith V (2005) Identifying individual differences: An algorithm with application to phineas gage. Games Econ Behav 52: Houser D, Keane M, McCabe K (2004) Behavior in a dynamic decision problem: An analysis of experimental evidence using a bayesian type classification algorithm. Econometrica 72: Hampton AN, Bossaerts P, O Doherty JP (2008) Neural correlates of mentalizing-related computations during strategic interactions in humans. Proc Natl Acad Sci U S A 105: Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Series B (Methodological) 39: Greenberg E (2008) Introduction to Bayesian Econometrics. Cambridge: Cambridge University Press. 224 p. 27. Diebolt J, Robert CP (1994) Estimation of finite mixture distributions through bayesian sampling. J R Stat Soc Series B (Methodological) 56: Lord C, Rutter M, Couteur A (1994) Autism diagnostic interview-revised: A revised version of a diagnostic interview for caregivers of individuals with possible pervasive developmental disorders. J Autism Dev Disord 24: Rotter JB (1967) A new scale for the measurement of interpersonal trust. J Pers 35: Trull TJ, Durrett CA (2005) Categorical and dimensional models of personality disorder. Annu Rev Clin Psychol 1: Hubert L, Arabie P (1985) Comparing Partitions. J Classif 2: Gentleman R, Ihaka R, Bates D, Chambers J, Dalgaard P, et al. (2007) R: A Language and Environment for Statistical Computing. Vienna (Austria): R Foundation for Statistical Computing, ISBN Available: Accessed 1 January Fraley C, Raftery A (2008) Function adjustedrandindex. In mclust: Model- Based Clustering/Normal Mixture Modeling, R package, version Available: Accessed 1 August American Psychiatric Association (2000) Diagnostic and statistical manual of mental disorders. Washington, DC: American Psychiatric Association. 943 p. 35. Raftery AL, Lewis S (1992) How many iterations in the Gibbs sampler? In: Bernardo JM, Berger JO, Dawid AP, Smith AFM, eds. Bayesian Statistics 4. Oxford: Oxford University Press. pp Smith BJ (2005) BOA: Bayesian Output Analysis Program, version Available: Accessed 1 January Lewis SM, Raftery AE (1997) Estimating bayes factors via posterior simulation with the Laplace-Metropolis estimator. J Am Stat Assoc 92: Gelman A, Carlin JB, Stern HS, Rubin DB (2003) Bayesian Data Analysis, Second Edition. Boca RatonFlorida: Chapman & Hall/CRC. 696 p. 39. Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics Bulletin 1: PLoS Computational Biology 12 October 2010 Volume 6 Issue 10 e

13 Computational Phenotyping of Two-Person Interactions Reveals Differential Neural Response to Depth-of- Thought Ting Xiang 1, Debajyoti Ray 2, Terry Lohrenz 3, Peter Dayan 4, P. Read Montague 3,5 * 1 Department of Neuroscience, Baylor College of Medicine, Houston, Texas, United States of America, 2 Computation and Neural Systems, California Institute of Technology, Pasadena, California, United States of America, 3 Virginia Tech Carilion Research Institute and Department of Physics, Virginia Tech, Roanoke, Virginia, United States of America, 4 Gatsby Computational Neuroscience Unit, University College London, London, United Kingdom, 5 Wellcome Trust Centre for Neuroimaging, London, United Kingdom Abstract Reciprocating exchange with other humans requires individuals to infer the intentions of their partners. Despite the importance of this ability in healthy cognition and its impact in disease, the dimensions employed and computations involved in such inferences are not clear. We used a computational theory-of-mind model to classify styles of interaction in 195 pairs of subjects playing a multi-round economic exchange game. This classification produces an estimate of a subject s depth-of-thought in the game (low, medium, high), a parameter that governs the richness of the models they build of their partner. Subjects in each category showed distinct neural correlates of learning signals associated with different depths-ofthought. The model also detected differences in depth-of-thought between two groups of healthy subjects: one playing patients with psychiatric disease and the other playing healthy controls. The neural response categories identified by this computational characterization of theory-of-mind may yield objective biomarkers useful in the identification and characterization of pathologies that perturb the capacity to model and interact with other humans. Citation: Xiang T, Ray D, Lohrenz T, Dayan P, Montague PR (2012) Computational Phenotyping of Two-Person Interactions Reveals Differential Neural Response to Depth-of-Thought. PLoS Comput Biol 8(12): e doi: /journal.pcbi Editor: Olaf Sporns, Indiana University, United States of America Received June 12, 2012; Accepted October 31, 2012; Published December 27, 2012 Copyright: ß 2012 Xiang et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: This work was supported by a Wellcome Trust Principal Research Fellowship (PRM), The Kane Family Foundation (PRM), NIDA grant R01DA11723 (PRM), NIMH grant R01MH (PRM), NIA grant RC4AG (PRM), and The Gatsby Charitable Foundation (DR, PD). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * read@vt.edu Introduction Many of the inferences that we make about others, or about their models of us, are silent and subtle [1,2]. One route for understanding the neural basis of such inferences comes from building computational models of social exchange that quantify their nature and evolution over the course of interactions. Recent behavioral and neuroimaging work in this area has employed interactive economic games that required subjects to model their partners strategies [3 6]. This work focused on relatively small cohorts of subjects, or on subjects knowingly playing a computer partner. Therefore, questions about individual differences in styles of play, and whether or not the partner was treated by the brain like a human partner remain largely open (but see 6). Figure 1 outlines the strategy of the approach. We used a multiround reciprocation game (the multi-round trust game, Figure 1A), classifying the play of a large (n = 195) number of pairs of players (dyads) [7 9] via a computational realization of the models of each other that they build [10]. This classification used the observed patterns of monetary exchange to infer two parameters important for all such exchanges: (1) the sensitivity of a subject to deviations from fair splits of money across the two players (called inequality aversion) [11]; and (2) the subject s depth-of-thought or cognitive level in the game, that is, the depth to which they modeled their interaction with their partner [12]. After classification along these two dimensions, we sought neural correlates of learning signals (interpersonal error signals) inferred by the model that are important for playing the game successfully (Figure 1B and 1C). We describe the model below. A player s type is represented by her degree of inequality aversion. Players value immediate payoffs, but to a lesser degree if the split of money between them is inequitable [13]: U i (x i,x j ; a i,b i )~x i {a i max(x j {x i,0){b i max(x i {x j,0) ð1aþ where x i is the money obtained by player i and x j is the amount obtained by player j. Two sorts of inequity are potentially important: envy (partner j gets more than subject i ; a i in eqn 1a) and guilt (subject i gets more than partner j; b i in eqn 1a). The envy and guilt parameters comprise what we consider as the type of a player. Empirically, the majority of investors invest more than half of the endowment and the modal behavior of trustees is to split the sum of money evenly. Hence, the influence of envy on subjects choices was minimal. For simplicity, we assume a i ~0 and consider only guilt - the aversion to inequity favorable to the subject as the way to type a player. Therefore, player i s type is fully described by b i [½0,1Š, the guilt parameter. The utility function becomes: PLOS Computational Biology 1 December 2012 Volume 8 Issue 12 e

14 Computational Phenotyping of Social Interactions Author Summary Human social interactions are extraordinarily rich and complex. The ability to infer the intentions of others is essential for successful social interactions. Although most of our inferences about others are silent and subtle, traces of their effects can be found in the behavior we exhibit in various tasks, notably repeated economic exchange games. In this study, we use a computational model that uses an explicit form of other-modeling to classify styles of play in a large cohort of subjects engaging in such a game. We classify players according to their depth of recursive reasoning (depth-of-thought), finding three groups whose performance throughout the task differed according to several measures. Neuroimaging results based on the model classification show a differential neural response to depth-of-thought. The model also detected differences in depth-of-thought between two groups of healthy subjects: one playing patients with psychiatric disease and the other playing healthy controls. These results demonstrate the power of a quantitative approach to examining behavioral heterogeneity during social exchange, and may provide useful biomarkers to characterize mental disorders when the capacity to make inferences about others is impaired. U i (x i,x j ; a i,b i )~x i {b i max(x i {x j,0) ð1bþ The second important feature of the model is depth-of-thought in the game [12], which derives from the estimates that each player maintains about the type of their partner. To maximize long-run utility, a player must estimate this type, and update the estimate when observing their partner s actions. Of course, I may estimate your type, your estimate of my type, your estimate of my estimate of your type, and so forth [14]. Thus we define deeper thinkers in the game as those who use more sophisticated simulations of play of this sort to update these estimates. A range of behavioral data suggests one strong constraint on how subjects model their partners, that is, they assume that their partners play one level less sophisticated than themselves [15]. We assume that all players plan ahead and choose actions that have beneficial consequences, but differ in how they interpret the signals from their partners to update their beliefs, and how they expect their partners perceive them through their actions. To estimate one s partner s type, a level 0 subject does not simulate his partner s play, but assumes his partner is level 0 i.e. also has a naïve model of them. A level 1 subject assumes his partner is a level 0 player and simulates how a level 0 partner makes choices. A level 2 subject assumes his partner is level 1 and simulates how a level 1 player interacts throughout the game. This kind of recursion lies at the heart of mentalizing (simulating) other autonomous agents who concurrently generate models of us it also lies at the heart of many models of predator-prey interactions [16]. The computational model of behavior simulating interactions with one s partner Here we write the model for how player i forms an estimate of optimal play at each round t by calculating the values Q t i of their possible actions a t i. The actions are the amounts to invest or to return. The Q values are the expected summed utilities over the next two rounds (as a simplification, players are assumed to look at the current round and the round after). The utility for player i depends on the actions of player j, which in turn depends on the type of player j, and the reasoning that player j does about player i. Player i does not know player j s type, but can learn about it from the history of their interactions, which, up to round t, is D t ~f(a 1 i,a1 j ),,(at{1 i,a t{1 j )g. Formally, player i maintains beliefs B t i, in the form of a probability distribution over the type of player j, and computes expected utilities by averaging over these beliefs. Bayes theorem is used to update the beliefs based on evidence. The Q t i value on round t is a sum of two expectations: Q t i (at i,bt i )~SU i t T B t z X i a t j X a tz1 i Q tz1 i P(a t j jfat i,dt g) (a tz1 i,b tz1 i )P(a tz1 i jfa t i,at j,dt g) The first is the utility of the exchange on that round. This is SUi t T B t ~ X i a t j P(a t j Dfat i,dt g)u i (a t i,at j ; b i) where, for convenience, we write U i (a t i,a; b i) as a function of the possible actions a of player j rather than the money this player earns. The second term in equation (2) concerns the value of future 2 rounds in the exchange (except in the last round, where this term ) on round t+1, where the new beliefs B tz1 i take account of the action a t i being considered by player i, and all the possible actions a t j of player j. Equation (2) is a form of Bellman evaluation equation. The players can calculate the Q values, including updating the beliefs, by simulating the course of play with their partners. This simulation is a central feature of the model with players adopting higher levels of depth-of-thought requiring more simulation (see belief updates in Supporting Information). is 0). This is thus an average over Q values Q tz1 i Results (a tz1 i,b tz1 i Classification of interpersonal interaction The model described above constitutes a full generative account of a subject playing the multi-round trust game, and incorporates several key cognitive mechanisms engaged by such a staged interpersonal interaction. Player i is characterized by their private type b i, their depth-of-thought level k i, and the prior beliefs B 0 i. Player j is characterized in just the same way. We estimated the parameters of both players in each dyad by maximizing the log likelihood of their choices over the 10 rounds of the game. The averaged maximal log likelihood of all 195 investors was In our model, we assume that players take one of five possible actions. If all the five possible actions were chosen with equal probability, the log likelihood would take the value 10 : log( 1 )~{16:1. Our computational theory-of-mind model 5 fitted the behavior significantly better than assuming that players act randomly (one-sample test, P = ). For the purposes of comparison, we also built a reinforcement learning (RL) model incorporating inequality aversion (details in Supporting Information). We found that the RL model performed poorly; when we optimized the learning rate in the model, the optimum was ð2þ PLOS Computational Biology 2 December 2012 Volume 8 Issue 12 e

15 Computational Phenotyping of Social Interactions Figure 1. Classification of investors. A) One player ( investor ) is endowed with $20 at the beginning of each round. The investor chooses any fraction I of the $20 to send to the other player ( trustee ). The investment is tripled to 3I en-route to the trustee. The trustee chooses a fraction R of the tripled amount (3I) to repay. Subjects play the same partner for ten consecutive rounds. B) Using the observed exchanges between the players, investors are classified according to their estimated inequality aversion and their depth-of-thought (strategic level) in the game (see main text for a description of the generative model). All 195 pairs included in this classification; this included 55 pairs where the trustee was diagnosed with Borderline Personality Disorder. C) First and second order interpersonal prediction errors are sought in the investors brain responses separately for each depth-of-though category. The 1 st order interpersonal prediction error is taken as the difference between actual repayment ratio R and expected amount due to the investor s model of the trustee s repayment. The 2 nd order prediction error is taken as the difference between the investment ratio I and the investor s model of the trustee s model of what the investor will send; hence, the term second order error. doi: /journal.pcbi g001 degenerate in the sense that no learning occurred, and all actions were selected with equal probability (random choices). Figure 2A shows the frequency histogram of depth-of-thought classification achieved by inverting the generative model described above. About half of all the 195 investors are classified as strategic level 0. The remaining investors are almost equally divided into level 1 and level 2 players. There are significant dynamic behavioral features that correlate with the depth-of-thought levels that we estimate using our model. The style of play across rounds of the game is different and correlates well with the intuition that players with higher depths-of-thought are sensitive to richer features of the game than those possessing lower levels. In Figure 2B, of all 195 investors, levels 1 and 2 start the game with high offers and maintain throughout the game, except that the highest depth-of-thought players decrease their offers towards the end of the game (which is strategic). Moreover, level 0 investors open low and stay low throughout the game, a strategy that tends to break cooperation with the trustee. Lastly, level 1 and 2 players make significantly more money overall than level 0 players (Figure 2C). Neural representations According to the generative model, players make predictions about the likely course of events through the game. These predictions lead naturally to prediction errors, which can be used PLOS Computational Biology 3 December 2012 Volume 8 Issue 12 e

16 Computational Phenotyping of Social Interactions Figure 2. Investor depth-of-thought classification separates distinct behavioral trajectories through the game. A) The distribution of depth-of-thought levels in all 195 investors. About half of the investors are classified as having depth-of-thought level 0. The remaining half is almost equally divided into having depth-of-thought level 1 and 2. B) Investment ratios by rounds from all three levels of depth-of-thought investors, level0 (n = 102), level 1 (n = 49), level 2 (n = 44). C) Total monetary points earned at the end of the game in all three levels of investors. Both level 1 and level 2 investors made significantly more points than level 0 investors (Tukey HSD test, P,10 26 and P,10 25, respectively). No significant difference in total earnings was found between level 1 and level 2 investors (P.0.1). Error bars represent standard errors (SE). doi: /journal.pcbi g002 to generate control signals to guide choices. In games against nature, prediction errors associated with rewarding outcomes have frequently been observed in the BOLD signal measured in striatal regions [17 19]. Games against other players offer much richer possibilities for neural responses since players have a range of interpersonal signals that they can model (e.g. Figure 1C). We here focus on the investor side of the interaction because this role has proved to be particularly sensitive for classifying styles of play in prior work [20]. Two types of interpersonal prediction errors emerge naturally in the reciprocating interactions of the multi-round trust game. The first order prediction error in the investor is a comparison between the investor s current model of what the trustee will return and the amount actually returned. This error is computed at the time that the repayment from the trustee is revealed to the investor. This error requires information sent back from the trustee. By contrast, the second order prediction error in the investor requires a comparison between the investor s offer and the investor s internal model of what the trustee expects from the investor, that is, information that is exclusively internal to the investor. This information is available to the investor before any immediate feedback from the trustee, and is potentially available during the entire epoch, starting from the time of the cue and up until the time when the actual investment is made. In this paper, we choose the time the investor submits as a natural trigger for this signal, but with the understanding that it might have been computed and thus available earlier. Thus, the first order error can be evaluated at the time the repayment from the trustee is revealed. In a similar spirit, the second order error is defined at the time the investor s offer is submitted since it is at this time that the investor brain can compare their actual offer to their (internal) model of what the trustee expects. Our hypothesis for the first order inter-personal prediction error was that players classified as level 0 would display a large response to this error, while the higher levels would not, since this signal is not a critical component of the high level players planning. We divided the first order interpersonal prediction error of all 195 healthy investors classified within a certain cognitive level into quintiles, performed separate GLM analysis at individual rounds, and then generated contrasts between rounds with high 1 st order prediction errors (.60%) and rounds with low 1 st order prediction errors (#40%) on the beta images of the events of interest. The contrast analysis at the revelation of the trustee s repayment showed that level 0 investors (n = 102) had robust activations in bilateral striatal regions (Figure 3A, whole-brain FDR corrected at P,0.05; peak MNI coordinates: right caudate (8, 12, 0), t = 4.49, 57 voxels; left caudate (212, 12, 4), t = 3.74, 73 voxels; right putamen (24, 4, 0),t = 4.02, 88 voxels; left putamen (224, 4, 4), t = 4.10, 72 voxels). These striatal activations were not observed in investors with level 1 (n = 49) or level 2 (n = 44) depth-of-thought. We also performed a direct comparison among investors with different depth-of-thought levels on the 1 st order interpersonal prediction errors using ANOVA. The group contrast results PLOS Computational Biology 4 December 2012 Volume 8 Issue 12 e

Computational Phenotyping of Social Interactions showed that the level 0 investors had higher caudate activation than level 1 investors (Figure 3B left, P,0.

17 Computational Phenotyping of Social Interactions showed that the level 0 investors had higher caudate activation than level 1 investors (Figure 3B left, P,0.001, uncorrected; peak MNI coordinates: (4, 16, 0), t = 4.04, FWE corrected at P,0.05 with small volume correction applying the anatomical mask of bilateral caudate). We also found that level 2 investors had higher right temporal-parietal junction (TPJ) activation than level 0 investors associated with the 1 st order interpersonal prediction errors (Figure 3B right, whole-brain FDR corrected at P,0.05; peak MNI coordinates: (52, 248, 28), t = 4.70, 7 voxels). Our hypothesis for the second order inter-personal prediction error was that players classified as level 0 would display no response to this higher order interpersonal error (since their model of the other s model of themselves is impoverished), whereas players classified as higher level would. We divided the second order inter-personal prediction error of all 195 healthy investors classified within a certain cognitive level into quintiles, performed separate GLM analysis at individual rounds, and then generated contrasts between rounds with high 2 nd order prediction errors (.60%) and rounds with low 2 nd order prediction errors (#40%) on the beta images of the events of interest. The contrast at the submission of the investor s decisions revealed that level 2 investors had significant activations in bilateral putamen (Figure 3C, whole-brain FDR corrected at P,0.05; peak MNI coordinates: right putamen (24, 8, 24), t = 3.79, 23 voxels; left putamen (220, 8, 24), t = 3.11, 7 voxels). We did not observe any striatal activations in level 0 and level 1 investors for the 2 nd order prediction errors. We also performed an ANOVA analysis on the three depth-of-thought levels of investors. The group contrast analysis found that level 2 investors had higher ventral striatal activation than level 0 investors when computing the 2 nd order interpersonal prediction errors (Figure 3D, P,0.005 uncorrected; peak MNI coordinates (12, 8, 212), t = 3.41, FWE corrected at P,0.05 with small volume correction applying the anatomical mask of bilateral caudate). It is possible that when grouping the rounds according to the high or low quintiles of prediction errors, some subjects might be exclusively included in the high group, or in the low group. This raised the concern that the contrast results above might be biased by those distinct subjects. We therefore counted the number of subjects only present in the high group, or in the low group for the 1 st and 2 nd interpersonal prediction errors, respectively. We showed that the vast majority of subjects made contributions to all quintiles of prediction errors, with only an extremely small number of subjects contributing to just the high or low quintiles (Table S1). We also plotted the magnitudes of the interpersonal prediction errors divided into high or low quintiles across the depth-of-thought Figure 3. Inter-personal prediction errors: differential neural response as a function of investor depth-of-thought. A) Contrast analysis between rounds with high (.60%) and low (#40%) 1 st order interpersonal prediction errors when repayments were revealed. Level 0 investors (n = 102) had robust activations in bilateral striatal regions (whole-brain FDR corrected at P,0.05; peak MNI coordinates: caudate (8, 12, 0), t = 4.49; putamen (24, 4, 0),t = 4.02). These striatal activations were not observed in investors with level 1 (n = 49) or level 2 (n = 44) depth-of-thought. B) Group contrast analysis on the 1 st order interpersonal prediction errors. Left, level 0 investors had higher caudate activation than level 1 investors (P,0.001, uncorrected; peak MNI coordinates: (4, 16, 0), t = 4.04, FWE corrected at P,0.05 with small volume correction applying the anatomical mask of bilateral caudate). Right, level 2 investors had higher right temporal-parietal junction (TPJ) activation than level 0 investors associated with the 1 st order interpersonal prediction errors (whole-brain FDR corrected at P,0.05; peak MNI coordinates: (52, 248, 28), t = 4.70, 7 voxels). C) Contrast analysis between rounds with high (.60%) and low (#40%) 2 nd order interpersonal prediction errors when investments were submitted. Level 2 investors had significant activations in bilateral putamen (whole-brain FDR corrected at P,0.05; peak MNI coordinates: putamen (24, 8, 24), t = 3.79). We did not observe any striatal activations in level 0 and level 1 investors for the 2 nd order prediction errors. D) Group contrast analysis on the 2 nd order interpersonal prediction errors. Level 2 investors had higher ventral striatal activation than level 0 investors when computing the 2 nd order interpersonal prediction errors (P,0.005 uncorrected; peak MNI coordinates (12, 8, 212), t = 3.41, FWE corrected at P,0.05 with small volume correction applying the anatomical mask of bilateral caudate). Color bars display t scores. doi: /journal.pcbi g003 PLOS Computational Biology 5 December 2012 Volume 8 Issue 12 e

18 Computational Phenotyping of Social Interactions Figure 4. Magnitude of interpersonal prediction errors as a function of estimated depth-of-thought for investors. Average 1st order A) and 2nd order B) inter-personal prediction errors: low (bottom two quintiles), high (top two quintiles). The differences between the high and low 1 st order interpersonal prediction errors were as follows: level 0 investors (mean = 10.05, SE = 0.38), level 1 investors (mean = 15.97, SE = 0.55), level 2 investors (mean = 14.30, SE = 0.58). The differences between the high and low 2 nd order interpersonal prediction errors were: level 0 investors (mean = 9.76, SE = 0.22), level 1 investors (mean = 10.62, SE = 0.31), level 2 investors (mean = 11.72, SE = 0.33). doi: /journal.pcbi g004 levels. We did this to rule out the possibility that a few subjects were dominating the observed results. The differences between the high and low quintiles were comparable across all the three levels of investors for both the 1 st and 2 nd order interpersonal prediction errors (Figure 4). Thus, the differential neural activations to the prediction errors observed here cannot be attributed to the differences in the magnitudes of prediction errors per se. Biosensor manipulation: Trustee types induce depth-ofthought distributions in healthy investors Earlier work [9] found that trustees diagnosed with Borderline Personality Disorder (BPD) played uncooperatively to an extent that they could not maintain the cooperation of their partner investor. In that work, the impact of the trustee behavior was read out through the willingness of the investor to sustain high offer levels throughout the rounds of the game. Figure 5 shows two distributions of estimated investor depth-of-thought levels as a function of distinct trustee types. Panel A shows the distribution when healthy investors play anonymous healthy trustees (n = 48 pairs). In this exchange, healthy subjects never meet their partner before the game and do not see or meet them after the game. They arrive at the lab and are randomly assigned roles in separate rooms. Panel B shows the distribution when healthy investors play subjects diagnosed with borderline personality disorder. There is a more dramatic shift toward lower depth-of-thought levels despite the fact that these subjects play the healthy investor anonymously. The distributions in panels A and B are statistically different (see legend Figure 5). We also recruited 38 trustee matched for lower socio-economic scale (SES) as a SES match for the Borderline personality disorder trustees. These trustees also played anonymously and induced a similar lower depth-of-mind distribution in the investors (Figure S2) suggesting that lower SES may be one source of influence for the incapacity of the Borderline subjects to sustain cooperation with their investor partners. Discussion In this paper, we used a Bayesian computational model that involves an explicit representation of theory of mind to classify a large number of subjects playing an economic exchange game. We PLOS Computational Biology 6 December 2012 Volume 8 Issue 12 e

19 Computational Phenotyping of Social Interactions Figure 5. Distribution of depth-of-thought in investors as a function of trustee group. A) Anonymous trustees (n = 48) remain anonymous to their investor partner for the entire game (and visa versa). B) Borderline personality disorder trustees were identified through an extensive set of formal interview procedures (see King-Casas et al., 2008). On Fisher s exact test, the borderline personality disorder-induced investor depth-of-though distribution was significantly different from investors playing anonymous trustees (panel A; p = ). doi: /journal.pcbi g005 used the model to assess their level of depth-of-thought. Our classification produces three levels of players whose behaviour correlates with important measures of performance through the task. Neuroimaging results based on the model classification showed a differential response to depth-of-thought. Additionally we found a significant difference for investor depth-of-thought distributions when comparing play with healthy trustees to play with subjects diagnosed with borderline personality disorder (BPD), a disorder known to disrupt inter-personal interactions. BPD subjects are characterized by their unstable relationships, and when they have played this game, they have tended to break cooperation. Indeed, it has been shown that, for this group, the anterior insula failed to sense the opponent s low offers [8]. The striatum has long been shown to encode reward prediction error signals in both passive and instrumental conditioning tasks [17,21 23]. Recently striatal activation has also been observed in social learning tasks [24] and tasks requiring mentalizing a partner s intention [3]. Here we found that striatum activity correlated with two types of interpersonal prediction errors evoked in a repeated social exchange game, and that these signals were modulated by players depth-of-thought levels. Level 0 players, but not level 2 players, had robust activations in the striatum to high 1 st order interpersonal prediction errors suggesting the naïve players were particularly sensitive to opponent s actions and mainly used this type of errors to adjust their own action policy. However, the striatum in level 2 players responded only to the 2 nd order interpersonal errors suggesting that these relatively sophisticated players discounted the direct influence of opponent s actions and rather put more emphasis on simulating and manipulating opponent s beliefs and actions. Other imaging experiments requiring subjects to model others intentions have also reported activations in frontoparietal regions [3,5,24]. It is not clear why frontoparietal regions were not observed in our paradigm. However, there is a clear path from known error signaling in the striatum to our observations here of 2nd order inter-personal prediction errors, since a 2 nd order prediction error can be seen as a direct proxy for future returns to the investor. In this reciprocation game, we have previously reported that deviations from neutral reciprocity or tit-for-tat behavior cause players to change their behavior [7,9]. Therefore, an investment that deviates positively from what the trustee expects (based on their model of the investor) should generate a positive error signal in the trustee s brain, which would itself lead to the investor expecting an increased return. Under this interpretation, the signal is exactly analogous to the range of prediction error signals that show up encoded in BOLD responses in the striatum. These neural results are congruent with our behavioral observations. The most sophisticated level 2 investors invested high at the beginning to cultivate trust and promote cooperation with their partners. But towards the end of the exchange, they responded to the horizon of the game and risked less money, reflecting their manipulative maneuver in the beginning. Furthermore, we found that the sophisticated level 2 investors had higher activations in the right TPJ in response to the 1 st and 2 nd order interpersonal prediction errors than the naïve level 0 investors. Right TPJ has been demonstrated to play a critical role in belief reasoning tasks involving theory of mind [25,26]. Right TPJ has also been found to be specifically modulated in people with higher strategic levels [27]. Furthermore the coordinates of the peak voxel of this activation place it in a recently designated posterior region of the TPJ (TPJp) that is well-connected to areas identified with social cognition [28]. The TPJ activation and its specific location within TPJ is consistent with the idea that level 2 investors build more sophisticated models of their opponents. PLOS Computational Biology 7 December 2012 Volume 8 Issue 12 e

20 Computational Phenotyping of Social Interactions Computational accounts developed in the framework of Markov Decision Processes (MDP), and in particular reinforcement learning models [29], have been successful in representing behavior and illuminating neural substrates in situations where agents interact with nature, and in which the environmental states are fully observable. Such models have furthered our understanding of the role of dopamine and related neural structures in reward learning and decision-making [30,31]. However, those models are limited in the typical social situations where agents interact and effectively create an ever-changing, adapting landscape, which are plausibly a raison d etre for sophisticated cognition. Recently, some progress has been made in establishing model-based approaches to social interaction [3,4,32,33]. Our approach makes a commitment to an explicit, generative model of higher-order thinking about other social actors, some aspects of which are in common with the recent work by Yoshida et al. (who also use their models to compare autistic and healthy subjects) [4 6]. The space of such models is vast, and explicit choices must be made at many steps [4,10]. Nonetheless, our model is able to capture striking heterogeneity in the behavior which we are then able to connect to differences in neural activity. Further developments of this approach also incorporating genetic data promise to help uncover the genetic underpinnings of social heterogeneity. Materials and Methods Ethics statement Informed consent was obtained for all research involving human participants, and all clinical investigation was conducted according to the principles expressed in the Declaration of Helsinki. All procedures were approved by the Institutional Review Board of the Baylor College of Medicine. Subject characteristics Data from four groups, total 195 pairs of subjects (18 64 yrs) who played the trust game previously [5 8] were examined, including an Impersonal group (48 pairs), a Personal group (54 pairs), a BPD group (55 pairs), and a BPD control group (38 pairs). Subject pairs from the Impersonal, BPD, and BPD control groups never met each other throughout the experiment. Subject pairs in the Personal group were introduced to each other before playing the task. Trustees in the BPD group were diagnosed with borderline personality disorder (BPD), and were matched to trustees in the BPD control group on socioeconomic status (SES). In addition, investors in the BPD and BPD control groups were recruited with socioeconomic status matched to trustees. Investors in the Impersonal groups were students from Caltech and Baylor College of Medicine. Image acquisition and preprocessing All scans were carried out on 3.0 Tesla Siemens Allegra scanners. High-resolution T1-weighted scans (1.0 mm61.0 mm61.0 mm) were acquired using an MP-RAGE sequence (Siemens). Subjects then played the iterated trust game for 10 rounds while undergoing whole-brain functional imaging. The detailed settings for the functional run were as follows: echoplanar imaging, gradient recalled echo; repetition time (TR) = 2000 ms; echo time (TE) = 40 ms; flip angle = 90u; matrix, 26 4-mm axial slices angled parallel to the anteroposterior commissural line, yielding functional 3.3 mm63.3 mm64.0 mm voxels. Images were analyzed using SPM2 ( uk/spm/software/spm2/). Slice timing correction was first applied to temporally align all the images. Motion correction to the first functional image was performed using a 6-parameter rigid-body transformation. The average of the motion-corrected images was co-registered to each subject s structural images using a 12- parameter affine transformation. Images were subsequently spatially normalized to the Montreal Neurological Institute (MNI) template by applying a 12-parameter affine transformation, followed by nonlinear warping using standard basis functions. Finally, images were smoothed with an 8 mm isotropic Gaussian kernel and then high-pass filtered (128 s width) in the temporal domain. General Linear Model (GLM) analysis Separate general linear models were specified for individual rounds of each subject (6). All visual stimuli, motor responses and motion parameters were entered as separate regressors that were constructed by convolving each event onset with a canonical hemodynamic response function in SPM2. Beta maps were estimated for regressors of interest. The SPM images shown in Figure 3 was generated as follows: both the first order and second order interpersonal prediction errors of subjects classified with the same depth-of-thought were divided into quintiles. For the 1 st order interpersonal prediction errors, beta images associated with the event when the repayments were revealed were sorted according to the prediction error quintiles. Contrast analysis between the beta images from top two quintiles (.60%) and images from the bottom two quintiles (#40%) were performed. Similarly, contrasts for the 2 nd interpersonal prediction errors were generated from beta images associated with the event when the investments were submitted. Computational theory-of-mind model See Text S1 for detailed descriptions. We also include a reinforcement learning model in Text S1 for comparison. Supporting Information Figure S1 Depth-of-thought distribution for investors playing trustee with lower SES. Trustee group was matched to the SES of the identified BPD trustees, which tended to be lower than the average healthy trustee. In reciprocation games (including the multi-round trust game), it is known that lower SES correlates with lower offers and increased difficulty of sustaining cooperation. This investor depth-of-thought distribution suggests that reduced SES that can attend BPD may be one of the causative factors in their style of play; however, these data are simply consistent with that hypothesis and do not show causality. The lower SES trustees induce a depth-of-thought distribution that is significantly different from investors playing anonymous healthy trustees using Fisher s exact test (p = ). (TIF) Figure S2 Depth-of-thought distribution for investors playing healthy trustees non-anonymously. Healthy trustees meet their investor partner at the beginning of the game and are paid in front of their partner at the end of the game. These subjects are not known to one another at the start of the game and are randomly assigned the role of trustee or investor. This depthof-thought distribution is not statistically different from the distribution in figure S2 (Fisher s exact test p = 0.032). (TIF) Table S1 Did a small number of subjects drive differences in the quintiles of inter-personal prediction errors? The number of distinct subjects in low (bottom two quintiles only) and high (upper too quintiles only) 1 st and 2 nd order PLOS Computational Biology 8 December 2012 Volume 8 Issue 12 e

21 Computational Phenotyping of Social Interactions prediction errors, the total subjects in each category, and the percentage. Extremely few subjects were presented in the low or high categories only. The majority of investors made contributions to all the quintiles for both the 1 st and 2 nd order interpersonal errors, regardless of their depth-of-thought levels. (TIF) Table S2 Parameters for reinforcement learning models. Estimated parameters k and b for different learning rates e for reinforcement learning model. (TIF) Table S3 Model fit comparison. Comparison of average negative log-likelihoods for reinforcement learning models using the estimated parameters, and the computational theory of mind model. (TIF) References 1. Sanfey AG (2007) Social decision-making: insights from game theory and neuroscience. Science 318: Lee D (2008) Game theory and neural basis of social decision making. Nat Neurosci 11: Hampton AN, Bossaerts P, O Doherty JP (2007) Neural correlates of mentalizing-related computations during strategic interactions in humans. Proc Natl Acad Sci USA 105: Yoshida W, Dolan RJ, Friston KJ (2008) Game theory of mind. PLoS Comput Biol 4:e Yoshida W, Seymour B, Dolan RJ, Friston KJ (2010) Neural Influence of belief inference during social games. J Neurosci 30: Yoshida W, Dziobek I, Kliemann D, Heekeren HR, Friston KJ, et al. (2010) Cooperation and heterogeneity of the Austistic Mind. J Neurosci 30: King-Casas B, Tomlin D, Anen C, Camerer CF, Quartz SR, et al. (2005) Getting to know you: reputation and trust in a two-person economic exchange. Science 308: Tomlin D, Kayali MA, King-Casas B, Anen C, Camerer CF, et al. (2006) Agentspecific responses in the cingulate cortex during economic exchanges. Science 312: King-Casas B, Sharp C, Loman-Bream L, Lohrenz T, Fonagy P, et al. (2008) The rupture and repair of cooperation in Borderline Personality Disorder. Science 321: Ray D, King-Casas B, Montague PR, Dayan P (2008) Bayesian model of behaviour in economic games. NIPS 21: Fehr E, Camerer CF (2007) Social neuroeconomics: the neural circuitry of social preferences. Trends Cogn Sci 11: Camerer CF, Ho T-H, Chong J-K (2004) A cognitive hierarchy model of games. Q J Econ 119: Fehr E, Schmidt KM (1999) A theory of fairness, competition, and cooperation. Q J Econ 114: Harsanyi JC (1967) Games with incomplete information played by Bayesian players. Manage Sci 14: Camerer CF (2003) Behavioral Game theory: Experiments in Strategic Interaction. Princeton, New Jersey: Princeton University Press. 16. Dugatkin LA, Reeve HK (2000) Game Theory and Animal Behavior. New York: Oxford University Press. 17. O Doherty JP, Dayan P, Friston K, Critchley H, Dolan RJ (2003) Temporal difference models and reward-related learning in the human brain. Neuron 38: Pagnoni G, Zink CF, Montague PR, Berns GS (2002) Activity in human ventral striatum locked to errors of reward prediction. Nat Neurosci 5: Table S4 Joint classification table. Joint Investor/Trustee depth-of-thought classification frequency table. Chi-Square test gives p = 6.4e-05. (TIF) Text S1 (DOC) Supplementary model information. Author Contributions Conceived and designed the experiments: TX DR TL PD PRM. Performed the experiments: TX DR TL PD PRM. Analyzed the data: TX DR TL PD PRM. Contributed reagents/materials/analysis tools: TX DR TL PD PRM. Wrote the paper: TX DR TL PD PRM. 19. Haruno M, Kuroda T, Doya K, Toyama K, Kimura M, et al. (2004) A neural correlate of reward-based behavioral learning in caudate nucleus: a functional magnetic resonance imaging study of a stochastic decision task. J Neurosci 24: Koshelev M, Lohrenz T, Vannucci M, Montague PR (2010) Biosensor approach to psychopathology classification. PLos Comput Biol 6:e McClure SM, Berns GS, Montague PR (2003) Temporal prediction errors in a passive learning task activate human striatum. Neuron 38: O Doherty J, Dayan P, Schultz J, Deichmann R, Friston K, et al. (2004) Dissociable roles of ventral and dorsal striatum in instrumental conditioning. Science 304: Montague PR, Hyman SE, Cohen JD (2004) Computational roles for dopamine in behavioral control. Nature 431: Behrens TEJ, Hunt LT, Woolrich MW, Rushworth NFS (2008) Associative learning of social value. Nature 456: Saxe R, Kanwisher N (2003) People thinking about thinking people. The role of temporo-parietal junction in theory of mind. Neuroimage 19: Young L, Camprodon JA, Hauser M, Pascual-Leone A, Saxe R (2010) Disruption of the right temporoparietal junction with transcranial magnetic stimulation reduced the role of beliefs in moral judgements. Proc Natl Acad Sci USA 107: Bhatt MA, Lohrenz T, Camerer CF, Montague PR (2010) Neural signatures of strategic types in a two-person bargaining game. Proc Natl Acad Sci USA 107: Mars R, Sallet J, Schüffelgen U, Jbabdi S, Toni I, et al. (2012) Connectivitybased subdivisions of the human right temporoparietal junction area : evidence for different areas participating in different cortical networks. Cereb Cortex 22: Epub 2011 Sep Sutton RS, Barto AG (1998) Reinforcement Learning: An Introduction. Cambridge, Massachusetts: MIT Press. 30. Montague PR, Dayan P, Sejnowski TJ (1996) A framework for mesencephalic dopamine systems based on predictive Hebbian learning. J Neurosci 16: Montague PR, King-Casas B, Cohen JD (2006) Imaging valuation models in human choice. Annu Rev Neurosci 29: Seo H, Lee D (2008) Cortical mechanisms for reinforcement learning in competitive games. Phil Trans R Soc B 363: Behrens TEJ, Hunt LT, Rushworth NFS (2009) The computation of social behavior. Science 324: PLOS Computational Biology 9 December 2012 Volume 8 Issue 12 e

22 RESEARCH ARTICLE Monte Carlo Planning Method Estimates Planning Horizons during Interactive Social Exchange Andreas Hula 1 *, P. Read Montague 2,3,4, Peter Dayan 5 a Wellcome Trust Centre for Neuroimaging, University College London, London, United Kingdom, 2 Wellcome Trust Centre for Neuroimaging, University College London, London, United Kingdom 3 Human Neuroimaging Laboratory, Virginia Tech Carilion Research Institute, Roanoke, Virginia, United States of America, 4 Department of Physics, Virginia Polytechnic Institute and State University, Blacksburg, Virginia, United States of America, 5 Gatsby Computational Neuroscience Unit, University College London, London, United Kingdom * ahula@ucl.ac.uk Abstract OPEN ACCESS Citation: Hula A, Montague PR, Dayan P (2015) Monte Carlo Planning Method Estimates Planning Horizons during Interactive Social Exchange. PLoS Comput Biol 11(6): e doi: /journal. pcbi Editor: Samuel Gershman, Massachusetts Institute of Technology, UNITED STATES Received: September 13, 2014 Accepted: March 23, 2015 Published: June 8, 2015 Copyright: 2015 Hula et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability Statement: The data are available from the following github repository: com/andreashula/trust Funding: This work was supported by a Wellcome Trust Principal Research Fellowship (PRM, AH) under grant /Z/10/Z, The Kane Family Foundation (PRM), NIDA grant R01DA11723 (PRM), NIMH grant R01MH (PRM), NIA grant RC4AG (PRM), and The Gatsby Charitable Foundation (PD). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Reciprocating interactions represent a central feature of all human exchanges. They have been the target of various recent experiments, with healthy participants and psychiatric populations engaging as dyads in multi-round exchanges such as a repeated trust task. Behaviour in such exchanges involves complexities related to each agent s preference for equity with their partner, beliefs about the partner s appetite for equity, beliefs about the partner s model of their partner, and so on. Agents may also plan different numbers of steps into the future. Providing a computationally precise account of the behaviour is an essential step towards understanding what underlies choices. A natural framework for this is that of an interactive partially observable Markov decision process (IPOMDP). However, the various complexities make IPOMDPs inordinately computationally challenging. Here, we show how to approximate the solution for the multi-round trust task using a variant of the Monte-Carlo tree search algorithm. We demonstrate that the algorithm is efficient and effective, and therefore can be used to invert observations of behavioural choices. We use generated behaviour to elucidate the richness and sophistication of interactive inference. Author Summary Agents interacting in games with multiple rounds must model their partner s thought processes over extended time horizons. This poses a substantial computational challenge that has restricted previous behavioural analyses. By taking advantage of recent advances in algorithms for planning in the face of uncertainty, we demonstrate how these formal methods can be extended. We use a well studied social exchange game called the trust task to illustrate the power of our method, showing how agents with particular cognitive and PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

23 Competing Interests: The authors have declared that no competing interests exist. social characteristics can be expected to interact, and how to infer the properties of individuals from observing their behaviour. This is a PLOS Computational Biology Methods paper. Introduction Successful social interactions require individuals to understand the consequences of their actions on the future actions and beliefs of those around them. To map these processes is a complex challenge in at least three different ways. The first is that other peoples preferences or utilities are not known exactly. Even if the various components of the utility functions are held in common, the actual values of the parameters of partners, e.g., their degrees of envy or guilt [1 6], could well differ. This ignorance decreases through experience, and can be modeled using the framework of a partially observable Markov decision process (POMDP). However, normal mechanisms for learning in POMDPs involve probing or running experiments, which has the potential cost of partners fooling each other. The second complexity is represented by characterizing the form of the model agents have of others. In principle, agent A s model of agent B should include agent B s model of agent A; and in turn, agent B s model of agent A s model of agent B, and so forth. The beautiful theory of Nash equilibria [7], extended to the case of incomplete information via so-called Bayes-Nash equilibria [8] dispenses with this so-called cognitive hierarchy [9 12], looking instead for an equilibrium solution. However, a wealth of work (see for instance [13]) has shown that people deviate from Nash behaviour. It has instead been proposed that people model others to a strictly limited, yet non-negligible, degree [9, 12]. The final complexity arises when we consider that although it is common in experimental economics to create one-shot interactions, many of the most interesting and richest aspects of behaviour arise with multiple rounds of interactions. Here, for concreteness, we consider the multi round trust task, which is a social exchange game that has been used with hundreds of pairs (dyads) of subjects, including both normal and clinical populations [16 18]. This game has been used to show that characteristics that only arise in multi-round interactions such as defection (agent A increases their cooperation between two rounds; agent B responds by decreasing theirs) have observable neural consequences that can be measured using functional magnetic resonance imaging (fmri) [14, 19 22]. The interactive POMDP (IPOMDP) [23] is a theoretical framework that formalizes many of these complexities. It characterizes the uncertainties about the utility functions and planning over multiple rounds in terms of a POMDP, and constructs an explicit cognitive hierarchy of models about the other (hence the moniker interactive ). This framework has previously been used with data from the multi-round trust task [22, 24]. However, solving IPOMDPs is computationally extremely challenging, restricting those previous investigations to a rather minuscule degree of forward planning (just two- out of what is actually a ten-round interaction). Our main contribution is the adaptation of an efficient Monte Carlo tree search method, called partially observable Monte Carlo planning (POMCP) to IPOMDP problems. Our second contribution is to illustrate this algorithm through examination of the multiround trust task. We show characteristic patterns of behaviour to be expected for subjects with particular degrees of PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

inequality aversion, other-modeling and planning capacities, and consider how to invert observed behaviour to make inferences about the nature of subjects reasoning capacities.

24 inequality aversion, other-modeling and planning capacities, and consider how to invert observed behaviour to make inferences about the nature of subjects reasoning capacities. Results We first briefly review Markov decision processes (MDPs), their partially observable extensions (POMDPs), and the POMCP algorithm invented to solve them approximately, but efficiently. These concern single agents. We then discuss IPOMDPs and the application of POMCP to solving them when there are multiple agents. Finally, we describe the multi-round trust task. Partially Observable Markov Decision Processes A Markov decision process (MDP) [25] is defined by sets S of states and A of actions, and several components that evaluate and link the two, including transition probabilities T, and information R about possible rewards. States describe the position of the agent in the environment, and determine which actions can be taken, accounting for, at least probabilistically, the consequences for rewards and future states. Transitions between states are described by means of a collection of transition probabilities T, assigning to each possible state s 2 S and each possible action a 2 A from that state, a transition probability distribution or measure T a ¼ s^s T ð^s; a; sþ :¼ P½^s j s; aš which encodes the likelihood of ending in state ^s after taking action a from state s. The Markov property requires that the transition (and reward probabilities) only depend on the current state (and action), and are independent from the past events. An illustration of these concepts can be found in Fig 1. By contrast, in a partially observable MDP (i.e., a POMDP [26]), the agent can also be uncertain about its state s. Instead, there is a set of observations o 2 O that incompletely pin Fig 1. A Markov decision process. The agent starts at state s 0 and has two possible actions a 1 and a 2. Exercising either, it can transition into three possible states, one of which (s 2 ) can be reached through either action. Each state and action combination is associated with a particular reward expectation R(a, s). Based on this information, the agent can choose an action and transitions with probability T(sˆ, a, s 0 ) to a new state sˆ, obtaining an actual reward r in the process. The procedure is then repeated from the new state, with its given action possibilities or else the decision process might end, depending on the given process. doi: /journal.pcbi g001 PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

25 Fig 2. A partially observable Markov decision process. Starting from a observed interaction history h, the agents use their belief state B(h), to determine how likely they are to find themselves in one of three possible actual states s 1, s 2, s 3. The POMDP solution requires to integrate over all possible states according to the belief state at every possible following history. The solution allows to choose the next action a. Following this, an observation o is obtained by the agent and the new history {h, a, o} becomes the starting point for the next decision. doi: /journal.pcbi g002 down states, depending on the observation probabilities W a^so ¼ Wðo; a;^sþ :¼ P½o j ^s; aš: These report the probability of observing o when action a has occasioned a transition to state ^s. See Fig 2 for an illustration of the concept. We use the notation s t = s, a t = a or o t = o to refer explicitly to the outcome state, action or observation at a given time. The history h 2 H is the sequence of actions and observations, wherein each action from the point of view of the agent moves the time index ahead by 1, h t := {o 0, a 0, o 1, a 1,..., a t 1, o t }. Here o 0 may be trivial (deterministic or empty). The agent can perform Bayesian inference to turn its history at time t into a distribution P[S t = s t jh t ] over its state at time t, where S t denotes the random variable encoding the uncertainty about the current state at time t. This distribution is called its belief state B(h t ), with P B(ht )[S t = s t ]: = P[S t = s t jh t ]. Inference depends on knowing T, W and the distribution over the initial state S 0, which we write as B(h 0 ). Information about rewards R comprises a collection of utility functions r 2 R, r:a S O! R, a discount function Γ 2 R,Γ:N! [0, 1] and a survival function H 2 R, H:N N! [0, 1]. The utility functions determine the immediate gain associated with executing action a at state s and observing o (sometimes writing r t for the reward following the t th action). From the utilities, we define the reward function R: A S! R, as the expected gain for taking action a at state s as R(a, s)=e[r(a, s, o)], where this expectation is taken over all possible observations o. Since we usually operate on histories, rather than fixed states, we define the expected reward from a given history h as R(a, h): = s 2 S R(a, s)p[sjh]. The discount function weights the present impact of a future return, depending only on the separation between present and future. We use exponential discounting with a fixed number γ 2 [0, 1] to define our PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

26 discount function: Gðt tþ ¼g t t 8t; t 2 N; t t: ð1þ Additionally, we define H such that H(τ, t) is 0 for τ > K and 1 otherwise. K in general is a random stopping time. We call the second component t the reference time of the survival function. The survival function allows us to encode the planning horizon of an agent during decision making: If H(τ, t) is 0 for τ t > P, we say that the local planning horizon at t is less than or equal to P. The policy π 2 P, π(a, h): = P[ajh] is defined as a mapping of histories to probabilities over possible actions. Here P is called the set of admissible policies. For convenience, we sometimes write the distribution function as π(h). The value function of a fixed policy π starting from present history h t is V p ðh t Þ :¼ X1 g t t Hðt; tþe½r t jp; h t Š t¼t ð2þ i.e., a sum of the discounted future expected rewards (note that h τ is a random variable here, not a fixed value). Equally, the state-action value is Q p ða; h t Þ :¼ Rða; h t Þþ X1 t¼tþ1 g t t Hðt; tþe½r t jp; h t Š: ð3þ Definition 1 (formal definition-pompd). Using the notation of this section, a POMDP is defined as a tuple (S, A, O, T, W, R, P, B 0 ) of components as outlined above. Convention 1 (softmax decision making). A wealth of experimental work (for instance [27 29]) has found that the choices of humans (and other animals) can be well described by softmax policies based on the agent s state-action values, to encompass the stochasticity of observed behaviour in real subject data. See [30], for a behavioural economics perspective and [10] for a neuroscience perspective. In view of using our model primarily for experimental analysis, we will base our discussion on the decision making rule: pða; hþ ¼P½ajhŠ ¼ ebqp ða;hþ P b2a ebqp ðb;hþ ð4þ where β > 0 is called the inverse temperature parameter and controls how diffuse are the probabilities. The policy ( pða; hþ ¼ 1 if Qp ða; hþ ¼maxfQ p ðb; hþjb 2 Ag ðassuming this is uniqueþ ð5þ 0 otherwise can be obtained as a limiting case for β!1. Convention 2. From now on, we shall denote by Q(a, h), the state-action value Q π (a, h) with respect to the softmax policy. POMCP POMCP was introduced by [31] as an efficient approximation scheme for solving POMDPs. Here, for completeness, we describe the algorithm; later, we adapt it to the case of an IPOMDP. POMCP is a generative model-based sampling method for calculating history-action values. That is, it builds a limited portion of the tree of future histories starting from the current h t, PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

27 Fig 3. Illustration of POMCP. The algorithm samples a state s from the Belief state B(h) at the root Y (Y representing the current history h), keeps this state s fixed till step 4), follows UCT in already visited domains (labelled tree nodes T) and performs a rollout and Bellman backup when hitting a leaf (labelled L). Then step 1) 4) is repeated until the specified number of simulations has been reached. doi: /journal.pcbi g003 using a sample-based search algorithm (called upper confidence bounds for trees (UCT); [32]) which provides guarantees as to how far from optimal the resulting action can be, given a certain number of samples (based on results in [33] and [34]). Algorithm 1 provides pseudo code for the adapted POMCP algorithm. The procedure is presented schematically in Fig 3. The algorithm is based on a tree structure T, wherein nodes TðhÞ ¼ðNðhÞ; ~QðhÞ; BðhÞÞ represent possible future histories explored by the algorithm, and are characterized by the number N(h) of times history h was visited in the simulation, the estimated value ~QðhÞ for visiting h and the approximate belief state B(h)ath. Each new node in T is initialized with initial action exploration counts N(h, a) = 0 for all possible actions a from h and an initial action value estimate ~Qðh; aþ ¼0 for all possible actions a from h and an empty belief state B(h)=;. The value N(h) is then calculated from all actions counts from the node N(h)= a 2 A N(h, a). ~QðhÞ denotes the mean of obtained values, for simulations starting from node h. B(h) can either be calculated analytically, if it is computationally feasible to apply Bayes theorem, or be approximated by the so called root sampling procedure (see below). In terms of the algorithm, the generative model G of the POMDP determines (s 0, o, r) * G(s, a), the simulated reward, observation and subsequent state for taking a at s; s itself is sampled from the current history h. Then, every (future) history of actions and observations h defines a node T(h) in the tree structure T, which is characterized by the available actions and their average simulated action values ~Qða; hþ under the policy SOFTUCT at future states. If the node has been visited for the N(h) th time; with action a being taken for the N(h, a) th time, then the average simulated value is updated (starting from 0) using sampled simulated PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

28 Algorithm 1 Partially Observable Monte Carlo Planning. procedure SEARCH(h, t, n) procedure SIMULATE (s, h, t, k) for SIMULATIONS = 1,..., n do if H(k, t) 0 then k t return 0 if h t = o 0 then end if s * B 0 if h =2 T then else for all a 2 A do s * B(h t ) T(ha) (N(h, a); Q(a, h), ;) end if SIMULATE (s, h, t, k) end for end for return ROLLOUT (s, h, t, k) return a * SOFTUCT(Q(. h)) end if end procedure procedure ROLLOUT(s, h, t, k) a * SOFTUCT(Q(. h)) if H(k, t) 0 then (s 0, o, r) * G(s, a) return 0 h {h, a, o} end if k k +1 a * π rollout (h, ) R r+γsimulate(s 0,h,t,k) (s 0, o, r) * G(s, a) N(h) N(h) +1 h {h, a, o} N(h, a) N(h, a) +1 k k +1 Qða; ~ hþ Qða; ~ hþþ R Qða;hÞ ~ Nðh;aÞ return r+γrollout(s 0 h, t, k) end procedure doi: /journal.pcbi t001 return R end procedure rewards R up to terminal time K, when the current simulation/tree traversal ends as: ~Q new ða; hþ ¼ ~Q old ða; hþþ 1 Nðh; aþ R ~Q old ða; hþ : ð6þ The search algorithm has two decision rules, depending on whether a traversed node has already been visited or is a leaf of the search tree. In the former case, a decision is reached using SOFTUCT by defining SoftUCTðQð:jhÞÞ Qða; hþ :¼ ~Qða; hþ þ c sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi log NðhÞ Nðh; aþ P½ajhŠ ¼ ebðqða;hþþ Pb ebðqðb;hþþ ð7þ where c is a parameter that favors exploration (analogous to an equivalent parameter in UCT). If the node is new, a so-called rollout policy is used to provide a crude estimate of the value of the leaf. This policy can be either very simple (uniform or ε greedy based on a very simple model) or specifically adjusted to the search space, in order to optimize performance. The rollout value estimate together with the SOFTUCT exploration rule is the core mechanism for efficient tree exploration. In this work, we only use an ε greedy mechanism, as is described in the section on the multi round trust game. Another innovation in POMCP that underlies its dramatically superior performance is called root sampling. This procedure allows to form the belief state at later states, as long as the initial belief state B 0 is known. This means that, although it is necessary to perform inference to draw samples from the belief state at the root of the search tree, one can then use each PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

29 sample as if it was (temporarily) true, without performing inference at states that are deeper in the search tree to work out the new transition probabilities that pertain to the new belief states associated with the histories at those points. The reason for this is that the probabilities of getting to the nodes in the search tree represent exactly what is necessary to compensate for the apparent inferential infelicity [31] i.e., the search tree performs as a probabilistic filter. The technical details of the root sampling procedure can be found in [31]. In the presence of analytically tractable updating rules (or at least analytically tractable approximations), the belief state at a new node can instead be calculated by Bayes theorem. This will also be the case for the multi round trust game below, where we follow the approximate updating rule in [22]. Interactive Partially Observable Markov Decision Processes An Interactive Partially Observable Markov Decision Process (IPOMDP) is a multi agent setting in which the actions of each agent may observably affect the distribution of expected rewards for the other agents. Since IPOMDPs may be less familiar than POMDPs, we provide more detail about them; consult [23] for a complete reference formulation and [35] for an excellent discussion and extension. We define the IPOMDP such that the decision making process of each agent becomes a standard (albeit large) POMDP, allowing the direct application of POMDP methods to IPOMDP problems. Definition 2 (formal definition-ipomdp). An IPOMDP is a collection of POMDPs such that the following holds: Agents are indexed by the finite set I. Each agent i 2 I is described by a single POMDP (S i, A i, O i, T i, W i, R i, P i, B i 0Þ denoting its actual decision making process. We first define the physical state space S i : an element phys si 2 S i phys is a complete setting of all features of the environment that determine the action possibilities A i and obtainable rewards R i of i for the present and all possible following histories, from the point of view of i. The physical state space S i is phys augmented by the set D i of models of the partner agents θ ij 2 D i,j2i\{i}, called intentional models, which are themselves POMDPs θ ij =(S ij, A ij, O ij, T ij, W ij, R ij, P ij, B ij 0Þ. These describe how agent i believes agent j perceives the world and reaches its decisions. The possible state space of agent i can be written S i ¼ S i phys Di and a given state can be written ~s i ¼ðs i ; j y ij Þ, where s i 2 S i is the physical state of the environment and phys θij are the models of the other agents. Note that the intentional models θ ij contain themselves state spaces that encode the history of the game as observed by agent j from the point of view of agent i. The elements of S i are called interactive states. Agents themselves act according to the softmax function of history-action values, and assume that their interactive partner agents do the same. The elements of the definition are summarized in Fig 4. Convention 3. We denote by S and ~S the random variables, that encode uncertainty about the physical state and the interactive state respectively. When choosing the set of intentional models, we consider agents and their partners to engage in a cognitive hierarchy of successive mentalization steps [9, 12], depicted in Fig 5. The simplest agent can try to infer what kind of partner it faces (level 0 thinking). The next simplest agent could additionally try to infer what the partner might be thinking of it (level 1). Next, the agent might try to understand their partner s inferences about the agent s thinking about the partner (level 2). Generally, this would enable a potentially unbounded chain of mentalization PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

accrued in the belief state B(h). The IPOMDP solution requires to integrate over all possible states and intentional models according to the belief state at every possible history. doi:10.

30 Fig 4. Interactive partially observable Markov decision process. Compared to a POMDP, the process is further complicated by the necessity to keep different models Θ of the other agent s intentions, so that evidence about the correct intentional model may be accrued in the belief state B(h). The IPOMDP solution requires to integrate over all possible states and intentional models according to the belief state at every possible history. doi: /journal.pcbi g004 Fig 5. Computational theory of mind (ToM) formalizes the notion of our understanding of other peoples thought processes. doi: /journal.pcbi g005 steps. It is a tenet of cognitive hierarchy theory [9] that the hierarchy terminates finitely and for many tasks after only very few steps (e.g., Poisson, with a mean of around 1.5). We formalize this notion as follows. Definition 3 (a hierarchy of intentional models). Since models of the partner agent may contain interactive states in which it in turn models the agent i, we can specify a hierarchical intentional structure D i,l, built from what we call the level l 1 intentional models D i,l. D i,l is PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

31 defined inductively from y ij; 1 2 D i; 1, S ij; 1 ¼ S ij phys f;g: ð8þ This means that any level 1 intentional model reacts strictly to the environment, without holding any further intentional models. The higher levels are obtained as y ij;l 2 D i;l, S ij;l ¼ S ij phys Dij;l 1 : ð9þ Here D ij, l 1 denotes the l 1 intentional models, that agent i thinks agent j might hold of the other players. These level l 1 intentional models arise by the same procedure applied to the level 1 models that agent i thinks agent j might hold. Definition 4 (theory of mind (ToM) level). We follow a similar assumption as the so called k-level thinking (see [12]), in that we assume that each agent operates at a particular level l i (called the agent s theory of mind (ToM) level; and which it is assumed to know), and models all partners as being at level l j =l i 1. We chose definition 4 for comparability with earlier work [22, 24]. Convention 4. It is necessary to be able to calculate the belief state in every POMDP that is encountered. An agent updates its belief state in a Bayesian manner, following an action a i and t an observation o i tþ1. This leads to a sequential update rule operating over the belief state P½~S i j t hi tš of a given agent i at a given time t: P½~S i tþ1 ¼ ~s 1jfh i t ; ai t ; oi tþ1 gš ¼ ZWðoi tþ1 ; ai t ;~s 1Þ X ~s2s i T ð~s 1 ; a i t ;~sþp½ ~S i t ¼ ~sjhi t Š: ð10þ Here η is a normalization constant associated with the joint distribution of transition and observation probability, conditional on ~s, ~s 1 ; o i and tþ1 ai. The observation t oi tþ1 in particular incorporates any results of the actions of the other agents, before the next action of the given agent. We note that the above rule applies recursively to every intentional model in the nested structure D i, as every POMDP has a separate belief state. This is slightly different from [23] so that the above update is conventional for a POMDP. Convention 5. (Expected Utility Maximisation). The decision making rule in our IPOMDP treatment is based on expected utility as encoded in the reward function. The explicit formula for the action value Qða i; t hi tþ under a softmax policy (Eq (4)) is: Qða i; t hi t Þ¼Rðai; t hiþþ X t P½o i tþ1 jfhi; t aigš X t g ðiþ Hðt þ 1; tþqðw; h i tþ1 jtþp½bjhi Š: tþ1 ð11þ o i tþ1 2O w2a i Here h tþ1 ¼fh i; t ai; t oi g and Qðb; tþ1 hi tþ1 j tþ denotes the action value at t+1 with the survival function conditioned to reference time t. γ (i) is the discount factor of agent i, rather than the i- th power. This defines a recursive Bellman equation, with the value of taking action a i given t history h i being the expected immediate reward t Rðai; t hi tþ plus the expected value of future actions conditional on a i and its possible consequences t oi discounted by tþ1 γi. The belief state Bðh iþ allows us to link t hi t to a distribution of interactive states and use W to calculate P½o i tþ1 jfti; h ai tgš, in particular including the reactions of other agents to the actions of one agent. We call the resulting policy the solution to the IPOMDP. Equilibria and IPOMDPs Our central interest is in the use of the IPOMDP to capture the interaction amongst human agents with limited cognitive resources and time for their exchanges. It has been noted in [9] that the distribution of subject levels favours rather low values (e.g., Poisson, with a mean of PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

32 around 1.5). In the opposite limit, sufficient conditions are known in which taking the cognitive hierarchy out to infinity for all involved agents allows for at least one Bayes-Nash equilibrium solution (part II, theorem II, p. 322 of Harsanyi [8]) and sufficient conditions have been shown in [36], given which a solution to the infinite hierarchy model can be approximated by the sequence of finite hierarchy model solutions. A discussion of a different condition can be found in [37]; however, this condition does assume a infinite time horizon in the interaction. In general, as [9], p.868 notes, it is not true that the infinite hierarchy solution will be a Nash equilibrium. For the purposes of computational psychiatry, we find the very mismatches and limitations, that prevent subjects strategies to evolve to a (Bayes)-Nash equilibrium in the given time frame, to be of particular interest. Therefore we restrict our attention to quantal response equilibrium like behaviours ([30]), based on potentially inconsistent initial beliefs by the involved agents with ultimately very limited cognitive resources and finite time exchanges. Applying POMCP to an IPOMDP An IPOMDP is a collection of POMDPs, so POMCP is, in principle, applicable to each encountered POMDP. However, unlike the examples in [31], an IPOMDP contains the intentional model POMDPs θ ij as part of the state space, and these themselves contain a rich structure of beliefs. So, the state is sampled from the belief state at the root for agent i is an I tuple ð^s i ; ^y i1 ;...; ^y iðjij 1Þ Þ of a physical state ^s i and (jij 1) POMDPs, one for each partner. (This is also akin to the random instantiation of players in [8]). Since the ^y ij still contain belief states in their own right, it is still necessary to do some explicit inference during the creation of each tree. Indeed, explicit inference is hard to avoid altogether during simulation, as the interactive states require the partner to be able to learn [23]. Nevertheless, a number of performance improvements that we detail below still allow us to apply the POMCP method involving substantial planning horizons. Simplifications for Dydadic Repeated Exchange Many social paradigms based upon game theory, including the iterated ultimatum game, prisoners dilemma, iterated rock, paper, scissors (for 2 agents) and the multi round trust game, involve repeated dyads. In these, each interaction involves the same structure of physical states and actions (S phys, A) (see below), and all discount functions are 0 past a finite horizon. Definition 5 (Dyadic Repeated Exchange without state uncertainty). Consider a two agent IPOMDP framework in which there is no physical state uncertainty: both agents fully observe each others actions and there is no uncertainty about environmental influence; and in which agents vary their play only based on intentional models. Additionally, the framework is assumed to reset after each exchange (i.e., after both agents have acted once). Formally this means: There is a fixed setting (S phys, A, T ), such that physical states, actions from these states, transitions in the physical state and hence also obtainable rewards, differ only by a changing time index and there is no observational uncertainty. Then after each exchange the framework is assumed to reset to the same distribution of physical initial states S phys within this setting (i.e. the game begins anew). Games of this sort admit an immediate simplification: Theorem 1 (Level 0 Recombining Tree). In the situation of definition 5, level 0 action values at any given time only depend on the total set of actions and observations so far and not the order in which those exchanges were observed. Proof. The level 1 partner model only acts on the physical state it encounters and the physical state space variable S is reset at the beginning of each round in the situation of 5. Therefore, PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

33 given a state s in the current round and an action a by a level 0 agent, the likelihood of each transition to some state s 1, T(s 1, a, s), and of making observation o, W(o, a, s 1 ), is the same at every round from the point of view of the level 0 agent. It follows that the cumulative belief update from Eq 10, from the initial beliefs B 0 to the current beliefs, will not depend on the order in which the action observation pairs (a, o) were observed. This means, that depending on the size of the state space and the depth of planning of interest, we may analytically calculate level 0 action values even online or use precalculated values for larger problems. Furthermore, because their action values will only depend on past exchanges and not on the order in which they were observed, their decision making tree can be reformulated as a recombining tree. Sometimes, an additional simplification can be made: Theorem 2 (Trivialised Planning). In the situation of definition 5, if the two agents do not act simultaneously and the state transition of the second agent is entirely dependent on the action executed by the first agent (as in the multi round trust task); and additionally the intentional model of the partner can not be changed through the actions of the second agent, then a level 0 second agent can gain no advantage from planning ahead, since their actions will not change the action choices of the first agent. Proof. In the scenario described in theorem 2 the physical state variable S of the agent 2 is entirely dependent on the action of the other agent. If the agent is level 0, they model their partner as level 1 and by additional assumption the second agent does not believe that the partner can be made to transition between different intentional models by the second agent s actions, hence their partner will not change their distribution of state transitions, depending on the agent s actions and hence also their distribution of future obtainable rewards will not change. Theorem 3 (Trivialised Theory of Mind Levels). In the situation of theorem 2, we state that for the first to go agent, only the even theory of mind levels k 2 {0}[2N show distinct behaviours, while the odd levels k 2 2N 1 behave like one level below, meaning k 1. For the second to go partner equivalently, only the odd levels k 2 {0}[2N 1 show distinct behaviours. Proof. In the scenario described in theorem 2, the second to go level 0 agent behaves like a level 1 agent, as it does not benefit from modeling the partner. This implies that the first to go agent, gains no additional information at the level 1 thinking, since the partner behaves like level 1, which was modeled by the level 0 first to agent already. In turn, the level 2 second to go agent gains no additional information over the level 1 second to go agent, as the their partner model does not change between modeling the partner at level 0 or level 1. By induction, we get the result. Examples of the additional simplifications in theorems 2 and 3 can be seen in the ultimatum game and the multi round trust game. The Trust Task The multi-round trust task, illustrated in Fig 6 is a paradigm social exchange game. It involves two people, one playing the role of an investor the other the one of a trustee, over 10 sequential rounds, expressed by a time index t = 1,2,...,10. Both agents know all the rules of the game. In each round, the investor receives an initial endowment of 20 monetary units. The investor can send any of this amount to the trustee. The experimenter trebles this quantity and then the trustee decides how much to send back to the investor, between 0 points and the whole amount that she receives. The repayment by the trustee is not increased by the experimenter. After the trustee s action, the investor is informed, and the next round starts. We consider the trust task as an IPOMDP with two agents, i.e., I ={I, T} contains just I for the investor and T for the trustee. We consider the state to contain two PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

Fig 6. Physical features of the multi round trust game. doi:10.1371/journal.pcbi.1004254.

34 Fig 6. Physical features of the multi round trust game. doi: /journal.pcbi g006 components; one physical and observable (the endowment and investments), the other nonphysical and non-observable (in our case, parameters of the utility function). It is the latter that leads to the partial observability in the IPOMDP. Following [24], we reduce complexity by quantizing the actions and the (non-observable) states of both investor and trustee shown for one complete round in Fig 7. The actions are quantized into 5 fractional categories shown in Fig 7. For the investor, we consider a I 2 {0,0.25,0.5,0.75,1} (corresponding to an investment of $20 a I, and encompassing even investment ranges). For the trustee, we consider a T 2 {0,0.167,0.333,0.5,0.67} (corresponding to a return of $3 20 a I a T, and encompassing even return ranges). Note that the trustee s action is degenerate if the investor gives 0. The pure monetary payoffs for both agents Fig 7. Discretized actions of both players. Investor: (left) The 21 possible actions are summarized into 5 possible investment categories. Trustee: (right) returns are classified into 5 possible categories, conditionally on investor action. Impossible returns are marked in black. doi: /journal.pcbi g007 PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

Fig 8. Payoffs in the multi round trust task. (left) Investor payoffs for an single exchange. (right) Trustee payoffs for an single exchange. doi:10.1371/journal.pcbi.1004254.

35 Fig 8. Payoffs in the multi round trust task. (left) Investor payoffs for an single exchange. (right) Trustee payoffs for an single exchange. doi: /journal.pcbi g008 in each round are investor : w I ða I ; a T Þ¼20 20 a I þ 3 20a I a T trustee : w T ða I ; a T Þ¼3 20 a I 3 20a I a T : ð12þ ð13þ The payoffs of all possible combinations and both partners are depicted in Fig 8. In IPOMDP terms, the investor s physical state is static, whereas the trustee s state space is conditional on the previous action of the investor. The investor s possible observations are the trustees responses, with a likelihood that depends entirely on the investor s intentional model of the trustee. The trustee observes the investor s action, which also determines the trustee s new physical state, as shown in Fig 9. Inequality aversion-compulsion to fairness. The aspects of the states of investor and trustee that induce partial observability are assumed to arise from differential levels of cooperation. One convenient (though not unique) way to characterize this is via the Fehr-Schmidt inequality aversion utility function (Fig 10). This allows us to account for the observation that many trustees return an even split even on the last exchange of the 10 rounds, even though no further gain is possible. We make no claim that this is the only explanation for such behaviour, but it is a tractable and well-established mechanism that has been used successfully in other tasks ([1, 14, 27]). For the investor, this suggests that: r I ða I ; a T ; a I Þ¼w I ða I ; a T Þ a I max fw I ða I ; a T Þ w T ða I ; a T Þ; 0g: ð14þ Here, α I is called the guilt parameter of the investor and quantifies their aversion to unequal outcomes in their favor. We quantize guilt into 3 concrete guilt types {0,0.4,1} = {α 1, α 2, α 3 }. Similarly, the trustee s utility is r T ða I ; a T ; a T Þ¼w T ða I ; a T Þ a T max fw T ða I ; a T Þ w I ða I ; a T Þ; 0g ð15þ with the same possible guilt types. We choose these particular values, as guilt values above 0.5 PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

36 Fig 9. (Physical) transitions and observations: (Left) physical state transitions and observations of the investor. The trustee s actions are summarized to a T, as they can not change the following physical state transition. (right) Physical state transitions and observations of the trustee. The trustee s actions are summarized to a T, as they can not change the following physical state transition. doi: /journal.pcbi g009 tend to produce similar behaviours as α = 1 and the values below 0.3 tend to behave very similar to α = 0. Thus we take α 1 to represent guilt values in [0,0.3], α 2 to represent guilt values in (0.3,0.5) and α 2 to represent guilt values in [0.5,1]. We assume that neither agent s actual guilt type changes during the 10 exchanges. Planning behaviour. The survival functions H I and H T are used to delimit the planning horizon. The agents are required not to plan beyond the end of the game at time 10 and within that constraint they are supposed to plan P steps ahead into the interaction. This results in the following form for the survival functions (regardless whether for investor or trustee): H P ðt; tþ ¼1 ðt tþ P ^ðt þ tþ 10; H P ðt; tþ ¼0 ðt tþ > P _ðt þ tþ > 10: ð16þ The value P is called the planning horizon. We consider P 2 {0,2,7} for immediate, medium and long planning types. We chose these values as P = 7 covers the range of behaviours from P =4toP = 9, while planning 2 yields compatibility to earlier works ([22, 24]) and allows to have short planning but high level agents, covering the range of behaviours for planning P = 1 to P = 3. We confirm later that the behaviour of P = 7 and P = 9 agents is almost identical; and the former saves memory and processing time. Agents are characterized as assuming their opponents have the same degree of planning as they do. The discount factors γ I and γ T are set to 1 in our setting. Belief State Since all agents use their own planning horizon in modeling the partner and level k agents model their partner at level k 1, inference in intentional models in this analysis is restricted to the guilt parameter α. Using a categorical distribution on the guilt parameter and Dirichlet prior on the probabilities of the categorical distribution, we get a Dirichlet-Multinomial distribution for the probabilities of an agent having a given guilt type at some point during the PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

37 Fig 10. Immediate Fehr-Schmidt utilities for a single exchange [1]. Left column shows investor preferences: (top left) Completely unguilty investor values only the immediate payoff, (middle left) Guilt 0.4 investor is less likely to keep everything to themselves (bottom left corner option), (bottom left) Guilt 1 investor will never keep everything to themselves (bottom left option). Right column shows trustee preferences: (top right) unguilty trusty would like to keep everything to themselves. (middle right) Guilt 0.4 is more likely to return at least a fraction of the gains. (bottom right) Guilt 1 trustee will strife to return the fair split always. doi: /journal.pcbi g010 PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

38 exchange. Hence B 0 is a Dirichlet-Multinomial distribution, with the initial belief state B 0 DirMultða 0 Þ a 0 ¼ð1; 1; 1Þ ð17þ P½a partner ¼ a i jh ¼;Š¼ 1 3 : ð18þ Keeping consistent with the model in [22], our approximation of the posterior distribution is a Dirichlet-Multinomial distribution with the parameters of the Dirichlet prior being updated to a i tþ1 ¼ ai t þ P½o tþ1 ¼ observed actionja partner ¼ a i Š ð19þ writing α partner for the intentional models. Theory of mind levels and agent characterization. Since the physical state transition of the trustee is fully dependent on the investor s action and one agent s guilt type can not be changed by the actions of the other agent, theorem 2 implies that the level 0 trustee is trivial, gaining nothing from planning ahead. Conversely, the level 0 investor can use a recombining tree as in theorem 1. Therefore, the chain of cognitive hierarchy steps for the investor is l I 2 {0}[{2njn 2 N}, and for the trustee, it is l T 2 {0}[{2n 1jn 2 N}. Trustee planning is trivial until the trustee does at least reach theory of mind level 1. Assuming b ¼ 1 in Eq 4, determined 3 empirically from real subject data [22] for suitably noisy behaviour, our subjects are then characterized via the triplet (k, α, P) of theory of mind level k, guilt parameter α 2 {0,0.4,1} and planning horizon P 2 {0,2,7}. Level 1 and POMCP Rollout Mechanism The level 1 models are obtained by having the level 1 agent always assume all partner types to be equally likely (P½a partner ¼ a i Š¼ 1 ; 8i), setting the planning horizon to 0, meaning the 3 partner acts on immediate utilities only, and calculating the agent s expected utilities after marginalizing over partner types and their respective response probabilities based on their immediate utilities. In the POMCP treatment of the multi round trust game, if a simulated agent reaches a given history for the first time, a value estimate for the new node is derived by treating the agent as level 1 and using an ε-greedy decision making mechanism on the expected utilities to determine their actions until the present planning horizon. Behavioural Results We adapted the POMCP algorithm [31] to solve IPOMDPs [23], and cast the multi-round trust task as an IPOMDP that could thus be solved. We made a number of approximations that were prefigured in past work in this domain [22, 24], and also made various observations that dramatically simplified the task of planning, without altering the formal solutions. This allowed us to look at longer planning horizons, which is important for the full power of the intentional modeling to become clear. Here, we first seek to use this new and more powerful planning method to understand the classes of behaviour that arise from different settings of the parameters, as shown in the following section. From the study of human interactions [16], the importance of coaxing (returning more than the fair split) has been established. From our own study of the data collected so far, we define four coarse types of pure interactions, which we call Cooperation, Coaxing to Cooperation, Coaxing to Exploitation, Greedy ; we conceptualize how these might arise. We also delimit the potential consequences of having overly restricted the planning horizon in PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

39 past work in this domain, and examine the qualitative interactive signatures (such as how quickly average investments and repayments rise or fall) that might best capture the characteristics of human subjects playing the game. We then continue to discuss the quality of statistical inference, by carrying out model inversion for our new method and comparing to earlier work in this domain [24]. Finally, we treat real subject data collected for an earlier study ([22]) and show that our new approach recovers significant behavioural differences not obtained by earlier models and offers a significant improvement in the classification of subject behaviour through the inclusion of the planning parameter in the estimation and the quality of estimation on the trustee side. Modalities All simulations were run on the local cluster at the Wellcome Trust Centre for Neuroimaging. For sample paths and posterior distributions, for each pairing of investor guilt, investor sophistication and trustee guilt and trustee sophistication, 60 full games of 10 exchanges each were simulated, totaling 8100 games. Additionally, in order to validate the estimation, a uniform mix of all parameters was used, implying a total of 2025 full games. To reduce the variance of the estimation, we employed a pre-search method. Agents with ToM greater than 0 first explored the constant strategies (offering/returning a fixed fraction) to obtain a minimal set of ~Q values from which to start searching for the optimal policy using SOFTUCT. This ensures that inference will not get stuck in a close-to-optimal initial offer just because another initial offer was not adequately explored. This is more specific than just increasing the exploration bonus in the SOFTUCT rule, which would diffuse the search during all stages, rather than helping search from a stable initial grid. We set a number n of simulations for the initial step, where the beliefs about the partner are still uniform and the time horizon is still furthest away. We then reduce the number of simulations as the time horizon approaches ðn; n 9 ; n 8 ;...; n 1 Þ Simulation and Statistical Inference Unless stated otherwise, we employ an inverse temperature in the softmax of b ¼ 1 (noting the 3 substantial scale of the rewards). The exploration constant for POMCP was set to c = 25. The initial beliefs were uniform a i =1,8i, for each subject. For the 3 possible guilt types we use the following expression while in text: α = α 1 is greedy, α = α 2 is pragmatic and α = α 3 is guilty. However, on all the graphs, we give the exact model classification in the form I:(k I, α I, P I ) for the investor and T:(k T, α T, P T ) for the trustee. We present average results over multiple runs generated stochastically from each setting of the parameter values. In the figures, we report the actual characteristics of investor and trustee; however, in keeping with the overall model, although each agent knows their own parameters, they are each inferring their opponents degree of guilt based on their initial priors. As a consequence of our earlier observation in theorem 2, we only consider k 2 {0,2} for the investor and k 2 {0,1} for the trustee. Planning horizons are restricted to P 2 {0,2,7}, as noted before, with the level 0 trustee always having a planning horizon of 0. Actions for both agents are parametrized as in section The Trust Task and averaged across identical parameter pairings. In the graphs, we show actions in terms of the percentages of the available points that are offered or returned. For the investor, the numerical amounts can be read directly from the graphs; for the trustee, these amounts depend on the investor s action. In the figures, we report the actual characteristics of investor and trustee; however, in keeping with the overall model, although each player knows their own parameters, they are each inferring their opponents degree of guilt based on their initial priors. PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

40 Dual to generating behaviour from the model is to invert it to find parameter settings that best explain observed interactions [22, 24]. Conceptually, this can be done by simulating exchanges between partners of given parameter settings (k, α, P), taking the observed history of investments and responses, and using a maximum likelihood estimation procedure which finds the settings for both agents that maximise the chance that simulated exchanges between agents possessing those values would match the actual, observed exchange. We calculate the action likelihoods through the POMCP method outlined earlier and accumulate the negative log likelihoods, looking for the combination that produces the smallest negative loglikelihood. This is carried out for each combination of guilt and sophistication for both investor and trustee. Paradigmatic Behaviours The following figures show the three characteristic types of behaviour, in each case for two sets of parameters for investor and trustee. The upper graphs show the average histories of actions of the investor (blue) and trustee (red) across the 10 rounds; the middle graphs show the mean posterior distributions over the three guilt parameters (0,0.4,1) as estimated by the investor and the lower graphs show the mean posterior distribution by the trustee (right) at four stages in the game (rounds 0, 3, 6 and 9). These show how well the agents of each type are making inferences about their partners. Fig 11 shows evidence for strong cooperation between two agents who are characterized by high inequity aversion (i.e., guilty). Cooperation develops more slowly for agents with shorter (left) than longer (right) planning horizons, enabling a reliable distinction between different guilty pairs. This is shown more explicitly in Fig 12 in terms of the total amount of money made by both participants. Both cases can be seen as cases of a tit for tat like approach by the players, although unlike a strict tit for tat mechanism the process leading to high level cooperation is generally robust against following below par actions by either player. Rather, high level players would employ coaxing to reinforce cooperation in this case. This is true even for lower level players, as after they have formed beliefs of the partner, they will not immediately reduce their offers upon a few low offers or returns, due to the Bayesian updating mechanism. The posterior beliefs show both partners ultimately inferring the other s guilt type correctly in both pairings, however the P I = 7 investors remain aware of the possibility that the partners may actually be pragmatic and therefore the high level long horizon investors are prone to reduce their offers preemptively towards the end of the game. This data feature was noted in particular in the study [22] and our generative model provides a generative explanation for it, based on the posterior beliefs of higher level agents explained above. Fig 13 shows that level 1 trustees employ coaxing (returning more than the fair split) to get the investor to give higher amounts over extended periods of time. In the example settings, the level 0 investor completely falls for the trustee s initial coaxing (left), coming to believe that the trustee is guilty rather than pragmatic until towards the very end. However, the level 2 investor (right) remains cautious and starts reducing offers soon after the trustee gets greedy, decreasing their offers faster than if playing a truly guilty type. The level 2 investor on average remains ambiguous between the partner being guilty or pragmatic. Either inference prevents them from being as badly exploited as the level 0 investor. In these plots, investor and trustee both have long planning horizons; we later show what happens when a trustee with a shorter horizon (P T = 2) attempts to deceive. A level 1 trustee can also get pragmatic investors to cooperate through coaxing, as demonstrated in Fig 14. The returns are a lot higher than for a level 0 guilty trustee, who lacks a model PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

41 Fig 11. Guilty types. Averaged Exchanges (upper) and posteriors (mid and lower). Left plots: Investor (k I, α I, P I ) = (2,1,2); Trustee (1,1,2); right plots: Investor (2,1,7) and Trustee (1,1,7). The posterior distributions are shown for α = (0,0.4,1) at four stages in the game. Error bars are standard deviations. The asterisk denotes the true partner guilt value. doi: /journal.pcbi g011 PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

42 Fig 12. Average overall gains for the exchanges in Fig 11 with planning 2 (dark blue) and 7 (light blue). The difference is highly significant (p < 0.01) at a sample size of 60 for both parameter settings. Error bars are standard deviations. doi: /journal.pcbi g012 of their influence on the investor, and hence does not return enough to drive up cooperation. This initial coaxing is a very common behaviour of high level healthy trustees, trying to get the investor to cooperate more quickly, for both guilty and pragmatic high level trustees. Inconsistency or Impulsivity Trustees with planning horizon 2 tend to find it difficult to maintain deceptive strategies. As can be seen in Fig 15, even when both agents have a planning horizon of 2, a short sighted trustee builds significantly less trust than a long sighted one. This is because it fails to see sufficiently far in the future, and exploits too early. This planning horizon thus captures cognitive limitations or impulsive behaviour, while the planning horizon of 7 generally describes the consistent execution of a strategy during play. Such a distinction may be very valuable for the study of clinical populations suffering from psychiatric disorders such as attention deficit hyperactivity disorder (ADHD) or borderline personality disorder (BPD), who might show high level behaviours, but then fail to maintain them over the course of the entire game. Inferring this requires the ability to capture long horizons, something that had eluded previous methods. This type of behaviour shows how important the availability of different planning horizons is for modeling, as earlier implementations such as [24] would treat this impulsive type as the default setting. Greedy Behaviour Another behavioural phenotype with potential clinical significance arises with fully greedy partners, see Fig 16. Greedy low level investors only invest very little, even if trustees try to convince them of a high guilt type on their part as described above (coaxing). Cooperation repeatedly breaks, which is reflected in the high variability of the investor trajectory. Two high level greedy types initially cooperate, but since the greedy trustee egregiously over-exploits, cooperation usually breaks down quickly over the course of the game, and is not repaired before the end. In the present context, the greedy type appears quite pathological in that they seem to hardly care at all about their partner s type. The main exception to this is the level 2 greedy investor (an observation that underscores how theory of mind level and planning can change behaviour that would seem at first to be hard coded in the inequality aversion utility function). The level 0 greedy investor will cause cooperation to break down, regardless of their beliefs, as PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

43 Fig 13. Deceptive trustees. Averaged Exchanges (upper) and posteriors (mid and lower). Left plots: Investor (k I, α I, P I ) = (0,1,7); Trustee (1,0.4,7); right plots: Investor (2,1,7) and Trustee (1,0.4,7). The posterior distributions are shown for α = (0,0.4,1) at four stages in the game. Error bars are standard deviations. The asterisk denotes the true partner guilt value. doi: /journal.pcbi g013 PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

44 Fig 14. Driving up cooperation. Average Exchanges (upper) and posteriors (mid and lower), Investor (0,0.4,7) and Trustee (1,1,7). The posterior distributions are shown for α = (0,0.4,1) at four stages in the game. Error bars are standard deviations. The asterisk denotes the true partner guilt value. doi: /journal.pcbi g014 PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

45 Fig 15. Impulsive trustee can not exploit consistently. Average Exchanges (upper) and posteriors (mid and lower), Investor (0,1,2) and Trustee (1,0.4,2). The posterior distributions are shown for α = (0,0.4,1) at four stages in the game. Error bars are standard deviations. The asterisk denotes the true partner guilt value. doi: /journal.pcbi g015 PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

46 Fig 16. Greedy agents break cooperation. Averaged Exchanges (upper) and posteriors (mid and lower). Left plots: Investor (k I, α I, P I ) = (0,0,7); Trustee (1,0,7); right plots: Investor (2,0,7) and Trustee (1,0,7). The posterior distributions are shown for α = (0,0.4,1) at four stages in the game. Error bars are standard deviations. The asterisk denotes the true partner guilt value. doi: /journal.pcbi g016 PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

47 in Fig 16 the posterior beliefs of the level 0 show that they believe the trustee to be guilty, but do not alter their behaviour in the light of this inference. Planning Mismatch High Level Deceived by Lower Level In Fig 17, the investor is level 2, and so should have the wherewithal to understand the level 1 trustee s deception. However, the trustee s longer planning horizon permits her to play more consistently, and thus exploit the investor for almost the entire game. This shows that the advantage of sophisticated thinking about other agents can be squandered given insufficient planning, and poses an important question about the efficient deployment of cognitive resources to the different demands of modeling and planning of social interactions. Confusion Model inversion. A minimal requirement for using the proposed model to fit experimental data is self-consistency. That is, it should be possible to recover the parameters from behaviour that was actually generated from the model itself. This can alternatively be seen as a test of the statistical power of the experiment i.e., whether 10 rounds suffice in order to infer subject parameters. We show the confusion matrix which indicates the probabilities of the inferred guilt (top), ToM (middle) and planning horizon (bottom) for investor (left) and trustee (right), in each case marginalizing over all the other factors. Afterwards, we discuss a particular special case of the obtained confusion. Said confusion relates to observations made in empirical studies (see [20, 22]) and suggests the notion of the planning parameter, as measure of consistency of play. Later, we show comparative data reported in the study [24], which only utilized a fixed planning horizon of 2 and 2 guilt states and did not exploit the other simplifications that we introduced above. These simplifications implied that the earlier study would find recovery of theory of mind in particular to be harder. As Fig 18 shows, Guilt is recovered in a highly reliable manner. By contrast, there is a slight tendency to overestimate ToM in the trustees. The greatest confusion turns out to be inferring a P I = 7 investor as having P I = 2 when playing an impulsive trustee (P T = 2), a problem shown more directly in Fig 19. The issue is that when the trustee is impulsive, far-sighted investors (P I = 7) can gain no advantage over near-sighted ones (P I = 2), and so the choices of this dyad lead to mis-estimation. Alternatively put, an impulsive trustee brings the investor down to his or her level. This has been noted in previous empirical studies, notably [20, 22] s observations of the effect on investors of playing erratic trustees. The same does not apply on the trustee side, since the reactive nature of the trustee s tactics makes them far less sensitive to impulsive investor play. Given the huge computational demands of planning, it seems likely that investors could react to observing a highly impulsive trustee by reducing their own actual planning horizons. Thus, the inferential conclusion shown in Fig 19 may in fact not be erroneous. However, this possibility reminds us of the necessity of being cautious in making such inferences in a twoplayer compared to a one-player setting. Confusion comparison to earlier work. We compare our confusion analysis to the one carried out in the grid based calculation in [24]. In [24] the authors do not report exact confusion metrics for the guilt state, only noting that it is possible to reliably recover whether a subject is characterized by high guilt (0.7) or low guilt (0.3). We can however compare to the reported ToM level recovery. The comparison with [24] faces an additional difficulty in that despite using the same formal framework as this present work, the indistinguishability of the level 1 and 2 trustees and the level 0 and 1 investors was not identified yet. This explains the somewhat higher amount of confusion when classifying ToM levels, reported in [24]. Also, PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

48 Fig 17. Higher level investor deceived by consistent trustee. Average Exchanges, Investor (2,1,2) and Trustee (1,0.4,7). Error bars are standard deviations. The asterisk denotes the true partner guilt value. doi: /journal.pcbi g017 PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

Fig 18. Percentage of inferred guilt, theory of mind and planning horizon for investor (left) and trustee (right) as a function of the true values, marginalizing out all the other parameters.

49 Fig 18. Percentage of inferred guilt, theory of mind and planning horizon for investor (left) and trustee (right) as a function of the true values, marginalizing out all the other parameters. Each plot corresponds to a uniform mix of 15 pairs per parameter combination and partner parameter combination. doi: /journal.pcbi g018 PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

50 Fig 19. Planning misclassification. Maximum likelihood estimation result, P I = 7 and P T = 2 agent combinations, marginalized maximum likelihood estimation of investor planning horizon over all other parameters. doi: /journal.pcbi g019 since calculation of the Dirichlet-Multinomial probability was done numerically in this study, some between level differences will only derive from changes in quadrature points for higher levels. As can be seen in Fig 20 (left), almost all of the level 1 trustees at low guilt are misclassified. This is due to them being classified as level 2 instead, since both levels have the same behavioral features, but apparently the numerical calculation of the belief state favored the level 2 classification over the level 1 classification. The tendency to overestimation is true on the investor side as well, with there being a considerable confusion between level 0 and level 1 investors, Fig 20. Classification probability reported in [24]. In analogy to Fig 18 we depict the generated vs estimated values in a matrix scheme. doi: /journal.pcbi g020 PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

51 Fig 21. Numerical properties. (left) Average running times for calculating the first action value of a level 2, guilt 1 investor from a given number of simulations, as a function of planning horizon (complexity). (right) Discrepancy to the converged case of the action probabilities for the first action measured in squared discrepancies. doi: /journal.pcbi g021 who should behaviorally be equivalent. In sum, this leads to the reported overestimation of the theory of mind level. We have depicted the confusion levels reported in [24]in Fig 20. Computational Issues The viability of our method rests on the running time and stability of the obtained behaviours. In Fig 21, we show these for the case of the first action, as a function of the number of simulation paths used. All these calculations were run at the local Wellcome Trust Center for Neuroimaging (WTCN) cluster. Local processor cores where of Intel Xeon E312xx (Sandy Bridge) type clocked at 2.2 GHz and no process used more than 4 GB of RAM. Note that, unless more than 25k paths are used, calculations take less than 2 minutes. We quantify simulation stability by comparing simulations for 120 level 2 investors (a reasonable upper bound, because the action value calculation for this incorporates the level 1 trustee responses) based on varying numbers of paths with a simulation involving 10 6 paths that has converged. We calculate the between (simulated) subject discrepancies C of the probabilities for the first action for P I 2 {2,3,4,5,6,7,8,9}: C ij ¼ 1 X Pk a I ¼ i 0 ^P a I 0 4 ¼ i P k a I 0 4 ¼ j ^P a I 0 4 ¼ j i; j 2f0;...; 4g ð20þ 4 k¼1 where ^P½a I ¼ iš are the converged probabilities, and 0 4 Pk ½a I 0Š is the action likelihood of simulated subject k. If the sum of squares of the entries in the discrepancy matrix is low, then the probabilities will be close to their converged values. As can be seen from Fig 21 (right), for 25k paths even planning 9 steps ahead agents have converged in their initial action probabilities, such that their action probabilities vary from the converged value by no more than about 0.1. However, note that this convergence is not always monotonic in either the planning horizon or the number of sample paths. The former is influenced by the differing complexity of preferences for different horizons sometimes, actions are PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

52 Fig 22. Planning horizon comparison. Average Exchanges, Investor (2,0.4,7) (dark blue) and Trustee (1,0.4,7) (red), as well as Investor (2,0.4,9) (light blue) and Trustee (1,0.4,9) (rose). The difference between the 2 planning horizons is not significant at any point. Error bars are standard deviations. doi: /journal.pcbi g022 harder to resolve for short than long horizons. The latter is influenced by the initial pre-search using constant strategies. Although 25k steps suffice for convergence even when planning 9 steps ahead, this horizon remains computationally challenging. We thus considered whether it is possible to use a shorter horizon of 7 steps, without materially changing the preferred choices. Fig 22 illustrates that the difference is negligible compared with the fluctuations of the Monte Carlo approach, even for the worst case involving the pairing of 2 pragmatic types, with high ToM levels and long planning horizons. At the same time, the calculation for P = 7 is twice as fast as P = 9 for the level 2 investor, which even just for the first action is a difference of 100 seconds. Finally we compare our algorithm at planning 2 steps ahead to the grid-based calculation used before [22, 24]. The speed advantage is a factor of 200 for 10 4 paths in POMCP demonstrating the considerable improvement that enables us to consider longer planning horizons. Comparison To Earlier Subject Classifications We will show below, using real subject data taken from [22], that our reduction to 3 guilt states does not render likelihoods worse and only serves to improve classification quality. We compared the results of our new method with the results obtained in earlier studes ([24], [22]). Dataset. We performed inference on the same data sets as in Xiang et al, [22] (which were partially analysed in [24, 16] and [17]). This involved 195 dyads playing the trust game over 10 exchanges. The investor agent was always a healthy subject, the trustees comprised various clinical groups, including anonymous, healthy trustees (the impersonal group; 48 subjects), healthy trustees who were briefly encountered before the experiment (the personal group; 52 PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

53 subjects), trustees diagnosed with Borderline Personality Disorder (BPD) (the BPD group; 55 subjects), and anonymous healthy trustees matched in socio-economic status (SES) to the (lower than healthy) SES distribution of BPD trustees, (the low SES group; 38 subjects). Models used. We compared our models to the results of the model used in [22]onthe same data set (which incorporates the data set used in [24]). The study [22] uses 5 guilt states {0,0.25,0.5,0.75,1} compared to our 3, a planning horizon of 2 and an inverse temperature of 1, otherwise the formal framework is exactly the same as in the section on the trust task. Action values in [22] were calculated by an exact grid search over all possible histories and a numerical integration for the calculation of the belief state. For comparison purposes we built a clamped model in which the planning horizon was fixed at the value 2, with 3 guilt states and a inverse temperature set to b ¼ 1. Additionally, we compared to the outcome for the full method in this 3 work, including estimation of the planning horizon. We noted that in the analysis in [22], an additional approximation had been made at the level 0 investor level, which set those investors as non learning. This kept their beliefs uniform and yielded much better negative loglikelihoods within said model, than if they were learning. Subject fit. A minimal requirement to accept subject results as significant is that the negative log likelihood is significantly better than random on average at p < 0.05, otherwise we would not trust a model based analysis over random chance and the estimated parameters would be unreliable. This criterion is numerically expressed as a negative loglikelihood of 16.1 for 10 exchanges, calculated from 5 possible actions at a probability of 0.2 each, with independent actions each round. For the analysis in [22], we found that the level 0 approximation made in [22] allowed for significantly better negative investor log likelihoods (mean 11.98); if this approximation is removed, the investor data fit at an inverse temperature of 1 would be worse than random for this data set. Additionally, the model used in [22] did not fit the trustee data significantly better than random at p < 0.05 (mean negative loglikelihoods 15.6 and standard deviation of > 3). Conversely, for both our clamped and full model analysis at b ¼ 1, the trustee likelihood is 3 significantly better than random (11.7 at the full model) and the investor negative loglikelihood is slightly better on average (smaller) than found in [22] with 5 guilt states (11.7 for our method, vs 11.98). This confirms that reducing the number of guilt states to 3 only reduces confusion and does not worsen the fit of real subjects data. Additionally, it becomes newly possible to perform model-based analyses on the BPD trustee guilt state distribution, since the old model did not fit trustees significantly better than random at p < The seemingly low inverse temperature at b ¼ 1 is a consequence of the size of the rewards 3 and the quick accumulation of higher expectation values with more planning steps, as the inverse temperature needs to counter balance the expectation size to keep choices from becoming nearly deterministic. Average investor outcome expectations (at the first exchange) for planning 0 steps stand at 18 with an average 18 being added at each planning step. Marginal parameter distributions significant features. Fig 23 shows the significant parameter distribution differences (Kolmogorov-Smirnov two sample test, p < 0.05). For investor theory of mind and trustee guilt distribution, many of the same differences are significant for the analysis reported in [22] (see Fig 23, upper panels), for an analysis using our model with a clamped planning horizon of 2 steps ahead (see Fig 23, middle panels, to match with the approach of [24]) and for our full model, using 3 guilt states, ToM level up to 2 and 3 planning horizons (see Fig 23, bottom panels and Fig 24). We find significantly lowered ToM in most other groups, compared to the impersonal control group. We find a significantly lowered guilt distribution in BPD trustees, however the guilt difference was not used for fmri analysis in [22], because, as noted above, the trustee was not fit significantly better than PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

54 Fig 23. Parameter distributions for different models on the data set of [22]. (upper left) Investor ToM distribution is significant (p < 0.05) between the impersonal control condition and all other conditions. (upper right) Trustee Guilt distribution is significant between impersonal controls and the BPD trustees. (middle left) Planning 2 investor ToM distribution with 3 guilt states. BPD and low SES differences to impersonal are significant. (middle right) Planning 2 trustee guilt, the difference between BPD trustees and impersonal controls is significant. (bottom left) Full planning model investor ToM, all differences to impersonal are significant. (bottom right) Full planning model trustee guilt. BPD trustees are significantly different from controls. The asterisk denotes a significant (p < 0.05) difference in the Kolmogorov-Smirnov two sample test, to the impersonal control group. doi: /journal.pcbi g023 PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

55 Fig 24. Planning horizon distribution on data set of [22]. Planning distribution for Investors, distinguished between personal condition controls (non significant) and BPD and low SES trustees (significantly lower than impersonal). The asterisk denotes a significant (p < 0.05) difference in the Kolmogorov-Smirnov two sample test, to the impersonal control group. doi: /journal.pcbi g024 random at p < 0.05 in the earlier model. For our full model with 3 planning values, we find additional significant differences on the investor side: While all ToM distributions are significantly different from the impersonal condition, the planning difference between the personal and impersonal conditions is not significant at p < 0.05, while it is significant for the other groups (see Fig 24). Thus, this is the only model keeping the parameter distribution of the personal group distinct from both the impersonal group (from which it is not significantly different in the clamped model) and the low SES playing controls and BPD playing controls (from which it is not significantly different based on the parameters in [22]) at the same time. This supports the planning horizon as a consistency of play and additional rationality measure, as the subjects might not think about possible partner deceptions as much in the personal condition, having just met the person they will be playing (resulting in lowered ToM). However, their play is non disruptive, if lower level, and consistent exchanges result. BPD and low SES trustees however disrupt the partner s play, lowering their planning horizon. PLOS Computational Biology DOI: /journal.pcbi June 8, / 38

Computational Psychiatry and the Mentalization of Others

Computational Psychiatry and the Mentalization of Others P. Read Montague Virginia Tech Carilion Research Institute Department of Physics, Virginia Tech 2 Riverside Circle, Roanoke, VA 24014 Wellcome Trust