The Social Bayesian Brain: Does Mentalizing Make a Difference When We Learn?

When it comes to interpreting others' behaviour, we almost irrepressibly engage in the attribution of mental states (beliefs, emotions…). Such "mentalizing" can become very sophisticated, eventually endowing us with highly adaptive skills such as convincing, teaching or deceiving. Here, sophistication can be captured in terms of the depth of our recursive beliefs, as in "I think that you think that I think…" In this work, we test whether such sophisticated recursive beliefs subtend learning in the context of social interaction. We asked participants to play repeated games against artificial (Bayesian) mentalizing agents, which differ in their sophistication. Critically, we made people believe either that they were playing against each other, or that they were gambling like in a casino. Although both framings are similarly deceiving, participants win against the artificial (sophisticated) mentalizing agents in the social framing of the task, and lose in the non-social framing. Moreover, we find that participants' choice sequences are best explained by sophisticated mentalizing Bayesian learning models only in the social framing. This study is the first demonstration of the added-value of mentalizing on learning in the context of repeated social interactions. Importantly, our results show that we would not be able to decipher intentional behaviour without a priori attributing mental states to others.


Deriving the Bayesian update rule of 0-ToM
In the following, we will posit that 0-ToM observers a priori believe that the probability of her opponent's choice may vary smoothly over time (as in a -boundedrandom walk). For numerical reasons, the corresponding prior transition density is defined on log-odds 0 t x rather than on probabilities themselves, i.e.: This is in fact the main difference between 0-ToM and the model HGF, which includes an update rule for 0  (Mathys et al. 2011). We assume that 0-ToM learns by assimilating new observations recursively in time or trials, as follows: is a Bayes-optimal probabilistic scheme for online tracking of the log-odds, whose first line is 0-ToM's prediction about her opponent's behavioural tendency, whereas its second line derives from Bayes' rule (see, e.g., ). Let us assume that 0-ToM holds a Gaussian probabilistic belief     where the constant is the log-normalization factor. The first iteration of the Laplace approximation (Friston et al., 2007) consists in approximating L by its second-order Taylor expansion around 0 t  , and deriving the approximate first-and second-order moments of the corresponding Gaussian density from there on, as follows: The limitations of such "early-stopping" variant of the Laplace approximation are discussed in Mathys et al., (2011). Inserting Equation A5 into Equation A4 yields 0-ToM's learning rule: where the right-hand term is the explicit form of the evolution function of the sufficient statistics of 0-ToM's belief about the log-odds 0 x .

Volterra decompositions of choice sequences
Volterra series allow a systematic decomposition of dynamical systems' input-output relationships, where the output is typically a function of the history of past inputs. In our context, this means fitting the following logistic convolution model: is the probability that the agent choses the first option at trial t and  is some arbitrary time lag. In Equation A8, 0  is a bias term that captures a potential average tendency to favour one of the alternative options. Here, the choice of the input basis functions ( op a and self a ) was motivated by:  their simplicity;  their completeness, i.e. op a and self a is the actual available information at each trial, (e.g., the game's outcome can be derived as a nonlinear function of both players' actions);  the fact that they induce a very efficient Volterra decomposition of reinforcement learning algorithms.
First-order Volterra kernels op  (resp. self  ) capture the impact of lagged opponent's (resp. own) actions op a (resp. self a ) onto peoples' choice probability. First-order Volterra kernels would be equivalent to impulse response functions, would the system be linear. Note that this analysis is essentially similar to Lau & Glimcher (2005).
We estimated each participant's Volterra kernels op  and self  in each condition of the 2x4 factorial design. This was done using a variational Bayesian approach (Daunizeau 2014

RFX-BMS: Group-level Bayesian model selection
All models m have unknown parameters  , whose impact on the data y is nonlinear and obscured by measurement noise. This is why we rely upon variational approaches to approximate Bayesian inference (Beal 2003 (Penny 2012). In our context, it was used to approximate 14X26X2X4=2912 model evidences (14 models, 26 participants, 2 task framings, 4 opponents), given each participant's choice sequence in each condition of the main task. These summary statistics were then taken to a random-effect group-level Bayesian model selection (RFX-BMS), as follows.
RFX-BMS assumes that the population is composed of subjects that differ in terms of the model that describes them best. In this view, an experiment is a poll that randomly samples n subjects from the population, who are labelled according to their corresponding model. Let where   i p m r is given in Equation A10 and the prior   pr is typically set to a non-informative (flat) density, i.e.:   1 pr . It follows that differences in model evidences ik L will be expressed in the posterior distribution   p r y , which will deviate from the prior, i.e.: 1 k E r y K    for some models.
One can also derive the so-called exceedance probability (EP) k the probability that the k th model is more frequent in the population than any other models (given observed data): Note that this idea is used to assess the stability of models across conditions, which we call "between-conditions" RFX-BMS (Rigoux 2013). One can think of two conditions as inducing an augmented model space composed of 2 K 2-tuples that encode all combinations of candidate models and conditions. Here, any 2-tuple identifies the models associated with each condition (which may or may not be the same), and its log-evidence is derived by summing up the corresponding log model evidences over conditions. To assess the probability that the same model underlies both conditions, one uses family inference on a partition of the 2 K tuples that divides them into a first subset, in which the same model underlies both conditions, and a second subset containing the remaining tuples (with distinct condition-specific models). The ensuing family EP then measure the probability that different conditions most frequently correspond to different models.

Details about the experimental procedure
The experiment was run at the Laboratoire d'Economie Expérimentale de Paris (LEEP, Paris Experimental Economics Laboratory). We performed two experimental sessions on two different days with two different groups of people. Recruitment of participants was performed through the data base of the LEEP.
Participants of each group were welcomed together in the same room, and a computer was randomly attributed to each participant. Small separations between participants' computers prevented them to communicate or look at other participants' screen during the experiment.
Before the beginning of the experiment, people were instructed that they could not communicate with each other, that they could freely call off the experiment at any point and that they would receive a monetary bonus that would depend on their performance in the different tasks. Each of the different tasks was then briefly described, along with its payment rule (see below). At this point, participants were invited to ask any question regarding the experiment.
Once the set-up was clear for all participants, the experimental session started. At the beginning of each task, written instructions were displayed on each participant's computer screen.
At the end of the experiment, participant came individually into the "control cabin" of the room to receive payment and answer a few debriefing questions. Participants were first asked to describe their strategy during both the hide and seek and casino games. Then, they were asked to report any perceived differences between the different players and sessions. Finally, they were invited to freely comment on their subjective experience during these two games, as well as during the other tasks.
Below are the payment rules for each task, in the order they were presented and ran by participants:  Hide and Seek: you will play 4 games of Hide and Seek against 4 different players.
At the end of the experiment one of the four games will be randomly selected, and each correct answer will yield .15€.
 Vicky's Violin task : This task is not financially rewarded.
 MCST: this task is composed of 40 trials, and each correct answer will yield .05€.
 Casino Task: you will play 4 sessions of the game. At the end of the experiment one of the four games will be randomly selected, and each correct answer will yield .15€.
 Frith-Happé animations: this task is composed of 20 trials, and each correct answer will yield .10€.
 Go-No Go Task: you will be rewarded according to the number of errors (false alarm or missed trials) you make. For instance making less than 3 errors will yields 4€ whereas making more than 40 errors will lead to no monetary payoff.
 Empathy Quotient: This task is not financially rewarded..  3-back task: this task is composed of 400 trials, and each correct answer will yield .02€.
 Imposing Memory Task: this task is composed of 40 trials, and each correct answer will yield .07€.

Results
In what follows, actions op a and self a take binary values encoding the first ( 1 a  ) and the second ( 0 a  ) available options, by convention. The game outcome at trial t can take the value 1 ("correct") or 0 ("incorrect"). Peoples' performance in each condition of the main task is defined as the total difference between the numbers of correct and incorrect trials, Note: although we report summary statistics that are not corrected for multiple comparisons, we indicate the family-wise error rate threshold (FWER 5% ) when necessary. More precisely, we used the standard Sidak correction, i.e.: 5% 1 0.95 n FWER  , where n is the number of multiple tests (Sidak 1967).

Design sanity check
Although the k-ToM algorithms were developed without any systematic preference for a given alternative action, their behavioural policy is stochastic in nature. This could have resulted in non-negligible biases that could be different across framing conditions. In turn, this would induce a confound in our interpretation of the pattern of participants' performances across conditions. Thus, we performed the following analysis. First, we measured the absolute bias b of each opponent, against each participant, in each condition: By construction, a "fair coin" (chance level of 50%) would have zero absolute bias. Figure 1 below depicts the average absolute bias for each opponent in each framing.

Figure 1: average opponents' bias. Group average absolute biases of the four different
opponents, plus or minus one standard error (red: non-social framing, blue: social framing).
Reassuringly, RB exhibits a bias at exactly 65%. One can also see that, on average, k-ToM algorithms are left with a small bias of about 55%. We then performed an ANOVA to assess the effect of framing, opponent and their interaction on the opponents' biases. Results show an effect of opponent (F=155.1, p<10 -5 ), but no effect of framing (F=0.7, p=0.40) or interaction (F=2.1, p=0.11). This is important, because this makes the small residual bias in the 0-ToM, 1-ToM and 2-ToM conditions an unlikely explanation for peoples' performance pattern. In particular, the residual bias cannot explain the observed performance difference between framings.   Based on final earnings only, we had summarized the results as follows: In the non-social framing, participants seem to continuously lose against all mentalizing opponents, be even with 0-ToM, and win against RB. In the social framing, participants seem to win against all artificial agents except 2-ToM (null earnings). It is reassuring to see that overall, visual extrapolations of accumulated earnings yield qualitatively similar predictions.

Reaction times analysis
Participants' reaction time was recorded on each trial of each condition of the main task. Figure   3 below summarizes the results in terms of mean reaction times (in log space). One can see that there is none of our experimental factors (opponent type and task framing) appears to have a clear impact on peoples' reaction times. In fact, this is confirmed by an ANOVA, which shows no evidence for a main effect of framing (p=0.80) or of opponent type (p=0.33).

Effect of performance in the secondary tasks
We analysed the impact of the performances in the seven secondary tasks onto peoples' performance in each session of the main task using a general linear model, which also included participants' age and gender. We used omnibus F-tests to test for the effect of any of the secondary tasks on peoples' performance in the main task. First, no effect was found in the social (F=0.63, p=0.72) or in the non-social (F=1.55, p=0.38) conditions, when final earnings were averaged across opponents. This holds true for the difference between the social and nonsocial framings (F=2.13, p=0.10). ToM task on each 2X4 conditions of the main task (columns: four opponents, rows: two task framings).
One can see that none of these tests reaches the 5% false positive rate significance threshold.
Only when we looked at the opponent-specific difference in accumulated earnings between framings did we find an effect (omnibus F-test: F=2.65, p=0.04). More precisely, participants' performance against 1-ToM increases with their performance in the Frith-Happé task (t=2.3, p=0.02), but is not significantly related to other tasks. This makes sense, given that performance in this task is related to the ability to discriminate between intentional and physical causation.  Note however that this result is but a statistical tendency, since the corresponding statistical tests were not corrected for multiple comparisons (FWER 5% =0.0064). choices sequences (using exponential Volterra kernels, see above paragraph). It is reassuring to see that the results of the parametric and non-parametric Volterra analyses are qualitatively similar to each other. However, the main effects of our experimental factors (framing and opponent type) are somewhat easier to eyeball in the parametric setting.

RFX-BMS diagnostics
In complement to random-effect Bayesian Model Selection (RFX-BMS), we derived simple group-level summary statistics of model inversions.  Note that log-evidences in Fig. 6 have been mean-corrected. Recall that no direct comparison between log-evidences in different framings is possible (e.g., one cannot compare the likelihood of a given model in the social versus the non-social framing).
In the social framing, although no model clearly stands out as being more probable than others, one can see that T+ models dominate. In the non-social framing however, it seems that the WSLS strategy is the likeliest explanation for participants' trial-by-trial responses.
Results of the RFX-BMS demonstrate that most participants behave as a 2-ToM agent in the social framing (i.e. 2-ToM has the maximum model frequency, cf. Fig. 7 in the main manuscript).
However, there is a strong variability in 2-ToM's fit accuracy across subjects, which explains why 2-ToM does not clearly single out on Fig. 6. This is illustrated on Figure 7 below, which shows 2-ToM's fit quality for both the best and the worst subject in the social framing (across opponent types). One can see that 2-ToM's fit quality varies from almost perfect fit (left panel of Fig. 7), to clearly poor fit accuracy (right panel of Fig. 7). This simply indicates that 2-ToM may not be the best explanation for all subjects. In other words, it is likely that the population is composed of subjects that differ in terms of the model that describes them best (cf. main assumption of RFX-BMS). Figure 8 below summarizes our between-condition RFX-BMS, performed for each experimental factor (framing and opponent type) separately. The main objective of this analysis is to address the question of whether our experimental factors induced a difference in model family (T+ or T-) or not. When assessing the impact of the framing factor, we report the exceedance probability (EP) that peoples' behaviour in the social and in the non-social framing most frequently correspond to the same family, for each opponent type. When assessing the impact of the opponent factor, we report the EP that peoples' behaviour against two different opponents most frequently correspond to the same family, for each framing. A small EP indicates that peoples' behaviour in the corresponding pair of conditions is likely to be best described by different model families. One can see that the opponent type factor has a much smaller impact on the best description of peoples' behaviour than the framing factor. This is confirmed by eyeballing condition-specific RFX-BMS analyses, which are summarized in Figure 9.

RFX-BMS: model identifiability
Different models may yield similar predicted choice sequences, which may confuse Bayesian model selection. We thus performed Monte-Carlo simulations designed to quantify model identifiability, under conditions similar to our experimental data analyses.
We first generated choice sequences under each agent's model (13 models, 60 trials per game, 4 opponents, 26 dummy subjects). For each simulated data, we performed a Bayesian Model Selection, based upon the VB approximation to the log evidence of each of the 13 candidate models. For any given type of simulated data, we then measured the frequency with which each candidate model is eventually selected. The so-called confusion matrix derives from renormalizing these frequency profiles, to yield the probability of having simulated the data under each model, given that a particular candidate model was selected. It is shown on Figure 10 below. Any non-diagonal element in this matrix signals a potential confusion between the inferred model and the true (hidden) model. More precisely, the i th row shows how often each model was actually generating the data, given that the i th model was identified as the most likely. First of all, one can see that there is almost possible confusion between models belonging to the T+ family, and models belonging to the T-family. In addition, there is almost no confusion between models within the T+ family (lower-right quadrant). However, there are partial model non-identifiabilities within the T-family (upper-left quadrant). In particular, eventually selecting the model 1-BSL is in fact strong evidence for data generated under the model hBL. To a much lesser extent, eventually selecting the model WSLS may in fact be taken as evidence for Bayesian sequence learning (2-BSL and 3-BSL). This is important, since WSLS is the most likely model in the non-social framing (cf. Fig. 7 in the main text).