Modeling the formation of social conventions from embodied real-time interactions

What is the role of real-time control and learning in the formation of social conventions? To answer this question, we propose a computational model that matches human behavioral data in a social decision-making game that was analyzed both in discrete-time and continuous-time setups. Furthermore, unlike previous approaches, our model takes into account the role of sensorimotor control loops in embodied decision-making scenarios. For this purpose, we introduce the Control-based Reinforcement Learning (CRL) model. CRL is grounded in the Distributed Adaptive Control (DAC) theory of mind and brain, where low-level sensorimotor control is modulated through perceptual and behavioral learning in a layered structure. CRL follows these principles by implementing a feedback control loop handling the agent’s reactive behaviors (pre-wired reflexes), along with an Adaptive Layer that uses reinforcement learning to maximize long-term reward. We test our model in a multi-agent game-theoretic task in which coordination must be achieved to find an optimal solution. We show that CRL is able to reach human-level performance on standard game-theoretic metrics such as efficiency in acquiring rewards and fairness in reward distribution.


Now we perform a pairwise statistical comparison between human and model results to analyse the main points of deviation from the human data.
2. Also, account for the specific comments of Review on how to reorganize the material for better understanding. Reviewer 1 has here useful suggestions on clarifying the motivation of your paper for general audience, for example by adding examples, and including relevant literature on social norms. Please follow these suggestions as well. Also, correct non-working gitlab link(s).

Response:
All the points emphasized by the Associate Editor regarding Reviewer 1 comments have been addressed. Moreover, all the references provided by Reviewer 1 have been incorporated. Please find a point by point reply below.
Replies to the reviewer's comments Below we provide detailed to answer to all the reviewer's comments.

Reviewer 1
The paper studies the role real time control and learning play in the formation of social conventions by using different computational models (simulations) and compare the outcome of these simulations with data from human interactions. The authors show that in order to reach stable -across round -coordination an adaptive layer in the algorithm is required. The paper is well written, methodologically sound and makes an interesting contribution. I summarize minor comments (in no particular order) below: 1. Regarding the motivation and literature on social norms and human behaviour, I believe you are missing several papers (summarized in under literature). These papers could help to better motivate the paper with respect to how human cooperation and social norms evolve over time (see, e.g., Cialdini and Trost ( Response: Several paragraphs of the introduction have been added to integrate the suggested literature regarding the formation and evolution of social norms (see page 1, par. 1,2).
The updated Introduction has been modified into:

But what is a convention, how is it formed and maintained over time? Although some forms of cooperative behaviors are observed across different animal and insect species, current evolutionary models alone do not seem to provide sufficient explanation of why humans but not other animals exhibit large-scale cooperation among genetically unrelated individuals [3]. Behavioral evidence suggests that "strong reciprocity" -defined as a tendency to voluntarily cooperate, if treated fairly, and to punish non-cooperators -[4] can account for this uniquely human type of cooperation. Such conditional cooperation allows humans to reach conventions, although various groups can differ greatly in particular social norms adopted. This variability in social conventions also serves another function: promoting conformity within groups and heterogeneity across groups [5].
2. The motivation of the paper could be more clear. This is a general interest journal and you might have readers who are also from other disciplines and not familiar with the literature. In this regard, I would like to see a better motivation of why your research is important. It is clear that human cooperation is important. It is also obvious that algorithms play an important role -why, however, should we be interested in algorithms that mimic human behaviour? To me, your contribution would be to better understand human corporation per se but it is not so clear in the current version of the manuscript. It is also not clear why we would need an algorithm for this? What can we learn from the algorithm that we cannot learn from experiments with humans (in the lab or in the fmri, for example). It is also not clear why exactly "a model that can account for how lower-level dynamic processes interact with higher-level (strategic) cognitive processes" is exactly the next step needed. It would be good if the authors could elaborate more on this point. Moreover, I would like to see where such algorithms may be applied in the real world -if there is any application. Furthermore, why would we be interested in algorithms (for application) that are similar to humans? Should we not compose algorithms that outperform humans with regard to fairness and efficiency, too?
Response: Several paragraphs of the introduction have been added to improve the motivation of the paper (see page 2, par. 6,7; page 3, par. 1).
The updated Introduction has been modified into:

These models can be used at later stages not only for prediction and validation of the existing theories but also can be applied for control of artificial intelligent agents. If we aim to integrate robots or intelligent machines into our daily lives, we have to provide them with cognitive models that are able to learn and adapt to our social norms and practices. In order to do that, the algorithms governing the agent must integrate high-level information such as rules and plans with embodied information relevant for acting in the real world. One fundamental step in this direction will be a model that can account for how lower-level dynamic processes interact with higher-level (strategic) cognitive processes: incorporating real-time components will not only add ecological validity, but also can bootstrap learning through solving the sampling inefficiency problem and reducing the time it takes for the model to achieve acceptable performance levels, often seen as a major drawback of the machine learning methods [24].
3. I would also like to see a better motivation of the restrictions applied to the different algorithms. The TD, for example, focusses on long term reward. Would incorporating an algorithm that solely focuses on short term reward change results (I know the focus is coordination over time but there is a substantial literature showing that humans often act irrationally and focus on the present rather than thinking long term and considering future interactions/outcomes (e.g., present bias). I am not asking for such simulations -but I would like to have more focussed discussion on the properties of the algorithms and what might change if they have different properties or, e.g., why considering different properties is of minor importance.

Response:
We provided better motivation for the selection of the TD learning algorithm in the methods section (see page 6, par. 1,2).
The updated Methods section has been modified into:

Functionally, it determines the agent's action at the beginning of the round, based on the state of the previous round and its policy. The possible states S are three: high, low and tie; and they indicate the outcome of the previous round for each agent. That is, if an agent got the high reward on the previous round, the state is high; if it got the low reward, the state is low; and if both agents went to the same reward, the state is tie. The actions A are three as well: go to the high, go to the low and none. The parameter values reported below have been obtained through a parameter search aimed at fitting the behavioral data in [23].
Moreover, we have added a paragraph to the discussion addressing the issue and pointed to related work in which we report a detailed study on model parameters exploration (see page 14, par. 1,6).
The updated Discussion section has been modified into:

In this paper we have chosen a TD-learning algorithm as it is considered a validated model of animal learning, tuning its parameters through a search that allowed us to get best fit to data. But the CRL architecture allows for the implementation of other algorithms, as the one shown in an extension of this work [65] that integrates loss aversion bias into a Q-learning algorithm. Therefore, a fruitful avenue for further research would be a more in-depth exploration of the algorithmic and parameter space of the Adaptive layer.
4. Since I am not from the field, I do not want to/and cannot judge the implication of the models itself. What I am missing though is a clear explanation why the TD is only applied in the ballistic setting. Additionally, would it not make sense to compare the models directly, too.

Response:
This might have been a confusion: TD-learning is a component of the Adaptive layer, so it is operating in both conditions (ballistic and dynamic). We were naming the model-ballistic condition 'TD' since the model only operates with the Adaptive layer in this condition. Since it has created confusion, we have changed the labels so that the distinction is clearer now. Moreover, we have added a whole section to the results, directly comparing the models as requested (see abovementioned response to Academic Editor's main points and Figure 5).

Statistical testing: you use non-parametric test throughout the paper (and present no test in Section 3.1. but refer to statistical significance of results. I would appreciate
if you could show the robustness of the results also in parametric regressions (can you account for individual fixed effects when using the human data?). Moreover, please also provide results for the test performed in Section 3.1 Response: Non-parametric tests were only used where the normality distribution assumption was violated. Parametric tests were performed in the rest of the cases. As for the results of the tests performed in Section 3.1, they have been added to the manuscript in page 11, par. 3,4; page 12, par. 1).
The updated Results -Model Comparison section has been modified into: (F(2, 47) = 659.98, p < .001) The results in the Fairness score of the ablated models are comparable to the ones of the complete CRL model. However, note that these results are computed from fewer rounds, precisely due to the high amount of ties reached by the ablated models (fairness computes how evenly the high reward is distributed among agents). Finally, regarding Stability, the normality tests showed non-Gaussian distribution, therefore Kruskal-Wallis nonparametric test was performed, showing significant differences among the three models in the high ((H(2) = 36.82, p < .001) and low ((H(2) = 51.03, p < .001)) conditions. With the post-hoc Mann-Whitney U-tests, we observe that both ablated models are significantly less stable than the CRL model (M = 1.09) in the high payoff condition (reactive M = 1.17, p < .001; adaptive M = 1.16, p < .001).

As for the low condition, the results indicate that reactive model (M = 1.25) performs significantly worse in terms of stability when compared to the adaptive (M = 1.18, p < .001) and the CRL models (M = 1.17, p < .001). These results show that overall the ablated models are less stable than the CRL model, as indicated by their higher values in surprise (see Figure 7, right panels). From these model ablation studies we can conclude that any of the layers working alone leads to more unstable and less efficient results.
6. Discussion: The discussion should be a bit more broad. In the paper, the authors concentrate on a very specific game -which is important -but also consider certainty in payoffs. In many situations, however, payoffs are uncertain which has consequences for cooperation (see, e.g., Xiao and Kunreuther, (2016)). It would be desirable if you could at least mention this as a further limitation of your algorithms which work only under certainty of payoffs.