Reinforcement Learning Explains Conditional Cooperation and Its Moody Cousin

Direct reciprocity, or repeated interaction, is a main mechanism to sustain cooperation under social dilemmas involving two individuals. For larger groups and networks, which are probably more relevant to understanding and engineering our society, experiments employing repeated multiplayer social dilemma games have suggested that humans often show conditional cooperation behavior and its moody variant. Mechanisms underlying these behaviors largely remain unclear. Here we provide a proximate account for this behavior by showing that individuals adopting a type of reinforcement learning, called aspiration learning, phenomenologically behave as conditional cooperator. By definition, individuals are satisfied if and only if the obtained payoff is larger than a fixed aspiration level. They reinforce actions that have resulted in satisfactory outcomes and anti-reinforce those yielding unsatisfactory outcomes. The results obtained in the present study are general in that they explain extant experimental results obtained for both so-called moody and non-moody conditional cooperation, prisoner’s dilemma and public goods games, and well-mixed groups and networks. Different from the previous theory, individuals are assumed to have no access to information about what other individuals are doing such that they cannot explicitly use conditional cooperation rules. In this sense, myopic aspiration learning in which the unconditional propensity of cooperation is modulated in every discrete time step explains conditional behavior of humans. Aspiration learners showing (moody) conditional cooperation obeyed a noisy GRIM-like strategy. This is different from the Pavlov, a reinforcement learning strategy promoting mutual cooperation in two-player situations.

(1) and where p t−1 , s t−1 , a t−1 , and r t−1 are the probability of cooperation, stimulus, action, and reward (i.e., payoff), respectively, in the (t − 1)th round. In Eq. (1), ℓ controls the learning rate and plays a similar role as β in Eq.
(2) in the main text. Note that the implementation error is not included in this model. We simulated dynamics of the BM players obeying the Macy-Flache rule in the repeated PDG on the square lattice. For three values of ℓ, the dependence of the probability of cooperation on f C is shown in Figs C(a)-C(c). Similarly to the results in the main text (Fig 2), we observe CC and MCC patterns for ℓ = 0.2 (Fig C(b)) and ℓ = 1 (Fig C(c)). Due to the absence of the implementation error, the probability of cooperation is close to zero for ℓ = 1 (Fig C(c)). The results for the linear fit to the relationship between the probability of cooperation and f C are summarized in Figs C(d)-C(g) for various values of ℓ and A. The figures indicate that CC and MCC occur when A < 1 and ℓ is not small. These results are consistent with those for the BM model analyzed in the main text, including the range of A.

Noisy GRIM strategy
The noisy GRIM strategy in the two-player PDG is defined as follows [6]. If both players cooperate, the focal player will cooperate with probabilityp t = 1 − ϵ in the next round, where 0 < ϵ < 1/2 is the probability of action misimplementation. Otherwise, the focal player will cooperate with probability ϵ in the next round. This action rule can be rephrased in terms of the payoff to the focal player, r t . If r t = T , R, or P , the player is satisfied and sticks to the current action (i.e., C or D) with probability 1 − ϵ. If r t = S, the player is dissatisfied and switches the action with probability 1 − ϵ. The noisy GRIM action rule generalizes to the multiplayer PDG. For a given aspiration threshold A, where S < A < P , a player does not flip the action with probability 1 − ϵ if r t > A and flips the action with probability 1 − ϵ if r t < A. This action rule corresponds to β = ∞ in our BM model.
The probability of cooperation conditioned on a t−1 is shown in Fig D for two values of A. When a t−1 = D, cooperation always occurs with probability ϵ. When a t−1 = C, cooperation occurs with a larger probability, 1 − ϵ, when the number of cooperators in the neighborhood, f C , is at least one or two, depending on whether (R + 3S)/4 < A < P (Fig D(a)) or S < A < (R + 3S)/4 (Fig D(b)), respectively. Otherwise, cooperation occurs with probability ϵ. The binary nature of the conditional probability of cooperation does not agree with MCC patterns observed in the behavioral experiments.

Directional learning model for the PGG
In directional learning in the PGG, the direction in the previous change in the amount of contribution is reinforced if a player is satisfied. We update the expected contribution of each player as follows: Except for this change, the directional learning model is the same as the BM model for the PGG.
We simulated the repeated PGG in a group of four players adopting the directional learning rule. The average contribution is plotted against that of the other group members in the previous round in Figs F(a)-F(c) for three values of A. The figures do not indicate CC or MCC patterns. We did not observe CC or MCC patterns, either, when we searched a wider region in the β-A parameter space (Figs F(d)-F(g)).

Analysis of the Cimini-Sánchez model
In the Cimini-Sánchez model [25], the linear relationship between the probability of cooperation, p t , and the fraction of neighbors that has cooperated in the previous round, f C , adaptively changes. We parameterize the linear relationship as p t = α 1,t f C + α 2,t . Variables α 1,t and α 2,t correspond to p t i and r t i (for the ith player) in Ref. [25]. Depending on the sign of the stimulus s t−1 and the action of the focal player in the previous two rounds, α 1,t and α 2,t are updated according to either or where 0 ≤s t−1 ≤ 1. Equations (4) and (5) imply which is also implied by Eqs. (6) and (7). Therefore, we obtain lim t→∞ (α 1,t − α 2,t ) = 0 except for the pathological case in which the stimulus is vanishingly small such that Average contribution from other group members, f C Contribution, a t a t-1 ≥ X a t-1 < X All We set β = 0.4, A = 0.9, and X = 0.4. The group has (c) one and (d) two free riders. Therefore, the maximum value of f C is equal to 2/3 and 1/3 in (c) and (d), respectively. We calculated the probability of cooperation and mean contribution by excluding the free riders.