Reinforcement learning account of network reciprocity

Evolutionary game theory predicts that cooperation in social dilemma games is promoted when agents are connected as a network. However, when networks are fixed over time, humans do not necessarily show enhanced mutual cooperation. Here we show that reinforcement learning (specifically, the so-called Bush-Mosteller model) approximately explains the experimentally observed network reciprocity and the lack thereof in a parameter region spanned by the benefit-to-cost ratio and the node’s degree. Thus, we significantly extend previously obtained numerical results.


Introduction
Human society is built upon cooperation among individuals. However, our society is full of social dilemmas, where cooperative actions, which are costly to individuals, appear to be superseded by non-cooperative, selfish actions that exploit cooperative others [1][2][3]. There are several mechanisms that explain cooperative behavior in social dilemma situations [4][5][6]. The evolutionary game theory has provided firm evidence that static networks enhance cooperation as compared to well-mixed populations under generous conditions, with the effect being called spatial reciprocity (in the case of finite-dimensional networks) and network reciprocity (in the case of general networks) [4,[7][8][9][10]. This finding is in alignment with broadly made observations that humans as well as animals interact on contact networks where a node is an individual [11][12][13].
However, a series of laboratory experiments using human participants involved in the prisoner's dilemma game (PDG) has produced results that are not necessarily consistent with spatial and network reciprocity. In fact, the structure of networks (e.g., scale-free, random, and lattice) did not correlate with the propensity of human cooperation in the PDG [14][15][16][17][18][19][20]. In contrast, Rand et al. have shown that humans present network reciprocity if the benefit-to-cost ratio, a main parameter of the PDG, is larger than the degree of nodes in the network (i.e., number of neighbors per player) [21], which is consistent with the prediction of evolutionary game theory [8]. Note that the earlier experimental studies used smaller benefit-to-cost ratio values [14][15][16][17][18][19][20].
The theoretical results in Ref. [8] are derived from the probability of fixation of cooperation, i.e., the probability that a unanimity of cooperation is reached before that of defection under weak selection (i.e., the difference between the strength of cooperator and that of defector is assumed to be small). While theoretically elegant, unanimity under weak selection may be different from the population dynamics taking place in laboratory experiments with human participants, such as those in Ref. [21]. (However, see [22] for conditions for cooperation that are derived in the case of infinite populations and therefore no fixation, assuming replicator dynamics.) In laboratory experiments, the unanimity of cooperators is hard to be reached. The aim of the present study is to look for an alternative mechanism that explains behavioral results under the PDG on networks. We hypothesize that a type of reinforcement learning implemented as a strategy of players produces game dynamics that are consistent with the aforementioned experimental results regarding network reciprocity. In particular, aspiration-based reinforcement learning [23][24][25][26][27], with which players modulate their behavior based on the magnitude of the earned reward relative to a threshold, has been successful in explaining conditional cooperation behavior and its variants called moody conditional cooperation [28,29]. Furthermore, aspiration-based reinforcement learning, not evolutionary game theory, yielded the absence of network reciprocity in numerical simulations [30]. This result is consistent with those showing that outcomes of aspiration-based learning and those of evolutionary dynamics are intrinsically different [31,32]. In the present paper, we vary the benefit-to-cost ratio and the node's degree, two key parameters in the discussion of network reciprocity in the literature, to show that aspiration-based reinforcement learning gives rise to network reciprocity under the conditions consistent with the previous experimental study [21]. In this way, we significantly extend the previous numerical results [30].

Prisoners' dilemma game on networks
Consider players placed on nodes of a network. They repeatedly play the donation game, which is a special case of the PDG, over t max rounds as follows. In each round, each player selects either to cooperate (C) or defect (D), and a donation game occurs on each edge in both directions. The submitted action (i.e., C or D) is consistently used toward all neighbors. On each edge, a cooperating player pays cost c to benefit the other player by b. If a player does not cooperate (i.e., D), both the focal player and the other player get nothing. We impose b > c > 0. For example, if both players constituting an edge cooperate, both gain b − c. Each player is assumed to have k neighbors. Therefore, a player submitting C loses −kc and gains b multiplied by the number of cooperating neighbors. After the donation game has taken place bidirectionally on all edges, each player's final payoff in this round is determined as the payoff that the focal player has gained, averaged over the k neighbors.

Static-and shuffled-network treatments
We compare the propensity of cooperation between static and dynamically shuffled networks, mimicking the situation of a laboratory experiment [21]. In both static-and shuffled-network treatments, the network in each round is a ring network in which each node has k neighbors, where k is an even number (Fig 1). Each player is adjacent to k/2 players on each side on the ring. In the static-network treatment, the position of the players is fixed throughout all the rounds. In the shuffled-network treatment, while the network structure is fixed over rounds, we randomize the position of all the players after each round.

BM model
We consider players that obey the Bush-Mosteller (BM) model of reinforcement learning to update actions over rounds [23][24][25]27]. We use the following variant of the BM model [29,33]. Each player has the intended probability of cooperation, p t (t = 1, . . ., t max ) as the sole internal state. Probability p t is updated in response to the payoff obtained in the previous round, denoted by r t−1 , and the previous action, denoted by a t−1 , as follows: p tÀ 1 À p tÀ 1 s tÀ 1 ða tÀ 1 ¼ D; s tÀ 1 ! 0Þ; p tÀ 1 À ð1 À p tÀ 1 Þs tÀ 1 ða tÀ 1 ¼ D; s tÀ 1 < 0Þ: In Eq (1) the stimulus, denoted by s t−1 2 (−1, 1), is defined by where β > 0 and A are the sensitivity parameter and aspiration level, respectively. The action selected in the previous round is reinforced if the realized payoff is larger than the aspiration level, i.e., r t−1 − A > 0. Conversely, if the payoff is smaller than the aspiration level, the previous action is suppressed. For example, when a player submitted C in the previous round and the obtained payoff was larger than the aspiration level, the stimulus is positive. Then, the probability of cooperation is increased in the next round [according to the first line in the RHS of Eq (1)]. Note that the updating scheme [Eq (1)] guarantees p t 2 (0, 1) if p 1 2 (0, 1). We set p 1 = 0.8, which roughly agrees with the observations made in the previous laboratory experiments [14,17,21]. In each round, players are assumed to misimplement the action to submit the action opposite to the intention (i.e., D if the player intends C, and C if the player intends D) with probability [29,[33][34][35]. Thus, the actual probability of cooperation is given bỹ p t ¼ p t ð1 À Þ þ ð1 À p t Þ. In this way, even defectors that are satisfied with their D action sometimes cooperate. This behavior is not produced by the variation on β. Reinforcement learning account of network reciprocity
To examine the robustness of the results shown in Fig 2, we carried out simulations for a region of the A-parameter space and four values of k. We did not vary β (= 0.2) because β did not considerably alter the behavior of the players unless it took extreme values [29]. With b/ c = 2, the fraction of cooperative players averaged over the first 25 rounds is shown in Fig 3(a) and 3(b) for the static and shuffled networks, respectively. The difference between the two types of networks, shown in Fig 3(c), is small in the entire parameter region, in particular for large k, suggesting a marginal effect of network reciprocity. In contrast, when b/c = 6, the fraction of cooperators is larger in the static-network than the shuffled-network treatment in a relatively large region of the A-parameter space [Fig 3(e), 3(f) and 3(g)]. As k increases, the difference between the two treatments decreases. In summary, a static as opposed to shuffled network promotes cooperation only when b/c is large and k is small. These results are consistent with the experimental findings [21].
Network reciprocity is attributed to assortative connectivity between cooperative players [7][8][9][10]. In other words, cooperation can thrive if a cooperator tends to find other cooperators at the neighboring nodes. To measure this effect, we defined the assortment by P(C|C; t) − P(C|D; t), where P(C|C; t) is the probability that a neighbor of a cooperative player is cooperative in round t, and P(C|D; t) is the probability that a neighbor of a defective player is cooperative in round t [21,36]. For various values of A and , the assortment values in the static-network treatment averaged over the first 25 rounds are shown in Fig 3(d) and 3(h) for b/c = 2 and b/c = 6, respectively. The figures indicate that the assortment tends to be positive when cooperation is more abundant in the static-network than shuffled-network treatment regardless of the value of b/c, suggesting that cooperative players are clustered in these parameter regions. In the shuffled treatment, we confirmed that the assortment was % 0 in the entire parameter region.

Conclusions
We have numerically shown that an aspiration-based reinforcement learning model, the BM model, produces network reciprocity if and only if the benefit-to-cost ratio in the donation game is large relative to the node's degree. The results are consistent with the previous experimental findings [14][15][16][17][18][19][20][21]. In addition to network reciprocity, the BM model also accounts for the conditional cooperation, which is hard to explain by evolutionary game theory [28,29,37]. Aspiration-based reinforcement learning may be able to describe cooperative behavior of humans and animals in broader contexts. Finally, we remark that, although network reciprocity is not observed in the shuffled-network treatment, in both theory and experiments, dynamic linking treatments that allow players to strategically sever and create links promote cooperation in laboratory experiments [17,[38][39][40][41]. Evolutionary game theory predicts cooperation under dynamic linking [42][43][44][45][46][47][48]. Reinforcement learning may also account for enhanced cooperation under dynamic linking.