Learning and Innovative Elements of Strategy Adoption Rules Expand Cooperative Network Topologies

Cooperation plays a key role in the evolution of complex systems. However, the level of cooperation extensively varies with the topology of agent networks in the widely used models of repeated games. Here we show that cooperation remains rather stable by applying the reinforcement learning strategy adoption rule, Q-learning on a variety of random, regular, small-word, scale-free and modular network models in repeated, multi-agent Prisoner's Dilemma and Hawk-Dove games. Furthermore, we found that using the above model systems other long-term learning strategy adoption rules also promote cooperation, while introducing a low level of noise (as a model of innovation) to the strategy adoption rules makes the level of cooperation less dependent on the actual network topology. Our results demonstrate that long-term learning and random elements in the strategy adoption rules, when acting together, extend the range of network topologies enabling the development of cooperation at a wider range of costs and temptations. These results suggest that a balanced duo of learning and innovation may help to preserve cooperation during the re-organization of real-world networks, and may play a prominent role in the evolution of self-organizing, complex systems.

Learning and innovative elements of strategy adoption rules expand cooperative network topologies Shijun Figure S1. 12 25 Both Q-learning and the long-term versions of all three strategy adoption rules above outperformed the short-term variants resulting in a higher proportion of cooperators in Hawk-Dove games on small-world and scale-free model networks especially at high cooperation costs (Figures S1.2A, S1.2B and S1.3). Long-term strategy adoption rules (including Qlearning) were also more efficient inducers of cooperation even at high costs in modular networks ( Figure S1.7). Moreover, long-term strategy adaption rules maintained cooperation even in randomly mixed populations as well as in repeatedly re-randomized networks ( Figure  S1.5). Interestingly, long-term strategy adoption rules (especially the long-term version of the best-takes-over strategy adoption rule) resulted in an extended range of all-cooperator outcomes in Hawk-Dove games (Figures S1.3-S1.5 and S1.7). Finally, long-term strategy adoption rules helped cooperation in canonical and extended Prisoner's Dilemma games in case of all three strategy adoption rules tried (Figure S1.6).
While short-and long-term strategy adoption rules resulted in a remarkable variation of the cooperation level in a large variety of random, regular, small-world, scale-free and modular networks in Hawk-Dove and both canonical and extended Prisoners' Dilemma games (Figures S1.1-S1.6), Q-learning induced a surprising stability of cooperation levels in all the above circumstances (Figures S1.2-S1.6). Interestingly, but expectedly, Q-learning also stabilized final cooperation levels, when games were started from a different ratio of cooperators (ranging from 10% to 90%) than the usual 50% (data not shown). When we introduced innovativity to long-term strategy adoption rules in Hawk-Dove games (for the description of these innovative strategy adoption rules see Methods) similarly to that shown for the canonical Prisoner's Dilemma game on Figure 2, cooperation levels were closer to each other in small-world and scale-free networks than their similarity observed when using only long-term, but not innovative strategy adoption rules ( Figure S1.7). Importantly, innovativity alone, when applied to the best-takes-over short-term strategy adoption rule could also stabilize cooperation levels in small-world and scale-free networks ( Figure S1.7C).
When we compared different levels of innovation by changing the value of innovation P in our simulations ( Figure S1. 8), an intermediary level of innovation was proved to be optimal for the stabilization of cooperation in small-world and scale-free networks. Scale-free networks and Prisoner's Dilemma game were more sensitive to higher innovation levels than small-world networks or Hawk-Dove games, respectively (Figure S1.8). Summarizing our results, Figures S1. 9 and S1. 10 show that similarly to canonical Prisoner's Dilemma games ( Figure  3), both in Hawk-Dove games (Figure S1.9) and extended Prisoner's Dilemma games ( Figure  S1.10) long-term strategy adoption rules and innovation (including Q-learning) resulted in a stable non-zero cooperation in a large variety of network topologies in combination only. Figure S1.11 shows the distribution of hawks (blue dots) and doves (orange dots) at the last round of a repeated Q-learning game on small-world ( Figure S1.11A and S1.11B) or scalefree networks ( Figure S1.11C and S1.11D) at low (Figure S1.11A and S1.11C) and high ( Figure S1.11B and S1.11D) relative gain/cost ( G ) values. Under these conditions both hawks and doves remained isolated (see arrows). On the contrary, when Hawk-Dove games were played with any of the three short-term, non-innovative strategy adoption rules doves, but even hawks showed a tendency to form networks (Figure S1.12 and data not shown). This effect was especially pronounced for doves in both small-world and scale-free networks, as well as for hawks in small-world networks, and present, but not always that strong for hawks in scale-free networks, where hawks remained more isolated in all configurations. Interestingly, the proportional updating strategy adoption rule quite often showed an extreme behavior, when in the last round of the play all agents were either doves or hawks. This behavior was less pronounced with a larger number (2,500) of players. All the above findings were similarly observed in extended Prisoner's Dilemma games (data not shown).

Supplementary Discussion
Explaining cooperation has been a perennial challenge in a large section of scientific disciplines. The major finding of our work is that learning and innovation extend network topologies enabling cooperative behavior in the Hawk-Dove (Figures S1.1-S1.5 and S1.7-S1.9, S1.11, S1.12) and even in the more stringent Prisoner's Dilemma games (Figures 1-3, S1.6, S1.8 and S1.10). The meaning of 'learning' is extended here from the restricted sense of imitation or learning from a teacher. Learning is used in this paper to denote all types of information collection and processing to influence game strategy and behavior. Therefore, learning here includes communication, negotiation, memory and various reputation building mechanisms. Learning makes life easier, since instead of the cognitive burden to foresee and predict the 'shadow of the future' [4][5][6] learning allows to count on the 'shadow of the past', the experiences and information obtained on ourselves and/or other agents [12]. Likewise to our understanding of learning, the meaning of 'innovation' is extended here from the restricted sense of innovation by conscious, intelligent agents. Innovation is used in this paper to denote all irregularities in the strategy adoption process of the game. Therefore, innovation here includes errors, mutations, mistakes, noise, randomness and increased temperature besides conscious changes in game strategy adoption rules.
In the Supplementary Discussion, first we summarize the effects of network topology on cooperative behavior, then discuss the previous knowledge on the help of cooperation by learning and innovation, and, finally, we compare our findings with existing data in the literature and show their novelty and implications.
Effect of network topology on cooperation. Cooperation is not an evolutionary stable strategy [13], since in the well-mixed case, and even in simple spatial arrangements it is outcompeted by defectors. As it is clear from the data summarized in Table S1.1, the emergence of cooperation requires an extensive spatial segregation of players helping cooperative communities to develop, survive and propagate. Cooperation in repeated multiagent games is very senitive to network topology. Cooperation becomes hindered, if the network gets over-connected [14][15][16]. On the contrary, high clustering [17,18], the development of fully connected cliques (especially overlapping triangles) and rather isolated communities [14,18] usually help cooperation. Heterogeneity of small-worlds and, especially, networks with scale-free degree distribution can establish cooperation even in cases, when the costs of cooperation become exceedingly high.
However, in most spatial arrangements cooperation is rather sensitive to the strategy adoption rules of the agents, and especially to the strategy adoption rules of those agents, which are hubs, or by any other means have an influential position in the network. Moreover, minor changes in the average degree, actual degree, shortests paths, clustering coefficients or assortativity of network topology may induce a profound change in the cooperation level. Since real world networks may have rather abrupt changes in their topologies [17,[20][21][22][23][24][25][26], it is highly important to maintain cooperation during network evolution.
Effect of learning on cooperation. From the data of Table S1.2 it is clear that learning generally helps cooperation. Cooperation can already be helped by a repeated play, assuming 'learning' even among spatially disorganized players. Memory-less or low memory strategy adoption rules do not promote cooperation efficiently. In contrast, high-memory and complex negotiation and reputation-building mechanisms (requiring the learning, conceptualization and memory of a whole database of past behaviors, rules and motives) can enormously enhance cooperation making it almost inevitable. As a summary, in the competitive world of games, it pays to learn to achieve cooperation. However, it is not helpful to know too much: if the ranges of learning and the actual games differ too much, cooperation becomes impossible [18].
Learning requires a well-developed memory and complex signaling mechanisms, which are costly. This helps the selection process in evolution [13], since 'high-quality' individuals can afford the luxury of both the extensive memory and costly signaling [27]. However, cooperation is rather widespread among bacteria, where even the 'top-quality individuals' do not have the extensive memory mentioned above. Here 'learning' is achieved by the fast succession of multiple generations. The Baldwin-effect describing the genetic (or epigenetic) fixation of those behavioral traits, which were benefitial for the individuals, may significantly promote the development of bacterial cooperation and the establishment of biofilms [28][29][30][31][32]. Genetically 'imprinted' aids of cooperation are also typical in higher organisms including humans. The emotional reward of cooperation uncovered by a special activation of the amygdalia region of our brains [33] may be one of the genetically stabilized mechanisms, which help the extraordinary level of human cooperation besides the complex cognitive functions, language and other determinants of human behavior.

Effect of randomness ('innovation') on cooperation.
From the data of Table S1.3 it is clear that a moderate amount of randomness, 'innovation' generally helps cooperation. Many of the above learning mechanisms imply sudden changes, innovations. Bacteria need a whole set of mutations for interspecies communication (such as quorum sensing), which adapt individual organisms to the needs of cooperation in biofilms or symbiotic associations. The improved innovation in the behavior of primates and humans during games has been well documented [34][35][36].
An appropriate level of innovation rescues the spatial assembly of players from deadlocks, and accelerates the development of cooperation [18]. Many times noise acts in a stochastic resonance-like fashion, enabling cooperation even in cases, when cooperation could not develop in a zero-noise situation [37,38]. As a special example, the development of cooperation between members of a spatial array of oscillators (called synchrony) is grossly aided by noise [39]. Egalitarian motives also introduce innovative elements to strategy selection helping the development of cooperation [40].
However, innovation serves the development of cooperation best, if it remains a luxurious, rare event of development. Continuous 'innovations' make the system so noisy, that it looses all the benefits of learning and spatial organization and reaches the mean-field limit of randomly selected agents with random strategy adoption rules (Table S1.3).
Comparison and novelty of our findings. In Hawk-Dove games on modified Watts-Strogatz-type small-world [2,9] and Barabasi-Albert-type [10] scale-free model networks we obtained very similar results of cooperation levels in all synchronously updated pair-wise comparison dynamics, proportional updating and best-takes-over strategy adoption rules to those of Tomassini et al. [2,3]. The success of our various 'long-term' strategy adoption rules to promote cooperation is in agreement with the success of pair-wise comparison dynamics and best-takes-over strategy adoption rules with accumulated payoffs on scale-free networks [1,3].
On the contrary to Hawk-Dove games, in the Prisoner's Dilemma game defection always has a fitness advantage over cooperation, which makes the achievement of substantial cooperation levels even more difficult. In the extended Prisoner's Dilemma games on scale-free networks [10] we obtained very similar results of cooperation levels using synchronously updated pairwise comparison dynamics and best-takes-over strategy adoption rules to those of Tomassini et al. [3]. Similarly to the Hawk-Dove game with the extended Prisoner's Dilemma game our results with various 'long-term' strategy adoption rules on scale-free networks are in agreement with those of pair-wise comparison dynamics and best-takes-over strategy adoption rules using accumulated payoffs [1,3].
We have to note that the definition of pair-wise comparison dynamics strategy adoption rule was slightly different here, than in previous papers, and on the contrary to the non-averaged payoffs used previously, we used average payoffs [1][2][3], which allows only a rough comparison of these results to those obtained before, and resulted in a lower level of cooperation than that of e.g. ref. [1]. The reason we used average payoff was that this made the final level of cooperators more stable at scale-free networks even after the first 5,000 rounds of the play (data not shown). When we used non-averaged payoffs in the extended Prisoner's Dilemma game with 100,000 rounds of play, we re-gained the cooperation levels of ref. [1] at scale-free networks (m=4, data not shown). The additional papers on the subject used differently designed small-world networks or different strategy adoption rules, and therefore can not be directly compared with the current data. It is worth to mention that none of the previous papers describing multi-agent games on various networks [1][2][3] used the canonical Prisoner's Dilemma game, which was used obtaining our data in the main text, and which gives the most stringent condition for the development of cooperation.
As a summary, our work significantly extended earlier findings, and showed that the introduction of learning and innovation to game strategy adoption rules helps the development of cooperation of agents situated in a large variety of network topologies. Moreover, we showed that learning and innovation help cooperation separately, but act synergistically, if introduced together especially in the complex form of the reinforcement learning, Q-learning.

Interactions of learning and innovations, conclusions.
Real complexity and excitement of games needs both learning and innovation. In Daytona-type car races skilled drivers use a number of reputation-building and negotiation mechanisms, and by continuously bringing novel innovations to their strategies, skilfully navigate between at least four types of games [41].
Noise is usually regarded to disturb the development of cooperation. Importantly, complex learning strategies can actually utilize noise to drive them to a higher level of cooperation. Noise may act as in the well-known cases of stochastic resonance, or stochastic focusing (with extrinsic and intrinsic noise, respectively) enabling cooperation even in cases, when it could not develop without noise. In a similar fashion, mistakes increase the efficacy of learning [37,38,42]. Additional noise greatly helps the optimization in the simulated annealing process [43][44][45].
Noise not only can extend the range of cooperation to regions, where the current level of learning would not be sufficient to achieve it, but extra learning can also 'buffer' an increased level of noise [19]. Thus, learning and innovation act side-by-side and -in gross termscorrect the deficiencies of the other. Learning and innovation also cooperate in the Baldwin effect, where beneficial innovations (in the form of mutations) are selected by the intergenerational 'meta-learning' process of evolution [28][29][30][31][32]. Mutual learning not only makes innovation tolerable, but also provokes a higher level of innovation to surpass the other agent [36].
Our work added the important point to this emerging picture that the cooperation between learning and innovation to achieve cooperation also works in the extension and buffering of those network configurations, where cooperation becomes possible.    The term 'learning' is used here in the sense of the collection and use of information influencing game strategy adoption rules and behavior, and not in the restricted sense of imitation, or directed information-flow from a dominant source (the teacher). Therefore, learning here includes communication, negotiation, memory, labelassignment and label-recognition, etc. b HD = Hawk-Dove (Snowdrift, Chicken) game; PD = Prisoner's Dilemma game (please note that in this supplementary table we did not discriminate between conventional and cellular automata-type games, where in the latter simulating evolution agents 'die', and are occasionally replaced; in our simulations we used only 'conventional' games, where agent-replacement was not allowed). c Tit-for-tat = this strategy adoption rule copies the opponent's step in the previous round; Pavlov = a 'win staylose shift' strategy adoption rule; generous strategy adoption rules = allow 'extra' cooperation options with a given probability. d These strategy adoption rules are interchangeably called as 'memory-one' or 'memory-two' strategy adoption rules referring to the fact that e.g. in the Pavlov strategy adoption rule agents remember the outcome of only the last step ('memory-one') but that of both players ('memory-two'). The term 'innovation' is used here in the sense of irregularities in the process of the game. Therefore, innovation here includes errors, mutations, mistakes, noise, randomness and increased temperature besides the senso stricto innovation of conscious, intelligent agents. b HD = Hawk-Dove (Snowdrift, Chicken) game; PD = Prisoner's Dilemma game (please note that in this supplementary table we did not discriminate between conventional and cellular automata-type games, where in the latter simulating evolution agents 'die', and are occasionally replaced; in our simulations we used only 'conventional' games, where agent-replacement was not allowed).  S P R T was changed from 3 to 6; [6]). In the canonical Prisoner's Dilemma games when using the Q-learning, the initial annealing temperature was set to 10,000 to extend the annealing process [115]). For each game strategy adoption rule and T values 100 random runs of 5,000 time steps were executed.     small-world (spheres) and scale-free (cones) networks were built as described in the Methods. The rewiring probability, p of small-world networks was increased from 0 to 1 with 0.05 increments, the number of edges linking each new node to former nodes in scale-free networks was varied from 1 to 7, and the means of shortest path-lengths and clustering coefficients were calculated for each network. Cubes and cylinders denote regular (p = 0) and random (p = 1.0) extremes of the small-world networks, respectively. For the description of the games and the best-takes-over (green symbols); longterm learning best-takes-over (blue symbols); long-term learning innovative best-takes-over (magenta symbols) and Q-learning (red symbols) strategy adoption rules used, see Methods.

Supplementary References Supplementary Tables
The left and right panels show the 2D side views of the 3D top middle panel using the same symbol-set. For each network 100 random runs of 5,000 time steps were executed at a fixed G value of 0.8. The bottom middle panel shows a color-coded illustration of the various network topologies used on the top middle panel. Here the same simulations are shown as on the top middle panel with a different color-code emphasizing the different network topologies. The various networks are represented by the following colors: regular networks -blue; smallworld networks -green; scale-free networks -yellow; random networks -red (from the angle of the figure the random networks are behind some of the small-world networks and, therefore are highlighted with a red arrow to make there identification easier).
Figure S1.10. Long-term learning and innovative strategy adoption rules extend cooperative network topologies in the extended Prisoner's Dilemma game. The top middle panel shows the level of cooperation at different network topologies. small-world (spheres) and scale-free (cones) networks were built as described in the Methods. The rewiring probability, p of smallworld networks was increased from 0 to 1 with 0.05 increments, the number of edges linking each new node to former nodes in scale-free networks was varied from 1 to 7, and the means of shortest path-lengths and clustering coefficients were calculated for each network. Cubes and cylinders denote regular (p = 0) and random (p = 1.0) extremes of the small-world networks, respectively. For the description of the games and the best-takes-over (green symbols); long-term learning best-takes-over (blue symbols); long-term learning innovative best-takes-over (magenta symbols) and Q-learning (red symbols) strategy adoption rules used, see Methods. The left and right panels show the 2D side views of the 3D top middle panel using the same symbol-set. For each network 100 random runs of 5,000 time steps were executed at a fixed T value of 1.8. The bottom middle panel shows a color-coded illustration of the various network topologies used on the top middle panel. Here the same simulations are shown as on the top middle panel with a different color-code emphasizing the different network topologies. The various networks are represented by the following colors: regular networks -blue; small-world networks -green; scale-free networks -yellow; random networks -red (from the angle of the figure the random networks are behind some of the small-world networks and, therefore are highlighted with a red arrow to make there identification easier).
Figure S1.11. Both hawks and doves become isolated in extreme minority, when they use the innovative Q-learning strategy adoption rule in Hawk-Dove games on small-world and scalefree networks. The small-world [2] and scale-free networks [10] were built, and Hawk-Dove games were played as described in the Methods using 225 agents. Networks showing the last round of 5,000 plays were visualized using the Kamada-Kawai algorithm of the Pajek program [116]. Blue and orange dots correspond to hawks and doves, respectively. Green, orange and grey lines denote hawk-hawk, dove-dove or dove-hawk contacts, respectively. Arrows point to lonely hawks or doves using the respective colors above. A, Small-world network with a rewiring probability of 0.05, G=0.15. B, Small-world network with a rewiring probability of 0.05, G=0.95. C, Scale-free network with m=3, G=0.1. D, Scale-free network with m=3, G=0.98. We have received similar data when playing extended Prisoner's Dilemma games (data not shown).

A B
C D Figure S1.12. Hawks, and especially doves are not extremely isolated in extreme minority, when they use the non-innovative best-takes-over strategy adoption rule in Hawk-Dove games on small-world and scale-free networks. The small-world [2] and scale-free networks [10] were built, and Hawk-Dove games were played as described in the Methods using 225 agents. Networks showing the last round of 5,000 plays were visualized using the Kamada-Kawai algorithm of the Pajek program [116]. Blue and orange dots correspond to hawks and doves, respectively. Green, orange and grey lines denote hawk-hawk, dove-dove or dovehawk contacts, respectively. Arrows point to lonely hawks or doves using the respective colours above. A, Small-world network with a rewiring probability of 0.05, G=0.15. B, Small-world network with a rewiring probability of 0.05, G=0.75. C, Scale-free network with m=3, G=0.1. D, Scale-free network with m=3, G=0.8. We have received similar data using other non-innovative strategy adoption rules, such as pair-wise comparison dynamics, or proportional updating, as well as when playing extended Prisoner's Dilemma games (data not shown).
A B D C