Learning, exploitation and bias in games

We focus on learning during development in a group of individuals that play a competitive game with each other. The game has two actions and there is negative frequency dependence. We define the distribution of actions by group members to be an equilibrium configuration if no individual can improve its payoff by unilaterally changing its action. We show that at this equilibrium, one action is preferred in the sense that those taking the preferred action have a higher payoff than those taking the other, more prosocial, action. We explore the consequences of a simple ‘unbiased’ reinforcement learning rule during development, showing that groups reach an approximate equilibrium distribution, so that some achieve a higher payoff than others. Because there is learning, an individual’s behaviour can influence the future behaviour of others. We show that, as a consequence, there is the potential for an individual to exploit others by influencing them to be the ones to take the non-preferred action. Using an evolutionary simulation, we show that population members can avoid being exploited by over-valuing rewards obtained from the preferred option during learning, an example of a bias that is ‘rational’.

so that an individual taking action u 1 gets a strictly lower payoff by switching to action u 2 . It can also be seen that k * is the unique value of k for which it is detrimental for any group member to switch action. At the equilibrium configuration, the payoff to an individual choosing action u 1 is W 1 (k * ). Similarly, the payoff to individual choosing action u 2 is W 2 (k * − 1). We have ≥ W 2 (k * − 1) by our assumption that action u 2 is beneficial (S4) Thus those taking action u 1 do strictly better than those taking action u 2 . We will refer to action u 1 as the preferred action.
In the non-generic case there is an integerk such that D(k) = 0. There are then two possible definitions of k * , namely k * =k and k * =k + 1.
First consider a group in which k * =k individuals take action u 2 . Then an individual taking action u 1 gets the same payoff by switching to action u 2 , while an individual taking action u 2 gets a strictly lower payoff by switching to u 1 . In this group W 1 (k * ) ≥ W 2 (k * − 1), and this inequality is strict if the payoff W 2 (k) is strictly increasing with k. Now consider a group in which k * =k + 1 individuals take action u 2 . Then an individual taking action u 1 gets a strictly lower payoff by switching to action u 2 , while an individual taking action u 2 gets the same payoff by switching to u 1 . In this group Thus in the non-generic case the equilibrium configuration is not unique, but at either configuration, those taking action u 1 do strictly better than those taking action u 2 provided W 2 (k) is strictly increasing with k (as is true in the Hawk-Dove game and the Producer-Scrounger game). The analyses below and computations assume the value k * =k + 1 when payoffs are non-generic.
The Hawk-Dove game. We consider the standard Hawk-Dove game with value of reward, V , and cost of losing a fight, C. We assume that V < C. The actions are u 1 = Hawk, and u 2 = Dove. Payoffs are From these payoffs we have Thus conditions A1 and A2 of the main text are satisfied since W 1 (k) − W 2 (k) and W 2 (k) are strictly increasing functions of k. Let k * be the minimum value of k such that W 1 (k) > W 2 (k), so that The advantage of a Hawk over a Dove at the equilibrium configuration is The resource exploitation game. Each group member can either obtain a resource in a communal place (the social foraging action u 1 ) or its own territory (the solitary foraging action u 2 ). All those choosing the communal place share a resource of value V equally. An individual that chooses it own territory gains unit resource. Payoffs are We assume that 1 < V < G. Under this assumption, assumptions A1 and A2 both hold. We have k * = 1 + integer part of (G − V ). (S13) The advantage of a social over a solitary forager at the equilibrium configuration is (S15) The Producer-Scrounger game. In this game u 1 = Scrounge, and u 2 = Produce. Payoffs are (S17) As can be seen, both of these payoffs are strictly increasing functions of k. Thus producing is beneficial.
In order to check the negative frequency dependence assumption we set D(k) = 1 A (W 1 (k) − W 2 (k)), so that Then for 0 ≤ k ≤ G − 2 we have The term in the first square bracket is non-negative and is positive for k ≥ 1. The term in the second square bracket is non-negative and is positive for k = 0. It follows that D(k) is a strictly increasing function of k.
To check the end conditions we first note that W 1 (0) < W 2 (0). We also have Computations are based on assuming that a = 2 and A = 3. For these values we require Let the real-valued function d be given by d(x) = J 2 x 2 − J 1 x + J 0 . This quadratic function of x satisfies d(0) > 0. It also satisfies d(G − 1) < 0 by condition S22. Let x * be the smaller of the two reals roots of the equation d(x * ) = 0. Then 0 < x * < G − 1, and we have d(x) > 0 for 0 < x < x * and d(x) < 0 for x * < x ≤ G − 1. It follows that The usual quadratic formula gives The advantage of a Scrounger over a Producer at the equilibrium configuration is

Appendix B. Learning
Time structure The Hawk-Dove game. Group members are assigned a random ordering, 1, 2, ..., G. In a cycle of rounds, group member 1 chooses a randomly selected opponent from the other G − 1 group members and plays the Hawk-Dove game against this opponent, then group member 2 does the same, and so on until all group members have done so. Thus in a cycle there are 2G rounds of the game. Updating of the subjective reward rates occurs after each round. Computations used to produce the figures are based on learning over K = 10000 cycles.
The resource exploitation game. During a round of the game each group member decided whether to forage socially or solitarily. The subjective reward rates are updated after each round in the cycle. Computations used to produce the figures are based on learning over K = 10000 rounds.
The Producer-Scrounger game. At the start of a round, each group member decides whether to be a producer or a scrounger. If no individual chooses to produce, there is another round of choice. This continues until at least one of the group members chooses to be a producer. This choice phase is instantaneous. Once at least one individual is a producer, all producers search for a food source. We assume that each producer finds food sources as a Poisson process of unit rate, independently of others. Once the first producer finds a food source, this source is consumed by that producer and all the scroungers. This consumption phase is instantaneous. The round then ends and another begins, with all again choosing to be producers or scroungers. Thus exactly one food source is consumed during a round, and the time taken to complete a round has an exponential distribution with parameter equal to the number of producers in that round. The number of producers tends to increase with group size. Thus the mean time taken for a round tends to decrease with group size. The probability a given producer find a food source in a round also tends to decrease with group size. Because of these effects, we have chosen the duration of the learning phase to decrease with group size. Specifically, in the computations used to produce the figures we have assumed that rounds continue until the total time exceeds T max = 250 + 15000 G . This is a compromise between having too short a time to learn and have more rounds to learn as group size increases.

Subjective rewards
The Hawk-Dove game. The true reward (payoff) of obtaining the resource in a contest is V . The true reward of losing a hawk vs hawk fight is −C. The true reward from failing to obtaining a reward when choosing dove is zero. If the individual has inflation bias α then the subjective rewards for these three outcomes are αV , −C and 0 respectively. Computations used to produce the figures are based on the values V = 2, C = 4.
The Resource Exploitation game. The true reward from solitary foraging is 1. When n group members choose to forage socially the true reward to each is V n . If an individual has inflation factor α then the subjective rewards for these two outcomes are 1 and α V n respectively. Computations used to produce the figures assume that V = 0.5G.
The Producer-Scrounger game. In a round of this game any producer that does not find a food source has true reward zero. The producer that finds the food source has true reward a + A n when there are n scroungers in the group. The true reward to each scrounger is A n . The subjective reward to a producer is the true reward. The subjective reward to a scrounger is α times the true reward. Computations used to produce the figures assume that a = 2, A = 3.

4/9
The choice rule Let w i (s) to be the subjective reward on round s if action u i is chosen on this round, with w i (s) = 0 if the other action is chosen. Let be the total subjective reward from action u i in the first t rounds. Consider first the Hawk-Dove and Resource Exploitation scenarios. Let n i (t) be the number of times action u i is chosen in the first t rounds. Then after t round the subjective rate of reward under action u i is set to be Computations used to produce the figures assume that r 0 = 50. For the Producer-Scrounger scenario, let τ i (t) denote that total time devoted to action u i in the first t rounds. Then after t round the subjective rate of reward under action u i is Computations used to produce the figures assume that r 0 = Tmax 20 . The probability that the individual chooses action u 2 on round t + 1 is a function,  LetR denote the mean payoff rate during learning. In the Hawk-Dove and the Resource Exploitation scenarios this rate is the total payoff obtained by an individual divided by the total number of rounds played. In the Producer-Scrounger scenario the rate is the total payoff divided by the total time. Note that these payoffs are true fitness increments rather than inflated values. The fitness of an individual is W = 1 +R, where the background contribution of 1 is the same for all population members. We performed evolutionary simulations for a population with discrete non-overlapping generations using W as the fitness measure. The evolving trait α is regarded as a quantitative trait. There is a single mating type, but inheritance is sexual in that each individual in the next generation is the offspring of two parents from the current generation. Each parent is chosen with a probability that is proportional to their fitness measure W , with the two choices being independent. Inheritance is specified by the infinitesimal model: the trait of the offspring is the average trait of the two parents plus an error that is normally distributed with mean zero and standard deviation σ. All computations are based on the value σ = 0.02. For the properties and merits of this form of inheritance see [1]. In all simulations α = 1 for all population members in generation 0. The population is size N = 15000 in this and all subsequent generations. Figure 4 of the main text illustrates the evolution of α and the resultant changes in population characteristics for the Hawk-Dove scenario. Figures S3 and S4 illustrate the PLOS 8/9 analogous evolutionary simulations for the Resource Exploitation game and the Producer-Scrounger game, respectively.