Altruism Can Proliferate through Population Viscosity despite High Random Gene Flow

The ways in which natural selection can allow the proliferation of cooperative behavior have long been seen as a central problem in evolutionary biology. Most of the literature has focused on interactions between pairs of individuals and on linear public goods games. This emphasis has led to the conclusion that even modest levels of migration would pose a serious problem to the spread of altruism through population viscosity in group structured populations. Here we challenge this conclusion, by analyzing evolution in a framework which allows for complex group interactions and random migration among groups. We conclude that contingent forms of strong altruism that benefits equally all group members, regardless of kinship and without greenbeard effects, can spread when rare under realistic group sizes and levels of migration, due to the assortment of genes resulting only from population viscosity. Our analysis combines group-centric and gene-centric perspectives, allows for arbitrary strength of selection, and leads to extensions of Hamilton’s rule for the spread of altruistic alleles, applicable under broad conditions.

S1 Examples. Meaning of ρ and ν. Self organization where, unless stated otherwise, we suppose that 0 < C < B. This is, by far, the most widely studied model of cooperative/altruistic behavior. At a net cost C to itself, each type A provides a total benefit B shared uniformly among the other n − 1 members of the group. This model can also be used when at a gross cost c to itself, each type A provides a total benefit b shared uniformly among the n members of the group, self included. By defining C = c − b/n and B = (n − 1)b/n, we obtain the payoff functions displayed above also in this case. Here we suppose that a public goods game is repeated a random number of times τ ≥ 1, with average IE(τ ) = T . Each time each member of the group can cooperate at a cost C to itself, resulting in a benefit B/(n − 1) to each one of the other members of its group. Defectors incur no costs and produce no benefits. We suppose that types A cooperate in the first round, and afterwards only cooperate if at least a other members of the group cooperated in the previous round. This is a generalization of the well known tit-for-tat strategy, which corresponds to the case n = 2, a = 1. We will refer to the strategy of the types A in this example then as "multi-individuals-tit-for-tat (with threshold a)". In the paper, we emphasized the case in which each type A cooperates in the first round, and in later rounds cooperates if and only if its payoff in the previous round was not negative. This corresponds to a taking the smallest integer value that is larger than or equal to (n − 1)C/B (since the PG has v A k < 0 if and only if k < 1 + (n − 1)C/B). In this important case, we call the strategy of types A "payoff dependent contingent cooperation".
The IPG was studied independently in [3] and in [9], in the trait-group framework, which corresponds to the 2lFW with m = 1, i.e., random assortment in groups. Both papers identified stable equilibria with positive fractions of types A. But they also observed that altruists could not invade when rare, when m = 1.
The relevance of this model was emphazised in the paper, and it will be analysed in detail in Section S8.
for positive constants C, A and A , and an integer θ ∈ {1, 2, ..., n}. The idea here is simple: the allele A carries a cost, but allows its carriers to gain benefits if sufficiently many are in the group. Unless otherwise stated, we assume that A > C. Types N obtain benefits also when types A do, but we allowed for the possibility that these benefits are different from those of the types A. These payoff functions can be seen as simplifications of more realistic ones, in which payoffs to types A initially grow slowly with k, then steeply, and then quickly saturate. For instance, the model of coordinated punishment of [2] provides examples of this sort. In [1] a class of models that bridge between the PG and the THR was introduced and their relevance discussed. Similarly to the results of [3] and [9] for the IPG, it was shown in [1] that the THR can have polimorphic equilibria, with Types A and N coexisting, when assortment is random, i.e., in the trait-group framework, that corresponds to 2lFW with m = 1. In this case, as for the IPG types A can nevertheless never proliferate, when p is small (the equilibrium with no types A is an ESS). In [17], the case n = 3 of the threshold model (called there "stag hunt game") was discussed in connection to the conceptual issue of the role of Hamilton's rule. The THR behaves in a very simple way under iteration. Suppose that the game is repeated T times over a life-cycle. And suppose that types A are "payoff dependent conditional cooperatores", meaning that a type A cooperates in the first round, and in later rounds cooperates if and only if its payoff in the previous round was not negative. The result is that types A cooperate exactly once if k < θ, and cooperate T times if k ≥ θ. The total payoffs to types A and to types N over the life-cycle are then those of a THR with the same value of C, with A replaced by T (A − C) + C and A replaced by T A . Fig. ?? displays instances of the three models above. A more detailed discussion of these and other examples, can be found in [15], where several conceptual issues are also considered. A good source of motivated examples is [8]. Fig. S2 displays curves of ρ vs. m for instances of the three examples above. In each case, ρ is decreasing in m, and there is a single critical value m s defined by ρ(m s ) = 1 (the subscript 's' stands for 'survival' of the allele A, when rare). The critical rate of gene flow that allows types A to proliferate when rare can also be identified by the critical relatedness R 0 s = R 0 (m s ), where R 0 (m) = (1 − m) 2 /(n − (n − 1)(1 − m) 2 ) ≈ 1/(1 + 2nm) (the approximation refers to the case in which m is small) is the relatedness obtained from neutral genetic markers (see Sections S5 and S7). Fig. S3, Fig. S5 and Fig. S7 display values of m s as function of the strength of selection δ. In these pictures, the short horizontal red lines indicate the limits as δ → 0, obtained by using (A2). Note the good agreement between these values obtained from (A2) and the asymptotic behavior of m s with vanishing δ in the pictures. As in Panel A of Fig. 2 in the paper, the region above the curve defined by m s represents the region where the allele A cannot invade the population, while in the region below this curve it can invade the population. Fig. S4, Fig. S6 and Fig. S8 display corresponding values of critical relatedness R 0 s = R 0 (m s ) above which types A proliferate, as a function of the strength of selection δ. In Fig. S5 and Fig. S6, the case T = 1 corresponds to the PG, and comparison between the results for the various values of T illustrates again the strength of group selection under contingent cooperative behavior. In Fig. S7 and Fig. S8 a similar phenomenon is shown for the threshold model, under contingent cooperation. Fig. S9 shows how, even for the standard public goods game, with B = 2C, the distribution ν depends on δ and m. Fig. S10, Fig. S11, Fig. S12 and Fig. S13 illustrate features of the evolution f (t + 1) = f (t)M (A + B), that self-organizes the copies of the allele A. These pictures also illustrate the meanings of ρ and ν.
It is natural to wonder if there are simple conditions under which one can guarantee that (a) when m = 1 (random assortment) types A will not proliferate when rare, but (b) that they will proliferate when rare when the migration rate m is low. The following conditions, satisfied by the three examples above, imply, respectively, each one of these two facts (see Section S7).
, so that an isolated type A individual has lower fitness than the wild type N has in groups without altruists.
(C2) v A n > 0, i.e., w A n > 1 = w N 0 , so that type A individuals have greater fitness when in single-type groups than type N individuals have when in single-type groups.     and T = 500 (green full line). Panel B depicts the same conditions except for a = 8. Short horizontal red lines indicate critical values at the weak selection limit obtained from (A2) in the paper. Each curve has T fixed, but to compare different values of T , the product δT is a natural measure of strength of selection, and is used in the horizontal axis.    Fig. S7, each curve has A fixed, but to compare different values of A, the product δA is a natural measure of strength of selection, and is used in the horizontal axis. Short horizontal red lines indicate critical values at the weak selection limit obtained from (A2) in the paper. These values are: Panel A: 0.012 , 0.017, 0.044, 0.071. Panel B: 0.061 , 0.075, 0.137, 0.194. Note the extremely low values of critical relatedness in Panel A. The large values of A can result from contingent cooperation, based on feedback, as for the IPG. For instance, suppose that a certain activity repeats itself T times over a life-cycle. Suppose also that in each repetition the payoff is well described by the threshold model. If types A discontinue the participation when their payoff in the previous round was negative (as in the IPG discussed in Fig 2 in the paper), then the resulting payoff over the T iterations is also given by a threshold model, with the same value of C, but A replaced by (A − C)T + C, and A replaced with A T . This gives plausibility to values of A and A as large as those in this figure, since T can be in the hundreds, or thousands (see discussion on the IPG in the paper).  Note that in all cases f (t) reaches in a few generations a steady state, in which it shrinks (ρ < 1), or grows (ρ > 1), as a multiple of ν. When m approaches m s , the eigenvalue ρ becomes close to 1, the stationary movement along the direction given by ν slows down and the trajectories towards this direction straighten themselves, but are not slowed down. 13 Figure S11: Self-organization of copies of A. In this picture we have IPG with n = 10, C = 1, B = 3, T = 100, a = 2, δ = 0.01, and m = 0.153, slightly smaller than m s = 0.163. Top part shows evolution of p(t), and bottom part shows corresponding evolution of f (t) = (f 1 (t), ..., f 10 (t)), displayed as normalized histograms. Two initial conditions are compared: (Red) f (0) = 10 −2 (1, 0, ..., 0), so that p(0) = 10 −3 . (Black) f (0) = 10 −5 (0, ..., 0, 1), so that p(0) = 10 −5 . Note that from generation to generation the distribution of copies of A adjusts itself to the same stationary distribution, "losing memory of the initial distribution".
Figure S12: Self-organization of copies of A. This picture corresponds to the same model and situation described in Fig. S11, but with a different time-frame, including later times. Note that eventually the two curves of p(t) become parallel straight lines, illustrating the exponential growth of p(t) at rate ρ independently of the initial condition. This picture also illustrates two other important points: 1) The possible non-monotonicity of p(t).
2) The fact that the asymptotic rate of growth may be smaller than the initial rate of growth. Indeed, computations of ∆p only indicate the long term prospects for the allele A, when done under stationary conditions, as in (A1). The initial distribution of copies of A in the red line produces neighbor modulated fitness for A below that of allele N, so that ∆p(0) < 0. In contrast, the initial distribution of copies of A in the black line produces neighbor modulated fitness for A well above that of allele N, so that not only ∆p(0) > 0, but this growth happens at an unsustainably high rate. The distribution ν, towards which the copies of A self-organize is optimal for their stationary, stable, growth. This is so because (ρ, ν) is the leading eigenpair of the driving matrix M (A + B): ν is the vector ν that satisfies the eigenvalue (stationarity) equation ν M (A + B) = ρ ν , with maximum ρ .
Figure S13: Self-organization of copies of A. This picture corresponds to the same model described in Fig. S11, but now m = 0.173 is slightly larger than m s = 0.163. Note that again eventually the two curves of p(t) become parallel straight lines, illustrating in this case the exponential decrease of p(t) at a rate independent of the initial condition. Here again one can see that ∆p(0) is not indicative of the relevant long term evolution. The self-organized distribution ν is still optimal for the proliferation of the allele A in a stable, sustainable, fashion. But when ρ < 1, as in this picture, this optimal stable distribution is still not good enough for A to spread, and instead, its copies are eliminated by natural selection.

S2 Use of Perron-Frobenius Theorem
When 0 < m < 1, the n × n matrix M (A + B) has all its entries strictly positive. The Perron-Frobenius Theorem states that from this it follows that [13] 1) M (A + B) has an eigenvalue ρ that is real and positive and is larger in absolute value than all the other eigenvalues of this matrix.
2) To ρ there corresponds a single left-eigenvector ν and a single right-eigenvector ζ. Both eigenvectors have all their entries strictly positive. With no loss, we normalize them as follows: n k=1 ν k = 1, and n k=1 ν k ζ k = 1.
We apply the theorem to our dynamical system f (t) as follows, If f (0) = (0, ..., 0), then some of its entries are strictly positive, and hence f (0)ζ is a 1 × 1 matrix, with a strictly positive entry C. Therefore, where the error term e(t) Illustrations of (S1) appear in Fig. S10, Fig. S11, Fig. S12 and Fig. S13. The error term in (A1) can now be expressed as follows. Combining the error terms in (S1) and from the approximationW = 1, which is of order p (a randomly chosen individual has probability of order p of being in a group with types A), we obtain error of order p 2 + pβ t in (A1). The left-hand-side of (A1) in this estimate refers to the linearized evolution f (t + 1) = f (t)M (A + B), but if one wants to estimate ∆(p) for the original non-linear dynamical system f (t), then the additional error term to (A1) is also of order p 2 , by the very fact that this dynamical system admits a linearization close to (0, ..., 0).

S3 Focal individual. Sampling bias. Relatedness. F ST . IBD
Focal individual: Suppose that in generation t a group is chosen at random, and its members are ordered in a random fashion. The chosen group is called the focal group, the first individual in the ordering is called the focal individual and the second individual in the ordering is called the co-focal individual. Note that this random experiment is equivalent to choosing a focal individual at random and then choosing a co-focal individual at random from the other n − 1 individuals in the focal's group. We will denote by IP t probabilities that refer to the sampling just described, and use IE t for corresponding expected values. We denote by A j the event that the jth individual in the random ordering of the members of the focal group is a type A. And we denote by I j the indicator of the event A j , i.e., I j is the random variable that takes value 1 if A j occurs and value 0 if its complement, A c j , occurs. We denote by K = I 1 + ... + I n the random number of types A in the focal group, and we setp = K/n, for the random fraction of types A in the focal group. Clearly IE t (p) = p(t).
Bayesian sampling bias: Conditioning on the focal being type A introduces bias on the number of types A in the focal group. This is well know and intuitive, since it is more likely that the focal individual will be type A if the focal group has many types A. The standard Bayesian computation follows. For k = 0, clearly IP t (K = k|A 1 ) = 0, while for k ≥ 1, .

(S2)
Relatedness as regression coefficient: If we define relatedness r t as the regression coefficient of Equivalently, r t can be defined by each one of the following four identities: In particular, Wright's F ST : The relatedness r t is closely associated to Wright's F ST statistics, defined as: The numerator measures intergroup variability, and the second term in the denominator measures average intragroup variability. Note that the denominator can also be written as showing that it measures total variability in the population. To see how F ST,t relates to r t , we use (S3) to write Var t (K) = Var t (I 1 + ... + I n ) = n Var t (I 1 ) + n(n − 1) Cov t (I 1 , I 2 ) = nVar t (I 1 ) + n(n − 1) r t Var t (I 1 ) = nVar t (I 1 ) (1 + (n − 1)r t ).
Sincep = K/n, we have Var t (p) = Var t (K)/n 2 , and we obtain In particular, when n is large F ST,t is close to r t . Figure S14: This diagram illustrates the concept of identity by descent (IBD) in the 2lFW. Two individuals X an Y in a given group in generation t, regardless of their type, are identical by descent (IBD) if their lineages, when followed back in time, coalesce before a migration event (indicated by a dashed arrow in the figure in the right panel). Considering a migration rate of m, migration typically takes place within a random number, of order 1/m of generations back.
Identity by descent: The 2lFW allows for a very natural definition of "kin" and "genetical identity by descent", IBD for short (see Fig. S14). Say that two individuals in the same generation are l-kin in case they share a common ancestor at most l generations in their past. (The 1-kin of an individual are its siblings, its 2-kin are its siblings and cousins, etc.) Because the number of groups g is large, individuals from different groups have negligible probability of being l-kin, unless l is comparable to g. As for individuals in the same group, migration events in their lineages play a major role in their being or not kin. If, when we follow their lineages back in time, we find a migration event in one of these lineages before they coalesce, then we know that the probability that they will coalesce in a time that is much shorter than g is negligible. With this in mind, say that two individuals in the same generation are IBD in case following their lineages backwards in time, they coalesce before (in the backwards sense) a migration event happens in either one. From the discussion above, we know that being IBD in the 2lFW framework is essentially the same as being l-kin for some l that may have to be large, but is not comparable to g. We denote by D the event that the focal and the co-focal individuals are IBD, and define genetic relatedness for the allele A by Then, neglecting terms of order p, we have, from (S3), r t = IP (A 2 |A 1 ).
To relate now r t to R t , we proceed as follows. It is clear that IP t (A 2 |DA 1 ) = 1. On the other hand, if the event D c happened, the type of the co-focal was sampled, independently of the type of the focal, from the whole population in a generation t − s, when there was a migration event in the lineage of the co-focal. Since migration occurs at fixed rate m, s is of the order of the constant 1/m, and the frequency of types A in generation t − s was of the same order of p. This implies that IP t (A 2 |A 1 D c ) is of order p, and hence, neglecting again terms of order p, We have now, using also (S4), Case of low prevalence in steady state: We suppose now the conditions of (A1): in addition to p(t) << 1, also t >> 1. Then we have the steady state relation f (t) = Cρ t ν, and (S5) and (S2) yield when p(t) << 1 and t >> 1.
(The symbol =: means that a new notation is being defined on its right side. The meaning of the subscript 'ses' will be explained in Section S9.)

S4 Linear payoff function
, as in the public goods game. Note that any linear function of k can be written in this form. We will not make any assumption on v N k , or on the values of C and B. Combining the notation introduced in the derivation of (A1) and that introduced in Section S3, we have W A = 1 + δIE t (v A K |A 1 ). When p << 1 we have, neglecting terms of order p,W = W N = 1, and therefore where we used (S5). Hence, usingW ∆p = p(W A −W ), and neglecting terms of order p 2 , in this case. Note that here we did not need the assumption that t >> 1, but the relatedness R t is time-dependent, even if selection is weak. If we now also assume t >> 1, then, from S5 Limit of weak selection. IBD distribution π as a needed extension of relatedness Separation of time scales: In this section, we will denote by Bin(N, q) a random variable with binomial distribution, corresponding to N independent attempts, each one with probability q of success. When δ = 0, we have w A k = w N k =w k = 1, for all values of k. The 2lFW evolves in this case by neutral genetic drift (A is a neutral marker). In this case when p << 1, we have This yields, after some simplifications, When 0 < p(0) << 1 and t >> 1, we have from the Perron-Frobenius theorem, as in the paper and in Section S2, f (t) = C(ρ 0 ) t ν 0 , where C > 0 depends on f (0), ρ 0 is the Perron-Frobenius eigenvalue of M 0 (A + B) and ν 0 its corresponding left-eigenvector, normalized as a probability vector. But when δ = 0, A and N have the same relative fitness, so the expected frequency of A cannot change. This means that p(t) = n k=1 (k/n)f k (t) must be a constant, and hence we must have ρ 0 = 1.  We want to think of the case of weak selection, i.e., 0 < δ << 1, as a perturbation of the case with δ = 0. We consider for this purpose the limit δ → 0. By continuity of eigenvalues and eigenvectors, we have then ρ → 1 and ν → ν 0 . Note that, since ν 0 does not depend on v A k and v N k , the dependence of ν on these payoff functions should vanish as δ → 0 (see Fig. S15, for an illustration). This is the well known separation of time scales that occurs under weak selection (see, e.g., [14], [11]). Most of the time in most places evolution is occurring as if δ were 0, i.e., via neutral genetic drift. Only occasionally events happen that are caused by the slight differences in fitness of the individuals. Quantitatively, demographics operates even when δ = 0, in times of order 1/m, while selection operates in times of order 1/δ. When 0 < δ << 1, the right hand side of (A1) can be well approximated by δ The error term in this approximation is of order δ 2 , since the error in replacing ν with ν 0 is of order δ. Define now π k = kν 0 k / n i=1 iν 0 i . Then (A1) yields, in good approximation, ∆p = pδ n k=1 π k v A k . The quickest way to see that π is the invariant distribution of the Markov chain driven by Q, as introduced in (A2), is as follows. Note that from (S8) and the definition of Q, we have j(M 0 (A + B)) k,j = kQ k,j . Therefore, the eigenvalue equation ν 0 M 0 (A + B) = ν 0 is equivalent to the equation πQ = π. (See Fig. S16 for an illustration.) This completes the proof of the claims pertaining to (A2) in the paper. But the last argument above hides the reasons for these results, and were not our original arguments. In the next subsection a less formal but more intuitive line of argumentation will be presented.
The Markov chain driven by Q can be described as follows. If at time t it is in state X t ∈ {1, ..., n}, then at time t + 1 it will take the value X t+1 = 1 with probability m, or the value X t+1 = 1 + Bin(n − 1, (1 − m)X t /n) with probability 1 − m. We will explain next how this relates to IBD and how it makes the relationship between ν and this Markov chain intuitive.
Markov chain driven by Q and IBD: From this point on, in this section, we suppose that δ = 0, except when stated otherwise. As before, suppose that a focal group is chosen at random in generation t, and that a focal individual from this group is then also chosen at random. Define K D as the random variable that gives the number of individuals in the focal group that are IBD to the focal individual, herself included in the counting. Then we claim that K D evolves from generation to generation as a Markov chain driven by Q. To see this (see Fig. S17), note that if the focal is a migrant, then she is only IBD to herself in her group, while if she is not a migrant, then she is IBD to herself and to each one of the other members of her group that are not migrants and that have a mother who is IBD to her mother. Since the assignment of mothers is given by the intragroup Fisher-Wright process (with no selection, since δ = 0), we have for K D the same probabilistic evolution as for the chain X t above.
We want to understand now, on intuitive grounds, why ν 0 relates to the invariant distribution of the Markov chain driven by Q in the way we found computationally in the previous subsection. Suppose that 0 < p(0) << 1 and t >> 1, so that f (t) = Cν 0 . Suppose that at time t we choose a random focal group and from that group a random focal individual. We condition now on the event A 1 , that the focal individual is type A. Because p(t) = p(0) << 1, it is very likely then that the number of types A in the focal group is K = K D , since individuals in the focal group who are not IBD to the focal individual have only probability p of being type A, while those who are IBD to the focal must be type A. Because we are under neutral drift, with δ = 0, conditioning on A 1 does not affect lineages. This means that By Bayesian bias in sampling, the left-hand side is kf where the first equality is a good approximation when 0 < p(0) << 1 and t >> 1 and the second equality is the definition of π. On the other hand, since t >> 1, the right-hand-side is close to the invariant distribution of the Markov chain driven by Q, as we saw in the previous paragraph. This shows that π is that invariant distribution. Figure S17: This diagram illustrates why K D evolves as a Markov chain driven by Q. In this picture K D u represents the number of individuals that are IBD to the focal individual F u , in generation u (red circle). Two scenarios are discernible. MC1 (left panel): the focal individual is a migrant. This happens with probability m and implies that K D u = 1. MC2 (right panel): the focal individual F u is not a migrant, and she is a child of F u−1 . Each individual in the focal group in generation u choses a mother from the group of F u−1 in the previous generation with uniform probability, as δ = 0. With probability K D u−1 /n the chosen mother is IBD to F u−1 (orange circles) and, consequently, her children are also IBD to F u , provided that they are not migrants. In this case, the number of individuals in generation u that are IBD to the focal is, therefore, 1 (for the focal individual herself) plus a number of individuals given by a binomial random variable with probability of success (1 − m)K D u−1 /n in n − 1 trials.

Moments of π:
The Markov chain introduced above can be used to write recursions for quantities of interest. For instance, this method can be used for computing the moments of the distribution π, M l = k k l π k , l = 1, 2, .... To express the lth moment, M l , in terms of M j , j = 1, 2, ..., l − 1, we will use the following fact. If Y has a Bin(N, q) distribution, then where N j = N (N − 1) · · · (N − j + 1), and the Stirling number of the second kind, S(l, j), is the number of ways in which a set with l elements can be partitioned into j non-empty sets. Clearly S(l, 1) = S(l, l) = 1, for all l. They are also known to satisfy The ... For our purpose, it is convenient to write s = 1 − m, and to consider the stationary Markov chain, i.e., the case in which IP (X 0 = k) = π k , so that also IP (X 1 = k) = π k . Look now at the random variable (X 1 − 1) l . With probability m it takes the value 0, and with probability s it takes the value (Bin(n − 1, sX 0 /n)) l . Therefore But since X 0 and X 1 both have distribution π, we obtain l j=0 l j This provides us with the aimed recursion: For l = 1, (S9) reduces to (S10) Relatedness: The mean M 1 of π is directly related to the relatedness in groups under δ = 0, and demographic equilibrium (t >> 1), that we will denote by R 0 . First we observe that when δ = 0, conditioning on the focal being type A does not change the demographic distribution, hence R 0 = lim t→∞ IP t (D|A 1 ) = lim t→∞ IP t (D), where, as before D is the event that the focal and the co-focal individuals are IBD. Recall the enumeration of the members of the focal group, in which the first individual is the focal. (n−1)IP t (D), and therefore IP t (D) = (IE t (K D )−1)/(n−1). Putting these pieces together yields Fig. S18.) When m is small, (S11) yields the approximation This is a well known result by Wright for the infinite islands model, for haploid individuals.
It is important to clarify that even when δ = 0, our 2lFK framework is not identical to the infinite islands model. In that model there are a large number g (to be taken to ∞ in the computations) of islands, each one with n individuals. In generation t + 1 the n individuals in each island choose, independently, a parent from the individuals in the same island in generation t. This is followed by migration at rate m, that is implemented in exactly the same way as in the 2lFW framework. In contrast with what happens in 2lFW, each island is "parent" to exactly one island in the next generation. In 2lFW, when δ = 0, each group is parent to a random number, with distribution Bin(g, 1/g), of groups. In spite of this difference, it is not simply a coincidence that lead to the same formula (S11) in both frameworks. They share the same coalescence structure, when we ask ourselves questions related to IBD. Not only is R 0 identical in these frameworks, but so is also the distribution π. Indeed, when we follow lineages backwards in time, until there are migration events, it does not matter if the group that we are following from generation to generation is the same (as in the infinite islands case) or changes (as in our 2lFW case).
For arbitrary payoff functions, v A k and v N k , we have, from (S6), (S13) Mnemonic recursions: There is a well known (in the infinite island case, but the same applies to 2lFW) and simpler way to derive (S11). If a focal and a co-focal individuals are chosen from the same focal group in generation t >> 1, then the probability R 0 that they are IBD can be calculated as follows. For them to be IBD neither can be a migrant (probability (1 − m) 2 ). Given that they are not migrants, they will be IBD if they have the same mother (probability 1/n), or if they have distinct mothers and their mother are IBD (probability (1 − 1/n)R 0 , using the fact that t >> 1 implies demographic equilibrium). Combining these, we have which yields (S11). The very same kind of powerful reasoning can be used to obtain the recursions for π 1 , ..., π k that are summarized in the equation πQ = π. This is done in the second derivation of (A2) in the paper, and reproduced next, so that its parallel with the derivation of (S14) becomes apparent. For this we will use the standard Kronecker notation δ i,j = 1 if i = j and δ i,j = 0 if i = j. Now, the probability π k that the focal is IBD to exactly k − 1 other members of her group is δ k,1 if the focal is a migrant (probability m), while if she is not a migrant (probability 1 − m), then we have to consider how many individuals in her mother's group were IBD to her mother. If, counting her mother, that number was j (probability π j , assuming demographic equilibrium) then the probability that the focal is IBD to exactly k −1 other members of her group is equal to the probability that of the n−1 other members of the focal's group, exactly k − 1 chose one of the j candidates to mother that were IBD to the focal's mother, and moreover did not become a migrant (probability IP (Bin(n − 1, (1 − m)j/n) = k − 1)). Combining these pieces, we have This is exactly the same as the set of equations πQ = π. And Fig. S17 is a diagram of that reasoning.
When is π needed over R 0 ? Precisely when v A k is not a linear function of k. Suppose that v A k = −c + b 1 k + b 2 k 2 + · · · + b l k l . Then the weak selection viability condition derived from ∆p > 0 in (A2) becomes involving moments of π. In the linear case l = 1, this condition becomes c < b 1 (1 + (n − 1)R 0 ), that only depends on R 0 . If in this case we write v A k = −C + B(k − 1)/(n − 1), corresponding to c = C + B/(n − 1) and b 1 = B/(n − 1), then this viability condition takes the familiar form C < BR 0 . This is the same that one obtains from combining (S7) with (S13). In the next section we will study the limit (S18), under which (S16) simplifies considerably, becoming equivalent to the condition ∆(p) = 0 in (A4).
Another good illustration of the need to use π rather than R 0 is given by the threshold model, Example 3. In this case the viability condition derived from ∆p > 0 in (A2) becomes (S17) In this case, the relevant information contained in π, rather than being in R 0 or in higher moments, is in the tail of that distribution. The results from the next section will lead to a major simplification to (S17), in the limit (S18), assuming that θ/n converges to some θ, so that the conditions of (A3) are satisfied. (See second subsection of Section S7.)

S6 Limit of large n and small m under weak selection. IBD distribution is close to beta distribution
Convergence of π to beta distribution: In this section we will continue to assume that selection is weak and we will study the limit in which n → ∞, m → 0 and nm → m, so that R 0 → 1/(1 + 2 m) = R 0 , where we used (S11), (S12). We will suppose that 0 < m < ∞. Suppose also that K D is a random variable with distribution π, i.e., IP (K D = k) = π k . Then we will show that for arbitrary 0 ≤ x ≤ 1. This means that K D /n converges in distribution to a beta distribution with parameters α = 1 and β = 2 m, which has density f m (x) = 2 m (1 − x) 2 m−1 , 0 < x < 1. (See Fig. S19 for illustrations.) To prove (S19), it is sufficient to show that the moments of K D /n converge to those of the beta distribution (see, e.g., Section 2.3.e of [4]), namely that for l = 1, 2, ..., We will prove (S20) by induction in l = 1, 2, .... First, from (S10) we obtain it in case l = 1. Now suppose that (S20) is true for l = 1, 2, ..., i − 1. Then M l /n i−1 → 0, for l = 1, ..., i − 2, so that from (S9) we obtain where ∆(n, m) n → 0. The last three displays, and the induction hypothesis, imply that This proves that (S20) is true for l = i, completing the induction argument, and the proof of (S19).

Application: Suppose that there is a piecewise continuous and bounded function v
Then, the weak convergence result in (S19) above, combined with the continuous mapping theorem, (2.3) in Section 2.2.b of [4], implies that, when 0 < m < ∞, in the limit (S18). This proves that (A3) follows from (A2) in this limit.

S7 Critical values
Definitions: When ρ > 1 we are in the regime in which the fixed point (0, ..., 0) of the dynamical system f (t) is unstable (the allele A can invade), while, when ρ < 1, this fixed point is stable (the allele A cannot invade). (The boundary case ρ = 1 could fall in either category, depending on the model.) It is natural to define the critical value (The subscript 's' stands for 'survival' of the allele A, when rare.) Typically (see Fig. S2), we found that ρ(m) is a monotone decreasing function of m, and ρ(m s ) = 1 (S24) has a single solution. In this case (S24) is equivalent to (S23), and the allele A can invade when m < m s and cannot invade when m > m s . Conditions (C1) and (C2) from Section S1 imply that 0 < m s < 1, as we show next. For this purpose, we use continuity of eigenvalues and eigenvectors. In particular ρ(m) is continuous, and it is enough to observe how it behaves as m approaches 0 or 1. When m = 1, the matrix A = 0, and M (A + B) = M B has only the first column not identically 0. Therefore any of its left-eigenvectors must be of the form (a, 0, ..., 0). When such a vector is multiplied by M B, the result is w A 1 (a, 0, ..., 0). This shows that w A 1 is the only eigenvalue of M (A + B). Therefore ρ(m) must converge to w A 1 < 1 (by (C1)), as m → 1. When m = 0, M (A + B) = M has the eigenvalue w A n > 1 (by (C2)), corresponding to the left-eigenvector (0, 0, ..., 0, 1). This implies that ρ(m) must be larger than 1 when m is close to 0.
The parameter 1 − m indicates the level of assortment in the 2lFW. Another natural parameter for this purpose, that is also in one-to-one correspondence with m, is R 0 (m), given by (S11) (see Fig. S18). It provides the relatedness measured using neutral genetic markers, in the same demographics in equilibrium, by means of regression coefficients, Wright's F ST , or lineages (IBD). This leads to the following natural definition: When the allele A is under weak selection, so that m s solves ∆p = 0 in (A2), then R 0 s gives the critical value of relatedness also if measured for the allele A itself, in demographic equilibrium. But this is not the case if A is under strong selection. Even then, R 0 s is a meaningful way to describe the critical value of assortment needed for A to be able to invade the population, as, for instance, in Panel B of Fig. 2 in the paper.
If the allele A is under weak selection, n is large, and we have in good approximation (in the sense of (S21)) v A k = v A k/n , for some piecewise continuous and bounded function R 0 s gives the critical level of relatedness in the population, measured with the allele A itself or with neutral genetic markers, that allows the allele A to invade in this case. The critical migration rate is then close to m s /n.
Threshold model with large n: A simple example of the use of (A3) is given by the threshold model, Example 3. Suppose that θ = n θ , for some 0 ≤ θ ≤ 1, where y is the smallest integer larger than or equal to y. In this case, v A We can now apply (A3), when A is under weak selection, and n is large. The equation ∆p = 0 can be easily integrated in this case, giving C = A(1 − θ) 2ms , which yields where log can be in any base. If θ is small, then we have the further approximations

S8 Iterated public goods game. Types A play multi-individual tit-for-tat
In this section we will analyse in detail the iterated public goods game (IPG, Exemple 2), introduced in Fig. 2 of the paper and in Section S1. This model was studied independently in [3] and in [9], in the trait-group framework. Both papers identified stable equilibria with positive fractions of types A. But they also observed that altruists could not invade when rare. In [3] an approach was then introduced to provide assortment and allow the allele A to invade when rare. The authors concluded that such invasion by allele A could only occur under very restrictive conditions, and that therefore this model and the corresponding notion of multi-individual tit-for-tat were of marginal relevance. One of our contributions in the current paper is to rectify this perception. We show here that in the 2lFW framework, the allele A can invade under very reasonable conditions. Therefore we propose that conditional mechanisms of cooperation be seriously considered. Later in this section, we will explain why the estimates in [3] contained an unreasonably pessimistic assumption, that is not supported in the 2lFW. Suppose that a = n a , for some 0 ≤ a = α ≤ 1 (the notation α in the paper is being changed to a in this section, to make it more similar to other quantities related to the limit studied in Section S6; y is the largest integer smaller than or equal to y, i.e., the integer part of y). Then, when n is large, we have in good approximation (in the sense of (S21) , for a < x ≤ 1. In case selection is weak and n is large the viability condition is obtained from setting ∆p > 0 in (A3). Integration of the right-hand-side of (A3) gives then the following equation for the critical level of relatedness R 0 s = R.
except in the trivial case a = 1, in which the right hand side of (S27) is 0. This case corresponds to a = a n = n, so that types A cooperate only in the first round. Obviously then, the game reduces to the one-shot public goods game, and indeed, (S27) reduces to the usual condition C = BR.
In the opposite extreme, when a = 0, so that a = a n = 0 and types A cooperate in each round of the game, the right hand side of (S27) reduces to (T − 1) (C − BR). So (S27) is again reduced to C = BR, as one should expect.
Note that, obviously, also when T = 1, the game is a one-shot public goods game and (S27) reduces once more to C = BR. One can see the right hand side of (S27), as a correction to that viability condition in the more general and challenging example that we are analyzing here.
Analysis of (S27). Results: While somewhat intimidating at first sight, (S27) provides good information about the critical value R = R 0 s , that solves it. We first state the main features of its behavior, and then explain in the next subsection how to do the computations leading to these claims. Of special biological and mathematical importance is the case in which a = C/B = min{0 ≤ x ≤ 1 : v A x ≥ 0}. In this case, in which types A are payoff dependent conditional cooperators, (S27) simplifies to: We will see that with C, B and T fixed, R 0 s is a continuous function of 0 ≤ a ≤ 1, that takes the value C/B on both end-points of this domain, is strictly decreasing when 0 ≤ a ≤ C/B, and strictly increasing on C/B ≤ a ≤ 1. It reaches therefore its minimum at a = C/B, where it solves (S28). It is intuitive that R 0 s should reach its minimal value when a = C/B, since then altruists are persisting precisely when it is in their benefit to do so. Fig. S21 compares the solution of (S27), as a function ofã, with values of R 0 s obtained from (A2), for n = 20 and n = 100. Note the features described above present in this pictures.
With T and a fixed, R 0 s is a decreasing function of B/C, that goes to 0 as B/C → ∞ (see Fig. S22). With C, B and 0 < a < 1 fixed, R 0 s is a decreasing function of T that behaves as follows when T → ∞. If 0 ≤ a < C/B, then R 0 In this latter case, the convergence is rather slow, in that R 0 s behaves asymptotically as (log(1/(1 − a)))/ log(T ). (See Fig. S23.) Analysis of (S27). Computations: The claims above follow from the behavior of the left hand side and the right hand side of (S27). We denote them respectively by  B a), 0}, and observe that 0 < R < C/B, when a < C/B, and R = 0, when a ≥ C/B. The term inside the curly braces, and therefore also G(R), is negative for R < R, and positive for R > R. Notice also that once it is positive, G(R) is strictly increasing in R, since it is then the product of strictly increasing positive functions. The behaviors described so far for H(R) and G(R) immediately imply that they are equal to each other at exactly one point R = R 0 s , and that this point is in the open interval ( R, C/B).
The various claims about the behavior of R 0 s as a function of a, or of B/C, or of T , follow now from analyzing the behavior of the graphs of H(R) and of G(R), as these parameters change. We have ∂G C,B,T, a (R)/∂ a = (T − 1) (C − B a) ((1/R) − 1) (1 − a) (1/R)−2 , which, regardless of the value of R, is positive for 0 < a < C/B and negative for C/B < a < 1. This means that the graph of G(R) moves upwards, as a increases from 0 to C/B, and then moves downwards, as a increases from C/B to 1. In the extremes, G(R) → (T −1)(BR−C), as a → 0, for all R > 0. (This convergence is not uniform, since the function G(R) converges to 0 as R → 0.) And G(R) → 0, uniformly in R, as a → 1. These facts, and the trivial behavior of H(R) that does not depend on a, provides us with the facts about the dependence of R 0 s on a. The fact that R 0 s depends on C and B only through B/C can be seen by dividing both sides of (S27) by C. The fact that R 0 s decreases as B/C increases follows easily from observing that the graphs of H(R) and G(R) move, respectively, down and up, as B increases, with C fixed. And the fact that R 0 s → 0, as B/C → ∞, is immediate from 0 < R 0 s < C/B (Fig. S22). If we keep B, C and 0 < a < 1 fixed and let T ∞, then G(R) also goes monotonically to ∞, for R < R < C/B, and stays at 0 for R = R. The corresponding behavior of the graph of G(R), and the fact that R 0 s is the point in R < R < C/B where this graph intersects the graph of the decreasing function H(R), shows that, as claimed above, R 0 s is decreasing in T , and as T → ∞, R 0 s → R (Fig. S25). The claim about the slow speed of this convergence, in case C/B ≤ a < 1, can be obtained by taking the logarithm of both sides of (S27) and analyzing how the resulting terms behave as T → ∞.
Analysis of (S28): For each value of T , this equation is a relationship between the values of C/B and R, since it can be written as C/B − R = (T − 1) R (1 − (C/B)) 1/R . Fig. S26 shows how this relationship looks like. For T = 1 we have R = C/B, and as T increases, R decreases towards 0, for each value of 0 < C/B < 1. To obtain an approximation for R, when T is large, we observe that, if we define ϕ = C/BR, then (S28) can be written Taking logarithms (arbitrary base) from both sides, we get Suppose 0 < C/B < 1 fixed, and let T → ∞. Then (S29) implies that ϕ → ∞, since otherwise the right-hand-side would go to ∞, while the left-hand-side would not. But since ϕ → ∞, the left-hand-side of (S30) goes to 0, and hence This provides us with the approximation (A5), when T is large.
A note on the assumptions in [3], and some intuition on our results: It is important to explain why the assumptions made in [3] on how much assortment to expect in groups, and later used in several papers, including [2], are excessively pessimistic. In [3] it was supposed that conditioned on the focal individual being type A, the other n − 1 members of the focal group would be type A independently, with probability IP t (A 2 |A 1 ). In the 2lFW, conditioned on the focal being type A, if we also condition on the co-focal being type A further increases the conditional probability that a third member of the group is type A. Conditioning then on a third individual being type A, again further increases the probability that a fourth individual is type A, and so on. This is so because the information being successively provided keeps increasing the probability that there were several types A in the previous generation in the group from which the focal descends. It is very interesting to make the computation of R 0 s using the assumption of [3] and compare the result with what we have obtained from (S27). Under that assumption of conditional independence, the viability condition for the regime of weak selection would become n x , and as the average between the limits of v A x from the left and from the right at x, at the possible discontinuity point. Then, by a central limit theorem, in the limit (S18), We To understand the reason for those substantially different results for the IPG in the 2lFW framework and in the ad hoc conditional independence scenario of [3], we observe the following. In the 2lFW, when δ = 0, n is large and m is small, the fraction of individuals in the focal group who are IBD to the focal individual is well approximated by a beta distribution with mean R and standard deviation R 2/(R + 1). In contrast, under the assumption of [3] that fraction is well approximated by a normal distribution with mean R and standard deviation R(1 − R)/n. So, in the 2lFW framework, the distribution has a spread that is independent of n and is comparable and even larger than its mean and has a heavy tail. (See (S19) and Fig. S19. For instance, in this figure, when m = 0.1, we have R 0 = 0.2 in panel A and R 0 = 0.04 in panel B. But see how much probability the beta distribution allows for substantially larger values of k/n). In the scenario of [3], when n is large, the distribution is narrowly concetrated around the same mean, with a light Gaussian tail. The conditionality of behavior in IPG makes the fitness function v A k grow fast when k is large (see Fig. S1) and gives a fundamental role to the tail of that distribution, explaining the discrepancy. This comparison with [3] highlights some fundamental facts about assortment of genes under weak selection in 2lFW and also in Wright's infinite island model, since both have the same coalescent. The number K D − 1 of individuals in a focal group that are IBD to and distinct from a focal individual in this group is a random variable that does correspond to n − 1 attempts, each with probability R 0 of success. But, contrary to the assumption in [3], these n − 1 attempts are highly dependent, so that we do not get a good Gaussian (central limit theorem) approximation, but rather a good beta distribution approximation, with a fatter tail. The observation of this lack of normality and the identification of the appropriate approximate distribution as a beta are central contributions of the current work, with powerful computational and conceptual implications.
When are types A altruistic in the IPG? There is more than one way in which the concept of altruism in the context of the trait-group framework has been defined [10]. These different definitions carry over to the 2lFW. A particularly simple concept is called in [10] the "multilevel interpretation" of altruism. That definition requires the two conditions w A k < w N k , for all k, andw k increasing in k. The first one means that types A are always worse off than types N in the same group, and the second one means that the more types A in a group, the better for the group. Both conditions are clearly always satisfied in the IPG, that hasw k = ( There are nevertheless good arguments for considering other definitions of altruism [10]. A particularly appealing definition is called "focal-complement interpretation" in [10], and is often known as "mutation condition". Suppose that a type N mutated into a type A, everything else remaining unchanged. Would this cause a decrease in the fitness of the mutant? Would it cause an increase in the average fitness of the other members of the mutant's group? Since the average fitness of the n members of the mutant's group increases with the mutation, the answer to the second question will be affirmative whenever the answer to the first question is affirmative. Therefore the condition for types A to be altruistic is that for all k. The l.h.s. of (S31) takes the values: T (−C + aB/(n − 1)) − aB/(n − 1), if k = a. Therefore, only the case k = a needs to be analyzed. When T = 1 and the IPG is identical to the PG, (S31) holds. For arbitrary values of T , the same is true for a = 0 (again, the IPG is identical to a PG, in this case with values of C and B multiplied both by T ). For each value of T and n, (S31) holds for values of a that are small and may start failing at some critical value of a. In case T is large, that critical value of a is close to (n − 1)C/B, that, when n is also large, corresponds to α = C/B. The intuition for this critical value of a is as follows. When k = a, it is always costly for a type N to mutate into a type A, since that mutation will not change the number of times other types A in the group cooperate, and will cost to the mutant a net amount C for each act of cooperation in the group. But in case k = a, the mutation changes the group from being one in which cooperation occurs only once, into one in which cooperation occurs an average of T times. When T is large, this is beneficial for the mutant when the net benefit obtained from each one of the many additional acts of cooperation, −C + aB/(n − 1), is positive, but is detrimental otherwise. In case a > (n − 1)C/B, the mutation can therefore benefit the mutant in the long run. In this case, types A are not altruistic when in groups with a total of a + 1 types A. Indeed, their acts of cooperation are then needed to keep cooperation going in their group, and therefore the initial such acts increase the fitness of the actor in the long run (life-cycle). In contrast, when a ≤ (n − 1)C/B (as in Fig. 2 in the paper, where n = 50, C/B = 1/3 and a = 16), all the types A are altruistic, even those in groups with a total of a + 1 types A. It is still true that their acts of cooperation are needed to keep cooperation going, but, while this benefits their group, it comes then at a net cost (or at least no gain) per action for each type A individual.
It is known from [12] (see also [10]) that when (S31) holds, the 2lFW with m = 1 (which is identical to the trait-group population structure, i.e., random assortment in groups) has no stable equilibrium with types A present. This is therefore the case of the model in Fig.  2 in the paper. It is only through the effects of limited dispersal (population viscosity) and the resulting assortment of types A that this altruistic allele can proliferate in the population. Figure S21: Limit of large n under weak selection for the Iterated public goods (IPG) game (Example 2). Panels represent critical migration rates (A and C) and critical relatedness (B and D) for the IPG with C = 1, B = 5 and T = 100 as a function ofã = a/n. Top panels A and B depict the case n = 20. Bottom panels C and D depict the case n = 100. In each panel critical values obtained by the viability condition under weak selection derived from (A2) (black full lines) are compared with the approximation for large n given by solving (S27) in R (approx., dashed blue lines). In panel B we have R 0 s = 4.02% whenã = 20%, and in panel D we have R 0 s = 5.54% whenã = 20%. Types A are altruistic in the strong sense of (S31) when 0 ≤ a ≤ 20%.    Time periods: When p(0) = p << 1, the evolution is well described by the linearized form f (t + 1) = f (t)M (A + B), until p(t) is no longer small. We refer to the time period while the linearized evolution is a good approximation as the "early stage", E.S., as opposed to the "late stage", L.S., when the evolution of f (t) can no longer be well approximated in this linear fashion. During the E.S., we have, from (S1), f (t) = Cρ t (ν + e(t)), where the error term e(t) satisfies max i |e i (t)| < β t , with 0 ≤ β < 1, and C = k f k (0)ζ k is a strictly positive constant, of order p (since all the entries of f (0) must be of order p or smaller). Therefore, we can think of the E.S. as the period of time while ρ t << 1/p, or, equivalently, t << log(1/p)/ log ρ. The E.S. can be further split into the "very early stage", V.E.S., during which the error term e(t) is still relevant, and the "stationary early stage", S.E.S., characterized by max i |e i (t)| << 1, so that f (t) = Cρ t ν in good approximation. The S.E.S. can be thought as the period of time 1 << t << log(1/p)/ log ρ. Thruought the E.S. we have But it is only in the S.E.S., that ∆p(t) takes the stationary form given by (A1). In Fig. S10  Small mutation rate: The condition ρ < 1 (or ρ > 1) implies the local stability (or instability) of the fixed point (0, ..., 0) of the dynamical system f (t). Stability here can be interpreted with respect to small deviations from that fixed point as initial condition, or in terms of the effect of a small mutation rate µ. For this latter purpose, we introduce mutation at rate 0 < µ < 1 per individual per generation, by assuming that each offspring is of the same type of her mother with probability 1 − µ and otherwise is of the other type. Suppose that f (0) = (0, ..., 0) and that µ is small. When types A are rare, the main effect of mutation is to create new types A in groups that have no types A. Therefore, neglecting terms of order µ 2 or smaller, we have, f This series converges when ρ < 1, and diverges when ρ > 1, since its generic term behaves as Cρ t ν, when t is large. This means that when ρ < 1, f (t) converges to the perturbed fixed point f µ = µn ∞ t=0 (1, 0, ..., 0)[M (A+B)] t , in which p is of order µ, and represents a small perturbation of the fixed point f 0 = (0, ..., 0). And, in contrast, when ρ > 1, allele A invades the population, for any mutation rate µ > 0.
Viability conditions: By a viability condition for an allele we mean a condition for it to spread when rare. In the 2lFW setting, this means the instability of the fixed point f = (0, ..., 0), either with respect to a small perturbation of the initial condition or by introducing a small mutation rate µ. Both are expressed by the mathematical condition ρ > 1. This is equivalent to the condition ∆p > 0, under no mutation and for small p(0) = p, provided that one waits until the S.E.S.. If we consider, say, an iterated public goods game and start from f (0) = (np, 0, ..., 0), (resp. f (0) = (0, ..., 0, p)), then initially p(0) = p and ∆p(t) is negative (resp. positive), regardless of the value of m. In both cases, in the S.E.S., f (t) becomes parallel to ν and (A1) holds. Fig. S11, Fig. S12 and Fig. S13 illustrate this discussion. The evolution of the distribution f (t) towards multiples of ν may be thought of metaphorically as "natural selection organizing the copies of the allele A in the best way, compatible with the demographics, for them to grow". Alternatively, we can use the metaphor that "the allele A self-organizes itself in the best way, compatible with the demographics, for it to grow". Regardless of the language used, the condition ∆p > 0 is only equivalent to the viability condition in the S.E.S., when the distribution of the copies of the allele A is stationary in a way compatible with natural selection and the demographics. We believe that this lesson, that to obtain viability conditions, one should only make requirements of the type ∆p > 0 under stationarity, had not been fully appreciated before, and is a contribution from the current work.
Our viability condition, in the 2lFW setting, can be written, respectively as ρ > 1, or, equivalently, in general (thanks to (A1)); as under weak selection (thanks to (A2)); and as with R = 1/(1 + 2mn), under the conditions of (A3). In the special case of (S35) in which v A x = −C + Bx + B 2 x 2 + ... + B l x l , we have the viability condition C < BR + B 2 R 2 + ... + B l R l , where R l = l!R l /[(l − 1)R + 1)(l − 2)R + 1) · · · (R + 1)]. In the special case of (S35) in which we have the THR with v A x = −C, for 0 ≤ x ≤ θ, and v A x = −C + A, for θ < x ≤ 1, we have the viability condition (see Section S7) In the special case of (S35) in which we have the IPG with v A x = −C + Bx, for 0 ≤ x ≤ α, and v A x = T (−C + Bx), for α < x ≤ 1, we have the viability condition (see Section S8) that, when α = C/B, simplifies to C − BR < (T − 1) BR (1 − C/B) 1/R .
The viability conditions above involve only parameters of the models. These inequalities provide the exact conditions under which the allele A can proliferate in the corresponding mathematical model. It is just natural to see them as counterparts to and generalizations of Hamilton's rule R > C/B for the PG.
Covariance-regression formulas: In the 2lFW framework, the fitness w • of the focal individual is a function of I 1 (that takes value 1 if the focal is of type A and 0 if it is of type N) and of κ = (I 2 + ... + I n )/(n − 1), the fraction of types A among the other members of the focal group. Note that κ describes the social environment of the focal individual. Using (S3), the regression coefficient of κ on I 1 can be seen to be: β t;κ,I 1 = Cov t (I 1 , κ) Var t (I 1 ) = 1 n − 1 Cov t (I 1 , I 2 + ... + I n ) Var(I 1 ) = Cov t (I 1 , I 2 ) Var t (I 1 ) = r t .
If we define c t = −β t;w•,I 1 |κ and b t = β t;w•,κ|I 1 , then (S37) reads valid in the 2lFW setting for any model, any initial state f (0) (with no conditions on p) and for all time t. In the case of the public goods game, w • = 1 + δ(−CI 1 + Bκ), and (S38) becomes the well known [7]W (t)∆p(t) = p(t)(1 − p(t))δ(−C + Br t ). But the simplicity of this special case hides the fact that for other models the regression coefficients c t and b t depend not only on the parameters of the model, but also on the distribution of alleles in the population in generation t. Admittedly, (S38) is always valid, while (S32) is valid only when p(0) << 1, during the E.S.. But for the purpose of deciding in the 2lFW when the allele A can invade the population, (S32) and its form (A1) for the S.E.S. give the answer, while we do not see how to use (S38) for this purpose, without first studying how f (t) evolves. This is so because c t , b t and r t are regression coefficients associated to the random experiment of choosing the focal individual when the distribution of alleles in the groups is given by f (t). And once the evolution of f (t) has to be studied, one can use (S32) directly, without any need to invoke (S38).
Cannot replace regression coefficients with partial derivatives: A popular procedure for approximating (S38) by something easier to analyse, under weak selection, is the replacement of the regression coefficients c t = −β t;w•,I 1 |κ and b t = β t;w•,κ|I 1 by partial derivatives of fitness functions with respect to I 1 and κ, respectively. See for instance [16], [6] (Box 6), or [18] (condition (6.7)). (The standard notation involves defining y = δI 1 , z = δκ and w(y, z) = w • , and expressing (S38) in terms of y and z, rather than I 1 and κ. This would make no relevant difference in the computations below.) We compute next the critical level of relatedness needed for invasion by allele A in the case of the IPG, using that approximation. For the IPG we have if (n − 1)κ + 1 ≤ a, 1 + δ[T (−CI 1 + Bκ)], if (n − 1)κ + 1 > a.
We are interested in analysing (S38) when p is close to 0. In this case the relevant partial derivatives would be taken at I 1 = κ = 0, and would be ∂w • /∂I 1 = −δC, ∂w • /∂κ = δB. Therefore (S38) would become ∆p = pδ(−C + Br t ). Then we would conclude that allele A could only invade when δ is small if R 0 > C/B. But as we know from Section S8, this is incorrect in a substantial way. Note that the limit δ → 0 corresponds to the limit of "vanishing trait variation" under which [6] and [18] claim that the replacement of regression coefficients by derivatives, as in the computation above, would give the right condition for invasion (when p is small the condition of "vanishing genetic variation" required in [18] is also met).