Social evolution under demographic stochasticity

How social traits such as altruism and spite evolve remains an open question in evolutionary biology. One factor thought to be potentially important is demographic stochasticity. Here we provide a general theoretical analysis of the role of demographic stochasticity in social evolution. We show that the evolutionary impact of stochasticity depends on how the social action alters the recipient’s life cycle. If the action alters the recipient’s death rate, then demographic stochasticity always favours altruism and disfavours spite. On the other hand, if the action alters the recipient’s birth rate, then stochasticity can either favour or disfavour both altruism and spite depending on the ratio of the rate of population turnover to the population size. Finally, we also show that this ratio is critical to determining if demographic stochasticity can reverse the direction of selection upon social traits. Our analysis thus provides a general understanding of the role of demographic stochasticity in social evolution.

not explicitly specify the dynamics of the environmental variable for reasons which will 10 become apparent shortly, but suppose that it may under go any possible transition of 11 the form X → X ± e 3 or X → X ± e 3 ∓ e i . In the latter case, the effect upon type 1 or 12 2 is subsumed into the birth and death terms of eq S1. Let Q(X, t) be the probability 13 density for X at time t. Then the master equation for the stochastic process (ignoring 14 changes in Y ) is Let Ω be a system size parameter (e.g., [1,2]), and take x ≡ (x 1 , x 2 , y) ≡ X/Ω, and 16 q(x, t) ≡ ΩQ(X, t). In our context Ω can be thought of as habitat size and so x can 17 be thought of as densities. If we suppose Ω is sufficiently large such that the variables 18 x are approximately continuous and rescale time as τ = t/Ω, we can perform a series 19 expansion in powers of 1/Ω to give the Fokker-Planck (or forward Kolmogorov) equation 20 , and the B ij (x) are the entries of with n = x 1 + x 2 .

22
The Fokker-Planck equation given by eq S3 is associated with the system of Ito 23 stochastic differential equations (SDEs) where A(x) = (A 1 (x), A 2 (x)) T , C(x)C(x) T = B(x) and W τ is a vector of N indepen-25 dent Wiener processes [1,3]. Thus C(x) is a 2 × N matrix, where N is a integer whose 26 value will depend upon how the matrix C(x) is chosen (the choice of matrix C(x) is not 27 January 14, 2019 2/14 unique). If habitat size becomes large, Ω → ∞, stochasticity disappears from eq S5, and 28 we are left with the system of ordinary differential equations (ODEs)ẋ = A(x), where 29 x = (dx 1 /dτ, dx 2 /dτ ) T (we have purposefully neglected dy/dτ inẋ). 30 As in the main text, we assume that type 1 is the social actor, and so if we let 31 r i (x) ≡ b i (x) − m i (x) denote the per-capita growth rate of type i, then we assume that 32 r 1 (x) = b(x) − m(x) − c(x) and r 2 (x) = b(x) − m(x). Thus whenever > 0, type 2 has 33 the selective advantage, and will ultimately fix in the ODE model without mutations. 34 Note that the costs may either reduce birth rate or increase death rate; although in 35 general this will have implications for the structure of B(x) and thus C(x), because we 36 will assume is sufficiently small that we can neglect terms of order /Ω, we only need 37 to take into account how the costs alter expected per-capita growth rate. 38 Reduction of system to slow manifold 39 We wish to reduce system S5 into a more manageable problem. To do so, suppose that 40 in the absence of selection, mutations, and stochasticity ( = 0, µ = 0, and Ω → ∞, 41 respectively), there exists a globally asymptotically stable curve of ecological equilibria 42 in the ODE system given by r i (x) = 0. We parameterize this curve in terms of x 1 , and 43 so let γ(w) = (w, γ 2 (w), γ y (w)) denote the value of x along the curve. We will assume 44 that both w/γ 2 (w) and w/(w + γ 2 (w)) are invertible. When µ = = 0 and Ω → ∞, 45 the ODE system will asymptotically approach different points on γ(w) dependent upon 46 the initial conditions (γ(w) is a center manifold [4]). If instead Ω is large but finite, 47 and , µ are nonzero but small, then system S5 will rapidly approach γ(w) along the 48 flow lines of the ODE system. However, once in the vicinity of γ(w) the stochastic 49 component of system S5 will dominate the dynamics, since as x → γ(w), A(x) → 0. 50 Because the movement along γ(w) is slow relative to the rate at which a system initially 51 distant moves to the vicinity of γ(w), in this context γ(w) is often referred to as a 'slow 52 manifold' [5][6][7]. Since the movement along the slow manifold corresponds to change in 53 population composition (and so represents the evolutionary timescale), our goal is to 54 derive an equation approximating the motion of system S5 along γ(w).

55
To do so, observe that when = µ = 0 and Ω → ∞ we have and so for the initial condition x w = (w 1 , w 2 , w 3 ), the solution of eq S6 is x 1 (t) = 57 (w 1 /w 2 )x 2 (t). Now suppose (stochastic) trajectories along the slow manifold receive 58 random "kicks" displacing the system from the slow manifold. Once the system is 59 "kicked away" from the slow manifold, then provided Ω is large (and , µ small), the 60 system will return to the slow manifold approximately along the (deterministic) flow 61 lines. These flow lines are the solutions of eq S6 with the initial condition being the 62 position system was "kicked" too. Thus a trajectory initially at (x 1 , x 2 , y) will return to 63 the point on the slow manifold, γ(w), implicitly given by From eq S7, since w parameterizes the slow manifold, dw/dt reveals how the position 65 of the system along the slow manifold evolves with time [6,8]. Let G(w) = w/γ(w) 66 and g(w) = G −1 (w). Then from eq S7 we have w = g(x 1 /x 2 ) and so applying the 67 multivariable version of Ito's formula [1,3] to w gives where we have neglected terms of order /Ω and µ/Ω.

69
Let p ≡ w/(w + γ 2 (w)) be the frequency of type 1 along the slow manifold. Then 70 applying Ito's formula to compute dp using eq S8 and evaluating the result on the slow 71 manifold (see [6,8] for justification) gives

74
To obtain eq S9 we did not need to specify the dynamics of the environmental variable 75 y, and instead only needed to know the density of the environmental variable along the 76 slow manifold, γ y (w). This is because the separation of time scales assumes population 77 composition changes on a slower time scale than demographic processes and so on the 78 slow timescale the environmental variable will be in a quasi-steady state if we use the 79 tools of Parsons and Rogers [6].

80
As an aside, we note that for problems which satisfy condition eq S6, then eq S9 is the 81 same result we would have obtained if we instead applied Ito's formula to p = x 1 /(x 1 +x 2 ) 82 and then restricted the resulting equation to γ(w). The intuitive reason for why this is 83 true is that a system (stochastically) perturbed away from the slow manifold will return 84 roughly along the flow lines of the ODE system when = 0. But along these flow lines, 85 by eq S6 we see that the proportion p remains constant.

86
To summarize our analysis to this point, we first specified a discrete state space 87 stochastic process involving two types of competing individuals in eq S1. We then 88 assumed that habitat size (and so population size) was sufficiently large that the discrete 89 state space is approximately continuous, and obtained a diffusion approximation of the 90 stochastic process (eq S3). We then applied the techniques of Parsons and Rogers [6] to 91 derive a single-variable SDE approximating the dynamics of motion of eq S3 along the 92 slow manifold by eliminating the dynamics on the fast timescale.

93
Speed measure and stationary distribution 94 As in the main text, let α(p) ≡ µ(1−2p)− c(p)p(1−p) and σ 2 (p) ≡ p(1−p)T (p)/[Ωn(p)]. 95 Then α(p) and σ 2 (p) are the infinitesimal mean and variance, respectively, of a one-96 dimensional diffusion process [9,10]; this is the diffusion process associated with the 97 SDE S9. Define for some constant C 0 . If C 0 can be chosen such that 1 0 π(p)dp = 1, then π(p) is 99 normalizable and so represents the stationary distribution of the diffusion process with 100 infinitesimal mean α(p) and infinitesimal variance σ 2 (p) [1,3,10]. However, regardless of 101 whether such a C 0 exists, π(p) is the speed measure of the diffusion process [9]. The 102 relevance of this is that the speed measure is proportional to the expected time a diffusion 103 process initially at p takes to exit an interval (ε − p, p + ε) for small ε [9]. Thus as π(p) 104 increases, the process will tend to spend more time at state p, and so we are more likely 105 to observe the process in such a state. Hence as π(p) increases for a given p, we will say 106 state p is increasingly favoured.

107
There are three factors that influence eq S10: 108 1. selection, which is controlled by c(p), and biases the stochastic process against 109 the social actor, 110 2. mutation rate, µ, which pushes π(p) away from the boundaries (p = 0 and p = 1) 111 towards p = 1/2, 112 3. and demographic stochasticity, which is controlled by the infinitesimal variance, 113 σ 2 (p). If we inspect σ 2 (p), however, we see that it is the product of a symmetric 114 term, p(1 − p)/Ω, and the ratio, T (p)/n(p). Thus this ratio controls the effect of 115 demographic stochasticity, and as we will see is key to understanding social trait 116 evolution.

117
Evolution of cost-free social traits 118 To understand the role of demographic stochasticity, suppose that selection is turned off 119 ( = 0). Then the only two factors present in eq S10 are mutation rate and the ratio 120 T (p)/n(p). First, consider how T (p)/n(p) changes in p, that is, 121 d dp dT /dp T (p) − dn/dp n(p) . (S11) If the social trait is spite, dn/dp < 0, whereas if the social trait is altruism, dn/dp > 0. 122 Since on the slow manifold, T (p) = 2b(p) = 2m(p), if the social trait acts upon death 123 rate, then dT /dp = 2dm/dp > 0 if the trait is spite, whereas if the trait is altruism, 124 dT /dp < 0. Thus for social traits acting on the death rate, T (p)/n(p) is monotonic in p, 125 and so is either minimized by the social actor (if the trait is altruism) or the non-social 126 actor (if the trait is spite). If instead the social trait acts upon birth rate, then since 127 dT /dp = 2db/dp, if the trait is spite dT /dp < 0 and if the trait is altruism dT /dp > 0. 128 Hence for both altruism and spite the ratio T (p)/n(p) may be smaller for a population 129 monomorphic for the social actor, or it may be smaller for a population monomorphic 130 for the non-social actor. Moreover, it is also possible that T (p)/n(p) is non-monotonic 131 in p (and so minimized by a polymorphic population).

132
To understand the implications of the behaviour of T (p)/n(p), first we will assume 133 that T (p)/n(p) is monotonic in p, and so is minimized by either type 1 or type 2, before 134 considering what happens when T (p)/n(p) is non-monotonic in p. There are three 135 different regimes based upon mutation rate, and we consider each in turn. 136 1. Low mutation rate. When mutation rate is sufficiently low, then in the interval 137 p ∈ (0, 1), π(p) ≈ 1/σ 2 (p), and π(p) is U -shaped. We are interested in whether 138 π(p) is increasing or decreasing in p. Taking the derivative gives 139 dπ dp ≈ d dp Since the first term, , is symmetric in p, we can ignore it and 140 instead focus upon how the ratio T (p)/n(p) changes with respect to p. If the ratio 141 is decreasing (resp. increasing) in p, then dπ/dp > 0 (resp. dπ/dp < 0), and the 142 social actor is favoured (resp. disfavoured).

143
An alternative way to arrive at this conclusion is to note that in the absence 144 of selection (and mutations), the diffusion process is on its natural scale (i.e., 145 α(p) = 0), and so the fixation probability of either type is simply equal to its 146 proportion in the population. Suppose there are k possible types of individuals, 147 and let mutations be sufficiently rare such that between mutations the population 148 returns to a monomorphic state. Then we can construct a Markov chain on the 149 state space of possible strain types [11]. In particular, let µ ij be the rate at which 150 strain i mutates to strain j and N i be the number of type i individuals in the 151 population at the moment of the (rare) mutation. Then is the rate at which the population transitions from a monomorphic strain i state to 153 a monomorphic strain j state [11]. Hence in the absence of mutational biases, the 154 Markov chain is equally likely to be in any particular state [11][12][13]. This implies 155 that the process will spend equal time in any of the monomorphic states. But if 156 so, then what type we are most likely to observe in the population will be dictated 157 by the time the process takes to transition between states, that is, the expected 158 time from the initial appearance of a mutation for the population to return to a 159 monomorphic state. Using standard techniques [10], when there are two types in 160 the population and they are selectively neutral, = 0, then the expected time till 161 absorption (time to reach either p = 0 or p = 1) from an initial state p 0 is Since time spent in any monomorphic state is equal, the type that is most likely 163 to be observed is the type which maximizes absorption time, that is, the type that 164 maximizes the time spent transitioning between states.

165
To determine this, let ν control the effect of the social behaviour, and suppose 166 that the effect of the social behaviour upon the ratio T (p)/n(p) is small. Since the 167 two types only differ due to the effects controlled by ν, we can write T (p)/n(p) = 168 T (νp)/n(νp), and so a Taylor expansion of the ratio T (p)/n(p) with respect to ν 169 gives If we then use this approximation for the ratio T (p)/n(p) in the differential equation 171 used to derive eq S14, we obtain the time till absorption as where T /n no longer depends upon ν or p (or p 0 ). Notice from eq S16 that if the 173 two types are identical in every respect, ν = 0, then absorption time is symmetric 174 in p about p = 1/2, and for any p,t(p) will decrease (resp. increase) as T /n 175 becomes large (resp. small). Thus absorption time is maximized (resp. minimized) 176 when we have minimized T /n (resp. maximized T /n). If instead ν > 0, then 177 time till absorption will no longer be symmetric about p = 1/2, and instead will 178 depend upon how the social behaviour effects the ratio T /n. In particular, if we 179 consider the differencet(p 0 ) −t(1 − p 0 ) on the interval p 0 ∈ (0, 1/2), we see that 180 this quantity will be negative, that is, the process will take longer to absorb from 181 state 1 − p 0 then p 0 , provided the social trait minimizes the ratio T /n. In this 182 circumstance, it follows immediately that 1 1/2t (p 0 )dp 0 / 1 0t (p 0 )dp 0 > 1/2. Hence 183 if the social trait minimizes the ratio T /n, then absorption time will be biased 184 in favour of the social actor and so the type minimizing T /n is favoured when 185 mutations are sufficiently rare. 186 2. Intermediate mutation rate. Suppose mutation rate is non-negligible, but is 187 not sufficiently high so as to admit a normalizable stationary distribution. Now 188 the picture is more complex, and it is not clear how to determine which type is 189 favoured. To see why, consider how π(p) changes in p: . (S17) From how the terms are grouped, we see that as p → 1/2 term (1) disappears, 191 whereas when p → 0 or p → 1, term (2) disappears (note it is also possible that 192 there exists a p 0 such that µ = T (p 0 )/[2Ωn(p 0 )], in which case term (1) also 193 disappears as p → p 0 ). Roughly speaking, this implies that term (1) is the stronger 194 effect near the boundaries of the interval p ∈ [0, 1], whereas term (2) is the stronger 195 effect in the interior. Term (1) represents the effect of mutations, µ, which pushes 196 π(p) towards p = 1/2, and genetic drift, T (p)/[2Ωn(p)], which pushes π(p) towards 197 the boundaries. These effects would be present even if the ratio T (p)/n(p) were 198 constant, that is, type 1 and 2 were mathematically interchangeable. Term (2), 199 however, occurs due to how the magnitude of demographic stochasticity (or genetic 200 drift) changes due to the consequences of the social trait, that is, how the ratio 201 T (p)/n(p) changes in p.

202
Consider the behaviour of π(p) at the boundaries, and so focus upon term (1). 203 When mutation rate satisfies µ < T (p) 2Ωn(p) for all p, then the distribution tends to 204 accumulate at both boundaries since genetic drift is stronger then mutations. As 205 µ increases, however, because in general T (0) 2Ωn(0) = T (1) 2Ωn(1) , there will be a regime 206 in which µ > T (p1) 2Ωn(p1) but µ < T (p2) 2Ωn(p2) where {p 1 , p 2 } ∈ {0, 1} with p 1 = p 2 . Thus 207 at the boundary which minimizes T (p)/n(p) (p 1 boundary), mutations will be a 208 stronger force then genetic drift, pushing the distribution towards the interior, 209 whereas at the other boundary (p 2 boundary) the distribution will accumulate at 210 the boundary in a state of quasi-fixation as the force of genetic drift outweighs 211 mutations. This will tend to give rise to the sideways S-distribution as seen in Fig 212  1e,f in the main text.

213
This behaviour at the boundary prevents us from formulating a clear criteria 214 about which type is 'favoured'. We can no longer focus exclusively upon the ratio 215 T (p)/n(p) as in the case of low mutation rate, but we also cannot use an integral 216 measure to determine which type is stochastically favoured (i.e., does more of 217 the mass of π(p) occur for p > 1/2 or p < 1/2?), since π(p) is not normalizable. 218 However, because away from the boundaries the strongest effect upon the shape 219 of π(p) will be how the ratio T (p)/n(p) changes in p (term (2) from eq S17), we 220 may be inclined to argue that given sufficient 'segregating variation' exists, and 221 so for polymorphic populations, the type minimizing the ratio T (p)/n(p) will be 222 favoured.
sense to use 1 1/2 π(p)dp to determine which type is favoured. In particular, if 226 1 1/2 π(p)dp > 1/2, then the social actor is favoured (this assumes we have chosen 227 C 0 such that 1 0 π(p)dp = 1). There are two possibilities here, either (a) mutations 228 are of small effect, or (b) mutations are of large effect. We consider these cases in 229 turn.

248
To make further progress, we need to compute ∂S ∂ν ν=0 . For ease of notation, 249 let R(p, ν) ≡ T (p, ν)/n(p, ν) and ω ≡ 2Ωµ R . Then Now what is ∂R/∂ν? Suppose we can write per-capita growth as b(n) − d(n) + 251 νθ(n)np, where b(n) and d(n) are the birth and death rates when ν = 0, and 252 θ(n)ν is the (additional) social effect of type 1 individuals (which is multiplied 253 by the density of type 1 individuals, np). The precise action of the social 254 trait may be upon either the birth or death rate, while if θ(n)ν < 0, the trait 255 is spite whereas if θ(n)ν > 0, the trait is altruism. The function θ(n) controls 256 any density-dependent effects of how the social action is distributed among 257 members of the population.

258
Since on the slow timescale the process is in demographic equilibrium, per-259 capita growth is zero, and so at equilibrium , and assuming G(n) is invertible, n = 261 g(νp) ≡ G −1 (νp). Thus we see that in order for G(n) to be invertible, n must 262 be a monotonic function of νp (which it is by our classification of the social 263 traits). Since at equilibrium, T (p, ν) is either two times the per-capita birth 264 rate or two times the per-capita death rate (e.g., if the social trait acts on 265 birth rate, then the birth rate is b(n) + νθ(n)np and the death rate is d(n) and 266 thus T (p, ν) = 2(b(n) + νθ(n)) = 2d(n)), we can write R as a function of n, 267 that is R(n) = R(g(νp)) (e.g., using our previous example, R(n) = 2d(n)/n). 268 So all of the instances of ν in R are mediated through their presence in n. 269 Then and when ν = 0, where δ ≡ R (g(0))g (0) is constant with respect to p (and ν).

272
Using this information and the fact that in eq S19 gives Now using eq S20 in eq S18 yields Since 1 1/2 f dp − 1/2 0 f dp is a function of a single variable, ω, it can be plotted. 277 From inspection of Fig A, this quantity is positive for ω > 0 and goes to zero 278 as ω → ∞. It follows that if δν < 0, type 1 is favoured, whereas if δν > 0, 279 type 2 is favoured. But the sign of δν has the same interpretation as before: 280 whichever type minimizes the ratio T /n is favoured. Thus when mutations 281 are of small effect, whichever type minimizes the ratio T /n is favoured. 1/2 1 f(p, )dp -0 1/2 f(p, )dp Fig A. As ω → ∞, 1 1/2 f dp − 1/2 0 f dp → 0, but remains positive.
(b) If mutations are not of small effect, that is, ν 0, and mutation rate is also 283 high, then mutations will exert a strong effect near the boundary pushing 284 the weight of the distribution towards p = 1/2. In this regime, the ratio 285 T (p)/n(p) plays the strongest role. So provided T (p)/n(p) is monotonic in 286 p, then the type minimizing the ratio T (p)/n(p) will tend to be favoured in 287 the sense that the weight of the distribution will be displaced away from 1/2 288 towards this boundary.

289
The preceding analysis assumed that the ratio T (p)/n(p) was monotonic in p and so 290 minimized at either p = 0 or p = 1. Suppose instead that T (p)/n(p) is non-monotonic 291 in p. Then for some p ∈ (0, 1), say p * , d dp T (p) n(p) = 0. From our analysis above, it is 292 apparent where the issues are going to arise. For low mutation rate, when we are in 293 a small neighbourhood of p * consideration of eq S12 predicts that whichever type is 294 more abundant will be favoured. Likewise, at intermediate mutation rates, now term 295 (2) in eq S17 will be zero at p * and so in a neighbourhood of p * the dominant force 296 shaping π(p) will be term (1). But away from the boundary, whether term (1) favours 297 or disfavours the social actor will depend both upon the value of p * (is it greater or 298 less than 1/2?) and the magnitude of mutation rate relative to genetic drift at p * (is 299 µ greater or less than T (p * )/[2Ωn(p * )]?). Finally, consider the case in which mutation 300 rate is sufficiently high such that there is a normalizable distribution. If mutations are 301 of small effect, that is, the two types are sufficiently similar in terms of the social trait, 302 then implicit within the preceding analysis was that one of the types will minimize the 303 ratio T (p)/n(p) (and so T (p)/n(p) is monotonic in p for sufficiently small ν). However, 304 when mutations are of large effect, then consideration of eq S17 reveals again how the 305 results become more complex and how the magnitude of mutation rate will alter the 306 expected outcome.

307
Evolution of costly social traits 308 If we instead suppose that > 0, then the social actor is selected against, and so can only 309 be favoured if the influence of demographic stochasticity outweighs that of selection. This 310 is similar to the observation from classical population genetics that in large populations, 311 selection dominates, whereas in small populations, genetic drift does. Because of the 312 inherent complexities owing to how the costs of the social traits are formulated (are 313 they density-dependent, or is c(p) constant for all p?), as well as how the costs interact 314 with mutations and demographic stochasticity, in the main text we simply focus upon 315 numerical calculations to show that a stochastic reversal of selection is possible using the 316 measure 1 1/2 π(p)dp. If 1 1/2 π(p)dp > 1/2, then despite the social actor being selected 317 against, the population is most likely to be observed in a state in which the social actor 318 is at greater frequency, and so we characterize this as a stochastic reversal of selection. 319 Examples 320 In this section we detail the calculations used to obtain the examples found in the main 321 text.
c(x) ≡ 0 with β > d, a > 0, and ν ∈ [0, 1]. Here the social trait is spite which alters 341 birth rate. In particular, with probability νx 1 /(x 1 + x 2 + a) a type 1 individual 342 blocks another individual from reproducing. For this example, γ 2 (w) can be 343 computed analytically, but the expression is unwieldy and so we do not show it 344 here. Importantly, however, T (p)/n(p) is a nonlinear function of ν, and the level 345 of spite minimizing T (1)/n(1) is Simulations predict this level of spite also tends to be favoured in the n-type model 347 (Fig 3). 4. b(x) ≡ r + νx 1 , m(x) ≡ κ(x 1 + x 2 ), and c(x) ≡ r, with r, ν, κ > 0 and κ > ν. 349 Here the social trait is altruism which increases birth rate. On the slow manifold, 350 γ 2 (w) = (r − [κ − ν]w)/κ, and so h(p) = rp/(κ − νp). Using this information yields 351 the stationary distribution where proportionality is up to a positive constant. In the absence of costs ( = 0), 353 eq S24 is symmetric about p = 1/2, and so neither type is stochastically favoured. 354 The reason for this result is that T (p)/n(p) = 2κ which is constant for all p.
, and c(x) ≡ r, with β > d > 0, κ > ν > 0 356 and r = β − d, and so the social trait is altruism increasing the birth rate. The 357 per-capita growth rate in this model is the same as in example 4 and so γ 2 (w) and 358 h(p) are the same. However now the stationary distribution is Here, in the absence of costs, the distribution is asymmetric and favours type 1 360 (the altruist). This can be seen by noticing that T (p)/n(p) = 2(βκ − dνp)/r, which 361 is decreasing in p.

362
Interestingly, example 4 is the non-spatial model of Constable, Rogers, McKane & 363 Tarnita [8] (CRMT), differing only in that we have explicitly included mutations (our 364 notation also slightly differs). CRMT concluded that stochasticity induced an advantage 365 for the altruist whereas our analysis shows altruism is stochastically neutral if cost-free 366 and disfavoured otherwise. There are two reasons for this discrepancy. 367 1. Density versus frequency. Rather than dealing with the SDE for proportions 368 directly (eq S9), CRMT focused upon interpreting the infinitesimal mean of the 369 density SDE, dx 1 (this can be obtained from (S9) by application of Ito's formula [1]). 370 However, evolution is a change in frequency, not density, and a change in density 371 is not equivalent to a change in frequency. This is clear from eq S9 where 372 in the absence of mutations and selection, the infinitesimal mean is zero while 373 σ 2 (p) = 2κ Ω p(1 − p) so the stochastic process is mathematically equivalent to 374 pure genetic drift in a population of constant size [14]. 375 2. Inclusion of mutations. At selective neutrality, which is the scenario most conducive 376 to the evolution of altruism, the fixation probability of a particular type is equal to 377 its proportion in the population. Thus the invasion probability of a type j mutant 378 into a monomorphic type i population of size N i is simply 1/N i . As the altruist 379 can grow to a larger population size then the non-altruist, pairwise comparison 380 of invasion probabilities predicts the altruist is favoured [8,15,16]. However, if 381 we assume type i mutates to type j at a per-capita rate µ ij , then as shown in 382 eq S13 the transition rate from an all-type i state to an all-type j state is µ ij . So 383 in the absence of mutational biases, the Markov chain is equally likely to be in any 384 state, a standard result for neutral evolution in sequential-fixation models [11][12][13]. 385 Hence mutations erase any numerical advantage of the altruists. However, this 386 does not take into account that in our model the expected time till fixation varies 387 based upon population composition. Consideration of expected time till fixation 388 (or absorption time) reveals the importance of the ratio T (p)/n(p) (see eq S16).

389
Simulations 390 To support our analytic predictions, we use two types of simulations: Gillespie's algorithm 391 [17] and the Euler-Maruyama (EM) method [3]. In particular, for plots involving the 392 stationary distribution of the two-type model, we have used Gillespie's algorithm (Fig 393  1, Fig 2 and Fig 4) to simulate the full stochastic process specified by eq S1. When 394 we consider more then two types (as in Fig 3) we simulate the system of stochastic 395 differential equations which approximates eq S1 using the EM method [3], which we 396 detail briefly here. For the EM method, we assume that a type i individual mutates to 397 a type j individual at rate µ. Thus the total rate at which a type i individual mutates 398 to a different type is µ(n − 1). Let η be a n × n matrix of standard normal random 399 variables, and let ∆τ be the step-size. Then using EM, the change in variable x i over 400 the time increment ∆τ , i.e., ∆x i ≡ x i (τ + ∆τ ) − x i (τ ), is given by For all the n-type simulations used in this paper, ∆τ = 0.01, Ω = 10 4 , and µ = 10 −6 , 402 and the initial conditions were chosen to be x i (0) = 0.05 for all i; this was then simulated 403 until τ = 2 × 10 6 for 10 4 individual simulations (we checked that the distribution had 404 settled down by τ = 2 × 10 6 ). Then if x The primary reason for using EM method rather than Gillespie's algorithm when 407 the number of types is greater then 2 is that the computational costs of Gillespie's 408 algorithm rapidly become prohibitive. This is because when we construct the stationary 409 distribution in the 2-type case, in order to have the distribution normalizable, mutations 410 must be artificially high. When mutations are high, the population composition can 411 change more rapidly, and so simulations reach the stationary distribution more rapidly. 412 When we extend the system to include more than 2-types, mutations have a homogenizing 413 effect, and so we lower the mutation rate. However, this means that the population 414 composition changes less quickly, and so more time must elapse to obtain the stationary 415 distribution.