Mutation Bias Favors Protein Folding Stability in the Evolution of Small Populations

Mutation bias in prokaryotes varies from extreme adenine and thymine (AT) in obligatory endosymbiotic or parasitic bacteria to extreme guanine and cytosine (GC), for instance in actinobacteria. GC mutation bias deeply influences the folding stability of proteins, making proteins on the average less hydrophobic and therefore less stable with respect to unfolding but also less susceptible to misfolding and aggregation. We study a model where proteins evolve subject to selection for folding stability under given mutation bias, population size, and neutrality. We find a non-neutral regime where, for any given population size, there is an optimal mutation bias that maximizes fitness. Interestingly, this optimal GC usage is small for small populations, large for intermediate populations and around 50% for large populations. This result is robust with respect to the definition of the fitness function and to the protein structures studied. Our model suggests that small populations evolving with small GC usage eventually accumulate a significant selective advantage over populations evolving without this bias. This provides a possible explanation to the observation that most species adopting obligatory intracellular lifestyles with a consequent reduction of effective population size shifted their mutation spectrum towards AT. The model also predicts that large GC usage is optimal for intermediate population size. To test these predictions we estimated the effective population sizes of bacterial species using the optimal codon usage coefficients computed by dos Reis et al. and the synonymous to non-synonymous substitution ratio computed by Daubin and Moran. We found that the population sizes estimated in these ways are significantly smaller for species with small and large GC usage compared to species with no bias, which supports our prediction.


A. Maximum likelihood equations
We can analytically predict how the population size N and the neutrality parameter S influence stability and fitness by exploiting the formal analogy between population dynamics and statistical mechanics proposed by Sella and Hirsh [1]. They noticed that, for monomorphic populations for which fixation events only involve the wild-type genotype and a single mutant, several evolutionary processes studied in population genetics tend to a stationary fitness distribution of the form exp(Nϕ) that is formally equivalent to a Boltzmann distribution, with population size N playing the role of inverse temperature and the logarithm of fitness ϕ = log( f ) playing the role of minus energy. This selective process has to be combined with the mutation process. We define P mut (α, F, GC) as the probability to find stability values α and F under an evolutionary process with no selection ( f is constant for all genotypes) and a mutation process with given GC usage (i.e., the stationary GC content of the evolutionary process is GC). Combining mutation and selection, we can compute the probability to observe misfolding and unfolding stability α and F in the stationary state of a population of N evolving individuals as where we introduced the notation ϕ = log( f ) and σ (α, F, GC) = log (P mut (α, F, GC)). This latter quantity can be interpreted as the entropy in sequence space compatible with stabilities α and F in the absence of selection, which depends on the mutation process, hence on the GC usage. We also define the normalized stabilities x α = α/α thr and x F = F/F thr . Since σ is proportional to sequence length L, and both L and N are large in biologically relevant situations, the distribution described by Eq. (1) is narrowly peaked around the values x α (S, N, GC) and x F (S, N, GC) that have maximum probability, i.e. the stabilities that maximize the "evolutionary free energy" G = log(P sel ) ≡ σ + Nϕ. In the following, we take a mean-field perspective and we identify the mean stabilities with the maximum likelihood stabilities x α (S, N, GC) and x F (S, N, GC), and the mean fitness with the fitness corresponding to these values, ϕ(x α , x F , S) ≈ ϕ(x α , x F , S). We consider the additive fitness function (see main text) for positive x α and x F , and ϕ = −∞ if either stability is not positive. The fitness takes the value f = exp(ϕ) if both stabilities are positive and f = 0 if either x α or x F are negative, which are strictly forbidden. Fitness grows with stability. The maximum value of stability is of course finite, since the number of sequences is finite. Fitness becomes a binary value f ∈ {0, 1} when S tends to infinite, with f = 1 when stabilities are above the neutral thresholds x i = 1 and f = 0 when either stability is below the threshold. We call this the neutral limit. With now consider the logarithm of the probability in Eq. (1), The quantity −G can be the interpreted as an evolutionary free energy. The most likely stability values x α and x F maximize G or, equivalently, they minimize the evolutionary free energy. They can be computed solving the maximum likelihood (ML) equations with i = α, F. These equations express the balance between the relative decrease in the number of sequences with enhanced stability due to mutation entropy and their relative increae due to selective pressure. Validity of the ML equations requires that G has a maximum, so that the Hessian matrix H of the second derivatives of G must be negative definite. This matrix is defined by This is the sum of the Hessian of ϕ(x α , x F ) = − log( f ), which is negative by construction, as it is easy to verify, and the Hessian of σ (x α , x F ), which is the logarithm of the probability to find stability values x α and x F under mutation alone. We assume that σ (x α , x F ) has a single maximum at x α = x mut α , x F = x mut F . Therefore, ∂ σ /∂ x i are negative for x i > x mut i , as it is required for the existence of solutions of the ML equations, and the Hessian of σ is negative, as required. We can go beyond the ML approximation writing the exponent G at second order as , which is equivalent to approximating the distribution P sel as a Gaussian with covariance matrix −H −1 . Therefore, negativity of the Hessian matrix is equivalent to requiring that the covariance matrix is positive.
The definition of the normalized energy gap α implies that x mut α is always negative. In contrast, our numerical results show that x mut F is positive for small GC usage corresponding to very hydrophobic sequences. ¿From the ML equations, we can obtain the following implicit equations that expresses the stability x F as a afunction of x α .
We see from this equation that the smaller stability x i is the one for which the absolute value of the derivative ∂ σ /∂ x i is larger.

Influence of parameters on stability
We now calculate how x α and x F depend on the parameters λ ∈ {N, S, GC}. We perform this calculation by taking the total derivative of the ML equations ∂ G/∂ x i = 0 with respect to the parameter λ and equating it to zero, since the ML equations must be satisfied for all values of λ . The total derivative is the sum of the partial derivative with respect to λ plus the partial derivatives with respect to x j multiplied times ∂ x j /∂ λ . We therefore get where H i j = ∂ 2 G/∂ x i ∂ x j . As mentioned above, the inverse of the Hessian matrix can be interpreted as minus a covariance matrix and it is negative definite. For λ = N we find which is always positive, since −H −1 i j is a positive matrix and both ∂ ϕ/∂ x j are positive. Therefore, both stabilities always increase with population size, consistent with the statistical mechanical analogy described above. For very small N, and provided that S is not too large so that SN is small, stabilities satisfy the equations x S+1 is positive as it is the case for x mut F when the GC usage is small. For very large N stabilities approach the maximum possible values. However, they can not be maximized simultaneously since there is a trade-off between the two kinds of stability (see main text).
For λ = GC, we find Numerical results show that x α is an increasing function and x F is a decreasing function of GC usage, so that these two variables are anticorrelated when GC is varied. Finally, for λ = S we find where k is the stability different from j (if j = α than k = F and the other way round). The term in round brackets in the numerator can vanish at some value of S, meaning that stabilities need not to be monotonic function of S. For S = 0 stabilities only have to fulfill the conditions x i > 0, and their values are given by x i = min(x mut i , 0). Our numerical results indicate that stabilitiies increase with S for small S. An N-dependent maximum is reached at intermediate S when both terms in round brackets with j = α and j = F in Eq. (10) vanish. It is possible to see that the maximum can only be reached when both x i are larger than one. For large S the fitness landscape is almost neutral, so that finite differences in stability determine very small differences in the additive fitness ϕ such that N∆ϕ ≪ 1. Such differences can not be fixed in the population. In the limit S → ∞ and for finite N, fitness only depends on the smaller of the two stabilities, x = min(x α , x F ), which tends to the value x = 1, i.e. to the neutral threshold at which fitness approaches the value f = 1. Therefore, stability is predicted to decrease with S at large S. This prediction is confirmed by numerical results. To simplify the analytic calculations, we also confirmed the prediction that stability has a maximum at intermediate S by using the simplified fitness function ϕ = − log(1 + x −S ), with x = min(x α , x F ) (see below).

Influence of parameters on fitness
Since fitness is an increasing function of stability and stability increases with population size, fitness always increases with population size, as expected. However, fitness is not a monotonic function of S but it starts from the value f = 1/3 at S = 0, it decreaes to a minimum at small S and then grows towards the neutral limit f = 1 at large S.

Optimal mutation bias
We now consider the dependence of the mutation bias on fitness. The two stabilities x α and x F depend on the mutation process through the sequence entropy σ (x α , x F , GC). High GC usage favors hydrophilic proteins, enhancing x α at the expenses of x F . Since fitness depends on both stabilities, there is an optimal mutation bias at which the fitness is maximal for given S and N, satisfying the ML equations (4) plus the implicit equation dϕ/dGC = 0. Using the ML equations, we can write The fitness is a decreasing function of δ = x −S min 1 + (x min /x max ) S . In the neutral limit S → ∞, x min tends to 1 independent of GC usage and the fitness is maximal for x α = x F = 1. From the saddle-point equations, this condition implies ∂ σ /∂ x F (1, 1, GC) ≈ ∂ σ /∂ x α (1, 1, GC). Therefore, the optimal GC does not depend on N in the neutral limit.

Single stability fitness function
To simplify the calculations and get more analytic insight into the model, we consider here a simplified fitness function that depends only on the smaller between the two stabilities, ϕ = − log(1 + x −S ) with x = min(x α , x F ). This fitness function is the neutral limit of Eq. (2) for S ≫ 1. In fact, and the Hessian becomes a scalar function H = (∂ 2 σ /∂ x 2 ) + N(∂ 2 ϕ/∂ x 2 ) < 0. We now compute how stability and fitness depend on S. For stability, we find Since H < 0, the derivative is positive for z < z * and negative for z > z * , where z = S log(x) and z * ≈ 1.278 is the non-trivial root of the equation 1 − z + e −z = 0 at which stability reaches a maximum. Therefore, stability increases with the neutral exponent S for small S and it decreases for large S, consistent with the biological intuition. For fitness, we find The denominator can be written as For small S, it holds x ≈ 0 so that S log x < 0. The second term in the above equation is positive, but it is smaller than the first term since it is multiplied times S, hence the fitness decreases with S at very small S. At large S, log x is positive and the fitness increases with S. The previous analysis shows that, for fixed N and GC usage, there are two values of S at which two "evolutionary potentials" are extremal: The value of S = S mut (N) at which the fitness is minimal, and the larger value S = S opt (N) at which stability is maximal. This analysis defines in a natural way three regimes for the model, which are graphically represented in Fig. 1. 1. At small S, stability is close to the lethal threshold x = 0 where fitness drops to zero, purifying selection is weak and mutation entropy is large. We call this the mutation regime, and we define the mutation cross-over as the point where stability reaches the neutral threshold x = 1 at which the mean fitness takes the value f = 1/2 that it has for S = 0. We see from Eq. (12) that the boundary of the mutation regime is defined by the inequality Thus the larger is S, the smaller the populations size required to leave the mutation regime.
2. In the large S limit, the fitness becomes a binary variable with f ≈ 1 for x > 1 and f ≈ 0 for x < 0. In the neutral regime finite differences in stability imply very little differences in fitness and they cannot be fixed by natural selection, so that stability decreases with S towards the neutral threshold x = 1 for which mutation entropy is largest. We write the ML equation Eq. (12) in the form where we approximate x ≈ 1 in the r.h.s. At dominant order in S we obtain We now define the neutral cross-over as the point where x ≤ 1 + ε, with small ε. From the above equations it follows that the neutral regime is defined through the inequality Thus, the larger is S, the larger the population sizes N required to leave the neutral regime.
3. For N larger than the mutation cross-over (small S) or N larger than the neutral cross-over (large S) the population enters the non-neutral regime where stability is above the neutral threshold. For fixed N, stability has a maximum x * (N) at S = S opt (N) given by .
The above formulas suggest that the typical scale of population size for entering the non-neutral regime from the mutation or the neutral regime is proportional to 2 |∂ σ /∂ x| x=1 . We estimated this quantity from our simulations using the approximate relationship Eq. (18) to yield This estimate is predicted to be independent of N for very large S. We used data for S = 20, finding that there is still some weak dependence on N that can be attributed to corrections to the above approximations. Averaging over N between 10 and 4000, we found that 2 |∂ σ /∂ x| x=1 varies from a minimum of 28 at GC ≈ 0.5 to a maximum of 100 at GC ≈ 0, see Fig. 6. The minimum of |∂ σ /∂ x| x=1 at GC ≈ 0.5 is another way to see that this mutation bias is optimal for neutral evolution.
Notice that there is a critical value of S that separates the mutation regime (small S) from the neutral regime (large S). This can be obtained by equating the boundaries of the two regimes, Eq. (16) and (19), which yields For ε = 0.5 we find S * = 1.63. Similarly, there is a value of N below which the system passes directly from the mutation regime to the neutral regime without entering non-neutrality. This is given by N = 2/S * |∂ σ /∂ x| x=0 . Entropy/Stability seq_entr N=80 seq_entr N=320 seq_entr N=1280 x N=80 x N=320 x N=1280 FIG. 5: Minimal stability x = min{α/α thr , F/F thr } and difference between actual sequence entropy and entropy expected under mutation alone, versus neutrality S for various population sizes and GC = 0.5. Entropy was computed through the independent sites approximation as Entropy = − ∑ i ∑ a P i (a) log (P i (a)), where P i (a) is the asymptotic distribution of amino acid a at site i. Entropy under mutation alone is computed using the distribution P mut (a) in which each of the codons codifying the amino acid a have the frequency expected under the mutation model (with zero frequency for stop codons). The difference between the two entropies estimates the reduction in entropy due to negative selection for conserving protein stability. Notice that the maximum of stability corresponds to maximum reduction of sequence entropy.