Evolutionary forces affecting synonymous variations in plant genomes

Base composition is highly variable among and within plant genomes, especially at third codon positions, ranging from GC-poor and homogeneous species to GC-rich and highly heterogeneous ones (particularly Monocots). Consequently, synonymous codon usage is biased in most species, even when base composition is relatively homogeneous. The causes of these variations are still under debate, with three main forces being possibly involved: mutational bias, selection and GC-biased gene conversion (gBGC). So far, both selection and gBGC have been detected in some species but how their relative strength varies among and within species remains unclear. Population genetics approaches allow to jointly estimating the intensity of selection, gBGC and mutational bias. We extended a recently developed method and applied it to a large population genomic dataset based on transcriptome sequencing of 11 angiosperm species spread across the phylogeny. We found that at synonymous positions, base composition is far from mutation-drift equilibrium in most genomes and that gBGC is a widespread and stronger process than selection. gBGC could strongly contribute to base composition variation among plant species, implying that it should be taken into account in plant genome analyses, especially for GC-rich ones.

For substitutions there is no second term with errors as in equations (S2.1) as a wrong polarization corresponds to no substitution. Note also that the last terms in parentheses correspond to mutations polymorphic in the population but fixed in the sample. The full likelihood is thus obtained by combining equation (S2.6) with B1 instead of B and the product of these three new probabilities: !"# , !" !" !!" , !" (S2.8) We tested five different models: -The null model: B0 = B1 = 0 -B0 = 0 and B1 free -B0 free and B1 = 0 -B0 = B1 = B, B being free -B0 and B1 free Nested models were compared by likelihood ratio tests (LRT).

Joined estimation of gBGC and SCU
The models presented above can also be extended to the joint estimation of gBGC or to SCU. To do so we need to distinguish nine DAF spectra and nine categories of substitution and to fit a model with distinct gBGC and selection parameters, as summarized in the following Table S2.1: As above, input of mutations could be written as a function of GC content, mutational bias and proportion of preferred and unpreferred codons belonging to the different base combinations. However, here we are not directly interested in estimating the different mutation parameters so we simply assume a different mutational input for each category. Then, the full likelihood can be written as in equation (S2.8) as the product over all SNP categories of the nine SFSs and of the nine substitution counts. Assuming an ancestral and a recent process, this leads to four parameters of interest, B0, B1, S0, and S1 and thus to a potential large amount of models. We tested all combinations of models where each parameter can be either null or free, so from the null neutral model, B0 = B1 = S0 = S1 = 0, to the model with the four parameters being free. We then chose the best model using the Akaike Information Criterium (AIC) as all models are not nested. When AIC were very close we chose the model with the lower number of free parameters.
In some species, there is neither SNPs nor substitutions for the WSPU and SWUP categories. To avoid numerical problem, a value of 10 -4 was set to the corresponding SFSs.

Simulation procedure
We tested these two new models by applying them to simulated datasets. As gBGC is equivalent to genic selection, we simulated selection in a haploid population (R script). Thus B equals 2Neb. Every generation, mutations are drawn from a Poisson distribution with mean Nu, where N is the population size and u the mutation rate. As the model assumes independent new mutations, we consider that at max only one mutation can occur per generation. This corresponds to truncating the Poisson distribution but the probability of having more than one mutation is negligible for Nu << 1 as used in simulations. Each mutation is then followed independently until lost or fixation: the expected allele frequency is changed deterministically depending on selection coefficient and drift is simulated by sampling in a binomial distribution of size N and probability given by the expected frequency. Population size and/or selection coefficient and/or mutation bias can change across generations. At the end of a simulation sampling is simulated by drawing alleles from a binomial distribution of size n and probability given by the final allele frequency in the population. By summing over all possible mutations, the total number of mutations in each frequency, including fixed mutations, is recorded. Neutral, W!S / U!P and S!W / P!U mutations are simulated separately. Similarly, for the full gBGC/SCU model, the nine categories of mutations are simulated separately. We then applied different polarization error rates to the simulated dataset. For simplicity we only assumed equal rate to all SFS: 3%, 5% and 10%.

Change in gBGC/SCU intensity
We first tested the rate of false positive in detection of different ancestral and recent gBGC/SCU by applying the model to simulated datasets under constant B/S. The model with error correctly retrieved both B0 and B1 values and the type I error is close to the 5% ( Figure S2.1). However, when polarization errors are not taken into account this leads to overestimating B1, as already shown by [1], and also B0 but at a lower level. Overall this leads to high rate of false positives when polarization errors are high and not taken into account. The problem is less pronounced for high B values (compare left B = 0.5) and right (B = 1) panels.

Figure S2.1: Estimations under constant gBGC (or SCU).
A population of size N = 50 is simulated for 1000 generations with B = 0.5 (left) or B = 1 (right). We assumed a GC content of 0.5 and mutation rates were set to 2.10 -3 , 10 -3 and 5.10 -4 for S!W, W!S and neutral mutations respectively. 5000 independent sites were simulated, corresponding to datasets of around 5,000 SNPs and 15,000 substitutions. Different error rates (from 0 to 10%) were applied as indicated. One hundred datasets of 20 chromosomes were simulated for each combination. We applied the estimation model with two gBGC episodes allowing for polarization errors (green boxes) or not (orange boxes). Dashed lines represent either the simulated B values or the 5% threshold for the p-values. On the last two panels, number above boxes corresponds to the percentage of significant tests (at the 5% level).   When B either decreases or increases, B1 is well estimated when polarization errors are taken into account ( Figure S2.2). However, B0 is less well estimated and is slightly over or underestimated, being closer to the B1 value than expected. However, it is worth noting that fixations also occur during the second period so that the model actually estimate the weighted mean of B0 and B1, which is accurately done (dotted lines on Figure S2.2). Overall, the power to detect different B values is rather high, although it decreases as polarization error rates increases. The model that does not take polarization errors into account could appear better to detect variation in heterogeneity of gBGC or SCU under certain conditions but it would be for bad reasons and the power varies with scenarios and error rates.

Figure S2.3: Estimations under increasing (upper panels) or decreasing (lower panels) gBGC (or SCU) for different time period
Same legend as in Figure S2.1 except that we first simulated T0 generations with B = 0.5 (upper) or 1 (lower) then 400 generations with B = 1 (upper) or 0.5 (lower). It corresponds to datasets of around 5,000 SNPs and 8,000, 12,000, 18,000 and 30,000 substitutions from T0 = 200 to 1600. White dotted lines correspond to the weighted average of B over the two periods.
We tested further the robustness of the model by letting time, T0, vary when population evolves under the B0 regime ( Figure S2.3). When T0 is rather low, B0 is much closer to B1 than expected and the power to detect changes in gBGC/SCU intensity is rather low. This is expected as the B0 value estimated in the model corresponds to an average over what currently fixed mutations have experienced during their lifetime. Here, B1 is less affected because T1 is rather large (= 4N) so most polymorphic mutations experienced the same conditions (see below for various demographic scenarios). Complex demographic scenarios We also tested several demographic scenarios, keeping b or s constant but letting N vary. When N varies, the relevant Ne for polymorphism-based measure (estimate of B1 here) can be obtained by computing half the average coalescent time between two gene copies under the given demographic scenario [6]. This is strictly true for neutral mutations but should be an accurate approximation here because gBGC/selection is weak (4Neb of the order of 1). The coalescent Ne is given by: where !"#$ is the probability of coalescing at time t. For demographic scenarios involving discrete changes in N described by a vector of population sizes N = (N1,N2,…) and a vector of durations T = (T1,T2,…), !"#$ can be computed as follows: for in the th time period (2.10) For fixation-based measures (estimate of B0 here), the harmonic mean of N can give an accurate approximation [7].
We explored the following scenarios (without polarization errors for simplicity): For each scenario we compared estimates of B0 with 2Nhb, where Nh is the harmonic mean of N in the scenario, and B1 with 2Ncb, where Nc is the coalescent Ne computed using (2.9) and (2.10). Results are presented on Figure S2.4 and show that the method is accurately accommodates demography for most scenarios.

Figure S2.4: Estimations under various demographic scenarios
Parameters of the different scenarios are given in the text above. Other parameters as in Figure S2

Change in mutation bias
If we relax the assumption of constant mutational bias, changes in both bias and selection/gBGC are no more identifiable. Recent S/B estimates are not affected but ancestral estimates are underestimated (resp. overestimated) when mutation bias decreases (resp. increases). However, the method is still powerful to detect departure from a constant regime of selection/mutation/drift equilibrium, although here the model detects a difference in B/S instead of a difference in mutation bias ( Figure S2.5).

Figure S2.5 (above): Estimations under decreasing (Left panels) or increasing (right panels) mutation bias
Same legend as in Figure S2.1 with B = 0.5 and a change in mutation bias after 500 generations. Mutation rates were set to 10 -3 and 5.10 -4 for W!S and neutral mutations respectively for the 1,000 generations. S!W mutation rate was set to 2.10 -3 (resp. 1.5.10 -3 ) during the 500 first generations and to 1.5.10 -3 (resp. 2.10 -3 ) during the 500 last ones for decreasing (resp. increasing bias). We also tested the power of the full model to distinguish between gBGC and selection. As we were interested in this specific question we only perform simulations without error (but applied the estimation model with error as in the main text). As we focus on the distinction between gBGC and selection we also only considered scenarios where B0 = B1 and S0 = S1. We considered two cases, either B = S = 0.5 or B = 0.75 and S = 0.25. The second case mimics the case similar to our observation where gBGC is stronger than SCU. We then assumed than SNPs and substitutions are fully balanced among the nine categories, unbalanced with an excess of NN, WSUP and SWPU categories, or highly unbalanced with no WSUP and SWPU, as observed in some species such as banana (see Table S2.1 above for categories definition). We then applied the same procedure as in the main text, testing for all 16 possible models.
On average, the method retrieves well the simulated values, especially for B0 and S0, but B1 tends to be slightly underestimated and S1 slightly overestimated. This could be due to the fact that the model assumes that polarisation errors affect mutations depending on their GC status not on their preference ( Figure S2.6). As a consequence, when S = B, the best model (according to AIC) more often assumes that B1 = 0 than S1 = 0. For unbalanced datasets, the best model often assumes than at least B or one S is null ( Figure S2.7). However, SCU tends to be preferred to gBGC when the two forces are of similar intensities and even when B > S, SCU is preferred to gBGC in some simulations. Overall, these simulations suggest that our result of higher gBGC than SCU is robust and conservative.

Figure S2.6: Joint estimations of ancestral and recent gBGC and SCU.
A population of size N = 50 is simulated for 1000 generations with constant B and S as indicated above each panel. We assumed a GC content of 0.5 and mutation rates were set to 2.10 -3 , 10 -3 and 5.10 -4 for S!W, W!S and neutral mutations respectively. 6000 independent sites were simulated, corresponding to datasets of around 6,500 SNPs and 18,000 substitutions. For the three categories of mutations we assumed (i) an equal proportion of neutral, U!P and P ! U (=balanced) (ii) 83.33% of NN, WSUP and SWPU and 8.33% of the others (= unbalanced) or (iii) 83.33% of NN and 8.33% for NUP and NPU, and 91% of WSUP and SWPU, 9% of WSN and SWN and 0% of WSPU and SWUP (= highly unbalanced). One hundred datasets of 20 chromosomes were simulated for each combination. We applied the estimation model with two gBGC and SCU episodes with polarization errors. Dashed lines represent the simulated B and S values. Simulation conditions as in Figure S2.6. Sixteen models where tested corresponding to all combinations where each of the four parameters (B0, B1, S0, S1) can be freely estimated or set to 0. Model choice was performed according to AIC.
We then tested the effect of imperfect identification of preferred and unpreferred codons. This problem is more likely for four-fold and six-fold degenerated codons. For these codons, misidentification can both change U!P and P!U mutations but also neutral ones. We thus chose this case for simulation by considering that two codons are misidentified over the four according to the following scheme: codons 1 and 2 are preferred and 3 and 4 are unpreferred but codons 1 and 3 are identified as preferred and 2 and 4 as unpreferred. The mutation matrix is thus changed as follows: " We reused previous simulations with B0 = B1 = S0 = S1 = 0.5 and introduced 10%, 20% and 50% errors according to the matrix scheme: half of N mutations are assigned to UP and the other half to PU, half UP mutations are assigned to N, one quarter to PU and the last quarter stays UP, and half PU mutations are assigned to N, one quarter to UP and the last quarter stays PU. Misidentification of codon preferences induces underestimation of S but does not affect B for balanced datasets ( Figure S2.8), but also tends to overestimate B for highly unbalanced datasets ( Figure S2.10) and unbalanced dataset with very high error rates (50%) (Figure S2.9). For balanced and unbalanced datasets, SCU is still detected, except for very high error rates (50%) (Figures S2.8, S2.9 and S2.11). Moderate error rate is problematic only for highly unbalanced dataset ( Figures  S2.10 and S2.11) for which it is difficult to distinguish gBGC and SCU event without error rates ( Figure S2.7).  Figure S2.