Conceived and designed the experiments: FH AEW. Performed the experiments: FH AEW. Analyzed the data: FH AEW. Contributed reagents/materials/analysis tools: FH AEW. Wrote the paper: FH AM AEW.
The authors have declared that no competing interests exist.
The genomic GC-content of bacteria varies dramatically, from less than 20% to more than 70%. This variation is generally ascribed to differences in the pattern of mutation between bacteria. Here we test this hypothesis by examining patterns of synonymous polymorphism using datasets from 149 bacterial species. We find a large excess of synonymous GC→AT mutations over AT→GC mutations segregating in all but the most AT-rich bacteria, across a broad range of phylogenetically diverse species. We show that the excess of GC→AT mutations is inconsistent with mutation bias, since it would imply that most GC-rich bacteria are declining in GC-content; such a pattern would be unsustainable. We also show that the patterns are probably not due to translational selection or biased gene conversion, because optimal codons tend to be AT-rich, and the excess of GC→AT SNPs is observed in datasets with no evidence of recombination. We therefore conclude that there is selection to increase synonymous GC-content in many species. Since synonymous GC-content is highly correlated to genomic GC-content, we further conclude that there is selection on genomic base composition in many bacteria.
Shortly after it was proved that DNA was the genetic material it became apparent that organisms, and in particular bacteria, use the four letters of the genetic code to very different extents; the use of G and C varies from less than 20% in some species, to more than 70% in others. This variation in the use of G and C is usually attributed to differences in the pattern of mutation between species. Here we test this hypothesis, and show that, on the contrary, there seems to be pervasive selection on the base composition of the bacterial genome, particularly in GC-rich species. This suggests that many, if not all, sites may be subject to natural selection in many bacteria. Unfortunately, the reason why some bacteria should be selected for high GC-content remains unclear.
Bacteria show an astonishing diversity of genomic GC-contents, from species such as the endosymbiont
The reasons for the variation in genomic GC-content are controversial. In the first example of a “neutral” theory being used to explain a phenomenon at the molecular level, Sueoka
For a few bacteria there is evidence that the genomic GC-content is not a simple consequence of mutation bias. The mutation pattern has been directly measured in
A similar excess of GC→AT substitutions is seen in the pseudogenes between strains of
Here we test whether genomic GC-content is a simple consequence of mutation bias by investigating the pattern of synonymous genetic variation in 149 bacterial species.
To investigate whether genomic GC-content is solely a consequence of mutation bias, we analysed the pattern of synonymous polymorphism at the third position of 4-fold degenerate codons. Since, the GC-content of 4-fold sites (GC4) is strongly correlated to genomic GC-content
The concept of a species is potentially problematic in bacteria because they do not undergo conventional sexual reproduction. However, “population genetic” species do exist, in the sense that strains exist that undergo random genetic drift and selection together
Ideally we would infer the pattern of mutation from our SNPs within a likelihood framework, integrating across all possible phylogenetic trees and ancestral states. However, this approach was not possible because a closely related outgroup was not available for most datasets. We therefore used two alternative methods to infer the direction of mutation. In the first we used the allele frequencies, inferring the minor allele to be the new mutation; we also reconstructed the phylogenetic tree between strains for each gene and used parsimony to infer the ancestral state and hence the direction of mutation. These two approaches gave qualitatively similar results, but we present the results from the frequency method because the potential biases are easier to estimate; simulations suggest that parsimony typically outperforms the frequency method, but the biases are less easy to predict. We restrict the analysis to datasets in which synonymous diversity was less 0.1 for two reasons; first to concentrate the analysis on strains that are likely to form a species, in the sense that they undergo selection and random genetic drift together, and second, to limit problems with violation of the infinite sites assumption (the assumption that each mutation is fixed or lost at a site before the next one occurs).
Overall we observe a large excess of GC→AT mutations at 4-fold sites (11045 GC→AT versus 8309 AT→GC, p<0.0001 using a two-tail binomial test), with similar patterns evident at 2-fold sites (6282 GC→AT and 5196 AT→GC p<0.0001) (
The proportion of GC↔AT mutations that are GC→AT,
The figure shows the correlation between Z, the proportion of GC↔AT SNPs that are GC→AT, and GC4, across 149 bacterial species.
Phylum | Class | No. of species | GC4 range | Mean Z (GC4<0.34) | Mean Z (GC4>0.34) |
Actinobacteria | Actinobacteria | 3 | 0.64–0.93 | no species | 0.64 |
Bacteroidetes/chlorobi | Bacteroidetes | 3 | 0.12–0.46 | 0.43 | 0.36 |
Chlamydiae/verrucomicrobia | Chlamydiae | 2 | 0.21–0.30 | 0.45 | no species |
Cyanobacteria | Chroococcales | 2 | 0.38–0.51 | no species | 0.53 |
Cyanobacteria | Nostocales | 3 | 0.26–0.31 | 0.45 | no species |
Cyanobacteria | Oscillatoriales | 2 | 0.41 | no species | 0.38 |
Cyanobacteria | Stigonemales | 1 | 0.40 | no species | 0.59 |
Firmicutes | Bacilli | 27 | 0.085–0.68 | 0.44 | 0.58 |
Firmicutes | Clostridia | 5 | 0.050–0.28 | 0.34 | no species |
Proteobacteria | Alphaproteobacteria | 16 | 0.099–0.94 | 0.43 | 0.65 |
Proteobacteria | Betaproteobacteria | 6 | 0.66–0.96 | no species | 0.67 |
Proteobacteria | delta/epsilon subdivisions | 6 | 0.15–0.99 | 0.49 | 0.78 |
Proteobacteria | Gammaproteobacteria | 62 | 0.095–0.95 | 0.50 | 0.66 |
Spirochaetes | Spirochaetes | 7 | 0.12–0.60 | 0.45 | 0.54 |
Tenericutes | Mollicutes | 4 | 0.023–0.24 | 0.33 | no species |
The excess of GC→AT mutations in GC-rich species and the excess of AT→GC mutations in AT-rich species could potentially be due to sequencing error or a violation of the infinite sites assumption. The infinite sites assumption is important for the following reason. Let us imagine that we have a GC-rich species in which high GC content is a consequence of mutation bias. This implies that AT nucleotides are more mutable than GC nucleotides, but when mutation rates are low, such that all mutations occur at sites which are monomorphic, we expect on average to observe equal numbers of GC→AT and AT→GC mutations
However, several lines of evidence suggest that violation of the infinite sites assumption is not responsible for the biases in SNPs that we observe. First, we note that the frequency method will be unbiased under the mutation bias hypothesis when base composition is stationary and the GC-content is 50%, whether or not there is a violation of the infinite sites assumption: in 6 out 7 species with GC4 between 0.45 and 0.55 Z>0.5 (p = 0.13) and there are 770 GC→AT and 529 AT→GC mutations in these species (p<0.0001). Second, we note that if we restrict the data to singletons, which are more likely to reflect the pattern of mutation, we find a large excess of GC→AT mutations in GC-rich species and the opposite pattern in AT-rich species: Z>0.5 in 69 out 82 GC-rich species (p<0.0001), and Z<0.5 in 47 of 67 AT-rich species (p = 0.001). However, to further investigate whether the biases could be due to the infinite sites assumption we used population genetic theory to predict the value of Z, allowing a violation of the infinite sites assumption (see
The figure shows the predicted value of Z, allowing for a violation of the infinite sites assumption, assuming that base composition is due to mutation bias alone and base composition is stationary, plotted against GC4 under the (A) constant rate and (B) exponential rate models, along with the effect of removing this bias from the observed value (Z-Zpred) for the (C) constant and (D) exponential models.
The mutation rate is known to differ between sites in bacteria so we also investigated a model in which the mutation rate was exponentially distributed across sites. An exponential distribution of rates represents substantial variation in the mutation rate: the mutation rate of the 95th percentile is ∼60-fold higher than the 5th percentile, the 99th percentile is ∼460 fold higher than the 1st percentile. As expected, under an exponential distribution the biases in Zpred are more extreme than under a constant rate model (
The excess of GC→AT mutations in GC-rich species does not seem to be due to sequencing error since the results remain qualitatively unaffected by the removal of singletons: Z>0.5 for 73% of GC-rich species (p<0.0001).
The pattern of SNPs implies, assuming that GC4 is determined by mutation bias alone, that most GC-rich species are declining in GC4. This can be illustrated by using the observed numbers of GC→AT and AT→GC mutations to predict the GC4 value, GC4pred, to which each species would evolve under mutation bias if there was no selection (
The figure shows the relationship between GC4pred, the GC4 to which each species is predicted to evolve under mutation pressure, and the current GC4.
It is well known that selection acts upon synonymous codon use in bacteria to increase translational efficiency
The figure shows the relationship between GC4 for putatively highly expressed genes and GC4 for all other annotated genes. The line is for GC4high = GC4other.
The base composition of many eukaryotes is thought to be affected by biased gene conversion (BGC)
Bacteria can undergo horizontal gene transfer (HGT) in which a gene, or gene fragment, from a distantly related species can be incorporated into the genome
In contrast to nhHGT, hHGT could explain the excess of GC→AT SNPs in GC-rich species. It is likely that many gene or gene fragments transferred by hHGT will be less extreme in GC-content than the genome they integrate into if the genome is GC-rich. The introduced sequence may therefore generate a series of GC→AT SNPs. This situation will be temporary because either the new sequence will be lost, or it will become fixed. If it becomes fixed, it will then evolve to the GC-content of its new host, in the process generating an excess of AT→GC SNPs under the mutation bias hypothesis. Thus an excess of GC→AT SNPs can be only be generated if AT-rich sequences are continually introduced by hHGT and then lost. This process would have to be pervasive to explain our results, affecting the majority of GC-rich species and generating most of the SNPs within them. This seems unlikely. However, to investigate the matter further we used Maynard Smith's
We have shown that there is a large excess of GC→AT synonymous SNPs segregating at 4-fold degenerate sites in GC-rich bacteria, with AT-rich bacteria showing the opposite pattern. These patterns are found across different phyla and classes of bacteria suggesting that these patterns are not restricted to select groups of bacteria. We have shown that the excess in GC-rich bacteria is probably not due to sequencing error, a violation of the infinite sites assumption, translational selection, biased gene conversion or horizontal gene transfer. In contrast the excess of AT→GC SNPs in AT-rich species may be due to either a violation of the infinite sites assumption, translational selection, or selection for low GC-content. The excess of GC→AT SNPs in GC-rich species is consistent with selection
The figure shows the effect of selection in favour of GC on
Our results are in accord with those in an accompanying paper in this journal by Hershberg and Petrov
We have investigated whether the bias towards GC→AT SNPs, in GC-rich species, is due to biased gene conversion by removing all datasets which fail the four gamete test, and testing whether GC-content and the bias towards GC→AT SNPs is correlated to measures of recombination. Biased gene conversion is a process that drives mutations through a population; since it is not expected to affect the frequency of mutations at linked sites it is expected to generate four gametes during this process. Nevertheless the FGT may miss some datasets that are undergoing gene conversion and we cannot completely rule out biased gene conversion as an explanation. Intriguingly, it has recently been shown that the GC-content across the
It has been suggested that there is a universal mutational bias in both prokaryotes and eukaryotes towards AT
Endosymbiotic bacteria typically have low AT-contents
Although, most obligate endosymbionts have low GC content,
Although, we have evidence of selection on GC-content in GC-rich bacteria, the nature of the selective agent is unclear. Recently Foerstner et al.
The Popset database of Genbank was searched for the keyword “bacteria”. From this we extracted datasets in which we had at least 8 sequences from the same species, defined as a group of bacterial strains with the same species and genus name. These sequences were translated, aligned using MUSCLE
Using population genetic theory we can infer the expected proportion of SNPs that are GC→AT, Z, under models in which the base composition is determined by mutation bias alone, and in which there is both selection and mutation bias acting. The direction of mutation is assumed to be inferred from the allele frequencies. We only consider changes between GC and AT so the system is effectively biallelic. Let the mutation rate from GC→AT be
In
We can also use the equations above to demonstrate that Zpred is expected to be generally greater than 0.5 if selection favours GC irrespective of the mutation bias (and since the system is symmetrical we expect Z<0.5 when selection favours AT) (
GC content correlations in prokaryotes. Figure shows the GC-content of the (A) first two and (B) third codon positions versus genomic GC-content for 855 complete bacterial genomes.
(0.28 MB EPS)
Using parsimony to infer the direction of SNPs. Figure shows the relationship between the proportion of GC↔AT SNPs that are GC→AT, Z, and GC4, where the direction of a SNP is inferred by parsimony. The line is where Z = 0.5.
(0.21 MB PDF)
The equilibrium GC content under the mutation bias model. The figure shows the relationship between GC4pred, the GC4 to which each species is predicted to evolve under mutation pressure from SNPs inferred by parsimony, and the current GC4.
(0.28 MB EPS)
Violation of the infinite sites assumption. The figure shows the relationship between Zpred and πs under the neutral equilibrium model for different equilibrium GC-contents for (A) 8 strains and (B) 50 strains, when violation of the infinite sites assumption is allowed. The lines from left to right are for f = 0.95, 0.9, 0.8, 0.7, 0.6.
(0.30 MB EPS)
The species analysed along with their phylum, class, the numbers of GC→AT and AT→GC at 4-fold sites (U4 and V4 respectively) and 2-fold (U2 and V2 respectively) sites, Z, GC4 and the GC4 to which the sequence is predicted to evolve under mutation bias alone, GC2 the GC-content of 2-fold sites and Zpred under the uniform and exponential models respectively. Also included is the nucleotide diversity for GC↔AT mutations at 4-fold sites and the genomic GC4 of highly expressed and other genes.
(0.11 MB XLS)
The authors are very grateful to Toni Gossman for helpful discussion and to Nina Stoletzki, Maria Warnefors and several anonymous referees for numerous helpful comments on earlier versions of this manuscript.