Linking high GC content to the repair of double strand breaks in prokaryotic genomes

Genomic GC content varies widely among microbes for reasons unknown. While mutation bias partially explains this variation, prokaryotes near-universally have a higher GC content than predicted solely by this bias. Debate surrounds the relative importance of the remaining explanations of selection versus biased gene conversion favoring GC alleles. Some environments (e.g. soils) are associated with a high genomic GC content of their inhabitants, which implies that either high GC content is a selective adaptation to particular habitats, or that certain habitats favor increased rates of gene conversion. Here, we report a novel association between the presence of the non-homologous end joining DNA double-strand break repair pathway and GC content; this observation suggests that DNA damage may be a fundamental driver of GC content, leading in part to the many environmental patterns observed to-date. We discuss potential mechanisms accounting for the observed association, and provide preliminary evidence that sites experiencing higher rates of double-strand breaks are under selection for increased GC content relative to the genomic background.

general, our sampling should not be an issue since Ku tends to be uniformly present or absent within a species. We have clarified this point in the methods (lines 385-390) and added S11 Fig:

"To assess trait vs. Ku relationships we sampled a single genome per species from our RefSeq dataset to determine Ku presence/absence in species with trait data available (617, 2062 without). Most species either always have or always lack Ku (S11 Fig), meaning that sampling should give a reliable estimate of whether we can expect a species to typically have Ku."
With respect to S1 Fig (now S2 Fig)

"The correlation of trait values for microbial species with their average genomic GC content is similar to the correlation of trait values with the presence/absence of Ku. Note that each point is an individual trait, as shown in Fig 1. The dashed diagonal line indicates the x = y line. For a direct analysis of the relationship between GC content and Ku incidence among organisms see Fig 2 and Table 1." and we have also noted the correlation between Ku and GC across all genomes in the main text (lines 104-107):
"Using a large set of genomes from RefSeq we found that genomes with Ku have a dramatically shifted GC content relative to genomes without Ku (Fig 2A, S3 Fig; Pearson

correlation between GC content and Ku across genomes, r = 0.54, p < 2.2×10−16)"
The main issue with the results of the manuscript is that the authors are using the presence of Ku as evidence for more frequent DSBs. As stated by the authors, the NHEJ pathway is either present or absent but DSBs can occur at different rates. Figure 2A is rather convincing but it would be interesting to indicate the size of each sample. Also, it is not clear to me whether the data were computed on the entire dataset of genomes or if only one genome were selected for each species.
We agree with the reviewer (and discuss in the main text, as the reviewer notes) that Ku presence/absence is an imperfect measure of the rate of DSB formation an organism experiences. That said, we do expect that NHEJ will be favored in specific environments (e.g. , Fig 1), and it's widespread but sparse distribution makes it a promising indicator of frequent damage. We hope to motivate future work that surveys the rate of DSB formation directly in the environment, though as we note in our conclusions this is a non-trivial task.
We also thank the reviewer for pointing out a part of the analysis that was inadequately described. While the sample size was indicated in the methods (21389 out of 104297 genomes with Ku), it was not specified on the actual figure. We have modified the legend of Fig 2 to reflect this:

"The relationship between genomic GC content and the NHEJ pathway in prokaryotes. (a) Microbes that code for the Ku protein tend to have much higher genomic GC content than those that do not (all RefSeq assemblies shown, 21389 out of 104297 genomes encode Ku)."
Species that don't encode Ku appear to present a wider range of GC-content while species that encode Ku appear much more biased toward high GC-content. It would be informative to explore and discuss the rare species that encode Ku but present a relatively low GC-content. These cases might be insightful, but maybe the authors did not find anything worth reporting in the manuscript.
It is true that Ku-encoding organisms have a much narrower range of GC content than Kulacking organisms. This is partially attributable to the fact that there are many more Ku-lacking organisms in the dataset. At the same time, we might expect this pattern for other reasons. Specifically, while the presence of Ku might indicate environmental conditions with high damage and low growth, the lack of Ku does not necessarily indicate low rates of damage. Organisms without Ku but still experiencing high rates of DSB formation would still be expected to have high GC content under our hypothesis. We make this point briefly in our conclusions (lines 352-357):

"While the presence of NHEJ cannot single-handedly explain high GC content in all organisms (there are many organisms incapable of NHEJ that still have high GC content, Fig 2), it is possible that DSB formation can (or at least come close). For example, Deinococcus radiodurans is resilient to extremely high rates of DSB formation [61] and has high genomic GC content, but lacks Ku."
The reviewer makes an interesting point about the organisms encoding Ku but with low GC content. In fact, 80% of these genomes come from the Baccilaceae. This family typically has low GC content (>99% of genomes in RefSeq have <50% GC, and 76% have <40% GC), and an ancestral state reconstruction suggests that the MRCA of the group had Ku (which has been lost multiple times). We have added these analyses to the manuscript (lines 170-178) along with S8 Fig:

"Finally, we note that there is a small subset of genomes in Fig 2 that both encode Ku and have a low GC content (< 40%). Of these, 80% belong to the family Baccilaceae. This family has uniformly low GC content (> 99% of genomes have GC content < 50%, and 76% have GC content < 40%), and an ancestral state reconstruction suggests that its most recent common ancestor encoded Ku (S8 Fig and see methods), though Ku has been lost multiple times across the group. We do not know why the Baccilaceae violate the pattern seen across the rest of the dataset; it may be an accident of evolutionary history or some particular aspect of this group's ecology and/or physiology."
and (lines 422-429): The authors argue the restriction systems might elevate the frequency of DSBs and use the presence of RM systems as an indirect indicator for elevated DSBs. If we follow their logic, we might expect species encoding larger numbers of RM systems to present higher GC-content. I find this argument not very convincing considering that Helicobacter pylori encodes an exceptionally high number of RM systems (Oliveira, Touchon and Rocha, NAR 2014) but present a relatively low GC-content (~39%). This doesn't necessarily follow, since even genomes with an exception number of RM systems are likely to potentially target only a small proportion of the genome. It would be possible to see a local effect of RM systems, which have very specific target sites, without seeing much overall signal for high GC across the genome. The very specificity of these systems constrains the magnitude of these effects (Especially since we see that often target sites themselves are selected against). We now note this in the main text (Lines 323-329):

This is similar to the logic described in Lassalle et al. [7] that you might see correlations between recombination rate and GC content along a genome, but not within genomes.
Finally, I think it is interesting that GC-content and genome length correlate. This observation is not new, but it supports the hypothesis of the authors. I believe it can be safely assumed that, overall, bacteria with larger genomes endure more frequent DSBs. Under the authors' assumption, the higher GC-content of larger genomes could be explained by the need to repair more frequent DNA breaks. I think this was not explicitly formulated in the manuscript and could be emphasized.
We generally agree with the reviewer and, in fact, have made this point in a previous version of this manuscript (submitted to another journal). Unfortunately a different reviewer took issue with this point previously, and since we deemed it non-essential to our overall message we opted to remove it. Specifically, they questioned the assumption that larger genomes are more prone to breaks, and suggested that the rate of DSB formation would have to scale super-linearly with genome size to see this type of effect (though we suspect that this intuition may be incorrect, since all breaks must be repaired before replication can proceed and thus are likely felt on a pergenome rather than a per-base scale). Nevertheless, since we do not know of any datasets describing how the rate of DSB formation scales with genome size, we are wary of making this point in the paper. It occurs to us that we already speculate about mechanisms to a degree some readers may find excessive, and we think it may be wise to restrain ourselves here (since this is a minor, though interesting, point).
Reviewer #2: I have reviewed this article before for another journal. In this new version, most of my initial critics have been addressed and I find the article quite good. The relationships between genomic GC-content and the presence of the NHEJ pathway is an interesting point to bring to the debate on the evolution of GC-content in genomes. However, I still have difficulties with the hypothesis of the authors that there is selection for high GC-content to favor double strand breaks repair. It is not clear to me why the hypothesis that NHEJ is itself a repair mechanism that is biased toward GC (just like BGC in HR) is not considered and discussed. The argument that GC-rich regions are better repaired seems relatively weak to me.
We thank the reviewer for spending multiple rounds of review on our manuscript. We do feel that the work has substantially benefited from previous review.
There is an important distinction here between the mutation-generation process and the fixation process. The data (fig 3) shows that GC alleles are fixed at a higher rate than AT alleles (not simply produced at a higher rate). If Ku repair favored the creation of GC alleles, we would see a mutational rather than a fixation bias among Ku-favoring organisms. BGC is slightly different since it is not producing novel GC alleles, but rather is increasing their rate of spread across a population. We have clarified this in the main text (lines 162-169):

"Thus, the association between Ku and genomic GC content is not due to differences in mutational bias. This implies that DSBs are either leading to selection for high GC content or influencing the rate and/or biases of homologous recombination to increase the overall action of BGC. We emphasize that this e#ectively rules out the possibility that biases during NHEJ repair are causing the observed patterns. NHEJ repair may be error-prone, but if those errors (i.e., mutations) were driving genome-wide GC-bias it would affect the GC-bias of polymorphisms as well as fixed alleles in the test described above."
We readily admit that we do not have direct evidence for our repair hypothesis. Nevertheless, we think this hypothesis is particularly useful as it proposes a specific mechanistic basis for selection on GC content across the entire genome that can be directly probed with experimental approaches.
Reviewer #3: This manuscript presents new observations and a new hypothesis to explain the long-time puzzle of prokaryotic GC content heterogeneity, and the discrepancy between observed GC contents and their -almost universally lower -expected value based on mutational patterns. They report that the non-homologous end-joining (NHEJ) protein Ku is strikingly associated with high genome GC content, and also with the departures from mutational equilibrium, in a stronger way than any previously considered trait (notably those associated to lifestyle). The authors interpret this Ku-GC association as a signature of GC elevation being a response to frequent exposure to double-stranded DNA (dsDNA) break (DSBs). This is considered under several hypotheses, including that GC elevation and Ku occurrence may both be correlated responses the high incidence of DSBs, via separate mechanisms. Alternatively, they investigate a hypothesis where Ku is causally linked to GC elevation, via selective process promoting the elevation of GC content in the genome and in particular in regions susceptible to regular DSBs such as self-target sites for restriction enzymes to improve the efficiency of Ku repair function. They conclude that Ku (or the NHEJ pathway) is unlikely to account on its own for the whole higher-than-expected-GC phenomenon, but may at least be the functional mechanism of a selective process that accounts for part of this phenomenon.
The manuscript is very well written and documented, and presents relevant analyses to test the new hypothesis. The authors also attempt to link these new results to observations made previously regarding other hypotheses of mechanism for above-mutational-equilibrium genome GC contents, namely selection for higher %GC per se, and biased gene conversion (BGC).
The evidence presented in support of the Ku-GC association is sufficient and convincing, and its interpretation is cautiously discussed to consider known evidence, and to take into account potential interactions or confounding signatures with other mechanisms.
However, it would be desirable that the authors bring their study a step further, and bring a bit more material to help the reader (and future investigations) to resolve this puzzle. Namely, in order to test the relevance of the BGC hypothesis in the light of the facts presented in this study, they confront Ku occurrence and GC content data are to homologous recombination (HR) data, which are only recovered from other studies. This brings the concern that these data are not properly matched with the sty's own datasets: summary statistic from studies using different genome sets (and potentially different set of sequences within genomes) on the basis of the sole species name is unlikely to reflect the exact properties of the genome datasets investigated here. Considering the scale of the present investigation (the whole prokaryotic tree of life), it is crucial that each data point be accurately representing the properties of the considered organism, and hence that all measurements be made on the same dataset. As explained below, applying the HR test/quantification procedures described in the cited literature to this dataset would be a feasible undertaking, and would add much value to the paper.
Finally, I notice that the intermediary data is not made available. This includes tables describing the sets of genomes used, the occurrence of Ku in these genomes, the list of restriction enzyme found in them and their corresponding target sequences, the genome tree presented Fig 2 in machine-readable format, the estimates of GC at the mutational equilibrium, etc. the scripts used to generate such data, as well as those used to test their association, should be provided as well. I think that publication in scientific journal, and especially in the open-access pioneer PLoS journals, should always be backed by full access to data and proceedings of the analyses so they can be replicated. Please attach them as a supplement, or provide a link to an external data/code repository (my recommendation).
I let the editor appreciate the relevance of the request for additional data on HR. Provided that the few minor comments below are addressed and that intermediary data are provided, I think the manuscript would be otherwise generally fit for publication in PLoS Genetics. I thus recommend the paper for minor revision.

We thank the reviewer for their in-depth comments. As detailed below, we have incorporated an analysis of recombination using the PHI statistic as requested (which nicely complemented our other analyses and resulted in the addition of a figure to the main text -Fig 4). We have also made our code and intermediate datasets available on github: https://github.com/jlw-ecoevo/gcku
Detailed comments L70-82: this paragraph belongs to the introduction, with which it is slightly redundant.
While we appreciate that there is some redundancy here, we think having a bit of motivation at the outset of our Results and Discussion section helps adequately frame the section and guide the reader. Additionally, we feel this context is important for readers who tend to read papers out of order. Since this is a combined Results/Discussion section we do feel it is appropriate to have some discussion of previous work here. As this is a matter of style rather than substance, we are opting to go with our gut in this specific instance and keep the paragraph as is.
L87-89: this correlation of the Ku and GC, as revealed by correlation of each with 'third-party' trait, is striking. However, it would be nice to have a more straightforward estimate and visualisation of their association. Could the authors provide a correlation r^2 and p-value for GC ~ Ku occurrence? In complement of the PCA in fig. 1, could they also plot the result of a linear discriminant analysis (LDA) maximizing the separation of the samples based on their Ku +/-state, and plotting the %GC over it (as well as showing the explained variance of such a projection)? Actually, something like a heatmap of a correlation matrix of all these traits would be helpful (in supplement) for the reader to see how the traits are associated with each other.
We agree that such information will be useful to our readers. We now note the correlation between Ku and GC across all genomes in the main text (lines 104-107): (Fig 2A, S3 Fig; Pearson

correlation between GC content and Ku across genomes, r = 0.54, p < 2.2×10−16)"
We also have added S1 Fig that describes the pairwise correlation between traits as requested.

With respect to the recommended LDA analysis, we do not understand quite what the reviewer is asking for. In addition, we are worried that many redundant analyses of the trait data may confuse a reader and dilute the overall message (especially since our more central and robust analysis of the GC-Ku relationship comes in the next section -the trait analysis was meant mostly to motivate downstream analyses).
L90-92 / S1 Table: I think that the table legend should spell out how the model was formulated (like give the R code or a more formal string like 'y ~ trait1 + trait2'). Once that is clarified, it would be interesting to present results of a general linear model where the prioritization of would have been different: with Ku as first explanatory variable, would the other traits have any variance left to explain?
We have modified the S1 Table legend to include the model formula.

"Nevertheless, the inclusion of Ku along with ecological traits in a linear model to explain genomic GC content resulted in most other environmental traits still being statistically significant (S1 Table), indicating that either there is some aspect of the environment affecting GC content that is not attributable to DSBs or that NHEJ is an imperfect indicator of the rate of DSB formation (or both)."
L95-96: "In fact this is trivially true, as Ku presence is a discrete, binary variable whereas the rate of DSB formation is continuous." This is a relevant point, and should be considered further. In fact, the presence/absence of the Ku protein (used as a proxy of a functional NHEJ pathway) is a trait that can vary among strains of a clade or species, as stated by the authors L176-178. Transitions between the Ku +/-states might have happened recently in certain strain lineages, and at potentially high frequency over time. On the contrary, %GC increase is expected to be a long process, given that the effect size of either selection for higher %GC or BGC phenomena are likely small, that they act against the mutational bias, and that selection for other traits may interfere with this background amelioration process. This is to be opposed to phenotypic traits (usually considered for correlation under BM or OU models) that result from the expression of the genotype of an individual organism, i.e. in sync with its current genotype. It follows that the association between a potentially recently acquired trait (Ku presence) and the result of a long-standing process (%GC increase away from the mutational equilibrium) could possibly be coincidental. The authors should try and repeat their analyses by restricting them to genomes in clades where the Ku +/-state is conserved, and where we can expect that it has been present/absent for long enough so that the base substitution process is in its steady state. The situation that "Ku presence/absence is sprinkled throughout the prokaryotic phylogeny", and described in Figure 2B, where it seems that many clades have a homogeneous pattern of Ku occurrence, should allow them to run such restricted analyses with enough statistical power (while still using the phylogenetically-aware regression models to avoid over-counting the replicated data points within such homogeneous clades). This is an important point, as most studies trying to confirm/invalidate the hypothesis of BGC have tried to correlate the %GC with the recombination rate inferred from recent polymorphism data, which again reflect a recent property of the population, but might not reflect the long-term average recombination rate that the lineage has experienced -a major issue that prevented most past analyses to settle the debate on the existence or not of BGC in Prokaryotes. Ku occurrence is a simple binary trait and its past distribution is more easily estimated than the past recombination rate, which estimation from polymorphism data is inherently biased towards recent times due to saturation of homoplasy signals; by studying this simpler trait, the authors here have an opportunity to bring stronger evidence on that subject than any other previous study.
We thank the reviewer as we had not considered this type of analysis previously. We performed the requested analysis (lines 416-420):

"Finally, for our "Uniform Ku" models we excluded all genera from our dataset that had fewer than two genomes with which to assess Ku incidence, and then excluded any genera for which Ku incidence was not uniform (all genomes had Ku or all genomes lacked Ku). We then repeated our above analysis (779 taxa with Ku, 2365 without)."
and found that our results were qualitatively unchanged (lines 117-121):

"Finally, to control for the possibility that Ku gain/loss via horizontal transfer is frequent and potentially confounding, we also restricted our analysis to a subset of the data where Ku presence/absence did not vary within each genera (discarding variable genera) and found qualitatively the same result (Table 1)."
Section "No Apparent Relationship Between Rate of Homologous Recombination and NHEJ": I agree with the general conclusions of the authors for this section, that is the impossibility to conclude given the data, but I think they could try and provide further evidence to fuel the debate. In particular, they only rely on data from previous study to quantify the effect of homologous recombination (HR) on species they investigated in their own dataset. The third-party data they report is likely to be inadequate to answer the question asked, for several reasons. The quantification of HR rates (r/m) by Vos and Didelot (ref [44]) is made using ClonalFrame, a method that is able to grasp the long-term average HR rate (see comment above), which is a good thing, but was based on multi-locus data and on quite a variable set of strains depending on the species, thus unlikely to reflect findings from sets of whole-genomes of calibrated diversity (from the ATGC database) used in the present study. The data from Ruendules et al. [45] are also unlikely to have used the same set of genomes, and use simple linkage disequilibrium-based metrics which have been designed to perform test of occurrence of HR, not to quantify it, and which application at the whole-genome scale is unlikely to grasp any nuance in such signal. The fairer comparison is with the data from Lassalle et al. (ref [7]), but again the genome datasets are unlikely to be matched. Published genome data expand rapidly and, as a consequence, prokaryotic species definitions are being regularly revised; the genomes available for what was considered to be B. anthracis by Lassalle et al. in 2015 is thus unlikely what is available today in ATGC database under this same name. I believe this drastically limits the scope of what the authors are able to say about HR in the framework of this study. I would suggest that the authors replicate the procedure used by Lassalle et al., that is running the PHI test on the core gene alignments of their species datasets (or at least a representative subset), as provided by the ATGC database. The PHI test is very fast and can easily be ran in parallel on a large collection of gene alignments. This is not essential to the core argument of the paper, but would help going further on the matter.
We thank the reviewer for pointing out a potential issue with our analysis. Nevertheless, since Ku incidence tends to be relatively uniform within a species (S11 Fig) this is something of a moot point (this can also be seen in S9 Fig and S10 Fig as most of the points lie at the extremes of the xaxis). Nevertheless, we agree it is important to be sure that our data is consistent. As requested we ran the PHI test (S10 Fig) and found nearly identical results to those using separate datasets (S9 Fig), (lines 187-192):

"We saw no positive association between Ku incidence and inferred rates of homologous recombination looking between genomes, as would be predicted by this hypothesis (S9 Fig with data from [44, 45], and S10 Fig with data from the ATGC database [43]). In fact the relationship appeared to be negative regardless of method to measure recombination rate (though not significant)."
The idea to replicate the within-genome analysis of Lassalle et al. using the ATGC data is an interesting one, and led to some results that we think nicely complement our other analyses (so much so that we have added a figure to the main text -Fig 4). We found some evidence that could by taken as supporting BGC in general (though see Bobay and Ochman [10] for why this type of evidence is flawed), but that rejects any link between BGC and the Ku-GC pattern we observe (lines 215-228): small (paired t-test, df = 154, p = 1.503 × 10−11; Fig 4). Interestingly, while a link between recombination and GC content was apparent, it seemed to explain none of the difference between Ku-encoding and Ku-lacking organisms (Fig 4a). , df = 83.698, p = 0.0308; Fig 4b)." and (Fig 4 legend):

"We obtained all available alignments of shared genes within each cluster of organisms in the ATGC database ([43]). We then ran the program PhiPack [46] using 10000 permutations to generate p-values for the occurrence of recombination in each cluster-gene pair. To correct for multiple testing we used a Benjamini-Hochberg correction with a false-discovery rate of 5%. Altogether this yielded 52117 genes with significant evidence of recombination out of 438580 cluster-gene pairs with sufficient information to run PhiPack. To obtain GC content for each cluster-gene pair we took the mean GC content across sequences in the relevant alignment. To obtain cluster-wide estimates of GC content and Ku incidence we took the mean across genomes associated with organisms in that cluster (each cluster member in ATGC is associated with a RefSeq genome)."
L216-220: "the extremely strong and specific association between GC and Ku suggests that this relationship may be particular to the specific conditions selecting for Ku (especially considering the absence of an association between HR and GC when looking between genomes [5]; S7 Fig)" As discussed above, these datasets are very unlikely to be matched with the authors', and rejecting the association of elevated %GC (or Ku occurrence) with HR rates on this basis is possibly flawed. Again, I would suggest the authors run their own recombination tests/quantifications on their own datasets so they can draw robust conclusions.

See comments directly above and S10 Fig.
Also, we would like to point out that even Lassalle et al. didn't find a link between recombination and GC content when looking between genomes, but only when looking within genomes.