Genome Dynamics of Short Oligonucleotides: The Example of Bacterial DNA Uptake Enhancing Sequences

Among the many bacteria naturally competent for transformation by DNA uptake—a phenomenon with significant clinical and financial implications— Pasteurellaceae and Neisseriaceae species preferentially take up DNA containing specific short sequences. The genomic overrepresentation of these DNA uptake enhancing sequences (DUES) causes preferential uptake of conspecific DNA, but the function(s) behind this overrepresentation and its evolution are still a matter for discovery. Here I analyze DUES genome dynamics and evolution and test the validity of the results to other selectively constrained oligonucleotides. I use statistical methods and computer simulations to examine DUESs accumulation in Haemophilus influenzae and Neisseria gonorrhoeae genomes. I analyze DUESs sequence and nucleotide frequencies, as well as those of all their mismatched forms, and prove the dependence of DUESs genomic overrepresentation on their preferential uptake by quantifying and correlating both characteristics. I then argue that mutation, uptake bias, and weak selection against DUESs in less constrained parts of the genome combined are sufficient enough to cause DUESs accumulation in susceptible parts of the genome with no need for other DUES function. The distribution of overrepresentation values across sequences with different mismatch loads compared to the DUES suggests a gradual yet not linear molecular drive of DNA sequences depending on their similarity to the DUES. Other genomically overrepresented sequences, both pro- and eukaryotic, show similar distribution of frequencies suggesting that the molecular drive reported above applies to other frequent oligonucleotides. Rare oligonucleotides, however, seem to be gradually drawn to genomic underrepresentation, thus, suggesting a molecular drag. To my knowledge this work provides the first clear evidence of the gradual evolution of selectively constrained oligonucleotides, including repeated, palindromic and protein/transcription factor-binding DNAs.

Less than 10 DUESs are expected in random sequences of similar length and base composition as H. influenzae Rd Kw20 and N. gonorrhoeae FA1090 genomes (see Table S1). However, Smith et al. [13,14] found 1471 DUES in the first genome, and Davidsen et al. [11] counted 1965 in the latter. The search for a function to account for the evolution of these sequences is still inconclusive. Goodman and Scocca [22] suggested a DUES role in transcription termination, and Karlin et al. [23] viewed the significant even spacing of these sequences around the genome as a sign of their role in DNA replication, repair, or compaction. However, any transcription termination activity is difficult to envisage for at least three reasons: (i) 65% of H. influenzae and 35% of N. gonorrhoeae DUESs are located within open reading frames [11,13,14], (ii) DUESs are not palindromes and only ,10% occur as inverted repeats (M. Bakkali, unpublished), and (iii) there are more DUESs in non-coding regions than expected for an unbiased distribution [15]. Furthermore, no intracellular DUES-binding protein was identified, and DUESs show no orientation bias around the chromosome [13][14][15]-features expected from a sequence that interacts with the replication machinery.
DUESs as bacterial mate recognition systems [24] is the only uptake related function proposed for these sequences. In this case, DUESs could be tags for 'safe sex' among bacterial cells seeking recombination trough competence for natural transformation. Given the striking DUES genomic overrepresentation, any preferential uptake of DUES-containing DNA will inevitably result in preferential uptake of conspecific DNA, and DNA from species sharing the same DUES (i.e., closely related species). Hence, DUES-biased system of DNA uptake could evolve given a higher selective advantage of recombination with DNA from conspecifics compared to DNA from unrelated species, and computer simulations seem to confirm this possibility [25].
However, bacteria were suggested to take up DNA mainly as a nutrient rather than for recombination [26,27]. Increasingly supported by recent findings [28][29][30][31], this hypothesis raises further questions about the evolution of a DUES-biased DNA uptake system that significantly limits the quantity of DNA available to the competent cells.
H. influenzae has 'only' 764 singly-mismatched DUESs [13][14][15], even though these can arise from the species' 1471 nine base pairs (bp) long DUESs by any of the 27 possible mutations. Testing this ostensible sequence homogeneity may therefore help understand DUES evolution. High sequence homogeneity is expected in repeats arising by copying, such as transposable elements and telomeres, and Smith et al. [13] suggested that the unexpectedly low frequency of singly-mismatched DUESs relative to the nonmismatched one could be due to a balance between mutation away from the latter and its restoration by preferential uptake. However, DUESs are not known to be transposable, and we have no reason to discard uptake bias in favor of singly-mismatched ones. Both H. influenzae and Neisseria meningitidis even take up DUES-lacking DNAs, though less efficiently [22,[32][33][34].
Based on sequence comparison of homologous pasteurellacean genes, DUESs were suggested to evolve by gradual accumulation of point mutations in preexisting sequences rather than by insertion/deletion of entire sequences [15]. As incoming DNA can replace homologous chromosomal regions by recombination, and provided there is enough bias towards some mismatched DUESs as well, DNA uptake could generate a drive that gradually imposes the DUES and some of its mismatched forms in the susceptible parts of the genome (i.e., non-coding and unconstrained coding regions). The resulting molecular drive, tested in [35], offers a simpler explanation of DUES accumulation where the perceived sequence homogeneity could be a consequence of stronger uptake bias towards the DUES than its mismatched forms.
Evaluation of these interpretations requires careful analysis of the genomic representation of all the DUES-like sequences as well as the actual bias of DNA uptake and its contribution to DUES evolution. In this work I examine and computer simulate the accumulation of DUESs in bacterial genomes by estimating and analyzing the frequencies of H. influenzae and N. gonorrhoeae DUESs and all their mutated forms. I subsequently monitor the overrepresentation of DUES nucleotides in sequences with different mismatch loads (i.e., number of mismatched positions compared to the DUES) to detect the genomic footprints of the DNA uptake bias and, thus, infer DUES evolutionary history. I then test whether uptake bias in itself can explain DUES accumulation by correlating the strength of DUES-like sequences, as estimated from the genomic frequencies of their nucleotides, to the uptake bias in their favor. Finally, I analyze the distribution of the genomic frequencies of several pro-and eukaryotic sequences to test whether the DUES mode of evolution could be extrapolated to other selectively constrained oligonucleotides. DUES offers a precious system for the study of such sequences-especially protein/ transcription factor-binding-as it is selectively bound by the extracellular receptor for DNA uptake, and the uptake bias could be analogous to any other function/selective force.

Sequence frequencies and overrepresentation
This analysis aims at answering two questions: (i) Do DUESs evolve by gradual accumulation of mutations? If so, then (ii) what is the minimum number of matches a sequence needs to share with the DUES for its uptake to be significantly preferential?
For a gradual evolution of DUES to take place, a degree of DNA uptake bias towards some of its mutated forms is also required. In such conditions, and assuming that there is no interference from other evolutionary forces than mutation and DNA uptake bias, over evolutionary time the latter should leave a trace in the genome in form of significant overrepresentation of the DUES and its mutated yet preferentially taken up forms. Furthermore, the magnitude of overrepresentation should inversely correlate to the mismatch load of the sequence compared to the DUES. However, no significant overrepresentation is expected for sequences not preferentially taken up and, if any, it does not have to correlate with the mismatch load.
Only DUESs in H. influenzae and N. gonorrhoeae genomes show significant overrepresentation that consistently decreases with the increase in mismatch load until underrepresentation at mismatch load 6 ( Figure 1, Table 1, and Table S1). Significant overrepresentation then reappears at the last two sequence categories where it increases with the mismatch load. As expected from nonaccumulating DNAs, the control tests show neither overrepresentation of the sequences analyzed, nor any consistent trend in sequence frequencies with respect to the mismatch load. Thus, if sequence overrepresentation is indicative of DNA uptake, these results suggest that DUESs gradually evolve in the genome by accumulation of point mutations.
The distribution of the overrepresentation values is strikingly similar between H. influenzae and N. gonorrhoeae, suggesting similar DUES genome dynamics and evolution. Nonetheless, overall, N. gonorrhoeae values are noticeably higher than those of H. influenzae, possibly due to more efficient/frequent DNA uptake. Two major drops in overrepresentation values can easily be identified in both species: The first between mismatch loads 0 and 1 (92 fold in H. influenzae and 82 in H. gonorrhoeae), and the second between loads 1 and 2 (9 fold in H. influenzae and 7 in N. gonorrhoeae). This reflects a non-linear relation between the mismatch load and DNA uptake, possibly due to high specificity of the receptor binding the DUES. Coinciding with the base length difference between N. gonorrhoeae and H. influenzae DUESs, significant overrepresentation values reach one mismatch further in the first species than in the latter. This highlights that for DNA uptake what matters is the number of matches to the DUES not the mismatches.
At four matches (i.e., mismatch load 5 in H. influenzae, and 6 in N. gonorrhoeae), the results are ambiguous with non-significant overrepresentation in H. influenzae, but significant underrepresentation in N. gonorrhoeae. This may be due to: (i) Non-preferential uptake, (ii) insufficient uptake for balancing sequence loss by mutational decay, (iii) insufficient drive from higher mismatch loads to balance the drive towards lower ones, or (iv) functional constraint selecting against some sequences at this mismatch loads. Sequences with 3 and 2 matches to the DUES are significantly underrepresented, thus, probably not preferentially taken up. They could possibly comprise the pool from which the drive imposed by the DNA uptake bias receives new sequences after mutation. Overrepresentation of sequences with 0 and 1 match to the DUES shows no negative correlation with the mismatch load and is clearly due to uptake independent factor(s) (e.g., functional and mutational constraints or codon bias).

Computer simulations
Overrepresentation of mismatched sequences suggests that they are preferentially taken up. Still, it could also be due to mutational decay of sequences with fewer mismatches that are preferentially taken up. DUESs are so overrepresented that singly-mismatched sequences could reach overrepresentation only by DUES mutational decay-as suggested in [13]. It is also possible that the two major drops in sequence overrepresentation reported above reflect uptake bias only towards sequences with less than 2 mismatches to the DUES. I therefore simulated the evolution of sequence frequencies in a population of 10 18 genomes with the same distribution of sequence frequencies as those expected for randomized H. influenzae and N. gonorrhoeae genomes (see Table S1)  Sequence names as in Figure 1. Significant values, as by x 2 test, are underlined (see Table S1).  evolving at 10 22 mutations per base per generation. Mutations alone are not able to drive neither H. influenzae nor N. gonorrhoeae DUES to their respective observed overrepresentation levels (data not shown), and an uptake bias of 104.4 and 477.5 respectively is needed (Figure 2A;D). However, the results do not support the possibility of mutational decay. By themselves, these uptake biases can neither drive the singly-mismatched sequences to their observed overrepresentation levels, nor give similar distributions of sequence frequencies as the observed ones (Figure 2A;D). An additional 1.02 and 5.284 uptake bias in favor of singlymismatched H. influenzae and N. gonorrhoeae DUESs was needed to attain their observed frequencies, yet it did not give similar overall sequence frequency distribution as the observed one ( Figure 2B;E). The same was true for additional uptake biases towards other mismatch loads (data not shown). An approximation to the observed distributions of sequence frequencies with an error margin of less than 0.01% was obtained only with the distributions of uptake biases shown in Figure 2C;F. This distribution agrees with my interpretation of the sequence overrepresentation results, as it supports uptake bias towards mismatched sequences and negative correlation between the mismatch load and the uptake bias. The exception seems to be H. influenzae sequences at mismatch loads 4 to 9, where uptake bias value increases with the mismatch load, which may reflect the involvement of other selective forces than uptake. The values of uptake bias used in the simulations certainly include the other selective forces, such as functional constraints, that led to the observed distributions of sequence frequencies in the real genomes.
To reach the currently observed genomic distributions of sequence frequencies, simulations needed higher uptake bias values and more cycles (i.e., generations) for N. gonorrhoeae than for H. influenzae ( Figure 2C;F). This may be due to: (i) Differences in the ancestral organization of the real genomes which, obviously, was not random as assumed in the simulations, (ii) more frequent uptake of DNA by the almost permanently competent N. gonorrhoeae [36,37] than by the occasionally competent H. influenzae [4,38,39], or (iii) earlier evolution of the DUES-biased DNA uptake in N. gonorrhoeae than in H. influenzae.

Sequence logos and evolution of DUES nucleotides
The results reported above clearly indicate gradual evolution of DUESs by accumulation of point mutations. However, this is only certain if we see a gradual increase in the overrepresentation of the actual DUES nucleotides; as nucleotide overrepresentation at any particular position of the DUES may not show any trend with regards to the mismatch level.
So far I focused solely on the 9 and 10 bp H. influenzae and N. gonorrhoeae DUESs. However, for H. influenzae, this sequence is only the core of a larger (29 bp) DUES consensus containing two additional less conserved regions [14,16]. Conversely, there is no report of less conserved DUES regions in N. gonorrhoeae [7]. DNA sequence logos were therefore generated for 100 pb sequences, containing the DUES cores or one of their mismatched forms, in order to include the less conserved regions and look for possible additional ones. H. influenzae DUES consensus is 59-aAAGTGCGGTnrwttttnnnnnnrwtttw, where r = A or G and w = A or T ( Figure 3A), which is similar to the consensus reported in [14]. N. gonorrhoeae DUES also seems to show less conserved nucleotides ( Figure 3B), especially an adenine and a thymine at the 59 side. Its consensus sequence thus being 59-mdatGCCGTCT-GAAvv, where d = A, G or T, m = A or C and v = A, G or C.
The logos support previous interpretations, as they show clear gradual accumulation of DUES nucleotides starting from the third match, with some nucleotides gaining overrepresentation earlier/ faster than others, probably due to their importance for uptake. This, together with the results of the simulations, solves the ambiguity regarding sequences with 4 matches and suggests a certain bias in their uptake, though not sufficient to drive them to significant overrepresentation. Coinciding with the biases in genomic base composition, DUES Ws emerge first in H. influenzae whilst, in N. gonorrhoeae, the whole DUES core emerges at the same time with a slightly higher overrepresentation of Ss than Ws. Unlike the core, DUES nucleotides of the less conserved regions emerge only after mismatch load 2 and, in H. influenzae positions 14 and 15, there is a switch in the nature of the most overrepresented nucleotide; probably because of similarity in their overrepresentation. In neither of the species does overrepresentation seem to depend on uptake at less than 3 matches, since the logos do not show resemblance to the respective DUES cores.
Dependence of the uptake bias on the strength of the sequence Dependence of sequence overrepresentation on DNA uptake is the key to interpreting the results described above. I therefore used the 28 DNA fragments tested for uptake efficiency by competent H. influenzae in [40]-which are the best experimental data set available on DNA uptake-to test the dependence of their uptake on the strength (equation 19) of their 29 bp DUES-like sequences. The small differences in the naming and sizes of some of the DNA fragments between what is reported in [40] and the sequences available at the database are insignificant and will not affect the intended analyses. 15 of these fragments were classified by Goodgal and Mitchell [40] as uptake fragments and the remaining 13 as non-uptake. As Table 2 shows, the uptake fragments are, overall, larger than the non-uptake ones, and the average strength of their strongest 29 bp sequences is higher, whilst the average mismatch load of these sequences is lower. The mean mismatch load of all the 29 bp sequences of every uptake and non-uptake fragment is similar, whereas their mean strength is lower in the uptake than the non-uptake fragments. In principle, this may be sufficient to infer dependence of the uptake efficiency on the strength of the strongest sequence of a DNA fragment. However, DNAs were named by Goodgal and Mitchell [40] prior to quantification of their uptake, and their classification into uptake and non-uptake fragments does not reflect the clear division between fragments taken up at more than 60 molecules per cell and those taken up at 30 or less. Three fragments originally classified as non-uptake are better seen as uptake fragments (14nu, 27nu and 58nu), and six vice versa (159u, 201u, 205u, 30u, 62u and 71u). After this reclassification, the overall uptake fragments (i.e., all their 29 bp sequences), on average, show less mismatches, but less strength, than the non-uptake ones, whereas their strongest 29 bp sequences show less mismatches and more strength.
Regression analysis to detect dependence of the number of DNA molecules taken up by competent H. influenzae on the strength and mismatch load of their sequences shows very significant values for the strongest 29 bp sequence in each of the 28 DNA fragments (Table 3). This suggests that the efficiency of DNA uptake depends on the strength of the best region of the DNA fragment. However, the results are also significant for the second 29 bp sequences in strength, implying that the preferential uptake might not always target the strongest sequence in a DNA fragment. Nevertheless, weaker sequences show no significant results, and the significance of the regression is much higher for the strongest sequence. The significant results of the second sequences in strength could, thus, be due to similarities with the strongest ones in some DNA fragments. In fact, 11 DNA fragments show Evolution of Oligonucleotides PLoS ONE | www.plosone.org less than 5% difference in strength between their strongest and the following sequence (fragments 8u2, 30u, 37nu2, 39nu, 45nu, 48u, 51nu, 71u, 159u, 201u, 205u). At 10% cut-off, the number of these fragments already reaches half the sample size (due to addition of fragments 43u, 48nu, 62nu).
Even though uptake DNA fragments on average are larger than the non-uptake ones (Table 2), DNA uptake efficiency does not depend on the length of the DNA fragment taken up (Table 4). However, had I tested more DNA fragments, such effect could have been detected as, in theory, the larger a DNA fragment is, the Ultimately, the results suggest that the efficiency of DNA uptake significantly depends on the strength of one sequence only-the strongest-which is the most likely to be bound and taken up independently of its position or orientation in the DNA fragment. Since the strength of sequences was estimated based on the genomic overrepresentation (conservation) of their nucleotides (see equation 19), these results suggest that DNA uptake bias alone can explain the genomic overrepresentation of DUESs and some of their mismatched forms.

Experimental verification of the theoretical results
For experimental confirmation of the abovementioned results, uptake was measured for 222 bp DNA fragments generated in vitro to contain the 'best' DUES consensus, consensuses mutated to the least frequent nucleotide at positions 2-11, 13 and 14, 20, or 25 and 26, or the worst 29 bp sequence. Uptake experiments were replicated several times and for different amounts of DNA (10, 20, 30 and 40 gg). The data show considerable variance between experiments, and a nested ANCOVA was performed on the uptake values and their standard deviations to detect possible dependence of uptake on the availability of DNA and the strength of the DUES, as well as to estimate the magnitude of the noise (i.e., error) emanating from the variation between experiments. The results (Table 5) suggest that the amount of DNA taken up strongly depends on the nature of the DUES (i.e., stronger DUESs are  Overall, the experimental findings agree with those of the theoretical analysis reported above, and mutations of theoretically important DUES positions also result in less DNA uptake ( Figure 4). The little effect on uptake after mutating the second, third and fourth positions of the DUES consensus, however, do not agree with their conservation results. Still, this result could reflect the evolutionary variability of nucleotides at these positions as they show differences between the two pasteurellacean DUESs (Figure 4). The experiment confirms that, even within the DUES core, the importance of the different positions for DNA uptake varies, with the evolutionarily conserved ones being the most important. Indeed, Karlin et al. [23] reported a biased distribution of singly mismatched DUESs in H. influenzae genome.

The findings on DUES could be generalized to other oligonucleotides
In this work I took advantage of the significant amount of experimental data on DUES mediated DNA uptake and the convenience of working with bacterial genomes, and used H. influenzae and N. gonorrhoeae DUESs as examples of selectively driven-protein-binding-oligonucleotides. As Figure 5A;B and Tables 6 and 7 show, the gradual mode of evolution suggested by the results on these DUESs seems to be a common feature of other selectively constrained oligonucleotides, including the other pasteurellacean DUES as well as several other pro-and eukaryotic short repeated, palindromic and protein-binding DNAs. Sequences showing genomic overrepresentation seem to gradually evolve in the same way as the DUES (i.e., by molecular drive), whilst those underrepresented seem to evolve in an inversed fashion (i.e., by molecular drag). The differences in over-or underrepresentation levels between sequences and mismatch loads reflect the differences in strength and specificity of the selective force(s) directing the evolution of these sequences.

DISCUSSION
The current names of pasteurellacean uptake signal sequences (USS) and neisseriacean DNA uptake sequences (DUS) could be misleading, as they do not accurately describe the effect of these sequences on DNA uptake and imply that they represent different genetic elements. Both sequences show similar genomic distributions in the corresponding genomes [9,10,[12][13][14], and have the same effect on DNA uptake by the corresponding competent bacteria [5,8,9,22,32,41]. Therefore, USS and DUS refer to sequence variants of the same genetic element; which makes the case for a common nomenclature. This should not be USS, given that neither USSs nor DUSs actually signal DNA uptake, as bacteria need to be competent beforehand (i.e., non-competent cells will not take up DNA, no matter how many USSs or DUSs one gives them). DUS is not a satisfactory name either, as both competent Pasteurellaceae and Neisseriaceae species also take up DNA that lacks USS/DUS, though less efficiently [22,[32][33][34].
Here I suggest the unifying and more accurately descriptive name of DNA Uptake Enhancing Sequences (DUES) to highlight that both sequences are variants of the same genetic element which enhance the uptake of DNA rather than cause it.
DUESs genomic overrepresentation raises questions about their function and how the normally small bacterial genomes tolerate them in such large quantities. Among the several intracellular functions suggested [22,23] none seems to be supported by the genomic distributions of these sequences (see Introduction) as: (i) DUESs are not palindromes, and most are neither inverted repeats nor biased towards locations downstream coding sequences-as it would normally have been expected from sequences terminating transcription, (ii) they show no orientation bias around the chromosome-feature expected from sequences interacting with the replication machinery-and we do not know of any intracellular DUES-binding protein, and (iii) they are not as evenly spaced as expected from sequences that help compact the chromosome. Bacterial mate recognition system, discussed in [24], is the only non-intracellular function proposed for these sequences, and the only one to account for their involvement in DNA uptake. The differences between the DUESs of the two groups of Pasteurellaceae species [16], and between these and the Neisseriaceae DUES ( [5,13]; Figure 3A;B) clearly supports this last hypothesis. Nevertheless, competence is suggested to be a mechanism more for nutritional than for sex and recombination purposes [26][27][28][29][30][31] and, even if this was not the case, some significantly distant bacterial species share the same DUES. These include: (i) H. influenzae, P. multocida, H. somnus, A. actinomycetemcomitans and M. succiniciproducens, (ii) A. pleuropneumoniae and M. haemolytica [6,9,[13][14][15][16]18] and (iii) N. gonorrhoeae and N. meningitidis [7,[10][11][12][13]. On the other hand, the concept of bacterial species is still a matter for debate [42][43][44][45] and, in some cases, including Neisseria [46], differentiation between species seems to be rather fuzzy. Furthermore, even if competence was for nutritional purposes, some of the DNA taken up still survives digestion and recombines with the chromosome. Occasional recombination with potentially harmful DNA from distantly related species may, thus, be a sufficient driving force for DUESs to evolve as a mechanism for minimizing such unwanted recombination. DUESs might also not be that advantageous, thus representing a sort of 'lucky DNA' probably driven to overrepresentation in genomic locations where they cause insignificant loss of fitness due to a possibly coincidental specificity of the protein that binds DNA (i.e., receptor) at the cell surface during competence. Every DNA-binding protein has some sort of inherent specificity towards a combination of nucleotides due to its amino acids sequence and spatial configuration [47].
The possible implication of DNA uptake bias towards DUESs on the accumulation of these sequences in the genome was oddly overlooked. Earlier we [15], hypothesized that the biased DNA receptor at the cell surface of a competent species may gradually enrich its genome with DNA fragments containing the preferred DNA sequence (i.e., DUES) and some of its mutated forms. Mutations towards the preferred DUES in the DNA of the 'donor' cell may preferentially spread to other cells, by uptake and recombination, when the 'donor' cell dies. On the other hand, mutations away from the DUES in the recipient chromosomes may be restored by uptake and recombination with incoming DNA from 'donors' with less-mutated sequences. This way, the combined action of mutation, uptake bias and reduced constraints in some genomic regions could be sufficient for DUESs to gradually accumulate in susceptible parts the genome and no additional function is needed. This would explain DUESs preferential location in non-coding regions [15], and in parts of the coding regions where they have little influence on protein configuration/function [23].
If such interpretation is correct, DNA uptake should leave a footprint on the genome, in form of higher overrepresentation of sequences depending on their similarity to the DUES, which is exactly what this work shows. Sequence overrepresentation is observed even for some mutated DUESs, where it negatively correlates to the mismatch load, and DUES nucleotides seem to emerge and start accumulating as soon as the third match. Simulations of DUES evolution in hypothetical genomes subjected to mutation and uptake bias towards the DUES only fail to reproduce the observed distribution of sequence overrepresentation across mismatch loads, whilst consideration of additional uptake bias towards mismatched sequences did. This, together with the clear correlation between genomic overrepresentation and uptake efficiency, suggests that mutations and DNA uptake bias result in a force sufficient enough to allow DUES gradual evolution by accumulation of point mutations in susceptible parts of the genome (i.e., by molecular drive). This is supported by the experimentally detectable uptake of DNAs from unrelated species (i.e., with mismatched or no DUESs) by competent Pasteurellaceae and Neisseriaceae species [22,[32][33][34].
The higher overrepresentation of N. gonorrhoeae DUES compared to that of H. influenzae could be due to the almost permanent character of competence in the first species [36,37] as opposed to its occasional nature in the latter [4,38,39]. Even so, the increase in sequence overrepresentation accompanying the decrease in mismatch load is not linear in any of the two species, as we see sharp drops between the DUES and its singly-mismatched forms and, to a lower degree, between the latter and sequences with two mismatches. This distribution could reflect a strong specificity of the DNA-binding receptor, as well as the length of the evolutionary history of the DNA uptake bias and/or its efficiency. In principle, the more specific the receptor is, the more pronounced the overrepresentation of the DUES compared to its mismatched forms will be. Similarly, the longer the uptake bias was running for, or the more efficient it is, the closer genomes will be to an equilibrium, where the relative overrepresentations of the DUES and its mismatched forms better reflect the differences in the relative efficiency of their uptake.
In spite of all the similarities in genomic frequencies and evolutionary dynamics between H. influenzae and N. gonorrhoeae Abbreviations as in Table 6 DUESs, their actual sequence consensuses are different both in shape and composition. In this work I identify less conserved nucleotides previously unreported at each side of the N. gonorrhoeae DUES core, bringing the consensus up to 16 bp, with four additional nucleotides at the 59 side of the core (two of which strongly conserved) and two at the 39 side. The resulting consensus is identical to that of N. meningitidis DUES reported in [13], suggesting common ancestry of the DUESs of these two sister species. Therefore, just like the Pasteurellaceae [15], Neisseria species seem to have ancestral DUES-mediated preferential uptake of conspecific DNA. The obvious reason for the differences between H. influenzae and N. gonorrhoeae DUESs is differences in structure and binding specificity of the DNA receptors at the cell surfaces of both species. PilC is the extracellular 110 kDa adhesin [48] suggested as binding DNA at the tips of N. gonorrhoeae's type IV pili [49,50]. Its equivalent at the tips of H. influenzae type IV pili are a 216 amino acids protein (called HifD) and two units of a 435 amino acids protein (called HifE) [51][52][53][54][55]. Given this difference, it is hard to refrain from speculating that whilst PilC could be the receptor driving the entire N. gonorrhoeae DUES, HifD might be binding to and driving the core sequence of H. influenzae DUES, whereas the two HifE proteins could have specificity towards the two almost identical less conserved regions of this DUES. However, this can only be confirmed after proper experimental testing (e.g., gene knockouts, DNA-protein cross-linking, southernblotting, band-shifts and DNaseI foot-printing, antibodies…).
The results of this work also show that, at least for H. influenzae, only one DUES seems to be bound by the receptor at the cell surface at any given time, independently of its orientation or position in the DNA fragment. A similar result was experimentally obtained in A. actinomycetemcomitans [9], suggesting that this is a general characteristic of DUES mediated DNA uptake. In addition, the most similar sequence to the DUES in a DNA fragment seems to be the one most likely to be bound. Both characteristics are of importance to the molecular drive model of DUES evolution postulated in [15], since the simultaneous binding of more than one DUES would not explain the genomic distribution of these sequences-it favors significant clustering in some chromosomal regions. In addition, bias towards DUESs in a particular orientation would have caused bias in their orientation around the chromosome. Indiscriminate binding of the receptor to sequences independently of their degree of similarity to the DUES, however, would loosen or even abolish the drive responsible for the emergence and gradual accumulation of DUESs in the genome.
Several genomically overrepresented oligonucleotides, both proand eukaryotic, seem to be selectively driven in a similar mode as the DUES, whilst those underrepresented seem to be gradually eliminated (dragged) in an inverse fashion. This suggests that the gradual mode of sequence evolution discussed above might be a general feature of short DNAs selectively driven or dragged to genomic over-or underrepresentation, including protein/transcription factor-binding, palindromic, repeated and other oligonucleotides. Statistical deviation from the expected frequencies [56][57][58][59] is used as indicator of DNA functionality, including transcription-factor binding [60][61][62]. DUESs themselves resemble sequence families that interact with sequence-specific DNA-binding proteins (e.g., CRP, LexA and other pro-and eukaryotic transcription factor-binding DNAs); which typically consist of a short conserved core extended by less conserved regions clustering around a consensus sequence without exactly reproducing it (see [63][64][65]).
In conclusion, the results of this work suggest a gradual accumulation of genomic oligonucleotides by molecular drive towards overrepresentation due to the combined effects of mutation, the function of the oligonucleotide and some of its mutated forms, and weak selection against changes in the functionally unconstrained regions of the genome. Underrepresented oligonucleotides, however, seem to be gradually eliminated by a molecular drag emanating from the combined effects of mutation, the negative effect of the oligonucleotide and some of its mutated forms on the fitness of the carrier, and selection against changes in the functionally constrained regions of the genome. Several oligonucleotides seem to evolve in this way, including transcription factor/protein-binding and functionally constrained palindromic and repeated DNAs. DUESs themselves might be a sort of 'lucky DNA' driven to genomic overrepresentation, in the susceptible parts of the genome, not due to any other function but the possibly coincidental specificity of the receptor that binds DNA for uptake during natural competence for transformation. The expected number of sequences at each mismatch load was calculated as:

Theoretical analysis
where m is the mismatch load analyzed, t is the number of adenines and thymines (Ws) in the DUES, s is the number of cytosines and guanines (Ss) in the same DUES, x is the maximum number of mismatches that could affect DUES W positions at the mismatch load m, h is the genomic frequency of Ws, z is the genomic frequency of Ss and L is the genome's length in base pairs. Sequence representation was calculated at each mismatch load as Rep Seq = (Seq obs 2Seq exp )/Seq exp (equation 2), where Seq obs is the observed number of sequences. Significant deviation of the observed numbers of sequences from the expected ones was tested using the goodness of fit Chi-squared (x 2 ), which values were multiplied by |Seq obs 2Seq exp |/(Seq obs 2Seq exp ) (equation 3) to differentiate overrepresentation from underrepresentation. Computer simulations To test the contribution of DNA uptake bias to the observed distribution of sequence frequencies, I wrote the Perl based program Genome_Dynamics.pl (program available on request). It simulates the evolution of the genomic frequencies of a given sequence and all its mismatched forms starting from a given distribution until the equilibrium or any other target. Each generation, a mismatch load looses the sequences mutated to the immediately higher and lower mismatch loads while gaining some of the sequences mutating from the same. Assuming there is no other evolutionary force than single (i.e., point) mutation, sequences at a mismatch load m should decay (i.e., mutate to mismatch load m+1) at the proportion m(Seq)(12(m/n)) (equation 4) per mutation per generation, and improve (i.e., mutate to mismatch load m-1) at m(Seq)(m/(3n)) (equation 5), where m is the mutation rate per base per generation, Seq is the number of sequences at the mismatch load and generation analysed, and n is the size of the sequence in bp. Uptake biases U 0 to U n in favour of sequences at the mismatch loads 0 to n should result in additional generational increase of the frequency of sequences at any mismatch load m by the proportion U m of sequences mutated from the mismatch loads m21 and m+1. In addition, there will be a generational decrease by the proportion U m21 of sequences improving from the mismatch load m to m21, and the proportion U m+1 of sequences decaying from the load m to m+1-note that U m could also include other selective forces than DNA uptake. Repetition of these calculations at each mismatch load for many generations allows simulating the evolution of a given initial distribution of sequence frequencies in a population under a given mutation rate and distribution of uptake bias values towards sequences at different mismatch loads. Genome_Dynamics.pl assumes panmixia and I used a population of 10 18 individuals (i.e., cells) evolving at 10 22 mutation per base per generation.
DNA Sequence logos and DUES nucleotides evolution DNA sequence logos were generated in order to detect possible nucleotide conservation outside the 9 or 10 bp DUES, and to graphically monitor the emergence and evolution of DUES nucleotides in the respective genomes. I used the program Sequences_Extractor.pl to search both strands of H. influenzae and N. gonorrhoeae genomes and extract all the 100 bp sequences containing the respective DUES or any of its mismatched forms at position 38 from the 59 end of the sequence. I then used WebLogo version 2.8.2 [66,67] to generate logos for the sequences extracted at each mismatch load. Deviation of the genomic G+C content from the 50% assumed by WebLogo was considered by amending its source code and adding a correction factor F to the value of each base frequency used for generating the logo. F was calculated as |V|(0.52h)/h (equation 6) for Ws, and |V|(0.52f)/f (equation 7) for Ss, where V is the value being corrected. This adjustment permits more accurate assessment of the selective pressure on DUES Ss and Ws by minimising underestimation of the importance of the less frequent ones. Contrarily to the unbiased distribution of nucleotides in sequence parts not targeted by the extraction program (positions outside the DUES core), the distribution of nucleotides within DUES core positions was biased due to targeting of specific combinations of nucleotides by the program Sequences_Extractor.pl. Thus, for nucleotides in the first type of positions, Nuc exp was calculated as hSeq obs /2 (equation 9) for Ws, and fSeq obs /2 (equation 10) for Ss. Whereas for matches at DUES core positions, Nuc exp was calculated using equation 11 for Ws, and 12 for Ss. For mismatches at DUES core positions, however, calculation of Nuc exp depends on the nature of both the mismatched nucleotide in question and the DUES match at the same position. Thus, for a W as mismatch at a DUES core position, Nuc exp was calculated A modified version of equation 3 in [66] was used to calculate a nucleotide conservation index (Con Nuc ) for each nucleotide at each position of sequences at each mismatch load as Dependence of the uptake bias on the strength of the sequence I searched the 28 DNA fragments tested for DNA uptake in [40] (Accession no. M33432 to M33459) and extracted then calculated the strength of all their 29 bp sequences using the program Sequences_Extractor.pl and equation 19. To test whether the genomic overrepresentation of DUES nucleotides reflects the efficiency of their uptake, I tested the dependence of the uptake of these 28 DNA fragments on the strength of their 29 bp sequences by regression analyses using Statistica version 5.1. As negative control the positions of these 29 bp sequences were randomized, using a script written in Perl, and a similar regression analysis performed on the resulting sequence strengths and the uptake of the DNA fragment from where the sequences were originally extracted. Experimental analysis To further test the theoretical results, synthetic DNA sequences containing the most conserved H. influenzae DUES consensus or one of its mutated forms were tested for uptake efficiency by competent H. influenzae. Both a 29 bp sequence of the most frequent nucleotide at each H. influenzae DUES consensus position (best DUES) and another of the least frequent ones (worst sequence) were synthesized and cloned by blunt-end ligation into the SmaI site of the plasmid pGEM-7Zf(-) (Accession no. X65311). Control 222 bp DNA template fragments containing the best DUES or the worst sequence were then PCR amplified from the constructs. Sequences carrying point mutations towards the least frequent nucleotide were generated from the best DUES construct using a three-step cohesive-end PCR process. For each desired mutant, two half-fragments (113 bp and 135 bp, with 26 bp overlap) were first produced using overlapping internal primers mutated at the chosen DUES position. After gel electrophoresis and purification, both PCR products were combined as template for a third cohesive-end PCR reaction. The final 222 bp PCR products were then gel purified and sequenced before use as templates for Klenow radio-labeling reactions. These were carried out for three hours at room temperature and, for the initial hour, contained limiting a-33 P-dATP (6.0 mM). Subsequent addition of unlabeled (cold) dATP ensured complete replication of each molecule. DNA was then gel purified, and incorporation of a-33 P-dATP checked by autoradiography after electrophoresis of an aliquot in acrylamide gel. Radio-labeled DNAs were then mixed with unlabelled DNAs of the same sequence to a specific radioactivity of 1,000 counts per minute (cpm) per gg of DNA. 10, 20, 30 or 40 gg of each DNA sequence was then separately added to 0.5 ml of competent H. influenzae cells for transformation as described in [68]. After 15 min incubation at 37 C in rotating tubes, 25 ml of ice cold DNaseI (1 mg/ml) was added to each tube before gentle vortexing and incubation for 5 min on ice. 50 ml of ice cold 5 M NaCl was then added and the tubes gently vortexed then centrifuged at 16,000 g for 1 min at 4 uC. After re-suspension in ice-cold MIV medium [39] containing 1 M NaCl, gentle vortexing and centrifugation, the final pellets were re-suspended in scintillation vials containing 200 ml MIV medium and 1 ml aqueous scintillation fluid at room temperature. Scintillation count was carried out in a Beckman Scintillation Counter. For standardization and accuracy, vials were counted simultaneously, P 33 decay factor considered, and the background's P 33 cpm subtracted. The nanograms of DNA taken up were estimated using its specific radioactivity. Experiments were carried out in triplicate.