Phylogenetic background and habitat drive the genetic diversification of Escherichia coli

Escherichia coli is mostly a commensal of birds and mammals, including humans, where it can act as an opportunistic pathogen. It is also found in water and sediments. We investigated the phylogeny, genetic diversification, and habitat-association of 1,294 isolates representative of the phylogenetic diversity of more than 5,000 isolates from the Australian continent. Since many previous studies focused on clinical isolates, we investigated mostly other isolates originating from humans, poultry, wild animals and water. These strains represent the species genetic diversity and reveal widespread associations between phylogroups and isolation sources. The analysis of strains from the same sequence types revealed very rapid change of gene repertoires in the very early stages of divergence, driven by the acquisition of many different types of mobile genetic elements. These elements also lead to rapid variations in genome size, even if few of their genes rise to high frequency in the species. Variations in genome size are associated with phylogroup and isolation sources, but the latter determine the number of MGEs, a marker of recent transfer, suggesting that gene flow reinforces the association of certain genetic backgrounds with specific habitats. After a while, the divergence of gene repertoires becomes linear with phylogenetic distance, presumably reflecting the continuous turnover of mobile element and the occasional acquisition of adaptive genes. Surprisingly, the phylogroups with smallest genomes have the highest rates of gene repertoire diversification and fewer but more diverse mobile genetic elements. This suggests that smaller genomes are associated with higher, not lower, turnover of genetic information. Many of these genomes are from freshwater isolates and have peculiar traits, including a specific capsule, suggesting adaptation to this environment. Altogether, these data contribute to explain why epidemiological clones tend to emerge from specific phylogenetic groups in the presence of pervasive horizontal gene transfer across the species.


Abstract 23
Escherichia coli is a commensal of birds and mammals, including humans. It can act as an 24 opportunistic pathogen and is also found in water and sediments. Since most population 25 studies have focused on clinical isolates, we studied the phylogeny, genetic diversification, 26 and habitat-association of 1,294 isolates representative of the phylogenetic diversity of more 27 than 5,000, mostly non-clinical, isolates originating from humans, poultry, wild animals and

136
There was a significant negative correlation between GRR and the patristic distance (Spearman's rho

151
Some O-groups are abundant, e.g., O8, O2 and O1 (each present in >50 genomes) but 152 almost half of the groups occur in a single genome and 43% of the strains could not be 153 assigned an O-group (even when the wzm/wzt and wzx/wzy genes were present). In contrast, 154 most H-types were previously known (87%). We found 311 combinations of O:H serotypes 155 among the 726 typeable genomes. Of these, 64% are present in only one genome,17% are 156 in multiple STs and 7% in multiple phylogroups (e.g. O8:H10). Conversely, half of the 95 STs homoplasic. They also show extensive variation of gene repertoires within STs. The gene 160 repertoire relatedness (GRR) between genomes (see Methods) decreases very rapidly with 161 phylogenetic distance for closely related strains, as revealed by spline fits (Fig. 1d). Similar 162 results were observed when removing singletons, which only account for on average 0.5% of 163 the genes in genomes, suggesting that this result is not due to annotation or sequencing 164 errors ( Supplementary Fig. 6). As a consequence, 85% of the intra-ST comparisons have a 165 GRR lower than 95% (corresponding to ~235 gene differences per genome pair), and some 166 as little as 77% (Fig. 1f). Hence, even genomes of the same ST can differ substantially in the 167 sequence of other persistent genes and in the overall gene repertoires.

169
To check if the dataset is representative of the species and can be used to assess its 170 diversity, we compared it with the ECOR collection 48 Table 1). Since the gene repertoire diversity of E. coli in Australia 176 is at least as high as that of ECOR and RefSeq, we studied the variation in gene repertoires 177 beyond the intra-ST level. After the rapid initial drop in GRR described above, the values of 178 this variable decrease linearly with phylogenetic distances (Fig. 1d). The average values of 179 GRR given by the regression vary between 90% for very close genomes and 80% for the 180 most distant ones. The variance around the regression line is constant and a spline fit shows 181 few deviations around the regression line. This is consistent with a model where initial 182 divergence in gene repertoires is driven by rapid turnover of novel genes. After this initial 183 process, divergence in gene repertoires increases linearly with patristic distance.

Phylogroups vary in the rates of gene repertoire diversification 185
We used the species phylogeny to study the associations between phylogroups and genetic 186 diversity (Fig. 2a). The tree showed seven main phylogenetic groups very clearly separated 187 by nodes with 100% bootstrap support. The 17 phylogroup C strains were all included within 188 the B1 phylogroup and were thus grouped with the latter in this study. For the rest, the 189 analysis showed a good correspondence between the assignment into the known  Fig. 7). The phylogroup G was the least diverse, but it is also poorly represented in our 199 dataset (33 genomes from three STs). Overall, genetic diversity is proportional to the depth 200 of the phylogroup, i.e. the average tip-to-MRCA distance, except for phylogroup F which is 201 more diverse than expected (Fig. 2b). These results suggest that genetic diversity varies 202 between phylogroups and that within phylogroups it is strongly affected by the time of 203 divergence since the most recent common ancestor.

214
Association between the rarefied pan-and persistent-genomes in each phylogroup. We used 1,000 215 permutations (genomes orderings) of 50 randomly selected genomes (rarefied datasets) to compute 216 the pan-and the persistent-genomes in each phylogroup (ignoring the G group), and then averaged  Table 2). This revealed larger pan-genomes for 230 phylogroups A, D, and B1 followed by E, B2 and F. Intriguingly, the larger the pan-genome of a phylogroup, the smaller the fraction of its genes that are part of the persistent genome ( Fig.   232 2c). This suggests that differences of pan-genome sizes across phylogroups are caused by 233 different rates of gene turnover in certain phylogroups. They affect all types of genes, even 234 those at high frequency in the species.

236
To quantify the similarities in gene repertoires, we analyzed the GRR values between 237 phylogroups. The smallest values were observed when comparing B2 strains with the rest 238 ( Supplementary Fig. 10). Accordingly, a principal component analysis of the 239 presence/absence matrix of the pan-genome shows a first axis (accounting for 23.6% of the 240 variance) clearly separating the B2 from the other phylogroups (Fig. 2d). This shows that 241 gene repertoires of B2 strains are the most distinct from the other groups, even if B2 is not a 242 basal clade in the species tree. Hence, phylogroups differ in terms of their gene repertoires 243 and in their rates of genetic diversification.

Mobile genetic elements drive rapid initial turnover of gene repertoires 245
Different mechanisms can drive the rapid initial diversification of gene repertoires. Mobile 246 genetic elements encoding the mechanisms for transmission between genomes (using 247 virions or conjugation) or within genomes (insertion sequences, integron cassettes) are 248 known to transfer at high rates and be rapidly lost 49-51 . We detected prophages using 249 VirSorter 52 , plasmids using PlaScope 53 , and conjugative systems using ConjScan 54 250 ( Supplementary Figs. 11-13). These analyses have the caveat that some mobile elements 251 may be split in different contigs, resulting in missed and/or artificially split elements. This is 252 probably more frequent in the case of plasmids, since they tend to have many repeated 253 elements 55 . Only two genomes lacked identifiable prophages and only 9% lacked plasmid 254 contigs. We identified 929 conjugative systems, with some genomes containing up to seven, 255 most often of type MPFF, the type present in the F plasmid. On average, prophages 256 accounted for 5% and plasmids for 3% of the genomes (Fig. 3a). Together they account for 257 more than a third of the pan-genomes of each phylogroup. We also searched for elements    suggests that the long-term cost of MGEs themselves is significant and/or their contribution 303 to fitness is low (or temporary).

305
Is the distribution of MGEs associated with phylogroups leading to preferential paths of gene 306 transfer? It has been suggested that homologous recombination is much rarer between than 307 within phylogroups 18 . To test if this applies to the transfer of MGEs, we analyzed the 308 distribution of the pan-genome gene families that are part of MGEs (excluding singletons, for 309 the separate analysis of prophages and plasmids, see Supplementary Fig. 15). Even if these genes are at low frequency in the pan-genome and are observed in a single phylogroup 311 more often than expected by chance (Z-score>20, see Methods), 75% of the phage and 312 plasmid gene families were found in more than one phylogroup and 8% were found in all 313 phylogroups (usually at low frequency, Fig. 3d). Accordingly, the number of gene families 314 present in two to six phylogroups is barely lower, even if significantly so, than expected by     Table 6). Accordingly, the number of 403 MGE-associated gene families specific to a given source was higher than expected (Z-404 score >17, Supplementary Fig. 15), and nearly one third of these source-specific families 405 were observed in multiple phylogroups. When we focused on the number of co-occurring 406 recently acquired gene pairs (encoding for MGE or not), we found that they are more 407 frequent within most of the isolation sources than expected by chance (Fig. 4c, col 15, see 408 Methods). These results suggest that the contribution of MGEs to genome size is primarily 409 driven by isolate source rather than phylogroup membership.

411
The previous result could arise from preferential co-gains of MGEs in an isolation source 412 relative to a phylogroup. To test this hypothesis, we used the results from Count and built a 413 matrix where for each gene family we indicate the acquisition or not of a gene in a terminal branch of the phylogenetic tree. We then compared the clustering of these recent 415 acquisitions by phylogroup and by isolation source using Shannon indexes (see Methods). If 416 the hypothesis is correct, we expected higher clustering (lower diversity) across sources than 417 across phylogroups. We observed slightly higher clustering across phylogroups than across 418 sources, both for MGE-associated and for the other genes (Fig. 4e). We conclude that the 419 contribution of MGEs to genome size depends largely on the isolation source but that this 420 does not reflect systematic co-gains of MGEs in the same source.

422
It is tempting to speculate that the association between the number of MGE-associated   isolated from humans (this study) and other mammals 2 . The genome size of freshwater 464 strains' is the smallest among all groups of isolates and across phylogroups (Fig. 4c, col 1,

503
We started our study by quantifying gene repertoire diversification, which we found to follow 504 a two-step dynamics. The very rapid initial diversification, where GRR quickly decreases to 505 ~90%, implicates substantial heterogeneity in terms of gene repertoires for strains that are 506 from the same sequence type and are almost identical in the sequence of persistent genes.

507
Some of this divergence may be due to genome sequencing or assembling artifacts 508 producing singletons and thus inflating pan-genomes. Yet, we have annotated all genomes in 509 the same way. We also confirmed key results by excluding singletons, and showed that

579
Genetic diversity, created by HGT, recombination, or mutation, affects a species' ability to 580 adapt to novel ecological opportunities. The higher the diversity of gene repertoires in a 581 population, the more likely that one of those genes will prove helpful in the face of 582 environmental challenges such as antibiotics. We observed that the generalist phylogroups, 583 such as A and B1, have broader pan-genomes than specialist phylogroups like B2. This was 584 not expected based on their smaller genome sizes or the lower frequency of MGEs in their 585 genomes. We propose that this reflects the high variability of the environments where they 586 circulate and the consequent diversity of local adaptation processes. Phylogroup B2 strains, by comparison, have developed very specific traits that may let them take advantage of 588 some particular resources, e.g. they are better adapted to mammal gut environment 2 . This 589 has resulted in large genomes that have diverged more from the other E. coli, as revealed by 590 the PCA analysis, but that are overall more conserved (largest persistent-genome, smaller 591 pan-genomes, fewer recent gene acquisitions). Altogether, these results suggest that the 592 habitat and the phylogenetic structure jointly determine the size of genomes. The results also 593 suggest the hypothesis that the large genomes of some phylogroups, like B2, are caused by                 indicating that the pan-genomes size depend on the number of genomes analyzed. Thus, to 731 compare genetic diversity across datasets (e.g. phylogroups), we rarefied the genome 732 datasets, i.e., each pan-genome was constructed with the same number of genomes in each 733 comparison. To do this, 1,000 subsets of X genomes (X depending on the analysis, specified 734 in the results section) were randomly selected for comparison in each group, resulting to 735 datasets called hereafter rarefied datasets (Supplementary Fig. 8).

781
The phylogenetic tree of Escherichia genus was inferred from the persistent-genome 782 obtained with the 87 outgroup genomes and the 100 E. coli reference genomes (see above) 783 using the same procedure as the species tree. In this case, the persistent-genome is 784 composed of 1,589 proteins families, and the resulting alignment of 1,469,523 bp. The genus 785 phylogenetic tree was extremely well supported: all nodes had bootstrap support higher than 95%. Its topology was consistent with a previous study 110 (Supplementary Fig. 3c). Then, we 787 used it to precisely root the species tree ( Supplementary Fig. 3d).

816
To characterize the genetic diversification of each phylogroup of the Australian dataset, we 817 computed the three different standard indexes: the GRR, the Jaccard, and the Manhattan 818 indexes. All these indexes were highly correlated ( Supplementary Fig. 9). Thus, only 819 analyses with GRR were reported and illustrated in the main text. Note that we always used

855
We counted the number of recently acquired gene pairs (co-gains) from the same pan-  i.e. the sum of pairs within and between phylogroups, in the real data, and in each simulation, resp.). We applied the same approach (i) considering only gene pairs encoding 869 for MGEs (similar result as in Fig. 3), (ii) for sources (instead of phylogroups, Fig. 4).

952
were detected with one quarter located at the edges of contigs. We only found one copy per 953 genome. They were often located on very short contigs (20 proteins on average), and five 954 make all the contigs. Most (86%) were located on contigs predicted as plasmid by Plascope, 955 the remaining were on unclassified contigs. Except for the latter, intI genes were always 956 located next to ARGs. IS elements were identified using ISfinder 56 . Only hits with an e-value 957 lower than 10 -10 , a minimum alignment coverage of 50% and with at least 70% identity were 958 selected, we extracted the IS name of the best hit. Therefore, we identified 47,592 genes 959 encoded for IS elements, among them 43% were located at the edges of contigs 960 (20,329/47,592). They represented 1,006 gene families (~1% of the pan-genome), of which 961 41% were singletons. Only 13% were multigenic protein families (i.e., with more than one 962 member in at least one genome). Among them, 9 protein families were found in more than 10 963 copies in at least one genome, i.e., ISEc1 (10 copies), IS1397 (11), ISSoEn2 (11), IS621 (11),