Diversification of the type IV filament super-family into machines for adhesion, secretion, DNA transformation and motility

Processes of molecular innovation require tinkering and co-option of existing genes. How this occurs in terms of molecular evolution at long evolutionary scales remains poorly understood. Here, we analyse the natural history of a vast group of membrane-associated molecular systems in Bacteria and Archaea – type IV filament super-family (TFF-SF) – that diversified in systems involved in flagellar or twitching motility, adhesion, protein secretion, and DNA natural transformation. We identified such systems in all phyla of the two domains of life, and their phylogeny suggests that they may have been present in the last universal common ancestor. From there, two lineages, a Bacterial and an Archaeal, diversified by multiple gene duplications of the ATPases, gene fission of the integral membrane platform, and accretion of novel components. Surprisingly, we find that the Tad systems originated from the inter-kingdom transfer from Archaea to Bacteria of a system resembling the Epd pilus. The phylogeny and content of ancestral systems suggest that initial bacterial pili were engaged in cell motility and/or DNA transformation. In contrast, specialized protein secretion systems arose much later, and several independent times, in natural history. All these processes of functional diversification were accompanied by genetic rearrangements with implications for genetic regulation and horizontal gene transfer: systems encoded in fewer loci were more frequently exchanged between taxa. Overall, the evolutionary history of the TFF-SF by itself provides an impressive catalogue of the variety of molecular mechanisms involved in the origins of novel functions by tinkering and co-option of cellular machineries.

also arise by processes of neo-functionalization or sub-functionalization following the 54 duplication of genes encoding proteins with multiple functions or the acquisition of a 55 genetic system with homologs in the genome. This mechanism may have been at the 56 origin of the bacterial flagellum [8]. It has been proposed that non-adaptive processes 57 acting on redundant genes in species with small effective population sizes provide 58 substrates for the secondary evolution of complex traits by natural selection [9]. Less 59 is known about how key macromolecular complexes can evolve novel functions, or 60 specialize in one of several initial functions, thanks to a combination of mutational 61 processes and horizontal gene transfer [10]. 62 The appendages of bacteria are striking examples of functional diversification. They 63 are complex macromolecular machineries encoded by many genes and spanning 64 several cellular compartments that can evolve towards novel functions. For example, 65 the type III protein secretion system (T3SS) evolved from the secretion apparatus of 66 the bacterial flagellum [11], the T4SS from the conjugation apparatus [12], and the 67 T6SS possibly from co-option of phage structures [13,14]. A particularly remarkable 68 genomics analyses. We adapted previously published models of T4aP (including the 141 Com systems of diderms), T2SS, and Tad [41,47], to which we incorporated 142 additional components and stricter rules in terms of genetic composition and 143 organization ( Fig S1). We used the available literature to produce equivalent models, 144 and associated hidden Markov models (HMM) protein profiles, for the Com pilus of 145 monoderms designated as ComM, and for the Archaeal-T4P. For the latter, we used 146 66 arCOGs identified from [37], after a step of re-analysis of the initial 191 arCOGs to 147 remove redundancy. We could not build equivalent models for T4bP and MSH 148 systems at this point because too few systems were described in the literature. This 149 resulted in five initial models, of which two are novel, including 154 HMM protein 150 profiles, of which 17 are novel (Table S1). The phylogenies of the components of the type IV filament super-family 179 The presence of homologs of the major functional components of the TFF-SF in most 180 types of systems raises the question of how their functional diversification took place 181 from a common ancestor. To study this, we added to the models described above a 182 very simple generic model to identify all systems that have a minimal number of 183 essential components (the ATPase, the integral membrane platform, and a major 184 pilin) (Fig. S1). The search for systems using the MacSyFinder models resulted in the 185 identification of 6652 systems in 3700 genomes (1486 species) (Fig S2), of which 186 1584 were classed as generic systems, reflecting the conservative character of the 187 initial models. This dataset was too large to analyse using sophisticated phylogenetic 188 methods and included many systems that were very similar, e.g. from different 189 strains of the same species. We reduced this redundancy by clustering very similar 190 systems. We then picked one representative per cluster, thus preserving most of the 191 diversity of the dataset. In this process, we prioritized the inclusion of experimentally 192 validated systems, including MSH (1), and T4bP (5) for which models were not 193 available (see Methods). This non-redundant set contains 309 representative 194 systems (33 T4aP, 47 Archaeal-T4P, 29 T2SS, 5 T4bP, 1 MSH, 31 ComM, 72 Tad, 195 101 generic) (Table S2). Hence, the systems used in the subsequent analyses are 196 associated with a (sometimes large) number of other very similar systems that are 197 from the same cluster. 198 We inferred the phylogeny of each of the five key protein components (AAA+ 199 ATPase, IM platform, major pilin, secretin and prepilin peptidase) by maximum 200 likelihood with IQ-Tree [51]. We made ten reconstructions per component with the 201 most thorough mode of topological search to account for the stochasticity of the 202 method. The detailed analysis of key events revealed by these trees can be found in 203 Table S3 (the trees themselves are in Table S4). The ATPase trees are very well 204 supported at most of the key nodes, they are consistent across replicated inferences, 205 and they clearly separate the different types of systems (Fig. S3). The trees include 206 two system-specific duplication events of the ATPases, one ancestral to the large 207 clade including T4aP, T4bP, MSH, T2SS, and ComM (PilT/PilB), and another within 208 a clade of T4aP (PilT/PilU). The IM platform also discriminates the different systems. 209 It includes the well-known paralogs TadB/TadC in the Tad system and some 210 duplications in the Archaea. Apart from these duplications that span large numbers of 211 systems, there were other duplicates in the systems that were rare (present in 8% of 212 the representatives' dataset) and dealt with a de-replication procedure (see 213 Methods sequences are monophyletic (100% UF-Boot support) and branch between two large 237 clades: the Tad and Archaeal-T4P on one side (100% UF-boot) and a clade grouping 238 the T2SS, T4aP, ComM and T4bP on the other side (100% UF-boot) (Fig S4). The 239 overall rooted topology is very similar to that of the unrooted tree in eight out of ten 240 trees (Table S3). The inclusion of the ubiquitous ATPase of T4SS (VirB4) as an 241 outgroup with FtsK also showed a split between the archaeal and the bacterial 242 branches of the tree (Fig S5). This confirms that this ATPase family is also an 243 outgroup of the TFF-SF. We rooted the trees of the IM platform and major pilin using 244 the root of the ATPase trees, since all three proteins showed a consistent split 245 between Tad/T4P-Archaea on one side and the remaining systems on the other (Figs 246 The analysis of gene duplications provides additional information on the possible 248 roots of the super-family phylogenetic tree because placing duplication events on a 249 tree corresponds to set as ancestral the node where the duplications occurred, and 250 as descendants those with the duplication [57,58]. The duplications of the ATPases 251 exclude the root from the group T4aP, T4bP, ComM, MSH and T2SS. The 252 duplication of the IM platform in the Tad system, also present in some Archaeal-T4P, 253 excludes the root from within these groups. Hence, the analyses of duplication 254 events are consistent with the root as defined above by the tree of ATPases. 255 256 Producing a concatenate tree 257 Since the two major components (ATPase and IM platform) have phylogenetic trees 258 that are broadly consistent (Table S3), we computed a phylogenetic tree of their 259 concatenate using a partition model (best model for each gene partition, as 260 computed by IQ-Tree). The major pilin was excluded from the concatenate because it 261 shows less consistent and less supported topologies. Concatenation required the use 262 of a procedure to deal with paralogs (to have one marker per component per 263 system). For paralogs present in a few taxa, we chose in each system the protein 264 most similar in sequence to the most closely related systems lacking paralogs (see 265 Methods). For the ATPases, we used PilB, because this ATPase is responsible for 266 the assembly of the pilus, which is a function that is essential in all families, contrary 267 to the function of PilT/PilU (retraction). There was no good argument to pick TadB or 268 TadC platform proteins and we therefore made phylogenetic reconstructions with 269 each of them in parallel (Fig. 3, S10). As expected, the concatenate trees showed 270 relationships similar to those of the ATPase and the IM platform. We then tested the 271 congruence between the two concatenate trees (PilB/TadB and PilB/TadC) and the 272 individual protein alignments (PilB and TadB for the first, and PilB and TadC for the 273 second) using the AU test implemented in IQ-Tree (v1.6.7.2) [59]. This showed that 274 the best trees of each individual protein were not significantly different from the best 275 tree of the concatenate (p>0.05, Table S7). Furthermore, after correction for multiple 276 comparisons, only two of the 40 comparisons between the individual trees and the 277 concatenate tree were significantly incongruent. Overall, these analyses strongly 278 suggest that the TFF-SF derived from an ancestral system, which diversified initially 279 into an archaeal system ancestor of the Tad/Archaeal-T4P and a bacterial system 280 ancestor of the T4aP/T4bP/T2SS/MSH/ComM. 281

282
The archaeal systems and the emergence of Tad   283 The ATPase, IM platform and concatenate trees are broadly consistent with five 284 groups within Archaea (Fig. 3, S3, S6, S10), of which four replicate previous findings 285 [37]. All experimentally validated archaella are part of a highly-supported clade 286 (100% UF-boot, group 3 in [37]) that is the sister clade to another highly supported 287 clade containing two pili involved in surface adhesion in Halobacteria (group 2 in 288 [37]). They are sister-groups of a clade gathering the Aap, Bindosome, and Ups (pili The systems at the base of the tree make a distinct clade but have unknown 293

functions. 294
Unexpectedly, the position of the root places Tad as a system derived from Archaeal-295 T4P systems. This feature is found in the trees of the three key components with high 296 confidence. Furthermore, all these trees showed a monophyletic clade including the 297 Tad and the Epd pilus (clade "Epd-like"), whose major pilins have similarly short 298 sequence lengths when compared to the others from Archaeal-T4P (Fig. S7). Both 299 Epd-like pili and Tad have two homologous genes encoding the IM platform, 300 suggesting that their common ancestor already contained them both. We examined 301 the domain structure of these two genes and found that each has one "T2SSF" 302 domain, where most other Archaeal-T4Ps have two such domains and longer IM 303 platform proteins. This strongly suggests that TadB and TadC were derived from an 304 ancestral event of gene fission. To confirm this observation, we aligned the TadB and 305 TadC profiles with the archaeal IM platforms containing two T2SSF domains. In 306 these cases, TadC aligned best with the N-terminal domain, while TadB aligned best 307 with the C-terminal domain of the archaeal proteins. This further supports the gene 308 fission scenario. Finally, the Tad systems have a protein -TadZ -which has 309 significant HMM-HMM profile alignments with Archaeal-T4P components 310 (arCOG00589 and arCOG05608) including those from the Epd-like clade (group 1 311 from [37]), but not with profiles from the bacterial systems. Altogether, these results 312 strongly suggest that an ancestral Archaeal-T4P harbouring two genes encoding the 313 IM platform diversified into Epd-like systems in Archaea and was transferred 314 horizontally, apparently only once, to Bacteria, leading to the extant Tad systems. 315 The transfer of the system from Archaea to Bacteria was very ancient. Tad systems 316 were frequently transferred among Bacteria since then (see below), and it is not 317 possible to infer the precise bacterial taxa that acquired the original system. 318 However, the Tad systems at the basis of the clade are from Proteobacteria in 18 out 319 of 20 concatenate trees, often with very good support (Table S3). The two odd 320 concatenate trees place Firmicutes at the base of the Tad clade, but with very low 321 support. This suggests that the ancestor of the Tad system was acquired by a diderm 322 bacterium, and the accretion of the outer-membrane, pore-forming secretin to the The key early event in the ATPase trees of the bacteria-only T4P large clade was the 353 amplification leading to the paralogs PilB (the assembly ATPase) and PilT (the 354 retraction ATPase). This event appears as a simple duplication at the base of the 355 tree in certain of the ATPase trees, but also shows more complex scenarios in others 356 (Table S3). In the PilB part of the ATPase tree, the T4bP is basal and the other 357 systems are regrouped with T4aP. This scenario is consistent with that of the 358 secretin tree, where if one places the root between T4aP and T4bP one finds the 359 T2SS deriving from a T4aP system, as in the PilB trees. This is also sustained, albeit 360 with low support, by the major pilin tree, where one finds at basal positions the T4aP 361 and the T4bP. The presence of PilT in very early parts of the tree suggests that the 362 most ancient systems were able to retract the pilus. This is consistent with the ability 363 of both T4aP and T4bP systems to promote twitching motility. 364 One of the most interesting functions of the super-family, from the evolutionary point 365 of view, is the involvement of some of its systems in natural transformation. The 366 ComM system is commonly found in Firmicutes, even if it is unclear whether it is 367 always involved in transformation. It is monophyletic in all the phylogenetic 368 reconstructions we made, usually with very high support (≥95%). In the concatenate 369 trees, ComM branches apart from a group gathering T4aP, MSH and T2SS, after the 370 divergence with T4bP. The trees of individual components show similar scenarios 371 once one accounts for the effects of the ATPase paralogs, and for the low support of 372 some parts of the IM platform trees. Interestingly, the major pilin trees show ComM 373 branching within T4aP, with poor resolution, close to systems that are known to be 374 involved in natural transformation. This evidence is weak, but suggests a link 375 between the major pilin and transformation. In summary, these results suggest that 376 ComM arose early and only once in the history of the TFF-SF. The T4P systems 377 experimentally linked to natural transformation in diderms were systematically 378 identified as T4aP, and also tend to cluster together in the tree. 379 380 TFF-SF elements are ubiquitous in the prokaryotic world 381 We used the rooted concatenate tree to class the numerous generic systems that we 382 had previously identified. We assumed that clades where all systems were either 383 generic or of a single type (of which at least one validated experimentally) could be 384 tentatively assigned to that type. Generic systems in clades lacking experimentally 385 validated systems were left unassigned. Only two types of systems were paraphyletic 386 in the tree -T4aP and Archaeal-T4P -and were thus treated differently. T4aP were 387 split in a few monophyletic clades, and systems within each clade were re-assigned 388 using the method above. The Archaeal-T4P systems, from which the Tad derives, 389 can be easily distinguished from the latter, and thus re-assigned, using a taxonomic  We used these tentatively assigned systems to produce more sensitive MacSyFinder 394 models. First, we changed the HMM profiles to account for the genetic diversity 395 introduced by the re-assigned systems. Second, we created models to detect the 396 T4bP and the MSH pilus, since we now had a much larger number of examples of 397 these systems. Finally, we searched for genes systematically associated with the 398 systems' loci, in a neighbourhood of ±20 genes, that were not matched by any of the 399 HMM profiles of the models. We clustered the proteins by sequence similarity and 400 analysed the largest families. This "guilt-by-association" approach failed to show 401 other proteins systematically associated with a particular type of system (Table S5), 402 suggesting that our models already encompass their most frequent components. This 403 process resulted in more sensitive models that accounted for all known types of 404 systems and correctly identified the 81 experimentally validated systems of bacteria 405 analysed in Table S2, except the T2SS of Chlamydia and Bacteroidetes (shown 406 above to be peculiar). 407 Using the novel improved models, we found 9026 systems within 4610 genomes, Interestingly, the genetic organization of Epd is very similar to the Tad: the two IM 475 platform genes are contiguous and followed by the major ATPase and the secondary 476 one (TadZ in Tad and FlaH (arCOG04148) in Epd) (see Fig. 5). Hence, the core 477 genetic organization of Tad evolved in Archaea before the transfer of the system to 478

Bacteria. 479
The patterns of genetic organization of the homologous components differ between 480 systems. For example, T4aP shows a conserved triplet of genes pilBCD of which the 481 pair pilBC is often conserved in other types of systems but pilCD is not. In general, 482 pilins tend to be encoded together, but can vary in their co-localization with the rest of 483 the genes: they can be apart (T4aP), at the edge of the locus (ComM, Tad) or in the 484 middle (T2SS, T4bP). In archaea, all cases were found. Interestingly, many 485 duplicated genes tend to be contiguous. This is the case of many pilins, of the pilUT 486 genes encoding the ATPases, and of the integral membrane platform genes tadBC. 487 This is consistent with models suggesting a bias towards gene duplication in tandem 488  We thus hypothesized that systems encoded in single loci are more likely to undergo 499 horizontal gene transfer. To test this hypothesis, we compared the phylogenetic tree 500 of each system, a sub-tree of the larger phylogenetic reconstruction, with a maximum 501 likelihood tree of the 16S rRNA sequences of the species carrying the systems (Fig.  502 S14). We excluded the archaeal systems from these analyses because their loci are 503 harder to define precisely (sometimes scattered and multiple systems per genome) 504 and their functions are still poorly delimited in most cases (complicating the definition 505 of the clade to use in the analysis). We found that systems encoded systematically in 506 a single locus are more frequently transferred than those encoded in several loci 507 (Fig. 6). These results are reinforced by the analysis of the frequency with which 508 systems are encoded in plasmids, which follows closely the trends observed for the 509 frequency of transfer (highest in Tad and lowest in ComM, Fig. 6). The contrast is 510 especially interesting between the Tad and T4aP systems that are both present in 511 many different clades and are encoded almost exclusively in one locus (Tad) or 512 many loci (T4aP). This association between rates of transfer and organization 513 suggests that systems that are frequently gained and lost endure a selective 514 pressure for being encoded in a single locus. 515 516 DISCUSSION 517 We used comparative genomics and phylogenetics to produce models and protein 518 profiles that identify TFF-related systems in genomic data of all Prokaryotes. They 519 are publicly available and provide a significant advance relative to our previous work, 520 since they are more sensitive and cover more types of systems (Archaeal-T4P, 521 ComM, MSH and T4bP). We used them to quantify the frequency and taxonomic 522 distribution of the different systems. Strikingly, every inspected phylum of 523 Prokaryotes has some type, and often several types, of systems from the TFF-SF. of Archaeal-T4P. Further experimental study of these systems is required to produce 531 reliable MacSyFinder models for each of them. 532 Our approach may be regarded as conservative. First, some components of the 533 systems are not sufficiently conserved in sequence for reliable phylogenetic analyses 534 at this large time-scale and were not used in the phylogenetic inference. The minor 535 pilins are a particularly important set of proteins that were ignored because they 536 produced short and very poor multiple alignments. Second, we rely on the existence 537 of experimentally validated systems and on monophyletic clades having such 538 systems to build the models. If the systems have been described in few species, or in 539 a small number of phyla, then this limits our ability to identify them, especially when 540 they are very different in terms of gene repertoires and protein sequences. This may 541 explain why our models missed the T2SS of Chlamydiae: they carry few components 542 and these are of different origins. Notably, its major pilin is atypical with a disulphide 543 bond and presumably lacking the Ca-binding site. In other cases, systems may 544 actually differ from the descriptions in the literature. This is probably the case of the 545 so-called T2SS of Bacteroidetes. This system is involved in protein secretion [63], but 546 consistently branches apart from T2SS in all analysis of the phylogenetic markers. 547 The major pilin of this system is very divergent compared to major pseudopilins from 548 proteobacteria. Our analysis raises the exciting possibility that it might represent a 549 novel type of secretion system derived from the T4aP independently of the T2SS. The phylogeny of the key components of the TFF-SF revealed an initial split between 562 archaeal and bacterial systems, suggesting that these structures may have pre-dated 563 the last common ancestor of all cellular organisms (Fig. 7). This ancestral system 564 presumably had one ATPase for its assembly (the function performed by PilB in 565 T4bP, ComM, Tad and Archaeal-T4P. They were found in MSH and T2SS (Fig. 2, 588 Table S5), suggesting that they arose more recently and that other systems do not 589 require these proteins for assembly (Fig. 7). In short, our results are consistent with 590 the idea that the ancestral system was able to energise its assembly and build up a  For this work, we could use the models previously proposed by TXSScan [41] for 769 T2SS, T4P, and Tad, but we wished to add a few components that were missing 770 there. For the Archaeal-T4P and for ComM we did not have an initial model. We 771 proceeded in two steps. First, we made conservative initial models that matched the 772 archetypal systems, but sometimes were too strict for some atypical systems. This 773 resulted in a list of systems in which we had strong confidence. However, it also 774 missed many systems. To identify these systems, we built a model called generic 775 that had only the basic building blocks of these systems, with all the homologous 776 proteins set as "exchangeable". Following the comparative and phylogenetic 777 analyses we re-defined all the models to make less-initial models that could identify a 778 larger number of systems. Both sets of models are made available in Dataset S1. 779 The table with all protein profiles is given in Table S4. according to their occurrence in the systems. Accordingly, the number of MGR and 808 MMGR were increased to 8. The prepilin peptidase pilD was changed to mandatory, 809 loner and exchangeable with a number of homologous components from the T2SS 810 (gspO) and ComM (comC) according to its localization that could be found alone in 811 the genome and the fact that the HMM profile of these two genes often have better e-812 value than the one of the T4aP. 813 The final model of T4P includes pilW, pilX and pilY as new accessory components. 814 We decreased MMGR to 4 and MGR to 5, which fits better the data. We set fimT, 815 pilM, pilP and pilA as accessory, to help MacSyFinder to search more complete T4P 816 in the genome. We also removed the forbidden genes gspN, tadZ and gspC. 817 818 T2SS. The initial model of T2SS Tad followed closely the definitions proposed in [41], 819 where we increased the MGR to 8 and set the prepilin peptidase gspO as mandatory, 820 loner and exchangeable with a number of homologous components from the T4aP 821 (pilD). 822 The final model was relaxed to identify a larger fraction of the systems. We reduced 823 the MMGR to 4 and the MGR to 5. To fit the data better, we added the prepilin 824 peptidase of ComM (comC) as another exchangeable gene of gspO. We set gspC 825 gene as mandatory, gspM as accessory and gspD was set as loner to fit better the 826 data. 827 828 ComM. In this initial model only the genes that compose the pilus were used in the 829 model without the genes that encode DNA uptake system, such as comEA, comEB 830 and comEC [34, 95-98]. The minimal distance between genes was set to 5. The 831 MMGR was set to 3 and the MGR to 5. And the system was set as multi_loci as 832 some genes are loner. The genes comC, comGA, comGB, comGC and comGD were 833 set as mandatory and the other ones were set as accessory in relation with their 834 presence in experimentally validated systems, curated with an exploratory phase to 835 know the relative abundance of the genes in the systems (the genes with more than 836 80% of presence in the detected systems were set as mandatory and the other as 837 accessory). comC was set as loner and exchangeable with pilD of the T4aP because 838 we found case where the HMM profile of pilD was better in e-value than the comC, 839 same for the comGA was set as exchangeable with pilB of T4aP. The genes comB, 840 comK and comX were set as loner because they are often found alone in the 841

genome. 842
In the final model we changed the number of genes for the MMGR and MGR to 4. 843 We also added the genes encoding the DNA uptake system in the plasma membrane 844  Core "mandatory" components were defined based on the literature and 859 experimentally validated systems. Other components were set as "accessory". The 860 arCOG families that matched the same component were defined as exchangeable. 861 The prepilin peptidase was set as a loner gene that can be part of multiple systems. 862 This initial model asked for a minimal number of mandatory genes and overall 863 number of genes of 4. Of 14 experimentally validated systems found in the literature, 864 10 were detected with this initial model (Table S1). After counting the occurrence of 865 the different arCOG in the detected Archaeal-T4P, we removed those without any 866 occurrence, to reduce the number to 109 arCOG families. 867 Final model. The number of genes for MMGR and MGR was reduced to 3, which fits 868 better the data.  We created 20 HMM profiles, the MMGR was set to 3 and the MGR to 4. The system 886 was set as multi_loci as some genes are loner. The genes mshA, mshE, mshG, 887 mshL and mshM were set as mandatory and the others were set as accessory, The dataset with all the systems identified in the genomes is too large to make 949 phylogenetic inferences. It also contains many very closely related systems that may 950 provide little additional information to infer the deeper nodes of the tree. Hence, we 951 developed a method to remove redundancy in the dataset while maximising its 952 genetic diversity. The method prioritizes the inclusion of systems that were 953 experimentally validated to facilitate the analysis of the results. The method consists 954 of several sequential steps ( Figure S15). 955 1) We inferred the maximum likelihood tree for each key components' family as 956 mentioned above and extracted the matrices of patristic distances (using the R 957 function "cophenetic_phylo" of the package ape) between all leaves of the 958 trees. This resulted in a set of distance matrices between systems. 959 2) When there were multiple copies of family of clade-specific paralogs the 960 system was represented multiple times in the phylogeny and in the distance 961 matrix. To solve the problem and to have only one distance between two 962 systems, we chose the minimal distance between the two systems between 963 paralogs. 964 3) Each core protein family has a different rate of evolution. To compare them, 965 To reduce the number of paralogs in each system, we used the following method 990 (Fig. S16). 991 1) We inferred the maximum likelihood tree for each key components' family 992 of the representative dataset as mentioned above and extracted the 993 matrices of patristic distances (using the R function "cophenetic_phylo" of 994 the package ape) between all leaves of the trees. This resulted in a set of 995 distance matrices between proteins. 996 2) For each system with more than one copy per gene, we found the nearest 997 system based on patristic distances extracted from the ATPase or the 998 integral membrane platform tree (depending on the number of copies of 999 the ATPase), which had only one copy of this gene. 1000 3) We use this nearest system to choose the copy of the duplicate gene with 1001 the smallest distance to its homolog in this nearest system.     For each detected system (those indicated in Fig. 4), the edge width represents the number of times the two genes are contiguous divided by the number of times the rarest gene is present in the system. The colour of the edge represents the number of times the two genes are contiguous in the system divided by the number of systems.               Table legends   Table S1. List of all the profiles of the TFF-SF used in the analysis.