A systematic pipeline for classifying bacterial operons reveals the evolutionary landscape of biofilm machineries

In bacteria functionally related genes comprising metabolic pathways and protein complexes are frequently encoded in operons and are widely conserved across phylogenetically diverse species. The evolution of these operon-encoded processes is affected by diverse mechanisms such as gene duplication, loss, rearrangement, and horizontal transfer. These mechanisms can result in functional diversification, increasing the potential evolution of novel biological pathways, and enabling pre-existing pathways to adapt to the requirements of particular environments. Despite the fundamental importance that these mechanisms play in bacterial environmental adaptation, a systematic approach for studying the evolution of operon organization is lacking. Herein, we present a novel method to study the evolution of operons based on phylogenetic clustering of operon-encoded protein families and genomic-proximity network visualizations of operon architectures. We applied this approach to study the evolution of the synthase dependent exopolysaccharide (EPS) biosynthetic systems: cellulose, acetylated cellulose, poly-β-1,6-N-acetyl-D-glucosamine (PNAG), Pel, and alginate. These polymers have important roles in biofilm formation, antibiotic tolerance, and as virulence factors in opportunistic pathogens. Our approach revealed the complex evolutionary landscape of EPS machineries, and enabled operons to be classified into evolutionarily distinct lineages. Cellulose operons show phyla-specific operon lineages resulting from gene loss, rearrangement, and the acquisition of accessory loci, and the occurrence of whole-operon duplications arising through horizonal gene transfer. Our evolution-based classification also distinguishes between PNAG production from Gram-negative and Gram-positive bacteria on the basis of structural and functional evolution of the acetylation modification domains shared by PgaB and IcaB loci, respectively. We also predict several pel-like operon lineages in Gram-positive bacteria and demonstrate in our companion paper (Whitfield et al PLoS Pathogens, in press) that Bacillus cereus produces a Pel-dependent biofilm that is regulated by cyclic-3’,5’-dimeric guanosine monophosphate (c-di-GMP).

pre-existing pathways to the requirements of particular environments. Despite the fundamental 25 importance that these mechanisms play in bacterial environmental adaptation, a systematic approach 26 for studying the evolution of operon organization is lacking. Herein, we present a novel method to 27 study the evolution of operons based on phylogenetic clustering of operon-encoded protein families 28 and genomic-proximity network visualizations of operon architectures. We applied this approach to 29 study the evolution of the synthase dependent exopolysaccharide (EPS) biosynthetic systems: cellulose, 30 acetylated-cellulose, poly-β-1,6-N-acetyl-D-glucosamine (PNAG), Pel, and alginate. These polymers 31 have important roles in biofilm formation, antibiotic tolerance, and as virulence factors in opportunistic 32 pathogens. Our approach reveals the complex evolutionary landscape of EPS machineries, and enabled 33 operons to be classified into evolutionarily distinct lineages. Cellulose operons show phyla-specific 34 operon lineages resulting from gene loss, rearrangement, and the acquisition of accessory loci, and the 35 occurrence of whole-operon duplications arising through horizonal gene transfer. Our evolutionary-36 based classification also distinguishes between the evolution of PNAG production between Gram-37 negative and Gram-positive bacteria on the basis of structural and functional evolution of the 38 acetylation modification domains shared by PgaB and IcaB loci, respectively. We also predict several 39 pel-like operon lineages in Gram-positive bacteria, and demonstrate in our companion paper 40 (BIORXIV/2019/768473) that Bacillus cereus produces a Pel-dependent biofilm that is regulated by 41 cyclic-3',5'-dimeric guanosine monophosphate (c-di-GMP). 42

AUTHOR SUMMARY 44
In bacterial genomes biological processes are frequently encoded as operons of co-transcribed 45 neighbouring genes belonging to diverse protein families. Therefore, studying the evolution of bacterial 46 operons provides valuable insight into understanding the biological roles of genes involved in 47 environmental adaptation. However, no systematic approach has yet been devised to examine both the 48

INTRODUCTION 61
The generation of novel genomes through next generation sequencing is creating a wealth of 62 opportunities for understanding the evolution of biological systems. A key challenge is the development 63 of robust and systematic approaches that allow genes not only to be classified into functional 64 categories, but also infer evolutionary relationships. In bacterial genomes, functionally-related genes 65 corresponding to metabolic pathways or protein complexes are often encoded by neighbouring co-66 transcribed loci, termed an operon. Computational prediction of operons based on the conservation of 67 short inter-genetic distances has frequently been used to assign functions to uncharacterized genes from 68 the known biological roles of their co-conserved neighbours(1-3). Analyzing patterns of sequence 69 divergence within each gene can subsequently yield insights into species-specific functionalities. 70 However, genes in an operon do not function isolation but tend to form parts of higher-order, biological 71 modules (e.g. protein complexes or metabolic pathways). Consequently, analysing evolutionary events 72 in an operonic context provides additional opportunities to better infer functional relationships. For 73 example, while sequence divergence has the potential to impact the function of a single gene, 74 evolutionary events that alter operon structure (e.g. rearrangements, duplications, gains and losses) 75 have the potential to dramatically alter the overall function of the operon(4,5). 76 77 Due to the lack of a systematic framework, very few studies have attempted to examine the role of 78 evolutionary events on operon structure(6,7). Phylogenetic-tree based classification of 197 ATP binding 79 motif sequences associated with operon-encoded bacterial ATP-binding cassette (ABC) transporters 80 was successful in resolving two evolutionarily distinct transporter clades associated with import and 81 export functions(8). Gene duplications have been shown to play an important role in driving protein 82 superfamily expansion among bacterial genomes and its frequency is significantly correlated with 83 genome size(9). The study of co-localized "gene blocks" across bacteria has also shown that gene 84 duplication, loss, and rearrangement play important roles in shaping the large-scale organization of 85 bacterial genomes(7). Key to these analyses is the use of a rigorous and systematic approach for 86 assigning genes into evolutionarily related 'families' that are likely to share similar functions. However, 87 the inference of biological function based on sequence similarities of genes or proteins are often 88 complicated by functional divergence arising through recent gene duplication events. A variety of 89 metrics have been employed for determining the relatedness of genes and their protein products from 90 which groups (i.e. clustering) can be defined. These metrics include: evolutionary distances derived 91 through the construction of phylogenetic trees(10-13); global protein sequence similarities(14-16); and 92 shared sequence features such as conserved amino acids at specific sites or shared amino acid 93 subsequences(17,18). The aim of these approaches is to automatically resolve large protein families 94 comprising potentially thousands of genes into a smaller number of clusters defining evolutionarily 95 related subfamilies with similar biological roles. Here we build on these methods and present a novel 96 framework for the systematic classification and analysis of genes in the context of operons. Focusing 97 on synthase-dependent exopolysaccharide (EPS) biosynthetic machineries, we use our framework to 98 explore how gene divergence in combination with duplication, loss, and rearrangement events have 99 shaped the evolution of EPS operons, and may have influenced the biofilm producing capabilities of 100 evolutionarily diverse bacteria. 101 102 EPS are an important constituent of bacterial biofilms that not only ensure survival in response to 103 limited nutrient availability, but are also involved in antibiotic tolerance, immune evasion and serve as 104 virulence factors in many clinically relevant pathogens (19-21). Distinct mechanisms have been 105 identified in the production of bacterial EPS, including the well-characterized Wzx/Wzy and ABC 106 transporter-dependent pathways involved in capsular polysaccharide and lipopolysaccharide secretion 107 in Gram-negatives (22), and the more recently identified synthase-dependent systems(23). 108 109 Typically, synthase-dependent EPS systems are organized as discrete operons comprised of genes 110 encoding: 1) an inner membrane associated polysaccharide synthase and copolymerase subunit; 2) a 111 regulatory domain responsible for binding the intracellular signaling molecule cyclic-3',5'-dimeric 112 guanosine monophosphate (c-di-GMP); 3) periplasmic polysaccharide modification enzymes; and 4) a 113 periplasmic tetratricopeptide repeat (TPR) domain coupled with an outer membrane pore(23). This 114 operonic organization allows bacteria to acquire complete EPS functionality through discrete lateral 115 gene transfer events and may act as a key driver in niche adaptation(24). To date five synthase-116 dependent EPS have been identified: cellulose, acetylated-cellulose, poly-β-1,6-N-acetyl-D-117 glucosamine (PNAG), alginate and the Pel polysaccharide. While much interest has focused on the 118 molecular basis of biofilm formation, much less is known about how these systems have propagated 119 across bacterial taxa. Further, it is not known how EPS operons evolve to help bacteria adapt to diverse 120 168

Evolution of EPS Operons is Driven by Gene Duplication, Loss and Rearrangements 169
The processes underlying EPS operon evolution across diverse bacterial phyla is poorly understood.. 170 We examined how operon organization is influenced by the following evolutionary events that are 171 likely to affect EPS production capabilities among bacteria: 1) single locus or whole operon 172 duplications, corresponding to dosage effects altering the level of EPS modification or export; 2) locus 173 losses, that may indicate a reduction or loss in EPS production or modification, or may suggest 174 supplementation of the lost function with a novel gene; 3) operon rearrangements which may affect the 175 regulation of EPS production through the order of expression of individual EPS system components; 176 and, 4) gene-fusions, resulting in enhanced co-expression of interacting subunits. 177 178 For each set of predicted EPS operons, the resulting number of operon evolutionary events were 179 assessed relative to the locus composition and ordering of reference Gram-negative experimentally 180 characterized operons defined from previously published studies(19,28-32) . With the exception of 181 acetylated-cellulose, locus losses were found to be the most frequent event (~46% of predicted operons 182 lacked one or more reference loci), and occurred with the greatest frequency for Pel which exhibited an 183 average loss of 2.6 loci lost per operon (Supplemental Table 4). Among all EPS systems the majority 184 of locus losses were associated with the outer-membrane pore encoding loci (441 / 993 -44% of all 185 locus loss events identified) among Gram-positive species (Supplemental Table 4), consistent with the 186 lack of an outer-membrane bilayer in Gram-positive membrane architectures. Operon rearrangements 187 were the next most frequent evolutionary events (~ 39%), largely associated with cellulose operons(25) 188 (327 / 407 -80%). Focusing on duplication events, within-operon loci duplications tended to be more 189 common than whole-operon duplications (2 or more core EPS loci identified >= 1 Kbp apart), with the 190 exception of cellulose operons (29 whole operon duplications v 24 loci duplications). All duplicated 191 operons were found to be separated by at least 10 kbp, suggesting they may have been acquired through 192 HGT rather than tandem duplication of a pre-existing operon(32,33). 193 194

Systematic Phylogenetic Distance-Based Clustering of EPS Operon Loci and Genomic-Proximity 195
Networks Identifies Evolutionarily Distinct Operon Clades 196 To better understand how these evolutionary events may have altered operon function, we next devised 197 an agnostic, systematic classification strategy to cluster each family of EPS operon loci on the basis of 198 phylogenetic distance (Figure 2A; see Methods). In brief, for each EPS operon locus, multiple 199 sequence alignments were generated and used to construct phylogenetic trees. From these trees, we 200 defined sets of clusters through an iterative scan of the tree structure that captures an increasing 201 sequence distance between family members, starting at the leaves and ending at the root. During this 202 scan, sequences are grouped into a cluster if they share a common node (i.e. are within a specified 203 evolutionary distance). To define the optimal set of clusters for each locus, we then applied three 204 cluster quality scoring schemes (Q1, Q2 and Q3) based on the following metrics: proportion of 205 sequences clustered (to maximize the number of sequences clustered); average silhouette score (to 206 minimize the occurrence of clusters containing highly divergent sequences); and the Dunn index (to 207 maximize the separation of closely related sequences from divergent sequences). For each scoring 208 scheme, we defined the optimal pattern of clustering based on the evolutionary distance (expected 209 number of substitutions per site) derived from a maximum-likelihood constructed phylogenetic tree 210 (see Methods for more details) that maximizes the quality score. Comparisons across scoring schemes 211 (see below) for cellulose operon loci identified Q2 as providing the most informative sets of clusters. 212 Applying this scoring scheme to all EPS loci revealed the average number of sequence clusters 213 generated correlated with the total number of operons predicted for each type of EPS system ( Figure  214 2B), which further corresponded to the underlying differences in species distributions of EPS systems 215 ( Figure 1C). For example, the cellulose system was predicted to have the largest average number of 216 sequence clusters overall (30 clusters) and also had the greatest species diversity (shannon index 2.16 -217 Supplemental Figure 2) compared to all other systems. Furthermore, for each EPS system the 218 variability of the number of sequence clusters predicted per locus ( Figure 2C) suggests differing 219 degrees of locus evolution that are likely to be the result of different structural and functional 220 constraints. For example, a higher degree of conservation would be expected for glycosyl transferase 221 (GT) subunits to maintain efficient co-ordination between polymerization and inner-membrane 222 transport of EPS, while increased variability of periplasmic modification enzymes suggests that only a 223 subset of highly conserved motifs are required to carry out polysaccharide modification reactions. 224

225
To compare patterns of clusters identified by each scoring scheme, we applied the three scoring 226 schemes to each set of genomically-neighbouring protein sequences assigned by HMM searches to a 227 given cellulose locus (bcsA, bcsB, bcsZ, and bcsC). From the resulting clusters we generated operon 228 genomic-proximity networks ( Figure 2D). These networks provide a visual display of the conservation 229 of individual loci, together with their respective genomic proximity to yield patterns of sequence 230 divergence associated with the emergence of distinct forms of operon organization. In the absence of 231 any clustering (Q0), the network trivially resolves into individual operons featuring up to four loci. 232 Applying the Q1 scoring scheme to each locus, the network reveals a variable number of clusters 233 across operon loci, with each cluster generally comprising sequences belonging to the same bacterial 234 genus. Application of the Q2 scoring scheme results in the generation of clusters of increased size, 235 encompassing species featuring distinct operon organizations and compositions. For example, two 236 distinct lineages of alpha-proteobacterial cellulose operons can be easily distinguished, one of which is 237 more closely related in sequence and composition to gamma-proteobacterial operons, and a second 238 which lacks two loci and appears evolutionarily divergent from gamma-proteobacterial operons(25). 239 However, these distinctions were more difficult to resolve using the Q3 scoring scheme due to 240 clustering of highly divergent sequences. Given the trade-off between clustering highly divergent 241 sequences (Q3) with the depiction of individual operons (Q1), we applied the Q2 scoring scheme to 242 generate clusters for all EPS loci (Supplemental Table 5). 243 244 Using this locus-specific phylogenetic clustering approach, we were able to devise a classification 245 scheme to define EPS locus clades based on the average evolutionary distance of a group of clustered 246 locus sequences to a reference operon sequence (Supplemental Table 3). For example, the cellulose 247 polysaccharide synthase locus, bcsA, from Escherichia coli is assigned to clade 1, while divergent 248 alpha-proteobacterial species including Rhodobacter sphaeroides are assigned to clade 2. We further 249 resolved operons into distinct groups based on the genomic co-occurrence patterns of locus clades; e.g. 250 for the cellulose operon (bcsABZC) we identify clade combinations of 1:1:1:1, 1:2:2:2 and 1:3:5:3, 251 which correspond to operons identified in Escherichia spp. and other closely related enterobacteria, 252 Klebsiella spp., and Burkholderia spp., respectively.  clades, which have previously been inferred as originating by HGT(19) and is further supported by our 282 phylogenetic clustering assignments ( Figure 4C). Furthermore, we identified two additional divergent 283 BcsB sequences associated with a novel organization of operon clade B1 and include several loci with 284 other roles in cellulose production (designated operon clade B2; Figure 4D). The divergence of BcsB 285 sequences associated with clade B2 were also found to distinguish bacterial genomes possessing 286 multiple cellulose operons of distinct evolutionary lineages: Proteus mirabilis (2 cellulose operons: 287 Clades A1 and B2) and Enterobacter spp. (3 cellulose operons: Clades A1, B1 and B3) ( Figure 4E). 288 Additional sequence database searches revealed that the non-core loci associated with operon clades B2 289 and B3 share functionally homologous loci to the cellulose accessory protein D (AxcesD), which has 290 been characterized as increasing the efficiency of cellulose production in the Acetobacter xylinus

Bacterium, Bacillus cereus that is Regulated by c-di-GMP 299
Examination of the genomic-proximity networks of pel loci also reveal novel operon organizations 300 across phylogenetically divergent bacteria ( Figure 5). As with cellulose loci bcsA and bcsZ, we identify 301 examples of operon rearrangements involving pelB (outer membrane transport pore + TPR domain) 302 loci and pelA (periplasmic modification hydrolase) (Figure 5(ii), (iiib), (iv)), across several species 303 associated with diverse environments. Again consistent with our findings for cellulose, we noted loci 304 losses and acquisitions. Although it has not been demonstrated that the pel operon forms a trans-305 envelope biosynthetic complex, the ordering of operon loci has been shown to play an important role in 306 the assembly of macromolecular complexes(36) and optimizing biosynthetic pathways(37), suggesting 307 that there exists a functional coupling between pel outer-membrane transport and periplasmic 308 modification(38). However, the effects of these rearrangement events on Pel production still remain to 309 be experimentally investigated. 310

311
We also observed a high degree of overall conservation among components which are known to play 312 key roles in Pel biogenesis, such as the putative polysaccharide synthase (PelF), putative inner-313 membrane protein (PelG), hydrolase/deacetylase (PelA) and cyclic-di-GMP receptor (PelD)(21). In 314 contrast, a greater degree of divergence can be seen among inner (PelE) and outer-membrane (PelB, 315 PelC) transport associated loci, which appear to follow a consistent pattern of clustering across 316 bacterial phyla suggesting co-evolution of potentially physically interacting components, however no 317 evidence of interaction has been shown to date.  197N (Supplemental Figure 6). Contrary to cellulose 388 phylogenetic clusters, the polysaccharide synthase, wssB, was divided into distinct Gamma-and Beta-389 proteobacterial clusters. We also found a distinct phylogenetic cluster identifying a unique tandem 390 duplication of wssC in Bordetella avium 197N, which was not observed among orthologous cellulose 391 bcsB copolymerase loci (Supplemental Figure 6 (ii)). This observation might suggest a divergent 392 mechanism of action of cellulose inner-membrane transport. As we previously observed (Figure 1), 3 393 out of 4 of the predicted acetylated-cellulose operons were also found to co-occur with alginate 394 operons. Additional HMM-searches identified significant sequence similarity between acetylated-395 cellulose wssBCDE operon sequences to those previously identified as bcsABZC, as well as between 396 acetylated-cellulose acetylation-machinery and their functional homologs in alginate operons (WssH -397 AlgI; WssI -AlgJ/AlgX). Taken together, these findings suggest that acetylated cellulose production 398 has likely evolved through the duplication and operonic acquisition of the alginate acetylation 399 machinery loci. 400 401

Conservation of Cellulose Biosynthesis Machinery 403
With the availability of a crystal structure for the BcsA-BcsB inner membrane complex responsible for 404 cellulose biosynthesis and transport(46), we examined the potential structural and functional 405 consequences of the sequence variability of the BcsA and BcsB phylogenetic clusters highlighted 406 above (Figure 3). In brief, we generated multiple sequence alignments of eight BcsA and BcsB 407 sequences summarizing the evolutionary diversity of cellulose operon clades identified in Figure 3. Staphylococci PNAG production does not depend on c-di-GMP and is likely regulated by an alternate 446 signaling pathway(57). Deacetylation of PNAG is carried out by pgaB and icaB loci and has been 447 shown to play a crucial role in biofilm formation and immune evasion(52,58). pgaB also possesses an 448 additional C-terminal glycoside hydrolase domain which cleaves the PNAG polymer following its 449 partial deacetylation(59), although the mechanism of how these activities are coordinated and the 450 biological role of the hydrolase activity is unknown. Unique to pga operons is a loci encoding an outer 451 membrane export pore, pgaA(60), and in ica operons an additional integral membrane protein, icaC, 452 which has been proposed to be involved in PNAG O-succinylation(53). Using Gram-negative pga loci 453 as seed sequences for the reconstruction of synthase-dependent PNAG operons, we were also able to 454 identify Gram-positive ica operons based on significant sequence similarities to pgaB and pgaC loci. 455 Our phylogenetic clustering approach also revealed that pgaC/icaA sequences clustered into a single 456 clade, while pgaB/icaB were associated with distinct sequence clades (Supplemental Figure 4) Figure 4 (v)). Although PNAG production in these species has not been 469 experimentally confirmed, these findings suggest if the polymer is produced it is under a novel mode of 470 regulation by c-di-GMP, that glycoside hydrolase activity might not be essential for PNAG export 471 across all Gram-negative species, and that other modes of export may exist. The fusion of would also 472 suggest that the de-acetylase activity of PgaB in these organisms may be associated with the 473 periplasmic face of the inner membrane, in contrast to dual domain PgaB clades where the protein is 474 predicted to function at the periplasmic face of the outer membrane(60). 475

476
In addition to these novel domain fusion events, PgaB phylogenetic clustering enabled us to resolve 477 distinct events affecting the evolution of the deacetylase domain across different operon clades. Using 478 the E. coli K12 MG1655 sequence of the largest PgaB clade (PgaB_G1) as a reference, multiple 479 sequence alignments against other representative PgaB clade sequences identified several regions of 480 insertion/deletion events (Supplemental Figure 8A). When these regions were mapped to the 481 published crystal structure of PgaB (PDB ID: 4F9D(61)), they were found to correspond to distinct 482 structural elements surrounding the conserved deacetylase core (Supplemental Figure 8B-C). We 483 assigned insertion/deletion regions a number according to their order of appearance in the multiple 484 sequence alignment of PgaB deacetylase domains, and divided them into two categories 485 (Supplemental Figure 8D) OrthoMCL(13) and EggNOG(14)), or through the generation of hierarchical evolutionary relationships 538 and construction of phylogenetic trees (e.g. TreeFAM(66) and TreeCL(67)). However, these methods 539 are limited in their ability to provide further resolution of sequence diversity within a family that might 540 otherwise offer additional insights into evolutionary events that allow taxa to adapt to specific 541 environments. 542 543 Agnostic approaches to define sub-clusters of evolutionarily related protein families have ranged from 544 phylogenetic tree reconstructions (68) to hierarchical clustering of pairwise global sequence 545 alignments(69). Here we present an extension of previous efforts, and introduce a novel systematic 546 approach for defining protein sub-family relationships through the clustering of phylogenetic trees. Key 547 to this approach is defining a scoring function that allows a phylogenetic tree to be resolved into 548 optimal clusters that best capture the similarities between cluster members, as well as the dissimilarities 549 between clusters. Combining two clustering quality metrics (Silhouette and Dunn index) and 550 proportion of sequences clustered, we demonstrate that our approach is able to classify a diverse array 551 of operon-associated protein families into taxonomically consistent and functionally informative sub-552 clusters. Genomic-proximity networks were also constructed to provide an intuitive means of utilizing 553 phylogenetic clusters to examine diverse mechanisms of operon evolution across taxonomically diverse 554 bacterial genomes. Genomic-proximity networks have previously been utilized for inferring functional 555 relationships(70), understanding mechanisms underlying bacterial genomic organization into 556 functionally related gene clusters(71), and transcriptional regulation of bacterial operons(72). In this 557 study we extend the application of genomic-proximity networks as a tool for the systematic exploration 558 of operon evolution resulting from locus divergence, loss, duplication, and rearrangement events. 559

560
To demonstrate the effectiveness of our approach, we applied our methods to classify the stynthase-561 dependent bacterial EPS operon machineries for 5 different polymers: cellulose, acetylated-cellulose, 562 alginate, Pel and PNAG. There has been only one previous attempt to classify synthase dependent EPS 563 operons and this focused specifically on the cellulose system(25). In that study, cellulose operons were 564 categorized into four major types, based on the presence or absence of experimentally validated 565 accessory loci involved in cellulose production. Here, we based our analysis on the four core operon 566 loci, bcsABZC, deemed essential for cellulose production. Cellulose operon clades identified in this 567 study showed little consistency with the previously defined four major cellulose operon types(25), 568 suggesting that the conservation of accessory loci is more variable across bacterial species compared to 569 loci encoding core EPS functionalities. However, one operon type was identified in this analysis, 570 representing the loss of the BcsC outer membrane transporter identified among a subset of alpha-571 proteobacterial genomes, which include several known cellulose producing species(47,73) suggesting a 572 novel mechanism of cellulose export (Figure 3(iii))(25). We also found that the loss of BcsC has 573 resulted in an increased divergence of BcsB loci in these genomes, which highlights the key role of 574 BcsB as an intermediary between cellulose biogenesis and periplasmic transport (Figure 6). 575 576 In general, inner membrane components involved in EPS polymerization were found to be relatively 577 conserved across all systems examined, while periplasmic and outer-membrane components showed a 578 relatively increased degree of evolution, which are likely to have important functional implications. For 579 example, in the cellulose and Pel operon networks (Figures 3 and 5, and Supplemental Table 4), 580 rearrangement events involving the periplasmic glycosyl hydrolase (BcsZ) and glycosyl 581 hydrolase/deacetylase (PelA) were found to be a defining feature of several operon clades. It is 582 interesting to note that these rearrangements have resulted in a change in the ordering of bcsZ and pelA 583 relative to their respective outer-membrane transport pore loci, which highlights the important role of 584 polysaccharide modification in both the biogenesis and regulating extracelluar EPS transport(20,38,74). 585 Similarly, the rearrangement of alginate modification machinery loci (algIJF) was observed as a 586 distinguishing feature of Pseudomonas spp. operon clades. These findings suggest that rearrangement 587 and locus ordering may serve as an important means of regulating EPS production by modifying the 588 timing of translation of modification enzymes, which could affect the assembly of EPS complexes or 589 the structural properties of EPS produced (37,49,75). 590 591 Furthermore, identifying operon clades through a phylogenetic approach elucidated numerous instances 592 of cellulose whole operon duplications arising from HGT of two evolutionary distinct operon clades 593 (Figure 4). Such large-scale duplications, if they are functional, may either serve as a dosage response 594 to given environmental stressors, as observed in the duplication of bacterial multiple-drug transporter 595 operons(76), or could be under the regulation of different environmental stimuli. Interestingly, 596 representative species of the two cellulose operon lineages identified in HGT events, e.g. the plant and 597 human pathogens, D. dadantii and S. enterica, respectively, are known to produce structurally distinct 598 forms of cellulose with different properties and roles in pathogenesis(77,78). Furthermore, we 599 identified that BcsB divergence was also seen to accompany the rearrangement or horizontal transfer of 600 these operons, which further suggests that it may play a key role in the fine-tuning of cellulose 601 production by coordinating the export of growing cellulose polymers through the periplasm.  Table 3). Fully sequenced genomes and associated protein sequences were obtained for 630 1861 bacteria from the NCBI (Retrieved April 20th 2015) (Supplemental Table 6). For each bacterial 631 strain predicted to possess an EPS operon, metadata corresponding to niche (host-associated or 632 environmental) and lifestyle (pathogenic or non-pathogenic) were collated from literature searches 633 (Supplemental Table 7).  Table 4) and were classified into the following evolutionary events; 1) locus 662 losses -the total number of reference loci missing or not detected by HMM searches; 2) locus 663 duplications -number of distinct loci appearing as multiple significant hits to the same HMM model < 664 10kBP apart; 3) locus fusions -the number of loci that were significant hits to two or more reference 665 For each sequence, i, its silhouette score, s(i), is defined as: 688 Where a(i)=average evolutionary distance (expected number of substitutions per site) i) is the lowest 689 average evolutionary distance to any other cluster of which i is not a member; and (3) Dunn index 690 (DI)(86), for a set of m clusters, its Dunn index, DI, is defined as: 691 Where DI is the evolutionary distance between clusters i and j and Δ c is the size of cluster c. Note that 694 a higher s(i) indicates that a sequence is well matched to other members of its cluster and not well 695 matched to neighbouring clusters. Furthermore, a higher DI indicates clusters that are compact (smaller 696 cluster sizes) and well differentiated (larger inter-cluster distances). Thus, the evolutionary distance 697 cutoff which maximizes p + s_avg + DI is chosen as the optimal phylogenetic clustering for a given set 698 of EPS locus sequences. 699 700

Construction of EPS Operon Genomic-Proximity Networks 701
To visualize evolutionary and genomic organization relationships of predicted EPS operons, genomic 702 proximity networks were generated in which each node represents an individual EPS locus cluster (as 703 defined above), and an edge connecting a pair of nodes represents the average genomic distance (base 704 pairs) between loci represented by each node found in the same genome. Further, nodes are represented 705 as pie-charts indicating phylogenetic distribution of each EPS locus, as defined by NCBI taxonomic 706 classification scheme. Networks were visualized using Cytoscape (version 3.5)(87). 707 708 This work was also supported in part by grants from the Canadian Institutes of Health Here evolutionary distance is defined as the number of expected amino-acid substitutions normalized 961 over the multiple sequence alignment length. To identify optimal patterns of clusters, we examined 962 three scoring schemes (Q1, Q2 and Q3). Q1 is defined as the sum of the average silhouette score for all 963 clusters: μ(s(i)), and the Dunn index (DI). Q2 is defined as the sum of the proportion of sequences 964 identified in clusters (Σc/Σm), μ(s(i)) and DI. Q3 is defined as the product of Σc/Σm and the sum of bcsA among betaproteobacteria -Here, bcsA appears in closer proximity to bcsC than to bcsB or bcsZ 984 (as indicated by a cyan coloured edge for the former and a grey coloured edge for the latter). bcsB is closer to bcsC than bcsZ, the cyan edges suggest that bcsZ is present, but appears after bcsC 989 (example 1), while in other operons, bcsZ appears missing (example 2). Detailed inspection of example 990