Skip to main content
Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Bioinformatics characterization of BcsA-like orphan proteins suggest they form a novel family of pseudomonad cyclic-β-glucan synthases


Bacteria produce a variety of polysaccharides with functional roles in cell surface coating, surface and host interactions, and biofilms. We have identified an ‘Orphan’ bacterial cellulose synthase catalytic subunit (BcsA)-like protein found in four model pseudomonads, P. aeruginosa PA01, P. fluorescens SBW25, P. putida KT2440 and P. syringae pv. tomato DC3000. Pairwise alignments indicated that the Orphan and BcsA proteins shared less than 41% sequence identity suggesting they may not have the same structural folds or function. We identified 112 Orphans among soil and plant-associated pseudomonads as well as in phytopathogenic and human opportunistic pathogenic strains. The wide distribution of these highly conserved proteins suggest they form a novel family of synthases producing a different polysaccharide. In silico analysis, including sequence comparisons, secondary structure and topology predictions, and protein structural modelling, revealed a two-domain transmembrane ovoid-like structure for the Orphan protein with a periplasmic glycosyl hydrolase family GH17 domain linked via a transmembrane region to a cytoplasmic glycosyltransferase family GT2 domain. We suggest the GT2 domain synthesises β-(1,3)-glucan that is transferred to the GH17 domain where it is cleaved and cyclised to produce cyclic-β-(1,3)-glucan (CβG). Our structural models are consistent with enzymatic characterisation and recent molecular simulations of the PaPA01 and PpKT2440 GH17 domains. It also provides a functional explanation linking PaPAK and PaPA14 Orphan (also known as NdvB) transposon mutants with CβG production and biofilm-associated antibiotic resistance. Importantly, cyclic glucans are also involved in osmoregulation, plant infection and induced systemic suppression, and our findings suggest this novel family of CβG synthases may provide similar range of adaptive responses for pseudomonads.


Advances in sequencing technology have added a new dimension to microbial molecular ecology allowing biochemical potential to be inferred from protein coding sequences [1, 2]. However, investigation of protein function remains difficult, and most new gene sequences are annotated based on homologies and should be viewed with scepticism [3]. Very few proteins have experimentally verified functions, misannotations are frequent, and many protein families remain annotated as hypothetical or uncharacterized [47]. Critically, homologous regions identified between pairs of proteins may not be responsible for conserved function [8] and the assumption that proteins with very similar sequences and structures should have similar functions may not always be reliable [9]. Domain recognition and modelling provide complementary means of recognising and comparing structural and functional (protein) folds and overall, functional predictions based on homologous sequences and structures have been improving [1012]. However, ‘twilight’ proteins with less than 20–40% amino acid sequence homology remain difficult to annotate, as they are unlikely to share the same protein fold and therefore function [1316]. Furthermore, functional homology may not be retained within diverse families, as subtle differences in sequences, folds, and active sites may result in altered activity [17].

We have encountered this problem in functional annotation through our attempts to understand the nature of a novel family of proteins initially annotated as bacterial cellulose synthase catalytic subunits (BcsA) or more generally as glycosyltransferases. Glycosyltransferase (GT) family enzymes [17, 18] catalyse the transfer of sugar groups from activated donor molecules to specific acceptor molecules (Leloir type GTs) and the Carbohydrate-Active enZYmes Database (CAZy) [19] ( classifies these into distinct sequence-based families where the same three-dimensional structural and functional fold [20] is expected to occur within each family. However, in the GT family, enzymes with different donor and/or acceptors (poly-specificity) are common, making functional predictions problematic with 115 GT families assigned to 144 Enzyme Commission (EC) activities (CAZy, August 2022). This is not surprising, as over 75% of the GT families belong to one of three superfamilies (GT-A, B & C) [21]. Poly-specificity is also seen within the GT2 family enzymes [17, 22], which have an α/β/α sandwich Rossmann-like fold [18] that binds nucleotide sugars and are members of the nucleoside-diphosphosugar transferase (GT-A) superfamily [21]. Hypervariable loops in the shared core component of the GT-A fold contribute to diversity within this superfamily in which inverting and retaining catalytic mechanisms have emerged multiple times [23]. Poly-specificity also confuses homolog / ortholog / paralog distinctions, and a lack of clear evolutionary relationships between proteins limits both homology and non-homology-based functional prediction methods [24].

GT2 synthases transfer sugars to a growing acceptor molecule in a processive manner to produce long-chain cellulose, chitin, curdlan, hyaluronan and mixed-linkage glucans [2531], as well as a range of other linear, branched, and cyclic polysaccharides (CAZy lists 18 GT2 family EC activities and over 20 ‘characterized’ bacterial synthases producing different polymers; August 2022). High-resolution X-ray structures [32, 33] are available for the Rhodobacter sphaeroides (now Cereibacter sphaeroides) 2.4.1 bacterial cellulose synthase catalytic subunit A protein (RsBcsA), showing the GT2 domain which includes the conserved DD, DxD, ED, and Q(Q/R)xRW motifs in the active site cavity, the Rossmann-like fold, a transmembrane (TM) region that forms a channel through which the cellulose chain translocates through to the periplasm, and the cyclic-di-GMP–binding PilZ regulatory domain [25, 3031, 34]. The RsBcsA structure has been used to model related GT2 synthases through template-based modelling (TBM) and predictive modelling that uses Artificial Intelligence (AI) / Deep Learning methods incorporating free modelling (FM) approaches [3537]. However, despite the relative ease of modelling novel GT2 sequences today, the question of how diverse homologous proteins can become before a significant change in function occurs is still relevant, despite the retention of a conserved protein fold and conserved motifs and residues.

We identified a bcsA-like gene in several model pseudomonads including the plant pathogen P. syringae pathovar tomato DC3000, and the plant and soil-associated P. fluorescens SBW25 and P. putida KT2440 [3840]. Each of these strains requires BcsA and other cellulose synthase subunits encoded by the bcs operon to produce cellulose [4146]. As the bcsA-like gene was not associated with these operons, we referred to it as an ‘Orphan’ and expected the protein to add to the pool of BcsA subunits to modify cellulose synthesis in different environmental conditions. However, as we begun to characterise the Orphan protein in silico, we realised that the link to cellulose synthesis through the conserved GT2 domain and associated transmembrane helices was misleading, and the identification of a second but smaller GH17 glycosyl hydrolase family domain suggested that the Orphan was more likely to be a transmembrane cyclic β-(1,3)-glucan synthase.

In this work we describe our identification and characterisation of the Orphan proteins in model pseudomonads and show that they in fact belong to a highly conserved group found across the genus and in other related bacteria. We use secondary structure predictions to identify the functional domains, and protein structural modelling to provide a consensus structure. We produce a functional model, in agreement with earlier enzymatic investigations and molecular dynamics simulations [4749], which suggest that the Orphan proteins form a novel family of cyclic β-(1,3)-glucan (CβG) synthases and explains the production of CβG by the opportunistic human pathogens P. aeruginosa PAK and PaPA14 that is associated with biofilm antibiotic resistance and tolerance [50, 51]. We suggest that the Orphan synthases may have a broader role in a range of environments, as other cyclic glucans are also involved in bacterial osmoregulation and adaptation, membrane structure, plant infection and induced systemic suppression [5259].


Accessing genomes and protein sequences and other information

Pseudomonas fluorescens SBW25 [40], P. putida KT2440 [38], and P. syringae pv. tomato DC3000 [39] genomes were accessed through the Pseudomonas Community Annotation Project (PseudoCAP) [60] ( or downloaded as GBK files and viewed using Artemis [61] ( Operon predictions [62, 63] ( was used to assess whether orphan and dapE genes were likely to be part of the same operon. Additional information about proteins were obtained from Universal Protein Knowledge Database (UniProtKB) ( [64] entries. Gene synteny was also checked using EnsemblBacteria [65] ( and NCBI Nucleotide [66] ( Unpublished draft genomes generated in other projects (A. Koza, K. Kabir & A. Spiers) from pseudomonad strains isolated in earlier work, including the soil-associated DBG-1, DBG-3, DBG-6, DBG-15, DBG-16, and DBG-23 strains [67] and the mushroom pathogen NZ092 [68], were investigated using Artemis and protein sequences provided (see S1 File for protein sequences analysed in this work). Locus tags and UniProtKB accessions are provided for key proteins where appropriate.

Identification of other Orphans and other proteins

Orphan proteins were identified in PseudoCAP by searching complete or draft genomes using the ‘Pseudomonas Ortholog Group’ generated for PfSBW25 DapE (locus tag PFLU1259; UniProtKB C3K5D5) [69]. For each genome sampled, the gene immediately upstream of dapE was examined and recorded if the protein had a functional annotation similar to the PfSBW25, PpKT2440 or PsDC3000 Orphan proteins (i.e., orphan–dapE gene synteny was used for selection, but no alternative genetic arrangements involving Orphan homologs were seen in those genomes sampled from PseudoCAP). Further Orphan orthologs were identified by NCBI BLAST+ [70] ( searches of UniProtKB using PfSBW25, PpKT2440 and PsDC3000 Orphan proteins, and by sampling proteins listed by InterPro [71] ( as having the same Pfam PF00332 –PF13641 domain architecture. Duplications were removed as well as metagenome or whole genome shotgun entries which were labelled as preliminary data in UniProtKB, and all searches were completed by the end of October 2021 (see S1 File for the final list of Orphan, BcsA and related proteins analysed in this work). PseudoCAP BLASTP search was also used to identify Escherichia coli MdoB (UniProtKB P39401) homologs in PaPA01, PfSBW25, PpKT2440 and PsDC3000.

Pairwise and multiple sequence alignments, phylogenetic trees, and cladograms

Similarities between sequences were investigated using the Water pairwise sequence alignment tool [70] ( and presented as heatmaps using JMP statistical software (SAS Institute, UK). Clustal Omega and MView [70] were used to produce and view multiple sequence alignments coded according to conserved physiochemical amino acid classes [72]. Sequence conservation was assessed by Shannon entropy using the Protein Residue Conservation Prediction tool [73] ( Phylogenetic trees were produced from Clustal Omega multiple sequence alignments using Simple Phylogeny [70] and the computationally fast unweighted pair group method with arithmetic mean (UPGMA). Trees were constructed with distance correction on, as recommended for divergent sequences, and with a constant-rate assumption the distance from root to each tip is the same. We constructed UPGMA trees in an iterative approach which maximised the number of orthologs that could be processed and selected representative sequences for clades that could be removed without affecting the tree topology. Hierarchical Cluster Analysis (HCA) was used to cluster proteins using amino acid profiles obtained by Protein Stats [74] ( and JMP 12 using the Ward method with an equal weighting of variables.

Functional domains, predicted secondary structure and topology, and modelling

Functional domains were identified using HMMSCAN [75] ( and secondary structures and topologies predicted using HMMSCAN, LipoP [76] (, Phobius [77] (, PRED-TAT [78] (, Proetus2 [79] (, Protter [80] (, and SignalP [81] ( Predicted structures (models) were produced using AlphaFold Colab Notebook [82, 83] (, IntFOLD6 [84] (, Phyre2 [85] (, RoseTTAFold implemented through Robetta [86] (, SWISS-MODEL [87, 88] (, and TrRosetta [8991] (, following the suggested approaches and default options and using the full-length protein sequences (i.e., including putative signal sequences). All modelling was completed by the end of June 2022 and Protein Data Base (PDB) files are available (see S2 File for a list of all models and DOI for downloads). Models were visualised using Mol* 3D Viewer [92] ( using representation and residue property settings, and membranes visualised using the ANVIL algorithm [93] implemented by Mol*. Models were visually compared by identifying secondary structure features characteristic of the reference structures and quantitatively using Pairwise Structure Alignment [94] ( with FATCAT-rigid body alignments suited to the identification of structural equivalences between closely related proteins with similar shapes [95,96].

Results and discussion

Orphan genes are not duplications of the bscA cellulose synthase subunit gene

Pseudomonas fluorescens SBW25 [4042, 97], P. putida KT2440 [38, 43, 45, 97], and P. syringae pv. tomato DC3000 [39, 44, 46, 97] each contain a bacterial cellulose synthase (bcs) operon encoding BcsA and other subunits required for cellulose production [98]. These pseudomonads also contain a second bcsA-like ‘orphan’ gene located in a different region of the genome, immediately upstream of dapE [69] and not associated with other cellulose synthesis-related genes (Fig 1A).

Fig 1. Orphan proteins are not recent duplications of BcsA proteins.

Shown here is the bacterial cellulose synthase (bcs) operon found in Pseudomonas fluorescens SBW25 known as the Wrinkly Spreader Structural (wss) operon [41] (A). WssB/BcsA (dark blue) and WssC–E (light blue) are the core cellulose synthase subunits, WssF–I (green) are involved in the partial acetylation of the cellulose polymer [41, 42], and Wss A and WssJ (rose) are likely to be involved in the positioning of the cellulose synthase and are required for cellulose production. A second BcsA homolog known as the Orphan is located upstream of dapE (yellow) in a different region of the chromosome. DapE and other genes (grey) indicated here are not involved in cellulose production. P. fluorescens SBW25 locus tag (PFLU) numbers are shown below the genes. Heatmaps of the amino acid sequence identity (left panel) and similarity (right panel) (B) determined from Water pairwise comparisons [70] of amino acid sequences of BcsA and Orphan proteins from P. fluorescens SBW25, P. putida KT2440 and P. syringae DC3000, as well as with the Escherichia coli MG1665 BcsA (EcBcsA) reference protein, suggest that the Orphan genes are not recent duplications of the bcsA genes in these model pseudomonads (see S1 Table for protein locus tags & UniProtKB accessions and S1 File for protein sequences). These comparisons also show that the PfSBW25, PpKT2440 and PsDC3000 Orphan proteins and EcBcsA share a central overlapping region (C) of ~240 residues in the multiple sequence alignment (this region is indicated in blue with the residue numbers provided for PfSBW25 Orphan and EcBcsA proteins).

The Orphans are currently annotated as β-(1,3)-glucosyl transferases or glycosyl transferase family proteins in PseudoCAP [60], and we confirmed by pairwise sequence alignment that each has limited amino acid sequence identity (24.7–26.5%) with the functionally-active BcsA protein (UniProtKB P37653) from Escherichia coli MG1665 (EcBcsA) which was chosen as a non-pseudomonad reference protein for this work [99] (see S1 Table for percentage identity & similarities). The Orphan genes are not recent duplications of the bcsA gene in each of the three pseudomonads, as pairwise alignments revealed limited DNA sequence identity (47.9–48.8%) and amino acid sequence identity (21.6–24.5%) between gene and protein pairs. Homologous proteins sharing more than 30% sequence identity are likely to share similar structure and function [16], but the poor level of amino acid identity seen between BcsA and Orphan proteins suggests that a BcsA or BcsA-like functional annotation for the Orphan proteins may not be justified.

We attach no significance to the positioning of the Orphan gene immediately upstream of dapE, as DapE has never been associated with cellulose production in bacteria, and Operon predictions [62] suggest that in PfSBW25, PpDC3000 and PsDC3000 the orphan and dapE genes are unlikely to be in the same operon (Estimated probability that the pair is in the same operon, pOp values of 0.017, 0.038 & 0.207, respectively). The beginning of the Orphan genes is marked by a region of low GC content, and in PsDC3000, the GC content of the orphan gene is clearly lower than that of dapE indicating recent recombination or horizontal/lateral gene transfer (see S1 Fig for GC content traces). Recombination is part of the dynamic nature of pseudomonad genomes including P. syringae [100], and gene amelioration is expected to reduce the difference in GC content of genes acquired by LGT over time [101]. This suggests that in PsDC3000 the acquisition or repositioning of the Orphan gene might be more recent than it was in PfSBW25 or PpKT2440.

Orphan proteins are more like one another than they are to BcsA proteins

The PfSBW25 and PpKT2440 Orphans could be aligned with the PsDC3000 Orphan protein with 47.3–47.9% amino acid identity, with only five insertion-deletions (INDELs) of 3–9 amino acids required in the PsDC3000 Orphan to align with the PfSBW25 and PpKT2440 proteins. Heatmaps based on Water pairwise alignments demonstrates that the Orphan proteins are more like one another than they are to their cognate BcsA proteins or to the EcBcsA reference sequence (Fig 1B). Furthermore, the C-terminal region of these proteins overlap with the N-terminal region of EcBcsA with a central core of ~240 residues with an over-all 21% amino acid sequence identity (Fig 1C). Our analysis of conserved domains, motifs, and residues described in a later section shows that the central core includes most of the active site found in BcsA proteins [25], but the C-terminal cyclic-di-GMP-associated regulatory PilZ domain is missing in the Orphans.

The difference between Orphan and BcsA proteins was maintained when we expanded our comparison to include Orphans identified in draft genomes of pseudomonads we and others had isolated earlier [67, 68], as well as Orphan proteins from P. aeruginosa PAK, PaPA01 and PaPA14, which are referred to as NdvB in PseudoCAP and elsewhere (PaPA01 locus tag and UniProtKB: PA1163 / Q9I4H4). We had originally ignored P. aeruginosa strains because this species has not been reported to produce cellulose and we did not expect them to have bcsA-like duplications (the P. aeruginosa Orphans share 46.2–47.9% normalised amino acid identity with the PfSBW25, PpKT2440 and PsDC3000 Orphan proteins). A phylogenetic analysis of this larger set of proteins placed the Orphan and BcsA proteins in separate clades rooted by a common ancestral sequence in the UPGMA [70] tree (Fig 2A). This relationship was confirmed by HCA which does not presume common ancestral sequences but uses the amino acid profiles to group proteins instead, and in this, the PsDC3000 Orphan was found to group with EcBcsA and PpBcsA (Fig 2B).

Fig 2. Pseudomonas spp. Orphan and BcsA proteins originate from two different but related groups of proteins.

Shown here is a simple un-rooted UPGMA phylogenetic tree produced by Clustal Omega Simple Phylogeny [70] (A) of BcsA and Orphan proteins from eleven pseudomonads plus the Escherichia coli MG1665 BcsA reference protein (see S1 File for protein sequences). The tree is drawn with the real (relative) genetic distances and as a cladogram with a uniform distance between the root, indicated by the small grey circle, and the terminal nodes shown as large circles for the Orphans and squares for the BcsA proteins. The dashed lines separate the two main clades of the tree with the Orphan and BcsA proteins in different branches. The real and uniform distance (horizontal) scales are shown at the bottom-left of each cladogram. The Orphan and BcsA proteins can also be differentiated by hierarchical cluster analysis (HCA) based on amino acid profiles (B). The same symbols and colours are used to indicate the arbitrary root which is located at the mid-point of the longest branch, and Orphan and BcsA proteins. The dashed arc indicates the branch containing most of the Orphans from the rest of the cladogram which includes the BcsA proteins and the remaining Orphan. The x-y scale is indicated at the bottom-left of the cladogram.

Further analysis of Orphan homologs indicates that they are widely distributed amongst the pseudomonads with 84 Orphans identified in 41 pseudomonad species with a further 28 in strains not classified to species level, and 42 additional homologs from other genera (see S2 Table for a list of Pseudomonas spp. Orphans; note that our selection of Orphan proteins for this larger set of proteins was by sampling and was not exhaustive). We undertook a UPGMA phylogenetic analysis of these proteins using the fungal Schizosaccharomyces pombe 972 α-(1,3)-glucan synthase Ags1 (UniProtKB Q9USK8) and the bacterial Rhizobium meliloti 1021 cyclic β-(1,2)-glucan synthase NdvB (UniProtKB P20471) as outlier sequences. It should be noted that while the PaPA14 Orphan protein was given the same name as RmNdvB because of (limited) sequence homology [50], RmNdvB is a significantly larger protein with only 12.2–13.6% sequence identity with the PaPAK, PaPA01, PaPA14, PfSBW25, PpKT2440, and PsDC3000 Orphan proteins (RmNdvB was named because of its role in nodule development [101103]).

The UPGMA tree we constructed consisted of seven clades (Fig 3A), placing plant and fungal homologs in Clade 1, BcsA proteins in Clade 2, and the Orphan proteins in Clades 3–6, with SpAgs1 and RmNdvB in Clade 7 (Fig 3A; see S2 Fig for the full tree listing all proteins and Table 1A for clade characteristics). We note that in the case of the Clade 3–5 representatives, Arcobacter butzleri ED-1, Azoarcus strain DN11 and Methylomonas methanica MC09, the Orphan–dapE gene synteny seen in our earlier analysis of key pseudomonads was not retained, perhaps because of ancient genome rearrangements and further supporting the lack of functional linkage between DapE and the Orphan protein. In contrast, the gene synteny was also retained by the Clade 6 representative, Pseudomonas viridiflava LMCA8, and by the other pseudomonad members of the Clade as this was part of the criteria used to select Orphans from PseudoCAP. However, Clade 6 also includes two small subclades which include four Orphans from other genera (Subclades 6.2 and 6.3), and in each case the gene synteny is not conserved. This suggests that the Orphan–dapE gene synteny is restricted to the pseudomonads and may reflect a relatively more recent genome rearrangement that brought the two genes together early in the development of the genus.

Fig 3. Orphan proteins are predominantly found within the Pseudomonas genus.

Shown here are two schematics of an un-rooted UPGMA phylogenetic tree produced by Clustal Omega Simple Phylogeny [70] showing the seven main clades (A) and five subclades within Clade 6 that containing all Pseudomonas spp. Orphan proteins (B). These are simplifications of the original UPGMA phylogenetic tree of 190 Orphan protein homologs (see S2 Fig for the full tree; see S1 File for protein sequences) and are drawn with real (relative) genetic distances. Clades and Subclades: Clade 1, This clade includes the fungal Rhizomucor miehei CAU432 β-(1,3)-Glucanosyltransferase, Rm Bgt17A and contains a total of 20 proteins from fungi, plants, and Gammaproteobacteria with glucosidase, glucanosyltransferase, glycogen synthase, and mannosyltransferase annotations. Clade 2, This clade includes the Escherichia coli MG1655 and Rhodobacter sphaeroides 2.4.1 BcsA reference proteins and contains a total of 14 BcsA cellulose synthase proteins from the Alphaproteobacteria and Gammaproteobacteria. Clade 3, This clade contains 11 proteins from the Alphaproteobacteria and Epsilonproteobacteria with glucosyl/glycosyl transferase and glucanase annotations. Clade 4, This clade contains 10 proteins from the Betaproteobacteria and Gammaproteobacteria with benzoate transporter, glucanase, glucosyl transferase, and glycosyl hydrolase family annotations. Clade 5, This clade contains 17 proteins from the Alphaproteobacteria, Deltaproteobacteria, and Gammaproteobacteria, with glucanase, glycosyltransferase, and cellulose synthase annotations. Clade 6, This clade contains 112 representative Pseudomonas spp. Orphan proteins and four non-pseudomonad homologs in the following five subclades. Subclade 6.1, This subclade contains six P. syringae strain Orphan proteins including PsDC3000, with glucosyl/glycosyl transferase annotations. Subclade 6.2, This subclade contains two non-Pseudomonas spp. Orphan proteins from the Betaproteobacteria and Gammaproteobacteria with gluco/glycosyl transferase annotations. Subclade 6.3, This subclade contains two non-Pseudomonas spp. Orphan proteins from the Deltaproteobacteria and Gammaproteobacteria with glucanase annotations. Subclade 6.4, This subclade contains a total of 94 Pseudomonas spp. Orphan proteins, including P. fluorescens SBW25 and P. putida KT2440 and excluding all P. aeruginosa and P. syringae Orphans, with glucanase, glucan biosynthesis protein, glucosyl/glycosyl transferase, Glyco trans 2-like domain-containing protein, and cellulose synthase annotations. Subclade 6.5, This subclade contains eight P. aeruginosa strain Orphan proteins including PaPA01 and one additional Pseudomonas spp. Orphan, with glucanase, glycosyl transferase, and synthases of periplasmic glucan annotations. The unmarked nodes in (B) represent single Pseudomonas spp. Orphan proteins are not included in the subclades. Clade 7, This clade includes the Alphaproteobacteria bacterium Rhizobium meliloti 1021 NdvB and the fungal Schizosaccharomyces pombe 972 Ags1 proteins chosen as outliers for this tree.

All Pseudomonas spp. Orphan proteins were located in Clade 6 which we suggest is the true Orphan family (we recognise that this is oxymoronic) and these proteins are distributed across five subclades (Fig 3B and Table 1B). P. aeruginosa and P. syringae Orphan proteins formed separate subclades (Subclades 6.5 and 6.1, respectively), but most pseudomonad Orphan proteins were in a large and unstructured subclade (Subclade 6.4) and included proteins from P. fluorescens and P. putida, as well as some other plant pathogens. This suggests that these proteins may have a relatively recent ancestor or that environmental / host adaptation may not be particularly strong, except for P. aeruginosa and P. syringae where host adaptation might be selecting for divergent sequences. A multiple sequence alignment of P. aeruginosa, P. fluorescens, P. putida and P. syringae Orphan proteins grouped sequences by species, with the P. syringae proteins the most divergent and including a large 14 residue insertion not seen in the other proteins (see S3 File for an annotated multiple sequence alignment of 26 Pseudomonas spp. Orphan proteins showing domains, conserved motifs, and residues). Further analysis involving more Orphans may confirm the pseudomonad family as the dominant clade, but this may change as genome sequences become available for under-represented sister and more distant genera.

Our larger comparison of Orphan homologs confirms their distant relationship to BcsA proteins and other orthologs. The poor level of amino acid sequence conservation in the central core region of the Orphan proteins suggests that they are unlikely to function as additional or alternate BcsA subunits in the cellulose synthase holoenzyme and are more likely to be involved in the synthesis or modification of some other polysaccharide. This is supported by the identification of more general glycosyltransferase (GT) annotations provided for Orphan proteins in PseudoCAP, though this is not particularly informative as GT proteins are highly diverse with 114 families currently listed by the Carbohydrate Active Enzymes Database (; for example, the GT2 family is also highly diverse and includes sixteen distinct enzymatic activities including cellulose synthase [25]. The sequence divergence within the Orphan clade also suggests that this family of proteins may provide an adaptative advantage in different environments with continued adaptation and altered enzymatic activities.

Conserved domains and structural comparison by homology modelling suggest a two-domain structure and function for the Orphan proteins

We used HMMSCAN to identify conserved domains in the PfSBW25 Orphan protein using profile Hidden Markov Models [75] (Fig 4). This suggested a two-domain model corresponding to the upstream and over-lap regions seen in our earlier pairwise and multiple sequence alignments. Similar HMMSCAN results were obtained for P. aeruginosa PA01, P. putida KT2440, and P. syringae DC3000 Orphan proteins, and these plus the high degree of amino acid sequence identity seen between all Orphan proteins examined (e.g., see S3 File for the multiple sequence alignment of 25 Pseudomonas spp. Orphan proteins), also provides evidence for an Orphan ‘protein’ family which complements our earlier presentation of the Orphan ‘phylogenetic’ family.

Fig 4. Functional domains identified in the Orphan protein sequence by HMMER.

Shown here are schematics of the fungal Rhizomucor miehei CAU432 β-(1,3)-Glucanosyltransferase, the Pseudomonas fluorescens SBW25 Orphan protein, and the Escherichia coli MG1665 cellulose synthase catalytic BcsA subunit, aligned to show the positioning of the (first) GH17 (trans)glycosidase and (second) GT2 nucleotide-diphospho-sugar transferase domains identified by HMMSCAN [75]. The PfSBW25 Orphan protein includes a peptide signal sequence (yellow) and a series of transmembrane helices (dark grey), but not the regulatory PilZ domain (Pfam 07238) present in BcsA. HMMSCAN also identified catalytic residues (red marker) but not in both homologous domains. The first domain of the PfSBW25, P. putida KT2440 and P. syringae DC3000 Orphan proteins share significant homology with the (Trans)glycosidase superfamily / β-Glucanases family (Superfam 51445 and 51487; Conditional E-values of 1.3e-45–5.4e-57) and Glycosyl hydrolase family 17 (Pfam 00332; 2.0e-07–2.7e-8). The second domain shared significant homology with the NDP-sugar-transferase superfamily (Superfam 53448; 2.4e-50–1.2e-52) and Glycosyltransferase-like family 2 (Pfam PF13641; 1.1e-27–1.0e-34).

HMMSCAN, Proteus2 [79], and Protter [80] identified a signal peptide sequence (residues 1–25) in the PfSBW25 Orphan protein sequence, and the Protter protein topology prediction suggested a N-terminal periplasmic domain (residues 1–311) and a central cytoplasmic domain (residues 392–680) linked by two transmembrane (TM) regions (residues 312–391 with three TM helices, and residues 681–847 with five TM helices).

The first PfSBW25 Orphan domain (residues 26–311) included matches to transglycosidase / glycosyl hydrolase (GH) domains (Superfamily 51445, Family 51521 & 51487, and Protein family (Pfam) 00332), and we suggest that this domain is involved in the sequential hydrolysis and rearrangement of a β-glucans producing elongated, branched, or cyclised structures, in agreement with earlier enzymatic characterisation studies and recent molecular simulations [48, 49]. To avoid confusion, we now refer to this region of the Orphan protein as the GH17 domain.

The second PfSBW25 Orphan domain (residues 312–847) included matches to nucleotide-diphospho-sugar transferase / glycosyltransferase domains (Superfamily 53448; Pfam 13641) and we therefore propose that is a GT2 family glycosyltransferase involved in the processive addition of glycosyl subunits from UDP-hexose to an elongating glucan polymer. This part of the Orphan protein also included the two TM regions (residues 312–391 and 681–847) which are commonly found either side of the GT2 domain in BcsA-like synthases [25, 30, 31]. We therefore refer to these regions of the Orphan protein as the TM region / GT2 domain.

We used Phyre2 [85] and SWISS-MODEL [87, 88] to produce template-based models of the GH17 domain and the TM region and GT2 domain of the PfSBW25 Orphan protein (all models are available as PDB files, see S2 File). Both Phyre2 and SWISS-MODEL identified the fungal Rhizomucor miehei CAU432 β-(1,3)-glucanosyltransferase Bgt17A (RmBgt17A) structure [104] (UniProtKB: A0A0M3KKZ6; Protein Databank (PDB) 4WTP) as the best template for the GH17 domain (Phyre2, mean ProQ2 score of 0.23 for residues; SWISS–MODEL, QMEANDisCo score of 0.50 for the model). However, pairwise alignment of the PfSBW25 Orphan GH17 domain (residues 37–306) and RmBgt17A showed that they shared only 26.9% sequence identity and 53.1% similarity, respectively.

Template-based modelling also identified the Rhodobacter sphaeroides 2.4.1 BcsA (RsBcsA) structure [32, 33] (UniProtKB: A0A3G6W9S6) as the best template to model the TM region / GT2 domain (Phyre2, PDP 4HG6 template [32], mean ProQ2 score of 0.39 for residues; SWISS–MODEL, PDP 4P00 and 4P02 [33] templates, QMEANDisCo scores of 0.49 and 0.51 for each model). No corresponding structure was available for our reference protein, EcBcsA, but RsBcsA (residues 13–740) and the PfSBW25 Orphan TM region / GT2 domain (residues 314–860) shared 24.4% sequence identity and 38.4% similarity, respectively.

We used Pairwise Structure Alignments [94] and the FATCAT-rigid algorithm [95, 96] to compare single-domain models and the template structures. Despite the low pairwise sequence identities between the PfSBW25 Orphan protein and the template sequences, the RMSD (0.55–2.15) and TM-scores (0.57–0.94) indicate that these models and structures were very similar (Table 2A). To put this into context, most single-domain proteins can be folded onto the best homologous template with an overall average RMSD of 2.3 and high-quality models can be produced with TM-scores above 0.8 when there is better than 40% amino acid sequence identity with the template [105] (notwithstanding the fact that model quality is dependent on the quality of the template). However, differences between models were evident at the local scale when comparing the positioning of the start and ends of secondary structure features along the primary amino acid sequence. Although each model used the same protein sequence and structural templates, there were only four starts and three ends conserved in eight reference α-helices compared in the first domain, and seven starts and five ends conserved in 14 TM helices, α-helices, and β-sheets compared in the second domain (see S3 Table for our comparison of secondary structures), perhaps reflecting the problems caused by adjusting to five INDELS needed to align with RmBgt17A and over 20 INDELS needed to align with RsBcsA.

Table 2. Pairwise comparisons of protein structures and models.

The first Orphan domain adopts a TIM-barrel–like structure and active cleft seen in fungal β-glucosyltransferases.

The 278-residue RmBgt17A protein produces branched glucans from linear β-(1,3)-glucan during fungal cell-wall assembly and rearrangement and is a single domain (GH17) protein with a classical (α/β) TIM-barrel fold [104]. In RmBgt17A and related fungal glucosyltransferases, two sub-domains (SD1 & 2) and two catalytic glutamic acids in the conserved motifs (VGxEV and ExGWPx) have been identified previously, though RmBgt17A has a shorter catalytic cleft located on the rim of the TIM-barrel compared to other proteins [104]. RmBgt17A has β-(1,3)-glucanase and β-(1,3)-glucanosyltransferase activities producing elongated and branched β-glucans [106]. Residues 305–269 of the RmBgt17A protein sequence aligned to the PfSBW25 Orphan GH17 domain (positions 60–299) with a 50.6% sequence similarity and there was a matching pattern of α-helices and β-sheets as predicted by Proteus2. We confirmed the similarity with RmBgt17A by identifying corresponding TIM-barrel α-helices in the Phyre2 and SWISS-MODEL single-domain models with those in the RmBgt17A template (S3 Table) and by FATCAT comparisons (Table 2A). However, although the RmBgt17A sub-domain SD1 α-helix–β-sheet and SD2 β-sheet–loop sequences were not found in the PaPA01, PfSBW25, PpKT2440 and PsDC3000 Orphan proteins (Table 3), the corresponding regions had similar predicted secondary structures. The corresponding SD1 sequence was more highly conserved than the SD2 sequence in an alignment of a larger set of Orphan homologs (S3 File), suggesting a structural importance for these sub-domains.

Table 3. Conserved motifs and residues identified in the Orphan proteins.

The VGxEV and ExGWP motifs and conserved glutamic acid residues were recognised in the PfSBW25 Orphan sequence, and in both Phyre2 and SWISS-MODEL single-domain models they were localised in an exposed cleft running across the bottom of the TIM-barrel like RmBgt17A [104]. Further comparison using Shannon entropy analysis [73] indicate that GH17 sub-domains, motifs, and conserved residues, are in regions of high conservation across Orphan proteins (Fig 5) and therefore may retain some GH17 / RmBgt17A-like functionality with the first major INDEL seen among Orphan homologs occurring after the end of the GH17 domain (S3 File).

Fig 5. The Orphan proteins share conserved features found in GH17 glucanosyltransferases and GT2 cellulose synthases.

Shown here is a map of amino acid conservation scores (A) determined by Shannon entropy [73] from a Clustal Omega multiple sequence alignment [70] of 26 Orphan proteins found in Pseudomonas aeruginosa AZPAE12140, BL14, PAK, PA01, PA14, LESB58, 19BR and 3573, P. fluorescens ICMP 11288, ICMP 3512, KF1, LMG 5329, SBW25, SS101, WH6 and WS 5037, P. putida KT2440, S610, W619 and YKD221, and P. syringae B728a, DC3000, ICMP 9617, NCPPB 4273, UMAF0158 and 41a strains (see S1 File for protein sequences and S3 File for an annotated multiple sequence alignment of these proteins). A value of 1 indicates complete conservation while lower values indicate less conservation of that residue. Overlaid onto this map are conserved domains, motifs, and residues, found in fungal GH17 β-(1,3)-glucanosyltransferases and Rhizomucor miehei CAU432 Bgt17A, as well as bacterial GT2 synthases such as Rhodobacter sphaeroides 2.4.1 BcsA, that are indicated by vertical blue lines and coloured rectangles. Below this is a simplified schematic of the Orphan protein (B) in which the proposed functional GH17 and GT2 domains are indicated along with the signal peptide sequence (yellow) (not identified in P. syringae strain Orphans) and transmembrane domains (grey). Note that the x-axis and Orphan schematic shown here are longer than individual Orphan proteins and the additional length is a result of the INDELS (mainly insertions) introduced by the multiple sequence alignment. Fungal GH17 / RmBgt17A conserved domains, motifs, and residues: 1, Conserved D. 2, Conserved R. 3, Conserved Y. 4, Conserved E. 5, Conserved G. 6, Conserved W. 7, VGNE motif. SD1 & SD2, Sub-domains. 8, Putative catalytic E. 9, GWP catalytic site. 10, Conserved G. 11, Conserved G. 12, Conserved WK. 13, Conserved WG. Bacterial GT2 / RsBcsA conserved domains, motifs, and residues: TM1 –TM3, Transmembrane helices. IF1, Amphipathic interface helix 1. BcsA Active Site, Active site of GT2 synthases. 14, DDG motif (but only D). 15, HAKAG motif. 16, DAD motif. 17, QTPH motif. 18, FFCGS motif (but only G). 19, TED motif (but only ED). 20, Conserved E not seen in the Orphans. IF2, Amphipathic interface helix 2. 21, QRxRW motif. TM4 –TM6, Transmembrane helices. 22, Conserved T not seen in the Orphan proteins. TM7 & TM8, Transmembrane helices. 23, RxxxR motif associated with c-di-GMP binding not seen in the Orphan proteins. Note some conserved motifs / residues not identified in the Orphan proteins are also indicated for reference.

We note that the GH17 domains of the PaPA01 and PpKT2440 Orphan proteins have been enzymatically characterised as fusion or tagged proteins (referred to as Glt1 and Glt3, respectively) that showed non-Leloir trans-β-glucosylation activity on linear β-(1,3)-glucan, cleaving short β-(1,3)-oligosaccharides and retaining the non-reducing end which is transferred to another acceptor glucan with a β-(1,3) linkage in an elongation reaction [48]. Glt1 and Glt3 template-based models were also created using the RmBgt17A template and the two catalytic glutamic acids within the active site cleft confirmed by superimposition of RmBgt17A [49]. Molecular dynamics simulations using these models demonstrated that Glt1 and Glt3 could bind β-(1,3)-glucan with the major cleavage site occurring at the third or fourth β-(1,3) linkage.

The second Orphan domain includes a Rossmann-like fold and a BscA-like active site positioned at the base of twisted column of transmembrane helices.

The RsBcsA catalytic cellulose synthase sub-unit contains a transmembrane (TM) region formed by a twisted cylinder of eight transmembrane helices that cross the inner membrane in a truncated cone of twisted cylinders, with a large intracellular Leloir (sugar-nucleotide-dependent) GT2 domain linked to a six-stranded β-barrel PilZ domain by a curved α-helical region [25, 30, 31, 34]. The domain contains a α/β/α sandwich Rossmann-like fold that includes seven β-sheets arranged in a 3214657 topology in which β-sheet 6 is antiparallel to the others [18]. The domain also contains the Q(Q/R)xRW motif which forms part of the active site and is on amphipathic interface helix IF2 located perpendicular to the base of the transmembrane helices on the cytoplasmic surface of the inner membrane. The first aspartate in each of the conserved DD and DxD motifs are involved in the coordination of the UDP-Glucose substrate while the aspartate in the ED motif is the catalytic base [32]. The DD, DxD, ED and Q(Q/R)xRW motifs form the active site where they mediate substrate (UDP-Glucose) and acceptor (non-reducing end of the glucan chain) binding. This is positioned just below a narrow transmembrane channel formed by six transmembrane helices (TM3–8) [32] and an additional TM helix is provided by the accessory membrane-anchored RsBcsB protein which is required for BcsA activity [107]. The regulatory PilZ domain includes the RxxxR and DxSxxG c-di-GMP-associated motifs [34]. When bound there is a conformational change in a gating loop allowing UDP-Glucose to bind to the active site (the gating loop also contains the conserved FxVTxK motif) [33, 108]. The glucosyl unit is transferred to the non-reducing end of the cellulose polymer with a β-(1,4) linkage and the elongating polymer is translocated up through the transmembrane channel to the periplasm and the RsBcsC porin located in the outer membrane [25, 30, 31, 109].

Residues 15–566 of the RsBcsA protein aligned to the PfSBW25 Orphan protein TM region / GT2 domain (residues 305–813) with a 37.7% normalised similarity and there was a matching pattern of TM helices, α-helices, and β-sheets, in RsBcsA and the PfSBW25 Orphan as predicted by Proteus2. We confirmed the similarity with RsBcsA by identifying corresponding TM helices, Rossmann-like fold β-sheets, and IF2 / Q(Q/R)xRW in the Phyre2 and SWISS-MODEL models (S3 Table) and by FATCAT comparisons (Table 2A).

However, the Phyre2 and SWISS-MODEL coverage of the RsBcsA structure differed, with Phyre2 modelling residues 314–856 that including the TM helices on either side of the GT2 domain, and SWISS-MODEL a shorter sequence covering residues 416–855 that missed out the TM helices encoded before the GT2 domain (S3 Table). Differences were seen in the positioning of the start and ends of TM helices, α-helices, and β-sheets along the primary amino acid sequence with SWISS-MODEL predicting a significantly shorter TM5, and both models presenting TM6 predicted by both HMMSCAN and Proteus2 as a normal α-helix located perpendicular to the base of the TM region. Despite these differences, Shannon entropy analysis indicate that GT2 motifs and conserved residues in the active site are in regions of high conservation across Orphan proteins (Fig 5), suggesting that this domain may retain some GT2 / BcsA-like functionality. Poor levels of structural conservation with RsBcsA were seen in the transmembrane helices TM1–3, just before TM4, and in TM7, though the only major INDEL among Orphan homologs in the TM region and GT2 domain in fact occurs just after IF2 / Q(Q/R)xRW outside the BcsA-like active site. It is noteworthy that in a larger alignment of GT-A Rossmann-like folds, the Q(Q/R)xxRW motif is located on a hypervariable region (HV3) rather than on a conserved α-helix (IF2 in BcsA proteins), β-sheet, or loop structure, despite the enzymatic importance of this sequence [23].

Structural predictions of the Orphan protein suggest a transmembrane ovoid-like protein with a periplasmic GH17 domain and a cytoplasmic GT2 domain

While template-based homology modelling of the PfSBW25 Orphan protein by Phyre2 and SWISS-MODEL and comparisons of these with the RmBgt17A and RsBcsA proteins have provided some insight into the likely function of the Orphan protein. However, these models are biased by the need to align the PfSBW25 sequence to the X-ray structures, and neither Phyre2 or SWISS-MODEL were able to confirm the Protter protein topology prediction by producing a combined structure linking the GH17 and TM region / GT2 domains. We decided to explore different predicted structures produced using AI / Deep Learning-based free modelling that allow structures to be predicted beyond the limits of available templates [3537]. The limited sequence homology seen between RmBgt17A and the GH17 domain (26.1% sequence identity) and between RsBcsA and the TM region / GT2 domain (23.8%) suggest that the PfSBW25 Orphan protein is in the ‘twilight’ modelling category [16] and it is possible that free modelling might produce significantly different domain structures as well as a combined GH17 / TM region / GT2 domain structure stabilised by tertiary interactions. However, we would expect to see consistency across predictions allowing a consensus structure to be produced in agreement with our earlier secondary structure and protein topology predictions.

We submitted the PfSBW25 Orphan sequence initially to AlphaFold Colab Notebook [82, 83], and then to InterFOLD6 [84], RoseTTAFold [86] and transform-restrained Rosetta (TrRosetta) [8991] for comparison and discuss here the top-ranked predicted models they produced (S2 File). In this modelling, we chose to retain the signal peptide identified in the PfSBW25 Orphan protein, as in other Orphans it was not clear whether a signal sequence and cleavage site was present or not (see our test of the significance of the signal sequence on the AlphaFold PfSBW25 predicted structure which is described later). To assess these models, we visually compared structures to determine the relative positioning and orientation of the G17 domain, TM region and GT2 domain and compared the start/stop positions of the reference TM helices, α-helices, and β-sheets we used to assess our single-domain models. We complemented this using Pairwise Structure Alignments and compared FATCAT-rigid RMSD and TM-score values to assess global differences in the positioning of the Cα backbone and similarities in protein topologies (due to the number of possible pairwise combinations we focussed on comparisons with the AlphaFold predicted structure).

The AlphaFold predicted PfSBW25 structure reveals the association between the GH17 domain, transmembrane region, and GT2 domain.

The PfSBW25 Orphan model is a transmembrane ovoid-like structure with the GH17 domain in the periplasm and the GT2 domain in the cytoplasm (Fig 6) (it should be noted that Mol* was used to visualize the membrane using the ANVIL algorithm [93] rather than it being modelled by AlphaFold; ANVIL positions membranes based on amino acid hydrophobicity and the likelihood that these residues are embedded in membranes). The GH17 domain is dominated by the TIM-barrel with the GH17 active site cleft running horizontally across the bottom of the structure and immediately above the top of the TM region formed by a twisted cylinder of transmembrane helices and the α-helix signal peptide (see S3 Fig for a schematic identifying each of the TM helices). The GT2 domain appears to be compressed against the bottom of the TM region, with several α-helices including IF2 lying perpendicular to the transmembrane helices and the seven β-sheets of the Rossmann-like fold located further away from the membrane surface. In addition to the TIM-barrel and the Rossmann-like fold, we were able to identify the α-helices and transmembrane helices we had seen in RmBgt17A and RsBcsA, and in the Orphan single-domain models (S3 Table), with the AlphaFold model having similar variation in secondary structure start and stop positions as SWISS-MODEL compared to the Pyre2 reference models.

Fig 6. Two-domain structural prediction of the PfSBW25 Orphan protein.

Shown here is the AlphaFold predicted structure of the Pseudomonas fluorescens SBW25 Orphan protein showing the relative positioning of the GH17 domain, transmembrane (TM) region and GT2 domain. The transmembrane ovoid-like structure of the protein (A) is dominated by α-helices (magenta) in the TM region with some β-sheets (gold) found in the GH17 TIM-barrel and GT2 Rossmann-like fold. Loops (green) (sections with poor certainty are in light green and white) are also indicated. Surface hydrophobicity (B) is represented by cold colours with hydrophilic surfaces indicated by warmer colours. The model was produced by AlphaFold Colab Notebook [82, 83] and the PDB file is available (see S2 File). The model was visualised with Mol* 3D Viewer [92] using molecular surface and membrane orientation representations and colouring residues according to secondary structure or hydrophobicity.

Surface representations of the model suggested substantive interactions between the GH17 domain and TM region though they were only linked by one unstructured sequence from the end of the last α-helix identified in the TIM-barrel to the start of TM1 (residues 283–311). We identified six non-covalent interactions between the GH17 domain, unstructured sequence and the TM region that might help stabilise the positioning of the GH17 domain against the periplasmic face of the TM region (see S4 Table for a list of non-covalent interactions). The signal peptide, shown as a long α-helix and associated with the TM region, is unlikely to have any impact on the prediction, as AlphaFold produced an almost identical structure for a truncated PfSBW25 Orphan protein lacking this sequence (FATCAT-rigid RMSD 0.69, TM-score 0.96, and Score 2468.03) (S2 File).

Pairwise Structure Alignments with the RmBgt17A and RsBcsA structures and single-domain models produced low RMSD values of 1.99–3.17 (Table 2B) reflecting the relative ease of predicting single-fold structures and confirming the AlphaFold predictions of the GH17 and GT2 domains (AlphaFold was trained to produce predicted structures most likely to appear in protein database structures [82]). We obtained a low TM-Score in the comparison with the RmBgt17A template but suspect this was the result of difficulties in the sequential alignment of α-helices, as in a restricted comparison omitting the AlphaFold α-helix signal peptide (residues 36–305), a better TM-Score of 0.87 was obtained in line with our other comparisons. A smaller improvement in TM-Score of 0.64 was also obtained by comparing the AlphaFold TM region / GT2 domain (residues 416–855) with the corresponding region of the BcsA protein (residues 140–700) reflecting problems associated in the prediction of TM α-helices either side of the GT2 domain and subsequent alignment. However, the alignment between the RsBcsA template and the AlphaFold model was sufficiently good to allow us to superpose structures including the cellulose polymer which passes up through the RsBcsA TM channel and turns sharply to follow the membrane-proximal surface of RsBcsB the periplasm [30]. In the AlphaFold model, the superposed cellulose non-reducing end is positioned near IF2 which includes the Q(Q/R)xRW motif, whilst the reducing end projects out from the base of the GH17 domain (see S4 Fig for images of the superposed model). The cellulose polysaccharide binds in the RsBcsA TM domain with a significant bend, which is determined by the overall structural arrangement in complex with RsBcsB; however, the RsBcsA-cellulose complex structure lacks a GH17 domain and therefore we suggest that in Orphan proteins the cellulose polysaccharide reducing end would extend into the GH17 active site cleft.

Structural prediction programmes reveal common features and a consensus model.

A visual comparison of the InterFOLD6, RoseTTAFold and TrRosetta predicted structures for the PfSBW25 Orphan protein (Fig 7) suggests that the TrRosetta prediction is very similar to AlphaFold, except for the projection of two TM helices (TM7 & TM8) apart from the main TM structure. The way TrRosetta has broken up the TM region looks unusual, and it would be interesting to simulate the dispersion of lipids around this structure to see whether the projection breaks the upper lipid layer or not. In both the AlphaFold and TrRosetta predictions, surface representations also suggest substantive tertiary interactions stabilizing the GH17 domain next to the TM region. In contrast, RoseTTAFold suggests a different configuration that places the GH17 domain out on an extended unstructured linking sequence away from the TM region and the lipid bilayer (Fig 7). InterFOLD6 suggests a third configuration with the GH17 domain perpendicular to a more cone-shaped TM region and below the membrane visualised by Mol* 3D Viewer. The linking sequence lengths differ slightly between models, with AlphaFold and RoseTTAFold predicting a short α-helix (residues 284–288) in this region, and all but RoseTTAFold looping the sequence around rather than showing it as an extended, linear structure. As for AlphaFold, non-covalent interactions linking the GH17 domain, unstructured sequence and the exposed surface of the TM region were seen in the other models (S4 Table) with the most identified in the InterFOLD6 model where the periplasmic face of TM region was larger and allowing greater contact with the GH17 domain. The differences in positioning the GH17 domain in these models illustrates problems in positioning long unstructured regions linking discrete domains and assessing interactions which might stabilise tertiary structures. We note that the relative positioning of the GH17 and GT2 domains with respect to the bacterial inner membrane could be confirmed experimentally using Alkaline phosphatase (PhoA) fusions [110] while protease-sensitivity assays could be used to investigate the separation between the top of the TM region and the base of the GH17 domain. Transmembrane and membrane-associated proteins are often found in complexes with other proteins, and the interactions between these might have a significant impact on the positioning of flexible structures such as the TrRosetta TM projection and the RoseTTAFold GH17 domain.

Fig 7. Common features are seen in PfSBW25 Orphan protein structural predictions produced by different servers.

Shown here are predicted structures of the Pseudomonas fluorescens SBW25 Orphan protein produced by AlphaFold (A), TrRosetta (B), RoseTTAFold (C), and IntFOLD6 (D). The relative positioning of the GH17 domain, transmembrane (TM) region and GT2 domain are indicated along with the position of a predicted lipid bilayer (grey ovals). Note that relative sizes vary from image to image and volumes may be hard to assess. Surface hydrophobicity is represented by cold colours with hydrophilic surfaces indicated by warmer colours. The models were produced by AlphaFold Colab Notebook [82, 83], IntFOLD6 [84], RoseTTAFold [86], and TrRosetta [8991] and PDB files are available (see S2 File). Models were visualised with Mol* 3D Viewer [92] using molecular surface and membrane orientation representations and colouring residues according to hydrophobicity.

As expected, the Pairwise Structure Alignments of predicted structures resulted in higher RMSD values (9.65–18.05) and lower TM-Scores (0.33–0.62) (Table 2C) than seen in our earlier comparisons of single-domain models and structural templates. Differences between homologous structures are generally the result of changes in the positioning of α-helices and β-sheets packed within domains as well as local changes especially in α-helices [13], and AI / Deep learning methods are successful in producing higher-level representations of predicted structures but may not be so consistent with smaller-scale details [35]. Although we found variation between models in the positioning of structures along the primary amino acid sequence (S3 Table), we were able to recognise similar organisation of TM helices, α-helices, and β-sheets across the predicted structures. The AlphaFold and TrRosetta models adopted similar organisation with the RoseTTAFold model having an extended unstructured sequence linking the GH17 domain and TM region. The InterFOLD6 model differs substantially from these structures but retains the same relationship between the TM region and GT2 domain. Molecular surface representations of the AlphaFold and TrRosetta models suggest that the GH17 domain and the exposed surface of the TM region are closely fitted, though no significant difference in non-covalent interactions that might stabilise the positioning of the GH17 domain were found between the four models.

On this basis we propose a consensus predicted structure for the PfSBW25 Orphan protein based on the AlphaFold transmembrane ovoid-like model which places the GH17 domain, TM region and GT2 domain along a central axis with the GH17 domain in the periplasm and the GT2 domain in the cytoplasm, in agreement with our HMMSCAN, Proetus2, and Protter predictions and earlier speculative schematic structures [47, 48]. It should be noted that we have not determined that the AlphaFold predicted structure is somehow better than the other others (this would require a physical structure for comparison), rather it shows the most common positioning of the GH17 domain (with TrRosetta), TM region structure (with RoseTTAFold), and positioning of the GT2 domain (with TrRosetta and RoseTTAFold) (Fig 7). As for all predicted structures, it is unclear whether our consensus represents the actual structure adopted by the real protein [35] and how much this would be distorted by substrate and product binding and interaction with other proteins in the inner membrane and periplasm of the bacterial cell, and these possibilities would need to be investigated experimentally.

Predicted structures for other Orphan proteins.

We also determined AlphaFold predicted protein structures for the PaPA01, PpKT2440 and PsDC3000 Orphan protein sequences (S2 File). Both PaPA01 and PpKT2440 predictions were very similar to the AlphaFold PfSBW25 Orphan model (Fig 8 & Table 2), confirming our consensus structure for these highly homologous proteins. In the recent AlphaFold Protein Structure Database update (v27 January 2022) [83], we noted the release of the PaPA01 Orphan structure (AlphaFold DB AF_Q914H4-F1) created by the AlphaFold Monomer v2.0 pipeline. This prediction is almost identical to our AlphaFold Colab Notebook structure (FATCAT RMSD 0.73, TM-score 1, Score 2587.7) and demonstrates the reliability of the more accessible but simplified Colab Notebook server. The small RSMD value differentiating the two AlphaFold models serves as a reminder that as AI/Deep learning methods improve and databases expand, predicted structures will become outdated and should not be used as fixed references or gold standards.

Fig 8. The Orphan protein two-domain structure is conserved across clades.

Shown here are the AlphaFold predicted structures of Orphan proteins from other model Pseudomonas spp. and representatives of sister clades identified in the UPGMA analysis of homologs. Clade 3 representative Arcobacter butzleri ED-1 (A); Clade 4 representative Azoarcus strain DN11 (B); Clade 5 representative Methylomonas methanica MC09 (C); Clade 6 representative P. viridiflava LMCA8 (D); Clade 6 member P. aeruginosa PA01 (E); Clade 6 member P. fluorescens SBW25 (F); Clade 6 member P. putida KT2440 (G); Clade 6 member P. syringae DC3000 (H). The relative positioning of the GH17 domain, transmembrane (TM) region and GT2 domain are indicated along with the position of a predicted lipid bilayer (grey ovals). Note that relative sizes vary from image to image and volumes may be hard to assess. Surface hydrophobicity is represented by cold colours with hydrophilic surfaces indicated by warmer colours. The models were produced by AlphaFold Colab Notebook [82, 83] and the PDB files are available (see S2 File). Models were visualised with Mol* 3D Viewer [92] using molecular surface and membrane orientation representations and colouring residues according to hydrophobicity.

In contrast to the PaPA01 and PpKT2440 Orphan models, the PfDC3000 prediction positioned the GH17 domain perpendicular to the TM region and on the same side of the membrane as the GT2 domain as for the InterFOLD6 PfSBW25 Orphan model. The PfDC3000 Orphan structure and Protter protein topology prediction agree, with residues 1–43 shown in the model as an extended and largely unstructured region leading right up to the edge of the TIM-barrel. HMMSCAN and other transmembrane topology and signal peptide predictions including LipoP [76], Phobius [77], PRED-TAT [78], and SignalP [81] failed to identify a signal peptide in the PfDC3000 Orphan. Protter also failed to identify signal peptides and predicted the GH17 domain to be cytoplasmic for five other P. syringae Orphans we analysed (B728a, ICMP 9617, NCPPB 4273, UMAF0158, and 41a) and they do not share any significant sequence conservation until residue 35 of the PsDC3000 Orphan protein suggesting that the P. syringae Orphan subclade has lost the signal peptide found in other Orphan proteins. However, as our investigation of the PfSBW25 Orphan protein has shown the signal sequence had no significant impact on the predicted structure, it is unlikely that residues 1–43 have a great impact on the structure of the PsDC3000 Orphan protein. Further modelling of hybrid PfSBW25/PsDC3000 proteins might reveal which sequences are responsible for the different orientation of the G12 domain in these AlphaFold models, and PhoA fusions [110] used to confirm the positioning of the PsDC3000 Orphan protein in the bacterial inner membrane. The different structure predicted for PsDC3000 suggests that the protein is non-functional despite the high level of sequence conservation of these proteins within the P. syringae subclade.

Predicted consensus structure is conserved in more distant homologs.

More distant homologs of the PfSBW25 Orphan protein, such as Arcobacter butzleri ED-1 which was identified as having the same Pfam PF00332 –PF13641 domain architecture as other Orphan proteins rather than by BLAST+ analysis or through PseudoCAP, and shares only 35.1% amino acid sequence identity (54.1% similarity) to the PfSBW25 Orphan, may not therefor be expected to share the same protein folds [1316]. However, AlphaFold predicted structures for representative Orphans from the sister clades identified in our earlier UPGMA comparison of homologs (Clade 3, AbED-1; Clade 4, Azoarcus strain DN11; and Clade 5, Methylomonas methanica MC09, respectively, Table 1) were visually similar to the PfSBW25 Orphan consensus structure (Fig 8) (like AbED-1, ADN11 and MmMCO9 were identified by domain architecture though their annotations vary: AbED-1: glycosyltransferase, ADN11: Uncharacterised protein, and MmMCO9: Glycosyl transferase family 2).

Pairwise Structure Alignments confirmed structural similarities, with FATCAT RMSD values of 1.86–2.62 and TM-Scores of 0.85–0.92 (Table 2D). Unsurprisingly, the predicted structure of the Clade 6 representative, Pseudomonas viridiflava LMCA8, which is also a member of the P. syringae Subclade 6.1, was more similar to the PsDC3000 protein (Fig 8). These comparisons suggests that the consensus Orphan structure is highly conserved within Clade 6 containing the Pseudomonas Orphans (excepting the P. syringae proteins as discussed), and across into sister clades where sequence conservation tis progressively reduced.

Orphans are likely to be cyclic-β-glucan (CβG) synthases

Our interpretation of the structure and function of the Orphan protein was initially driven by homology to BcsA cellulose synthase subunit. However, significant sequence and structural variation is found among GT2 / GT-A homologs which have a range of synthase activity with predictive structures of cellulose, chitin and curdlan synthases, as well as β-(1–3,1–4)-glucan synthase, all produced using the RsBcsA structure [27, 28, 111, 112]. This means that the GT2 domain may not synthesize cellulose or a related β-(1,4) glucan such as chitin, or even a β-(1,3) glucan such as curdlan, and we are not aware of any definitive sequence or structural feature in the Orphan protein which suggests it produces a particular β-glucan. Despite the differences in width and stacking of the glycosyl units in different glucans, it is possible to position the curdlan chain in the transmembrane channel of the predicted structure of a curdlan synthase based on RsBcsA template [28], and single amino acid changes in TM6 of the plant β-(1–3,1–4)-glucan synthase alters the proportion of the two linkages in the glucan product [111]. Nonetheless, we are able to suggest a model for function of the Orphan synthase based on earlier work linking the Orphan protein (NvdB) in PaPAK and PaPA14 to the production of a partially glycerol-phosphorylated cyclic-β-glucan (CβG) [48, 50, 51], enzymatic characterization and molecular dynamics simulations of Glt1 and Glt3 fusion proteins [48, 49], and our understanding of the relationship of the GH17 domain, TM region and GT2 domain through the consensus structure obtained in this work (the distorted predicted structure obtained for PsDC3000 suggests that this Orphan would be non-functional and we note that no periplasmic cyclic glucans were identified in PsR32 [113]).

We suggest that the Orphan GT2 domain acts as a Leloir-type synthase to produce a β-(1–3)-glucan chain which passes through the transmembrane channel to the GH17 active site cleft where transglycosylation hydrolysis, elongation, and cyclization reactions produce the 12–16 glucosyl backbone [51] (we accept that the GH17 and GT2 functions might be provided by separate proteins, but the identification of double-domain containing Orphan proteins within the Pseudomonas and in other genera suggests that the two domains are functionally connected in the same transmembrane protein). The major cleavage site of GLT1 is at the fourth linkage from the non-reducing end of the glucan chain [48, 49], but 3–4 oligosaccharide fragments would need to be cleaved and re-joined through the elongation reaction, retaining the original β-(1–3) linkage, before cyclization to produce the CβG. However, the same product might be more efficiently produced by looping the β-(1–3)-glucan chain through the active site with only one hydrolysis reaction between the 12–16 linkages followed by cyclization. Clearly this process needs to be confirmed by experimentation and comparison of enzymatic behaviour of Orphan mutant proteins produced by site-directed or random scanning mutagenesis might identify the critical residues and motifs used by the GH17 domain to modify the glucan chain produced by the GT2 domain.

We suggest that a second, membrane-associated enzyme is then responsible for the partial substitution of glucosyl groups in the CβG. Phosphoglycerol transferase (MdoB; UniProtKB P39401) transfers phosphoglycerol residues from phosphatidylglycerol to membrane-derived oligosaccharides in the periplasm of E. coli [114] and we have identified possible but distant homologs of this protein in PaPA01, PfSBW25, PpKT2440 and PsDC300 with 22–27% sequence identity which might perform this function (see S5 Table for a list of MdoB homologs) (we note that in Desulfofustis glycolicus DSM 9705 the orphan gene is located immediately upstream of a MdoB homolog, but this is the only example we are aware of). The involvement of a MdoB homolog in the partial substitution of glucosyl groups would need to be confirmed experimentally by chemical analyses and genetic approaches. It would be interesting to model the interactions between the Orphan protein and EcMdoB or one of the MdoB homologs to determine if CβG synthesis and substitution are linked, which is perhaps supported by the high level of substitution observed [51], though the two processes may happen independently if the MdoB homolog had a high affinity for CβG. We note that of all the Orphan–dapE gene syntenies we have checked, one has the orphan gene immediately upstream of a MdoB homolog (Desulfofustis glycolicus DSM 9705), and this system in particular might be worthy of future investigation.


Bacteria produce a wide range of structurally diverse polysaccharides with numerous functional roles, yet it appears that our understanding of the genes and enzymes involved in producing these remain incomplete even in well-studied model strains. The rapid development of sequence and structure-based annotation allows the rapid identification of genes [1, 2], but it remains problematic that sequence, structure, and function are not always robustly linked [3, 8, 9, 17] and that some sequences may be misannotated. Our investigation of the Orphan proteins, highly conserved within the Pseudomonas, is a good example of this problem, as sequence homology to the cellulose synthase catalytic subunit BcsA suggested a role in cellulose production, whereas further investigation of conserved domains and predicted structures have allowed us to suggest they represent a novel family of cyclic-β-glucan (CβG) synthases, in agreement with earlier characterization of the GH17 glycosyl hydrolase family domain and transposon mutants [4749, 51]. Our comparison of predicted structural models has allowed us to identify a consensus transmembrane ovoid-like structure which positions the GH17 domain in the periplasm and the GT2 glycosyltransferase family domain in the cytoplasm. These findings have given us sufficient insights to plan further in silico and biochemical analysis to confirm Orphan function and investigate the functional role of CβG produced by pseudomonads in a wide range of environments including soil and plant surfaces as well as during plant and human pathogenesis.

Our use of predictive modelling also highlights variation in the structures produced by different AI / Deep Learning-based free modelling approaches. In the absence of highly homologous template structures, we advise the use two or three approaches followed by pairwise structural comparison to evaluate models and propose a consensus structure. Furthermore, models made available in various databases need to be regularly revised as algorithms become more sophisticated and template databases expand.

Supporting information

S1 Fig. GC content of the orphan-dapE regions of Pseudomonas fluorescens SBW25, P. putida KT2440 and P. syringae DC300.

Shown here are GC content plots covering the orphan (blue) and dapE (gold) genes and adjacent genes (grey) not involved in DapE activity or cellulose production. Each of the plots covers approximately 6,000 bp and the locus tags (from left to right) are Pf. SBW25: PFLU1258, PFLU1259, PFLU1260 and PFLU1261; Pp. KT2440: PP1524, PP1525, PP1526 and PP1527; and Ps. DC3000: PSPTO1522, PSPTO1523, PSPTO1524 and PSPTO1525. The horizontal dashed line indicates the mean GC content for each genome. Genomes were obtained from PseudoCAP [60] as GBK files and were viewed using Artemis [61]. The GC plots are copies of the Artemis graphs.


S2 Fig. Phylogenetic tree of Orphan protein homologs.

Shown here is composite figure of an un-rooted UPGMA phylogenetic tree produced by Clustal Omega Simple Phylogeny [70] of 190 Orphan protein homologs and drawn with real and cladogram (uniform) scales. Species and protein annotation (in parentheses) and genetic distances provided for each protein (see S1 File for protein sequences). The UPGMA tree is divided into seven clades (A) with Clade 6 containing all Pseudomonas spp. Orphan proteins and further subdivided into five subclades (B) (inset figures are from Fig 3). Rhizomucor miehei CUA432 Bgt17A is indicated by the white circle (Clade 1). Escherichia coli MG1655 and Rhodobacter sphaeroides 2.4.1 BcsA reference proteins are indicated by the black circles (Clade 2). Orphan proteins from Pseudomonas aeruginosa PA01 (Clade 6 Subclade 5), P. fluorescens SBW25 (Clade 6, Subclade 4), P. putida KT2440 (Clade 6, Subclade 4), and P. syringae DC3000 (Clade 6, Subclade 1) are indicated by coloured squares. Rhizobium meliloti 1021 NdvB and Schizosaccharomyces pombe 972 Ags1 were chosen as outliers for this tree (Clade 7). The real and cladogram trees and text are copied from the Simple Phylogeny output.


S3 Fig. Identification of the transmembrane helices in the AlphaFold predicted structure of the Pf SBW25 Orphan protein.

Shown here is a view of the AlphaFold model of the Pseudomonas fluorescens SBW25 Orphan protein with the cartoon representation colour-coded according to secondary structure (A). These include α-helices (magenta), β-sheets (gold), and loops (green) (sections with poor certainty are in light green and white). The GH17 domain, transmembrane (TM) region and GT2 domain are indicated along with the position of a predicted lipid bilayer (grey ovals). The TM region includes seven transmembrane helices (TM1–7) which were also identified by Proteus2 [79] and Protter [80] (B). The signal peptide (SP), identified by HMMSCAN [75], Proteus2 and Protter, is shown aligned with the other TM helices. The model was produced by AlphaFold [82, 83] and the PDB file is available (see S2 File). The model was visualised with Mol* 3D Viewer [92] using cartoon and membrane orientation representations and colouring residues according to secondary structure.


S4 Fig. The elongating cellulose chain seen in the RsBcsAB crystal structure can be superposed on the AlphaFold predicted structure of the Pf SBW25 Orphan protein.

Shown here are views of the AlphaFold model of the Pseudomonas fluorescens SBW25 Orphan protein superimposed with the cellulose polymer as visualised in the homologous Rhodobacter sphaeroides 2.4.1 BcsAB (RsBcsAB) X-ray crystal structure [32, 33]. The Orphan protein (A) is shown as a cartoon representation colour-coded according to secondary structure with α-helices (magenta), β-sheets (gold), and loops (green) (sections with poor certainty are in light green and white). The GH17 domain, transmembrane (TM) region and GT2 domain are indicated along with the superimposed position of a short cellulose chain (linked purple beads) with the reducing end projecting away from the base of the GH17 domain and the non-reducing (elongating) end buried at the base of the TM region. In the RsBcsAB crystal structure, the cellulose chain passes up through a transmembrane channel where it is then threaded into the RsBcsC porin in the outer membrane. A similar transmembrane channel appears to be present in the AlphaFold Orphan model, but the cellulose chain is likely to continue to project towards the GH17 Orphan domain rather than adopting an acute turn as seen in the RsBcsAB crystal structure. A second view of the Orphan protein is given looking down into the centre of the TIM-barrel like structure of the GH17 domain (B). Although it seems as if the cellulose chain could project up into the TIM-barrel, it is more likely that it will come into contact with the GH17 cleft and active site residues. The AlphaFold model (S2 File) [82, 83] was superposed with the RsBcsAB crystal structure which included the cellulose chain with Pairwise Structure Alignment [94]. The superposed model was then visualised with Mol* 3D Viewer [92] with only the cellulose chain and Orphan protein visible and using cartoon representation and colouring residues according to secondary structure.


S1 File. FASTA file of proteins investigated in this work.

Proteins are listed by species and strain and were include as Reference proteins or Orphan or BcsA homologues identified in PseudoCAP [60], unpublished genomes, by BLAST or InterPro IPR000490-PF13641. UniProt Accessions are provided plus protein function if known.


S2 File. Single-domain homology models and predicted structure models.

This is a list of single-domain homology models and predicted structure models (PDB files) generated in this work and available from DOI: XXX (to be added after acceptance).


S3 File. Multiple sequence alignment of 26 Orphan proteins from representative P. aeruginosa, P. fluorescens, P. putida and P. syringae strains.

Clustal Omega [70] was used to produce a multiple sequence alignment of 26 Orphan proteins identified in Pseudomonas aeruginosa AZPAE12140, BL14, PAK, PA01, PA14, LESB58, 19BR and 3573, P. fluorescens ICMP 11288, ICMP 3512, KF1, LMG 5329, SBW25, SS101, WH6 and WS 5037, P. putida KT2440, S610, W619 and YKD221, and P. syringae B728a, DC3000, ICMP 9617, NCPPB 4273, UMAF0158 and 41a strains, using the PaPA14 Orphan (NdvB) as the reference sequence (see S1 File for protein sequences). Orphan proteins show 95.3–100.0% coverage and 46.9–99.2% sequence identity normalised by aligned length. The Mview [70] file was copied and annotated to show the signal peptide sequence predicted by Proteus2 [79], the first (GH17) and second (GT2) Orphan domains, and conserved domains, motifs, and residues found in homologous fungal proteins and BcsA proteins.


S1 Table. BcsA and Orphan proteins from Pseudomonas fluorescens SBW25, P. putida KT2440 and P. syringae DC3000.

This lists locus tags and UniProtKB accessions, PseudoCAP annotations, number of residues, genome coordinates, and amino acid identity and similarity to the EcBcsA reference protein.


S2 Table. List of Pseudomonas species containing Orphan proteins.

This lists Pseudomonas species and strains containing Orphan genes identified in our sampling of PseudoCAP entries, by BLAST searches, and inspection of unpublished draft genome sequences. See S1 File for all protein sequences.


S3 Table. Comparison of secondary structures in Orphan homology models and predicted structures.

This lists reference transmembrane helices, α-helices and β-sheets identified in the Rhizomucor miehei CAU432 and Rhodobacter sphaeroides 2.4.1 BcsA X-ray crystal structures and seen in the single-domain models and predicted structures of the Pseudomonas fluorescens SBW25 Orphan protein. The start and stop residues for each structure is recorded showing the variation from that shown in the Phyre2 homology models for the GH17 domain and the TM region and GT2 domain.


S4 Table. Interactions between GH17 domain and TM region in Orphan predicted structures.

This lists the non-covalent bonds identified in the AlphaFold, InterFOLD6, RoseTTAFold and TrRosetta predicted structures of the Pseudomonas fluorescens SBW25 Orphan protein connecting residues located in the GH17 domain, linking (unstructured) sequence, and the exposed surface of the TM region.


S5 Table. Homologs of the Escherichia coli Phosphoglycerol transferase MdoB.

This lists MdoB homologs identified in Pseudomonas aeruginosa PA01, P. fluorescens SBW25, P. putida KT2440, and P. syringae DC3000.



JMcG, AN and KS acknowledge the support of the Nuffield Research Placements ( who organised their projects with AJS and RJ.


  1. 1. Xu J. Microbial ecology in the age of genomics and metagenomics: concepts, tools, and recent advances. Mol Ecol. 2006; 15: 1713–31. pmid:16689892
  2. 2. Pérez-Cobas AE, Gomez-Valero L, Buchrieser CMetagenomic approaches in microbial ecology: an update on whole-genome and marker gene sequencing analyses. Microb Genom. 2020; 6. pmid:32706331
  3. 3. Gerlt JA, Babbitt PC. Can sequence determine function? Genom Biol. 2000; 1: reviews0005.1–0005.10. pmid:11178260
  4. 4. Schnoes AM, Brown SD, Dodevski I, Babbitt PC. Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Comput Biol. 2009; 5: e1000605. pmid:20011109
  5. 5. Uchiyama I, Mihara M, Nishide H, Chiba H. MBGD update 2015: microbial genome database for flexible ortholog analysis utilizing a diverse set of genomic data. Nucleic Acids Res. 2015; 43: D270–6 (Database issue). pmid:25398900
  6. 6. Cozzetto D, Jones DT. Computational methods for annotation transfers from sequence. Methods Mol Biol. 2017; 1446: 55–67. pmid:27812935
  7. 7. Hong J, Luo Y, Zhang Y, Ying J, Xue W, Xie T, et al. Protein functional annotation of simultaneously improved stability, accuracy and false discovery rate achieved by a sequence-based deep learning. Brief Bioinform. 2020; 21: 1437–47. pmid:31504150
  8. 8. Sasson O, Kaplam N, Linial M. Functional annotation prediction: All for one and one for all. Protein Sci. 2006; 15: 1557–62. pmid:16672244
  9. 9. Whisstock JC, Lesk AM. Prediction of protein function from protein sequence and structure. Q Rev Biophys. 2003; 36: 307–40. pmid:15029827
  10. 10. Jiang Y, Oron TR, Clark WT, Bankapur AR, D’Andrea D, Lepore R, et multi al. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol. 2016; 17: 184.
  11. 11. Zhou N, Jiang Y, Bergquist TR, Lee AJ, Kacsoh BZ, Crocker AW, et multi al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 2019; 20: 244.
  12. 12. Kryshtafovych A, Schwede T, Topf M, Fidelis K, Moult J. Critical assessment of methods of protein structure prediction (CASP)–Round XIV. Proteins 2021; 89: 1607–17. pmid:34533838
  13. 13. Chothia C, Lesk AM. The relation between the divergence of sequence and structure in proteins. EMBO J. 1986; 5: 823–6. pmid:3709526
  14. 14. Chung SY, Subbiah S. A structural explanation for the twilight zone of protein sequence homology. Structure 1996; 4: 1123–7. pmid:8939745
  15. 15. Rost B. Twilight zone of protein sequence alignments. Protein Eng 1999; 12: 85–94. pmid:10195279
  16. 16. Khor BY, Tye GJ, Lim TS, Choong YS. General overview on structure prediction of twilight-zone proteins. Theor Biol Med Model. 2015; 12: 15. pmid:26338054
  17. 17. Coutinho PM, Deleury E, Davies GJ, Henrissat B. An evolving hierarchical family classification for glycosyltransferases. J Mol Biol. 2003; 328: 307–17. pmid:12691742
  18. 18. Breton C, Šnajdrová L, Jeanneau C, Koca J, Imberty A. Structures and mechanisms of glycosyltransferases. Glycobiol. 2006; 16: 29R–37R. pmid:16037492
  19. 19. Drula E, Garron M-L, Dogan S, Lombard V, Henrissat B, Terrapon N. The carbohydrate-active enzyme database: functions and literature. Nucleic Acids Res. 2022; 50: D571–7 (Database issue). pmid:34850161
  20. 20. Sun PD, Foster CE, Boyington JC. Overview of protein structural and functional folds. Curr Protoc Protein Sci. 2004; 35: 17.1.1–17.1.189. pmid:18429251
  21. 21. Liu J, Mushegian A. Three monophyletic superfamilies account for the majority of the known glycosyltransferases. Protein Sci. 2003; 12: 1418–31. pmid:12824488
  22. 22. Campbell JA, Davies GJ, Bulone V, Henrissat B. A classification of nucleotide-diphospho-sugar glycosyltransferases based on amino acid sequence similarities. Biochem J. 1997; 326: 929–42. pmid:9334165
  23. 23. Taujale R, Venkat A, Huang L-C, Zhou Z, Yeung W, Rasheed KM, et al. Deep evolutionary analysis reveals the design principles of fold A glycosyltransferases. eLife 2020; 9: e54532. pmid:32234211
  24. 24. Eisen JA, Wu M. Phylogenetic analysis and gene functional predictions: Phylogenomics in action. Theor Pop Biol. 2002; 61: 481–7. pmid:12167367
  25. 25. McNamara JT, Morgan JLW, Zimmer J. A molecular description of cellulose biosynthesis. Annu Rev Biochem. 2015; 84: 895–921. pmid:26034894
  26. 26. Weigel PH. Hyaluronan synthase: the mechanism of initiation at the reducing end and a pendulum model for polysaccharide translocation to the cell exterior. Int J Cell Biol. 2015; Article 367579.
  27. 27. Dorfmueller HC, Ferenbach AT, Borodkin VS, van Aalten DMF. A structural and biochemical model of processive chitin synthesis. J Biol Chem. 2014; 289: 23020–28. pmid:24942743
  28. 28. Oehme DP, Shafee T, Downton MT, Bacic A, Doblin MS. Differences in protein structural regions that impact functional specificity in GT2 family β-glucan synthases. PLoS ONE 2019; 14: e0224442.
  29. 29. Agarwal G K, Prasad SB, Bhaduri A Jayaraman G. Biosynthesis of Hyaluronic acid polymer: Dissecting the role of sub structural elements of hyaluronan synthase. Sci Reports 2019; 9:1251.
  30. 30. Abidi W, Torres-Sánchez L, Siroy A, Krasteva PV. Weaving of bacterial cellulose by the Bcs secretion systems. FEMS Microbiol Rev. 2021; fuab051.
  31. 31. Tajima K, Imai T, Yui T, Yao M, Saxena I. Cellulose-synthesizing machinery in bacteria. Cellulose 2022; 29: 2755–77.
  32. 32. Morgan JLW, Strumillo J, Zimmer J. Crystallographic snapshot of cellulose synthesis and membrane translocation. Nature 2013; 493: 181–6. pmid:23222542
  33. 33. Morgan JLW, McNamara JT, Zimmer J. Mechanism of activation of bacterial cellulose synthase by cyclic-di-GMP. Nat Struct Mol Biol. 2014; 21: 489–96. pmid:24704788
  34. 34. Poulin MB, Kuperman LL. Regulation of biofilm exopolysaccharide production by cyclic di-guanosine monophosphate. Frontiers Microbiology 2021; 12: 730980. pmid:34566936
  35. 35. Torrisi M, Pollastri G, Le Q. Deep learning methods in protein structure prediction. Comput Struct Biotechnol J. 2020; 18: 1301–10. pmid:32612753
  36. 36. Kuhlman B, Bradely P. Advances in protein structure prediction and design. Nat Rev Mol Cell Biology 2019; 20: 681–97. pmid:31417196
  37. 37. Pearce R, Zhang Y. Toward the solution of the protein structure prediction problem. J Biol Chem 2021; 297: 100870. pmid:34119522
  38. 38. Nelson KE, Weinel C, Paulsen IT, Dodson RJ, Hilbert H, Martins dos Santos VAP, et al. Complete genome sequence and comparative analysis of the metabolically versatile Pseudomonas putida KT2440. Environ Microbiol. 2002; 4: 799–808.
  39. 39. Buell CR, Joardar V, Lindberg M, Selengut J, Paulsen IT, Gwinn ML, et al. The complete genome sequence of the Arabidopsis and tomato pathogen Pseudomonas syringae pv. tomato DC3000. Proc Natl Acad Sci. (USA) 2003; 100: 10181–6.
  40. 40. Silby MW, Cerdeño-Tárraga AM, Vernikos GS, Giddens SR, Jackson RW, Preston GM, et al. Genomic and genetic analyses of diversity and plant interactions of Pseudomonas fluorescens. Genome Biol. 2009; 10: R51.
  41. 41. Spiers AJ, Kahn SG, Bohannon J, Travisano M, Rainey PB. Adaptive divergence in experimental populations of Pseudomonas fluorescens. I. Genetic and phenotypic bases of Wrinkly Spreader fitness. Genetics 2002; 161; 33–46.
  42. 42. Spiers AJ, Bohannon J, Gehrig SM, Rainey PB. Biofilm formation at the air–liquid interface by the Pseudomonas fluorescens SBW25 wrinkly spreader requires an acetylated form of cellulose. Mol Microbiol. 2003; 50: 15–27.
  43. 43. Gjermansen M, Nilsson M, Yang L, Tolker-Nielsen T. Characterization of starvation-induced dispersion in Pseudomonas putida biofilms: genetic elements and molecular mechanisms. Mol Microbiol. 2010; 75: 815–26.
  44. 44. Pérez-Mendoza D, Aragón IM, Prada-Ramírez HA, Romero-Jiménez L, Ramos C, Gallegos MT, et al. Responses to elevated c-di-GMP levels in mutualistic and pathogenic plant-interacting bacteria. PLoS One 2014; 9: e91645. pmid:24626229
  45. 45. Nielsen L, Li X, Halverson LJ. Cell–cell and cell-surface interactions mediated by cellulose and a novel exopolysaccharide contribute to Pseudomonas putida biofilm formation and fitness under water limiting conditions. Environ Microbiol. 2011; 13: 1342–56.
  46. 46. Prada-Ramírez HA, Pérez-Mendoza D, Felipe A, Martínez-Granero F, Rivilla R, Sanjuán J, et al. AmrZ regulates cellulose production in Pseudomonas syringae pv. tomato DC3000. Mol Microbiol. 2016; 99: 960–77.
  47. 47. Hochstenbach F, Klis FM, Van den Ende, Van Donsellaar E, Peters PJ, Klausner RD. Identification of a putative alpha-glucan synthase essential for cell wall construction and morphogenesis in fission yeast. Proc Natl Acad Sci. (USA) 1998; 95: 9161–6.
  48. 48. Hreggvidsson GO, Dobruchowska JM, Olafur H Fridjonsson OH, Jonsson JO, Gerwig GJ, Aevarsson A, et al. Exploring novel non-Leloir β-glucosyltransferases from proteobacteria for modifying linear (β1→3)-linked gluco-oligosaccharide chains. Glycobiol. 2011; 21: 304–28.
  49. 49. Linares-Pastén JA, Jonsdottir LB, Hreggvidsson GO, Fridjonsson OH, Watzlawick H, Karlsson EN. Modeled 3D-structures of proteobacterial transglycosylases from glycoside hydrolase family 17 give insight in ligand interactions explaining differences in transglycosylation products. Appl Sci. 2021; 11: 4048.
  50. 50. Mah T-F, Pitts B, Pellock B, Walker GC, Stewart PS, O’Toole GA. A genetic basis for Pseudomonas aeruginosa biofilm antibiotic resistance. Nature 2003; 426: 306–10.
  51. 51. Sadovskaya I, Vinogradov E, Li J, Hachani A, Kowalska K, Filloux A. High-level antibiotic resistance in Pseudomonas aeruginosa biofilm: the ndvB gene is involved in the production of a highly glycerol-phosphorylated β-(1→3)-glucans, which bind aminoglycosides. Glycobiol. 2010; 20: 895–904.
  52. 52. Breedveld MW, Miller KJ. Cyclic β-glucans of members of the family Rhizobiaceae. Microbiol Rev. 1994; 58: 145–61.
  53. 53. Bohin J-P. Osmoregulated periplasmic glucans in Proteobacteria. FEMS Microbiol Letters 2000; 186: 11–9. pmid:10779706
  54. 54. Komaniecka I, Choma A. Isolation and characterization of periplasmic cyclic β-glucans of Azorhizobium caulinodans. FEMS Microbiol Letters 2003; 227: 263–9.
  55. 55. Rigano LA, Payette C, Brouillard G, Marano MR, Abramowicz L, Torres PS, et al. Bacterial cyclic β-(1,2)-glucan acts in systemic suppression of plant immune responses. Plant Cell 2007; 19: 2077–89.
  56. 56. Gay-Fraret J, Ardissone S, Kambara K, Broughton WJ, Deakin WJ, Le Quéré. Cyclic-β-glucans of Rhizobium (Sinorhizobium) sp. strain NGR234 are required for hypo-osmotic adaptation, motility, and efficient symbiosis with host plants. FEMS Microbiol Letters 2012; 333: 28–36.
  57. 57. Martirosyan A, Pérez-Gutierrez C, Banchereau R Dutarte H, Lecine P, Dullars M, et al. Brucella β 1,2 cyclic glucan is an activator of human and mouse dendritic cells. PLoS Pathog 2012; 8: e1002983.
  58. 58. Guidolin LS, Arce-Gorvel V, Ciocchini AE, Comerci DJ, Gorvel J-P. Cyclic β-glucans at the bacteria-host cells interphase: one sugar ring to rule them all. Cell Microbiol. 2018; 20: e12850.
  59. 59. Javvadi S, Pandey SS, Mishra A, Pradhan BB, Chatterjee S. Bacterial cyclic β-(1,2)-glucans sequester iron to protect against iron-induced toxicity. EMBO Reports 2018; 19: 172–86.
  60. 60. Winsor GL, Griffiths EJ, Lo R, Dhillon BK, Shay JA, Brinkman FS. Enhanced annotations and features for comparing thousands of Pseudomonas genomes in the Pseudomonas genome database. Nucleic Acids Res. 2016; 44: D646–53 (Database issue).
  61. 61. Carver T, Harris SR, Berriman M, Parkhill J, McQuillan JA. Artemis: an integrated platform for visualization and analysis of high-throughput sequence-based experimental data. Bioinformatics 2012; 28: 464–9. pmid:22199388
  62. 62. Price MN, Huang KH, Alm EJ, Arkin AP. A novel method for accurate operon predictions in all sequenced prokaryotes. Nucleic Acids Res. 2005; 33: D880–92 (Database issue). pmid:15701760
  63. 63. Dehal PS, Joachimiak MP, Price MN, Bates JT, Baumohl JK, Chivian D, et al. MicrobesOnline: an integrated portal for comparative and functional genomics. Nucleic Acids Res. 2010; 38: D396–400 (Database issue). pmid:19906701
  64. 64. The UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 2021; 49: D480–9 (Database issue). pmid:33237286
  65. 65. Yates D, Allen J, Amode RM, Azov AG, Barba M, Becerra A, et al. Ensembl Genomes 2022: an expanding genome resource for non-vertebrates. Nucleic Acids Res. 2022; 50: D996–1003 (Database issue) pmid:34791415
  66. 66. Sayers EW, Bolton EE, Brister JR, Canese K, Chan J, Comeau DC, et al. NCBI Nucleotide Database resources of the national center for biotechnology information. Nucleic Acids Res. 2022; 50: D20–6 (Database issue)
  67. 67. Kabir K, Deeni YY, Hapca SM, Moore L, Spiers AJ. Uncovering behavioural diversity amongst high-strength Pseudomonas spp. surfactants at the limit of liquid surface tension reduction. FEMS Microbiol Letters 2018; 365: fny008.
  68. 68. Godfrey SAC, Harrow SA, Marshall JW, Klena JD. Characterization by 16S rRNA sequence analysis of Pseudomonads causing blotch disease of cultivated Agaricus bisporus. Appl Environ Microbiol. 2001; 67: 4316–23.
  69. 69. Bouvier J, Richaud C, Higgins W, Bögler O, Stragier P. Cloning, characterization, and expression of the dapE gene of Escherichia coli. J Bacteriology 1992; 174: 5265–71.
  70. 70. Madeira F, Park MY, Lee J, Buso N, Gur T, Madhusoodanan N, et al. The EMBL-EBI search and sequence analysis tools APIs in 2019. Nucleic Acids Res. 2019; 47: W636–41 (Web Server issue). pmid:30976793
  71. 71. Blum M, Chang H-Y, Chuguransky S, Grego T, Kandasaamy S, Mitchell A, et al. The InterPro protein families and domains database: 20 years on. Nucleic Acids Res. 2021; 49: D344–54 (Database issue). pmid:33156333
  72. 72. Taylor WR. The classification of amino acid conservation. J Theor Biol. 1986; 119: 205–18. pmid:3461222
  73. 73. Capra JA, Singh M. Predicting functionally important residues from sequence conservation. Bioinformatics 2007; 23: 1875–1882. pmid:17519246
  74. 74. Stothard P. The Sequence Manipulation Suite: JavaScript programs for analyzing and formatting protein and DNA sequences. Biotechniques 2000; 28: 1102–4. pmid:10868275
  75. 75. Eddy SR (2011) Accelerated Profile HMM Searches. PLoS Comput Biol 2011; 7: e1002195. pmid:22039361
  76. 76. Rahman O, Cummings SP, Harrington DJ, Sutcliffe IC. Methods for the bioinformatic identification of bacterial lipoproteins encoded in the genomes of Gram-positive bacteria. World J Microbiol 2008; 24: 2377–82.
  77. 77. Käll L, Krogh A, Sonnhammer ELL. A combined transmembrane topology and signal peptide prediction method. J Mol Biol 2004; 338: 1027–36. pmid:15111065
  78. 78. Bagos PG, Nikolaou EP, Liakopoulos TD, Tsirigos KD. Combined prediction of Tat and Sec signal peptides with Hidden Markov Models. Bioinformatics 2010; 26: 2811–7. pmid:20847219
  79. 79. Montgomerie S, Cruz JA, Shrivastava S, Arndt D, Berjanskii M, Wishart DS. PROTEUS2: a web server for comprehensive protein structure prediction and structure-based annotation. Nucleic Acids Res. 2008; 36: W202–9 (Web Server issue). pmid:18483082
  80. 80. Omasits U, Ahrens CH, Müller S, Wollscheid B. Protter: interactive protein feature visualization and integration with experimental proteomic data. Bioinformatics 2014; 30: 884–6. pmid:24162465
  81. 81. Almagro Armenteros JJ, Tsirigos KD, Sønderby CK, Petersen TN, Winther O, Brunak S, et al. SignalP 5.0 improves signal peptide predictions using deep neural networks. Nat Biotechnol 2019; 37: 420–23. pmid:30778233
  82. 82. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021; 596: 583–9. pmid:34265844
  83. 83. Varadi M, Anyango S, Deshpande M, Nair S, Natassia C, Yordanova G, et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucl Acids Res 2022; 50: D439–44 (Database issue). pmid:34791371
  84. 84. McGuffin LJ, Adiyaman R, Maghrabi AHA, Shuid AN, Brackenridge DA, Nealon JO, et al. IntFOLD: an integrated web resource for high performance protein structure and function prediction. Nuc Acids Res 2019; 47: W408–13 (Web server issue). pmid:31045208
  85. 85. Kelley LA, Mezulis S, Yates CM, Wass MN, Sternberg MJE. The Phyre2 web portal for protein modeling, prediction and analysis. Nature Protocols 2015; 10:845–58. pmid:25950237
  86. 86. Baek M, DiMaio F, Anishchenko I, Dauparas J, Ovchinnikov S, Lee GR. Accurate prediction of protein structures and interactions using a 3-track neural network. Science 2021; 373: 871–6.
  87. 87. Bienert S, Waterhouse A, de Beer TAP, Tauriello G, Studer G, Bordoli L, et al. The SWISS-MODEL Repository–new features and functionality. Nucleic Acids Res. 2017; 45: D313–9 (Database issue). pmid:27899672
  88. 88. Waterhouse A, Bertoni M, Bienert S, Studer G, Tauriello G, Gumienny R, et al. SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res. 2018; 46: W296–303 (Web Server issue). pmid:29788355
  89. 89. Yang J, Anishchenko I, Park H, Peng Z, Ovchinnikov S, Baker D. Improved protein structure prediction using predicted interresidue orientations. Proc Natl Acad Sci. (USA) 2020; 117: 1496–1503. pmid:31896580
  90. 90. Du Z, Su H, Wang W, Ye L, Wei H, Peng Z, et al. The trRosetta server for fast and accurate protein structure prediction. Nature Protocols 2021; 16: 5634–51. pmid:34759384
  91. 91. Su H, Wang W, Du Z, Peng Z, Gao S-H, Cheng M-M, et al. Improved protein structure prediction using a new multi-scale network and homologous templates. Advanced Science 2021; 8: 2102592. pmid:34719864
  92. 92. Sehnal D, Bittrich S, Deshpande M, Svobodová R, Berka K, V. Bazgier V, et al. Mol* Viewer: modern web app for 3D visualization and analysis of large biomolecular structures. Nucl Acids Res 2021; 49: W431–7 (Web server issue). pmid:33956157
  93. 93. Postic G, Ghouzam Y, Guiraud V, Gelly JC. Membrane positioning for high-and low-resolution protein structures through a binary classification approach. Protein Eng Des Sel 2016, 29: 87–92. pmid:26685702
  94. 94. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, et al. The Protein Data Bank. Nuc Acids Res 2000; 28: 235–42.
  95. 95. Ye Y, Godzik A. Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics 2003; 19 Suppl. 2: ii246–55. pmid:14534198
  96. 96. Li Z, Jaroszewski L, Iyer M, Sedova M, Godzik A. FATCAT 2.0: towards a better understanding of the structural diversity of proteins. Nuc Acids Res 2020; 48: W60–64 (Web Server issue). pmid:32469061
  97. 97. Ude S, Arnold DL, Moon CD, Timms-Wilson T, Spiers AJ. Biofilm formation and cellulose expression among diverse environmental Pseudomonas isolates. Environ Microbiol 2006; 8: 1997–2011.
  98. 98. Römling U, Galperin MY. Bacterial cellulose biosynthesis: diversity of operons, subunits, products, and functions. Trends Microbiol. 2015; 23: 545–57. pmid:26077867
  99. 99. Zogaj X, Nimtz M, Rohde M, Bokranz W, Römling U. The multicellular morphotypes of Salmonella typhimurium and Escherichia coli produce cellulose as the second component of the extracellular matrix. Molec Microbiol 2011; 39: 1452–63.
  100. 100. Newberry EA, Ebrahim M, Timilsina S, Zlatkovic N, Obradovic A, Bull CT, et al. Inference of convergent gene acquisition among Pseudomonas syringae strains isolated from watermelon, cantaloupe, and squash. 2019; 10: 270.Front Microbiol pmid:30837979
  101. 101. Lawrence JG, Ochman H. Amelioration of bacterial genomes: Rates of change and exchange. J Mol Evol 1997; 44: 383–97. pmid:9089078
  102. 102. Ielpi L, Dylan T, Ditta GS, Helinski DR, Stanfield SW. The ndvB locus of Rhizobium meliloti encodes a 319-kDa protein involved in the production of β-(1–2)-glucan. J Biological Chem 1990; 265: 2843–51.
  103. 103. Chen R, Bhagwat AA, Yaklich R, Keister DL. Characterization of ndvD, the third gene involved in the synthesis of cyclic β-(1→3), (1→6)-D-glucans in Bradyrhizobium japonicum. Can J Microbiol. 2002; 48: 1008–16.
  104. 104. Qin Z, Yan Q, Lei J, Yang S, Jiang Z, Wu S. The first crystal structure of a glycoside hydrolase family 17 β-1,3-glucanosyltransferase displays a unique catalytic cleft. Acta Cryst. 2015; D71: 1714–24.
  105. 105. Zhang Y. Protein structure prediction: when is it useful? Curr Opin Struct Biol. 2009; 19: 145–55. pmid:19327982
  106. 106. Qin Z, Yan Q, Yang S, Jiang Z. Modulating the function of a β-1,3-glucanosyltransferase to that of an endo-β-1,3-glucanase by structure-based protein engineering. Appl Microbiol Biotechnol. 2016; 100: 1765–76.
  107. 107. Omadjela O, Narahari A, Strumillo J, Mélida H, Mazur O, Bulone V, et al. BcsA and BcsB form the catalytically active core of bacterial cellulose synthase sufficient for in vitro cellulose synthesis. Proc Natl Acad Sci. (USA) 2013; 110: 17856–61. pmid:24127606
  108. 108. Morgan JLW, McNamara JT, Fischer M, Rich J. Chen H-M, Withers SG, et al. Observing cellulose biosynthesis and membrane translocation in crystallo. Nature 2016; 531: 329–35.
  109. 109. Acheson JF, Derewenda ZS, Zimmer J. Architecture of the cellulose synthase outer membrane channel and its association with the periplasmic TPR domain. Structure 2019; 27: 1855–61. pmid:31604608
  110. 110. Manoil C, Mekalanos JJ, Beckwith J. Alkaline phosphatase fusions: Sensors of subcellular location. J Bacteriol. 1990; 172: 515–8. pmid:2404939
  111. 111. Jobling SA. Membrane pore architecture of the CslF6 protein controls (1–3,1–4)-β-glucan structure. Sci Adv. 2015; 1: e1500069.
  112. 112. Salgado L, Blank S, Esfahani RAM, Strap JL, Bonetta D. Missense mutations in a transmembrane domain of the Komagataeibacter xylinus BcsA lead to changes in cellulose synthesis. BMC Microbiol. 2019; 19: 216.
  113. 113. Talaga P, Fournet B, Bohin J-P. Periplasmic glucans of Pseudomonas syringae pv. syringae. J Bact. 1994; 176:6538–44. pmid:7961404
  114. 114. Jackson BJ, Bohin J-P, Kennedy EP. Biosynthesis of membrane-derived oligosaccharides: Characterization of mdoB mutants defective in Phosphoglycerol transferase I activity. J Bact. 1984; 160: 976–81. pmid:6094515