Properties and Phylogeny of 76 Families of Bacterial and Eukaryotic Organellar Outer Membrane Pore-Forming Proteins

We here report statistical analyses of 76 families of integral outer membrane pore-forming proteins (OMPPs) found in bacteria and eukaryotic organelles. 47 of these families fall into one superfamily (SFI) which segregate into fifteen phylogenetic clusters. Families with members of the same protein size, topology and substrate specificities often cluster together. Virtually all OMPP families include only proteins that form transmembrane pores. Nine such families, all of which cluster together in the SFI phylogenetic tree, contain both α- and β-structures, are multi domain, multi subunit systems, and transport macromolecules. Most other SFI OMPPs transport small molecules. SFII and SFV homologues derive from Actinobacteria while SFIII and SFIV proteins derive from chloroplasts. Three families of actinobacterial OMPPs and two families of eukaryotic OMPPs apparently consist primarily of α-helices (α-TMSs). Of the 71 families of (putative) β-barrel OMPPs, only twenty could not be assigned to a superfamily, and these derived primarily from Actinobacteria (1), chloroplasts (1), spirochaetes (8), and proteobacteria (10). Proteins were identified in which two or three full length OMPPs are fused together. Family characteristic are described and evidence agrees with a previous proposal suggesting that many arose by adjacent β-hairpin structural unit duplications.


Introduction
Most bacteria, including all Gram-negative bacteria and some Gram-positive Firmicutes and Actinobacteria, as well as mitochondria and chloroplasts of eukaryotes, have envelopes consisting of two membranes, an inner cytoplasmic or matrix membrane and an outer membrane with special protective functions [1]. In Gram-negative bacteria and eukaryotic organelles, most integral outer membrane pore-forming proteins (OMPPs) contrast with integral inner membrane proteins with respect to their structural features. While integral inner membrane proteins generally possess transmembrane α-helical segments (α-TMSs), integral outer membrane proteins (OMPs) usually consist of transmembrane β-strands (β-TMSs) that form β-barrels [2]. Large proportions of these β-barrel proteins are OMPPs that non-selectively allow passage of molecules across the outer permeability barrier. These proteins also serve as cell surface antigens that provide targets for vaccine development [3,4]. However, many other outer membrane pore-forming proteins exhibit substrate selectivity, and we here designate porins and all other outer membrane pore-forming proteins collectively as OMPPs [5].
Bioinformatic analyses and evolutionary considerations have led to the conclusion that many proteins have arisen from ancient peptide modules coded for by genes that underwent repeated intragenic multiplication (duplication, triplication, quadruplication, etc.) to generate larger proteins [6][7][8]. Replication slippage provides one mechanism for the generation of multiple repeats, and stable protein complexes have apparently evolved more frequently from identical units than from dissimilar ones [9]. In fact, some of the most popular folds found in proteins include structural repeats [8]. It has been argued that these repeat sequences arose by divergent rather than convergent evolutionary processes, a conclusion that in many cases, has been extensively documented [6,7].
Our laboratory has identified over 1000 families of transport proteins and over 60 superfamilies of integral membrane transport proteins (see superfamily hyperlink in TCDB) [6,7,18,20]. These superfamilies include six types of α-helical channel proteins, four pore-forming toxin types, seven holin types, one viral envelope type, five toxic channel-forming peptide types, eleven secondary carrier types, six primary active transporter types, and two group translocator types. In any one of these superfamilies, the protein members always exhibit the same internal repeat units. However, some superfamilies have diversified to include channels, carriers and primary carriers [26,30].
In 2010, Remmert et al. [31] proposed that most outer membrane β-barrel proteins have a common origin, being derived from a single ancestral ββ-hairpin structure. This suggestion was based on three types of experimental evidence. First, the authors used transitive profile searching (a search with BLAST is performed, and all significantly matched sequences are used in new searches [32]; second, they identified repeat signature sequences in some OMPPs in which the repeated sequence units coincided with the proposed ββ-hairpin repeats, and third, they provided evidence that similarity between some of the outer membrane β-barrel hairpins could not be explained by structural or membrane constraints on their sequences. This last consideration addressed the issue of convergent versus divergent evolution, responsible for the sequence similarity observed. They rejected the notion of convergent pathways in favor of divergent pathways, suggesting that the proteins arose by amplification and recombination of ββ-hairpin modules that might previously have evolved as RNA co-factors [31]. The Protein Family (Pfam) Database also provides evidence, that many β-barrel OMPP families are related by common ancestry [33,34].
As noted above, our laboratory has focused primarily on superfamilies of α-TMS transport proteins with little emphasis on outer membrane pore-forming proteins (OMPPs) [6, 7, 10-12, 21, 35]. In order to define superfamily relationships, we have developed statistical means to evaluate the probability of homology, e.g., common origin [36]. Our standard methods involve the use of computer programs that use the Superfamily Principle to determine the significance of sequence similarities [34]. The Superfamily Principle states that if protein A is homologous to protein B, and protein B is homologous to protein C, then protein A must be homologous to protein C, regardless of the degree of sequence similarity observed between these two proteins. It should be noted that homology is an absolute term meaning, "derived from a common ancestor" and does not imply a specific degree of sequence similarity. Thus, two proteins or protein domains are homologous if they share common descent.
We have decided to use the methods developed in our laboratory [36] to independently examine the possibility of common origin of recognized OMPP families in TCDB using rigorous quantitative statistical approaches. We use these methods to establish the relationships of the various families to each other and describe their families' characteristics, based on our bioinformatic analyses as well as the published literature [37]. To our surprise, and in contrast to previous analyses with α-helical type transport proteins [6,7,12], we observed a remarkable degree of sequence similarity among many of the 76 currently recognized families of OMPPs and putative OMPPs included in TCDB as of 5/2015 (see Table 1).
Specifically, we could provide evidence that 47 of the 68 (putative) β-barrel OMPP families belong to a single superfamily, hereafter referred to as Superfamily I (SFI). Using the superfamily tree (SFT) program [12,21,35], we have drawn the first phylogenetic tree for the superfamily, revealing which of these families are likely to be most closely related. The results support the suggestion [31] that many families of OMPPs derive from a single common ββ-hairpin structure. We also confirm relationships suggested from family assignments in Pfam (see Table 1). However, indirect evidence is presented suggesting that some OMPPs do not derive from the same source. We also provide evidence for the existence of four small OMPP superfamilies, two in eukaryotic organelles, and two (α-TMS and β-TMS structural OMPPs, respectively) in Actinobacteria. The results reported extend suggestions made previously and put OMPPs in a phylogenetic framework.

Family and superfamily identification and characterization
In order to estimate relative family sizes, OMPPs of the 76 families in TCDB were used as query sequences for BLAST searches of the non-redundant NCBI protein database (default settings), which were conducted without iterations [38]. From one to 5,000 homologous proteins were retrieved from the NCBI database for each of the families, and these numbers were recorded in Table 1 to indicate the relative sizes of the families. Redundant and incomplete sequences were eliminated, and remaining selected proteins were retained for topological and phylogenetic analyses.
The CLUSTAL X program [39] was used with default parameters for multiple alignment of homologous sequences, and the TreeView [40] and FigTree programs [41] were used for the construction of phylogenetic trees for members of individual families. Alternative methods of tree construction, dependent on tens of thousands of BLAST bit scores and obviating the need for construction of a multiple alignment, were provided by the SuperfamilyTree (SFT) programs, SFT1 and SFT2 [21], [42], [35]. Previous publications have shown that these two programs give excellent agreement with trees derived using ClustalX/TreeView when sequences are sufficiently similar to generate reliable multiple alignments [21,35,42]. However, the SFT programs are superior when proteins with more divergent sequences are analyzed [11,13,43]. The SFT1 program shows the relationships of all proteins included in an analysis, while the SFT2 program shows the subfamily or family relationships within a superfamily. Topological analyses of individual proteins were performed using the WHAT [44], HMMTOP [45] and Spoctopus [46] programs which we have shown are among the most reliable programs for topological predictions [47]. Average hydropathy, amphipathicity and similarity plots were generated using the AveHAS program [48]. PRED-TMBB [49] (http:// bioinformatics.biol.uoa.gr/PRED-TMBB/) was used to predict numbers and positions of transmembrane β-strands for β-barrel proteins. HHrepID, [50] (http://toolkit.tuebingen.mpg.de/ hhrepid#), a bioinformatics tool kit that uses HMM-HMM comparisons, was used to find structural repeats in protein sequences.

Statistical approaches to homology establishment
Statistical sequence similarity comparisons between proteins, and between internal regions of these proteins, were conducted using the IC [36], GAP [45], Protocols 1 and 2 [51] and GSAT [36] programs. These programs randomly shuffle the sequences of the proteins or protein segments under scrutiny and compare these shuffled sequences with the native sequences. They thereby correct for abnormal protein compositions such as those that can occur in integral membrane proteins. Two thousand random shuffles and default settings have proven to be satisfactory for obtaining statistically significant values with both Protocol 2 and GSAT (see below). A comparison score of 12 standard deviation (SD) for comparable regions of two proteins of at least 60 amino acyl residues (aas) has been reported to correspond to a probability of 10 −27 that the observed degree of sequence similarity arose by chance [52]. Although the actual probability may be much higher due to Gaussian skewing, this value has been considered sufficient to strongly suggest homology, given the NCBI protein database size when these studies were conducted [7].

Obtaining homologues and removing redundancies
Query sequences used to identify members of OMPP families were taken from families 1.B.1 to 1.B.76 in TCDB. NCBI PSI-BLAST searches were conducted with two iterations (e -4 ; e -6 cutoff values, respectively). These searches were performed using Protocol1 [36] to identify members of each family. The Protocol1 program compiles homologous sequences from each BLAST search into a single file in FASTA format. It then eliminates redundancies and fragmentary sequences and generates a table of the resultant collection of sequences containing protein abbreviations, sequence descriptions, organismal sources, protein sizes, gi numbers, organismal groups or phyla, and organismal domains. Protocol1's CD-HIT option was used to remove redundancies and highly similar sequences [36,53]. An 85% identity cut-off was used to retrieve sequences that were subsequently used to establish homology between family members, and a 70% identity cut-off was used to create more easily viewed average hydropathy plots and phylogenetic trees. These percent identity values refer to the values above which all but one of the most similar sequences were removed. Thus, an 85% cutoff means that no two protein sequences retained for analysis were more than 85% identical. FASTA files from Proto-col1 were considered representative of each respective protein family, although selected proteins that demonstrated apparent homology between families were sometimes confirmed with Pfam, NCBI's Conserved Domain Database (CDD) [54], and PSI-BLAST [55] results as outlined above (see also Discussion).

Multiple alignments and topological analyses
The ClustalX program was used to create multiple alignments of homologous proteins within individual families, and the few sequences that introduced large gaps into the alignment (usually a reflection of fragmentation, inclusion of introns or artifactual sequences) were removed. This allowed the generation of coherent multiple alignments where all or most sequences are homologous throughout most of their lengths. Results obtained with this program have been compared with 5 other programs, and when sequence similarity was sufficient to give reliable multiple alignments, phylogenetic trees obtained with the six programs (Neighbor Joining or Parsimony) were very similar [14]. The conserved domain database (CDD) (57) was also used to analyze protein sequence extensions that can result from the presence of extra protein domains as initially revealed using AveHAS plots [48].

Establishing homology between families
Initially, a large screen was performed, comparing distantly related OMPP family members against members of all OMPP families (TC subclass 1.B) [36]. The Targeted Smith-Waterman Search (TSSearch) feature of Protocol2 was then run in order to compare each family to all other OMPP superfamily members. TSSearch uses a rapid search algorithm to find distant homologues within the two different FASTA files that may not readily be apparent from BLAST or PSI-BLAST searches [36]. The most promising comparisons between proteins were automatically analyzed using the Global Sequence Alignment Tool (GSAT) feature of Proto-col2 [36]. Comparison scores obtained using GSAT are reported in standard deviations (SD). Scores were calculated with the Needleman-Wunsch algorithm [56]. Promising results with comparison scores of 12.0 SD or greater were confirmed and analyzed further using the GSAT program set at default settings with a gap creation penalty of 8 and a gap extension penalty of 2 with 2,000 random shuffles. Two families within TC subclass 1.B were initially excluded from our studies. These were the Autotransporter-2 (AT-2) Family (1.B.40) and the Intimin/Invasin (Int/Inv) Family (AT-3, 1.B.54). Many of these proteins have huge passenger domains of >1,000 aas with relatively small transmembrane β-barrel domains. Since the passenger domains frequently include β-structure, their presence complicated the assignment of homology, warranting their initial exclusion from our homology studies with other OMPP families. However, in subsequent studies, the transmembrane domains of these families were examined for tentative relationships with other families.
Comparison scores were calculated using Mathematica (Wolfram Research, Inc., Champaign, IL, USA). Comparisons involved protein segments of at least 60 amino acyl residues (aas), the average size of a prototypical protein domain, and required a comparison score of at least 12.0 SD to provide evidence for homology [57]. Convergent sequence evolution is possible and has been demonstrated for short motifs, but not for large segments of proteins such as entire domains. GSAT alignments were sometimes performed on sequences by taking segments of at least 60 aas, maximizing the number of identities, minimizing gaps, and removing non-aligned sequences at the ends of the alignment, but never in central regions of an alignment. Thus, all segments analyzed are derived from contiguous portions of proteins.
The Ancient Rep (AR) and GSAT programs [36] were used to identify internal repeats, and the HHRep [58] and HHRepID [50] programs provided independent search approaches. The AR program compares potential transmembrane repeat sequences (e.g., transmembrane regions predicted by HMMTOP) within a single protein and between proteins in a FASTA file, giving a comparison score in SD in the same format as Protocol2. The HHRep programs show graphical representations of similarities with repeat sequences revealed as lines parallel to the diagonal line representing the protein sequence itself. Results could often be confirmed using the MEME program [59] for conserved motif identification.

OMPP families in TCDB
TCDB included 76 families of OMPPs in TC subclass 1.B at the time these studies were updated (5/2014), 62 of them being transmembrane β-barrel structures with varying numbers of transmembrane β-strands (β-TMSs), nine containing both αand β-structure, and five consisting only of transmembrane α-structure (see Table 1). The large majority of these families (64) include members from Gram-negative bacteria, but six families are primarily from Actinobacteria, and 6 are primarily from eukaryotes. Table 1 summarizes characteristics of these 76 families as well as three additional OMPP families added more recently, while Table 2 summarizes the dominant phyla from which the members of these families derive, and Table 3 summarizes characteristics of the five superfamilies identified (see below). Column 1 in Table 1 presents the family TC numbers while column 2 presents the family names and their abbreviations. Column 3 lists the dominant organismal phyla from which these proteins are known to derive. Column 4 provides the average protein sizes ± SD, expressed in numbers of amino acyl residues (aas) for family members included in TCDB as of 5/2014. Column 5 gives the relative family sizes, estimated by the number of proteins retrieved in a single PSI BLAST search of the NCBI NR protein database without iterations when the first member of each family (1.B.X.1.1) was used as the query sequence. The maximal number of proteins retrievable in any one search was 5,000, so the few families reported to have this number of members are larger than indicated. Column 6 indicates the superfamily, if any, as defined in this paper, to which the family belongs. Column 7 presents the known or estimated numbers of transmembrane β-strands (β-TMSs) in protein members. An asterisk indicates that for one or more representative member (s), the 3-D structure is known, and consequently the topology of that protein is established. It should be noted that not all members of a family necessarily have the same number of β-TMSs. For those families lacking an asterisk, the numbers recorded were estimated using average hydropathy/amphipathicity/similarity (AveHAS) plots as well as the PRED-TMBB β-TMS prediction program for β-barrel proteins. In some cases, the proteins are known or predicted to consist of both α-helical and β-structural regions, and in these cases, we indicated this fact by "α + β". Finally, the last column indicates the designation of the family or superfamily used by the Conserved Domain Database (CDD), often derived from the Pfam database. Although Table 1 is self explanatory, some of the features will be described below.
As discussed in greater detail below, we have been able to assign many of the OMPP families to one large and four small superfamilies (Table 1, column 6 and Tables 2 & 3), which we have designated with roman numerals. Thirty three Pfam/CDD families, corresponding to 47 TC families, proved to fall into our Superfamily I, SFI; (Table 1). Each TC family usually corresponds to a distinct CDD family, although some TC families encompass more than one CDD family [60].  (Table 1), which merely reports "no putative conserved domains have been detected". In such cases, column 8 in Table 1 is left blank. OMPP families not recognized by CDD derive from a variety of organismal sources, and in general, they include low to moderate numbers of members. SFI families are derived almost exclusively from Gram-negative bacteria. CDD recognizes our Superfamily II (SFII) as the MspA Superfamily while a single family (1.B.47) of our superfamily III (SFIII) was recognized by CDD, but the other family of this superfamily (1.B.28) was not recognized. CDD recognized SFIV but not SFV. Establishment of the number of β-TMSs for one member of a family does not necessarily imply that all members of that family have the same topology as noted above. Finally, in several cases, some members of a TC family were recognized by CDD while others were not.
About two thirds of the TC OMPP families have their members derived primarily from Proteobacteria (Table 2). Nine families are derived primarily from spirochaetes, six from Actinobacteria, four from chloroplasts, two from chlamydiae, and one family each is derived from mitochondria (eukaryotes), peroxysomes (eukaryotes), fusobacteria, cyanobacteria and bacterioidetes (Table 2). Thus, eukaryotic OMPPs include four from chloroplasts, one from mitochondria and one from peroxisomes. It should be noted that this skewed distribution with so many families derived predominantly from Proteobacteria undoubtedly reflects in part the facts that so many proteobacterial genomes have been sequenced and so much experimental Table 3. Five superfamilies of OMPPs identified in this analysis. The table presents column 1, the superfamily number; column 2, the number of TC families in each superfamily; column 3, the relative superfamily size in numbers of proteins identified; column 4, the average protein size, expressed in numbers of amino acyl residues, ± standard deviations; column 5, numbers of superfamily proteins in TCDB as of 5/2014, and dominant organismal type represented. work has been conducted with these organisms. Although most firmicutes lack an outer membrane and therefore lack OMPPs, a few have been reported to have these structures, and these unusual firmicutes sometimes proved to encode OMPP homologues in their genomes [61].
Establishing homology between families with the formation of superfamilies  Fig 1 as an example. Third, if comparison scores were insufficient to strongly suggest homology, Protocol 1 was used to retrieve homologues of the two query sequences using NCBI PSI-BLAST with one or two iterations followed by comparison of all retrieved sequences in one list with those in the other list using Protocol 2 [36]. Fourth, top scores obtained with Protocol2 were confirmed using GSAT with 2,000 random shuffles. When adequate values were obtained, the two sequences compared by Protocol2 were then compared with the original query sequences from TCDB using GSAT with 2,000 random shuffles (Table 4). Only if all three values exceeded 12 SD did we conclude that evidence for homology was appreciable (see below). It should be noted that the inability to establish homology does not prove that a family is not a member of a superfamily.  Table 4.
doi:10.1371/journal.pone.0152733.g001 An example of this procedure is shown in Fig 2A-2C, and the results of this comparison and others are summarized in Table 4. Protocol 1 retrieved Req1 when the query sequence was 1.B.24.1.2. The alignment obtained between these two proteins is shown in Fig 2A and Table 4 provide the basis for the conclusion that our studies have defined the TC family compositions of Superfamilies I-IV. Comparable scores could not be obtained for SFV because of the small sizes (<60 aas) of these proteins (see Tables 1 and 3).

OMPP Superfamily I (SFI)
We have identified five superfamilies, each of which includes at least two TC OMPP families (Tables 3 and 4  TMS proteins could have arisen by duplication of 8 β-TMS proteins, it appeared possible that the most common topological type (16 β-TMSs) observed in this superfamily, arose by duplication of 8 β-TMS precursor. This has been suggested previously on the basis of the properties of artificially constructed 16 β-strand OMPPs, generated by intragenic duplication of 8 β-strand OMPPs [62]. It should be noted that our attempts to document such a duplication using our statistical methods were unsuccessful. Fig 3 also shows the numbers of established or predicted β-TMSs in the β-barrel OMPPs not in a superfamily. Interestingly, two of the three most common topologies observed for SFI (16 and 18 β-TMSs) are not represented at all among the non-SF families, and while 6 families of SFI have 8 β-TMSs, only one of the non-SF families have this topology. Other striking differences can be seen (Fig 3). These differences in topologies between proteins within SFI and those excluded from a superfamily suggest fundamental differences between these two sets of OMPPs (see Discussion). Four small OMPP families derived from Actinobacteria, and two families from eukaryotes have been extensively characterized (see reference citations in TCDB under each family), and they apparently consist primarily of α-TMSs. The latter two OMPP families (1.B.30 and 1. B.69) belong to superfamily IV and appear to have 4 α-TMSs per OMPP. They are in the Tim17 family of Pfam and are included in a more extensive superfamily in TCDB (see the TCDB Superfamily hyperlink for a list of other proteins included in the Tim17 Superfamily). Two of the actinobacterial families comprising superfamily V have been shown to be related, and they include small proteins, usually of 40-60 aas, with a single α-TMS (see Table 1). These two families are the PorA Family (TC# 1.B.34) and the PorH Family (TC# 1.B.59) ( Table 1). These can either be hetero-or homo-oligomeric [63]. Because of their small sizes and the substantial sequence divergence of these two families, these proteins did not allow construction of reliable phylogenetic trees. In an independent study, we have shown that these two families consist of homologous proteins. In addition to being called OMPP SFV, they have been designated the Corynebacterial PorA/PorH Superfamily (T. Su and M.H. Saier; unpublished results; see TCDB Superfamily hyperlink). This superfamily will not be discussed further here.

OMPP Superfamilies II-V (SFII-V)
An additional actinobacterial OMPP family including proteins of α-structure is the Corynebacterial PorB Family (TC# 1.B.41) [64]. A high resolution (1.8 Å) x-ray structure of the Corynbacterium glutamicum PorB monomer is available, revealing a globular bundle of 4 α-helices tied together by a disulfide bond [65]. The native membranous structure must be oligomeric to form a pore, and a model for such a structure has been proposed [65]. PorC homologues of Corynebacteria [64] are members of this family, but PorB/PorC homologues are not believed to be related to PorA/PorH proteins.

OMPPs not included in superfamilies
Twenty-one families of OMPPs in TCDB did not fall into one of the five superfamilies mentioned above. Twenty of these families consist of proteins forming β-barrels that exhibit predicted topologies with any of the following numbers of putative β-TMSs within the barrel, based on PRED-TMBB: 12 (8 families) >10 (4 families) > 2 (3 families) > 13 (2 families) > 8 = 24 = 26 (1 family each). It is interesting to note that the distribution of topological types observed for these families is strikingly different from that observed for SFI (Fig 3). For example, three non-SF OMPP families have members with 2 predicted β-TMSs and presumably form oligomeric structures, but such proteins are not found in SFI. Moreover, the predominant predicted topologies are 12 > 10 β-TMSs, and none of these families appears to consist of proteins with 16 or 18 β-TMSs, two of the most common topologies observed for SFI. These observations suggest that there is a basic difference between families included in Superfamily I, and those not included in a superfamily. Possibly, these differences in topological distribution reflect a fundamental difference in their evolutionary pathways, which could suggest that all β-TMS OMPPs do not belong to a single superfamily (see Discussion).

Topological analysis of the 76 TC OMPP families
All 76 OMPP families were examined in three ways using the AveHAS and PRED-TMBB programs. First, the proteins of each family in TCDB were used to construct multiple alignments using the ClustalX program followed by input of the alignment into the AveHAS program for topological prediction. Second, homologues of the family were retrieved using the first protein in each TC family (1.B.X.1.1) as the query sequence in a single Protocol 1 search of the NCBI NR protein database with a cut off value of 80% using PSI-BLAST without iterations. Proteins obtained were again multiply aligned, and topology was again predicted using the AveHAS program. Third, several randomly chosen individual proteins in each family were examined using the PRED-TMBB program. A consensus value was then obtained and recorded in column 7 of Table 1. When 3-D structural data were available (values in column 7 of Table 1 marked with asterisks), the known topology was compared with the predictions, leading to the observation that these predictions were about fifty percent accurate. However, when several homologues of a single family are examined and the results are averaged, much greater accuracy can be attained. Examples are shown in Fig 4A-4D for subfamily 1.B.6.1 with 8 β-TMSs (Fig 4A), subfamily 1.B.1.1 with 16 β-TMSs (Fig 4B), subfamily 1.B.3.1 with 18 β-TMSs ( Fig  4C) and subfamily 1.B.35.1 with 12 β-TMSs (Fig 4D). Because the 3-D structures are known for representative members of these four families, we could assign β-TMSs with high confidence. If the topology is the same for all included members of each subfamily, an assumption that may or may not be valid, depending on the family, this conclusion applies to all members of the subfamily. Thus, the numbers above the peaks of hydrophobicity in Fig 4A-4D give the positions of the known TMSs. We emphasize that all of these fusion proteins were found in α-proteobacteria, some of which are known to form protein fusions and domain rearrangements with higher frequency than other proteobacteria [15].

Fusion of OMPPs with other OMPPs and other protein domains
Additional fusion proteins were found in subfamily 2 of family 1.B.72, the Protochlamydial OMPP (PomS/T) Family. Members of subfamily 1 in this family include characterized chlamydial OMPPs [68]. The two proteins exhibiting fusions are (1)   (2) 1.B.72.2.4 of 1086 aas which consists of a long N-terminal sequence including tetratricopeptide repeats and a C-terminal PomS/T domain. Interestingly, both of these sequences are also from α-Proteobacteria, and surprisingly, the PomS/T domain shows limited sequence similarity to members of the Omptin (Protease 7) Family. It is possible that Omptin Proteases (9.B.50), for which high resolution x-ray structures are available [69], are related to PomS/T OMPPs.

Superfamily I phylogenetic analyses
Phylogenetic trees were constructed in six different ways. First, all or many representative proteins from a family within TCDB, included within any one of the 47 families that comprise Superfamily I, were used to generate a multiple alignment using the ClustalX program followed by tree construction using TreeView or FigTree, two equivalent methods. Second, the same method was used with a smaller collection of representative proteins from each of the TC families included in the study. Third, the SuperfamilyTree 1 (SFT1) program was used with all members of the families in TCDB included in the Superfamily I input file. Fourth, the same program was used with a single representative member of each subfamily in Superfamily I. In this last tree, we selected the first members of all subfamilies (1.B.X.X.1). Fifth, the SFT1 program was used with several representative proteins from each Superfamily I family. These studies revealed that for accurate tree construction, several members of a family (at least 5), must be included to obtain accurate relationships. Finally, the SFT2 program was used to derive a consensus tree using all of the data from SFTI (Fig 5). This tree shows the positions of the families relative to each other, revealing their probable relationships. See Tables 1 and 5 for details of these families, their properties and their relationships to each other. Table 5 presents the proposed phylogenetic relationships, functions, substrates when known, protein sizes, and (putative) topologies.
ClustalX-derived trees (e.g. S1 Fig) revealed clustering patterns that were inconsistent with the known phylogenies of the proteins. For example, the coherent but sequence diverse family of outer membrane receptors (OMR; 1.B.14) showed a majority of the members in one large cluster, but sequence divergent members of this family were found in eight additional clusters around the tree. Another large family, the Outer Membrane OMPP Family (Opr; 1.B.25), showed a majority of the members in a single large cluster, but other members appeared in three more clusters. The large General Bacterial OMPP Family (GBP; 1.B.1) had most of its members clustering in two large groups, separate from each other. Few families had all members correctly clustering together, as was true of the OprB Family (1.B.19), but these were the exception. In general, members of the large diverse families did not show consistent clustering, although members of the small sequence similar families sometimes did.
In other ClustalX based trees, where select sequence divergent proteins were included, a similar situation existed. When all members of a family selected for inclusion were derived from a single subfamily, the proteins frequently clustered together. This was true for families 7, 19, 31, 39 and 47. Only in one case (family 9), where the members selected were from different subfamilies, did they still cluster together. These results reveal the limitations of trees based on multiple alignments when sequence divergence is considerable as has been noted before [70].
The SFT phylogenetic tree for Superfamily I Using the SFT1 program, we first included all TC proteins within all of the 47 Superfamily I families. Clustering patterns were in general consistent with family assignments (see below), but the tree was so congested that it was impossible to display all of the proteins included. We next created a tree using only the first member of each subfamily within all families of the superfamily. This tree did not show the expected clustering of subfamilies within specific families, showing that it was necessary to include a substantial number of closely related proteins in order to generate a reliable tree. Consequently, we generated a final tree in which five proteins from each family within the superfamily were included. This tree proved to have members of each family generally clustering together with few exceptions. Thus, very few proteins fell outside of the cluster representing the family to which these proteins belonged. The tree, showing family relationships (Fig 5), derived using the SFT2 program, will be described in detail below.
The phylogenetic tree shown in Fig 5 includes fifteen major clusters, labeled I-XV. Most of these clusters include multiple families, although clusters VII and IX include only one family each. The clustering pattern reveals which families within Superfamily I are most closely related. Clusters IV, V, VIII, XI and XII contain two families each, clusters III, VI, XIII and XV have three families each; clusters X and XIV each have four families, cluster II has six families, and cluster I includes nine families. The families included in the 15 clusters are shown in Fig 5, and their properties are summarized according to cluster and subcluster (A, B and C) in Table 5.
Cluster I has three primary subclusters; subcluster A includes families 20 (The Two Partner Secretion (TPS) Family), 33 (The Outer Membrane Protein Insertion OMPP (OmpIP) Family), and 11 (The Outer Membrane Fimbrial Usher OMPP (FUP) Family). TPS and OmpIP OMPPs  Table 5. SFI families included in clusters I-XV arranged according to the cluster/subcluster as shown in Fig 5. A bracket ({) indicates that these families are most closely related within the indicated (sub)cluster. See footnotes for explanation of the columns. are most closely related, with the FUP Family branching from a point closer to the center of the tree. All three families include members that are derived from a variety of Gram-negative bacteria, especially proteobacteria (Table 1). Although as noted below, members of the OmpIP Family are present in mitochondria and chloroplasts, the organismal types cited in Table 1 represent those included in TCDB. At least in some of these families, other phyla are represented in the UniProt and GenBank databases. Both of the first two families in subcluster A include C-terminal 16 stranded β-barrel OMPPs as well as additional domains that function in the insertion of outer membrane proteins (OmpIP), or the export of proteins across the outer membrane (TPS) [71,72]. In both cases, the substrate protein folds into its native configuration during or soon after the export process [73]. It is interesting that the functional Omp85 (YaeT, BamA) OMPPs of the OmpIP family are related to the chloroplast import-associated β-barrel channel proteins (IAP75; 1. B.33.2.1) of the Chloroplast Envelope Protein Translocase (CEPT or Tic Toc) Family (TC# 3. A.9), and the Mitochondrial Sorting and Assembly Machinery (SAM) OMPPs, SAM50 (TC# 1. B.33.3.1), which assembles outer mitochondrial membrane β-barrel proteins [74][75][76]. As for the TPS OMPPs, the N-terminal domains of Omp85 homologues are localized to the periplasm, where they function in substrate protein binding and pore gating, while the C-terminal domains comprise the 16-stranded β-barrels [77,78]. Interestingly, signals in bacterial Omp85 homologues are functional in eukaryotic cells for targeting to and assembly of mitochondrial OMPs into the outer membranes of these organelles [79].
The third family in subcluster IA with the TPS and OmpIP OMPPs, is the Fimbrial Usher Protein (FUP) Family (1.B.11). These large usher proteins resemble the TPS and OmpIP OMPPs in having extra N-terminal domains involved in substrate protein recognition as well as C-terminal extracellular domains that function in fimbrial subunit folding and assembly. In this case, the OMPP domain is central and has about 24 β-TMSs [80,81]. Fimbrial ushers serve essentially the same function as TPS systems in exporting proteins, in this case, for assembling the subunits of bacterial fimbriae. They evolved in parallel with the periplasmic chaperone proteins that feed the subunits to the usher proteins [16].
The second major subcluster in cluster I, subcluster B, includes the Outer Membrane Auxillary (OMA) family of capsular polysaccharide exporters (1.B.18) and the bacterial Secretin (Secretin) Family (1.B.22), usually involved in protein secretion (most closely related), as well as the Outer Membrane Factor (OMF) Family (1.B.17) (more distantly related), involved in the export of extracellular proteins and polysaccharides as well as small molecules such as drugs, aromatic acids and divalent metal ions, depending on the inner membrane transport systems with which the proteins of this family associate [82,83]. Like subcluster A, subcluster B is Family TC # in subclass 1.B. (see Table 1). 2 Family abbreviation; see Table 1 for full name. 3 Substrates shown to be transported by members of the indicated OMPP families.?, substrates unknown. 4 Average protein size is provided in numbers of amino acyl residues ± standard deviation (SD). 5 Topology expressed in numerical values refers to the established or predicted numbers of β-strands in the transmembrane β-barrel. All Cluster I OMPPs contain both α-helical and β-strand structures. doi:10.1371/journal.pone.0152733.t005 concerned with macromolecular export. Both the OMFs and the Secretins have proteins with αand β-structure with 12-16 β-TMSs, and both form oligomeric structures [84]. The octomeric transmembrane ring of the OMAs has been compared with that of Secretins [85], although their transmembrane domains are formed of unusual α-helical barrels with three layered ring domains of mixed composition, mainly of β-strands in the periplasm [86]. Near the base of subcluster B, is the Poly Acetyl Glucosamine OMPP (PgaA) Family (1. B.55), another family concerned with extracellular polysaccharide export [87]. These OMPPs have the structure of a β-barrel with variable numbers of predicted β-strands.
At the base of cluster I is a subcluster (subcluster C) consisting of two families, the Autotransporter-I (AT-1) Family (1.B.12) and the Lipopolysaccharide Export OMPP (LPS-EP) Family (1.B.42). Like all other OMPPs in cluster I, these OMPPs are concerned with macromolecular (protein and carbohydrate, respectively) export. These two families cluster more closely to each other than to any other family in cluster I. While representative members of the AT-1 family are known to have 12 β-TMSs in β-barrel structures, the topology of the LPS-EP family is not known. It is remarkable that the nine OMPP families that comprise cluster I all serve a unified function in macromolecular export, particularly because few OMPP families include members with this capability. They also display the properties of being multi-domain multisubunit systems in most cases.
Cluster II includes families that fall into two subclusters, each containing three families.  [89,90]. The former two OMPP families are more closely related to each other than to the POP family. All cluster II families consist of members in about the same size range (400-540 aas) and are present in many different Gram-negative bacterial phyla. Moreover, all are predicted to have 16-18 β-TMSs arranged in β-barrels. The close relationships of the functionally uncharacterized PBP and BBP4 families with the four functionally characterized families provide the strongest evidence currently available that these two families do, in fact, consist of OMPPs. In contrast to cluster I OMPPs that have the capacity to export macromolecules, all characterized cluster II OMPPs apparently function to allow transport of small molecules.
Cluster III consists of three large families, all of which have small porin domains of about 200-250 aas with 8 established (2 families) or putative (1 family) β-strands in a barrel structure. However, several of these proteins are larger due to fusions with domains such as the peptidoglycan binding domain in several OmpA proteins [91].
The best characterized of these families are the OmpA-OmpF OMPP (OOP) Family (1.B.6) [92,93] and the OmpW Family (1.B.39) [94]. The third family in cluster III is the Anaplasma P44 (A-P44) Family (1.B.49) with established OMPP activity [95]. Members of a family of spirochaete proteins, the Putative Spirochaete Omp-like OMPP (Sp-Omp) Family (9.B.184) [96] have size, topological and sequence characteristics resembling those of E. coli OmpA and OmpW homologues, but the function of no member of this family has been established. This family will therefore not be considered further.
Cluster IV includes two families, the General Bacterial OMPP (GBP) Family (1.B.1) and the Rhodobacter PorCa OMPP (RPP) Family (1.B.7). The Rhodobacter OMPP was the first OMPP to have its high resolution structure solved [97]. Subsequently, the structures of several members of the GBP family were solved and all proved to consist of trimeric pores, each subunit having 16 β-TMSs, like the Rhodobacter OMPP [59]. That these two families belong to a single subcluster was therefore not unexpected.
Cluster V includes the well characterized FadL OMPP Family (1.B.9), concerned with transport of hydrophobic molecules such as fatty acids, benzene derivatives, hydrocarbons, hemin and salicylate esters [98] and the poorly characterized Legionella Major OMP (LM-OMP) Family (1.B.57) [99]. Members of these two families are of similar sizes (Table 1), and while FadL of E. coli is known to have 14 established β-TMSs, LM-OMP Family members are predicted to have 12-14 β-TMSs. Structural similarities with FadL seem likely.
Cluster VI consists of three OMPP families, the first well characterized Sugar OMPP (SP) Family (1.B.3) that includes the trimeric E. coli maltoOMPP with 18 established β-TMSs [100,101] as well as the sucrose and β-glucoside OMPPs, and second, the much less well characterized Raffinose (RafY) Family (1.B.15), which includes the E. coli RafY OMPP that transports several oligosaccharides including the trisaccharide, raffinose [102,103]. The OMPPs of these two families have overlapping specificities for oligosaccharides, and they are closer to each other on the tree than to any other OMPP family. However, a third family, the Chlamydial OMPP (CP) Family (1.B.2) also occurs in this cluster. The members of this family have similar sizes and topologies as the RafY family, and like maltoOMPP, these Chlamydial OMPPs, which are known to transport small nutrients, are homotrimers [104].
Branches (Clusters) VII, VIII and IX include just 1, 2 and 1 families, respectively. Moreover, the branch point for the two families in cluster VIII are so close to the center of the tree, it cannot be concluded with confidence that they are more closely related to each other than to the families in clusters VII and IX. These families are the uncharacterized Putative β-Barrel OMPP-4 (BBP4) Family (1.B.68) (branch VII), the Porphyromonas gingivalis OMPP (PorT) Family (1.B.44) (cluster VIII), the Outer Membrane Receptor (OMR) Family (1.B.14), (cluster VIII) and the Alginate Export OMPP (AEP) Family (1.B.13; cluster IX). Members of these families exhibit differing sizes and topologies with 8, 22, 8 and 18 predicted β-TMSs, respectively, in agreement with the fact that they stem from points near the center of the tree. Of these four families, only the OMR and AEP families include members that are functionally well characterized [105,106].
Cluster X includes four families. The Cyclodextrin OMPP (CDP) Family (1.B.26) and the Oligogalacturonate OMPP (KdgM) Family (1.B.35) [107,108], which includes the structurally characterized 12 TMS NanC OMPP [109], are families of polysaccharide export OMPPs that cluster closely together, a fact that is noteworthy since both are specific for complex carbohydrates. Members of these two families form β-barrels, but they differ in average protein size (about 350 aas versus 530 aas) and possibly topologies (14-16 predicted β-TMSs versus 12 established β-TMSs, respectively). Branching lower within cluster X is the OmpG OMPP (OmpG) Family (1.B.21), and closest to the center of the tree, we find the Fusobacterial Outer Membrane OMPP (FomA) Family (1.B.32). Proteins of these two families have about the same sizes and numbers of β-TMSs (14) as the CDP and KdgM families, but they are reported to catalyze non-specific export of small molecules, restricted only by the sizes of the substrates [110][111][112].
Cluster XI consists of just two families, the Nucleoside-specific Channel-forming Outer Membrane OMPP (Tsx) Family (1.B.10) [113] and the Capsule Biogenesis/Assembly (CBA) Family (1.B.73) [41]. They differ in protein sizes and numbers of β-TMSs (12 versus 18 established β-TMSs). The molecular functions of CBA family members are not well established, but Wzi of E. coli, a member of the CBA family, is a carbohydrate binding protein (lectin) that is somehow involved in extracellular capsule formation [114].
Cluster XIII includes three families. The poorly characterized Intimin/Invasin or Autotransporter-3 (AT-3) Family (1.B.54) and the Proteobacterial/Verrucomicrobial OMPP (PVP) Family (1.B.71) cluster more closely together than to the much better characterized, but more distantly related Mitochondrial and Plastid OMPP (MPP) Family (1.B.8) that includes outer mitochondrial membrane OMPPs called VDAC. While members of the first two of these families may consist of 12 β-TMS barrels, VDAC OMPPs have 19 established β-TMSs. The latter proteins can differ in cellular location, size and structure due to alternative splicing [115], but they always appear to form anion-selective pores.
Cluster They transport a variety of small molecules including amino acids, peptides, phenolic compounds, antibiotics and sugar derivatives [118,119]. It can be concluded that like many other OMPPs, these channel proteins exhibit broad specificities and are simply size limited.

Protein phylogenetic trees for Superfamilies II and III
Acid fast Gram-positive Actinobacteria have OMPPs of two types, β-barrel and α-helical, in their outer membranes (see Table 1 and TCDB [120,121]). Two of these OMPP families, the Mycobacterial OMPP (MBP or MspA) Family (1.B.24) and the Nocardial Heterooligomeric Cell Wall Channel (NfpA/B) Family (1.B.58) are believed to consist of β-barrels and comprise Superfamily II [122,123]. These two families include proteins, most of which are of 200-290 aas with a single N-terminal α-helical TMS followed by a proposed β-barrel OMPP-type structure. The tree shown in Fig 6A reveals that, as expected, members of these two families segregate into two distinct clusters. Because the proteins in SFII are quite similar, it is not surprising that the SFT1 and ClustalX/FigTree trees were in good agreement (compare Fig 6A and S2A Fig).
Superfamily III (Fig 6B) includes the Plastid Outer Envelope OMPP of 24 KDa (OEP24; 1. B.28) and the Plastid Outer Envelope OMPP of 37 KDa (OEP37; 1.B.47) Families. These β-barrel OMPPs are of about 220 and 330 aas in size, respectively, and are predicted to have 12 and 14 β-TMSs, respectively. Presumably, because of their simplicities, the SFT1 tree proved to be in good agreement with the tree based on the ClustalX multiple alignment (see Fig 6B and S2B Fig).

OMPP repeat sequences
Attempts were made to identify the proposed 8 β-TMS repeats in the 16 β-TMS proteins of Superfamilies I & II using AncientRep [36], but these attempts were unsuccessful. We then used the HHrepID program to look for internal repeats and the PRED-TMBB program to predict the β-TMSs. While these programs also failed to identify the proposed eight β-TMS repeats, they did identify what appeared to be hairpin repeats with P values between e -2 and e -10 . , which was predicted to have 14 β-TMSs, appeared to have at least five and possibly seven β-hairpin repeats, starting with β-TMS 4 and ending with β-TMS 13, with P values between e -3 and e -9 (Fig 7). Different proteins in family 1.B.4 were predicted to have variable numbers of β-TMSs and anywhere between 1 and 3 putative β-hairpin repeats. Several proteins in family 1.B.6 were predicted to have at least three β-hairpin repeats with P values in the same range. Putative adjacent β-hairpin repeats were also identified in families 9 (up to 7 repeats), 14 (up to 8 repeats), and 24 (up to 5 repeats). However, hairpin repeats were not identified in members of several families including 28, 47 and 58. Interestingly, four adjacent putative 1 β-TMS repeats were identified in one member of family 28, the protein with TC# 1.B.28.1.4. It is worth noting that families 28 and 47, where no hairpin repeats were identified, belong to Superfamily III, and families 24 and 58, which comprise Superfamily II, in general, did not exhibit identifiable repeats. However, in one protein within family 24, 1.B.24.1.4, five β-hairpin repeats were predicted. Because of concerns about convergent sequence evolution and the short lengths of these sequences, we do not consider it certain that these results can be interpreted in terms of a primordial hairpin structure being the precursor of the proteins in Superfamilies I & II although a previous report came to this conclusion [31].

Discussion
In 1997, Paulsen et al., published a review concerning a family of Gram-negative outer membrane factors involved in the export of proteins, complex carbohydrates, drugs and heavy metals [124]. They showed that these substrates could be exported via a complex of three proteins, an inner membrane transporter of any one of several types that provided the energy for export across the entire cell envelope, a "Membrane Fusion Protein" (MFP) that may serve primarily as an "adaptor", joining the two membranes of the Gram-negative bacterial envelope, and the above mentioned Outer Membrane Factor that provided the β-barrel pore through which the substrates permeate the outer membrane. Since then, sixteen mechanisms of protein export across one or the other of the two membranes of the Gram-negative bacterial envelope or both have been described, and all require the presence of a pore-forming outer membrane protein (OMPP) [125,126]. However, outer membrane pore-forming proteins can function in non-specific or substrate-specific transport into or out of the cell, and many of these are not coupled to an energy input source. They can therefore facilitate transport in both directions [5,127]. Few attempts have been made to classify these proteins into families and superfamilies. As part of a massive attempt to categorize all well conserved cellular proteins, the Pfam database has used Hidden Markov Models (HMMs) and UniProtKB reference proteins to categorize outer membrane pore-forming proteins among others [128]. The current release of this database (Pfam 29.0) includes 16,295 entries classified into 559 Clans, and relationships between families within a clan (i.e., a superfamily) have been suggested using newly prepared bioinformatic tools [128].
Based in part on Pfam, two other databases, devoted exclusively to β-barrel outer membrane proteins from Gram-negative bacteria, have been derived [129]. They define families including about 20,000 and 80,000 proteins respectively, again based largely on HMMs but also using the transitivity rule and 3-d structures when available [130]. The Pfam family/clans are presented in Table 1, last column, and comparisons between TCDB and Pfam are discussed in the text as well as in reference 131. It is apparent that many of the families in TCDB often correlate with those in Pfam and OMPdb, and some of the TC families found within a single superfamily are also found in single clans in Pfam.
While Pfam, HHomp and OMPdb are based substantially on HMMs as noted above, TCDB is based largely on the Superfamily Principle, also known as the Transitivity Rule. Surprisingly, perhaps, there is considerable concordance. We have screened OMPdb and Pfam for overlap, insuring that all of the members in these databases are included in TCDB (see reference [131]. The first comprehensive analysis of outer membrane β-barrel proteins was that of Remmer et al., 2010 [37]. These investigators used three criteria for suggesting homology as discussed in the Introduction. Despite the differences in topology (8-24 β-strands in the barrel) they provided evidence that many β-barrel porins "arose by amplification and recombination of short peptide modules". They did not, however, assign families to superfamilies or attempt phylogenetic analyses of the proteins within these superfamilies. Using defined criteria for quantitative homology assignment and the Superfamily Tree programs, SFT1 and 2, to construct phylogenetic trees, we were able to correct for these deficiencies as reported here.
We have identified 76 TC families of outer membrane OMPPs, most being from Gram-negative bacteria, but some being from outer membrane-possessing Gram-positive Firmicutes and Actinobacteria, and some being from eukaryotic organelles. Of the Gram-negative bacterial OMPPs, 47 of the families appear to form a single large superfamily which we have designated SFI. Phylogenetic analyses using SFT1 and SFT2 [11,12,21,35,42] revealed for the first time that members of each of these families cluster coherently together in almost all cases, and that the families fall into fifteen clusters. Because these proteins are highly divergent in sequence, it is not surprising that ClustalX/FigTree phylogenetic trees were not reliable as documented previously for other sequence-divergent superfamilies [11,12,21,35,42]. The SFT2 tree, showing family relationships (Fig 5), in general, confirmed the SFT1 tree of the proteins (not shown) with a few minor exceptions that proved in every case to be uncertainties due to deep branching. The analyses revealed common functions and/or topologies among some but not all of the most closely related families. However, even within a single family, or within a closely related set of families, functions and topologies may differ substantially. These facts must reflect the ease with which OMPPs change their numbers of β-strands and alter their pore sizes as well as their substrate specificities during their evolutionary divergence. In this respect it is interesting that Korkmaz et al [132] duplicated the last 38aa β-hairpin at the end of the 14 β-stranded OmpG of E. coli (TC #1.B.21.1.) to produce a 16 β-TMS OMPP with similar properties and stabilities, but altered pH sensitivity. Thus, duplication of a β-hairpin structure has been documented with retention of a functional β-barrel.
We noted that OMPPs from Gram-negative bacteria are often related to each other and reside in SFI, that the small SFII includes β-barrel proteins only from Actinobacteria, and that the SFIII proteins derive exclusively from chloroplasts of eukaryotes. Three families of outer membrane proteins from Actinobacteria and two families from eukaryotes are known to consist of transmembrane α-helices [65,133]. Two of the former and two of the latter comprise SFIV and SFV, respectively. Of the seventy one putative β-barrel OMPP families, many of the families in each superfamily may have related topologies (8 and 16 β-TMSs for Superfamilies I & II, but 12 and 14 β-TMSs for Superfamily III). Because the sequences that comprise SFII, and those that comprise SFIII, are similar within each superfamily, it was not surprising that the SFT trees showed good agreement with the trees based on multiple alignments (compare Fig 6  with S2 Fig). For SFI, the demonstration of homology and construction of phylogenetic trees provided the first evidence that several families of structurally and functionally uncharacterized proteins are in fact, OMPPs. Since these proteins are extremely divergent in sequence, it is not surprising that trees based on multiple alignments proved inaccurate. Moreover, some families in SF1 have proteins from different phyla. Sequence divergence between phyla has contributed substantially to the tremendous diversity of SFI. These results should serve as guides for further molecular biological experimentation.
The proteins that reside in β-barrel OMPP families not included in one of the identified β-TMS superfamilies in general proved to have a very different distribution of predicted topologies from those of Superfamilies I-III (Fig 3). This observation provides preliminary evidence that many of these outlying families may not be related to SFs I-III, and may instead, have evolved independently. Thus, the dominant topology for these families appears to involve twelve β-strand barrels (8 of 21 families), and no family of these 21 families exhibited the 16 or 18 β-TMS protein topology found in SFI as two of the most common topological types. This suggestion is in agreement with our inability to demonstrate homology between members of the five recognized OMPP superfamilies classified here. However, we were not able to document this proposal, for example, by showing that the route of evolution taken for the appearance of any of these proteins differs from those taken by other OMPPs. Such studies remain work for future investigations.
We noted that major topologies in SFs I and II involve 16 and 8 β-TMS topologies. This observation suggested the possibility that the larger of these proteins arose by intragenic duplication events, where the precursors were the smaller of these homologous proteins. Our attempts to demonstrate this pathway were not successful, although Arnold et al provided evidence that artificial internal duplication of 8 β-TMS OMPPs can give rise to 16 β-TMS OMPPs that are fully functional [62]. We could, however, observe limited similarities between adjacent 2 TMS hairpin structures in a number of these families in agreement with the observations of Remmert et al. [31]. The demonstration of apparently homologous adjacent hairpin structures in β-barrel proteins suggests that topological variations among these proteins could have arisen by gain or loss of β-hairpin structures. However, this does not eliminate the possibility of larger scale duplications such as the proposed 8 β-strand duplications to give 16 β-strands as proposed by Arnold et al. [62]. This last possibility is more in agreement with observations of intragenic duplication as occurred in most α-type integral membrane transport proteins [7]. Nevertheless, it should be kept in mind that these two transporter types (α and β) may have evolved via very different routes.
Of the 76 families of OMPPs described here, we could identify phylogenetic relationships for many of these families, revealing which families most recently diverged from common ancestors. These proteins were additionally analyzed for topology with the observation that at least in the large Superfamily I, the gain and/or loss of β-strands or of β-hairpin structures may have occurred repeatedly during the evolutionary divergence of members of this superfamily. We do, however, suggest that not all β-barrel OMPPs derived from a common β-hairpin structure. Interestingly, we have identified naturally occurring proteins in which two or even three OMPP domains are present within a single polypeptide chain. These may function to coordinate the transport of physiologically related compounds. Further studies will be required to define the specific pathways that gave rise to these proteins of dissimilar topologies. After completing the work described in this article, evidence for five new (putative) OMPP families has appeared. (1) The Oep23 Family (1.B.77) includes a characterized OMPP in the outer envelope of chloroplasts [134]. Bacterial homologues, particularly from Actinobacteria, have been identified. We have not been able to show that this family is related to any other OMPP family. (2) The Electron Transport-associated OMPP (ETOMPP) Family (1.B.78) includes proteobacterial members involved in the transenvelope transport of electrons, allowing extracellular metal oxidoreduction [135,136]. This family proved, by our criteria, to be a member of SFI. (3) The SpmT Family (1.B.79), with members in Actinobacteria, is a putative OMPP (Nterminus) sphingomyelinase (C-terminus) fusion protein. Evidence that the OMPP domain can transport glucose and phosphocholine has been presented [137]. (4) The Putative Trans-Outer Membrane Electron Flow OMPP (TOM-EF) Family (1.B.80) appears to be very distantly related to members of the ETOMPP Family (TC#1.B.78). It probably serves the same or a closely related function [138]. Both families function with periplasmic and outer membrane cytochrome c proteins and may belong to SFI. (5) Finally, the most recently identified putative porin family (DUF2490; TC#1.B.81) may also belong to SFI, but its functions are not established.
In subclass 9.B of TCDB, we have listed 14 additional families, which, on the basis of tentative predictions, may consist of OMPP proteins, even though in no case has an OMPP function been established, and in no case has homology with members of an established OMPP family been demonstrated. These putative OMPP families in TC subclass 9.B include families 138, 153, 155, 161-165, 167, 168, 170-172 and 184. Clearly, many novel OMPP families are in need of functional and structural characterization.