Distribution and preservation of the components of the engulfment. What is beyond representative genomes?

Engulfment requires the coordinated, targeted synthesis and degradation of peptidoglycan at the leading edge of the engulfing membrane to allow the mother cell to completely engulf the forespore. Proteins such as the DMP and Q:AH complexes in Bacillus subtilis are essential for engulfment, as are a set of accessory proteins including GerM and SpoIIB, among others. Experimental and bioinformatic studies of these proteins in bacteria distinct from Bacillus subtilis indicate that fundamental differences exist regarding the organization and mechanisms used to successfully perform engulfment. As a consequence, the distribution and prevalence of the proteins involved in engulfment and other proteins that participate in different sporulation stages have been studied using bioinformatic approaches. These works are based on the prediction of orthologs in the genomes of representative Firmicutes and have been helpful in tracing hypotheses about the origin and evolution of sporulation genes, some of which have been postulated as sporulation signatures. To date, an extensive study of these signatures outside of the representative Firmicutes is not available. Here, we asked whether phyletic profiles of proteins involved in engulfment can be used as signatures able to describe the sporulation phenotype. We tested this hypothesis in a set of 954 Firmicutes, finding preserved phyletic profiles defining signatures at the genus level. Finally, a phylogenetic reconstruction based on non-redundant phyletic profiles at the family level shows the non-monophyletic origin of these proteins due to gain/loss events along the phylum Firmicutes.


Introduction
Bacteria have developed different strategies to survive. Several members of the phylum Firmicutes can form spores that are resistant to chemical and physical insults that remain dormant until favorable conditions allow the spore to germinate [1][2][3][4]. Genomic comparisons among Firmicutes, show a core set of genes that are almost conserved in all endospore formers [5][6][7][8][9][10]. (POs). We tested this approach on the proteins of the phosphorelay initiating sporulation in Firmicutes [36]. The approach was a determinant to identify the transferases Spo0B and Spo0F in some species of Clostridia, which was considered for a long time to be a two-component system composed only of an orphan histidine kinase and the response regulator Spo0A [1,18,38].
In the present work, we used our aforementioned method to construct protein architectures based on HMM-profiles, which describe the proteins that participate in the engulfment such as accessory proteins that help to assemble the Q:AH and DMP complexes in B. subtilis. With the constructed protein architectures, we searched for the POs in the proteomes of a set of 4,852 fully sequenced genomes, and found that the POs are confined into Firmicutes, and as previously described, are extensively distributed into endospore formers. The sporulation and nonsporulation phenotypes for this investigation were manually curated from the literature, which revealed that of the 954 Firmicutes, 26% of the phenotypes were inferred from other species that belong to the same genus and 8% of the species have an unknown phenotype.
As mentioned before, the prevalence of the proteins of the engulfasome among endospore formers, has caused them to be considered as sporulation signatures, and some of them prevailing up to 100% in some of the representative Firmicutes genomes [7,8,10,29]. Our results show that the representatives chosen to determine the prevalence of the orthologs that control sporulation in several cases do not reflect the preservation of the ortholog even between strains. However, as has been stated, the preceding discussion does not prevent determining a minimum set of genes that defines each stage of sporulation. Moreover, the results obtained in this work for the first time allow offering specific signatures to each genus based on the observed phyletic profiles. Additionally, the prediction of POs through our method and the construction of phyletic profiles are demonstrated to be a valuable tools that allowed the recognition of incorrect annotations. Finally, a selection of species based on nonredundant phyletic profiles used to reconstruct a phylogenetic tree results in a consistent clustering of species across genera and families, which enables us to estimate gain and loss events, which were necessary to adapt the Firmicutes genomes to the ever-changing environments.

Datasets
The genomic data for 4,852 completely sequenced bacterial genomes were retrieved from the KEGG Database (https://www.kegg.jp/) in April 2019 [39]. The database at this time stored the genomes of 954 Firmicutes, for which a sporulation phenotype was assigned as described later in the Methods section. From the KEGG-GENOME database, we extracted the taxonomy and the phylogenetic lineages taken by KEGG from the NCBI RefSeq repository (https://www. ncbi.nlm.nih.gov/taxonomy). We also extracted the amino acid sequences of each protein, and the KEGG gene identifier. From the KEGG orthologs (KOs) database [40], we extracted the groups of KOs.

Construction of protein architectures
The identification of POs is based on our own methodology [36] in which the first step, as shown in Fig 1, is to define the protein models, by consulting the pertinent literature to identify the proteins experimentally proven to be involved in the engulfment. We found that experimental evidence is available for C. difficile and B. subtilis. The selected proteins that serve as models, were those of the complexes Q:AH and DMP, the proteins SpoIIB, which were suggested to regulate the septal thinning during engulfment in B. subtilis [33], and GerM which is required to assemble the basal platform of the Q:AH complex also in B. subtilis [31].
The KOs of the model proteins were retrieved, and all the sequences related to the KO were employed to create a database in which we performed a hhmscan with HMMER v3.1b1 [41] using the profiles of the protein families stored in the curated PfamA database [42]. We preserved the sequences with the domains that have an expectation (E) value < = 0.001 since Pfam considers 0.001 to be a significant cutoff. This cutoff is actually used to reconstruct the clans and protein families in their update published in 2019 [42]. Then, we identified the domain that covers the longer length of each protein, and we found that some protein sequences, have two or more overlapping domains. When a protein presents two or more overlapped domains, we selected the domain that encompassed the longest length and a significant (E) value. An example is SpoIIQ, which has two domains: the RnfC barrel sandwich hybrid domain and the Peptidase family M23 [43,44] that encompasses the RnfC domain. Therefore, the Peptidase family M23 was selected as the representative domain, which was previously used for ortholog identification [28].
As some of the Pfam domains hold structural or sequence motifs, we also used this information to construct the protein architectures, for example, the protein SpoIIIAA of the Q:AH complex, which is a typical ATPase that preserves the Walker A and Walker B motifs, and the His and Asp boxes in B. subtilis and C. difficile [25,28]. We found that this protein could be represented by 229 Pfam profiles grouped in the P-loop_NTPase clan [42]. From this, according to the KO, the most extensive length, and an (E) values < = 0.001, we selected the profiles PF00004 (AAA), PF00437 (T2SSE), and PF03266 (NTPase_1), which were organized in four different architectures as shown in Fig 2. The inspection retrieved 14 Pfam profiles that can identify all the proposed proteins (Fig 2). In the case of the SpoIIIAG protein, we built a specific profile since the Pfam Spore_III_AF (PF09581), equally identified the sequences related to the SpoIIIAF and SpoIIIAG proteins. The profile was built with 27 nonredundant sequences selected by running the Cd-hit program [45], with a cutoff threshold of 80%. The seed group consisted of 73 sequences that have K06396. The 27 selected proteins were first aligned using MUSCLE [46] and converted to a Stockholm format using the online converter program reported in (http://sonnhammer.sbc. su.se/Stockholm.html). We then used hmmbuild, which is part of the HMMER v3.1b1 suite [41], to construct a profile for SpoIIIAG using hidden Markov models and default parameters. The SpoIIIAG profile (see File 1 via https://doi.org/10.6084/m9.figshare.13338677.v1) was then used to perform a hmmscan search.

Identification of POs using the genomic context
The identification of the homologs that correspond to a KO [40] is the first step to discern the paralogues. As expected, this is not the best criterion to identify POs since not all the proteins have an assigned KO. Considering this observation, we inspected the genomic context of the model proteins (see S1 Table in S1 File), as proposed in our previous work [36]. We limited the context to three genes upstream and downstream from the reference protein. To simplify the context inspection, the neighbor genes were also classified by their KOs. We selected as representative contextual KOs (RCKOs) the KOs that appear at least eight times (an arbitrary cutoff), and the full list of contextual KOs and their frequencies can be consulted in S2 Table in S1 File. The next step was to search for the RCKOs upstream and downstream from a protein with a predicted architecture. As expected, some of the homologs that have an architecture and others found in the RCKOs do not possess a predicted KO as shown in S3 Table in S1 File. For the predicted homologs in which the KO was missing the RCKO was used to define a PO. For the predicted homologs where at least one contextual KOs was not defined, the prediction was dismissed. In S2 Table in S1 File, we show each KO's frequency in the genomic context. The final rules used to consider a protein as a PO were as follows. 1. The protein should have a defined architecture that includes the most extended domain, and when known, the domain or domains should include the protein sequence motifs. 2. The protein should have at least one RCKO. 3. When a KO was assigned to the query, we used it, but this condition was not required. The selected RCKOs and KOs of the target genes are shown in S4 Table in S1 File. The results of the predictions that use the architectures and the rules mentioned above were represented in a presence/absence matrix of phyletic patterns in which rows correspond to species, and columns correspond to binary characters; "0" stands for absence, and "1" stands for presence (S5 Table in S1 File).

Sporulation phenotype compendia
In this work, we curated the sporulation phenotype of the 629 Firmicutes. The other 325 phenotypes from the 954 Firmicutes were recently published by our group [36]. To this end, we used two methods to acquire information. First, we searched the databases that offer curated phenotypes such as the Interactive Atlas for Exploring Bacterial Genomes (BacMap), the Genomes OnLine database (GOLD) and the Pathosystems Resource Integration Center (PATRIC). Second, we inspected the literature for the missing organisms in these databases [47][48][49]. Additionally, we searched for recent information to corroborate the phenotypes offered by these databases and found that most of the entries were recently updated. The curated phenotypes were added to the presence/absence matrix shown in S5 Table in S1 File. We also included in the matrix the links to the manuscripts or databases from which the phenotypes were curated.

Annotation of missing spoIIIA genes
We conducted a BlastX sequence search [50] in the upstream regions of the missing proteins of the spoIIIA operon into the genomes of the curated endospore former Firmicutes. We considered it to be a significant hit when the searched protein has as the best match the protein of the query organism or a nearly related species or a hit on multispecies, which included the searched species. We then scanned the translated sequences with the Pfam profiles stored in the PfamA database. In this case, we used an (E) value < = 0.01 since we wanted to recover the hits not found in the first hhmscan round. Despite the cutoff, the profiles always gave significant matches when found. The SpoIIIAA sequences that matched were aligned and inspected because we wanted to preserve the proteins that have the Walker A and B motifs and the His and the Asp boxes.

Methods for the identification of phyletic patterns signatures
To estimate the preservation of the phyletic patterns within each genus, we calculated a measure of aggregation or segregation, as evaluated by a Checkerboard score (C-score), which was originally used to measure the association between species pairs and that was adapted to test the presences (1) and absence (0) matrix of a protein (column) in a species (row). Thus, for any particular species/protein pair, the C-score is a numerical index that ranges from a maximally aggregated (a minimum of 0) to a maximally segregated with no shared proteins in the profile. [51]. We used the module cooc_null_model of the EcoSimR package, developed in R [51]. The EcoSimR package in our case conducted a protein cooccurrence analysis using null models, with parameters "sim9" and "c_score" that have satisfactory performance with most types of random matrices (= low Type I errors). As stated before, in the input data were the presence/absence matrix of each phyletic profile per genus with a number of species > = 2. We considered a positive result to be all the genus matrices that possess an arbitrary C-score < = 0.2.

Species tree construction
As a first step, all orthologous proteins of 168 organisms that have a distinctive phyletic profile into its family were identified with the bidirectional-best hit (BDBH) method when comparing proteomes using BLASTp [50], with a cut-off value of (E) value <1e-04. To consider that two proteins were orthologous, in addition to the BDBH criterion, the coverage of the region of similarity identified by BLASTp [50] should cover at least 50% of the length of the smallest protein of the pair of proteins compared. Once the set of orthologous proteins were identified, for the construction of the mega-alignment, these proteins found in at least 90% of the set of study organisms were chosen (see File 3 via https://doi.org/10.6084/m9.figshare.13338686.v1). These sequences were aligned using the MUSCLE program [46]. For all the alignments to have the same number of elements, the missing protein sequences in the alignments were replaced by a chain of dashes of the same length as that of the substituted proteins. Subsequently, the columns of the alignments that did not have at least 90% of the informative positions were eliminated. Finally, all the alignments were concatenated to form the mega-alignment that was used for the phylogenetic analysis. To reconstruct the tree, we used IQ-TREE multicore version 1.6.11 for Linux 64-bits, (maximum likelihood) with the aLRT correction (a standard likelihood ratio test approximation) with 1,000 replicates to find the best bootstrap distance model [52]. The tree that grouped 168 species was visualized and rooted with the iTol tool [53].

Gain and loss analysis
We used the Gain Loss Mapping Engine (GLOOME) to estimate the number of gain and loss events that have occurred across the Firmicutes chosen to construct the species tree over the course of the evolution of the proteins involved in engulfment [54]. The gain-loss analysis implemented by GLOOME integrates the presence-absence data for each gene of interest across the phylogenetic profile to estimate the posterior expectation of gain and loss across all branches. These events are then summed to calculate the total number of gene gain and loss events that have occurred for each family across the phylogenetic tree. We performed this analysis using the mixture model with a variable gain/loss ratio and a gamma rate distribution. The interquartile distance distribution of a branch length was calculated and used as a parameter to organize the gain and loss events from the root to the tips. This means that the first quartile (q1 25%) groups the shortest distances from the root, and q4 (75%) groups the longest distances.

Identification of the probable orthologs involved in the engulfment
At present, the number of organisms that can be currently analyzed has grown considerably, which makes it possible to exhaustively compare the genome composition of a vast number of organisms. In this work, we performed a comparative genomic approach to investigate the phylogenetic extent of the proteins involved in engulfment and other proteins necessary to assemble two of the complexes that form the engulfasome beyond the representative Firmicutes. Currently, Firmicutes are divided into seven classes, namely, Bacilli, Clostridia, Negativicutes, Erysipelotrichia, Limnochordia, Thermolitobacteria, and Tissierellia (source: NCBI taxonomy), but since we are working with fully sequenced genomes, the analysis excluded Thermolitobacteria.
To perform the analysis, we employed a methodology of our own that can discover distant POs based on the construction of protein architectures using Pfam profiles that describe the proteins of the Q:AH and DMP complexes that form a structure recently named engulfasome [9] and the accessory proteins SpoIIB, which is suggested to regulate the septal thinning during engulfment in B. subtilis [33], and GerM that probably localizes SpoIIQ in the Q:AH complex also in B. subtilis [55]. The procedure yielded 17 architectures sufficient to identify both complexes and the two accessory proteins. The profiles described in the Pfam database identified the homologs of sixteen of the seventeen proteins, except for the SpoIIIAG part of the Q:AH complex. SpoIIIAG displays remote homology to the "Ring Building Motif (RBM)" also present in the homologs of SpoIIIAF and SpoIIIAH [24]. To resolve this ambiguity, we constructed a profile for SpoIIIAG and found 415 hits, distinct from those proteins with a predicted SpoIIIAF and SpoIIIAH.
Once an architecture was assigned, the genomic context was inspected, assuming that the genes that are most similar at the sequence level should retain the ancestral gene-neighborhood after duplication [56]. We also assumed that some contextual information will be consistent with that found in the genomes of B. subtilis and C. difficile (see S1 Table in S1 File) organisms whose engulfment components are also well studied. We also defined the presence of other proteins frequently represented in the genomic context of homologs with a predicted KO (S2 Table in S1 File), which is a decision that may leave out some true positives, in which the context is poorly defined. Therefore, the selection of the final set of POs need both a) the prediction of a valid architecture and b) the presence of a valid genomic context. The application of these rules also allowed us to identify 297 POs with a KO not assigned (see S3 Table in S1 File). Homologs without a KO were found in the entire set of POs, except for the group of homologs defined by the architectures of SpoIID and SpoIIP.
At the end of the process shown in Fig 1, we proposed seventeen architectures that can be consulted in Fig 2, which also shows the number of predicted POs in the Firmicutes proteomes. From these, 15 architectures were constructed with a unique domain; however, proteins such as SpoIIIAA are represented by more than one Pfam. We identified four architectures that define SpoIIIAA; the SpoIIIAA_1 architecture was constructed with the NTP_1 domain (PF03266), the SpoIIIAA_2 architecture uses the NTP_1 domain in the tandem copy (PF03266), the SpoIIIAA_3 architecture is represented by an AAA profile (PF00004), and the SpoIIIAA_4 architecture uses the T2SSE (PF00437) Pfam. In the case of SpoIIIAA_2, the tandem copies of NTP were necessary to cover the Walker A and B motifs and the His and Asp boxes [25]; the detailed results of the Pfam organization for each species are shown in the S6 Table in S1 File. Other proteins that need tandem copies for PO recognition are GerM, which was previously reported to have two copies called GerMN1 and GerMN2 [55], described by the Pfam profile (PF10646) called Germane. Another protein that need a tandem copy for a full detection was SpoIIIAD that has SpoIIIAC (PF06686) in tandem copies.
As we demand that the domains that define SpoIIIAA have the Walker motifs, and the His and Asp boxes, we aligned the hhmscan hits, and found 43 proteins in which the profile just covered the Walker A or Walker B motifs. Nonetheless, the inspection of the alignment showed that the rest of the motifs were present in the protein sequences, except for two species of Clostridia, specifically, Heliobacterium modesticaldum and Clostridium botulinum H04402 065, which have two contiguous small proteins that hold the expected motif.

Distribution of the probable orthologs beyond the representative Firmicutes
We searched for the PO involved in engulfment by inspecting, the 4,852 bacterial proteomes, which were found only in the proteomes of 954 Firmicutes. The results were then organized in a presence/absence matrix of phyletic profiles, which were associated with a curated sporulation phenotype (S5 Table in S1 File). The observed phenotype frequencies grouped by the phylogenetic class are presented in Table 1. The table is also organized by considering the types of evidence that describe the phenotype. As a result, two groups show the number of species with phenotypes and present rigorous experimental evidence which was as sporulating or nonsporulating. Two other groups show the species for which the projects described a sequenced genome, whose sporulating or nonsporulating phenotype was inferred from typical morphologies that describe the family or genus, as in the case of the Streptococcaceae family described as a group of non-endospore formers [57]. Finally, we found a fifth group that included organisms with unknown phenotypes, and this group has members in which in the family or genus of both sporulating and nonsporulating species have been observed. Both nonsporulating types include "asporogenic" organisms, which are defined as bacteria with an impaired sporulation process but contain the majority of sporulation genes, and "nonsporeforming" that include organisms with an absence of sporulation-specific genes [5].
Both curated phenotypes and POs were used to analyze the distribution of the studied proteins. The results presented in Fig 3a show that the endospore formers have the highest proportion of preserved POs (median = 290 POs from 305 species), but this proportion decreases in non-endospore formers (median = 29.5 POs from 318 species), as seen in Fig 3b, which is an observation that is consistent with the findings reported by Galperin [6,29], Abecasis [7] and recently, Ramos-Silva [8]. Previous studies and ours also show that even in proteins whose genes are structured in operons, as is the case of spoIIIA, the ortholog proteins are not 100% conserved (media = 297.5, n = 305). The POs of SpoIIQ (286 from 305) and the DMP complex (median = 281, n = 305) are also preserved in corroborated endospore formers, but as seen, the orthologs of the DMP complex tend to be absent in almost 24 species. The POs of SpoIIQ, SpoIID, and SpoIIM were significantly less identified (median = 22.5, n = 318) than those of the POs of spoIIIA.
We also separately analyzed the distribution of the POs in 251 species without an experimentally confirmed phenotype. These species that have an inferred sporulating phenotype (n = 32) (Fig 3c) have more than 75% of the calculated POs, except for Sporosarcina sp. P37, Sporosarcina sp. P33 and Lysinibacillus fusiformis RB 21, which have a PO prediction < = 60%. All of these species lack SpoIIIAA, SpoIIIAB, SpoIIIF, SpoIIM SpoIIB and GerM (S5 Table in S1 File). A previous work shows that Sporosarcina sp. P37 and Sporosarcina sp. P33, which belong to the same clade and are similar to other cocci-shaped Sporosarcina, lack 38% of the sporulating genes [58]. However, contrary to this work, we do not predict an ortholog for Spo0A in Sporosarcina sp. P33. In the case of Lysinibacillus sphaericus C3-41 that can form spores [59], sporulation genes such as bofC, spmA, spmB, sda, spoVAA, spoVAB, spoVID, tlp and yqfC involved in several stages of the differentiation process are absent; however, our method identified the product of spoIIIAD (SpoIIIAD), which was previously described as absent [29]. In the category nonsporulating-inferred, from the 290 species (Fig 3d), Eubacterium hallii EH1 (Anaerobutyricum hallii EH1), a species of the class Clostridia, is the only species that presents 10 predicted POs. This includes the Q:AH complex that excludes SpoIIIAB and SpoIIP from the DMP complex. From the 80 species with an unknown phenotype, the predicted POs are represented on average in 37 of the observations (S1 Fig). The work of Abecasis demonstrated that 75% of the genes that form a signature are preserved in endospore formers, which is a feature that is clearly distinguishable from non-endospore formers that have less than 10% the genes preserved [7]. Considering this observation and those presented in this work for the spore formers with experimental evidence, we consider that in our set, 21% (n = 17) of the studied organisms with an unknown phenotype should be spore formers; however, experiments that prove the phenotype should be performed. The list of potential endospore formers is shown in S7 Table in S1 File. The table shows the distribution of the curated phenotypes of 954 manually curated species of Firmicutes. Column 1 shows the classes that group the curated species.
The frequency by class of the sporulating and nonsporulating species are shown in columns 2 and 4, which describe the species that were tested in wet laboratory conditions to define the phenotype. Columns 3 to 5 show the total number of sporulating and nonsporulating phenotypes by class as inferred from other species, and column 6 presents the frequencies of the class of species whose phenotype is not yet defined. At the end of each column, we preset the total number of species that have the curated phenotype. https://doi.org/10.1371/journal.pone.0246651.t001

Missed annotations of the spoIIIA operon
The results of the previous section showed that the spoIIIA operon tends to be entirely represented in spore formers. We also showed that the organization of the operon spoIIIA is an essential trait to identify POs, since it preserves the organization of the genomic context. While inspecting the genomic contexts of spoIIIA, we noticed that some intergenic regions are sufficiently sizeable to hold genes. To corroborate if these sizeable regions were a product of incomplete annotations, as described in the Methods section, we performed a BLAST search using BlastX [50], to find missed annotations in 46 endospore formers that lack one or more POs. The predicted proteins were then subjected to a domain search using hhmscan to corroborate the presence of the architectures. The procedure yielded 33 proteins whose annotations were missed in the 32 genomes of the 46 tested (see S8 Table in S1 File). The examination also revealed in five species alternative Pfam domains not considered by our method for the SpoIIIAA protein (PF05621, PF13401, PF13191, and PF13481), since the prediction did not fulfill the proposed selection criteria; this result suggests that a step of manual curation is always necessary to improve the predictions. Remarkably, the new annotated homologs of SpoIIIAA were concentrated in Clostridia. Meanwhile, the homologs of the SpoIIIAE to SpoIIIAH proteins were frequently distributed in Bacilli in different proportions. The SpoIIIAF protein has the highest proportion of homologs (0.93% of the 14 hits were distributed in Bacilli (13 hits), and 0.07% were distributed in Clostridia (one hit)). From the tested set, in Bacillus sp. OxB-1, Jeotgalibacillus malaysiensis D7, Lysinibacillus sp. B2A1, Lysinibacillus sp. YS11, Lysinibacillus sphaericus C3-41, Paenisporosarcina, Rummeliibacillus stabekisii PP9, Solibacillus silvestris DSM 12223, Solibacillus silvestris StLB049, Sporosarcina psychrophila DSM 6499, and Sporosarcina ureae P32, we could not find the missing homologs of the spoIIIA operon.
As in the case of the prediction made using the proposed protein architectures, the RCKO criteria was applied to inspect the upstream or downstream contexts of the blasted sequences. For example, the presence downstream of an acetyl-CoA carboxylase (K01961) from the spoIIIAH gene, reinforces the prediction, since this feature is also shared by the 261 other genomes that have SpoIIIAH.

Signatures beyond representative organisms
The exhaustive examination beyond the representative species may show differences in the phyletic profiles of closely related species. If so, then the observed distribution will generate several questions about the origin and evolution of the sequences that encode the predicted orthologs in the not previous explored species. Therefore, we addressed these two questions, using a curated compendium of experimentally confirmed endospore formers, and the prediction of POs (S5 Table in S1 File). The prediction of the orthologs into a wide set of confirmed phenotypes, is crucial bearing in mind, that several genome projects related to Firmicutes do not prove whether the respective strains can form spores and that several of them infer the phenotype from phylogenetically related strains.
To evaluate the extent of an endosporulation genomic signature among the species proven to sporulate, we organize each genus by its correspondent class, in order to facilitate an inspection of the results. The organization of the major classes of sporulating Firmicutes (Bacilli and Clostridia), as shown in Figs 4 and 5, indicates that these proteins tend to be preserved into the genus. Despite this, a vast number of them lack at least one PO involved in the engulfment. In 2012, a work of Galperin [29], showed that the Q:AH and the DMP complexes are fully present in sporulating Bacilli, and that particularly the SpoIIIA-SpoIIQ 'zipper' should be a common feature of bacillar sporulation. However, the results presented in Fig 4, are more similar to the findings of Ramos-Silva [8], in which the complex is present above 50%. A more precise calculation of the frequency of the representation of the PO per genus of all classes in our results, shows that 98% of the genera preserve at least 50% of the predicted POs. As seen in Fig 5, a search into this wider set found that Acetoanaerobium and Lachnoanaerobaculum, that have only one species and which belong to Clostridia, are the only genera in which any of the POs was identified. The remaining 31 confirmed endosporulators that have one representative preserve at least 50% of the predicted POs for Bacilli and Clostridia, as seen in Figs 4 and 5. The remaining 61 genera have at least 2 representatives. The genus Paenibacillus, for example, has 34 representatives that stand out for being the only Bacilli in which a species, Paenibacillus bovis, none of the tested proteins was detected, despite having Spo0A and a sporulating phenotype [60]. In contrast, as observed in Fig 4, from the 97 species grouped in Bacillus, nine do not conserve SpoIIP orthologs, and four of them are strains of Bacillus megaterium. Other Bacillus, as seen in Fig 4, also lack SpoIIP, such as those of the genera Fictibacillus, Terribacillus, and Alycibacillus, and four species belong to Paenibacillus. The proteins of the DMP complex are also absent among other Bacilli, such as Jeotgalibacillus, Sporosarcina, Paenisporosarcina, Solibacillus, Rummeliibacillus and Lysinibacillus, and all of these consistently lack SpoIIM. The genera organized into Clostridia present similar results as seen in Fig 5; nonetheless, the percentage of the missing POs in the DMP complex, is under 0.9% on average and is concordant with Abecasis' results [7], which consider the orthologs of these complex as signatures.
The GerM protein, a lipoprotein previously implicated in spore germination, is required to assemble the basal platform of the Q:AH transenvelope complex during sporulation in B. subtilis [31]. The GerM protein was missing in sixteen species of Bacilli, 95 Clostridia, and the two sporulating Negativicutes. The POs of the GerM in Bacilli were not found in the species of the genera Jeotgalibacillus, Sporosarcina, Paenisporosarcina, Solibacillus, Rummeliibacillus, and Alicyclobacillus. Additionally, only one species of the genera Bacillus and Paenibacillus, lacked this protein (see Fig 4). In Clostridia, several genera lacked a PO of GerM; this was not the case for all the species that form the genera Hungateiclostridium and Ruminiclostridium. We observed a PO of GerM in nine genera represented by only one species. No orthologs of GerM are present in C. difficile. As shown here, a significant number of species of Clostridia and less in Bacilli have an undetected PO. The absence of this protein in such species may imply the existence of a simplified version of the Q:AH complex or that other not yet identified mechanisms will be shaping the engulfasome.
We did not predict the PO of SpoIIB in 78 Bacilli. The POs were also missing in 67 sporulating Clostridia, in two Negativicutes and Gottschalkia acidurici 9a, which is classified as a Tissierellia. From Clostridia, the genus Clostridium is the more populated; from its 46 species, only nine do not have an identified PO of SpoIIB. Other genera of Clostridia such as Desulfitobacterium and Desulfosporosinus have more than one representative, and all species possess a predicted PO of SpoIIB. It has been suggested that SpoIIB may have been added late to the engulfment machinery of B. subtilis, which serves to increase the speed and efficiency of engulfment [32], but this observation may suggest that this protein is perhaps not required by some species to complete engulfment.
Considering this observation, we performed a test to measure the concurrency of the phyletic profile per genus, as described in the Methods section, to determine the prevalence of the studied proteins in the proteomes of the spore formers. The Spo0A response regulator was included in the phyletic profile since it has been mentioned as a sporulation signature of stage 0 in several publications [7,29,36,37], which is stage prior to engulfment. The results in Table 2 show the genus that presents phyletic profiles with a significant C-score in genera that have two or more species. The analysis yielded 28 genera with a defined phyletic profile signature. Remarkably, even in the genera that have the largest number of members (Bacillus (n = 97), Clostridium (n = 46), Paenibacillus (n = 34) and Geobacillus (n = 16)), a robust signature was identified. The C-score in the genera Desulfitobacterium and Solibacillus was not significant. The results probably obey to a dissimilarity in the PO distribution observed within Desulfitobacterium dehalogenans ATCC 51507, which lacks SpoIIIAA and SpoIID in contrast to the rest of the members of the genus. Two other species, Desulfitobacterium metallireducens DSM 15288 and Desulfitobacterium dichloroeliminans LMG P-21439, also lack GerM, which affects the distribution across the genus. In contrast, two species grouped as Solibacillus (both from the species silvetris) lack several POs, which justifies the observed result.
The studied set presents 33 genera defined solely by a species (S9 Table in S1 File) a condition that makes the determination of reliable signatures unfeasible. In this case, the inclusion of new fully sequenced species in each genus will help to proposed phyletic profile signatures.
Finally, to test if the proposed signatures can predict sporulating phenotypes, we tested them on the species with an inferred and unknown phenotype. We found that from the 31 species with a sporulating-inferred phenotype, four do not follow the proposed phyletic profile signature (S10 Table in S1 File). The four species belong to different genera (Bacillus, Clostridium, Geobacillus, and Paenibacillus). A result suggesting that our signatures are well preserved, and these species do not sporulate. We applied the same procedure to 29 species that belong to the unknown phenotype. We found that in the species of the genera Bacillus, Lachnoclostridium, Lysinibacillus Sporosarcina, Thermoanaerobacterium and Virgibacillus, the proposed phyletic profile signature was preserved. In the genus Paenibacillus, one of the four species (Paenibacillus sp. IHB B 3084) has a different phyletic profile organization from that proposed as a signature. For this reason, we predicted a nonsporulating phenotype; in this case, experimental evidence should be collected to verify this result. Additionally, we assign a nonsporulating phenotype to two species of Ruminiclostridium, both having distinct phyletic profiles. The signature for Ruminiclostridium, was constructed using two species with a reported phenotype; in this case, two species are perhaps not sufficient to establish a signature; however, several  The phyletic profile signatures are shown for a genus with more than two species. Columns from one to five shows in the following order, the genus grouping the species, the class to which the genus belongs, the number of species per class, and the "Checkerboard score" (C-score) < = 0.2 were considered to be significant, and a C-score > = 0.2 was not considered to be significant; a phyletic profile signature is proposed.

Reconstruction of the phylogenetic tree based on the engulfment phyletic profile
The reconstructions of the Firmicutes species trees based on the markers of the 16S gene and ribosomal proteins are often formed by the selection of representative strains. Some of the selected strains are bacterial models that have peculiar phenotypes or traits and biotechnological or medical model organisms. When the biological parameters of selection are not available, the use of a clustering sequence analysis based on their similarities [45], will provide a set of nonredundant sequences that are expected to be able to classify living organisms into taxa that should be consistent with the true phylogenetic tree [61]. Since the phylum Firmicutes is polyphyletic [57] and the species are prone to horizontal gene transfer, the phylum taxonomy is continuously reorganized, especially the Clostridia class [62,63], which comprises various species of a gram stain-negative cell wall structure [64]. The aforementioned descriptions along with those that claim that the genes related to sporulation have been frequently lost throughout the evolution of Firmicutes [7,8,29,37,65], motivate to construct a Firmicutes species tree in which the organism selection was based on the organization of the phyletic profiles of the proteins involved in the engulfment and on the curated phenotypes, this means that a genus or a family may have more than one representative. We assume that a selection that uses species that have characteristic phyletic profiles, even if they belong to the same genus or family, will provide new clues that describe the evolutionary history of genes involved in the engulfment.
To test this hypothesis, we concatenated 585 proteins preserved in 90% of the species to construct a species tree. This final tree was then compared with some previously proposed Firmicutes trees [8,57,[62][63][64]. Our final tree clustered 167 Firmicutes including 99 sporulating species, 5 species with an inferred sporulating phenotype, 44 nonsporulating species, 19 species with an unknown phenotype, and one outgroup. The phyletic profile associated with each species is represented next to the tree as shown in Fig 6. The predictions calculated for Spo0A for 259 Firmicutes [36] and complemented with the genes related to K07699 assigned by KEGG to Spo0A were included in the represented phyletic profile. These additions were made to ensure the presence of a protein considered to be a signature involved in the preceding stage of sporulation.
As seen in Fig 6, the tree is well resolved at almost all nodes, in which as previously reported [8,37,63], the classes Bacilli (n = 71) and Clostridia (n = 87) are separated. The analysis yielded a first split that separates Clostridia into the two groups GI and GII. GI forms four clusters that harbor members of the families Leptotrichiaceae, Thermoanaerobacterales Family III. Incertae Sedis, Gottschalkiaceae, Clostridiaceae, Peptostreptococcacea, Ruminococcaceae, Oscillospiraceae, Lachnospiraceae, Eubacteriaceae, and two unknown families. The tips resolution shows the species consistently clustered, except for Thermoanaerobacterium thermosaccharolyticum DSM 571, Clostridium sp. BNL1100, Geosporobacter ferrireducens IRF9, and Clostridium saccharolyticum WM1, which appeared in branches out of the reported family. GIa is divided into two clusters, the first one, GIa1 including Thermoanaerobacterium thermosaccharolyticum DSM 571, Gottschalkia acidurici 9a a member of the class Tissierellia, class, which as reported forms a cluster with the members of Clostridiaceae and Peptostreptococcacea [62], and all of them lacking SpoIIB. Geosporobacter ferrireducens IRF9 that branches with Gottschalkia acidurici 9a share the presence of the POs of GerM, as opposed to the rest of the members within the clade. Acetoanaerobium sticklandii DSM 519 is also grouped in GIa1, which was recently transferred to the Peptostreptococcacea family [66]. Despite being reported as a spore former (see S5 Table in S1 File) [67], our analysis shows that this bacteria preserves neither the genes that encode the engulfasome and the accessory proteins nor Spo0A, a trait seen in some Firmicutes, an example is Limnochorda pilosa a genome that lost the genes involved in sporulation although they have been regarded as essential for the endospore forming system [68]. A second branching grouped the family Clostridiaceae, which all form spores and lack GerM but preserve the rest of the studied proteins, except for Clostridium tetani E88 for which we did not find a PO for SpoIIB.
GIb holds two groups of Firmicutes, a set of nonsporulating species of family Thermoanaerobacterales Family III. Incertae Sedis, which is represented by a group of Caldicellulosiruptor that share a last common ancestor with a branch of sporulating Ruminococcaceae and Clostridium sp. BNL1100. In contrast to what is reported [69], and looking at the bootstrap value (27.2%), this cluster is probably not well resolved. However, the internal branches that grouped Hungateiclostridium clariflavum DSM 19732 and Thermoclostridium stercorarium subsp. stercorarium DSM 8532, are consistent, with previous reports. These species previously named Clostridium, are subject to fast evolution [65], which may explain the bootstrap value and missed locations of Thermoanaerobacterium thermosaccharolyticum DSM 571, placed near the root.
GIc harbors two groups of Ruminococcaceae of the genus Ruminococcus, Faecalibacterium and Ruminiclostridiu, an Oscillospiraceae and two species with an unknown family. This cluster displays an important number of species with an unknown phenotype characterized by preserving the POs of the spoIIIA operon except for Flavonifractor plautii YL31, that lacks POs of the SpoIIQ, SpoIIB, GerM proteins and the POs of the DMP complex. Notably, all the species in this group lack SpoIID and SpoIIB, and the rest of the proteins of the DMP complex are almost missing in all species but are better preserved in the sporulating ones.
GId which groups the species from the families Lachnospiraceae and Eubacteriaceae (n = 19) shared a common ancestor; however, two endospore formers, specifically, Cellulosilyticum lentocellum DSM 5427 and Anaerotignum propionicum DSM 1682, are grouped separately from a larger clade that clustered two Eubacteriaceae (Eubacterium hallii EH1 and Eubacterium eligens ATCC 27750), previously reported to be grouped within Lachnospiraceae [70,71]. Twelve of the species clustered within this clade are not endospore formers; nonetheless, most of them have preserved POs of the spoIIIA operon and all conserve a PO of SpoIIP, but they lack POs of SpoIID, SpoIIB and GerM. Remarkably, the nonsporulating species were also split from the sporulating ones, which holds Eubacterium rectale DSM 17629 that has an unknown phenotype but a different phyletic profile.
The second split, GII, shows several divisions that harbor a group of Negativicutes and the families from the order Halanaerobiales (n = 4), which belong to the classically monoderm Firmicutes that possess outer membranes with lipopolysaccharide [64]. GIIa, which groups Tepidanaerobacter acetatoxydans Re1 and Thermosediminibacter oceani DSM 16646 belong to different families. The families Natranaerobiaceae, Halobacteroidaceae and Halanaerobiaceae, grouped in GIIb, are consistent with previous reports [6,64]. All of the species do not form spores, despite they have POs of the DMP and Q:AH complexes. The members of families Halobacteroidaceae and Halanaerobiaceae are diderm bacteria with a gram negative-type cell envelope a characteristic shared with the families that belong to the class Negativicutes and Limnochordia [64,72]. A recent phylogenetic bacterial reconstruction based on outer membrane proteins, shows that these families shared a diderm common ancestor [72]. Our tree reconstruction agrees with the reconstruction made with the just mentioned work except for Natranaerobius thermophilus JW/NM-WN-LF that was grouped with uncultured species of bacteria and archaea.
Consistent with other reports the class Negativicutes, (n = 6), is branched within the class Clostridia [6,63,64], in group GIIc. The Negativicutes are branched in clades that separate sporulating from nonsporulating species for which any PO was predicted. As seen in reported reconstructions, the proposed trees and the one reconstructed in this work relate the Negativicutes to the families of Peptococcaceae and Thermoanaerobacteraceae (GIId) [6,63,64]. GIId also holds species of the families Syntrophomonadaceae and Heliobacteriaceae, which are reported to cluster with Peptococcaceae and Thermoanaerobacteraceae [69]. GIId split Peptococcaceae into two groups, GIId1, in which all the species lack the POs of SpoIIB and GerM except for nonsporulating Thermincola potens JR. GIId2 housing the second group of Peptococcaceae and the families Thermoanaerobacteraceae, Heliobacteriaceae, and Syntrophomonadaceae. The two last ones present the same POs organization patterns. All the Peptococcaceae in this branch lack GerM, not being the case of Desulfosporosinus meridiei DSM 13257 and Desulfitobacterium hafniense Y51.
GIe clustered Limnochorda pilosa HC45T, a gram-negative stained bacteria that represents the family Limnochordaceae, and this strain is clustered with Thermaerobacter marianensis DSM 12885T, Sulfobacillus acidophilus DSM 10332T and Symbiobacterium thermophilum IAM 14863T, which have a high GC content except for Sulfobacillus acidophilus TPY that has a lower GC content [68]. The four species are spore formers. Several sporulation genes are absent in Limnochorda pilosa [68], for which we observed only the absence of a PO of SpoIID. For its part, Sulfobacillus also lacks a PO of SpoIID and SpoIIB, which is a characteristic shared with Thermaerobacter that also lacks the POs of GerM. All of the species of this group except for Thermaerobacter have a complete version of the spoIIIA operon for which SpoIIIAH is missing in this species. In this cluster, SpoIIQ is missing only in Sulfobacillus acidophilus TPY.
The class Bacilli contains nine families that are organized in our tree in seven groups. GIIf clustered the spore formers of the family Alicyclobacillaceae in which Alicyclobacillus acidocaldarius subsp. acidocaldarius DSM 446 has undetected POs of SpoIIP, SpoIIB and GerM, as opposed to the rest of the members of the group that have a complete set of the studied proteins. This result contrasts with another report in which the orthologs of SpoIIP and SpoIIB were not found in Kyrpides tuscia DSM 2912 [29]. The next division shows Novibacillus thermophilus SG-1 located as a singleton within Bacilli in the family Thermoactinomycetaceae; in another report, this species group with another strain that forms a separate node from the rest of the Thermoactinomycetaceae [73], and in this nonspore former, we were not able to detect the POs of SpoIIIAA and SpoIID.
The family Paenibacillaceae appeared in our tree as divided into two branches, and GIIg grouped the species of the genera Paenibacillus and Thermobacillus in the same branch consistent with previous reports [8,69,74]. In this group, Paenibacillus bovis is a spore former that lacks all the studied proteins. A study in which the conservation of modularity was inspected in several Firmicutes showed that Paenibacillus bovis has a relative low conservation to the average of the sporulation genes across the analyzed species [74]. GIIh grouped Paenibacillaceae of the genus Brevibacillus and Aneurinibacillus in this cluster, and the nonsporulating Brevibacillus formosus NF2 that lacks the POs of SpoIIB and SpoIID.
Within GII, Exiguobacterium sibiricum 255-15 as Paenibacillus bovis is another species for which the analysis does not reveal the POs of the studied proteins, and similar to Paenibacillus bovis, the study based on modularity showed a lower conservation of genes depending on σ E and σ F expressions [74], that as mentioned these sigma factors control genes that are involved the engulfment in Bacilli and Clostridia. In this GII mega-cluster, the families Planococcaceae and Bacillaceae have several genera that form independent clusters. GIIi holds species of the family Bacillaceae that mainly lack the orthologs of SpoIIP and SpoIIB. GIIj preset Lentibacillus amyloliquefaciens LAM0015, a nonspore former that encodes the POs of all the proteins in its genome but lacks Spo0A, which probably prevents the onset of sporulation. In GIIj, we also found sporulating Bacillaceae from the genera Virgibacillus and Oceanobacillus in which Virgibacillus necropolis LMG 19488 lacks SpoIIB. We found in GIIk 16 species of Bacillus (GkII1,2,3), two Geobacillus (GIIk3) and one of Anoxybacillus (Anoxybacillus flavithermus WK1) (GIIk3), which are consistently grouped [69].
GII(I) group members of the families Bacillaceae and Planococcaceae have a branch that clusters congruently with the genus Lysinibacillus (n = 3) and Solibacillus (n = 3), respectively [69]. Lysinibacillus is known to be highly polyphyletic [75], which is a feature preserved in our reconstructed tree. Other genera such as Rummeliibacillus, Paenisporosarcina and Sporosarcina are congruently resolved [75]. The Sporosarcina clade includes within species from the genus Sporosarcina along with Bacillus sp. OxB-1 [75]. Despite forming spores, all the members lack several genes that encode the POs of the spoIIIA operon (SpoIIIAA, SpoIIIAB and SpoIIIAF in all species and SpoIIIAD in five species) and all the POs of SpoIIM, SpoIIB and GerM. It is remarkable the absence of Spo0A in Sporosarcina sp. P33, for which the KEGG database did not assigned a KO, and we did not test by our method [36], so an update of our data will corroborate this result. Jeotgalibacillus malaysiensis D5 possess POs of SpoIIIAA and SpoIIB, this is not the case of SpoIIIAC and SpoIIIAE, which are missing in the profile, a result that contrasts with the rest of the observations within the group.
As seen, GIIm clusters a group of species that belongs to the families chosen because all the species are nonsporulating, except for Auricoccus indicus S3 in which SpoIIIAB was predicted, indicating that maybe the gene that encodes this protein was gain as a consequence of the adaptive forces prevalent in the phylum. The species of the class Erysipelotrichia (n = 4), are consistently branched near Staphylococcaceae [8,69], but the omission of the Tenericutes (Mollicutes) species in our tree prevents seeing the remote relationship between Erysipelotrichia and Mollicutes. Erysipelotrichaceae bacterium I46 does not form spores, and Turicibacter sp. H121 has an unknown phenotype; they preserved the POs mainly related to the DMP complex, SpoIIB, GerM, and Spo0A but lack most of the POs encoded in spoIIIA, suggesting that the absence of these gene products and the sigma factors regulating their expression [8] underline the versatility of strategies to adapt to different environment.

Gain and loss events are common in the proteins involved in the engulfment
The patchy distribution observed in the phyletic patterns across the 954 Firmicutes and the prediction of orthologs in previous works [7-9, 29, 65], show that the sporulation genes, even those defined as signatures, may be missing in some species. Several authors postulated that horizontal gene transfer and loss events may be important evolutionary forces of the sporulation genes among Firmicutes. Therefore, we also sought to analyze the expected number of the gene gain and loss events of the proteins involved in engulfment across the species used to reconstruct the phylogenetic tree by using GLOOME, which is a tool that provides accurate estimates of the expectations and probabilities of both gains and losses [54]. GLOOME infers the position of the gene gain and loss events across a phylogenetic tree but does not consider duplication and speciation events. To evaluate the GLOOME results, the distance from the root to the tips was distributed in quantiles (q n ). The first quantile (q1) represents the nodes that have the shortest distances from the root, and q4 represents the nodes with positions frequently on the tips (S11 Table in S1 File). As a result, we found that the nodes and branches distributed in q 1 have 133 gains and 441 loss events, with a major loss event of GerM in Clostridia (q1), and as seen in Fig 7b, the loss of GerM in Clostridia seems to occurred in the last common ancestor of the group, but a set of few gains appeared in q1, less in q2, and several gains in q3, which can be seen in Fig 7a. These gains were preserved among most of the sporulating Bacilli.
An important gain event of SpoIIB can be seen in Clostridia (q1), as also some gains in Bacilli in q1 and q2. The results suggest that these gains were acquired after a major loss event that occurred twice in nodes placed in q1 a q2. A minor loss of SpoIIP is observed, in the nodes and branches distributed in q1 also in Bacilli. The orthologs of SpoIIM were also frequently lost in Clostridia in the branches grouped in q3 and q4 but not in q1, which suggests that this protein was present in the last common ancestor of the species within this class similarly to the POs of the genes encoded in the spoIIIA operon, which were lost more recently (q3 and q4). SpoIIQ had several losses across the tree that were more frequent in the branches positioned in q3 and q4 in species related to Clostridia that are located in the GII branch of the tree. Notably, the gains observed in Negativicutes in q2 were then lost in the nonsporulating species in the nodes and branches distributed in q3. The loss events found in Erysipelotrichia were seen in ancestral branches located in q2 that began in the asporogenous lineages of the class Bacilli (Listeriaceae) with an important gain (q4) in Erysipelotrichia in all of the genes encoding the studied proteins except for SpoIIIAA, SpoIIIAB and SpoIIIAF for which the event was not observed, a result that is consistent with that found in a previous work [8]. Important losses of the SpoIIIAA, SpoIIIAB, SpoIIIAD and SpoIIIAF were also seen in Bacillus in q1, this losses correspond to species of the family Planococcaceae and some species of the genus Lysinibacillus in which several of them form spores. The results found in Planococcaceae are concordant with a previous report that shows an extensive loss in the genus Solibacillus [8], which are extended in this work to other Planococcaceae (Fig 6). In the same family, the sporulating species Jeotgalibacillus malaysiensis D5, presents in q2 a gain of SpoIIIAA and the loss of SpoIIIAC and SpoIIIAE; this result shows the need for experimental studies in Firmicutes other than C. difficile and B. subtilis, which explain the mechanisms used by these organisms to carry out the sporulation process.

Components of engulfment beyond representatives
The analysis conducted in this work based on the similarity of the proteins involved in engulfment and the proteins SpoIIB and GerM showed that the distribution of these proteins varied among the 954 studied Firmicutes, even in the species that produce spores. These deviations have been better studied in B. subtilis and C. difficile whose experimental evidence shows differences in the regulation and mechanisms and in the genes needed to promote engulfment [1,18,19]. The differences were also found in other stages of sporulation that have been the object of study by groups that, given the lack of or little experimental evidence in other species, have been motivated to search for these differences by sequences comparison, and some of them have found genes that serve as sporulation markers or signatures [5, 7-9, 29, 76]. However, the prevalence of sporulation markers among the "asporogenic" species of Firmicutes and the absence of some of them in sporulating species generate important questions regarding their evolution in the phylum. Particularly, in the cases of the genes involved in engulfment, several studies have identified the products of the spoIIIA operon as part of the core genes conserved in endospore formers [5, 7-9, 29, 76]. Our findings that consider 954 Firmicutes show that the genes of the operon prevail in the highest proportion of endospore formers in contrast to the genes that do not form spores, which is a result shown in other reports. In a similar work conducted with 258 representative Firmicutes [8], the authors showed the absence of SpoIIIAF in the genera Paenibacillus (n = 4), Thermobacillus (n = 1) and in Solibacillus silvestris StLB046. Using a wider data set and our developed methodological approach, we found differences between our results and those presented in this similar work. For these aforementioned genera, we predicted the POs of SpoIIIAF in eleven tested Paenibacillus and in one Thermobacillus and an agreement with the absence of SpoIIAF in Solibacillus silvestris StLB046. We also found that the Solibacillus species included in our study and other members of the family Planococcaceae also lack the POs of this protein. Remarkably, other absent orthologs of the spoIIIA operon in Planococcaceae were SpoIIIA-B and SpoIIIAG; SpoIIIAG was searched with the HMM profile constructed in this work, which efficiently recovered the SpoIIIAG orthologs. In B. subtilis, it is suggested that the role of the proteins encoded in the spoIIIA operon sustain transcription activity of σ G in the forespore [25], so the mechanism used to promote the engulfment by the analyzed Planococcaceae and the members of the genus Lysinibacillus remains to be elucidated.
The orthologous genes that encode SpoIIQ and the proteins of the DMP complex are better conserved in sporulating Bacilli and Clostridia, and the frequencies are lower than those found for the PO of spoIIIA, especially in Clostridia. These results suggest that the mechanisms proposed for the engulfment progression may be different in some of the analyzed species, as for example the genera Sulfobacillus and Lachnoclostridium, which lack POs of SpoIIQ and SpoIID, which in C. difficile were proven to have strong interaction, however, some cells carrying the spoIID deletion seem to be able to partially initiate engulfment [27].
The finding in 24 representatives that was recently reported [9], shows the absence of the orthologs of SpoIIP and SpoIIM in only two species, namely, Symbiobacterium thermophilum and Lysinibacillus sphaericus, respectively. Our study detected the POs of DMP in Symbiobacterium thermophilum, but the results for Lysinibacillus sphaericus prevail in this species as in other two analyzed Lysinibacillus. The observed results may lead to a hypothesis that the same selection pressures affected the patterns observed for the complex in these species.
Comparisons of our extended search with the results found in other works that use a limited number of species to represent each genus showed that the phyletic patterns may change within clades, and as discussed in other studies [7,8,29], a few sets of genes that participate in this complex differentiation process can be considered to be permanent signatures. Moreover, the preservation of the sporulation genes in asporogenic species still make it difficult to infer that a species is an endospore former solely based on the presence of the so-called signatures.

Signature based on phyletic profile
The employed approach that uses HMM profiles and the preservation of the RCKOs are powerful tools to predict distant orthologs, as they allow the extension of the number of POs across a wider set, and we predicted orthologs not found using other methodologies. Moreover, the careful inspection of the genomic context was fundamental to identifying missed annotated genes that belong to the spoIIIA operon and showed that the genes of the spoIIIA operon were preserved on average in 97% of the endospore formers, which confirms that these proteins and the proteins of the DMP complex were also preserved on average in 97% of the endospore formers and are, as previously described, sporulation signatures [7,29]. This observation is more consistent for the species in the class Bacilli, however as other authors discussed these genes as others define as signature are not a strong trace allowing to infer the sporulation phenotype in Firmicutes. Despite this, we used the distribution of these signatures into each genus to estimate the probability that a given phyletic profile can function as a signature of this sporulation stage, finding a set of proteins prevailing for each genus, that were then used to predict a sporulation phenotype in species for which the experimental evidence does not exist. The constructed phyletic profile signatures as shown, were in almost all of the tested genus able to predict a phenotype, a result that should be experimentally proven. Unfortunately, for 33 genera that have an only member a dependable signature was impossible to predict, we are confident that the exploration of new environments, will provide new Firmicutes that may help to refine the proposed signatures and get some for the species that in this work were represented by an only organism.

Evolution of the protein involved in the engulfment based on phyletic profiles
events that result in the gain of the studied proteins, and we cannot be certain whether the coding genes were gained by HGT or gene duplication; however, the results found in this analysis shows consistent results as those found in previous works. Finally, the gain events were seen at the base of the tree for the protein SpoIIB in Clostridia and a posterior loss in some members of Bacilli, a result suggesting that as previously tested in B. subtilis, a mutant in the gene spoIIB can complete engulfment, that probably causes a defect affecting only the speed of engulfment [32].

Conclusions
In this study we inspected the proteomes of 954 Firmicutes, to identify POs of a set of proteins involved in the engulfment. The performed study included a wide number of species, showing for the first time the distribution of this component in a significant number of species, showing phyletic profiles that prevail into genus, and that we proposed to be used as sporulation signatures. The patterns were tested over Firmicutes with an unknown phenotype for which we proposed that this signature can be used to suggest a sporulating or a nonsporulating phenotype. We think that this first attempt should be complemented with proteins participating in other stages of sporulation, which can better support the predicted phenotype. We also showed a phylogenetic tree in which the displayed species were selected using unique phyletic profiles across taxonomic families, which yield a consistent reconstruction revealing that the genes encoding proteins involved in the engulfment was present in the last common ancestor of this phylum, with subsequent losses and gains that occurred in different points along the phylogeny.