Comparative genomic analysis of mollicutes with and without a chaperonin system

The GroE chaperonin system, which comprises GroEL and GroES, assists protein folding in vivo and in vitro. It is conserved in all prokaryotes except in most, but not all, members of the class of mollicutes. In Escherichia coli, about 60 proteins were found to be obligatory clients of the GroE system. Here, we describe the properties of the homologs of these GroE clients in mollicutes and the evolution of chaperonins in this class of bacteria. Comparing the properties of these homologs in mollicutes with and without chaperonins enabled us to search for features correlated with the presence of GroE. Interestingly, no sequence-based features of proteins such as average length, amino acid composition and predicted folding/disorder propensity were found to be affected by the absence of GroE. Other properties such as genome size and number of proteins were also found to not differ between mollicute species with and without GroE. Our data suggest that two clades of mollicutes re-acquired the GroE system, thereby supporting the view that gaining the system occurred polyphyletically and not monophyletically, as previously debated. Our data also suggest that there might have been three isolated cases of lateral gene transfer from specific bacterial sources. Taken together, our data indicate that loss of GroE does not involve crossing a high evolutionary barrier and can be compensated for by a small number of changes within the few dozen client proteins.


Introduction
Proteins can fold into their three-dimensional native structures spontaneously and without any assistance by other factors [1]. In vivo, however, protein aggregation and mis-folding can occur owing to macromolecular crowding and other conditions that exist in cells. Protein misfolding is harmful to cells because of the costs involved in the synthesis and degradation of non-functional proteins, toxic effects of protein aggregates (such as disruption of cell membranes [2]) and the absence of functional protein molecules that may have crucial roles. Hence, it is not surprising that selection against mis-folding is a major driving force in evolution [3]. Molecular chaperones, which prevent aggregation and promote efficient protein a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 K12) [83333]" AND proteome: up000000625. The chosen reference genome of E. coli is 4.64 Mbp long and contains 4,306 proteins. The complete mollicute genomes were selected by NCBI Genome search: http://www.ncbi.nlm.nih.gov/genome/browse/. Resulting entries were downloaded into a text file containing clade ID, genome size and FTP links. A python script was used to select one genome per species, where species was defined by the clade ID. Some genomes lacked a clade ID and were, therefore, assigned the respective identifiers: 'Mycoplasma suis', 'Mycoplasma haemolamae', 'Mycoplasma arginine', 'Mycoplasma ovis', 'Mycoplasma wenyonii' and 'Mycoplasma parvum'. Genomes with similar names to species with existing clade IDs were assigned those IDs: 'Mycoplasma canis': 21069, 'Mycoplasma mycoides': 21078, 'Mycoplasma pneumoniae': 21053 and 'Spiroplasma turonicum': 39903'. The longest genome of each species was selected and downloaded from the NCBI ftp site. A local DNA BLAST database was then created for each of these genomes. In order to get the corresponding DNA sequences of the protein sequences, we downloaded the nucleotide sequences corresponding to the CDS annotation ( Ã _cds_from_genomic.fna.gz) from ftp://ftp.ncbi.nlm.nih. gov/genomes/refseq/bacteria/ (These files are part of the data files provided for every assembly). The nucleotide sequences of 56 out of 59 mollicute species were downloaded since for 3 genomes the CDS are not available. The SILVA database [16] was used to retrieve 16S rRNA sequences (SSU) of the mollicute species.

Comparing properties of proteins
For length, charge and FoldIndex comparisons, the mean of all homologs found in GroE + or GroEof one query protein was calculated and then all (57 or less) the means (per GroE + or GroEgroup) were averaged again for the group of clients or non-clients. For amino acid composition, amino acid substitution events and codon bias analysis, there is no meaning to "averaging". Hence, every homolog (not every group of homologs per query protein) contributed equally.

Codon bias analysis
In order to check whether the codon usage of the homologs of E. coli proteins differs between GroE + and GroEmollicutes, the gi number of every protein was mapped to its Refseq ID via Uniport ID mapping. For the homologs of the E. coli GroE client proteins in mollicutes, 385 gi numbers (out of 454) were successfully mapped to their corresponding Refseq ID and for the homologs of the E. coli GroE non-client proteins in mollicutes, 165 gi numbers (out of 191) were successfully mapped to their corresponding Refseq ID. The DNA sequences of the homologs of the E. coli client and non-client proteins were downloaded from NCBI. Four Fasta files were created for the homologs of the E. coli clients and non-clients in GroE + and GroEmollicutes, respectively. EMBOSS ( [17]) CUSP was run on every file to analyze the codon usage. The output included fractions of codon usage for every DNA sequence that were added up to reflect the codon bias of the corresponding groups of proteins. The Shannon entropy of the codon usage for each amino acid (plus the Stop codons) was calculated according to PðcodonÞ Ã log 2 PðcodonÞ. Since Met and Trp residues are encoded by a single codon, their entropy is 0. Thus, comparison of entropies can be done for 19 (18 amino acids + stop) cases.
The Welch's t-test is a two-sample location test, which assumes normal distribution of populations without the additional assumption of equal variances and it tests if two populations have the same mean.

Evolutionary tree construction
Phylogenetic and molecular evolution analysis of the GroE + and GroEmollicute species listed in S1 and S2 Tables, were conducted using MEGA version 6 [19]. In the analysis, sequences were aligned using ClustalW and phylogenetic trees were constructed using the Maximum Likelihood method (and the MEGA6 default parameters) in accordance with the General Time Reversible and Gamma evolutionary models.

Results
Many different comparisons can be made between E. coli clients and non-clients and their homologs in GroE + and GroEmollicute species (Fig 1). For example, E. coli clients can be compared with their homologs in GroEmollicute species in order to uncover changes that occurred to compensate for the absence of GroE. Given, however, that E. coli and mollicutes are evolutionary distant and, thus, differ in many respects (e.g. GC content, codon usage, genome size etc.), a more controlled comparison is between the GroE + and GroEmollicute homologs. Thus, most of the analysis here was done in this manner but, for completeness, we also compared other groups in Fig 1 with regard to some properties.

General comparison of mollicute genomes with and without chaperonin systems
A search for homologs of E. coli GroEL (amino acid sequence P0A6F5) in the entire genomes of 59 mollicute species was carried out using TBLASTN with default parameters and an Evalue threshold of 0.001. E-values of about 1e-100 were obtained for 13 species. We note that GroES was also present in all these species. Thus, these species were classified as GroE positive (GroE + ). The other 46 species were classified as GroE negative (GroE -). The members of the GroE + and GroEgroups of species are listed in S1 and S2 Tables, respectively. The genome sizes of GroE + and GroEspecies were then compared. The average genome sizes of GroE + and GroEmollicutes were found to be 1.13 ± 0.40 and 0.95 ± 0.25 Mbp, respectively. This difference is, however, not statistically significant as indicated by a P-value of 0.152 obtained from a Welch's unequal variances t-test. The average number of proteins in GroE + and GroEmollicutes was found to be 1022.23 ± 405.28 and 828.09 ± 279.64, respectively. This difference is also not statistically significant as indicated by a P-value of 0.138 obtained using the Welch's unequal variances t-test. The average lengths of proteins in the GroE + and GroEmollicute species were found to be 315.35 ± 48.41 and 342.59 ± 48.41 amino acids, respectively. A Pvalue of 0.131 obtained using the Welch's unequal variances t-test indicated that also this difference is not significant.

Identifying homologs of E. coli GroEL clients in mollicutes
A subset of class III clients identified by Kerner et al. [11] together with 4 obligate GroEL clients identified by Fujiwara et al. [12] were defined as class IV GroEL clients that comprises 57 proteins. A control set was created by selecting at random 57 E. coli proteins that are not GroEL clients and have the same length distribution as the clients (i.e. each client protein has a corresponding control protein with the same length except for 3 protein pairs that differ in their length by one amino acid). The length distributions were chosen to be the same since it is well established [20,21] that protein length is strongly correlated with folding properties. A search for homologs of the 57 obligate substrates and 57 control proteins in the 59 species of mollicutes was then carried out using BLASTP. Hits with an E-value threshold of 0.001 for which the sequences covered at least 80% of each other were considered as homologs. The lists of obligate substrates and control proteins are given in S3 and S4 Tables, respectively. In total, 130 and 340 homologs of the E. coli obligate clients were found in the GroE + and GroEmollicute species, respectively. For the control group of E. coli non-clients, 53 and 152 homologs were found in the GroE + and GroEmollicute species, respectively. The probability of finding a homolog of a client was higher than that of finding a homolog of a non-client control protein in both GroE + and GroEmollicutes. For 28 E. coli obligate clients, at least one homolog was assigned in a GroE + mollicute species and for 31 at least one in a GroEspecies. For 16 E. coli non-clients in the control group, at least one homolog was assigned in a GroE + mollicute and for 19 at least one in a GroEspecies. For the number of homologs of each query protein, see the last two columns of S3 and S4 Tables.

Comparing properties of client homologs in GroE + and GroEmollicutes
The lengths of the homologs in GroE + and GroEmollicutes of the obligate clients were compared (Table 1A) and the average length of the latter group was found to be slightly smaller. The difference is, however, not statistically significant. We also note that, in general, mollicute proteins tend to be shorter than their E. coli counterparts but also this difference is not statistically significant.
A second feature we examined is protein net charge (Table 1B), which was defined as the absolute difference in the number of positively charged and negatively charged amino acid residues (at pH 7) divided by the total number of residues [22]. No statistically significant Comparative genomic analysis of mollicutes with and without a chaperonin system PLOS ONE | https://doi.org/10.1371/journal.pone.0192619 February 13, 2018 differences were found between the average net charges of the various groups. In a previous study [23] we noticed that GroE clients in E. coli have, on average, a lower folding propensity (as calculated by the FoldIndex program [24]) than random sets of E. coli proteins whereas control sets of proteins in U. urealyticum (which is a GroEorganism) have a low FoldIndex. The current larger study shows (Table 1C) that mollicute proteins do have lower FoldIndex values than their E. coli counterparts, but this property is common to all mollicutes and is not unique to GroEorganisms (i.e. the average FoldIndex values of proteins from GroE + and GroEmollicutes are quite similar). Interestingly, however, the homologs of the E. coli GroE obligatory clients have a lower FoldIndex value, on average, than the homologs of the non-clients in both GroEand GroE + species (Table 1).
The overall sequence compositions of proteins in E. coli and the GroE + and GroEmollicutes are given in Table 2. Also given there are the sequence compositions of the GroE clients and control non-clients in E. coli and the homologs of these two groups in GroE + and GroEmollicutes. There are clear differences in the sequence compositions of E. coli and the proteins of the mollicutes. Ala, Trp, Cys and Arg are strongly depleted in mollicute proteins while Ile, Asn and Lys are significantly enriched in mollicute proteins compared with their E. coli homologs. Interestingly, Lys residues in mollicutes are about 3 times more frequent than Arg whereas in E. coli the frequency of Lys is slightly smaller than that of Arg. No significant differences are observed, however, between the protein sequence composition of GroE + and GroEmollicutes. There are also no significant differences between the compositions of the homologs of the GroE client and non-clients in the protein sequences of the GroE + and GroEmollicutes.

Multiple sequence alignments
As mentioned above, for the 57 client sequences, 28 and 31 sets of homologs were found in the GroE + and GroEmollicute species, respectively, and for the non-clients, 16 and 19 set of homologs were found in the GroE + and GroEmollicute species, respectively. The sequences of the 57 E. coli obligate GroEL clients and the 57 control non-clients were aligned using MUS-CLE with their respective homologs. The heat maps in S1 and S2 Figs show the rate of substitution of each amino acid between the E. coli clients (on the vertical axis) and the aligned positions in GroE + and GroEmollicute homologs (on the horizontal axis). The rates were calculated by counting the number of changes at all positions in the mollicute sequences and then normalizing to 1. While there are differences in the values within each matrix and between the matrices, we could not identify any large changes or consistent patterns in these matrices.

Codon usage
We compared the codon usage of the homologs of the E. coli clients in GroE + and GroEmollicutes. S5 Table (left columns) shows the fraction of usage of every codon for each amino acid in the client homologs in GroE + and GroEmollicutes. Overall, the codon usage is quite similar but there are several differences that can be noticed. First, there are noticeable differences in the usage of TAA (stop) and GGA (Gly) codons. Moreover and interestingly, the entropies of the codon usage for 17 out of the 19 cases are lower for the client homologs in GroE + mollicutes than in the GroEmollicutes. However, these differences are not unique to the client homologs. Similar results are found (S5 Table, right columns) when comparing the control (i.e. the non-clients) codon usage. Again, the main difference seems to be in the usage of the TAA and GGA codons between the GroE + and GroEmollicute species. Also, when comparing the entropies, we noticed that for 16 out of the 19 amino acids the entropy is lower in the case of the non-client homologs in GroE + mollicutes than in the GroEmollicutes. This difference in the entropies between GroE + and GroEmollicutes is intriguing but it is not a property that distinguishes between the client homologs in GroE + and GroEmollicutes.

The evolution of the GroE system in mollicutes
The evolutionary tree of the distribution of the GroE system in mollicutes was constructed as described in the Methods (Fig 2). Mollicutes have evolved from other bacteria and underwent reductive evolution and lost many of their genes [25]. Given that the GroE system is absent in most branches of the evolutionary tree of this class, it is likely that the root of the mollicute tree lost the GroE system and that it re-emerged polyphyletically in several branches of the tree. We can distinguish two major clades that contain the GroE system (Fig 2): a big clade that includes seven species, acholeplasma and phytoplasma, marked with red diamonds (P3, P9, , and a smaller clade that includes three species marked with red squares (P5, P7, P10).
Of special interest are the two species Spiroplasma kunkelii (P13) and Spiroplasma turonicum (P12) (marked with red circles in Figs 2 and 3) that have the GroE system whereas their immediate evolutionary neighbors do not. Interestingly, their GroEL sequences and GroES   Fig 2. Distribution of the GroE system in a 16S rRNA-based evolution tree of mollicutes. The branches of the GroE + and GroEspecies are marked in red and blue, respectively. The numbers on the branching points represent the bootstrapping frequencies, the branch length represents the sequence distance and the scale shows a distance of 5%. The diamonds and squares show the two major clades that contain the GroE system. The triangle marks the species where lateral gene transfer (LGT) was suggested before and circles mark the two additional species for which we suggest LGT events.
https://doi.org/10.1371/journal.pone.0192619.g002 sequences show high similarity to the GroEL and GroES sequences from bacteria belonging to the Bacilli class. BlastP shows that GroEL from Spiroplasma turonicum (P12) has 85% sequence similarity (71% identity) to GroEL from Enterococcus raffinosus with a 96% query coverage. GroEL from Spiroplasma kunkelii (P13) has 80% sequence similarity (65% identity) to that from Enterococcus raffinosus with a 96% query coverage. The sequence similarity between GroELs from Spiroplasma turonicum and Spiroplasma kunkelii is 78%. The similarity between these two GroE + spiroplasma to the GroEL of the closest other mollicutes is lower and is about 75 and 71%, respectively. GroES from Spiroplasma turonicum (P12) has 62% sequence similarity (43% identity) to GroES from Salinicoccus sediminis with 97% query coverage. GroES from Spiroplasma kunkelii (P13) has 64% sequence similarity (43% identity) to GroES from Gorillibacterium massiliense with 98% query coverage. These results point to a possible event of lateral gene transfer between one of the species from the Bacilli class and S. kunkelii (P13) and S. turonicum (P12). Our analysis also points to a lateral gene transfer event for the emergence of the GroE system in Mycoplasma penetrans (P2, marked with a red triangle) as both the GroEL The diamonds and squares show the two major clades that contain the GroE system. The triangle marks the species where lateral gene transfer (LGT) was suggested before and circles mark the two additional species for which we suggest LGT events. and GroES sequences from Mycoplasma penetrans show higher similarity to their homologues in Helicobacter pylori than to those in any mollicutes species. The GroEL from mycoplasma penetrans (P2) has 85% sequence similarity and 70% sequence identity to GroEL from Helicobacter pylori with 98% coverage, while the GroES sequence from Mycoplasma penetrans (P2) has 71% sequence similarity (43% identity) with 97% query coverage to GroES from Helicobacter pylori. This finding confirmed the lateral gene transfer previously observed by Clark and Tillier [14]. Building the same tree with neighbor joining using the PHYLIP software resulted in the same branch structure (data not shown). In addition to the tree above, we built separate evolutionary trees for GroEL and GroES sequences for the GroEL + organisms. Both trees of GroEL and GroES sequences coincide with the "standard" evolutionary tree presented in Fig  2, which is based on 16S rRNA with minor differences. The GroEL tree is presented in Fig 3  and the similar GroES tree is presented in S3 Fig (Helicobacter pyroli 26695 was used as an outgroup in the evolutionary trees in Fig 3 and S3 Fig).

Discussion
Our analysis shows that 13 species of mollicutes are GroE + and 46 are GroE -. Previous experimental studies [26,27] reported the existence of a GroEL homolog in Mycoplasma fermentans and Mycoplasma suis but we did not detect one in these species and, thus, classified them as GroE -(the existence of the GroE system in Mycoplasma suis has also been challenged before [28]. In all aspects that we checked (i.e. genome size, number of proteins, average protein length, amino acid composition, folding propensities and codon bias), no statistically significant differences are observed between homologs of E. coli GroE clients in GroE + and GroEmollicute species. Our data do not support, therefore, the suggestion [29] that an excess of Lys and Arg (when compared to the E. coli proteins) is related to the ability of N-acetylneuraminic acid aldolase from M. synoviae, a GroEorganism, to fold in the absence of GroE. Our data show ( Table 1) that the excess of Lys at the expense of Arg is, in fact, a general property of mollicutes that does not depend on whether they do or do not contain the GroE system. Similarly, the suggestion by Georgescauld et al. [29] that the number of Phe and Tyr residues is higher in the GroEspecies compared to the GroE + species is not supported by our data, which show that this is a general property of mollicutes regardless of their GroE content.
Obvious differences between E. coli proteins that are obligate clients of the GroE systems and those that are not have not been identified so far [30]. Nevertheless, the loss of the GroE system in the root of the mollicute class may have conferred a strong and immediate pressure on proteins that originate from obligatory clients to escape from the GroE system. According to this evolutionary scenario, systematic changes in the sequences of these proteins might have been expected. Our study suggests, however, that no large systematic differences exist between the sequence features of proteins that originated from clients of the GroE system and those that are not dependent on the GroE system. This observation is consistent with the report [31] that single amino acid changes are sufficient to convert a GroE-independent protein into a dependent one. Furthermore, in a recent study [32] we studied the GroEL dependence of GFP, a eukaryotic protein that is often used as a fluorescent marker also in prokaryotic systems and folds in a GroE-dependent manner. We found that single mutations in GFP can decrease the GroEL dependence of its folding. These mutations were of residues at "frustrated" positions [33]. The conclusion of that study is that even a single mutation is sufficient to change the GroE dependence of a protein. This conclusion is consistent with the fact that large-scale differences between homologs of E. coli obligate substrate in GroE+ and GroE-organisms were not found in this study and in two previous studies [13,14]. It is also in accord with our conclusion (Fig 2) that regaining the GroE system occurred in a polyphyletic manner [14] and not monophyletically as suggested before [15]. The fact that we have many closely related species with and without the GroE system suggests that compensation for the absence of the GroE system does not require crossing a high evolutionary barrier and can be achieved, as argued above, by a small number of changes within the few dozen proteins that are dependent on GroE in other organisms.
Given such a compensatory mechanism, one can ask why there are several mollicute species that have regained the GroE system. We note that GroEL has been suggested to be a moonlighting protein, i.e. to have roles other than in assisted folding [34] of which some may be related to pathogenicity (see [35]). Thus, it is possible that the regain of GroEL in some mollicute species was driven by functions of GroEL not related to its role in folding.
Supporting information S1