A Systematic Survey of Mini-Proteins in Bacteria and Archaea

Background Mini-proteins, defined as polypeptides containing no more than 100 amino acids, are ubiquitous in prokaryotes and eukaryotes. They play significant roles in various biological processes, and their regulatory functions gradually attract the attentions of scientists. However, the functions of the majority of mini-proteins are still largely unknown due to the constraints of experimental methods and bioinformatic analysis. Methodology/Principal Findings In this article, we extracted a total of 180,879 mini-proteins from the annotations of 532 sequenced genomes, including 491 strains of Bacteria and 41 strains of Archaea. The average proportion of mini-proteins among all genomic proteins is approximately 10.99%, but different strains exhibit remarkable fluctuations. These mini-proteins display two notable characteristics. First, the majority are species-specific proteins with an average proportion of 58.79% among six representative phyla. Second, an even larger proportion (70.03% among all strains) is hypothetical proteins. However, a fraction of highly conserved hypothetical proteins potentially play crucial roles in organisms. Among mini-proteins with known functions, it seems that regulatory and metabolic proteins are more abundant than essential structural proteins. Furthermore, domains in mini-proteins seem to have greater distributions in Bacteria than Eukarya. Analysis of the evolutionary progression of these domains reveals that they have diverged to new patterns from a single ancestor. Conclusions/Significance Mini-proteins are ubiquitous in bacterial and archaeal species and play significant roles in various functions. The number of mini-proteins in each genome displays remarkable fluctuation, likely resulting from the differential selective pressures that reflect the respective life-styles of the organisms. The answers to many questions surrounding mini-proteins remain elusive and need to be resolved experimentally.


Introduction
Mini-proteins are polypeptides consisting of no more than 100 amino acids (AA), which are widespread in both prokaryotes and eukaryotes and found to play important roles in a variety of functionalities. Mini-proteins usually contain a single domain. In prokaryotes, well known mini-proteins include chaperonin Hsp10, translation initiation factor IF-1, ribosomal proteins and others. In eukaryotes, certain important signalling molecules, animal toxins and protease inhibitors belong to the mini-protein family [1]. James Kastenmayer reported that the Saccharomyces cerevisiae genome codes for 299 mini-proteins based on experimental approaches and computational analysis [2].
Some mini-proteins have been used as model systems to study the determinants of protein folding and stability because of their simple and typical structures [3,4]. Moreover, some exhibit structural scaffolds valuable to the study of binding activities, identification of frameworks for peptidomimetic design, or search for novel drug candidates [5]. Besides their importance in structural studies, reports on the regulatory functions of miniproteins have recently aroused extensive interests, especially in Bacteria. For instance, Wu et al. [6,7] have elucidated the functions of two mini-proteins from Pseudomonas aeruginosa. These proteins were expressed in response to specific environmental stresses and actively participate in the suppression of the type III secretion system, achieving coordinated gene expression, thus playing a critical role in host infection. Within dormant spores of Bacillus, Clostridium and related species, a group of small, acid-soluble spore proteins (SASP) are the crucial factors enabling spores to survive for years, protecting spore DNA from damaging agents [8].
According to binding studies of peptides of various sizes, the minimal size of a functional epitope is around 8 AA, with an average size of 15-20 AA. Therefore, a mini-protein as short as 8 AA is capable to binding targets and to exhibit biological functions. It is not surprising then that mini-proteins with sizes up to 100 AA can perform a variety of relevant functions and participate in regulation of various biological processes. However, little effort had been put to explore their functions; instead, most researches focus on large proteins that are conserved and/or essential among organisms [9]. The characterization of miniproteins presents difficulties in experimental and bioinformatic approaches. Experimentally, mini-proteins are difficult to isolate and identify due to their small sizes; likewise, in bioinformatic analyses, short genes are the most difficult to predict. Therefore, to provide a clue for their functions, it is necessary to conduct in depth and systematic studies of the mini-proteins.
In this report, we analyzed all annotated protein sequences that are #100 amino acids (AA) from 532 completed genome data, including 491 sequences of Bacteria and 41 sequences of Archaea, deposited in the Microbial Genome Database at the National Center for Biotechnology Information (NCBI) [10]. We focused our attention on three aspects: the component distribution of mini-proteins (including length, number, and conservation), the characteristics of mini-proteins in bacterial and archaeal species, and the possible reasons why they possess such characteristics. The results indicate that mini-proteins account for an average of 10.99% of all annotated sequences in Bacteria and Archaea, comprising numerous species-specific proteins and hypothetical proteins. The functions of very few mini-proteins are known, but these involve many important biological processes. Moreover, hypothetical mini-proteins contain a fraction of highly conserved sequences, indicating that they play important functional roles.

Mini-protein length distribution
We downloaded 532 sequenced genome data of prokaryotes, consisting of 491 strains of Bacteria and 41 strains of Archaea, from National Centre for Biotechnology Information (NCBI). A total of 180,879 annotated protein sequences with no more than 100 amino acids were extracted. The length distribution of these mini-proteins shows increase in frequency for progressively longer sequences ( Figure 1A). Mini-proteins with #30AA are the minority in all data, representing merely 1,897 sequences, and accounting for 1.05% of all mini-proteins. The longest sequences, 90AA,length#100AA, are more common than other categories, with 37,280 sequences accounting for 20.61% of all mini-proteins. Figure 1B displays the detailed length distribution of mini-protein, with lengths from 1AA to 100AA. The general trend that miniprotein numbers increase with mini-protein length is obvious. Only 5 mini-protein sequences were #10AA with the shortest protein containing 6 amino acids. Three of these are hypothetical proteins. Of the other two mini-proteins, one is predicted to be a fragment of the PE-PGRS protein family whose members are probably related to surface antigens in mycobacterial species; the other is annotated as transposase-like protein B (remnant) in Clostridium difficile 630. Proteins of 100 amino acids are the most abundant, with 4,092 sequences.

Mini-protein overview in phylum
The 532 sequenced genomes we collected from NCBI belonged to species classified in 18 distinct phyla, 3 from Archaea and 15 from Bacteria. Four phyla were represented by single genome sequences, i.e., Nanoarchaeota from Archaea and Aquificae, Fusobacteria and Planctomycetes from Bacteria. Moreover, we treated Proteobacteria's five classes as phyla to describe, namely Alpha-, Beta-, Delta-, Epsilon-and Gamma-, because they are represented by the largest number of genomes, with 258 strains accounting for nearly half of sequenced genomes.
As shown in Table 1, the overall proportion of mini-proteins among all annotated genomic proteins is 10.99%. Planctomycetes has the highest number of mini-proteins, comprising 26.54% or 1,944 sequences. In contrast, Aquificae has the least number of mini-proteins, merely encoding 48 mini-proteins in the whole genome, representing 3.08%. However, these two phyla contain only one genome each, Rhodopirellula baltica and Aquifex aeolicus, respectively. Except for these two extremes, other phyla encode similar proportions of mini-proteins, although greater variability is observed when considering individual strains. For instance, the Alphaproteobacterium Anaplasma phagocytophilum contains 33.39% mini-proteins, more than any other genome. On the other extreme, in the genome of Clostridium tetani (Firmicutes), a human pathogen causing tetanus, there are no mini-proteins annotated except for 4 sequences on its plasmid. Genomes from nine other species of Clostridium have been sequenced. In sharp contrast to the C. tetani, these nine genomes contain a normal proportion of mini-proteins, ranging from 14.25% to 8.27%. Moreover, similar average proportions of mini-proteins, 11.28%, 11.30% and 9.33%, respectively, are annotated in the genomes of three archaeal phyla.

Specific and shared mini-proteins
To investigate conservation among mini-proteins, we took several representative phyla to determine the proportion of their mini-proteins that are specific or shared to each taxonomic level (species, genus, family, order, class, phylum and domain). Conservation was established by sequence similarity as determined by BLAST comparisons (see Table 2). Our criteria for the definition of specific vs. shared include the following: (i) Except for species-specific proteins, the specificity at other taxonomic levels must meet two conditions, namely not only are they particular at a certain level, but they also simultaneously exist in all categories at the lower levels. For instance, as a query sequence, one miniprotein belongs to a certain species and a certain genus, and the results indicate that its homologs are only present in all species in the same genus. In this case, we call it a ''genus-specific'' protein.
Similarly, if its homologs are found in other genera in the same family, then we name it the ''genus-shared''; (ii) Given that a genus might only have one sequenced species, a mini-protein named ''species-specific'' does not automatically become genus-specific. This rule also applies to other levels; (iii) Because of filtration by various parameters, the entries shown in the results are less than the number of mini-proteins used in the initial searches.
From Table 2, it is clear that the species-specific mini-proteins are the majority in all of the phyla (average proportion is 58.79%), suggesting that these proteins potentially take on some unique functions that contribute to the adaptation of organisms to different habitats. However, 85.81% of them are annotated as ''hypothetical protein'' and the authenticity of their existence has not been confirmed. In contrast, shared or conserved proteins account for a small fraction, with 6.20% phylum-shared and 0.73% domain-shared (conserved in both Archaea and Bacteria). It is worthy of attention that Firmicutes comprise a larger proportion of these shared mini-proteins than any other bacterial phyla. In addition, although the proportion of hypothetical proteins is low among the conserved proteins, some hypothetical proteins are phylum-shared and domain-shared proteins. However, most phylum-shared proteins are well-characterized, such as, in the phylum-shared class, various ribosomal proteins, cold shock protein, translation initiation factor IF-1; in the domain-shared class rubredoxin, transcriptional regulator and gas vesicle protein (see Table 3 for a complete list).

Conservation of hypothetical proteins
The aforementioned results show that hypothetical proteins accounted for a large proportion of mini-proteins, even among the conserved phylum-shared and domain-shared ones. In fact, about 70.03% or 126,670 mini-proteins are designated as hypothetical proteins, while merely 29.97% or 54,209 proteins possess functional or structural annotations. Moreover, 25,394 miniproteins have been classified in the COG (Clusters of Orthologous Groups) database [11] and approximately 17.81% of them are unknown function (see Figure 2 for details). We further focused on these hypothetical proteins (also including uncharacterized protein and protein of unknown function, here together referred to as ''hypothetical proteins'') to search for more conserved mini-proteins for better classification. We selected one strain from each genus that contains the most mini-proteins as representative in all phyla (Table 1) and analyzed these mini-proteins' conservation among all data (see Materials and Methods). We then picked out the mini-proteins whose homologous proteins are present in at least five of the phyla. As before, the five classes in Proteobacteria were also treated as distinct phyla.
As a result, we found many new groups of conserved miniproteins and obtained 28 groups of proteins conforming to the above conditions. Then we compiled the data and searched for their functional domains on the Pfam [12] or InterProScan [13] websites (see Table 4 in details). These 28 groups of mini-proteins can be divided into three types. First, mini-proteins are well studied, with detailed functional and/or structural information, including group 01-07 and group 27-28. Second, they are the mini-proteins with domains named as DUF (Domain of Unknown Function) or UPF (Uncharacterized Protein Family). Third, the conservation is lower than those of above two types, whose domains are only found in Pfam-B, which supplements the databases' principal body (Pfam-A) and contains small families of proteins. The conservation of mini-proteins represented in Table 4 is generally high, with lowest similarity of 42% among these groups. Mini-proteins assembled in a group usually belong to the same domain, but proteins in groups 07, 18 and 20 include representatives from the two domains of Bacteria and Archaea. However, proteins of groups 18 and 20 are poorly characterized.

Evolutionary analysis of domains
We further investigated the domains (or motifs and conserved regions) within the conserved hypothetical proteins in Table 4 as well as the phylum-shared and the domain-shared proteins in Table 3, and observed four patterns in the process of their evolution (see in Figure 3). We noticed that (i) these domains are highly conserved and widespread. Four domains, Plasmid-killer, Plasmid-Txe, RHH-2 and DUF370, were specific to Bacteria; other domains were conserved in Bacteria as well as in Viruses, Archaea and Eukarya. Except for Zf-UBP, which is mainly represented in eukaryotes, all other domains mainly exist in Bacteria. This suggests that the domains in mini-proteins are more likely to contribute to the bacterial species rather than that of eukaryotes; (ii) these domains seem to have evolved independently in mini-proteins, except for the PAAR-motif, which is often observed in tandem repeats. However, with the extension of protein lengths, the domains developed at least two patterns, except for those with independent evolution such as RHH-2, DUF37, DUF196, DUF370 and DUF528; (iii) independent domains seem to be more frequent than any of the three patterns, whereas self-tandem is the major pattern for PAAR-motifs and, chimera with other domains is the major pattern for Zf-UBP and YHS domains.
Domains represent the functional and evolutionary units of proteins, and almost all mini-proteins contain one domain. Results of our analysis indicate that individual domains evolve independently. Most domains develop new patterns during long-term evolution although the patterns of independent domains account for the majority in terms of number. In the course of evolution, proteins have a general tendency to fuse into two or multi-domain units from the single unit, which may help proteins develop new functions. As shown in Figure 3, proteins in pattern 2 achieve the functional integration through combination with different domains, which is a predominant route of protein evolution. In regard to pattern 3, it is also a relatively common method of protein evolution from single to multiple domains. The number of self-tandem domains is variable in proteins. For example, BMC (bacterial microcompartment) is always tandem with two repeats, but in proteins CSD (cold-shock domain) is not stabilized and tandem up to six domains.
A typical example of independent evolution is DUF37, which originates from group 08 in Table 4. This group of mini-proteins  Note: The averages of species-specific, phylum-shared and domain-shared are 58.79%, 6.20% and 0.73%, respectively. Because Actinobacteria contains only one class, class-specific mini-proteins are equal to the phylum-specifics'; similarly, Spirochaetes contains one class and one order, so the order-specific mean the phylum-specific.  includes the largest searched sequences and covers all phyla of Bacteria, 144 total sequences of the 15 phyla. The majority of them are hypothetical proteins or proteins of unknown function, except 4 proteins that are annotated alpha-hemolysin, which is a bacterial toxin that can assemble a transmembrane pore. In the InterPro database [14], we detected 653 proteins possessing this domain, including one sequence in virus, 9 sequences in green plants and 643 sequences in Bacteria. Also, these proteins do not comprise another domains any more, which suggests that DUF37 evolved independently. In addition, many domains consist of at least two patterns. A good example is BMC within the group 01 of mini-proteins which involves 103 sequences of 11 phyla. We found that 843 proteins contain this domain in the InterPro database and summarized its evolutionary patterns. From Figure 4A, we can find that beside independent domain (62.51%), BMC has developed other two patterns: self-tandem (18.04%) as well as chimera with other domain (19.45%). In spite of different patterns, the proteins still possess similar functions, which indicate that one BMC domain is necessary to exert its function instead of requiring tandem of two BMC domains. We further investigated its phylogeny and used Cyanobacteria as an example ( Figure 4B). It is clearly observed that the self-tandem and chimera with other domain pattern are divergent from independent domain because the BMC domains in pattern 3 or 4 and pattern 2 or 5 form two independent clusters, respectively. The left and right domains are clustered in pattern 3 or 4, respectively. This implies that the existence of tandem domains may not be the result of the domain duplication, rather the transfer of the domains between proteins.

Discussion
Our study collected all annotated mini-protein sequences from the sequenced genomic data and carried out the comprehensive and systemic analysis, although previously there were a few sporadic reports about the structural and functional analyses of mini-proteins [2][3][4][5][6][7][8]. We found that the number of mini-proteins gradually increases with their length in amino acids. In particular, mini-proteins in the range of 70AA,length#100AA account for 57.79% of the total, which concurs with the view that the size of a protein domain is generally below 100 amino acids [15]. This is the reason why we have chosen this length as the cut-off of proteins for analysis. With regard to smaller proteins, it has been suggested that mini-proteins (40-50AA) can exhibit a well defined three-dimensional structure through disulfide bridges, metal ion binding and specific hydrophobic interactions [4]. However, Samuel et al. reported that mini-proteins with just 20 amino acids can also adopt well-defined globular shapes [16]. Surprisingly, our analysis indicated that the number of mini-proteins #30AA is very low. It is possible that many small mini-proteins may have been filtered out during annotation, grossly under estimating the actual number of mini-proteins.
Our results indicate that mini-proteins are numerous, accounting for an average of 10.99% of all genomic data in Bacteria and Archaea. Despite the enormous total sum, distribution of the miniproteins exhibits remarkable variation among different strains. For example, more than 30% of proteins encoded by the genome are mini-proteins in two strains of Prochlorococcus marinus (30.83% and 30.30%) as well as Anaplasma phagocytophilum HZ (33.39%). A. phagocytophilum HZ represents the greatest percentage of miniproteins encoded on the genome. By contrast, Clostridium tetani in Firmicutes, represents a unique strain with no known miniproteins encoded on its genome. Interestingly, both the maximum and the minimum belong to the bacterial domain. Consequently, the range of variation of mini-protein content in Bacteria (0.16%-33.39%) spans much greater than in archaea (7.83-18.23%). Although the concrete biological significance is unknown, we speculate that this phenomenon may relate to the fact that ecological conditions of bacterial species are more diverse and complicated than that of archaeal species which are mostly in constant but extreme environments [17,18].
In addition, even among closely related species, the relative proportions of the mini-proteins vary greatly. In Clostridium, except for C. tetani which encodes no mini-proteins, the other nine strains all encode mini-proteins, ranging from 8.27% to 14.25% of the total number of proteins. Species of this genus are ubiquitous in soils, aquatic sediments and the intestinal tracts of animals and humans; hence they display metabolic and biological diversity. Surprisingly, ferredoxin, ATP synthase subunit C and 50S ribosomal protein L27 are less than 100 amino acids and belong     to mini-proteins in the nine strains, but in C. tetani, they are 290AA, 333AA and 101AA long, respectively. It is plausible that even subtle changes in the environment may become a selective pressure for mini-proteins, and the differences among Clostridium are the result of multifactor influence. However, it is difficult to determine which environmental factors affect the evolution of mini-proteins. Nonetheless, a few examples can provide some clues to certain extent. For instance, the aforementioned Anaplasma includes two sequenced species, A. phagocytophilum HZ (contains 33.39% mini-proteins) and A. marginale str. St. Maries (contains 7.59% mini-proteins). They are both obligate intracellular pathogens, but they inhabit granulocytes and erythrocyte, respectively [19,20], therefore, differences in the host intracellular environments might account for the significant differences in relative proportion of the mini-proteins. Moreover, the proportions of mini-proteins in the genome are actually dissimilar between different isolates of the same species, such as two strains in Spirochaetes, Leptospira interrogans serovar Copenhageni str. Fiocruz L1-130 and L. interrogans serovar Lai str. 56601, contain 10.74% and 28.71% mini-proteins, respectively. We speculate that this difference between mini-protein proportions may reflect different selection pressures the two strains are exposed to, resulting in different leptospiral serovars that are derived from structural heterogeneity in the carbohydrate component of lipopolysaccharides [21].
Our results reveal that one characteristic of mini-protein data is that species-specific proteins predominate, whereas conserved proteins are the minority, which ought to be the chief reason for the fluctuations in mini-protein content. Why are species-specific proteins so numerous? We speculate several possible reasons: first, the mini-proteins help organisms to adapt to the diverse and distinctive ecological niches, thus many of them are speciesspecific. Particularly in Bacteria, some species freely live in various aqueous or terrestrial environments, while others are intracellular parasites, obligate and facultative parasites of animals and plants. Second, some of the mini-proteins are short remnants of longer genes that were present in their early ancestors. Third, some proteins probably evolved too rapidly to maintain homologues and intermediate sequences. Fourth, similar proteins have been incorrectly annotated [22]. In fact, mini-proteins are capable of being very good candidates for the species-specific. On one hand, the vast majority of mini-proteins contain one domain which lets them exert functions simply and directly through protein-protein interactions or binding DNA or RNA sequences. On the other hand, since mini-proteins require less to translate and fold, organisms use them to regulate relevant pathways and respond to subtle changes in the environment, which accords with the hypothesis that organisms tend to minimize costs of protein biosynthesis [23]. Additionally, the amount of conserved proteins is less, but most of them are necessary for the survival of organisms, especially those phylum-shared and domain-shared ones.
Another characteristic of mini-protein data is that although hypothetical proteins are the majority and the proteins with known functions are the minority, the functions of mini-proteins are diverse. As shown in Figure 2, mini-proteins are involved in broad functional classes, including information storage and processing, cellular processes and signalling, and metabolism. In fact, they are distributed in nearly all subclasses of three larger classes, except for RNA processing and modification, nuclear structure, cytoskeleton and extracellular structures (data not shown). This result implies that regulatory and metabolic proteins are more common than constitutive or structural proteins, which can also be observed clearly from phylum-shared and domain-shared proteins. As previously mentioned, some of 299 mini-proteins in the S. cerevisiae genome are required for growth under genotoxic conditions including exposure to hydroxyurea (HU), bleomycin and ultraviolet (UV), suggesting that they play important roles to harsh environmental conditions [2]. Furthermore, the proportion of hypothetical proteins is very high, about 70.03%. This might be due to the fact that (i) a great deal of mini-proteins are speciesspecific; (ii) some of the mini-proteins might be incorrectly annotated; and (iii) there are technical difficulties in identifying the functions of mini-proteins. However, we discovered that even in hypothetical proteins there are still a fraction of conserved sequences, including conserved proteins at each taxonomic level and 28 groups of proteins spanning beyond five phyla. These will be useful for us to correctly annotate proteins and further explore the function and evolution of mini-proteins; especially those highly conserved sequences listed in Table 4 which are more biologically significant and will be an emphasis of our future studies.
Mini-proteins have received significantly less attention from researchers due to the constraints of experimental and bioinfor-  matic approaches. Here, we investigated the annotated miniproteins from the sequenced genomic data and discovered some overall rules which could establish a foundation for further studies. However, the answers to many questions remain elusive and wait to be resolved in the future. They include (i) how to identify potentially more mini-proteins in various genomes; (ii) how to confirm the functions of identified mini-proteins; (iii) what are the biological functions of the mini-proteins; and (iv) what are the driving forces for the evolution of the mini-proteins.

Materials and Methods
We collected 532 completed prokaryotes genomes from the National Center for Biotechnology Information (NCBI) up to date the 2nd July, 2007; and extracted all annotated protein sequences #100 amino acids in their chromosomes and plasmids, as the length of one domain is usually below that cut-off value. Moreover, every strain was classified according to the NCBI taxonomy database.
We first analyzed the overall length distribution of mini-proteins and described the main characteristics of the each phylum. And then, to detect the special or shared mini-proteins of six representative phyla we started by carrying out a BLASTP search of every mini-protein sequence in one phylum against all miniprotein data we extracted. In regard to the last results, we recorded the matches for each protein sequence with an E-value lower than 10 25 , sequence identity higher than 60% and filtered lowcomplexity sequences.
In addition, to explore the conservation of mini-proteins in all phyla we also carried out BLASTP searches using mini-proteins queries from a representative species for every genus against all mini-protein data with parameters as previously described. Amino acid sequence alignments were obtained with Clustalx software [24]. For the domain analysis, we investigated them using the Pfam or InterPro websites and proteins' sequences include all prokaryotic and eukaryotic data. Moreover, we used Mega [25] software (bootstrapped neighbor-joining method) for phylogenetic reconstructions.