Comparative Analysis of Codon Usage Bias Patterns in Microsporidian Genomes

The sub-3 Mbp genomes from microsporidian species of the Encephalitozoon genus are the smallest known among eukaryotes and paragons of genomic reduction and compaction in parasites. However, their diminutive stature is not characteristic of all Microsporidia, whose genome sizes vary by an order of magnitude. This large variability suggests that different evolutionary forces are applied on the group as a whole. In this study, we have compared the codon usage bias (CUB) between eight taxonomically distinct microsporidian genomes: Encephalitozoon intestinalis, Encephalitozoon cuniculi, Spraguea lophii, Trachipleistophora hominis, Enterocytozoon bieneusi, Nematocida parisii, Nosema bombycis and Nosema ceranae. While the CUB was found to be weak in all eight Microsporidia, nearly all (98%) of the optimal codons in S. lophii, T. hominis, E. bieneusi, N. parisii, N. bombycis and N. ceranae are fond of A/U in third position whereas most (64.6%) optimal codons in the Encephalitozoon species E. intestinalis and E. cuniculi are biased towards G/C. Although nucleotide composition biases are likely the main factor driving the CUB in Microsporidia according to correlation analyses, directed mutational pressure also likely affects the CUB as suggested by ENc-plots, correspondence and neutrality analyses. Overall, the Encephalitozoon genomes were found to be markedly different from the other microsporidians and, despite being the first sequenced representatives of this lineage, are uncharacteristic of the group as a whole. The disparities observed cannot be attributed solely to differences in host specificity and we hypothesize that other forces are at play in the lineage leading to Encephalitozoon species.


Introduction
Microsporidia are spore-forming, single-celled fungal pathogens best known for their unique infection apparatus called the polar tube and for harboring species with the smallest reported nuclear genomes. Microsporidia as a group are highly diverse, with more than 1,500 distinct species infecting vertebrate and invertebrate hosts widely spread across the Tree of Life, and cause growing concerns due to their medical, environmental and economic relevance [1]. Their diversity is reflected at the genetic level, as the extreme levels of reduction encountered in the Encephalitozoon lineage [2][3][4][5] are not characteristic of the group. The microsporidian genetic paraphernalia vary by at least an order of magnitude, from as little as 2.3 Mbp [2] to more than 25 Mbp [6], with the underlying content and structure changing accordingly. These changes may reflect, at least in part, the different evolutionary pressures applied by the various host ranges with which each microsporidian species co-evolve, and a better understanding of these changes may lead to better predictions models about their zoonotic and lethal potentials.
The genetic code plays a critical role in living cells, but not all species use its built-in redundancy in the same way. Codon usage biases (CUB) are widespread across the Tree of Life and are affected by nucleotide composition [7], translation processes [8], tRNA abundance [9], gene function [10] and length [11], protein structure [12] and hydrophobicity [13], environment temperature [14] and other factors. In particular, the balance between gene mutation and natural selection determines the CUB [15]. The genetic code itself is not universal and deviations from the standard code can have profound impacts on the translational apparatus of the corresponding organisms. Conversely, irreversible modifications to a species' translational apparatus can force it to adapt its CUB accordingly. With genome reduction often comes simplification and forced specialization at the expense of versatility.
Microsporidia have simpler ribosomes than their fungal relatives with sediment coefficients that are similar to that of prokaryotes [16]. This simplification could potentially limit the breadth of possible codon usage biases that they can adopt. While CUB in Microsporidia have been investigated to various degrees [4,[17][18][19], only one study addressed their CUB in a systematic, albeit succinct fashion [18]. Here, we expanded on previous studies by using a wider selection of eight taxonomically dispersed microsporidian representatives with available genomic sequences. By examining several statistics that characterize CUB, we identify several trends and uncover some implications for selection pressures affecting this idiomatic phylum.

Nucleotide composition analyses
The GC content of the entire CDS (GC cds ) as well as the first (P 1 ), second (P 2 ), and third (P 3 ) codon position GC content were calculated using a custom PERL script (available on https:// github.com/hxiang1019/calc_GC_content.git). To account for the inequality of α and γ at the third codon position [25], the three stop codons (UAA, UAG, and UGA) and the three codons for isoleucine (AUU, AUC, and AUA) were excluded in calculation of P 3 , and the two single codons for methionine (AUG) and tryptophan (UGG) were excluded from P 1 , P 2 , and P 3 . Neutrality plots were drawn using the average value of P 1 and P 2 (P 12 ) as the vertical axis and the P 3 as the horizontal axis. The nucleotide compositions of the third codon position (A 3 , U 3 , C 3 , and G 3 ) were also obtained and used to calculate the AU-bias [A 3 / (A 3 +U 3 )] and GC-bias [G 3 / (G 3 +C 3 )]. The Parity rule 2 (PR2) plots were drawn based on AU-bias and GC-bias.

Codon usage indices and ENc-plot
The Codon Adaptation Index (CAI), the Effective Number of Codons (ENc), and the third synonymous codon position GC content (GC 3s ) were calculated using CodonW (John Peden, http://www.molbiol.ox.ac.uk/cu, version 1.4.2) using Saccharomyces cerevisiae as reference [16]. The ENc vs GC 3s plots were generated from this data.

RSCU and correspondence analyses
The relative synonymous codon usage (RSCU) was calculated using CodonW. The high-and low-expression gene datasets were defined as genes in the upper and lower 5% of CAI values for each microsporidian species. RSCU values of these two datasets were compared through a chi-squared test, and the codons whose usage frequency in the high-expression genes was significantly higher (P-value < 0.05) than in the low-expression genes were identified as the optimal codons [26]. Codons with RSCU values less than 0.1 were classified as rare codons. A heat map was drawn with CIMMiner (http://discover.nci.nih.gov/cimminer) [27] and clustered the microsporidian RSCU values using a Euclidean distance method and an Average Linkage cluster algorithm. The correspondence analysis (COA) [28] was performed with CodonW utilizing the RSCU values to compare the intra-genomic variation of 59 informative codons, partitioned along 59 orthogonal axes with 41 degrees of freedom. Correlation analyses, ANOVA and significance tests were performed with Microsoft Excel and SPSS 18.0 (http://www.spss.com/).

Codon usage biases
Codon usage patterns for microsporidian genomes were investigated by calculating RSCU values ( Table 1). The RSCU is the observed frequency of a codon divided by the expected one. If the RSCU is close to 1, synonymous codons are used without apparent biases. When the RSCU value is greater or less than 1, the codons investigated are used more or less frequently than expected, respectively.
The preferred codons (RSCU > 1, Table 1; in bold) in S. lophii, T. hominis, E. bieneusi, N. parisii, N. bombycis, and N. ceranae are strongly biased towards A/U bases in third position, in contrast to E. intestinalis and E. cuniculi where more than half of the codons end with G or C ( Table 2). Optimal codons (shown in Ã or @, Table 1) identified by chi-squared tests are similarly biased. Nearly all of the optimal codons in S. lophii (17 A/U-end in 17 optimal codons), T. hominis (28 A/U-end in 29 optimal codons), E. bieneusi (29 A/U-end in 29 optimal codons), N. parisii (28 A/U-end in 28 optimal codons), N. bombycis (27 A/U-end in 29 optimal codons), and N. ceranae (17 A/U-end in 17 optimal codons) are A/U-end whereas more than half of the optimal codons in E. intestinalis (17 G/C-end in 23 optimal codons) and E. cuniculi (14 G/Cend in 25 optimal codons) are G/C-end (Table 2). When clustering these biases according to a heat map (Fig 1), these values display a remarkable difference between the Encephalitozoon species and the other six microsporidians. Both the sign * (P-value < 0.01) and @ (0.01 < P-value < 0.05) represent the optimal codons, while the sign -(RSCU < 0.10) denotes the rarely used codons. The preferred codons (RSCU > 1) are in bold. doi:10.1371/journal.pone.0129223.t001

Correlation analyses
Nucleotide composition is an important factor influencing CUB, and the mean values of all of the microsporidian GC cds are similar to their reported overall genomic GC content (  (Table 4) shows that the GC cds , P 1 , P 2 and P 3 are significantly related to each other for all eight Microsporidia. In addition, correlations between ENc and GC cds , P 1 , P 2 and P 3 were also investigated. ENc relates the overall synonymous codon usage, ranging from only 20 codons being used for each of the 20 amino acids, to all 61 codons being used randomly [29]. In Table 4, the ENc is significantly correlated to GC cds , P 1 , P 2 and P 3 for Microsporidia. Correlations between ENc and  GC cds , P 1 , P 2 and P 3 were either negative or null for Encephalitozoon species but positive for the other six Microsporidia.
To judge whether the nucleotide composition is the only factor to influence CUB for Microsporidia, the ENc-plot (Fig 2) was drawn. If genes follow the standard curve ENc = 2+GC 3s +-29/[GC 3s 2 +(1-GC 3s ) 2 ], the microsporidian CUB is determined primarily by the nucleotide composition [29]. In Fig 2, the distribution of most genes below the standard curve indicates that there are other factors acting on microsporidian CUB.

Parity Rule 2 plot analyses
Examining the codons with RSCU > 1 and the optimal codons, species from the Encephalitozoon genus prefer G-end over C-end, despite the other microsporidians preferring A and U ends to a roughly equivalent degree ( Table 2). Because of this bias of guanine over cytosine, all codons were examined by PR2 plot analysis (Fig 3)

Neutrality plot analyses
The neutrality plot analysis (Fig 4) was carried out to characterize the correlation among the three codon positions, and then identify the presence of selective mutation on CUB [25]. In Table 4. Correlation analysis among GC cds , P 1 , P 2 , P 12 , P 3 and ENc for eight Microsporidia. the neutrality plot, if a gene is located on the slope of unity there is a significant correlation between its P 12 and P 3 , meaning the gene is under neutral mutation via random selection pressure. If the gene is under a directed mutational pressure it should fall below the slope of unity, closer to the X-axis. Thus a regression line with a slope less than 1 would indicate a whole genome trend of non-neutral mutational pressure [31]. The Encephalitozoon species have regression slopes of 0.0589 and -0.0048, and correlation coefficients of 0.082 and  -0.021 respectively. Their extremely low relative neutralities (5.9% and 4.8%) might suggest a large amount of directed mutational pressure, although the low Spearman correlations combined with P-values > 0.05 make these unreliable (Fig 4 and Table 4). S. lophii, T. hominis, E. bieneusi, N. parisii, N. bombycis and N. ceranae all have relative neutralities ranging from 20-38% and significant correlation coefficients with P-values < 0.01 (Fig 4 and  Table 4), indicating that directed mutational pressure plays an important role in shaping CUB for these Microsporidia.

Correspondence analyses
The correspondence analysis was used to check what other factors shape the microsporidian CUB. This multivariate statistical method surveys the variation of RSCU values within the genome [28]. The correspondence analysis shows the distribution of genes and reflects the distribution of their corresponding codons, unveiling potential influences on CUB [13]. In the correspondence analysis, a series of orthogonal axes were produced to represent the factors responsible for CUB (Fig 5). For E. intestinalis, E. cuniculi, S. lophii, T. hominis, E. bieneusi, N. parisii, N. bombycis, N. ceranae, Axis 1 accounted for 9.81%, 11.98%, 12.44%, 16  31.31%, 24.54%, respectively. This suggests that the first axis is the primary factor (9-18% of the overall variation), which was also found to be significantly correlated with CAI, ENc, and GC 3s . However, other factors are also responsible for the codon usage variation based on Axes 2 to 4.

Codon usage indices
The notable differences observed in CUB between Encephalitozoon species and the other microsporidians (Fig 1) were confirmed by one-way ANOVA (F-value: 1734.906 in CAI, 1919.114 in ENc; and P-value < 0.01 in both) based on the CAI and ENc (Fig 6) and by a Ttest (P-value < 0.01). CAI is a ratio of the synonymous codon bias in a gene to a highly expressed reference gene. With values that range between 0 and 1, a higher CAI value indicates a stronger bias of synonymous codon usage and a potentially higher gene expression [32]. For microsporidian genomes, their mean CAI values are less than 0.15, and their mean ENc values are larger than 37 (Fig 6). These indicate the presence of mild synonymous codon usage bias across microsporidian genomes as a whole, with Encephalitozoon species showing the highest degree of randomization (Fig 6).

Discussion
The codon usage bias, an important feature of species that can reflect the evolutionary patterns of their genome, has been reported in numerous organisms [33]. Here, the CUB in Microsporidia was studied based on eight genomes and 22,467 coding sequences, with optimal codons identified by RSCU values. The optimal codon usage pattern was found to be significantly different between species from the genus Encephalitozoon and those from other microsporidian lineages. Nearly all of the optimal codons in S. lophii, T. hominis, N. parisii, E. bieneusi, N. bombycis, N. ceranae feature a biased A/U third codon position, while the Encephalitozoon species (E. intestinalis, E. cuniculi) have a more balanced nucleotide distribution, yet slightly biased towards G/C. Although the microsporidian CUB are mild according to CAI and ENc values, these statistics are likely the result of a larger distribution of varying biases for individual genes averaging out to a less significant overall genome bias. This has previously been described [3,5], where Encephalitozoon GC content smoothly arced across the chromosomes, but averaged an unremarkable GC%. Still, nucleotide composition is clearly a factor in microsporidian CUB, which was confirmed by correlation analysis.
Besides nucleotide composition, the microsporidian CUB is also influenced by other factors including directional mutation pressure, which appears to play a much larger role than selection in the CUB of Microsporidia according to neutrality analyses. While the neutrality plots do suggest an even larger directional mutation pressure for the Encephalitozoon species, the observed correlation coefficients undermine its overall significance. Being the first completely sequenced microsporidian genome, Encephalitozoon cuniculi has long been regarded as the model Microsporidia. Its reduction and compaction were for a time thought to be typical traits of microsporidian genomes. However, comparative analyses of recently released microsporidian genomes [6] rather indicates that Encephalitozoons are the exception rather than the norm, and that the evolutionary trends they have followed are not characteristic of the group. This is corroborated by our analyses. As the Encephalitozoon species are known as the smallest eukaryotic genomes (assembled genomic sizes of 2.22 and 2.50 Mbp; Table 3), they are likely under stronger reductionist pressure (directional mutation pressure) than the larger microsporidian genomes, of which many have expanded by genome duplication, horizontal gene transfer and transposable elements proliferation [1].
Intuitively, different hosts respond differently to parasitic infections, and while arthropods do possess an innate immunity, they lack the adaptive immune response found in mammals. However, the Encephalitozoon species represented here are not the only microsporidians infecting mammals, with both E. bieneusi and T. hominis reported as human pathogens [18,22]. If host specificity was the sole factor involved, we would expect these four species to display similar trends, which is clearly not the case. Unfortunately, it is unknown how long these species have been coevolving with their host, and the observed disparities may be due to differences in duration rather than in the relative strengths of the underlying evolutionary pressure applied. Here, we hypothesize that the Encephalitozoon species may have been under mammalian host pressure for longer evolutionary periods, which would explain why they are so markedly different. A caveat of this is that the host specificity of these organisms is unclear. In fact, the Encephalitozoon species E. romaleae infects grasshoppers but also most likely can pester mammalian cells based on its uncanny genomic similarities with is sister species E. hellem [4], and we don't know if its presumed absence in humans is real or rather due to limited sampling. Alternatively, the observed differences may be a direct consequence of the strongly reduced metabolic potential inherent to their Lilliputian gene repertoire, and it would be interesting to revisit the genome of the closest known relative of Encephalitozoon species, Ordospora colligata [6], which has been released during the latter stages of publication of this manuscript. O. colligata infects the water flea Daphnia (an arthropod [34]) and displays a similarly reduced genome, whose overall characteristics may help better delineate what has happened in the lineage leading to the Encephalitozoon species.