Mutation and Selection Cause Codon Usage and Bias in Mitochondrial Genomes of Ribbon Worms (Nemertea)

The phenomenon of codon usage bias is known to exist in many genomes and it is mainly determined by mutation and selection. To understand the patterns of codon usage in nemertean mitochondrial genomes, we use bioinformatic approaches to analyze the protein-coding sequences of eight nemertean species. Neutrality analysis did not find a significant correlation between GC12 and GC3. ENc-plot showed a few genes on or close to the expected curve, but the majority of points with low-ENc values are below it. ENc-plot suggested that mutational bias plays a major role in shaping codon usage. The Parity Rule 2 plot (PR2) analysis showed that GC and AT were not used proportionally and we propose that codons containing A or U at third position are used preferentially in nemertean species, regardless of whether corresponding tRNAs are encoded in the mitochondrial DNA. Context-dependent analysis indicated that the nucleotide at the second codon position slightly affects synonymous codon choices. These results suggested that mutational and selection forces are probably acting to codon usage bias in nemertean mitochondrial genomes.


Introduction
The genetic code is degenerate (64 codons for 20 amino acids and the termination signal), with most amino acids encoded by two to six synonymous codons. In protein-coding genes, synonymous codons that code for the same amino acid do not appear at the same frequency [1,2]. Bias in synonymous codon usage exists in a wide variety of organisms, from prokaryotes, to unicellular and multicellular eukaryotes [3,4,5]. Since different genomes have their own characteristic patterns of synonymous codon usage [6], it has not been easy to provide a satisfactory explanation for the particular pattern that is found in a given genome. In some prokaryotes, such as E. coli, and some unicellular eukaryotes, e.g., yeast, codon usage is thought to be determined by the equilibrium between natural selection and mutation bias [7,8,9,10]. Mutation bias is the major factor accounting for the variation in codon usage in some prokaryotes with extremely high A + T or G + C contents [11], and in many mammals [12]. However, translational selection plays the most important role in shaping codon usage in Drosophila [13], and in some plants [14,15,16].
Analysis of codon usage facilitates the understanding of evolution and environmental adaptation of living organisms [17]. While numerous reports on synonymous codon usage bias have focused on nuclear genomes, only few mitochondrions have been analyzed. Mitochondrial genomes are an evolutionary paradox with a relatively conserved gene content and small size. The genetic code of mitochondria furthermore often differs from the standard genetic code [18]. We know that the codon usage pattern changes during evolution, but how is still not completely known. Mitochondrial codon usage has been studied mainly in vertebrates, whereas among invertebrates so far only some parasitic platyhelminthes had been surveyed for mitochondrial codon usage and bias [19]. Currently, the complete or nearly complete mitochondrial genomes of nine nemerteans have been sequenced. These species include two palaeonemertans (Cephalothrix hongkongiensis, C. rufifrons, C. sp.), three heteronemerteans (Lineus alborostratus, L. viridis, Zygeupolia rubens), and three hoplonemerteans (Emplectonema gracile, Nectonemertes cf. mirabilis, Paranemertes cf. peregrina) (e.g., [20,21,22,23,24]). Thus all three major lineages of nemerteans are represented and it is now possible to investigate codon usage bias and associated forces in phylum Nemertea. Knowledge of codon patterns in the different nemerteans should improve the understanding of the mechanism of codon distribution and variation in nemertean mitochondrial genomes, and of factors shaping codon usage patterns. In this paper, we report the analysis of codon usage bias in eight nemertean mitochondrial genomes by using methods of multivariate statistical analysis and correlation analysis, and we also compare codon usage and tRNAs.

General aspects of nemertean mitogenomes
GC content of the entire gene, first, second, and third codon positions (GCall, GC1, GC2, and GC3 respectively) were calculated after excluding the stop codons. GC12 is the average of GC1 and GC2, and was used for analysis of neutrality plots (GC12 vs GC3 [25]). To investigate the nucleotide bias, skew was calculated as (A2T)/(A+T) or (G2C)/(G+C) [26]. Relative synonymous codon usage (RSCU) values were calculated with MEGA 5 [27]. RSCU values greater than 1.0 indicate that the corresponding codons are used more frequently than the expected frequency whereas the reverse is true for RSCU values less than 1.0.

Codon usage bias: measured by ENc
Effective number of codons (ENc) [28] is a widely used measure of codon bias. It ranges from 20 (in an extreme case, only one codon for each codon family was used) to 61 (all synonymous codons were equally used). ENc value was calculated with DAMBE 5 [29] for both combined coding sequences and individual genes. The ENc-GC3s plot has been widely used to determine whether codon usages of given genes are affected by mutation only (corresponding points would lie around the expected curve) or also by other factors such as selection (corresponding points would depart away from, considerably below the expected curve).
At individual gene level, ENc calculation is sometimes inaccurate due to statistical fluctuation and internal mathematical flaws, as pointed out by Fuglsang [30]. For this reason, only those protein-coding gene sequences that were 200 codons or more in length were used to draw ENc-GC3s plots.

PR2-bias plot analysis
The Parity Rule 2 (PR2) bias is detected by the value of AT-bias [A/(A+ T)] as the ordinate and GC-bias [G/(G + C)] as the abscissa [31]. In this plot, the center of the plot, where both coordinates are 0.5, is the place where A = T and G = C (PR2), with no biases between the two complementary strands of DNA in mutation and selection rates (substitution rates). A vector from the center represents the extent and direction of biases from PR2. PR2 bias plots are particularly informative when PR2 biases at the third codon position in the four-codon amino acids of individual genes are plotted [32]. In this case, 'A3/(A3 + T3)' and 'G3/(G3+C3)' are plotted as the ordinate and abscissa, respectively.

Statistical tests for context-dependent codon usage bias
Methods previously applied to animal mitogenomes [33] and plant mitogenomes [34] were adopted to detect the influences on codon usage by context-dependent mutation in each nemertean mitochondrion. To avoid the constraint from amino acid usage, only data of the four-fold degenerate (FFD) families (Ala, Arg, Gly, Leu, Pro, Ser1, Ser2, Thr, Val) were used. The null hypothesis is that mutations are independent single-site events; therefore, the base frequencies of the third codon position in FFD families are not influenced by bases at the second codon positions. To test the independence between the second and third codon positions, the following three datasets were used:

Statistical analysis
Statistically significant (P,0.05) associations were tested using x 2 test and correlations were analyzed using the Spearman's rank correlation test. GraphPad Prism 5 software (GraphPad Software, San Diego, CA, USA) was used for statistical evaluation. Table 1 summarizes nucleotide composition of the analysed mitochondrial genomes. All the protein-coding genes of the eight species are transcribed in the same direction (L strand). Among the eight species, the genomic GC contents (GCall) varied from 26.6% in Cephalothrix sp. to 37.2% in Lineus alborostratus. The GC1, GC2 and GC3 are low in the eight nemertean mitochondrial genomes, which is consistent with the generally lower genomic GC contents in invertebrate mitochondrial genomes. The GC3s are significantly higher in heteronemerteans than those in hoplonemerteans

Codon bias level of nemertean mitochondrial genes: measured by ENc
The effective number of codons (ENc) has been widely used to measure the codon bias level of individual genes [28]. The more biased a gene, the smaller the ENc value. In all eight mitochondrial genomes, ENc values of all individual genes (.600 bp) were calculated. Table 2 lists ENc values of 9 mitochondrial genes (except atp8, nad4L, nad3, nad6) from all eight genomes. For each gene, the ENc value is the highest in L. alborostratus; in contrast, the values are lowest in two Cephalothrix species. ENc-plot ( Figure 2) of eight nemerteans showed that although there were a small number of genes close to the expected ENc-plot curve, which indicates that mutation plays a role in defining the codon usage variation among those genes, a majority of the points are below the expected curve. This suggests that not only mutation but also other factors, such as translational selection, are likely to be involved in determining the selective constraints on codon bias in nemertean mitochondrial genomes.

Base composition bias and codon usage bias
The AT skews for the eight nemertean mitochondrial genomes are all negative, while GC skews are all positive. This indicates asymmetry in nucleotide composition between the two strands, with one being rich in A and C, and the other being rich in T and G (Table 1). This phenomenon is common in mitochondrial genomes [26], and suggests the presence of asymmetric patterns of mutational changes between strands [31,35].
RSCU analysis unraveled a general bias toward codons having nucleotides A or U at the third position and U was detected more frequently (Table 3), similar to what has been found in some other invertebrate mitochondrial genomes [19]. These findings may indicate directional mutation in the codon usage patterns of nemertean mitochondrial genomes, and suggest the correlation between genome A+T content and overall codon bias, which is consistent with vertebrate mitogenomes [36].
To investigate whether these biased codon choices are restricted in highly biased protein-coding genes, the association between purines (A and G) and pyrimidines (C and T) in four-codon amino acid families was analyzed by Parity Rule 2 (PR2) bias plot [32] ( Figure 3). G and T were more frequent than C and A in all comparisons except in two Cephalothrix species (Figure 3). Differences between C and G and between A and T contents were observed commonly in most protein-coding genes.
The wobble nucleotides of tRNA anticodons in nemertean mitochondrial genomes show a strong bias towards G and U, which is congruent with the mutation bias of the coding strand.
In comparing tRNA anticodon and synonymous codon families (Table 3), except for stop codons, we found seven of the NNY codon family (where N represents any of the four nucleotides and Y either C or U) are ''non-optimal codon-anticodon usage'' [37]. Each with one synonymous codon, whose corresponding tRNA is not encoded in the mtDNA, used strongly preferred. The optimal codons ended with U, such as UGU for Cys. The RSCU value of UGU was higher than for UGC, yet the wobble position of the tRNA-Cys anticodon was G. Also, ''combined codon-anticodon usage'' is found in the NNN codon family; two synonymous codons are used substantially more preferentially and corresponding tRNA is both encoded and not encoded in the mtDNA. Both of the two codon RSCUs were distinctly higher than the other two codons in the four-fold synonymous codon families. For example, the RSCU values of ACA and ACU, which encode Thr, were the highest, but the wobble position of tRNA-Thr anticodon was only U. Thus, only ACA could completely pair with the anticodon.
Codon usage also can be influenced by translational selection. Because tRNAs are involved in protein translation, there is a high correlation between preferred codons and the abundance of corresponding tRNAs in many bacteria [1,38]. Only one tRNA gene translates a synonymous codon family in animal mitochondrial genomes. Nevertheless, it is still possible for translational selection to occur between synonymous codons translated by a single tRNA, if one codon interacts more effectively than another with the anticodon of tRNA [33]. The codon-anticodon adaptation hypothesis described for vertebrate mitogenomes by Xia [36] predicts that the anticodon should match the most abundant codon. Our analysis of the RSCU and tRNAs encoded by the mtDNA shows that most NNY and NNN family codons could not pair completely with the tRNA anticodons in nemertean mitochondrial genomes. Thus, our findings from nemerteans do not support the selection hypothesis of anticodon adaptation, which also was refuted by studies on bivalves, arthropods and fungi [39,40,41].

Context-dependent mutation in nemertean mitogenomes
To investigate the complexity of mutational dynamics in nemertean mitogenomes, data of the nine FFD families were analyzed. The independence of nucleotides at the second and third codon positions was analyzed from three datasets (Table 4): Leu/ Pro/Arg; Val/Ala/Gly; and (Leu+Val)/(Ser1+Pro+Thr+Ala)/ (Arg+Gly+Ser2). Each of these three datasets forms a 364 contingency table, and was tested by a x 2 test for independence between nucleotide frequencies at the third codon position (codon usages) and second codon position (U, C or G) ( Table 4).
Evidence of context-dependent mutation has previously been reported from both animal and plant mitochondrial genomes [33,34]. The relative frequencies of the four codons in each FFD family should be similar, if the frequency of the nucleotide at the third codon positions is not influenced by other sites. This expectation is not borne out in our results (Table 4). Based on three different datasets and the x 2 tests with 6 degrees of freedom, the calculated P-values were extremely significant in all cases, suggesting that the base at the second codon position affected the synonymous codon choices.
Among all of the nine FFD families, the nucleotide at the second codon position (Y in a given codon XYZ) could be U, C, or G, and the nucleotide at the third codon position (Z in a given codon XYZ) could be any of the four regular bases. Context-dependent mutational processes create correlations between neighboring  bases. These can be measured by dinucleotide frequency ratio R(YZ) defined as f YZ /(f Y f Z ), as in a previous study [33]. For a total of 12 possible dinucleotides, R(YZ)s were calculated using all three datasets for each nemertean mitochondrial genome (Figure 4). The calculated frequency ratios did not deviate much from 1 for most dinucleotides, but there are some exceptions. When the nucleotide is G at the second codon position, the nucleotide at the third codon position is more likely to be G, with mean R(GG) in eight nemertean mitochondrial genomes equal to 1.8460.34, which is a statistically significant deviation from the expected ratio 1. When the nucleotide is C at the second codon position, the nucleotide G is less likely to be at the third codon position, with mean R(CG) equal to 0.4360.16, reflecting a prejudice against NCG codons.
Jia and Higgs [33] had previously reported R(YZ) ratios based on mammal mitochondrial data. The nemertean mitochondria showed similar ratios for the dinucleotides. The largest ratio is R(GG) (nemertean to mammal: 1.8460.34 vs 1.65, whereas the smallest is R(CG) (0.4360.16 vs 0.55) in all cases. These results suggest that there are context-dependent mutational effects influencing dinucleotide positions in a similar way in both nemerteans and mammals.
Relative dinucleotide frequencies in vertebrate mitochondrial genomes are also reported by [42]. They consider all pairs, irrespective of reading frame or whether the sequence is coding or noncoding. They also observe GG is most overrepresented and CG is most underrepresented. One process known as 'CpG effect' showed that CG dinucleotide is a mutational hotspot in mammalian genomes [5], which may explain a low frequency ratio of R(CG) in animal mitogenomes.

Conclusion
This study conclusively demonstrates that the genome-wide codon usage bias in nemertean mitochondrial genomes is mainly set by mutational force, though the dynamics may be complex. Codon usage of the nemertean mitochondrial genomes appears to be not necessarily constrained by tRNA anticodons, suggesting that codon usage and anticodons evolve independently. Therefore, the mutation together with selection dynamics may play an important role in shaping the pattern of codon usages in nemertean mitochondrial genomes. Different genomes have their own characteristic patterns of synonymous codon usage [43]. The previous studies on genomes of Drosophila [44] and nematodes [45,46] demonstrated that codon usage similarity usually persists over the breadth of a genus but then rapidly diminishes within each clade of higher taxonomic rank. However, Ingvarsson [47] showed that five species within the genus Populus with close phylogenetic relationship have significantly different synonymous codon bias. Therefore, it may be difficult to draw conclusions about assumed codon usage patterns to unsampled species beyond the genus level. Additionally, different life history characteristics can potentially influence effective population size which will affect patterns of codon usage (e.g., [48]). This might be the same phenomenon in nemerteans, for instance, palaeonemertans and hoplonemerteans have direct development, and heteronemerteans have indirect development with larval stages. Nevertheless, basic knowledge of codon usage patterns of nemertean mitochondrial genomes and the factors regulating the codon usage are presented in the study, more comprehensive analysis is necessary for revealing the deeper characteristic of codon usage, when more nemertean mitochondrial genomes are available.