Analysis of Synonymous Codon Usage Bias of Zika Virus and Its Adaption to the Hosts

Zika virus (ZIKV) is a mosquito-borne virus (arbovirus) in the family Flaviviridae, and the symptoms caused by ZIKV infection in humans include rash, fever, arthralgia, myalgia, asthenia and conjunctivitis. Codon usage bias analysis can reveal much about the molecular evolution and host adaption of ZIKV. To gain insight into the evolutionary characteristics of ZIKV, we performed a comprehensive analysis on the codon usage pattern in 46 ZIKV strains by calculating the effective number of codons (ENc), codon adaptation index (CAI), relative synonymous codon usage (RSCU), and other indicators. The results indicate that the codon usage bias of ZIKV is relatively low. Several lines of evidence support the hypothesis that translational selection plays a role in shaping the codon usage pattern of ZIKV. The results from a correspondence analysis (CA) indicate that other factors, such as base composition, aromaticity, and hydrophobicity may also be involved in shaping the codon usage pattern of ZIKV. Additionally, the results from a comparative analysis of RSCU between ZIKV and its hosts suggest that ZIKV tends to evolve codon usage patterns that are comparable to those of its hosts. Moreover, selection pressure from Homo sapiens on the ZIKV RSCU patterns was found to be dominant compared with that from Aedes aegypti and Aedes albopictus. Taken together, both natural translational selection and mutation pressure are important for shaping the codon usage pattern of ZIKV. Our findings contribute to understanding the evolution of ZIKV and its adaption to its hosts.


Introduction
Zika virus (ZIKV) is classified as a mosquito-borne arbovirus of the family Flaviviridae, genus Flavivirus [1]. This virus was first isolated from a blood sample of a Rhesus monkey in Uganda in 1947 and, before its outbreak in Oceania in 2007, it was confined to Africa and Southeast Asia [1]. Since then, ZIKV has been circulating in the Americas, and in May 2015, the first case of ZIKV originating from the Americas was reported in Brazil. Thus far, ZIKV has expanded from South America to more than 28 countries and has aroused the attention of the World Health Organization (WHO) as well as that of many governments [2,3]. Clinical presentation of ZIKV fever is non-specific; the most common symptoms are rash, fever, arthralgia, myalgia, asthenia, and conjunctivitis. ZIKV is thought to be transmitted to humans mainly the frequency of the nucleotide Y, and f xy the frequency of the dinucleotide XY. As a criterion, If ρ xy >1.23 or <0.78, the XY dinucleotide is considered to be over-represented or under-represented compared with a random association of mononucleotides [22].
Comparison between the codon usage pattern in ZIKV and those in its hosts RSCU. RSCU was employed to investigate the overall synonymous codon usage bias among the genes, and this value was defined as the ratio of the observed codon usage to the expected value [23]. Codons with a RSCU value of >1.6 were regarded as over-represented, while codons with a RSCU value of <0.6 were regarded as under-represented. Codons used at an average level (no bias) have the RSCU values of 1 [24]. In our comparison of the ZIKV codon usage pattern with those of its hosts, if the RSCU value for the polyprotein-coding region of ZIKV and that of the same codon for the host were both <0.6, >1.6, or between 0.6 and 1.6, their codon usage patterns were judged to be similar [25]. The codon usage data of ZIKV's hosts, including human (Homo sapiens) and mosquitoes (A. aegypti and A. albopictus) was retrieved from the codon usage database (http://www.kazusa.or.jp/codon).

D(A,B).
To determine the influence of the overall codon usage of hosts on that of ZIKV, the similarity index D(A,B) [26] was calculated as follows: where R(A,B) is termed as a cosine value of an included angle between A and B spatial vectors  and represents the extent of similarity between ZIKV and hosts in the aspect of overall codon usage pattern. a i is defined as the RSCU value for a specific codon among 59 synonymous codons in ZIKV polyprotein-coding region. b i is defined as the RSCU value for the same codon of ZIKV's hosts. D(A,B) represents the potential effect of the overall codon usage of the hosts on that of ZIKV, and its value varies from 0 to 1 [26]. The higher D(A,B) means the stronger influence of environment related synonymous codon usage patterns of hosts on that of ZIKV. tRNA adaptation index. tRNA adaptation index (tAI) is used to estimate tRNA usage for the coding sequences of a species [27]. It represents the levels of co-adaption between a special codon and a corresponding tRNA pool and has greater correlations with protein abundance compared with other indicators [28]. The tAI value of ZIKV polyprotein-coding region based on the tRNA copy number of H. sapiens was calculated by Visual Gene Developer [29].

Effect of mutation pressure and translational selection on the codon usage bias
ENc-GC3s plot. A ENc-GC3s plot was used to investigate the influence of the GC3s content on codon usage [20]. The expected ENc values for each GC3s were calculated using the following formula: where s represents the GC3s value. Parity rule 2 (PR2) plot. A Parity rule 2 (PR2) plot was used to assess the influence of mutation pressure and translational selection on the codon usage of genes [30]. This plot is shown by the value of AU-bias [A 3 /(A 3 +U 3 )] as the ordinate and GC-bias [G 3 /(G 3 +C 3 )] as the abscissa at the third codon position of the four-codon amino acids. The center of the plot, where both coordinates are 0.5, is the position where A = U and G = C (PR2), with no bias between influence of mutation and translational selection rates.
Neutrality plot (GC12 Vs GC3). Analysis of the correlation between the GC contents at the first and second codon positions (GC12) and that at the third codon position (GC3) is useful to examine the effect of mutation pressure and translational selection on the base composition [31]. Therefore, GC12 and GC3 were calculated by using the EMBOSS CUSP program [32] and then subjected to correlation analysis.

Correspondence analysis (CA)
Correspondence analysis (CA) is a useful multivariate statistical method for studying the internal relationship between variables and samples [33]. The mathematics procedure of CA transforms the RSCU values into a series of dimensional factors, and the results can be used to analyze the major trend in codon usage patterns among different samples. Each gene is represented with 59 dimensional variables, and each dimension matches the RSCU value of one codon, with the exclusions of AUG, UGG and stop codons. The CA was performed using the CodonW 1.4.2 program. The first two axes of CA (Axis 1 and Axis 2) were subjected to a correlation analysis.

Statistical analysis
The statistical analyses were performed using SPSS 13.0 software package (SPSS Inc., Chicago, USA). Correlation analyses were carried out using Spearman's rank correlation test. The P-values less than 0.05 were considered statistically significant.

Phylogenetic analysis of ZIKV based on polyprotein-coding region
To determine the phylogenetic relationship of different ZIKV strains, a phylogenetic tree was drawn (Fig 1). The results show that 46 strains of ZIKV can be divided into two genera (I, II) and the strains isolated from the same geographic regions cluster together (Fig 1). It can be seen that the members isolated from Africa, including Senegal, Central African Republic and Uganda, firstly cluster together and form a separate branch, and subsequently cluster with the members isolated from other countries all over the world (Fig 1).

Nucleotide composition analysis
The GC3s content is a useful indicator of the extent of the base composition bias, representing the frequency of the nucleotides G+C at the synonymous third codon position, excluding Met, Trp, and the termination codons. The mean value of the GC contents in the 46 tested strains of ZIKV is 50.98% (50.40-51.20%; SD, 0.216), while the average value of their GC3s contents is 51.53 (49.60-52.10%; SD, 0.685) ( Table 1).
An analysis of nucleotide composition at the third position of synonymous codons (G3s, A3s, U3s, C3s) indicates that the mean values of C3s (31.97%) and A3s (33.32%) are higher compared with those of G3s (31.87%) and U3s (24.95%) in ZIKV polyprotein-coding region (Table 1). Moreover, it was found that the G and A nucleotides are abundant with mean values of 29.16% and 27.50%, respectively, while the average values of U and C nucleotide were 21.52% and 21.82%, respectively (data not shown in Tables). The G and A contents are significantly higher compared with U and C contents (Student's t test, p<0.01). These results highlight that there is a GA-rich composition in ZIKV polyprotein-coding region.
The synonymous codon usage characteristics of the ZIKV polyproteincoding region ENc was used to quantify the codon usage bias of each gene [20]. ENc values can range from 20 to 61, and lower values of ENc represent higher levels of codon usage bias. To measure whether or not ZIKV strains show similar codon usage biases, the ENc values of 46 different strains were calculated. The ENc values of the ZIKV polyprotein-coding regions vary from 52.13 to 55.00, with a mean value of 53.32 and a standard deviation (SD) of 0.81, showing that the codon usage bias of ZIKV is low ( Table 1).
The CAI value is a universal measure of the synonymous codon usage of genes in different organisms and can be used to analyze the adaption of a species to its hosts [17]. CAI values can range from 0 to 1 and higher CAI values signify higher levels of codon usage bias. We found that, in relation to human, the CAI values of ZIKV polyprotein-coding regions range from 0.734 to 0.741, with an average value of 0.740 and a SD of 0.002 (Table 1).
This study on 46 ZIKV strains revealed that the codon usage bias in the polyprotein-coding region of ZIKV is low as the mean ENc value of ZIKV polyprotein-coding regions is 53.32 (>40). This result is analogous to those of previous studies, which found that some RNA viruses, such as hepatitis A virus, bovine viral diarrhea virus, SARS-coronavirus, Newcastle disease virus, Marburg virus, and swine fever virus, also show a weak codon usage bias [22,[34][35][36][37][38]. A possible explanation for this is that the low codon usage bias may be beneficial for the efficient transcription and translation of virus genes in host cells [22]. In addition, ZIKV shows the high CAI value (0.740) for H. sapiens, suggesting that natural selection from H. sapiens can affect the codon usage of ZIKV and the evolution of codon usage in ZIKV has made it to utilize the translation resource of H. sapiens more efficiently. This is similar to Marburg virus, which also has a higher CAI value for H. sapiens but shows low codon usage bias [37].

Relationships between the codon usage pattern of ZIKV and that of its hosts
To investigate the synonymous codon usage pattern, the RSCU values of 59 codons (excluding Met, Trp, and the termination codons) in ZIKV polyprotein-coding regions were calculated. Among 18 preferable codons, 13 have an end base of A or C, while only five have an end base of U or G; therefore, the codons with end bases of A and C are prone to be preferentially utilized in the ZIKV genome ( Table 2).
To determine if the codon usage pattern of ZIKV is influenced by that of its hosts, the codon usage pattern of ZIKV was compared with the codon usage patterns of its natural hosts, including H. sapiens, A. aegypti, and A. albopictus. We found that 47 of 59 synonymous codons between ZIKV and H. sapiens are equivalently selected while 40 or 30 of 59 synonymous codons between ZIKV and A. aegypti or A. albopictus, respectively, are similarly selected ( Table 2). In general, the similarity in the degree of codon usage between ZIKV and H. sapiens is higher than that between ZIKV and A. aegypti or A. albopictus. Specifically, CUG for leucine (Leu), AGC for serine (Ser), GCC for alanine (Ala), UAC for tryptophan (Tyr), CAC for histidine (His), AAC for asparagine (Asn), AAG for lysine (Lys), and UGC for cysteine (Cys) have high similarity between ZIKV and its natural hosts. Additionally, the RSCU values of several codons showed a strong discrepancy between ZIKV and its hosts, such as CUA for Leu, AUA for isoleucine (Ile), CCA for proline (Pro), and CGA/CGG/AGA for arginine (Arg).
These results suggest that the selection pressure from the hosts may influence the codon usage pattern of ZIKV, which may assist it in adapting to the cellular environment of the hosts and allow it to replicate efficiently in the hosts [24,31]. Interestingly, the role of the translational selection from H. sapiens in shaping the codon usage pattern of ZIKV is different from that of its insect hosts (A. aegypti and A. albopictus). Compared with the codon usage pattern of A. aegypti or A. albopictus, the codon usage pattern of ZIKV is more similar to that of H. sapiens. This discrepancy of similarity in the degree of codon usage between ZIKV and its hosts may be caused by the various defense mechanisms from different hosts against ZIKV infection. Indeed, a recent study indicated that skin immune cells, including fibroblasts, epidermal keratinocytes, and immature dendritic cells, are highly permissive to ZIKV infection and replication, which can lead to the activation of an antiviral innate immune response [39]. Another study found that although A. aegypti and A. albopictus are susceptible to ZIKV infection, they are both low-competent vectors for ZIKV [40]. It is presumed that the evolution of the flavivirus genome sequence involved in anti-host countermeasures may be faster than that of other flavivirus sequence [41]. This may be one reason why the codon usage pattern of ZIKV tends to show more similarities to that of H. sapiens.

Assessing effects of the overall codon usage of hosts on that of ZIKV
To determine how the overall codon usage of ZIKV's hosts has contributed to virus codon usage bias, the similarity index analysis was carried out. The results indicated that all of the average values of D (A,B) among three hosts are slightly low, suggesting that ZIKV has adapted to self-replicate efficiently with strong independence of overall codon usage of its hosts during

Relationship between dinucleotide biases and codon usage in ZIKV
Previous studies found that dinucleotide compositional constraints of genomes can affect the codon usage bias [33]. Therefore, we determined the relative abundance of 16 dinucleotides in ZIKV polyprotein-coding regions. The results show that the occurrences of dinucleotides in ZIKV are not randomly distributed and no dinucleotide is present at the expected frequency (Table 3). Specially, the dinucleotides UG and CA are over-represented (ρ xy > 1.23) while UA and CG are markedly under-represented (ρ xy < 0.78). These data is consistent with previous study, which suggested that the dinucleotides UA and CG are under-represented in many sequence sets [21]. Moreover, the analysis of RSCU values of the eight codons containing CG (UCG, CCG, ACG, GCG, CGU, CGC, CGA, and CGC) suggests that these codons are not preferentially used. Meanwhile, in case of UA containing codons, most of codons are not preferentially selected, except for UAC. Taken together, the composition of dinucleotides plays a role in the synonymous codon usage pattern of ZIKV. The relative abundance of dinucleotides has been shown to influence the codon usage in some RNA viruses [42]. In our study, we found the relative low abundances of CpG and UpA in ZIKV, which may be beneficial for the virus to escape the host anti-viral immune response and complete virus transcription reaction efficiently [43]. The unmethylated CpG can be recognized by the host innate immune system as a pathogen signature, and activates various immune response pathways [42,44]. UpA deficiency was proposed to avail virus by reducing the risk of nonsense mutations, minimizing the improper transcription and decreasing the opportunities of cleavage by RNase L [45].

Correspondence analysis and correlation analysis: compositional properties of the ZIKV polyprotein-coding region
The A, U, C, G, and GC contents were compared with the A3s, U3s, C3s, G3s, and GC3s contents, respectively ( Table 4). The results show that correlations in nucleotide compositions are complicated. Specifically, both the G and GC contents have a significant negative correlation with the content of A3s or U3s, as well as a significant positive correlation with the content of C3s, G3s, or GC3s. The A content has a significant negative correlation with the content of C3s, G3s or GC3s and a significant positive correlation with the content of A3s or U3s. The U content has a significant negative correlation with the content of C3s or GC3s as well as a significant positive correlation with A3s or U3s content, except for the insignificant correlation between U and G3s contents. The C content has a significant negative correlation with A3s, U3s, or C3s content as well as a significant positive correlation with G3s or GC3s content. These data shows that the nucleotide compositional constraint may also affect the codon usage of ZIKV. A correspondence analysis was performed to determine the main trends in the codon usage variation and the distribution of each gene along the continuous axes. The positions of each polyprotein-coding region defined by the first axis (Axis 1) and second axis (Axis 2) are shown in Fig 2. The first axis accounts for 72.93% of the total variation, and the second, third and fourth axes account for 8.99%, 6.33%, and 3.08%, respectively, of the total variation in synonymous codon usage. A correlation analysis also showed that, except G content, the Axis1 is positively correlated with the contents of A, U, A3s, U3s, whereas it is negatively correlated with the GC3s, GC, ENc, C, C3s and G3s (Table 5). Meanwhile, Axis2 is only negatively correlated with the G content (Table 5). Overall, these results suggest that mutation pressure from the base composition plays a role in constructing the codon usage pattern of ZIKV.

The effect of translational selection on the codon usage of ZIKV
A plot of the ENc values against the GC3s values was constructed to check the heterogeneity of codon usage [20]. If a gene is subject to the GC compositional constraints, it will lie on or near Analysis of Codon Usage Bias of ZIKV the theoretical fitting curve that represents random codon usage. In contrast, if a gene is subject to translational selection, it will lie considerably below the expected curve [46]. Here, the ENc value of each polyprotein-coding region of ZIKV was plotted against the corresponding GC3s content (Fig 3). The resulting points lie considerably below the solid curve, implying that, in addition to mutation pressure, other factors, such as translational selection, also influence the codon usage pattern of ZIKV. This result is generally similar to the related plot in previous study [34]. The base composition and codon usage bias of the ORFs of a species with an A/U-rich genome may be different from those species with G/C-rich genomes. Previous studies have employed a correlation between CAI values and ENc values to demonstrate the effect of mutation and translational selection on the codon usage bias [37,47]. If the correlation (r) between the two indices approaches -1, this suggests that the translational selection is preferred over mutation. Otherwise, if the r value approaches 0 (no correlation), mutation may be more influential than translational selection. Our results showed that the CAI value of ZIKV is significantly positively correlated with the ENc value (r = -0.749, P<0.01) (Fig 4). This result reflects  the influence of both translational selection and mutation pressure on the codon usage pattern of ZIKV. A significant correlation between the GC12 and GC3 values is regarded to indicate that the mutation pressure dominates over the translational selection pressure in shaping the codon usage bias [22,46]. If the correlation between GC12 and GC3 is significant, the mutation pressure is regarded as the main force forming the codon usage bias. To further determine the role of mutation pressure and translational selection in shaping the codon usage bias of ZIKV, a correlation analysis was performed to analyze the relationship between GC12 and GC3. There was no significant correlation observed between them (r = 0.25, P>0.05), suggesting that both translational selection and the mutation pressure are involved in shaping the codon usage pattern of ZIKV (Fig 5).
To determine whether the biased codon selection are restricted in highly biased coding sequences, the relationship between pyrimidines (C and U) and purines (A and G) contents in four-fold degenerate codon families (alanine, arginine, glycine, leucine, proline, serine, threonine and valine) are analyzed by PR2 bias plot. It can be seen that A and C are more frequently used than U and G in ZIKV in four-fold degenerate codon families (Fig 6). This result shows that the codon usage pattern of ZIKV is shaped by mutation pressure and other factors including translational selection.
To confirm whether translation selection from the hosts plays a role in shaping the codon usage pattern of ZIKV, the tAI values were calculated based on the tRNA copy numbers of H. sapiens. The results indicated that the tAI values of 46 ZIKV strains range from 0.329 to 0.347, with an average value of 0.344 and a SD of 0.004. Moreover, the positive correlation between tAI and CAI values (r = 0.457, P<0.01) in ZIKV highlights the importance of translational selection in the formation of synonymous codon usage pattern.
Compared with translational selection, mutation bias seems to have a stronger effect on the codon usage bias of some viruses [48,49]. However, for ZIKV, the translational selection pressure also takes part in shaping the codon usage bias. Our result is consistent with a previous study that showed that recent Asian lineage spread is linked to the codon usage adaptation of the NS1 protein to human housekeeping genes [50]. During the preparation of this manuscript, two papers were published employing some ZIKV strains to analyze the codon usage [51,52]. They concluded that mutation pressure is an important determinant of the codon usage bias of ZIKV mainly based on the result of a GC3s-ENc analysis [51]. The reasons that they do not mention the role of translational selection in the codon usage of ZIKV may be due to the lack of application of other codon usage analysis methods in their studies.

Effect of other factors on codon usage
GRAVY and AROMO may also be related to the codon usage pattern of viruses [53]. Our correlation analysis indicated that AROMO is positively correlated with GC3s, GC, and ENc, but it is negatively correlated with Axis 1. GRAVY showed a significant positive correlation with Axis 1, but it showed a significant negative correlation with GC, GC3s, and ENc, respectively (Table 6). Both GRAVY and AROMO do not show any correlation with Axis 2. These results indicated that the aromaticity and degree of protein hydrophobicity are linked to the codon usage variation in ZIKV, emphasizing the importance of natural translational selection on forming the codon usage pattern [34]. The involvement of aromaticity and hydrophobicity in the construction of codon usage bias has been revealed in some RNA viruses, such as bovine viral diarrhea virus, classical swine fever virus, and duck hepatitis A virus [34,35,54]. This study found that Axis 1 has a Analysis of Codon Usage Bias of ZIKV significant role in shaping the ZIKV codon usage pattern and is significantly correlated with aromaticity and hydrophobicity indices, implying that the aromaticity and hydrophobicity of proteins are related to the codon usage pattern of ZIKV. Aromaticity and hydrophobicity are known to play a role in peptide self-assembly and protein aggregation rates [55,56]. A recent study showed that the structure of ZIKV particles is thermally stable, and this feature may help the virus to survive in the harsh conditions of semen, saliva, and urine [57].
It has been reported that there is a significant correlation between the phylogroups of isolates and their geographic regions, and an obvious pattern of geographic clustering has been observed in ZIKV isolates [14]. To determine if geographic factors influence the evolution of ZIKV, a plot of Axis 1 and Axis 2 was drawn according to the geographic distribution of the tested ZIKV strains. The resulting coordinate spots are separated into three groups, classified Analysis of Codon Usage Bias of ZIKV as group I, II, and III (Fig 7). Some strains isolated in Uganda clustered together with the strains isolated from Senegal, and were classified as group I. Additionally, some strains isolated from Central African Republic also clustered together with the strains isolated from Senegal, and these were classified as group II. Most of the strains isolated, regardless of their isolation countries, tended to cluster together and were classified as group III. The codon usage pattern reflects the close relationship of ZIKV strains in different geographic regions.
To investigate if the ZIKV codon usage pattern displays changes over time, a plot of Axis 1 and Axis 2 was drawn according to the outbreak time of the ZIKV strains. The 46 ZIKV isolates were divided into three groups, classified as group I, II and III (Fig 8). Most of the strains isolated from 2010 to 2016 tended to cluster together in group III. The strains isolated from 1968 to 1997 clustered together in group II, while the strains isolated in 1947 and 2001 clustered in group I. Interestingly, the strains isolated in 1968 exist in both group II and group III. These results indicated that ZIKV strains isolated in different time intervals show genetic variation in their codon usage patterns.
Previous studies showed that the Dengue virus strains occurring in the same continental region are more closely related to one another, forming a cluster when plotted by their codon usage biases, indicating that the viruses from a geographical group can show similar codon usage biases [58]. Andrew et al found that the geographic origin of the strains responsible for the ZIKV epidemics that occurred on Yap island in 2007 and in Cambodia in 2010 most likely originated in Southeast Asia [14]. In this study, we further found that most of the American ZIKV strains isolated in recent years cluster with some Asian, Europe and Oceania strains, supporting the idea that a close evolutionary relationship exists among Asian, Europe, Oceania and American strains.

Conclusions
Our findings reveal that the codon usage bias of ZIKV is weak and that, in addition to mutation pressure, translational selection also influences the codon usage bias. Other factors, such as base composition, aromaticity, and hydrophobicity, also have an effect on the codon usage pattern. Importantly, there are similarities between the codon usage patterns of ZIKV and its natural hosts. This study not only provides an understanding about the variation in ZIKV codon usage patterns, but it also contributes to understanding the factors that drive ZIKV evolution.