Chikungunya virus (CHIKV) is an arthropod-borne virus of the family Togaviridae that is transmitted to humans by Aedes spp. mosquitoes. Its genome comprises a 12 kb single-strand positive-sense RNA. In the present study, we report the patterns of synonymous codon usage in 141 CHIKV genomes by calculating several codon usage indices and applying multivariate statistical methods. Relative synonymous codon usage (RSCU) analysis showed that the preferred synonymous codons were G/C and A-ended. A comparative analysis of RSCU between CHIKV and its hosts showed that codon usage patterns of CHIKV are a mixture of coincidence and antagonism. Similarity index analysis showed that the overall codon usage patterns of CHIKV have been strongly influenced by Pan troglodytes and Aedes albopictus during evolution. The overall codon usage bias was low in CHIKV genomes, as inferred from the analysis of effective number of codons (ENC) and codon adaptation index (CAI). Our data suggested that although mutation pressure dominates codon usage in CHIKV, patterns of codon usage in CHIKV are also under the influence of natural selection from its hosts and geography. To the best of our knowledge, this is first report describing codon usage analysis in CHIKV genomes. The findings from this study are expected to increase our understanding of factors involved in viral evolution, and fitness towards hosts and the environment.
Citation: Butt AM, Nasrullah I, Tong Y (2014) Genome-Wide Analysis of Codon Usage and Influencing Factors in Chikungunya Viruses. PLoS ONE 9(3): e90905. doi:10.1371/journal.pone.0090905
Editor: Fausto Baldanti, Fondazione IRCCS Policlinico San Matteo, Italy
Received: November 29, 2013; Accepted: February 6, 2014; Published: March 4, 2014
Copyright: © 2014 Butt et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This research was supported by grants from the National Hi-Tech Research and Development (863) Program of China (No. 2012AA022-003), China Mega-Project on Major Drug Development (No. 2011ZX09401-023), and China Mega-Project on Infectious Disease Prevention (No. 2013ZX10004-605, No. 2013ZX10004-607, No. 2013ZX10004-217, and No. 2011ZX10004-001). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Chikungunya virus (CHIKV), a member of the genus alphavirus of the family Togaviridae, is a small (60–70 nm), enveloped, single-strand positive-sense RNA virus. The genome is approximately 12 kb in size and comprises two open reading frames (ORFs) encoding non-structural and structural proteins, respectively . The CHIKV genome is arranged in the order of 5-′cap-nsP1-nsP2-nsP3-nsP4-(junction region)-C-E3-E2-6K-E1-poly(A)-3′ . Since the first isolation of CHIKV from a febrile individual in Tanzania in 1953 , CHIKV has caused several outbreaks in Asia, Africa, and Indian Ocean islands, emerging as a serious public health concern –. CHIKV infection is characterized by abrupt onset of high fever, headache, rashes, arthralgia and myalgia. The typical clinical sign of the disease is poly-arthralgia, which is a very painful condition affecting joints and may persist for several months to years in some cases . Being an arthropod-borne virus, the mode of transmission is the mosquitoes of the Aedes spp. It is generally accepted that CHIKV originated from Africa, where it is primarily maintained in a yellow fever-like zoonotic sylvatic cycle and depends upon non-human primates and arboreal, peridomestic mosquitoes as reservoir hosts. However, the spread of CHIKV in Asia and urban endemics are associated with a dengue-like “human-mosquito-human” direct transmission cycle, where A. aegypti and A. albopuctus serve as primary transmission vectors and humans serve as hosts –.
The genetic code comprises 64 codons that can be divided into 20 groups, where each group consists of one to six codons, and each group corresponds to each of the standard amino acids. Alternative codons within the same group coding for the same amino acid are often termed ‘synonymous’ codons, although their corresponding tRNAs might differ in their relative abundance in cells and in the speed by which they are recognized by the ribosome. This redundancy of the genetic code, in which most of the amino acids can be translated by more than one codon, represents a key step in modulating the efficiency and accuracy of protein production, while maintaining the same amino acid sequence of the protein. On the other hand, the synonymous codons are not chosen randomly both within and between genomes, which is referred to as codon usage bias ,. This phenomenon of synonymous codon usage bias has been studied in a wide range of organisms, from prokaryotes to eukaryotes and viruses –. Studies on codon usage have determined several factors that could influence codon usage patterns, including mutational pressure, natural or translational selection, secondary protein structure, replication and selective transcription, hydrophobicity and hydrophilicity of the protein and the external environment. Among these, the major factors responsible for codon usage variation among different organisms are considered to be compositional constraints under mutational pressure and natural selection , –.
Previous studies on codon usage in different viruses have highlighted mutational pressure as the major factor in shaping codon usage patterns compared with natural selection , –; however, as our understanding of codon usage increases, it appears that although mutational pressure is still a major driving force, it is certainly not the only one when considering different types of RNA and DNA viruses –. Considering their comparatively small genome size and other viral features, such as dependence on host’s machinery for key process including replication, protein synthesis and transmission in comparison with prokaryotic and eukaryotic genomes, the interplay of codon usage among viruses and their hosts is expected to affect overall viral survival, fitness, evasion from host’s immune system and evolution , . Therefore, knowledge of the codon usage in viruses can not only reveal information about molecular evolution, but also improve our understanding of the regulation of viral genes expression and aid vaccine design, where the efficient expression of viral proteins may be required to generate immunity. In the present study, we report the detailed codon usage data and analysis of various factors shaping the codon usage patterns in CHIKV genomes.
Results and Discussion
Nucleotide Composition Analysis of CHIKV Genomes
Codon usage bias, or preference for one type of codon over another, can be influenced greatly by the overall nucleotide composition of genomes . Therefore, we first analyzed the nucleotide composition of coding sequences from CHIKV genomes. As shown in Table 1, the mean A% (28.91) was the highest, followed by similar composition of G% (25.75) and C% (25.19), with the U% being the lowest (20.16). The mean GC and AU compositions were 50.91% and 49.06% respectively. This appears to suggests there might be equal or almost equal distribution of A, U, G, and C nucleotides among codons of CHIKVs, with potentially more preference towards A-ended codons followed by G/C-ended codons. However, a clearer picture of overall nucleotide composition that could influence the codon usage preference in CHIKV genomes emerged from the analysis of the nucleotide composition of the third position of codons (A3, U3, G3, C3) and of GC1, GC1,2, GC3 and AU3 (Table 1). The mean C3 and G3 were the highest, followed by A3 and U3. The GC3 values ranged from 54.9% to 57.2%, with a mean of 55.86% and a standard deviation (SD) 0.40 compared with that of AU3, whose values ranged from 42.8% to 45.1%, with a mean of 44.14% and an SD of 0.41. The GC1 ranged from 50.6% to 53.8%, with a mean of 53.56% and an SD 0.27. The GC1,2 values ranged from 48.2% to 48.7%, with an average of 48.45% and an SD of 0.07. Therefore, from the initial nucleotide composition analysis, it is expected that G/C-ended codons might be preferred over A/U-ended codons in CHIKV genomes.
Relative Synonymous Codon Usage (RSCU) Analysis of CHIKV
To determine the patterns of synonymous codon usage and to what extent G/C-ended codons might be preferred, we performed RSCU analysis and calculated the RSCU values. Among the 18 most abundantly used codons in CHIKV genomes, eleven (UUC, CUG, AUC, GUG, CCG, UAC, UGC, CAC, CAG, AAC and GAC) were G/C-ended (C-ended: 7; G-ended: 4) and the remaining seven (ACA, GCA, UCA, AGA, AAA, GAA, GGA) were A-ended codons; none of the preferred codons were U-ended (Figure 1A and Table 2). From RSCU analysis, we observed that CHIKV exhibits comparatively higher codon usage bias towards G/C- and less towards A-ended codons. However, it is also interesting to note that the mean GC% and AU% values are very similar (Table 1), yet the G/C- ending codons were used in a comparatively biased manner, indicating that the G/C content at the third position of the codons influenced the shaping of the overall synonymous codons usage patterns. The overall general trend of the 59 synonymous codon usages was also relatively consistent among different genotypes of CHIKV, indicating that the evolutionary processes of the three genotypes of CHIKV are restricted by the synonymous codon usage pattern to some extent (Figure 1B and Table 2). Furthermore, analysis of over- and under-represented codons showed that codons with an RSCU>1.6 are infrequently observed in CHIKV genomes. The RSCU values of the majority of preferred and non-preferred codons fell between 0.6 and 1.6. We further divided the RSCU data into three groups; (A) codons with RSCU<0.6 (under-represented), (B) codons with RSCU values between 0.6 and 1.6 (unbiased/randomly represented), and (C) codons with RSCU values >1.6 (over-represented). Among 59 codons, only CUG (Leu) and AGA (Arg) had an RSCU>1.6. However, the under-represented codons (RSCU<0.6), were identified as follows: CUU, CUC for Leu, GUU for Val, and CGU, CGG for Arg. The remaining 52 codons had RSCU values between 0.6–1.6 (Figure 1 and Table 2). These findings suggested that despite being an RNA virus with a high mutation rate in its lifecycle, CHIKV has evolved to form a relatively stable genetic composition at some specific levels of synonymous codon usage. This was further confirmed by ENC and CAI analysis as discussed in coming sections. Combining nucleotide composition and RSCU analysis, we deduced that the selection for preferred codons has been mostly influenced by compositional constraints, which also accounts for the presence of mutational pressure. However, we suspect that the compositional constraints may not be the sole factor associated with codon usage patterns in CHIKV, because although the overall RSCU values could reveal the codon usage pattern for the genomes, it may hide the codon usage variation among different genes in a genome .
(A) between chikungunya virus (CHIKV), Homo sapiens (HS), Pan troglodytes (PT) and Aedes aegypti (AG) and Aedes albopictus (AB). (B) between east central south African (ECSA), Asian and West African (WA) genotypes of CHIKV.
Codon Usage Bias among CHIKV
To quantify the extent of variation in codon usage among different genomes of CHIKV arising from different geographical regions and genotypes, the ENC values for each genome were calculated. The ENC values among CHIKV genomes ranged from 54.55 to 56.41, with a mean of 55.56 and an SD of 0.34 (Table 1). An average value of 55.56 (ENC>40) represents stable ENC values and indicates a relatively conserved genomic composition among different CHIKV genomes. In general, there is an inverse relationship between ENC and gene expression; i.e., a lower ENC value indicates a higher codon usage preference and higher gene expression and vice versa . Our results show that the overall codon usage bias and gene expression among different CHIKV genomes is lower, slightly biased and would be mainly affected by the base composition. Previous studies on codon usage analysis among other RNA viruses, such as bovine viral diarrhea virus (ENC: 50.91) , classical swine fever virus (ENC = 51.7)  and HCV (ENC = 52.62) , have also reported lower codon usage bias. The same is also true in the case of arthropod-borne RNA viruses, including West Nile virus (ENC: 53.81)  and dengue virus (DENV) (ENC: 49.70: DENV-1; 48.78: DENV-2; 49.52: DENV-3; and 50.81: DENV-4) . A possible explanation for the weak codon bias of RNA viruses is that it might be advantageous for efficient replication in host cells, with potentially distinct codon preferences .
The codon adaptation index (CAI) is often used as measure of level of gene expression and to assess the adaptation of viral genes to their hosts. Highly expressed genes exhibit a strong bias for particular codons in many bacteria and small eukaryotes. In comparison to the ENC, which is another way of calculating codon usage bias and measures deviation from a uniform bias (null hypothesis), CAI measures the deviation of a given protein coding gene sequence with respect to a reference set of genes . Here, we calculated the CAI values of coding sequences from CHIKV genomes. The CAI values ranged from 0.21 to 0.22, with a mean value of 0.22 and an SD of 0.001 (data not shown). The mean CAI value was low, indicating low codon usage bias and expression levels, which agreed with the ENC analysis.
Relationship between Codon Usage Patterns of CHIKV and its Hosts
Being parasitic organisms, it can be expected that the codon usage patterns of viruses would be affected by its hosts to some extent . For instance, the codon usage pattern of poliovirus is reported to be mostly coincident with that of its host , while the codon usage pattern of hepatitis A was reported to be antagonistic to that of its host . We therefore computed and compared the codon usage of CHIKV with its two hosts (Homo sapiens and Pan troglodytes), and transmission vectors (A. aegypti and A. albopictus). The results showed that the codon usage patterns of CHIKV were a mixture of coincidence and antagonism to its hosts and vectors (Table 2). In detail, the preferred codons for 12 out of 18 amino acids were common between CHIKV and H. sapiens. This included UUC (Phe), CUG (Leu), AUC (Ile), GUG (Val), UAC (Tyr), AGA (Arg), UGC (Cys), CAC (His), CAG (Gln), AAC (Asn), AAG (Lys) and GAC (Asp). Furthermore, all common preferred codons between CHIKV and H. sapiens were G/C- ended (C-ended: 7; G-ended: 4), with exception of an A-ended preferred codon for amino acid Arg. Similarly, preferred codons for 10 out of 18 amino acids were common between CHIKV and P. troglodytes. In case of the two transmission vectors, 10 out of 18 preferred codons were common among both mosquito species and CHIKV. It is also interesting to note that, except for amino acid Arg, the remaining 10 highly preferred codons were same among CHIKV, H. sapiens, A. aegypti and A. albopictus. Moreover, the preferred codon usage profiles of A. aegypti and A. albopictus were also very similar: 16 out of 18 preferred codons were common between, with exceptions for the preferred codons for Asp and Gly (Table 2). These results indicated that selection pressures from hosts and vectors have influenced the codon usage pattern of CHIKV and the possible fitness of the virus to adjust among its dynamic range of hosts and vectors. A mixture of coincidence and antagonism has also been reported previously in the case of HCV  and enterovirus 71 . It was suggested that the coincident portions of codon usage among viruses and their hosts could enable the corresponding amino acids to be translated efficiently, while the antagonistic portions of codon usage may enable viral proteins to be folded properly, although the translation efficiency of the corresponding amino acids might decrease .
Although the comparative analysis of individual RSCU values as given above is frequently employed as a method of estimating the effect of synonymous codons usage of the hosts on that of specific viruses, it has its limitations in revealing the effect of the overall codon usage of the hosts on the formation of codon usage patterns of the viruses. Therefore, we took advantage of a method proposed recently that estimates the similarity degree of the overall codon usage patterns comprehensively between viruses and their hosts by treating the 59 synonymous codons as 59 different spatial vectors. The advantage of this formula, as reported by the authors in the case of dengue viruses, is that the comparative overall codon usage takes the place of the direct estimation of each synonymous codon usage; thus, the new method avoids the situation that the variations of 59 synonymous codon usage confuse the correct estimation of the effect of the host on the virus for codon usage . The similarity index D(A,B) was therefore calculated for each genotype of CHIKV in relation to its hosts and vectors. The similarity index was found to be highest for A. albopictus vs. CHIKV group followed by P. troglodytes vs. CHIKV, A. aegypti vs. CHIKV and lowest in the case of H. sapiens vs. CHIKV (Figure 2), indicating that the effect of A. albopictus and P. troglodytes on the formation of the overall codon usage patterns of CHIKV is relatively higher than that of the A. aegypti and H. sapiens. Secondly, we computed the effect of transmission vectors on the formation of the overall codon usage patterns of three genotypes of CHIKV. A. aegypti had the strongest effect on the east central south African (ECSA) genotype, followed by West African (WA) and Asian genotypes. In the case of A. albopictus, the strongest effect was noted on the ECSA genotype, followed by Asian and WA genotypes. As for the effects of the two primates on the formation of the overall codon usage of CHIKV, the strongest effect of H. sapiens was on the Asian genotype, closely followed by the ECSA and WA genotypes. By contrast, P. troglodytes had its strongest and equal effect on ECSA and Asian genotypes, followed by WA genotype (Figure 2). Therefore, from the similarity index analysis, we observed that selection pressure from hosts and vectors have contributed to shaping the molecular evolution of CHIKV at the level of codon usage. The effect of the hosts was unevenly distributed among different genotypes, potentially indicating different evolutionary rates of CHIKV isolates. The calculation of the effects of primates and transmission vectors on the overall codon usage patterns of CHIKV showed that P. troglodytes and A. albopictus dominate the effects of H. sapiens and A. aegypti, respectively, on the formation of the overall codon usage patterns of CHIKV (Figure 2). The stronger effect of P. troglodytes than H. sapiens could also be attributed to the maintenance of CHIKV in a yellow fever-like zoonotic sylvatic cycle and its dependence upon non-human primates as reservoir hosts , . Moreover, the similarity index of codon usage was also the highest between CHIKV and A. albopictus, as compared with A. aegypti, P. troglodytes and H. sapiens. The successful human-to-human transmission of CHIKV depends on Aedes mosquitoes , ; therefore, the stronger effect of A. albopictus on all three genotypes of CHIKV suggests that this vector might be a more efficient reservoir for viral replication and transmission compared with A. aegypti. These results are in agreement with recent studies showing more efficient dissemination and transmission of CHIKV by A. albopictus, which contribute to its ongoing re-emergence in a series of large-scale epidemics , .
Trends of Codon Usage Variation in CHIKV
Correspondence Analysis (COA).
Codon usage is multivariate by its very nature; therefore, it is necessary to analyze the data using multivariate statistical techniques, such as COA . Therefore, to determine the trends in codon usage variation among different CHIKV genomes, we performed COA on the RSCU values, which were examined as a single dataset based on the RSCU value of each coding region (Figure 3). The first principal axis (f′1) accounted for 53.57% of the total variation, and the next three axes (f′2−f′4) accounted for 25.16%, 7.62%, and 2.06% of the total variation in synonymous codon usage, respectively. For further analysis, plots were reconstructed based on different geographical locations (Figure 4) and genotypes of CHIKV isolates (Figure 5). As expected the CHIKV isolates belonging to ECSA genotype were distributed across all planes of axes. When these plots were accessed on regional basis, it was found that different genotypes are circulating in single country. This analysis showed that the three different genotypes of CHIKV might have common ancestor. This further implies that the geographical diversity and associated factors, such as presence of favorable transmission vectors, climate features, host range and susceptibility, have also contributed to shaping the molecular evolution and codon usage in CHIKV, even though it appears to be less influential than mutational pressure (based on the current analysis).
Effect of mutational pressure in shaping the codon usage patterns in CHIKV.
Mutational pressure and natural selection are considered the two major factors that shape codon usage patterns . A general mutational pressure, which affects the whole genome, would certainly account for the majority of the codon usage among certain RNA viruses . To determine the extent of the influence of these two factors on CHIKV codon usage, we performed correlation analysis between different nucleotide constraints. A complex correlation was observed among different nucleotide constraints (Table 3). U3% had a significant positive correlation with U% (r = 0.621, P<0.01) and G% (r = 0.185, P<0.05), whereas it had significant negative correlations with C% (r = −0.606, P<0.01) A% (r = −0.278, P<0.01) and GC% (r = −0.806, P<0.01). C3% had significant positive correlation with C% (r = 0.621, P<0.01), A (r = 0.261, P<0.01) and GC% (r = 0.798, P<0.01), and negative correlations with U% (r = −0.5877, P<0.01) and G% (r = −0.217, P<0.01). A3% had positive correlations with A (r = 0.625, P<0.01), C% (r = 0.327, P<0.01) and negative correlations with U% (r = −0.373, P<0.01) and G% (r = −0.576, P<0.01), whereas no correlation was observed between A3% and the GC%. G3% was positively correlated with G% (r = 0.658, P<0.01) U% (r = 0.354, P<0.01), and negatively correlated with C% (r = −0.377, P<0.01) and A% (r = −0.610, P<0.01); the correlation with the GC% was non-significant. In the case of GC3%, positive correlation was noted with C% (r = 0.498, P<0.01) and GC% (r = 0.852, P<0.01), and negative correlation with U% (r = −0.480, P<0.01); the correlation with G% was non-significant. Finally the GC and GC12 were also compared with GC3 and a highly significant positive correlations (r = 0.28, P<0.01; GC12 versus GC3) (r = 0.85, P<0.01; GC versus GC3) was observed as shown in Figure 6A and 6B respectively. Furthermore, a significant negative correlation between GC3 and ENC values was also observed (r = −0.756, P<0.01). This analysis collectively indicates that mutational pressure is most likely responsible for the patterns of nucleotide composition and, therefore, codon usage patterns, because the effects were present at all codon positions.
In addition to correlation analysis, linear regression analysis was also performed to determine correlations between the first two principle axes (f′1 and f′2) and nucleotide constraints of CHIKV genomes. Again, several significant correlations were observed between the two principle axes and nucleotide contents (Table 4). f′1 showed a significantly positive correlation with U3% (r = 0.31, P<0.01), G3% (r = 0.58, P<0.01), U% (r = 0.25, P<0.01) and C% (r = 0.51, P<0.01); however, it showed significantly negative correlations with A% (r = −0.54, P<0.01), G% (r = −0.29, P<0.01), A3% (r = −0.50, P<0.01), C3 (r = −0.35, P<0.01), GC3 (r = −0.24, P<0.01; Figure 7A) and GC% (r = −0.21, P<0.01). In the case of f2, A3%, G3% and C% had non-significant correlations. f′2 axis showed significantly positive correlations with C3 (r = 0.69, P<0.01), GC3% (r = 0.74, P<0.01; Figure 7B), GC% (r = 0.64, P<0.01), A% (r = 0.17, P<0.05) and G% (r = 0.39, P<0.01) whereas, negative correlations with U3% (r = −0.66, P<0.01), and U% (r = −0.34, P<0.01) (Table 4). Our analysis shows that mutational pressure has played a major role in shaping the dynamics of codon usage patterns within CHIKV genomes.
Correlation analysis between ENC and GC3 values.
A plot of ENC versus GC3 (Nc plot) is widely used to study codon usage variation among genes in different organisms. It has been postulated that an ENC-plot of genes, whose codon choice is constrained only by a G3+ C3 mutational bias, will lie on or just below the continuous curve of the predicted ENC values . Although, the nucleotide composition correlation analysis showed that codon usage in CHIKV genomes is mainly caused by compositional constraints or mutational pressure, we were interested to determine the possible influence of other factors, such as natural selection. Therefore, we constructed a corresponding relation distribution plot between the ENC and GC3 values. As shown in Figure 8, all points aggregated closely towards the right side under the expected ENC curve, indicating that, apart from mutation pressure, the codon usage patterns have also been influenced by other factors to some extent.
Relationship between dinucleotide and codon usage patterns in CHIKV.
It has been suggested that dinucleotide bias can affect overall codon usage bias in several organisms, including DNA and RNA viruses –. To study the possible effect of dinucleotides on codon usage in CHIKV genomes, we calculated the relative abundances of the 16 dinucleotides from the coding sequences of CHIKV. The occurrences of dinucleotides were not randomly distributed, and no dinucleotides were present at the expected frequencies (Table 5). Under-representation of CpG dinucleotides in different RNA and DNA viruses has been reported . In the case of CHIKV, the relative abundance of CpG showed deviation from the “normal range” (mean ± SD = 0.808±0.016) and was under-represented. Interestingly, GpC dinucleotides also deviated from the normal range and were instead slightly over-represented (mean ± SD = 1.001±0.007) (Table 5). The RSCU values of the eight codons containing CpG (CCG, GCG, UCG, ACG, CGC, CGG, CGU, and CGA) and the six codons containing GpC (GCU, GCC, GCA, UGC, AGC, GGC) were also analyzed to determine the possible effects of CpG and GpC representations on codon usage bias. In the case of CpG-containing codons, all codons were under-represented (RSCU<1.6) and were not preferred codons for their respective amino acid, except for CCG (RSCU = 1.26), a preferred codon for proline (Table 2). On the other hand, despite slight over-representation of the GpC dinucleotide, all GpC-containing codons were also under-represented (RSCU<1.6) and were not preferred codons for their respective amino acids, with two exceptions; GCA (Ala, RSCU = 1.43) and UGC (Cys, RSCU = 1.33) (Table 2). It has been proposed that CpG deficiency in pathogens is associated with the immunostimulatory properties of unmethylated CpGs, which are recognized by the host’s innate immune system as a pathogen signature . Recognition of umethylated CpGs by Toll like receptor 9 (TLR9), a type of intracellular pattern recognition receptor (PRR), leads to activation of several immune response pathways . The vertebrate immune system relies on unmethylated CpG recognition in DNA molecules as a signature of infection, and CpG under-representation in RNA viruses is exclusively observed in vertebrate viruses; therefore, it is reasonable to suggest that a TLR9-like mechanism exists in the vertebrate immune system that recognizes CpGs when in an RNA context (such as in the genomes of RNA viruses) and triggers immune responses .
Compared with differential (over- and under-) representation of CpGs in different organisms, UpA under-representation also exists in several organisms, including vertebrates, invertebrates, plants and prokaryotes . The presence of TpA in two out of three canonical stop codons and in transcriptional regulatory motifs (e.g., the TATA box sequence) is believed to be responsible for its under-representation. Therefore, UpA under-representation is expected to reduce the risk of nonsense mutations and minimizes improper transcription , . In the case of CHIKV, the relative abundance of UpA also deviated from the “normal range” (mean ± SD = 0.859±0.022) and was under-represented, similarly to CpG. The six codons containing UpA (UUA, CUA, GUA, UAU, UAC and AUA) were also under-represented (RSCU<1.6) and were not preferred codons for their respective amino acids. The CpA (mean ± SD = 1.125±0.017) and UpG (mean ± SD = 1.275±0.022) dinucleotides were over-represented compared with the rest of the 14 dinucleotide pairs (Table 5). Similarly, the eight codons containing CpA (UCA, CCA, ACA, GCA, CAA, CAG, CAU and CAC) and five codons containing UpG (UUG, CUG, GUG, UGU and UGC) were also over-represented compared with the rest of the codons for their respective amino acids and a majority of them were also preferential codons for their respective amino acids, based on RSCU analysis (Table 2). Over-representation of CpA and UpG in different organisms has been observed and is regarded as a consequence of the under-representation of CpG dinucleotides. One possible explanation is that methylated cytosines are prone to mutate into thymines through spontaneous deamination, resulting in the dinucleotide TpG and the subsequent presence of a CpA on the opposite strand after DNA replication . However, this theory cannot explain under-representation of CpGs in RNA viruses. Moreover, under-representation of CpGs has also been observed in several vertebrate viruses, where it is independent of their genomic composition and replication cycles. Recently, two studies performed large-scale dinucleotide analyses in different viruses and suggested that the CpG usage of +ssRNA viruses is affected greatly by their hosts. As a result, most +ssRNA viruses mimic their hosts’ CpG usage and the existence of an RNA dinucleotide recognition system, probably linked to the innate immune system of the host, has also been proposed , .
Finally, the relative abundance of dinucleotides was also correlated with the first two principal axes. Among the 16 dinucleotides, 11 significantly (positive and negative) correlated with the first axis and 16 significantly (positive and negative) correlated with the second axis (Table 5). These observations indicated that the composition of dinucleotides determines the variation in synonymous codon usage. Therefore, from the present dinucleotide composition analysis, it is evident that selection pressure associated with (i) maintenance of efficient replication and transmission cycles among multiple hosts, and (ii) evolution of escape mechanisms to evade from the host antiviral responses, have contributed to shaping the overall synonymous codon usage in CHIKV.
Effect of natural selection in shaping the codon usage patterns in CHIKV.
It has been suggested that if synonymous codon usage bias is affected by mutational pressure alone, then the frequency of nucleotides A and U/T should be equal to that of C and G at the synonymous codon third position . However, in case of CHIKV genomes, variations in nucleotide base compositions were noted (Table 1), indicating that other factors, such as natural selection, could also influence overall synonymous codon usage bias. As the role of natural selection is also evident from previous codon usage analysis studies in several viruses , , , we were interested to determine to what extent natural selection might be involved in the codon usage patterns of CHIKV. For this purpose, we computed the GRAVY and aromaticity (ARO) values for each CHIKV isolate (Table S1) and a linear regression analysis was performed between GRAVY, ARO and the f′1, f′2, ENC, GC and GC3 values. The analysis results showed that the GRAVY values were not significant for f′1 and were highly significant for f′2, ENC, GC3 and GC. In the case of ARO, an opposite trend was observed: ARO values were significantly negatively correlated with f′1 and correlations with f′2, ENC, GC3 and GC were not significant (Table 6). These results indicated that, although natural selection has influenced codon usage of CHIKV genomes to some extent, it is much weaker compared with mutational pressure.
Taken together, our analysis showed that overall codon usage bias in CHIKV is slightly biased, and the major factor that has contributed to shaping codon usage pressure is mutational pressure. In addition, contributions of other factors, including hosts, geography, dinucleotides composition and natural selection, are also evident from our analysis. Our data suggested that codon usage in CHIKV is undergoing an evolutionary process, probably reflecting a dynamic process of mutation and natural selection to re-adapt its codon usage to different environments and hosts. To the best our knowledge, this is first report of codon usage analysis in CHIKV and is expected to deepen our understanding of the mechanisms contributing towards codon usage and evolution of CHIKV.
Materials and Methods
The complete genome sequences of 141 CHIKV isolates (in FASTA format) were obtained from the National Center for Biotechnology (NCBI) GenBank database (http://www.ncbi.nlm.nih.gov). The accession numbers and other detailed information of the selected CHIKVs’ genomes, such as isolation date, isolation place, host and genome size were also retrieved (Table 7).
The following compositional properties were calculated for the CHIKV genomes; (i) the overall frequency of occurrence of the nucleotides (A %, C %, U/T %, and G %); (ii) the frequency of each nucleotide at the third site of the synonymous codons (A3%, C3%, U3% and G3%); (iii) the frequencies of occurrence of nucleotides G+C at the first (GC1), second (GC2), and third synonymous codon positions (GC3); (iv) the mean frequencies of nucleotide G+C at the first and the second position (GC1,2); and (v) the overall GC and AU content. The codons AUG and UGG are the only codons for Met and Trp, respectively, and the termination codons UAA, UAG and UGA do not encode any amino acids. Therefore, these five codons are expected not to exhibit any usage bias and were therefore excluded from the analysis.
The RSCU values for all the coding sequences of CHIKV genomes were calculated to determine the characteristics of synonymous codon usage without the confounding influence of amino acid composition and the size of coding sequence of different gene samples, following a previously described method . The RSCU index was calculated as follows:where gij is the observed number of the ith codon for the jth amino acid which has ni kinds of synonymous codons. RSCU values represent the ratio between the observed usage frequency of one codon in a gene sample and the expected usage frequency in the synonymous codon family given that all codons for the particular amino acid are used equally. The synonymous codons with RSCU values >1.0 have positive codon usage bias and were defined as abundant codons, while those with RSCU values <1.0 have negative codon usage bias and were defined as less-abundant codons. When the RSCU values is 1.0, it means there is no codon usage bias for that amino acid and the codons are chosen equally or randomly . Moreover, the synonymous codons with RSCU values >1.6 and <0.6 were treated as over-represented and under-represented codons, respectively .
Influence of Overall Codon Usage of the Hosts on that of CHIKV
For the comparative analysis of codon usage between CHIKVs and its vectors and hosts; codon usage data for two transmission vectors (A. aegypti, A. albopictus), and hosts (H. sapiens, P. troglodytes) were obtained from the codon usage database (http://www.kazusa.or.jp/codon/) . Zhou et al. proposed a method recently to determine the potential impact of the overall codon usage patterns of the hosts in the formation of the overall codon usage of viruses . Here, we applied the same approach in case of CHIKV and the similarity index D(A,B) was calculated as follows:where R(A,B) is defined as a cosine value of an included angle between A and B spatial vectors representing the degree of similarity between CHIKV and a specific host at the aspect of the overall codon usage pattern, ai is defined as the RSCU value for a specific codon among 59 synonymous codons of CHIKV coding sequence, bi is termed as the RSCU value for the same codon of the host. D(A,B) represents the potential effect of the overall codon usage of the host on that of CHIKV, and its value ranges from zero to 1.0 .
Measures of Relative Dinucleotides Abundance
The relative abundance of dinucleotides in the coding regions of CHIKV genomes was calculated using a previously described method . A comparison of actual and expected dinucleotide frequencies of the 16 dinucleotides in coding regions of the CHIKV was also undertaken. The odds ratio was calculated using the following formula:where fx denotes the frequency of the nucleotide X, fy denotes the frequency of the nucleotide Y, fy fx the expected frequency of the dinucleotide XY and fxy the frequency of the dinucleotide XY, etc,. for each dinucleotide were calculated. As a conservative criterion, for Pxy>1.23 (or <0.78), the XY pair is considered to be over-represented (or under-represented) in terms of relative abundance compared with a random association of mononucleotides.
The CAI is used as a quantitative method of predicting the expression level of a gene based on its codon sequence. The CAI value ranges from 0 to 1. The most frequent codons simply have the highest relative adaptiveness values, and sequences with higher CAIs are preferred over those with lower CAIs .
The ENC is used to quantify the absolute codon usage bias of the gene (s) of interest, irrespective of gene length and the number of amino acids . In this study, this measure was calculated to evaluate the degree of codon usage bias exhibited by the coding sequences of CHIKVs. The ENC values ranged from 20 for a gene showing extreme codon usage bias using only one of the possible synonymous codons for the corresponding amino acid, to 61 for a gene showing no bias using all possible synonymous codons equally for the corresponding amino acid. The larger the extent of codon preference in a gene, the smaller the ENC value is. It is also generally accepted that genes have a significant codon bias when the ENC value is less than or equal to 35 , . The ENC was calculated using the following formula:Where (k = 2,3,4,6) is the mean of values for the k-fold degenerate amino acids, which is estimated using the formula as follows:where n is the total number of occurrences of the codons for that amino acid andwhere ni is the total number of occurrences of the i th codon for that amino acid. Genes, whose codon choice is constrained only by a mutation bias, will lie on or just below the curve of the expected ENC values. Therefore, for elucidating the relationship between GC3 and ENC values, the expected ENC values for different GC3 were calculated as follows:where s represents the given GC3% value .
COA of Codon Usage
COA is a multivariate statistical method that is used to explore the relationships between variables and samples. In the present study, COA was used to analyze the major trends in codon usage patterns among CHIKVs coding sequences. COA involves a mathematical procedure that transforms some correlated variable (RSCU values) into a smaller number of uncorrelated variables called principal components. To minimize the effect of amino acid composition on codon usage, each coding sequence was represented as a 59 dimensional vector, and each dimension corresponded to the RSCU value of each sense codon, which only included several synonymous codons for a particular amino acid, excluding the codons AUG, UGG and the three stop codons.
Correlation analysis was carried out to identify the relationship between nucleotide composition and synonymous codon usage patterns of CHIKV. This analysis was implemented based on the Spearman’s rank correlation analysis. All statistical processes were carried out using the statistical software SPSS 16.0 for windows.
Hydrophobicity (GRAVY) and aromaticity (ARO) indices in CHIKV genomes.
Conceived and designed the experiments: AMB YT. Performed the experiments: AMB IN. Analyzed the data: AMB IN YT. Contributed reagents/materials/analysis tools: AMB IN YT. Wrote the paper: AMB IN.
- 1. Strauss JH, Strauss EG (1994) The alphaviruses: gene expression, replication, and evolution. Microbiol Rev 58: 491–562.
- 2. Robinson MC (1955) An epidemic of virus disease in Southern Province, Tanganyika Territory, in 1952–53. I. Clinical features. Trans R Soc Trop Med Hyg 49: 28–32.
- 3. Schuffenecker I, Iteman I, Michault A, Murri S, Frangeul L, et al. (2006) Genome microevolution of chikungunya viruses causing the Indian Ocean outbreak. PLoS Med 3: e263.
- 4. Powers AM, Brault AC, Tesh RB, Weaver SC (2000) Re-emergence of Chikungunya and O’nyong-nyong viruses: evidence for distinct geographical lineages and distant evolutionary relationships. J Gen Virol 81: 471–479.
- 5. Kumar NP, Joseph R, Kamaraj T, Jambulingam P (2008) A226V mutation in virus during the 2007 chikungunya outbreak in Kerala, India. J Gen Virol 89: 1945–1948.
- 6. Theamboonlers A, Rianthavorn P, Praianantathavorn K, Wuttirattanakowit N, Poovorawan Y (2009) Clinical and molecular characterization of chikungunya virus in South Thailand. Jpn J Infect Dis 62: 303–305.
- 7. Jupp PG, McIntosh BM (1988) Chikungunya disease. In: Monath TP, editor. The Arboviruses: Epidemiology and Ecology. Boca Raton (Florida): CRC Press. 137–157.
- 8. Powers AM, Logue CH (2007) Changing patterns of chikungunya virus: re-emergence of a zoonotic arbovirus. J Gen Virol 88: 2363–2377.
- 9. Pialoux G, Gauzere BA, Jaureguiberry S, Strobel M (2007) Chikungunya, an epidemic arbovirosis. Lancet Infect Dis 7: 319–327.
- 10. Grantham R, Gautier C, Gouy M, Mercier R, Pave A (1980) Codon catalog usage and the genome hypothesis. Nucleic Acids Res 8: r49–r62.
- 11. Marin A, Bertranpetit J, Oliver JL, Medina JR (1989) Variation in G+C-content and codon choice: differences among synonymous codon groups in vertebrate genes. Nucleic Acids Res 17: 6181–6189.
- 12. Gu W, Zhou T, Ma J, Sun X, Lu Z (2004) Analysis of synonymous codon usage in SARS Coronavirus and other viruses in the Nidovirales. Virus Res 101: 155–161.
- 13. Liu YS, Zhou JH, Chen HT, Ma LN, Pejsak Z, et al. (2011) The characteristics of the synonymous codon usage in enterovirus 71 virus and the effects of host on the virus in codon usage pattern. Infect Genet Evol 11: 1168–1173.
- 14. Ma JJ, Zhao F, Zhang J, Zhou JH, Ma LN, et al. (2013) Analysis of Synonymous Codon Usage in Dengue Viruses. Journal of Animal and Veterinary Advances 12: 88–98.
- 15. Moratorio G, Iriarte A, Moreno P, Musto H, Cristina J (2013) A detailed comparative analysis on the overall codon usage patterns in West Nile virus. Infect Genet Evol 14: 396–400.
- 16. Sharp PM, Cowe E, Higgins DG, Shields DC, Wolfe KH, et al. (1988) Codon usage patterns in Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Drosophila melanogaster and Homo sapiens; a review of the considerable within-species diversity. Nucleic Acids Res 16: 8207–8211.
- 17. Tao P, Dai L, Luo M, Tang F, Tien P, et al. (2009) Analysis of synonymous codon usage in classical swine fever virus. Virus Genes 38: 104–112.
- 18. Sharp PM, Li WH (1986) Codon usage in regulatory genes in Escherichia coli does not reflect selection for ‘rare’ codons. Nucleic Acids Res 14: 7737–7749.
- 19. Duret L, Mouchiroud D (1999) Expression pattern and, surprisingly, gene length shape codon usage in Caenorhabditis, Drosophila, and Arabidopsis. Proc Natl Acad Sci U S A 96: 4482–4487.
- 20. Van der Linden MG, de Farias ST (2006) Correlation between codon usage and thermostability. Extremophiles 10: 479–481.
- 21. Jenkins GM, Holmes EC (2003) The extent of codon usage bias in human RNA viruses and its evolutionary origin. Virus Res 92: 1–7.
- 22. Wang M, Zhang J, Zhou JH, Chen HT, Ma LN, et al. (2011) Analysis of codon usage in bovine viral diarrhea virus. Arch Virol 156: 153–160.
- 23. Wong EH, Smith DK, Rabadan R, Peiris M, Poon LL (2010) Codon usage bias and the evolution of influenza A viruses. Codon Usage Biases of Influenza Virus. BMC Evol Biol 10: 253.
- 24. Chen Y (2013) A comparison of synonymous codon usage bias patterns in DNA and RNA virus genomes: quantifying the relative importance of mutational pressure and natural selection. Biomed Res Int 2013: 406342.
- 25. Shi SL, Jiang YR, Liu YQ, Xia RX, Qin L (2013) Selective pressure dominates the synonymous codon usage in parvoviridae. Virus Genes 46: 10–19.
- 26. Zhang Z, Dai W, Wang Y, Lu C, Fan H (2013) Analysis of synonymous codon usage patterns in torque teno sus virus 1 (TTSuV1). Arch Virol 158: 145–154.
- 27. Zhang Z, Dai W, Dai D (2013) Synonymous Codon Usage in TTSuV2: Analysis and Comparison with TTSuV1. PLoS ONE 8: e81469.
- 28. Shackelton LA, Parrish CR, Holmes EC (2006) Evolutionary basis of codon usage and nucleotide composition bias in vertebrate DNA viruses. J Mol Evol 62: 551–563.
- 29. Hassan S, Mahalingam V, Kumar V (2009) Synonymous codon usage analysis of thirty two mycobacteriophage genomes. Adv Bioinformatics: 316936.
- 30. Wright F (1990) The ‘effective number of codons’ used in a gene. Gene 87: 23–29.
- 31. Hu JS, Wang QQ, Zhang J, Chen HT, Xu ZW, et al. (2011) The characteristic of codon usage pattern and its evolution of hepatitis C virus. Infect Genet Evol 11: 2098–2102.
- 32. Sharp PM, Li WH (1987) The codon Adaptation Index–a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res 15: 1281–1295.
- 33. Zhou H, Wang H, Huang LF, Naylor M, Clifford P (2005) Heterogeneity in codon usages of sobemovirus genes. Arch Virol 150: 1591–1605.
- 34. Mueller S, Papamichail D, Coleman JR, Skiena S, Wimmer E (2006) Reduction of the rate of poliovirus protein synthesis through large-scale codon deoptimization causes attenuation of viral virulence by lowering specific infectivity. J Virol 80: 9687–9696.
- 35. Sanchez G, Bosch A, Pinto RM (2003) Genome variability and capsid structural constraints of hepatitis a virus. J Virol 77: 452–459.
- 36. Zhou JH, Zhang J, Sun DJ, Ma Q, Chen HT, et al. (2013) The distribution of synonymous codon choice in the translation initiation region of dengue virus. PLoS One 8: e77239.
- 37. Tsetsarkin KA, Weaver SC (2011) Sequential adaptive mutations enhance efficient vector switching by Chikungunya virus and its epidemic emergence. PLoS Pathog 7: e1002412.
- 38. Tsetsarkin KA, Vanlandingham DL, McGee CE, Higgs S (2007) A single mutation in chikungunya virus affects vector specificity and epidemic potential. PLoS Pathog 3: e201.
- 39. Greenacre M (1984) Theory and Applications of Correspondence Analysis: Academic Pr. 364 p.
- 40. Tatarinova TV, Alexandrov NN, Bouck JB, Feldmann KA (2010) GC3 biology in corn, rice, sorghum and other grasses. BMC Genomics 11: 308.
- 41. Cheng X, Virk N, Chen W, Ji S, Ji S, et al. (2013) CpG usage in RNA viruses: data and hypotheses. PLoS One 8: e74109.
- 42. Chiusano ML, Alvarez-Valin F, Di Giulio M, D’Onofrio G, Ammirato G, et al. (2000) Second codon positions of genes and the secondary structures of proteins. Relationships and implications for the origin of the genetic code. Gene 261: 63–69.
- 43. Karlin S, Burge C (1995) Dinucleotide relative abundance extremes: a genomic signature. Trends Genet 11: 283–290.
- 44. Dorn A, Kippenberger S (2008) Clinical application of CpG-, non-CpG-, and antisense oligodeoxynucleotides as immunomodulators. Curr Opin Mol Ther 10: 10–20.
- 45. Lobo FP, Mota BE, Pena SD, Azevedo V, Macedo AM, et al. (2009) Virus-host coevolution: common patterns of nucleotide motif usage in Flaviviridae and their hosts. PLoS One 4: e6282.
- 46. Karlin S, Mrazek J (1997) Compositional differences within and between eukaryotic genomes. Proc Natl Acad Sci U S A 94: 10227–10232.
- 47. Bird AP (1980) DNA methylation and the frequency of CpG in animal DNA. Nucleic Acids Res 8: 1499–1504.
- 48. Greenbaum BD, Levine AJ, Bhanot G, Rabadan R (2008) Patterns of evolution and host gene mimicry in influenza and other RNA viruses. PLoS Pathog 4: e1000079.
- 49. Barrett JW, Sun Y, Nazarian SH, Belsito TA, Brunetti CR, et al. (2006) Optimization of codon usage of poxvirus genes allows for improved transient expression in mammalian cells. Virus Genes 33: 15–26.
- 50. Sharp PM, Li WH (1986) An evolutionary perspective on synonymous codon usage in unicellular organisms. J Mol Evol 24: 28–38.
- 51. Nakamura Y, Gojobori T, Ikemura T (2000) Codon usage tabulated from international DNA sequence databases: status for the year 2000. Nucleic Acids Res 28: 292.
- 52. Comeron JM, Aguade M (1998) An evaluation of measures of synonymous codon usage bias. J Mol Evol 47: 268–274.