Gaining Insights into the Codon Usage Patterns of TP53 Gene across Eight Mammalian Species

TP53 gene is known as the “guardian of the genome” as it plays a vital role in regulating cell cycle, cell proliferation, DNA damage repair, initiation of programmed cell death and suppressing tumor growth. Non uniform usage of synonymous codons for a specific amino acid during translation of protein known as codon usage bias (CUB) is a unique property of the genome and shows species specific deviation. Analysis of codon usage bias with compositional dynamics of coding sequences has contributed to the better understanding of the molecular mechanism and the evolution of a particular gene. In this study, the complete nucleotide coding sequences of TP53 gene from eight different mammalian species were used for CUB analysis. Our results showed that the codon usage patterns in TP53 gene across different mammalian species has been influenced by GC bias particularly GC3 and a moderate bias exists in the codon usage of TP53 gene. Moreover, we observed that nature has highly favored the most over represented codon CTG for leucine amino acid but selected against the ATA codon for isoleucine in TP53 gene across all mammalian species during the course of evolution.


Introduction
TP53 gene encodes tumor protein p53 which is known as the "guardian of the genome" as it plays a vital role in maintaining genomic stability by preventing mutation in the genome [1]. The p53 primarily acts as transcription factor and stands out as a key player in restricting tumor cell invasion that includes the ability to induce cell cycle arrest, DNA repair, senescence and apoptosis [2]. Mutation in p53 results in abnormal proliferation of cells that leads to the formation of tumor development and so TP53 gene is cataloged as tumor suppressor gene [3].
The nucleus of a cell is the main store house of tumor protein p53 where it binds to DNA. When any damage occurs in the DNA of a cell by some external agents like toxic chemicals, radiation, exposure to sun light or ultra violet rays, p53 plays the crucial role in activating other genes and inhibits cell cycle to repair the damage [4]. In case of failure of DNA repair, the tumor protein p53 prevents the cell from dividing and provokes signals to a wide variety of genes that contribute to TP53 mediated cell death i.e., apoptosis [5].
Unequal usage of synonymous codons that encode the same amino acid during translation of a gene into protein is known as codon usage bias (CUB). Some codons in a synonymous group are used more frequently whereas others less frequently in the genome of an organism [6,7]. CUB is a unique property of the genome and it may vary between genes from the same genome or within a single gene [8,9].
The advent of whole genome sequencing in different organisms and the easily accessible nucleotide database from NCBI (GenBank) have attracted much attention of the scientific community to study CUB in gaining clues for understanding the molecular evolution of genes and genome characterization.
Previously, several studies were conducted on synonymous codon usage bias in a wide variety of organisms including prokaryotes and eukaryotes [10][11][12][13][14][15][16], and till date in many organisms the codon usage patterns have been interpreted for diverse reasons. Many genomic factors such as gene length, GC-content, recombination rate, gene expression level, or modulation in the genetic code are associated with CUB in different organisms [17][18][19][20][21]. In general, compositional constraints under natural selection or mutation pressure are considered as major factors in the codon usage variation among different organisms [8,[22][23][24][25]. Moreover, studies revealed that mutation pressure, natural or translational selection, secondary protein structure, replication and selective transcription, hydrophobicity and hydrophilicity of the protein and the external environment play a major role in the codon usage pattern of organisms [26]. In unicellular and multicellular organisms it was observed that, preferred synonymous codons/optimal codons with abundant tRNA gene copy number rise with gene expression level within the genome that supports selection on high codon bias confirmed by positive correlation between optimal codons and tRNA abundance [18,22,27]. Urrutia and Hurst (2003) reported weak correlation between gene expression level and codon usage bias within human genome though not related with tRNA abundance [19]. However, Comeron (2004) observed that in human genome, highly expressed genes have preference towards codon bias favoring codons with most abundant tRNA gene copy number compared to less highly expressed genes [28].
The study of codon usage bias acquires significance in biology not only in the context of understanding the process of evolution at molecular level but also in designing transgenes for increased expression, discovering new genes [29] based on nucleotide compositional dynamics, detecting lateral gene transfer and for analyzing the functional conservation of gene expression [30]. Codon usage bias may be superimposed on the effect of natural selection. The amount of protein produced from the mRNA transcript may vary significantly since the translational properties of alternate synonymous codons are not equivalent [31]. Several studies have further shown that codon usage bias is associated with highly expressed genes as some codons are used more often than others in the coding sequences [32]. Moreover, literature suggested that a gene can be epitomized not only by the sequence of its amino acid but also by its codon usage patterns shaped by the balance between mutational bias and natural selection [33]. As a consequence of selection pressure within a gene, differentiation in codon bias may arise between species of the same genus.
The present study was undertaken in order to perform a comparative analysis of codon bias and compositional dynamics of codon usage patterns in TP53 gene across eight different mammalian species using nucleotide chemistry (GC contents) and several genetic indices namely effective number of codons (ENC), relative synonymous codon usage (RSCU), and relative codon usage bias (RCBS) etc. Our analysis has given a novel insight into the codon usage patterns of TP53 gene that would facilitate better understanding of the structural, functional as well as evolutionary significance of the gene among the mammalian species.

Results and Discussion
Codon usage patterns in TP53 genes across mammalian species Correlation coefficient between codon usage and GC bias was analyzed using heat map ( Fig. 1) in order to find out the relationship between the codon usage variation and the GC constraints among the selected coding sequences of TP53 genes. In our analysis, nearly all codons ending with G/C base showed positive correlation with GC bias and nearly all A/T-ending codons showed negative correlation with GC bias. But, 8 G/C-ending codons (ATC, ACG, TAC, TTG, TCC, CAC, GTG, GGG) showed negative correlation with GC bias whereas 6 A/T-ending codons (AAT, ATT, TGT, CGA, GTA, GGA) showed positive correlation with AT bias although statistically not significant (p>0.05). Two G-ending codons i.e. TCG for serine and CTG for leucine amino acid showed strong positive correlation (p<0.01) with GC 3s , indicating that codon usage has been influenced by GC bias due to GC 3s . Interestingly, we observed that the codon ATA encoding isoleucine amino acid was not favored by natural selection in TP53 genes across mammalian species during the course of evolution. Thus, scanning the codon usage pattern provides the basis of the mechanism for synonymous codon usage bias and has both practical as well as theoretical significance in gaining clues of understanding molecular biology [34].

G/C-ending codons are favored by TP53 gene across mammalian species
We analyzed the nucleotide composition of coding sequences from TP53 genes ( Table 1) which revealed that mean value of C (361.50) was the highest followed by G (306.75), A (270.88) and T (227.50) among all the selected mammals. The mean percentage of GC and AT compositions was 57.3% and 42.7% respectively. Thus, the overall nucleotide composition suggested that the nucleotide C and G occurred more frequently compared to A and T in the coding sequences of TP53 gene across the mammalian species. The nucleotide composition at the third position of codon (A 3 ,T 3 ,G 3 ,C 3 ) showed that the mean values of C 3 and G 3 were the highest followed by T 3 and A 3 . The GC 3 values (ranged from 58.7%-70.6%, mean = 65.1%, SD = 0.040) was compared with that of AT 3 values (ranged from 29.4%-41.3%, mean = 34.9%, SD = 0.040) in the coding sequences of TP53 genes. The average percentage of GC contents at the first and second codon positions (GC 12 ) was found in the range of 52.6% to 54.9% with a mean value of 53.4% and a standard deviation (SD) of 0.008. Therefore, nucleotide composition analysis suggested that GC-ending codons might be preferred over AT-ending codons in the coding sequences of TP53 genes across the selected mammalian species. Further, we calculated the occurrence of frequently used optimal codons (Fop) for each amino acid as suggested by Lavner and Kotler (2005) [14]. The frequency was allied with statistical analysis to find out the highest and lowest frequently used codon. Our results showed that the most frequently used codons were G/C-ending for the corresponding amino acid (Fig. 2) in TP53 genes across mammalian species.

Relative synonymous codon usage in TP53 gene across mammals
The relative synonymous codon usage values of 59 codons for TP53 gene across eight mammalian species were analyzed excluding the codons ATG (methionine) and TGG (tryptophan). In our calculation RSCU value greater than 1.0 represents that the particular codon is used more frequently and less than 1.0 represents the less frequently used codon for the corresponding amino acid. The RSCU value greater than 1.6 indicates over represented codon for the corresponding amino acid. The overall RSCU values in the selected coding sequences of TP53 gene revealed that 25 codons were most frequently used among the 59 codons and the most predominantly used codons were G/C-ending compared to A/T-ending (Table 2). Besides, it was observed that C-ending codon was mostly favored compared to G-ending codon in the coding sequence of TP53 gene among the selected mammalian species. Our results showed marked similarities as reported by Dass et al., (2012) in serotonin receptor gene family from different mammalian species [35]. Further, clustering analysis of RSCU values (Fig. 3) depicted that the codon GCC, CGC (except Rattus norvegicus), ATC (except Tupaia chinensis), CTG, ACC (except Rattus norvegicus, Macaca mulatta), GTG (except Felis catus) were displayed as the over represented codons (RSCU>1.6). The highest RSCU value was found for the codon CTG for leucine amino acid in all TP53 genes across mammalian species. The codon ATA showed the RSCU value zero because natural selection has not favored this codon in TP53 gene across mammalian species.
Codon usage patterns of TP53 gene correspond to phylogeny of mammalian species We have performed a neighbor joining tree analysis based on Kimura 2-parameter (K2P) distances of the coding sequences in TP53 gene across mammalian species (Fig. 4). We observed that codon usage patterns in TP53 genes have significant similarities among the closely related mammalian species. The gene TP53 in H. sapiens showed resemblance to the TP53 gene in M. mulatta, Similarly, TP53 of F. catus resembled to that of C. lupus and M.unguiculatus with R. norvigicus. Generally, genes with similar functions exhibit similar patterns of codon usage frequency [36]. Our analysis further suggested that the coding sequence of TP53 gene share similar patterns of codon usage bias across eight mammalian species.

Selection pressure over TP53 gene across mammalian species
The ENC values of the coding sequences ranged from 52 to 59 with a mean of 55.5±2.33 indicating relatively smaller variation in the codon usage of TP53 gene across eight mammalian species. However, the GC 3s values ranged from 0.59 to 0.71 with a mean value of 0.65±0.040. Significant negative correlation (Pearson r = -0.979, p<0.01) was observed between ENC and GC 3s . Moreover, a plot of ENC vs GC 3s revealed that the ENC values had negative correlation with the GC 3 content (Fig. 5) and comparatively lower ENC was linked to higher GC 3s values. All the selected coding sequences of TP53 gene across the selected mammalian species had Overall frequency of optimal and non optimal codon used in TP53 genes among mammals. Red color coding represents optimal used codons with corresponding amino acid.
doi:10.1371/journal.pone.0121709.g002 a higher predominance of G/C-ending codons. It suggested that GC 3s values determined the codon usage pattern in the coding sequences of TP53 gene [33]. Nabiyouni et al., (2013) reported that eukaryotic organisms with very high GC-contents have high GC 3 -composition while organisms with low GC-content have low GC 3 -composition in the genome [37]. We also calculated GC 3 skew values which ranged from 0.000 to -0.094, indicating that GC 3 composition at the third position of codon might have played an important role in the codon usage bias [38]. Negative GC skew was observed in all the coding sequences of TP53 gene which revealed that the abundance of C over G [39]. In addition, lower values of the frequency of optimal codons (FOP) and the effective number of codons (ENC) along with higher GC contents suggested that a moderate bias exists in the usage of synonymous codons [33] for TP53 gene in different mammalian species. Predominant codon usage bias was observed in TP53 gene of M.unguiculatus compared to other mammalian species (Table 3). RCBS value of a gene can be used as an effective measure of predicting gene expression and its value depends on the patterns of codon usage along with nucleotide compositional bias of a gene [20]. The distribution of RCBS values for TP53 gene across eight mammalian species is shown in figure below (Fig. 6). The RCBS values ranged from 0.006 to 0.065 with a mean value of 0.039 and a standard deviation (SD) of 0.021. In our analysis, low mean RCBS value suggested that there exists a low codon bias for TP53 gene associated with low expression level [20].

Conclusions
In brief, our results showed that codon usage in TP53 gene in mammals has been influenced by GC bias, mainly due to GC 3s . The majority of frequently used codons were G/C ending in which C-ending codons were mostly favored compared to G-ending codons for the corresponding amino acid. The most over-represented codon was CTG encoding the amino acid leucine in the TP53 gene of all the selected mammalian species. We further observed that the codon ATA encoding isoleucine was selected against by nature in TP53 genes across the mammalian species under study during the course of evolution. The codon usage pattern for TP53 in H. sapiens showed resemblance to that of M. mulatta; similarly, F. catus to C. lupus and M. unguiculatus to R. norvigicus. Moderate codon bias was observed for the TP53 gene in different mammalian species. The codon usage patterns in the coding sequence of TP53 gene across different mammalian species showed significant similarities, suggesting that the evolutionary pattern might be similar. According to Yang and Nielsen (2008), codon bias in mammals is mainly influenced by mutation bias and the selection on codon bias is weak for nearly neutral synonymous mutations [40]. From the outstanding work of Grantham et al., (1980Grantham et al., ( -1981 on "genome hypothesis" it was evident that species specific genes share similar spectrum of codon usage frequency [41,42]. The present study revealed that specific gene of closely related species with similar functions exhibit similar patterns of codon bias across different mammals as evident from the previous work of Dass et al., (2012) [35]. To the best of our knowledge, this is the first report on the codon usage pattern in TP53 gene across the mammalian species. Since our analysis has  given better insights into the codon usage, it may have theoretical value in further understanding the molecular evolution of TP53gene.

Sequence Data
The complete nucleotide coding sequences (cds) for TP53 gene having perfect start and stop codon, devoid of any unknown bases (N) and perfect multiple of three bases, were retrieved from National Center for Biotechnology Information (NCBI) GenBank database (http://www. ncbi.nlm.nih.gov). Finally, we selected eight coding sequences for TP53 gene that fulfill the above mentioned criteria in different mammalian species and used in our CUB analysis (Table 4).

Prediction of Base Composition Bias
The occurrence of overall frequency of the nucleotide (G+C) at first (GC 1 ), second (GC 2 ) and third (GC 3 ) position of synonymous codons were calculated to quantify the extent of base composition bias. Moreover, we analyzed the skewness for AT, GC and GC 3s of each coding sequence to estimate the base composition bias particularly in relation to transcription processes.

Effective Number of Codons (ENC) Analysis
ENC is generally used to quantify the codon usage bias of a gene that is independent of the gene length and number of amino acids [43]. This measure was computed as per Wright (1990) to estimate the extent of CUB exhibited by the coding sequences of TP53 gene across the selected mammalian species: Where, F k ( k = 2, 3, 4 or 6) is the average of the F k values for k-fold degenerate amino acids. The F value denotes the probability that two randomly chosen codons for an amino acid with two codons are identical. The values of ENC ranged from 20 indicating strong codon bias in the gene using only one synonymous codon for the corresponding amino acid, to 61indicating no bias in the gene using all synonymous codons equally for the corresponding amino acid [43].

Frequency of Optimal Codon (Fop) Analysis
Fop is a measure of codon usage bias in a gene [44]. Fop values represent the ratio of the number of optimal codons used to the total number of synonymous codons [22]. The Fop value ranges from 0.36 for a gene showing uniform codon usage bias to 1 for a gene showing strong codon usage bias [45]. Fop value for each selected coding sequence was calculated using the formula given by Lavner and Kotler (2005) [14].

Relative Synonymous Codon Usage (RSCU) Analysis
RSCU is defined as the observed frequency of a codon divided by the expected frequency if all codons are used equally for any particular amino acid [46]. RSCU values of codons for each of the selected coding sequence of TP53 gene was calculated as follows: Where, g ij is the observed number of the ith codon for the jth amino acid which has n i kinds of synonymous codons [26].

Computation of Gene Expression
Gene expression was estimated through RCBS which can be defined as the overall score of a gene indicating the influence of relative codon bias (RCB) of each codon in a gene [20]. The RCBS value of each coding sequence of TP53 gene was calculated as follows: where, Oc is the observed number of counts of codon c of the query sequence and E[O c ] is the expected number of codon occurrences given the nucleotide distribution at three codon positions (b1b2b3) [20].
logw RCB c Þ À 1 [47] Where, O tot is the total number of codons

Software Used
The above mentioned genetic indices were estimated in a PERL program developed by SC (corresponding author) to measure the CUB on the selected coding sequences of TP53 genes in different mammalian species. All statistical analyses were carried out using the SPSS software. Cluster analysis (Heat map) of correlation coefficient of codons with GC3 and the RSCU values of codons among the eight mammalian species were clustered using a hierarchical clustering method implemented in NetWalker software [48].