Correlation between gene expression levels under drought stress and synonymous codon usage in rice plant by in-silico study

We studied the correlation of synonymous codon usage (SCU) on gene expression levels under drought stress in rice. Sixty genes related to drought stress (with high, intermediate and low expression) were selected from rice meta-analysis data and various codon usage indices such as the effective number of codon usage (ENC), codon adaptation index (CAI) and relative synonymous codon usage (RSCU) were calculated. We found that in genes highly expressing under drought 1) GC content was higher, 2) ENC value was lower, 3) the preferred codons of some amino acids changed and 4) the RSCU ratio of GC-end codons relative to AT-end codons for 18 amino acids increased significantly compared with those in other genes. We introduce ARSCU as the Average ratio of RSCUs of GC-end codons to AT-end codons in each gene that could significantly separate high-expression genes under drought from low-expression genes. ARSCU is calculated using the program ARSCU-Calculator developed by our group to help predicting expression level of rice genes under drought. An index above ARSCU threshold is expected to indicate that the gene under study may belong to the “high expression group under drought”. This information may be applied for codon optimization of genes for rice genetic engineering. To validate these findings, we further used 60 other genes (randomly selected subset of 43233 genes studied for their response to drought stress). ARSCU value was able to predict the level of expression at 88.33% of the cases. Using third set of 60 genes selected amongst high expressing genes not related to drought, only 31.65% of the genes showed ARSCU value of higher than the set threshold. This indicates that the phenomenon we described in this report may be unique for drought related genes. To justify the observed correlation between CUB and high expressing genes under drought, possible role of tRNA post transcriptional modification and tRFs was hypothesized as possible underlying biological mechanism.


Introduction
Abiotic stresses such as drought, salinity, low or high temperatures and other environmental extremes negatively affect crop growth and reduce crop yield on a global scale [1]. Among a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 RNA-derived fragments known as tRFs [38][39][40][41]. Production of tRFs is promoted by abiotic stress conditions to eventually control gene expression by transcriptional, post-transcriptional and translational regulation [42].
Hence, knowledge of the codon usage and codon-pair context patterns of plants and underlying evolutionary forces will be useful to understand the molecular mechanism of environmental adaptation and biological diversity of each species [43]. In addition, this information can be applied for codon optimizations of diverse sets of genes to be used in plant transformation programs. Here, we use genome sequence and genome-wide expression data of drought responsive genes to investigate the relationship between gene expression levels and CUB in rice.

Selection of genes with different expression levels under drought stress
Sixty genes associated with drought stress in rice were identified through a structured literature survey from meta-analysis data [44][45][46]. Then the genes were divided based on their expression levels under drought condition into three categories: high-, intermediate-and lowexpression genes. Complete coding DNA sequences of these genes were accessed from the NCBI nucleotide database [47]. In the present study, high-expression genes included NAC transcription factors, zinc finger proteins, AP2/ERF transcription factors, LEA proteins, phosphatidylethanolamine binding proteins, peptide transporters, dioxygenases, actin-binding proteins, protein phosphatases, dehydrins and membrane proteins family. These genes have been reported to show high expression under drought stress in different experiments (Table 1). In addition to the high expression, some of these genes showed enhanced drought tolerance in reported genetic engineering experiments [48][49][50][51][52][53][54][55][56][57] (Table 2).

Analysis of the nucleotide composition
The program CAIcal [58] was used to calculate the AT content and GC content at first, second and the third nucleotide positions, (AT1, AT2, AT3) and (GC1, GC2, GC3), respectively.

Relative Synonymous Codon Usage (RSCU) analysis
RSCU was calculated following Hastings and Emerson [59], as the ratio of the observed frequency of a codon to the expected frequency of the same codon within a synonymous codon group (with no bias) in the entire coding sequence of the gene concerned. RSCU can be equal to one when there is no bias, more than one when there is preference to use that specific codon and less than one when the codon is underused [60]. RSCU values can range from zero where the specific codon is not used at all to 2, 3, 4 and 6 when only one codon is used for encoding amino acids with 2, 3, 4 and 6 synonymous codons, respectively. An RSCU value >1 for each codon shows that this codon is preferred. Higher RSCU value indicates the presence of higher CUB. The RSCU of 60 genes associated with drought stress in rice were calculated using ACUA software [61] and the program CAIcal [58], excluding the stop codons and the two amino acids which are encoded by a single codon (Trp and Met).

ENC analysis
Effective number of codon usage (ENC) refers to the number of unique codons found in a gene. This value can be in a range of 20 (where there is an extreme bias towards the use of only one codon for each amino acid) to 61 (representing the use of all synonymous codons). If there are fewer than 60 synonymous codons used in a particular gene, then CUB may be present (considering the fact that Trp has only one codon). We used CAIcal to calculate ENC.

Codon Adaptation Index (CAI) analysis
CAI is another widely used method for evaluation of CUB. It measures the similarity between the frequency of synonymous codons used by a gene and that of a reference set [58]. The range of the CAI value is between 0 and 1. The rice codon usage from the Kazusa database [62] was used as a reference set. The program CAIcal [58] was used for CAI calculation.

Searching for tRFs data under drought stress
Data on tRFs under drought stress in rice were obtained from the PtRFdb (http://223.31.159.8/ PtRFdb/plant.php?txt=Oryza%20sativa&bs=1&type=all) [63] and their frequency was compared in the control and drought stress condition in different stages of plant growth. Codons related to each tRF were also compared with the rice preferred codons under drought stress.

Base composition analysis
In the present study, GC content was higher than AT content in all three gene-expression categories (high, intermediate and low-expression genes), as is expected for the rice genome. The means of GC content in categories 1, 2 and 3 were 70.9%, 56.95% and 54.2%, respectively. The results showed that the GC content at the three-codon positions was noticeably different. G or C at position three of each codon (GC3) was higher than that at position one (GC1), and it was lowest at position two (GC2) among most of the studied genes. The means of GC3 content in categories 1, 2 and 3 were 94%, 69.3% and 56.2%, respectively ( Table 1, Fig 1).

ENC analysis
The CUB for each gene was also calculated using ENC. The ENC value of the genes of the first group was between 27 and 34, the second group was between 38 and 56 and the third group was between 46 and 56 ( Table 1).

Analysis of CAI
The range of the CAI value in the first category was between 0.859 and 0.946; in the second and third categories were between 0.733 and 0.867 and between 0.729 and 0.894, respectively (Table 1). High expression genes showed higher CAI value as compared to the other categories.

RSCU analysis
The comparative study among the RSCU values of the studied genes in three expression categories under drought stress with that in general pattern of rice genes reveals the existence of CUB. In other words, when genes with different expression levels under drought in rice are compared with the standard rice codon usage table, a change in preferred codon of some amino acids of high expression genes and low expression genes under stress is observed. For example, alanine and serine are encoded by the most preferred codon GCG and TCG instead LOC_Os03g20680 Os LEA3-2

LEA protein
The transgenic rice showed significantly stronger growth performance than control under salinity or osmotic stress conditions and were able to recover after 20 days of drought stress. [43] LOC_Os05g46480 Os LEA3-1

LEA protein
Overexpression of an LEA gene in rice improved drought resistance under the field conditions without yield penalty. [44] LOC_Os01g66120 SNC2 NAC transcription factor Overexpression of SNAC2 improved the tolerance to PEG treatment. The transgenic plants had higher cell membrane stability and higher germination and growth rate than wild type during the cold and salt stresses respectively. [45] LOC_Os03g60080 OsNAC9 NAC transcription factor Overexpression of OsNAC9 gene increased rice drought tolerance. [46] LOC_Os12g39400 ZFP252 Zinc finger Overexpression of ZFP252 in rice increased the amount of free proline and soluble sugars enhanced rice tolerance to salt and drought stresses. [47] LOC_Os03g60560 ZFP182 Zinc finger Overexpression of ZFP182 significantly enhanced salt, cold and drought tolerance in transgenic rice. [48] LOC_Os05g10670 OsTZF1 Zinc finger gene family Overexpression of OsTZF1 in rice showed improved tolerance to high-salt and drought stresses. [49] LOC_Os08g31580 OsDRAP1 AP2/ERF transcription factor Overexpressing OsDRAP1 transgenic plants exhibited significantly improved drought tolerance. [50] LOC_Os03g57240 DST C2H2 zinc finger protein A zinc finger protein, DST, regulated drought and salt tolerance in rice via stomatal aperture control. [51] LOC_Os05g34830 OsNAC52 NAC transcription factor Overexpression of OsNAC52 activated the expression of downstream genes in transgenic Arabidopsis, resulting in enhanced tolerance to drought stresses but not growth retardation. [52] https://doi.org/10.1371/journal.pone.0237334.t002 of rice's most widely used codons GCC and TCC, respectively, in the high expression genes category. The third codon position changed to the nucleotide G in this category. Aspartic acid and proline are encoded by the most preferred codon GAT and CCA instead of rice standard codons for these two amino acids: GAC and CCG, respectively, in the intermediate expression genes category. The third codon position changed to the nucleotides A and T in this category (Fig 2). The heat-plot of different gene categories based on the RSCU index is shown in Fig 3. This plot separates distinctly the high expression genes under drought conditions from other categories so that the colors were divided into two distinct groups (Fig 3). These data may indicate that genes with higher expression under drought stress have acquired their own preferred codons in the course of evolution. Also, the heat-plots showed that G or C end codons are highly preferred in this set of genes for most amino acid codons, but the TTG codon was not preferred for leucine in high expression genes under drought stress.

Average RSCU (GC/AT): A new index for estimation of CUB
After calculating the RSCU index for each codon per gene, it was observed that the ratio of RSCU of codons with GC end to codons with AT end in each amino acid was significantly higher in high expression genes than that in the other categories. Even in some high expression

PLOS ONE
genes, the RSCU of codons with AT end was zero. Hence, an RSCU-based index was defined and called ARSCU. It measures the average ratio of RSCUs with GC-to AT-end codons for all amino acids in one gene, providing only one value for each gene. Our findings indicate that ARSCU could significantly separate highly expressed genes under drought from the others. In this study, the ARSCU (GC/AT) index ranged between 9.5 to 18.5, 0.7 to 8.5, and 0.7 to 4.5 for high, intermediate and low expression genes, respectively (Fig 4A).
ARSCU was calculated as: Where aa is amino acid, a is RSCU of GC end codons and b is RSCU of AT end codons (any a and b with a value of zero is arbitrarily assigned a value of 0.1).

ARSCU-calculator
A program was developed in an Excel file for calculating the ARSCU index (S1 Table). Input data for this program is the RSCU index of all amino acid codons in each gene. This program calculates only one ARSCU (GC/AT) for each gene (whilst every gene has 59 RSCU values). Since RSCU may be zero for some codons, any RSCU with a value of zero is arbitrarily assigned a value of 0.1.
Based on our findings, ARSCU values can be used for discrimination of genes with different expressions under drought conditions, as follows: • Genes with ARSCU above 13 (arbitrary threshold) are predicted to be high expression genes under drought conditions; • Genes with ARSCU ranging from 9 to 13 are predicted to be genes with high or intermediate expression; • Genes with ARSCU numbers less than 9 are predicted to be genes with low or intermediate expression.

Validation of discriminating power of ARSCU
For validation of discriminating power and accuracy of the ARSCU, 60 other rice genes were randomly selected from 43233 genes studied for their response to drought stress [44] and their ARSCU were calculated by the program (S2 Table). The results were correlated with the expression datafrom the published meta-analysis [44]. ARSCU calculator was able to predict genes with different expression level under drought stress in rice with an accuracy of 88.3%) Fig 4B). In order to compare the codon usage pattern of different highly expressing genes unrelated to drought with that of drought responsive high expression genes, third set of 60 other highly expressed genes not related to drought [64,65] were used. The ARSCU value for these genes was also calculated by the program) S3 Table). Only 31.65% of the genes (18.33% plus 13.33% were correctly categorized as high and intermediate expressing genes, respectively) similar to those obtained for drought responsive genes (Fig 4C). ARSCU as an index worked for

PLOS ONE
identification of drought related genes (with 88% accuracy) compared to only 32% accuracy on non-drought related genes.

tRFs under drought stress
Results of our search for tRFs under drought stress and control condition in rice, showed that: 1. In most cases, no tRF has been reported for the preferred codon under drought stress, which means the tRNA remains for longer time and participates in translation.
2. In few cases, tRF has been reported for tRNA related to preferred codon under drought stress. However, even in these cases the frequency of this tRF has decreased under drought stress compared with its frequency under normal condition.
3. Only in one case (Ala), tRF related to preferred codon has increased under drought stress. tRNA expression data is required in order to explain this observation. One mechanism may be over expression of the preferred tRNA in such amounts that even its break down to its tRF does not affect the over-expression of the gene in concern (Table 3).

Discussion
Drought tolerance in rice is governed by many genes with huge environmental interaction, with low heritability, and thus are difficult to study [66]. Development of GM plants with enhanced tolerance to drought is an important challenge in rice biotechnology research. For this purpose, further investigation of the mechanisms, pathways and genes involved in response to abiotic stress is essential. Expression of genes belonging to diverse functional and regulatory groups, such as transcription factors, protein kinases, and phosphatases, are influenced by abiotic stress conditions [67]. In addition to environmental conditions, gene expression levels correlate with multiple aspects of gene sequence and structure including SCU [18]. Here, we report the existence of strong correlation between SCU and gene expression levels under drought stress in rice which has not previously been reported.

Correlation between GC content and higher expression of rice genes under drought
Our analysis of codon usage patterns of studied genes in different expression categories under drought stress shows that most of the genes in the three categories have more GC content as expected from rice genes. In this study, GC percent analysis was able to separate 60 selected genes in rice into two major classes: 1) the high expression genes that were also high in GC and 2) the other genes with lower GC content. Higher GC content in rice genome has been reported with no particular attention paid for presence or absence of differences among genes with different levels of expression under drought [68]. Wang and Hickey [69] showed that the means of GC content of 14005 rice-coding sequences was 57.8%. Also they separated the 14,005 rice genes into two classes: High (67.4%) and Low GC genes (50.1%). The means of the high GC rice genes and low GC rice genes' GC3 were 80.4% and 52.7%, respectively in their study. Also, Yi et al. [33] showed higher expression is associated with higher GC3 in three closely-related species of the genus Misgurnus. Our results further confirm the previous reports in the field. However, the novelty of our findings may be the existence of a correlation between genes with different levels of expression under drought condition and their GC content. We showed that genes with high expression under drought conditions had overall 70.9% GC content. Furthermore, we showed that on average 94% of the high expressing genes contained a G or C at third position of their codons. Therefore, we have demonstrated that high expression genes under drought are more dependent on high GC percentages, especially in the third nucleotide position.

Correlation between SCU and higher expression of rice genes under drought
As we expected, the range of the CAI value was higher in the high expression genes compared with other categories. In addition, the ENC value for the high expression genes was less than 36, which is an indication of the presence of a relatively strong CUB. This value for the second and third categories of genes was above 38 and 46, respectively. Sharp et al. [60] reported that, highly expressed genes generally show a tendency of preferentially using a limited number of codons. Also, Liu et al. [70] showed a positive correlation between ENC value and gene expression level in the rice plant. In their report, genes with ENC �30 and ENC �55 were correlated with high and low expression genes, respectively. Alanine and serine in the high expression genes under drought stress preferred the GCG and TCG codons, respectively, whereas according to the common rice codon usage table, the preferred codons for alanine and serine were GCC and TCC, respectively. Our results about those amino acids encoded by six codons (leucine, arginine and serine) showed that among the C-or G-ending codons, using G or C in the first and second positions is also more preferred. For example, for the case of leucine, the use of the TTG codon, although ending with a G, is not preferred and the CTC and CTG codons are more preferred. Therefore, using TTG is not recommended for codon optimization for enhanced expression of the concerned gene under drought. Intermediate and low expression genes under drought stress end with A or T codons in some cases. This may further indicate the effect of more efficient codons on higher expression of genes under drought stress. For example, proline and aspartic acid preference the GAT and CCA codons in the intermediate expression genes under drought stress, whereas the preferred rice codon for these two amino acids is GAC and CCG that end with a G or C. Therefore, the use of A and T end codons for these amino acids has reduced the expression of these genes under drought stress. A number of studies of eukaryotes and prokaryotes have showed that groups or families of genes involved in stress responses systematically overuse and underuse specific "non-optimal" synonymous codons. Chionh et al. [71] showed the 48 genes in the DosR regulon in Mycobacterium bovis BCG, which are essential for survival under hypoxic stress, are enriched in G-and C-ending codons. Others showed that hypoxic stress increased the translation of proteins from genes enriched with non-optimal codons ACG (Thr), CTA (Leu), GCG (Ala) and GGA (Gly), whereas their synonymous partners ACC (Thr), CTT (Leu), GCT (Ala) and GGT (Gly) were all overrepresented in downregulated proteins in hypoxia [72]. High-expression genes under drought in rice use G/C-end codons more frequently. Therefore, we created a different codon usage table for these genes compared to the standard rice codon usage table (S4 Table).
Correlation of CUB with gene expression levels for different species including: Caenorhabditis, Drosophila, Arabidopsis [20], Tribolium castaneum [18], arthropods [32,72], Paeonia lactiflora [73] and Populus tremula [74] has been shown by several authors. Selection favors specific codons promoting the efficient and accurate translation of genes that are expressed at high levels.

ARSCU as a single value for prediction of rice genes expression under drought
Previous reports have used the RSCU number for different cases, but this number is calculated for the codons of each amino acid in the gene. In this report, we introduced the ARSCU GC/ AT as a single value/index for each gene with the potential to predict gene expression under drought stress. In addition, we developed a simple program in an Excel file for calculating the ARSCU index (ARSCU calculator). It is expected that using this program, we will be able to predict high expression genes in high throughput data with relatively high confidence. This program was developed based on calculations made on a set of genes expressed under drought stress conditions in the rice genome. We validated 1) the discriminating power of ARSCU and 2) its specificity for higher expressing genes under drought. This index was not able to predict highly expressed non-drought related genes. These findings may reflect the different codon usage pattern of the genes under drought stress compared with other highly expressed genes in rice. In other words, the observed phenomenon (strong correlation between ARSCU and high expressing genes) is restricted to higher expressing genes under drought stress and not every high expressing genes in rice.
It remains unclear if ARSCU will also be useful for prediction of gene expression in other plant species under drought or even other stressed conditions. We are currently attempting validation of this program for prediction of gene expression under drought for other cereals. It may require a different threshold definition than the one reported by this index (ARSCU) in this article. There is room for improvement of this program to be suitable for use with the codon usage features of other plant species. The prediction made by this program potentially indicates the extent to which such a codon composition can be expressed under drought. It is, however, clear that in addition to ORF, regulatory elements in particular the promoter and transcription factors affect gene expression and that these variables were not considered in this program. It is therefore expected that even in rice, genes with high ARSCUs may exhibit low expression under drought. The high ARSCUs in such genes only indicate their codon potential for high expression under stressed conditions. In the case of such genes, alteration of regulatory elements may materialize their expression potential.

Possible role of tRNA post transcriptional modification and tRFs justifying the correlation between CUB and high expressing genes under drought
We observed different codon usage pattern in high expression genes under drought stress in our research. This observation termed Modification Tunable Transcripts (MoTTs) was also reported as a model combining genome-wide CUB analytics and gene expression studies. This model implies that modification of tRNA drives the translational regulation of critical response proteins whose transcripts display a distinct codon bias [75].
Average transcripts not exhibiting CUB, do not need specific tRNA modifications, as they are efficiently translated under all conditions. In contrast, the translation of MoTTs in cells lacking tRNA specific modifications is severely reduced [37].
The number of genes that encode tRNA modification enzymes are far more abundant than the tRNA coding genes, which further highlights the importance of tRNA modification [76,77]. It has been long postulated that modified nucleosides on tRNA molecules may function as "biosensor" for environmental and physiological changes [78,79] as a fast module to regulate gene expression at translational level. In agreement with this hypothesis, the abundance of tRNA modified nucleosides do change in response to various stresses [34]. Stress-induced degradation of tRNA has been reported by several investigators [80,81].
We speculate that those transcripts that are more efficiently expressed due to the presence of CUB during the course of stress, can be more efficiently used by translational machinery to express proteins due to the presence of specific modifications. However, this require more direct evidence to draw a final conclusion.
Methylation sensitive degradation of tRNA as another potential underlying mechanism for these observations has also the merit to look at. Wang et al. [82] investigated the mechanism of changes in the composition and abundance of tRNA-modified nucleosides in response to drought, salt and cold stress, as well as in different tissues during the whole growth season in two model plants-O. sativa and Arabidopsis thaliana. They identified 22 and 20 candidate genes for methyl-transferases in rice and Arabidopsis, respectively. Based on bioinformatics analysis, nucleoside abundance assessments and gene expression profiling, they found four methylated nucleosides (Am, Cm, m 1 A and m 7 G) that are critical for stress response in rice and Arabidopsis.
However, the relationship of these modifications with CUB is not yet known. On the other hand, several articles have discussed tRNA modification of the animal species and its association with codon usage. Chan et al [74] showed a model for translational control mechanism for survival under oxidative stress in yeast. Exposure to H 2 O 2 leads to an elevation in the level of m 5 C at the wobble position of the leucine tRNA for translating the codon UUG on mRNA. This modification in tRNA enhanced the translation of the UUG-enriched RPL22A mRNA relative to its paralog RPL22B and leads to changes in ribosome composition in their study. This reprogramming of tRNA and ribosomes ultimately causes selective translation of proteins from genes enriched with the codon TTG. More comprehensive studies are needed to better correlate the expression of genes with specific codon preference under drought stress in rice with tRNA modifications.
The fact that in most cases, no tRF has been reported for the preferred codon under drought stress, may support our hypothesis that during the course of evolution plants with preferred codons for which the tRNA does not break down into tRF gained selective advantage under drought condition. This may be the result of preservation of already existing such preferred codon or even occurrence of the mutation creating the preferred codon. This may be considered for further studies with the objective of identifying the underlying biological mechanism of CUB observed for high expression genes under drought. Based on these observations we propose an illustrated model based on codon characteristics, tRNA modification and tRFs under drought stress in rice (Fig 5). In summary, the results of this paper suggests that codon optimization may play more significant role for improved gene expression under drought. Therefore, it is suggested to be considered in genetic engineering programs. It is expected that drought tolerant rice will be available from our ongoing projects on engineering rice for enhanced drought tolerance using several different strategies. The genes used in our projects have been optimized based on the findings reported in the current study.
Supporting information S1