Analysis and Prediction of Translation Rate Based on Sequence and Functional Features of the mRNA

Protein concentrations depend not only on the mRNA level, but also on the translation rate and the degradation rate. Prediction of mRNA's translation rate would provide valuable information for in-depth understanding of the translation mechanism and dynamic proteome. In this study, we developed a new computational model to predict the translation rate, featured by (1) integrating various sequence-derived and functional features, (2) applying the maximum relevance & minimum redundancy method and incremental feature selection to select features to optimize the prediction model, and (3) being able to predict the translation rate of RNA into high or low translation rate category. The prediction accuracies under rich and starvation condition were 68.8% and 70.0%, respectively, evaluated by jackknife cross-validation. It was found that the following features were correlated with translation rate: codon usage frequency, some gene ontology enrichment scores, number of RNA binding proteins known to bind its mRNA product, coding sequence length, protein abundance and 5′UTR free energy. These findings might provide useful information for understanding the mechanisms of translation and dynamic proteome. Our translation rate prediction model might become a high throughput tool for annotating the translation rate of mRNAs in large-scale.


Introduction
It is often assumed that genes with high mRNA levels also have high protein abundance. Thus, mRNA levels are used instead of protein abundance. However, the regulation of gene expression takes place at many levels, from transcription to translation and to the post-translational modification. Many studies either could not find the assumed correlation between mRNA level and protein abundance [1] or the correlation was very weak [2,3]. By estimation, only 20%-40% of protein abundance is determined by the concentration of its corresponding mRNA [4,5]. The reason for such weak correlation between protein and mRNA levels is that protein concentrations depend not only on the mRNA level, but also the translation rate and the degradation rate [6].
Translation is the third process of gene expression. In this stage, mRNA is decoded by the ribosome which binds to tRNAs with complementary anticodon sequences. The tRNAs carry specific amino acids that are synthesized into a polypeptide as the mRNA passes through the ribosome. Translation has three steps: initiation, elongation and termination [7]. Both empirical and theoretical studies showed that the bottleneck step in the translation process is the initiation of protein translation [8,9,10]. At the initiation step, the ribosome binds to the five prime untranslated region (59UTR) of mRNA and moves along the mRNA until it gets to the translation start site (TSS). After initiation is completed, the ribosome enters the elongation stage. At elongation step, the ribosome waits until it intercepts an appropriate tRNA whose anticodon complements the codon at the A site of ribosome. Once the correct tRNA is intercepted by the ribosome, the amino acid from the tRNA is transferred to the ribosome associated peptide chain, and the ribosome moves forward one codon. The waiting for the correct tRNA limits the elongation process [10,11]. Translational initiation rate determines protein production rate and elongation rate determines ribosome occupancy [8]. Therefore, ribosome density is propor-tional to translational initiation rate which determines protein production while it is inversely proportional to translational elongation rate.
The regulation of translation plays as important role as transcriptional regulation in the control of gene expression. Changes of the mRNAs translation rate have great influence on the actual protein abundance. Dysregulation of translation will result in various diseases, such as cancer and neurological disorders [12].
With ribosome-profiling technology, ribosome-protected mRNA fragments can be deep-sequenced and the translation rate can be monitored, but it is time-consuming, expensive and not helpful for understanding the translation mechanisms. Here we choose Saccharomyces cerevisiae, one of the most studied model organisms, to perform our study and predict the translation rate. We used the ribosome-profiling data from Ingolia's work [13] in which the read density of mRNA is measured by deep sequencing of ribosome-protected mRNA fragments under both rich and starvation conditions. According to Ingolia's work [13], the translation rate (or called as translation efficiency) is defined as the normalized read density of translation (footprints) divided by the normalized read density of transcription (mRNA). The ratio of ribosome footprints to mRNA fragments can roughly quantify the rate of protein synthesis [13] although further improvements could incorporate variations in the speed of elongation along the mRNA. Each mRNA is represented by various sequence-derived and functional features related to translation, such as codon usage frequencies, gene ontology enrichment scores, biochemical and physicochemical features, start codon features, coding sequence length, minimum free energy, 59UTR length, 39UTR length, number of transcription factors known to bind at the promoter region, number of RNA binding proteins known to bind its mRNA product, protein abundance, mRNA half life, protein half life and 59UTR free energy. With this dataset, an efficient computational model to predict the translation rate was constructed with Nearest Neighbor Algorithm (NNA) and crossvalidated. The prediction accuracies of jackknife cross-validation under rich and starvation condition were 68.8% and 70.0%, respectively. More specifically, to identify the most important features regulating translation rates under different conditions, we applied maximum relevance & minimum redundancy and incremental feature selection to select the important features for predicting the translation rate in rich and starvation conditions, respectively. Our results suggest that the following features are correlated with translation rate: codon usage frequency, some gene ontology enrichment scores, biochemical and physicochemical features of protein (such as amino acids composition, polarity, normalized Van Der Waals volume), number of RNA binding proteins known to bind its mRNA product, coding sequence length, protein abundance and 59UTR free energy. Our findings might provide clues for understanding the mechanisms of translation. The translation rate prediction model could become a useful tool for annotating the translation rate of mRNAs in large-scale.

Dataset
The ribosome-profiling data we used are from Ingolia's work [13] and publicly available at GEOs http://www.ncbi.nlm.nih. gov/geo/query/acc.cgi?acc = GSE13750. With ribosome-profiling technology, Ingolia et al. [13] deep-sequenced the ribosomeprotected mRNA fragments and monitored the genome-wide translation with subcodon resolution in Saccharomyces cerevisiae under both rich and starvation conditions. To get the translation rate, we divided the normalized read density of translation (footprints) by the normalized read density of transcription (mRNA) [13]. The ratio of ribosome footprints to mRNA fragments represents the translation rate and according to their values [13], we characterize the translation rates into two groups which are: (1) smaller than median or equal to median, (2) greater than median. Open Reading Frames (ORFs) in the former group have low translation rate, while the ORFs in the latter group have high translation rate. We characterized the translation rates in rich condition and starvation condition, respectively. The number of ORFs with low translation rates and high translation rates in rich condition and starvation condition can found in Table 1.

Feature Construction
Codon usage frequency features. We downloaded the ORF coding sequences from Saccharomyces Genome Database (SGD) [14] and calculated the codon relative frequencies with seqinR [15]. It was reported that highly expressed genes have different codon preference with low expressed gene and the pattern of codon usage can be used to predict the gene expression level in yeast [16]. It is highly possible that ORFs with different translation rate have different codon usage pattern, too. There were 4 3~6 4 codon usage frequency features.
Gene Ontology features. Proteins are produced to achieve their biological functions. As demand determines production, the translation rate of ORF is definitely correlated with its biological functions. The function of one protein can be better described in protein interaction network, i.e. the network context will give a comprehensive and robust description of its function. In this study, the network context we used was STRING [17]. The Gene Table 1. The number of ORFs with low translation rates and high translation rates in rich condition and starvation condition.

Starvation condition
The number of ORFs Ontology enrichment score of protein i on Gene Ontology term jwas defined as the -log 10 of the hypergeometric test p value [18] of its neighbors on STRING network and can be computed by equation (1): where N is the number of overall ORFs in yeast, M is the number of ORFs annotated to Gene Ontology term j, n is the number of ORFs in ORF set i which includes protein iand its neighbors on STRING network, m is the number of ORFs from ORF set i that are annotated to Gene Ontology term j. The larger the enrichment score of one Gene Ontology term, the more overrepresented this term is. There were 4148 Gene Ontology (GO) enrichment score features.
Biochemical and physicochemical features of proteins.
To encode proteins of different sequence lengths with a uniform dimensional vector, we adopted the notion of pseudo amino acid composition (PseAAC) [19,20]. Each protein sequence was represented by 132 biochemical and physicochemical features which can be categorized into seven groups: (1) amino acid composition [21,22], (2) solvent accessibility, (3) normalized van der Waals volume, (4) polarizability, (5) secondary structure, (6) hydrophobicity, and (7) polarity [23]. Except for amino acid composition, all the other six ones are generated by integrating the pseudo properties of amino acid in the protein sequence and each amino acid can be classified into two or three pseudo groups. For secondary structure, each amino acid can be predicted by SSpro [24] as: helix, strand or coil. For solvent accessibility, each amino acid is predicted by ACCpro [25] as: exposed or buried to solvent. In terms of hydrophobicity, there are three groups of amino acid: hydrophobic (C, V To generate the global protein features by integrating the local quantities of amino acid over the entire protein sequence, the following three quantities are calculated: C(composition), T(tran-(transition), and D(distribution). The detailed computational procedures and a well illustrated example can be found in our previous work [30]. Generally speaking, Crefers to the percent of each pseudo group in the sequence; Tto the frequencies with which one pseudo group changes to another; and Dto the relative position where the first, twenty-five percent, fifty-percent, seventyfive percent, and last of each kind of pseudo letters occur.
For polarity, secondary structure, polarizability, hydrophobicity and normalized van der Waals volume, each amino acid has three pseudo groups and would generate 21 protein features. For solvent accessibility, each amino acid has two pseudo groups and would only generate 7 protein features. Now for the amino acid composition we have 20 features; for solvent accessibility, 7 features; and for the other five properties, each has 21 features. Combining them together, each protein has 5 | 21 z 20 z 7 ð Þ132 features. The detailed explanation of each biochemical and physicochemical feature can be found in our previous work [30].
Start codon features. During the translation initiation, the 40S subunit of ribosome binds to a site upstream of start codon. It proceeds downstream until it encounters the start codon and form the initiation complex of translation. The start codon is typically AUG (or ATG in DNA) and related with translation initiation. We extracted sequences in untranslated region 3 bp upstream of the initial ATG and sequences in coding region 3 bp downstream of the initial ATG. We encoded the 6 bp DNA sequences up/downstream of start codon ATG binarily and each base pair was represented by a 4-dementional vector: A~1,0,0,0 f g ,T~0,1,0,0 f g , C~0,0,1,0 f g and G~0,0,0,1 f g . Coding sequence length. We calculated the coding sequence length of each ORF as a potential feature for translation rate prediction.
Free energy of 42 nucleotide cross translation start site. Kudla et al. [31] identified a region, from nucleotide (nt) -4 to +37 relative to translation start site, for which predicted folding energy can explain some of the of the variation to differences in protein levels. So we calculated the minimum free energy of 42 nucleotide (nt) -4 to +37 relative to translation start site, with Vienna [32].

Various parameters of untranslated regions from
Tuller's study. Tuller et al. [33] collected various properties of untranslated regions of the S. cerevisiae genome and we used the following 8 features from Tuller's study: 59UTR length, 39UTR length, Number of transcription factors known to bind at the promoter region, Number of RNA binding proteins known to bind its mRNA product, Protein abundance, mRNA half life [34], Protein half life and 59UTR free energy [35]. Unlike the above free energy, here the 59UTR free energy is calculated with 59-UTR 100 nt [33,35].

Feature space of ORF
As mentioned above, there are 64 codon usage frequency features, 4148 Gene Ontology (GO) enrichment score features, 132 biochemical and physicochemical features, 24 start codon features and 10 other features. The total features used in this study to represent an ORF sample would be 64z4148z132z4|6z10 ð Þ4378.

mRMR method
In this study, we used the Maximum Relevance and Minimum Redundancy (mRMR) feature selection method [36,37] to rank 4378 features of each ORF considering both their relevance to translation rates and the redundancy among them. The mRMR selected features have maximum relevance with the translation rates and meanwhile minimally redundant, i.e., maximally dissimilar to each other. Both relevance and redundancy are measured with mutual information (MI), which is defined as follows: where x and y are two vectors, p x,y ð Þ is the joint probabilistic density, p x ð Þ and p y ð Þ are the marginal probabilistic densities. Let V denotes the whole vector set containing all 4378 features, V a 5V ð Þ denotes the selected feature set with a feature vectors, and V b 5V ð Þ denotes the to-be-selected feature set with b feature vectors. The relevance R of a feature f in V b with the translation rate class c can be computed by equation (3): The redundancy D of a feature f in V b with all the features in V a can be computed by equation (4): To select a feature f j from V b with maximum relevance with translation rates and minimum redundancy with selected features in V a , the mRMR function which integrates equation (3) and equation (4) is defined: j~1,2,:: For a feature pool containing N~azb ð Þ features, feature evaluation will be executed in Nrounds. After these evaluations, a feature set S will be obtained: where each feature has an mRMR order, representing at which round the feature is selected. The smaller order means more important.

Nearest Neighbor Algorithm
To classify ORFs into different translation rate categories, the Nearest Neighbor Algorithm (NNA) was applied. Its basic idea is to predict a new ORF into its translation rate categories by comparing the features of this ORF with the features of those with known translation rate categories. The distance between two ORF vectors p x and p y is defined as [30,38]: where p x : p y is the inner product of p x and p y , and jjpjj is the module of vector p. p x and p y are consider to be more similar if D p x ,p y À Á is smaller. In NNA, an ORF with feature vectorp t will be predicted as having the same translation rate class as its nearest neighbor which has the smallest D p n ,p t ð Þ. That is D p n ,p t ð Þ~min D p 1 ,p t ð Þ,D p 2 ,p t ð Þ,:::,D p z ,p t ð Þ,:: where N represents the number of training ORFs with known translation rates.

Jackknife Cross-Validation Method
We used Jackknife Cross-Validation Method [38,39], one of the most objective methods, to evaluate the performance of prediction. During Jackknife Cross-Validation, each ORF in the dataset is tested in turn by the translation rate predictor, which is trained by the other ORFs in the data set. Each ORF is involved in training N{1 times and is tested exactly once. To evaluate the performance of the translation rate predictor, the prediction accuracy for the overall ORFs can be calculated as: where T high and T low stand for the number of correctly predicted ORFs with high and low translation rate, respectively; N high and N low are the number of ORFs with high and low translation rate, respectively.

Incremental Feature Selection (IFS)
When the mRMR step was completed, we obtained an ordered feature list but still do not know how many fore features in the list should be chosen. To determine the optimal number of features, Incremental Feature Selection (IFS) [30,38] was applied by constructing Nfeature subsets of the feature list S provided by mRMR. Starting from only the first feature S 1~f1 f g, the feature subset S i is defined as: by adding feature f i to the previous subset S i{1~f1 ,f 2 ,:::,f i{1 f g . For each feature subset S i i~1,:::,N ð Þ , we calculated the prediction accuracy elevated by Jackknife Cross-Validation. The prediction accuracies with different feature numbers form an IFS curve with feature numberi as its x-axis and the prediction accuracy as its y-axis.

The correlation between features and translation rate
To identify the direction of the correlation between features selected by mRMR and IFS with translation rate, we calculated the point-biserial correlation coefficient between them. The point biserial correlation [40] is a measure of association between a continuous variable and a binary variable. Assume that X is the selected feature which is a continuous variable and Y is the translation rate which is binary. The point biserial correlation is calculated as where X high is the mean of X with high translation rate, X low is the mean of X with low translation rate, p high is the proportion of X with high translation rate, sd X is the standard deviation of X . The point biserial correlation is positive when large values of X are associated with high translation rate and small values of X are associated with low translation rate.

Identification of relevant features and construct translation rate prediction model
Using mRMR method, we ranked and analyzed the top 500 relevant features to translation rate with Maximum Relevance Minimum Redundancy method. Each of them has the maximal relevance with translation rate and the minimal redundancy with other features. Then in Incremental Feature Selection (IFS) procedure, 500 prediction models were constructed using nearest neighbor algorithm with 1, 2, 3… 499 and 500 features respectively and tested by jackknife cross-validations as described above. The IFS results of rich and starvation condition were shown in Figure 1 (A) and Figure 1 (B), respectively. It can be ð8Þ seen from Figure 1 (A) that the translation rate prediction model of rich condition achieved the peak accuracy at 68.8% with 37 features. These 37 features formed the optimal feature set for translation rate prediction model of rich condition and are provided in Table S1. Similarly, in Figure 1 (B), the translation rate prediction model of starvation condition achieved the highest accuracy at 70.0% with 86 features. These 86 features formed the optimal feature set for translation rate prediction model of starvation condition and can be found in Table S2.

Factors correlated with translation rate
We compared the optimal 37-feature set of rich condition and the optimal 86-feature set of starvation condition and found there were 27 common features between them. These 27 common features are provided in Table 2. To identify what kinds of features are important for translation rate prediction, we calculated the numbers of each kind of features in the optimal feature set. Figure 2 shows the numbers of each kind of features in (A) the optimal 37-feature set of rich condition, (B) the optimal 86-feature set of starvation condition. As we can see from Figure 2, Table S1, Table S2 and Table 2, the following kinds of features are correlated with translation rate: (1) Codon usage frequency, (2) some Gene Ontology (GO) enrichment scores, (3) protein features (such as amino acids composition, polarity, normalized Van Der Waals volume) and (4) other features (such as Number of RNA binding proteins known to bind its mRNA product, Coding sequence length, Protein abundance and 59UTR free energy).

Discussion
In this study, we have developed a new computational method to predict the translation rate by integrating various sequencederived features and functional features. In rigorous jackknife cross-validation test, the predictor can achieve an overall prediction accuracy of 68.8% and 70.0% in rich and starvation conditions, respectively. We concluded that the following features are correlated with translation rate: codon usage frequency, some GO enrichment scores, protein features (such as amino acids composition, polarity, normalized Van Der Waals volume), number of RNA binding proteins known to bind its mRNA product, coding sequence length, protein abundance, and 59UTR free energy. The following elucidations on these features confirmed their informative and importance in understanding the translation rate and translation mechanism in large-scale.

Codon usage frequency
It has been reported by several studies that codon bias is the major factor for translation efficiency [31,41]. In this study, we analyzed the relationship between the codon usage frequencies of ORFs and their translation rate. Our analysis not only confirmed the strong correlation between codon usage frequencies and translation efficiency, but also showed that more usage of which codon will result in high translation efficiency. It was found that the ORFs with higher frequencies of the following codons (AAC, TCT, ACC, TCC, GCC, GCT, CCA) tend to have higher rate of protein synthesis; on the other hand, higher frequency of the codons (ATA, CGA, TGC, GTA, GGA, CTT, AGG, CGG, TAT) relates to lower translation efficiency.

Gene Ontology (GO) enrichment scores
We also analyzed 4148 Gene Ontology (GO) enrichment score features based on the STRING network context [17]. Interestingly, our analysis indicates that ORFs with different functions or subcellular locations will have different translation rate. The translation differences among different function groups have been mentioned before [42]. According to our analysis, in starvation condition, ORFs with cellular response function tend to have higher translation rate probably to improve the survival in this extreme condition. In starvation, high translation rate correlated with GO groups related to 'cellular response' (e.g. GO:0034605cellular response to heat, GO:0009409 -response to cold, GO:0009266 -response to temperature stimulus). An interesting contrast is the fact that the GO groups 'GO:0005737 -cytoplasm' and 'GO:0001950 -plasma membrane' are enriched with genes with high translation rate while the GO group 'GO:0005634nucleus' is enriched with genes with low translation rate. A possible explanation for this result is that in starvation condition in order to survive proteins in membrane and cytoplasm overexpress, and genes related to DNA duplication (replication in the nucleus) under-express.

Protein features
In our study, the protein features such as amino acids composition, polarity, normalized Van Der Waals volume were correlated with translation rate. The correlation between amino acid composition and protein abundance level has been reported [43] and it is possible that the amino acid composition may influence translation. The reason for the importance of protein features in translation efficiency prediction maybe that these features are strongly related to its function. And the translation difference among different function groups was mentioned in Ghaemmaghami's work [42].

Other features
There are additional features that are useful for translation rate prediction. 'Number of RNA binding proteins known to bind its mRNA product', 'Coding sequence length', 'Protein abundance' and '59UTR free energy' are examples of such features. Firstly, there are a number of RNA binding proteins known to influence mRNA translation rate by bind its mRNA. For instance, RNAbinding proteins HuR and PTB promote the translation of Hypoxia-Inducible Factor 1a [44]. Cytochrome c mRNA translation is controlled by TIA-1 and HuR [45]. Furthermore, the correlation between protein abundance and the level of gene expression has been intensively studied (mainly on yeast). It was suggested that the relatively weak correlation between protein and mRNA abundance is due to different rates of translation and protein degradation [46]. Here we found that the ORFs with higher protein abundance tend to have higher translation rate. Thus, it is possible that the relatively weak correlation between the mRNA levels and protein abundance can be partially explained by the fact that translation rate is an important determinant of protein abundance that can't be estimated from mRNA levels. The last factor is 59UTR free energy. It supports that previous studies that suggested that base-pairing potentials analysis of 59UTR in various prokaryotes indicated that 59UTR free energy is important for translation initiation [47].
Taken together, these sequence-derived and functional features are significantly-related to mRNA translation. Therefore, our prediction model might become a high throughput tool for annotating the translation rate of mRNAs. As a preliminary predictor of translation rate, the current model can only give the high or low categories of translation rate. When more in-depth understanding of translation is accumulated, the regression model might be tried to construct a more practical predictor which can directly estimate the translation rate.

Supporting Information
Table S1 The features for translation rate prediction in rich condition.