A Comprehensive, Quantitative, and Genome-Wide Model of Translation

Translation is still poorly characterised at the level of individual proteins and its role in regulation of gene expression has been constantly underestimated. To better understand the process of protein synthesis we developed a comprehensive and quantitative model of translation, characterising protein synthesis separately for individual genes. The main advantage of the model is that basing it on only a few datasets and general assumptions allows the calculation of many important translational parameters, which are extremely difficult to measure experimentally. In the model, each gene is attributed with a set of translational parameters, namely the absolute number of transcripts, ribosome density, mean codon translation time, total transcript translation time, total time required for translation initiation and elongation, translation initiation rate, mean mRNA lifetime, and absolute number of proteins produced by gene transcripts. Most parameters were calculated based on only one experimental dataset of genome-wide ribosome profiling. The model was implemented in Saccharomyces cerevisiae, and its results were compared with available data, yielding reasonably good correlations. The calculated coefficients were used to perform a global analysis of translation in yeast, revealing some interesting aspects of the process. We have shown that two commonly used measures of translation efficiency – ribosome density and number of protein molecules produced – are affected by two distinct factors. High values of both measures are caused, i.a., by very short times of translation initiation, however, the origins of initiation time reduction are completely different in both cases. The model is universal and can be applied to any organism, if the necessary input data are available. The model allows us to better integrate transcriptomic and proteomic data. A few other possibilities of the model utilisation are discussed concerning the example of the yeast system.


Introduction
The rate of translation differs for individual proteins, reflecting both the intrinsic capability of an mRNA molecule to be translated and the environmental factors affecting the efficiency of the translation process. The first is well characterised in other studies [1][2][3] that discuss mRNA features responsible for the regulation of translation (e.g., length of the 59 UTR, presence and location of mORFs, type and number of initiation codons, sequence context around the initiation codon, presence and location of mRNA secondary structure elements, codon usage, mRNA stability, and posttranscriptional modifications). However, the second describes the features of the environment in which translation occurs, namely the amounts of particular mRNA transcripts in a cell, the accessibility of the translation machinery elements required to initiate and accomplish protein synthesis (such as free ribosomes, tRNAs, and elongation factors), as well as growth conditions, which have been proven to evoke gene-specific translational control [4].
Although the general theoretical background of translation is known, the process of protein synthesis is still poorly characterised at the level of individual proteins. Experimental determination of absolute translation rates (i.e., in time units) is a tremendous task and we are not aware of any such research. Even though the factors specified above have been studied separately for some proteins, little is known about the extent to which they affect the process and how they cooperate to keep the synthesis rate at the required level. Another strategy to examine translation activity is to integrate genome-wide expression datasets from different sources [5][6][7][8]. However, it was shown [9] that these datasets cannot be used to predict translation rates at the level of individual proteins, as they suffer from large random errors and systematic shifts in reported values.
In practice, upon the development of techniques to examine transcriptome data experimentally (microarrays, Northern blotting, RNA-seq, etc.), the mRNA concentration has become a broadly used measure of protein abundance. Nevertheless, recent research indicates that there is only a partial correlation between mRNA and protein abundances [10][11][12][13][14][15][16]. It was shown that the mRNA transcription level can explain only 20-40% of the observed amounts of proteins [17,18], which leads to conclusion that the role of translation in regulation of gene expression has been constantly underestimated. Thus, a deeper insight into the process of translation is required to better integrate transcriptomic and proteomic data [19][20][21].
In this study, we developed a model to measure the absolute, translational activity at the level of individual genes. The model was implemented in Saccharomyces cerevisiae, however, it can be used to study translation in any other organism of known genome, but only if the following data are available: (i) a dataset of mRNA relative abundance and ribosome footprints; (ii) tRNAs decoding specificities; (iii) average cell volume; (iv) average number of active ribosomes in a cell; (v) average number of mRNA transcripts in a cell; and (vi) a dataset of mRNA half-lives (optionally).
In our calculations for the yeast system the first dataset came from one genome-wide experiment provided by Ingolia et al. [22], quantifying simultaneously mRNA abundance and ribosome footprints by means of deep sequencing. This method is thought to provide a far more precise measurement of transcript levels than other hybridisation or sequence-based approaches [23]. Based on this dataset, we determined the absolute time of translation, in SI units, for individual genes. The time is the sum of the time required to accomplish two main steps of protein translation: initiation and elongation. Analysing the initiation or elongation time alone provides quantitative information on the extent of translation regulation at these two steps separately. Moreover, by introducing mRNA concentrations into the model, one can calculate the relative rate of translation initiation, which does not depend on the transcriptional level of a corresponding gene. Assuming identical conditions for all mRNAs in the cell (i.e., equal amounts of available ribosomes, elongation factors, tRNAs, etc.), the measure will reflect the mRNA's intrinsic ability (in relation to other analysed mRNAs) to regulate the efficiency of translation initiation. Such a deep insight into the process of initiation is particularly important, as this step of protein synthesis is thought to be the main and rate-limiting target for translational control [24]. Furthermore, by combining our results with a dataset on mRNA stability [25], we calculated the absolute amounts of protein produced from each transcript during its lifespan.
We compared our results with direct experimental studies measuring the mRNA and protein levels of chosen genes. Good correlation with most of the experimental data was observed, and calculated mRNA and protein abundances did not differ significantly from those reported in vivo. In addition, other calculated parameters of translation, such as the overall rate of protein synthesis, were in agreement with earlier reports.
The calculated translational parameters were also used to study the general characteristics of the yeast translational system, revealing the diversity of strategies of gene expression regulation. For instance, we showed that two commonly used measures of translation efficiency -ribosome density and number of protein molecules produced -are affected by two distinct factors. We observed strong negative correlations between values of both measures and translation initiation time, however, the origins of initiation time reduction for most efficient transcripts are completely different. In case of elevated ribosome density, short initiation is caused mostly by mRNA instristic capability of being translated discussed at the beginning of this section. Contrary, in case of high number of protein molecules produced, short initiation is caused primarily by elevated mRNA concetrations.
Finally, we exemplified and discussed other possible ways of model utilisation, as the model may be of considerable help in examining gene expression regulation, protein-protein interactions, metabolic pathways, gene annotation, ribosome queuing, protein folding, and translation initiation. Additionally, the model provides an overall and quantitative picture of the translation process, crucial for better integration of transcriptomic and proteomic data from high-throughput experiments.

Results
The following translational parameters were attributed to the yeast genes (for derivation, see the Materials and Methods): L, length of the transcript coding sequence (CDS) in codons; x, absolute number of transcripts in a yeast cell; B, total amount of protein molecules produced from transcripts of particular type; g, ribosome density in number of ribosomes attached to a transcript per 100 codons; w, the absolute number of ribosomes on a transcript; T, total time of translation of one protein molecule from a given transcript; I, total time required for translation initiation; E, total time required for translation elongation; mean E, mean time required for elongation of one codon of a transcript; P, translation initiation frequency; Pz, relative rate of binding of free ribosomes to the 59 end of a transcript, proportional to the concentration of the transcript; Ps, relative rate of successful accomplishments of initiation once the ribosome-mRNA complex is formed (the obtained values of the parameter Ps ranged from 3.4e-4 to 65.9. For clarity, we decided to normalise them by the maximal reported value of Ps obtained for the gene YLL040C. The normalised values of Ps range from 0 to 1 and allow more intuitive comparison); h, estimated half-life of a transcript; and m, estimated mean lifetime of a transcript. Parameters T, I, E, mean E, h, and m are given in SI units.
We managed to attribute quantitative measures of translation to the majority of 4648 transcripts from the initial dataset. Four transcripts were rejected at the beginning of processing, as ribosome footprints were not observed on them. Further, 23 transcript had unrealistic, elevated g values (i.e., gw10). Assuming, that a ribosome covers ten codons, a transcript CDS built of 100 codons cannot contain more than ten ribosomes. Eventually, we eliminate transcripts at which queuing of the ribosomes may occur. Our simulation program yielded 130 transcripts suspected of queuing, plus 21 for which translation at the 59 end is so slow that the first attached ribosome prevents the attachment of the following ribosomes. Further calculations were performed for the most relevant transcripts, i.e., the remaining 4470 yeast genes, without ribosome queuing.
The values of parameters T, E, mean E, I, L, x, g, w, P, Pz, and normalised Ps were determined for all 4470 transcripts in the dataset, of which 4192 could also be attributed with additional

Author Summary
Translation is the production of proteins by decoding mRNA produced in transcription, and is a part of the overall process of gene expression. Although the general theoretical background of translation is known, the process is still poorly characterised at the level of individual proteins. In particular, the quantitative parameters of translation, such as time required to complete it or the number of protein molecules produced from a transcript during its lifetime, are extremely difficult to measure experimentally. To overcome this problem, we developed a computational model that, on the basis of only few datasets and general assumptions, measures quantitatively the translational activity at the level of individual genes. We discussed it concerning the example of the yeast system; however, it can be applied to any organism of known genome. We used the obtained results to study the general characteristics of the yeast translational system, revealing the diversity of strategies of gene expression regulation. We exemplified and discussed other possible ways of model utilisation, as it may help in examining protein-protein interactions, metabolic pathways, gene annotation, ribosome queueing, protein folding, and translation initiation. It also may be crucial for better integration of cell-wide, highthroughput experiments.
parameters h, m, and B. The calculation of all parameters except h, m, and B was entirely based on the results from one high-throughput experiment. Parameters h, m, and B engaged one additional dataset of mRNA relative half-lives. The general characteristics of the parameters are specified in Table 1. The values of parameters for  individual genes are provided in Supplementary Table S1.
The calculated parameters allow to study the process of translation at the level of individual genes. Figure 1 depicts the translation process in time on the example of protein YJL173C, a highly conserved subunit of Replication Protein A (RPA). Similar schematics may be constructed for the majority of yeast genes.

Correlations with existing data
Next, we checked if our calculations were in agreement with published data on protein and mRNA abundances. We compared our results with two previously published studies that provide information on transcript and protein copy number for numerous S. cerevisiae genes [13,26], by performing linear regression through the origin on the log-transformed values. The adjusted R 2 values, as well as the corresponding regression coefficients, were calculated for six pairs of datasets and the results are presented in Table 2. Scatter plots are presented in Figure 2. The results show that our model explains 84% of the variability in mRNA abundance and 97% of the variability in protein abundance reported by experimental studies. Such R 2 values are reasonably good, taking into account the differences in the particular yeast strains and laboratory protocols used, as well as the fact that our calculations are based on a few simplifications that can disrupt the final outcome. Moreover, R 2 values reported for our model do not stand out from those calculated for comparisons of two experimental datasets with each other, suggesting that the observed differences constitute the internal variability of the system, not a methodology error. To measure if our results suffer from systematic shift, we calculated the fold difference values for transcript and protein abundance comparisons with two experimental datasets (see Supplementary Figure S1). In general, our calculations slightly overestimate the transcript copy number and underestimate the protein copy number, in relation to published data. This is mainly caused by the assumption we made: that one yeast cell contains, on average, 36,000 transcripts. The transcript copy number used in both reference studies is originally taken from older research [27], which quantified the relative mRNA concentrations and transformed them into absolute copy number, assuming 15,000 as the total number of transcripts per cell. This estimation seems inadequate to us in the light of current discoveries, which are explained in the Materials and Methods.
Transcript copy number is also problematic due to the wide discrepancies in mRNA levels reported by different studies [28]. Above mentioned mRNA concentration dataset [27] was obtained in a serial analysis of gene expression (SAGE) experiment and it is likely that such concentration estimates have low precision for low abundance mRNAs [13,26]. On the other hand, it is hypothesized that SAGE is more accurate for abundant mRNAs when compared with other widely used technique: high-density oligonucleotide arrays (HDA) [26,28]. Thus, we decided to compare mRNA concentrations calculated in our model with results obtained in genome-wide HDA experiment [29]. We performed linear regression through the origin on log-transformed data on mRNA abundance for 3769 genes. Scatter plot and the distribution of fold difference values are presented in Figure 3. The obtained adjusted R 2 value was 0.30 (see Table 2), meaning that parameter x is able to explain only one third of the variability in mRNA abundance reported by this experiment [29]. This discrepancy is probably caused again by the experimental error. Parameter x reflects mRNA concentration obtained by means of deep-sequencing, technique considered to be far more precise in measuring mRNA levels than other hybridisation or sequence-based approaches [23]. However, it is likely, that it is less precise for low abundance mRNAs, which may be seen in Figure S1 provided by Ingolia et al. [22]. This would explain why parameter x better describes variability in mRNA concentrations obtained from SAGE than HDA experiments. In addition, we estimated that the cell-wide rate of translation for S. cerevisiae at 30 0 C is 5.5 amino acids (aa) per second, which corresponds to an average time of translation for one codon of 183 ms. This is in agreement with experimental studies, reporting rates of 8.8 aa/sec and 5.2 aa/sec for fast-growing and slowgrowing yeast cells, respectively [30]. It is worth noting that the obtained value is also within the range reported for proteins from other organisms, namely 6 aa/sec for human apolipoprotein [31], 0.74 aa/sec for rabbit hemoglobin [32], 5 aa/sec for chick ovalbumin [33], and an average translation rate of 7.3 aa/sec in cockerel liver [34].
Furthermore, it is reported in independent studies that the total amount of protein in a yeast cell varies from 4:9|10 {12 g [16] to 6:4|10 {12 g [35]. Based on known protein sequences and the molecular mass of particular amino acids, we can calculate the mass of each yeast protein. By multiplying this by the protein copy number B and summing over all expressed yeast proteins, we estimated that the total mass of proteins in a yeast cell is around 2:2|10 {12 g. Although this number is smaller than values reported previously, it is still consistent taking into account the fact that we excluded from the calculations all transcripts with ribosome density gw10, as our model cannot operate on such elevated values of this parameter. Most likely, gw10 results in very high level of translation, meaning that excluded transcripts would have large B values, if they could be counted by our model. Thus, excluding these transcripts strongly affects the final mass of proteins in a yeast cell, diminishing it noticeably. Moreover, we must not forget that calculated values of the parameter B reflect only the total amount of proteins produced from a given transcript, whereas the cell contains many other proteins produced in the past that are still present in the cell.

General features of the yeast gene expression system
Based on our results, we can draw the following conclusions concerning gene expression in S. cerevisiae: First, half of the genes produce less than 2.73 transcripts per cell. The distribution of the transcript copy number is skewed with a long right tail: only 55 genes have more than 100 mRNA copies. Unsurprisingly, the top 20 genes with the highest x values turned out to be either ribosomal proteins (18 genes) or enzymes engaged in glycolysis (genes YKL060C and YKL152C). One mRNA molecule is translated from 0.14 to 40,110 times, and the median is 257.9. Typically, one gene produces 677 protein copies; however, the most active genes may generate more than 2 million protein copies. Only six genes are common for the sets of the top 20 genes with the highest transcript levels and protein abundance. Among the 20 most highly produced proteins, there are 14 ribosomal proteins, two genes engaged in glycolysis (YCR012W, YKL060C), a highly expressed mitochondrial aminotransferase (YHR208W), alcohol dehydrogenase (YOL086C), and two cell wall proteins (YLR110C, YKL096W-A). There is only partial correlation between transcript and protein copy number and protein production does not necessarily follow the concentration of mRNA molecules (see Figure 4). We compared mRNA (x) and protein (B) abundance calculated in our model, by performing linear regression through the origin on log transformed data. Adjusted R 2 value calculated over the entire dataset (4192 genes with known B) was 0.59. This means that over 40% (in log space) of the variation in protein abundance cannot be explained by variation in mRNA abundance, suggesting some additional, posttranscriptional mechanisms of gene expression regulation.
Next, we analysed yeast genes for expression strategies applied to produce the highest number of protein molecules. We prepared two datasets: 200 genes with the highest B values (Bw24000) and 200 genes with the lowest B values (Bv36: 14). We compared the rest of the translation parameters between these two sets, performing a two-sided Mann-Whitney test. The mean value of most parameters differs between the two datasets in an intuitive manner: genes coding for highly abundant proteins usually produce more transcripts, which have a shorter time of translation (both E and I), as well as stronger resilience to degradation and are occupied by more ribosomes per 100 codons. All differences are statistically significant with p-value v0:001 (data not shown). Only one parameter appeared not to affect the number of proteins produced: the relative rate Ps of initiating translation once the ribosome attaches to the free 59 end of an mRNA molecule  (p-value~0:07). Moreover, the Spearman's correlation coefficient between parameters B and Ps for the entire dataset is very weak (r s~0 :09, p-value v0:001).
Analogously, we analysed two datasets of 200 genes with the highest and lowest g values (gw3:09 and gƒ0:169, respectively). According to the Mann-Whitney test, transcripts of higher ribosome density typically produce more protein molecules and have shorter times of translation (both E and I). All differences are statistically significant with p-value v0:001 (data not shown). In contrast to the result mentioned above, the shorter time I for genes of the highest ribosome density is here caused mainly by elevated Ps, while Pz has little influence, but the difference in Pz is still statistically significant (p-value v0:001). Nevertheless, no signif-icant correlation was observed between the parameters g and Pz measured over the entire dataset (p-value~0:17). The roles of Ps and Pz in modifying values of B and g are detailed in Supplementary Figure S2.
Furthermore, we studied, in detail, 20 genes from the set of 200 genes producing the highest number of proteins but with low transcriptional activity (xv4:33 for all of them). Interestingly, these genes are involved in many distinct biological processes, with the notable exception of ribosome formation. The mechanism of their regulation, deduced from the values of the translational parameters, is almost the same for all genes. For instance, two parameters seem to play the main role in sustaining the high protein synthesis rate: relatively long mean life-time of the mRNA molecule, reaching up to several dozens hours (the maximal observed mean lifetime of a yeast transcript is 61 hours), and about four times shorter time of translation initiation, caused mainly by relatively high Ps values. On average, the observed Ps is one order of magnitude higher than the median Ps for all yeast genes. The shorter pause between subsequent initiations results in elevated ribosome density g and increased protein production rate. On the other hand, the total time of translation, as well as the mean elongation time, are unexpectedly long (i.e., slightly above the median value of all yeast genes (see Table 3)). This indicated that in cases of long-lived mRNAs, high transcriptional rates and usage of frequent codons are not required to achieve a high rate of protein synthesis. This strategy of expression constitutes an interesting but still inscrutable example of translation regulation, and further research should be carried out. The plot shows the correlation between mRNA abundance (parameter x) and the number of protein molecules produced from a given gene (parameter B). We performed linear regression through the origin on log transformed data. Adjusted R 2 value calculated over the entire dataset (4192 genes of known B) was 0.59. This means that over 40% (in log space) of the variation in protein abundance cannot be explained by variation in mRNA abundance, suggesting some additional, posttranscriptional mechanisms of gene expression regulation. doi:10.1371/journal.pcbi.1000865.g004 Table 3. Translational parameters of 20 genes of low transcriptional activity and high protein production rate.   [29]. The axes were log transformed. Calculated R 2 value for the comparison is presented in Translation times and codon bias In Supplementary Table S2 we present times of translation of individual yeast codons at 30 0 C. We compared these values with codon optimality C opt calculated by [36]. The value of C opt measures whether the codon is preferred in highly expressed genes compared with all other codons encoding the same amino acid. C opt is calculated as the odds ratio of codon usage between highly and lowly expressed genes. Figure 5 shows that there is negative correlation between C opt value and translation time of a codon. However, while optimal codons have only short times of translation, non-optimal codons may be translated at both high and low rates. Linear regression model through the origin on log transformed values confirmed this conclusion: the obtained adjusted R 2 is only 0.15. This indicates, that translation speed may be the one, but not the only criterium for selection on codon bias. This is in agreement with other reports, discussed widely in the recent review [37]. Also, it has been shown that codon usage bias in yeast is associated with translation accuracy [38] and protein structure [36].

Translational parameters and protein interactions
Interacting proteins are often precisely co-expressed, presumably to maintain proper stoichiometry among interacting components [39]. For instance, it was shown that functionally associated proteins exhibit correlated mRNA expression profiles over a set of environmental conditions [40,41]. Other studies report the coevolution of codon usage of functionally linked genes [39,42] and show that codon usage is a strong predictor of protein-protein interactions [43]. Our model provides far more information on translation regulation than mRNA expression profiles or codon usage alone, thus we decided to examine calculated parameters in a set of well-known interacting proteins.
As a model, we chose the 20S proteasome complex, built of 28 proteins. There are 14 genes in the yeast genome coding for proteasome subunits a1-a7 and b1-b7, and each subunit is present in the complex in two copies [44]. Only subunit a3 is nonessential for the functionality of the complex and may be replaced by the a4 subunit under stress conditions to create a more active proteasomal isoform [45].
The analysis of the translational parameters (see Supplementary, Table S3) shows that the mean translation time (mean E) of all proteins is similar and ranges from 194 to 259 ms. As all interacting proteins are of similar length, the total time of elongation does not vary much; the biggest observed difference between two proteins was *24 s. However, the level of transcription is more variable and ranges from 3.99 to 22.61 transcripts per cell. There is a considerable divergence of ribosome density g (from 0.61 to 4.32), but regulation at the level of translation initiation (similar values of Pz and variability of Ps reaching two orders of magnitude) keeps the initiation time I at the same level for all 14 proteins. The biggest observed difference of I between two proteins equals *31 s. This results in congruent total times of translation T, the difference between maximal and minimal values is only two-fold with a mean value of *73 s. Nevertheless, the observed differences in the mean lifetimes of mRNA molecules are huge, reaching up to 278 min. In consequence, the number of protein molecules produced is strongly variable, ranging from 318 to 11,185 molecules per cell, and this is surprising as the stoichiometry of the 20S proteasome would rather suggest equal amounts of all subunits. Indeed, for four proteasome proteins, the value of the B parameter is almost the same, about 2,600 subunits of b2, b3, b4, and b5 per cell. Similar values, which do not exceed the range 2,600+1,000, were reported for subunits a1, a7, b6, and b7. Subunits a3, a4, a5, and a6 are produced to less than 1,100 copies, while the rate of protein synthesis of subunits a2 and b1 is 5,481 and 11,185 molecules per cell, respectively. To maintain the number of different subunits at the same level, the high translation rates of a2 and b1 may be balanced by post-translational regulation, presumably by elevated protein degradation. Conversely, the reduced translation rate of a3, a4, a5, and a6 may be compensated at the level of transcription, for instance by more frequent transcription initiations. In addition, the limited number of these subunits, as well as the relatively short life-time of their mRNAs, makes them ideal candidates for regulators of the abundance of proteasome complexes.

Discussion
The main advantage of the proposed model is that basing it on only few datasets and general assumptions allows the calculation of many important translational parameters, which are extremely difficult to measure experimentally. As a result, the majority of yeast genes may be attributed with quantitative rates of expression and protein synthesis. These data may be used to study both the general characteristics of the process of translation in yeast and the rates of protein production of individual genes. The model itself is general and universal and can be applied to other organisms if all of the necessary input datasets are available.
However, as with any theoretical model, this one also has some drawbacks. The quality of our calculations strongly depends on the quality of the input data. To study the example of S. cerevisiae, we carefully chose the dataset of ribosome profiles and made sure that data on mRNA abundance and ribosome footprints were obtained under the same experimental conditions. Similarly, all global parameters, such as the overall number of transcripts and  [36]. There is negative correlation between C opt value and translation time of a codon. However, while optimal codons (high C opt values) have only short times of translation, non-optimal codons may be translated at both high and low rates. Adjusted R 2 value obtained in linear regression model through the origin on log transformed values indicates, that translation speed may explain only 15% of variability in C opt values. doi:10.1371/journal.pcbi.1000865.g005 ribosomes in a cell, were determined with care and attention, after insightful analysis of the literature. To extend our model to the number of proteins produced, we decided to use an additional dataset of mRNA half-lives. The assumption that lies at the basis of mRNA half-life calculation is that in the steady state of mRNA turnover, the time required to synthesise an mRNA molecule equals the time to degrade it. Obviously, this is not true for many transcripts, as the cell cycle and environmental stimuli force changes in mRNA turnover. Additionally, we must not forget that the parameter B, calculated based on mRNA half-life, reflects only the total amount of protein molecules produced by the transcripts of a given gene. The protein degradation rate is not taken into account, and therefore, especially in case of short-lived proteins, the observed protein concentration will be smaller than estimated in this paper. This may be the cause of some of the discrepancies between the estimated protein abundances and those previously reported.
The true meaning of the B parameter is also important when analysing the set of 20 genes characterised by low levels of transcription and high levels of protein production rate. As their transcripts may be sustained in a cell for up to a few dozen hours, they may produce a large amount of protein in their lifetime, even if the translation is not very efficient. However, this does not necessarily indicate that all synthesised proteins are aggregated in a cell, and their number is constantly increasing. It is more likely that these proteins are systematically degraded and replaced by new ones produced from the same mRNA. Interestingly, genes regulated thusly would not be classified as highly expressed by any standard methods, as their transcripts are not present in the cell in many copies, and their mean time of elongation is about average, so no codon bias is suspected.
In addition, our model revealed some interesting aspects of global translation characteristics. In many studies ribosome density is used as the only measure of translational activity [6]. We have shown that high ribosome density is caused mainly by the elevated relative rate of translation initiation after forming of the ribosome-mRNA complex -Ps. In contrast, another measure of translation efficiency, protein production rate B, is affected mostly by the relative rate of finding an mRNA molecule by a free ribosome Pz, while the influence of Ps in this case is negligible. These results reflect the complexity of translation regulation, suggesting that any translational parameter, when considered separately, is not sufficient to fully characterise the process.
It has been stated before [46] that the regulation of gene expression is controlled at multiple stages, and no general rule exists describing how it works. In fact, the regulation of expression is different for each gene, and its main role is to produce the required amount of a given protein at the proper time. In contrast to the typically used methods of quantifying translation (i.e., codon bias and transcript abundance measurements), the proposed model does not concentrate only on one parameter of translation. In fact, it allows one to study, in depth, many strategies of gene expression, showing which parameters play the main role in which type of control. Furthermore, the model opens the prospect for new analysis of mRNA molecules. As mentioned before, the translation initiation rate depends on mRNA abundance and intrinsic features of the transcript. The calculated parameter Ps measures the relative efficiency of translation initiation, excluding the influence of mRNA concentration. Thus, for the first time, it provides a quantitative way to compare mRNA sequences from the same organism with respect to initial codon context, 59UTR secondary structure, mORF presence, and other mRNA features responsible for the efficiency of translation initiation.
Another possible application of the model is the analysis of the calculated translational parameters in the context of protein complexes, where proper stoichiometry among interacting components is maintained. As exemplified by the study of the yeast 20S proteasome, such analysis enables one to draw some interesting conclusions about the regulation of the individual proteins, as well as the entire complex. Moreover, we have shown that some parameters, in particular translation times I, E and T, are similar for all proteins of the complex. Possibly, the calculated model parameters, if properly integrated, could become a strong predictor of protein-protein interactions. It would be interesting to carry out a similar search for proteins participating in the same metabolic pathway, as functionally related proteins are usually co-expressed. In such a case, the analysis of the translational parameters pattern could become useful for the functional annotation of genes.
The model can also be used to study the elongation process in the context of ribosome queuing. It provides all the necessary tools to deeply analyse the strategies developed by living cells to avoid ribosome stacking on a translated mRNA molecule.
Additionally, clustered codons that pair to low-abundance tRNA isoacceptors cause local slow-down of the elongation rate. It has been hypothesised, that such slow-down might facilitate the co-translational folding of defined protein segments, by temporally separating their synthesis [47]. Recently, it has been proposed that discontinuous elongation of the peptide chain can control the efficiency and accuracy of the translation process [48]. Our model provides the measure of yeast codon elongation rates that may be used to better examine the co-translational folding. In contrast to the measure used in the aforementioned study, it is quantitative and more precise, as it takes into account the delay caused by near-and non-cognate aa-tRNAs.
Finally, the crucial coefficients of the model, i.e., the time of insertion of cognate aa-tRNAs and time delays caused by nearand non-cognate aa-tRNAs binding, can be calculated with respect to different temperatures. This provides the possibility to study the excess to which the temperature affects the efficiency of translation, provided that the ribosome footprints and mRNA concentrations are also measured at a few different temperatures.
In conclusion, although experimental confirmation is still required, this model constitutes an important tool for understanding the process of protein synthesis.

Theoretical model of translation
The molecular mechanism of translation was well characterised previously [49]. However, for the purpose of this research, we must consider the process both at the single transcript and genome-wide levels. Quantifying the process of protein biosynthesis engages vast array of data, some of which is incomplete or missing. Thus, the following assumptions and simplifications must be made: (i) the pools of all molecules participating in translation (mRNA, tRNA, ribosomes, translation factors, and so on) are constant, and molecules diffuse without restraint; (ii) all transcripts derived from the same gene have identical sequence, i.e., there is no alternative splicing and/or posttranscriptional modification; and (iii) the elongation process is never interrupted, and it always ends by producing a full-length protein molecule (note, that experimentally estimated procesivity of translation in yeast was 99.8-99.9% [50]). When these assumptions are satisfied, the model is as follows: Let X be the set of all transcripts present in the yeast cell at the moment of observation. We can make a partition of the set X into n subsets, each containing transcripts of identical sequence. Thus, n denotes the number of transcriptionally active genes in the cell. To each gene (subset), we attribute the index i~f1,2,3:::g and define x i as the number of transcripts in the i th subset. The variable x is reflected though by the transcriptional activity of a gene.
Let T i be the total observed time of synthesis of one protein molecule from a transcript belonging to the i th subset. We define it as: where I i denotes the time required for translation initiation, and E i is the total time of the elongation process. We define I i as the time interval from the point when the free 59 end of a transcript becomes available for ribosomes to the moment when a ribosome finds the initiator AUG codon and the entire complex enters into the elongation phase. The inverse of the initiation time I i is initiation frequency P i : If these frequencies are multiplied by a brief time interval dt, one obtains the probabilities that the initiation process will occur during time interval dt. We assume that the initiation of translation follows the scanning model [51], which postulates that the small ribosomal subunit enters at the 59 end of the mRNA and moves linearly, searching for the initiator AUG codon; once it finds it, the elongation process begins. We define Pz i as the relative binding rate of free ribosomes to the 59 end of the i th transcript, and assume it is proportional to the concentration of the transcript (see Eq.10). This means, that in our model the binding constants of ribosomes are the same for all mRNAs. Contrary, the process of 59UTR scanning by the ribosome is not straightforward, as there are many intrinsic features of mRNA molecules that can considerably delay or hasten the start of elongation (for review, see [1]). Sometimes, the ribosome detaches from the mRNA molecule before reaching the initial AUG, and the process must return to the point when a ribosome binds at the 59 end. To describe the efficiency of the scanning process by one numerical parameter, we normalised P i by the rate of binding of free ribosomes Pz i : The calculated parameter Ps i describes the rate of successful accomplishment of initiation on the i th transcript once the ribosome-mRNA complex is formed. Its value reflects the relative capability of an mRNA molecule to be translated, regardless its expression level. The rates Ps and Pz are calculated in relation to all studied transcripts, thus they can only be compared within one particular analysis.
The time E i (see Eq.1) is defined as a time interval from the recognition of the initiator AUG codon by the ribosome to the moment when the last peptide bond of a protein molecule is formed. Each elongation event consists of two main steps: (i) finding the correct tRNA molecule, and (ii) formation of the peptide bond and translocation. The time required for the first event is much larger than for the second. In fact, the second step is almost instantaneous [52]; thus, the times needed for transpeptidase and translocation reactions can be neglected, and time E may be simplified to: where e j is the time of translation of the j th codon, and L i is the number of codons in the coding sequence of the i th transcript. Translation times for all yeast codons, as well as the values of E and Pz, can be calculated on the basis of existing data (see below). These values can also be used to calculate times I and the rest of the model parameters T, P, and Ps, if the numbers of ribosomes attached to the mRNA molecules are known. Here, the reasoning is as follows: Let w i be the number of ribosomes attached to the i th transcript. We introduce the measure of ribosome density g, defined as the number of ribosomes attached to the transcript per 100 codons: One ribosome occupies ten codons of a mRNA molecule [53], and the E site of one ribosome can be immediately adjacent to the A site of another ribosome [54]. This means that the maximum possible value is g~10. Next, the attachment of a ribosome to the 59 end is possible only if it is not occupied by other ribosomes. Thus, the most efficient mRNA sequences should have g&10. Nevertheless, the majority of observed g values are much smaller, meaning that there are usually gaps of varying length between attached ribosomes. As the exact positions of ribosomes on a particular transcript cannot be deduced from the data, we must operate on the averaged gap lengths, defined as the quotient of the transcript length L and number of attached ribosomes w. The length of a gap measured in codons is meaningless, as each type of codon has a different translation time. However, the gap can be calculated as the sum of translation times of these codons, becoming an adequate measure of the time interval between individual translation initiation events on a given mRNA molecule. This time is actually a delay from the best possible initiation frequency and reflects the efficiency of the initiation process. In principle, this is the time I (see Eq. 1) expressed in the same time units as the translation times of particular codons: Note that due to unknown ribosome positions on a transcript, both the gap length and time of its translation are averaged. The rest of the parameters (T, P, and Ps) can be calculated based on I, as shown in Eq. 1, 2, and 3.

Calculating model parameters
The S. cerevisiae coding sequences used in our calculations were downloaded from the Saccharomyces Genome Database [55] (accessed 25 th June 2009). For each gene, we determined the values of T, I, E, P, Pz, and Ps on the basis of the recent research of Ingolia et al. [22], quantifying simultaneously mRNA abundance and ribosome footprints by means of deep sequencing. The study was done for the yeast strain BY4741 grown in YEPD at 30 0 C. In the first step Ingolia et al. performed deep sequencing on a DNA library that was generated from fragmented total mRNA in order to measure abundance of different yeast transcripts. Next, they applied a new ribosome-profiling strategy based on the deep sequencing of ribosome-protected fragments. This resulted in a dataset of 4,648 reliable transcripts (for the definition of ''reliability'', see Supplementary Materials of [22]) that was used as an input in our research. For each transcript in the dataset, the following values were attributed: c, raw count of mRNA-seq reads aligned to transcript coding sequence (CDS); C, density of mRNA-seq reads in reads per kilobase per million CDSaligned reads (RPKM); f , raw count of ribosome CDS-aligned footprints; and F , density of ribosome footprints in reads per kilobase per million CDS-aligned reads. Next, the relative numbers of reads counted in RPKM were transformed into the transcript copy numbers. Normally, for each transcript i, C i is defined as: where L i is the length of the transcript CDS in codons, and Nc is the sum of all mappable reads c [56]. Assuming uniform distribution of the mappable reads across the transcriptome coding sequences, the probability of observing c i reads on the i th transcript CDS of length L i in Nc attempts corresponds to the fraction of the transcriptome composed of the i th transcript: where M is the sum of all CDS of the transcriptome in base pairs. The meaning of x was explained in the previous section. We can substitute final RPKMs to get: Although the length of the entire transcriptome was estimated as 7|10 7 nucleotides [57], deriving M is more problematic, as little is known about the accurate boundaries of non-coding elements in transcript sequences [9]. There were some attempts to determine the length of UTRs on a global scale in yeast [58,59], but the results show that even the length of transcripts derived from the same gene of the same yeast strain cultured in the same growing conditions may vary considerably. This causes the discrepancies between reported transcript lengths by these two studies, making the analysis at the level of individual genes difficult and inaccurate.
To overcome this problem we use Pz, the relative rate of binding of free ribosomes to the 59end of a given transcript (see Eq.3). This rate corresponds to the fraction of transcript i in the set of all transcripts. By substituting x as shown in Eq. 9, we obtained the following relation: where NC is the sum of all densities C i of mRNA-seq reads. Thus: The next step was to calculate w i (the absolute number of translationally active ribosomes attached to the i th transcript), and g i (the measure of ribosome density, as defined in Eq.5). The dataset used provides information only on ribosome footprints aligned to the coding sequences. However, in practice, there were some exceptions to this rule, caused mostly by the presence of mORFs in the 59UTR sequences [22]. Due to the lack of data and aforementioned difficulties in determining exact transcript length, this fact is not taken into account in our analysis. Furthermore, we defined W as the number of all ribosomes in a yeast cell and p as the fraction of ribosomes participating in the process of translation at the moment of observation. In contrast to raw mRNA-seq reads, the distribution of ribosome footprints is not uniform across the transcriptome, due to differences in genes translational activity. Thus, the probability of observing a ribosome attached to the i th transcript corresponds to the fraction of all ribosome footprints Nf composed of the raw footprints in the i th transcript, f i . This probability is equal to the ratio of all ribosomes engaged in translation of transcripts of type i and the number of all occupied ribosomes in the cell: Thus, ribosome density for the i th transcript can be calculated as:

Global parameters estimation
Three parameters must be estimated to transform relative numbers of transcripts and ribosomes attached to them into absolute measures. These parameters are X , the total number of mRNA transcripts in a yeast cell; W , the total number of ribosomes in a yeast cell; and p, the fraction of ribosomes participating in the translation process at the moment of observation. There are many studies concerning the quantitative measurement of yeast cells, and we used the Bionumbers database [60] to extract these data.
Two reports provide an independent, yet coherent, estimation of the total number of ribosomes: 187,000+56,000 [9] and 200,000 [57] molecules per cell. In this study, we decided to set W to 200,000. The value of 85% was established for the parameter p, as stated in experimental studies [61,62]. The number of all transcripts in a cell is more problematic. Many contemporary studies assume that a yeast cell contains 15,000 mRNAs per cell on average [27,63], which is based on estimations done over 30 years ago [64]. Current research, based on more up-to-date techniques (e.g., in situ hybridisation or GATC-PCR) argues that the number should be at least doubled [65] or even quadrupled [62]. We decided to use the value of X situated between these estimates and equal to 36,000. This number was also confirmed by other studies [65].
Assuming W~200,000, p~0:85, and X~36,000, we obtained the mean ribosome density equal to 1.66 ribosomes per 100 codons. This is in agreement with experimental analysis, which reports that, on average, there is one ribosome per 156 nucleotides, corresponding to a density of 1.92 ribosomes per 100 codons [61]. Moreover, it was estimated that mRNA constitutes 5% of the total amount of RNA present in a cell, and the RNA:DNA ratio is 50:1 [57]. Assuming the yeast genome size of 2:8|10 7 nucleotides, the expected length of the entire transcriptome would be 7|10 7 nucleotides. Thus, the length of all transcribed coding sequences L CDS can be defined as: The meanings of X , L, and x are explained above. Thus, the calculated length of all coding sequences equals 3|10 7 nucleotides. This would suggest that non-coding elements constitute, on average, more than 50% of a transcript. In conclusion, it seems that the chosen parameter values generate reasonable measures of the global characteristics of the yeast cell.

Determining absolute times of translation
In the previous section, we calculated the values of L, x, w, g, and Pz for each gene. To determine the absolute times of translation I, E, and T, we need to know the times of translation for individual codons. To achieve this goal, we adapted a model proposed for Escherichia coli [66] to the yeast system. Here, we briefly present the model and all of the necessary changes we made. For a description of the derivation, see the original paper.
The transport mechanism in the cytoplasm is diffusion, thus the aa-tRNAs act as a random walker, and the ribosomes on mRNAs with vacant A sites are the targets. We assume a yeast cytoplasm volume V~42|10 {18 m 3 [67]. We divide it into N walker occupation sites, where: and l is a measure of the walker size. The values of l used previously [66] were determined separately for individual E. coli aa-tRNA molecules [68]. As we are not aware of any similar reports for S. cerevisiae, we decided to use l~14:5|10 {9 m for all yeast codons, which is the mean of the E. coli l values. The average time that elapses before the arrival of a walker j is defined as: where t j is the characteristic time of the j th walker, associated with its transition from one cellular occupation site to the other. It depends on the size of the walker l and its diffusion coefficient D j : The measures of D j were taken directly from [66]. As this value depends only on the accepted amino acid, we assumed that the difference in size between yeast and E.coli tRNA molecules is negligible. In Eq.16, p j stands for the probability that a tRNA-aa molecule of type j arrives at an open A site in the time interval t j and is proportional to the number of walker occupation sites containing the j th walker: We assume that the number of the molecules of the j th walker n j is proportional to the number of corresponding tRNA genes of type j, which is reasonable, as it was shown that in yeast the concentration of the various tRNA species is largely determined by tRNA gene copy number [69]. In particular, the calculated correlation coefficient between tRNA gene copy number and experimentally determined tRNA abundance for a subset of 21 tRNA species equaled 0.91. According to [57], the RNA-DNA ratio is 50:1 and tRNA constitutes 15% of the total amount of RNA in a yeast cell. Assuming a genome size of 2:8|10 7 nucleotides, the total cellular tRNA size is 2:1|10 8 nucleotides.
When divided by the average tRNA molecule size (74.5 nt) we obtain the number of tRNA molecules in a cell equal to 2,818,792. Next, this number was multiplied by the fraction of all tRNA genes composed of the tRNA genes of type j, yielding the absolute amount of particular tRNA molecules in a cell. Gene copy number and predicted decoding specificities of yeast tRNAs were taken from Table 1 of [69]. The values of all presented parameters for individual tRNAs are gathered in Supplementary Table S4. All 61 codons that code for the 20 amino acids have one or more aa-tRNAs and varying numbers of near-cognates. Nearcognates are defined as having a single mismatch in the codonanticodon loop in either the 2nd or 3rd position. Since some cognate tRNAs have a mismatch in the 3rd position, these tRNAs are excluded from the set of near-cognates [70]. The theoretical background of the model is based on the observation that the translation rate of a codon reflects the competition between its non-cognate, near-cognate and cognate aa-tRNAs [71], and that such nonspecific binding of the tRNAs to the ribosomal A site is rate-limiting to the elongation cycle for every codon [72]. The model of Fluitt et al [66] introduces two competition measures, C j and R j , being the quotients of the sum of arrival frequencies of near-cognates vs. cognates and non-cognates vs. cognates, respectively. For each codon, we determined its cognates, near-, and non-cognates (based on [69]) and calculated the competition measures C j and R j (see Supplementary Table S2).
According to [66], the average time to add an amino acid coded by the j th codon to the nascent peptide chain can be calculated as: where D cogn is the average time to insert an amino acid from a cognate aa-tRNA, and D near and D nonc are the average time delays caused by the binding attempts of near-and non-cognate tRNAs, respectively. Based on existing data and assumption that the activation energies for the various reactions do not vary much, Fluitt et al [66] showed how to calculate the values of D cogn , D near and D nonc at any given temperature. Table 4 contains these values for S. cerevisiae at 20, 24, 30, and 37 0 C. Next, we calculated translation rates e for all yeast codons at the four different temperatures (see Supplementary Table S2). However, as the main part of our analysis is based on the ribosome footprints measured at 30 0 C, in further calculations we use only the values of e estimated at this temperature. The last step was to calculate times E for individual S:cerevisiae genes, as described in Eq.4.

Ribosome queuing
It has been found that subsequent ribosomes are loaded onto the transcript sufficiently fast to make them interfere with each other, leading to ribosome queuing [73]. This phenomenon is usually caused by the presence of rare codons clusters in CDS, although other sequence features may also be very important [74]. Such elongation pauses may have distinct consequences, for instance ORF shifting or ribosome dissociation, often followed by decay of the mRNA and partly completed protein products [75]. Moreover, stalled ribosomes generate a false picture of a transcript translational activity, elevating the observed ribosome density in relation to the actual frequency of translation initiation events. For these reasons, we decided to reduce the dataset to the transcripts on which ribosome queuing does not occur. We wrote a simple program that simulates the ribosomes translocation along a transcript sequence. A ribosome moves from one codon to another only if it has spent a required amount of time for translation of the current codon (taken from Supplementary Table  S2) and the subsequent codon is vacant. The successive ribosome attempts to attach to the initial AUG codon after the elapse of time interval I, calculated as shown in Eq.6. The cumulative time of the movement is calculated for each ribosome separately. If this time is identical for each ribosome, translation is believed to pass without ribosome queuing. If the time is different, namely, the first ribosome moves faster than the rest, it means that some sequence features allow ribosome stacking under the assumed conditions (i.e., temperature and translational parameters). If the attachment of subsequent ribosomes is prevented by very slow translation of the first few codons, we consider it a particular case of ribosome queuing and reject all such transcripts.

Calculation of protein abundances
To enrich our dataset, we estimated the total number of proteins produced from a given transcript. Considering the mRNA molecules as a decaying quantity, we defined m i as the mean lifetime of the i th transcript expressed in time units: where h i is the half-life of the i th transcript. Assuming that each translation event happens independently, we calculated the abundance of the i th protein as the number of translation initiation events that happen during the life-time of the i th transcript multiplied by the its copy number: The dataset of mRNA relative half-lives is provided in the Supplementary Materials of [25]. In our calculations, we used the times t0 measured at exponential growth in YPD medium for 5,718 ORFs. It was determined experimentally by independent studies that the absolute mRNA half-life of the yeast gene YOR202W (HIS3) ranges from 7 (at 24 0 C) [76] to 11 min (at 30 0 C) [77]. Assuming the mean value of 9 min for this gene, we can quantify the half-lives for the rest of the genes in the dataset, as well as the values of m and B.

Calculations summary
Based on the presented reasoning, we calculated translational parameters for the majority of yeast genes. In particular, parameter L was calculated on the basis of yeast coding sequences downloaded from [55]. Parameter x was obtained from Eq.11, where values of C and NC were taken from the experimental study [22], and X was set to 36,000, as estimated by [65]. Parameter g was obtained from Eq.13, where values of f , Nf were taken from [22], W was set to 200,000, based on [57], and p was set to 0.85 as stated in [61,62]. Parameter w was calculated from Eq.5. Parameter E was calculated from Eq.4, based on yeast coding sequences downloaded from [55] and the values of translation times of codons e, calculated as shown in Eq. 19. The values of D cogn , D near and D nonc (at 30 0 C) used in Eq.19 were calculated as shown in [66], and the values of C j and R j were calculated separately for each codon as shown in [66], by substituing the number of its cognates, near-, and non-cognates tRNAs determined on the basis of [69]. Parameter I i was obtained from Eq.6, and P from Eq.2. Parameter Pz was obtained from Eq.10, where values of C and NC were taken from the experimental study [22]. Parameter Ps was obtained from Eq.3 and then normalised by its maximal value reported for the gene YLL040C. Total time of translation T was calculated as stated in Eq.1. Mean time required for elongation of one codon of the i th transcript (mean E) was calculated by dividing elongation time E by the length of this transcript in codons L. Parameter h was obtained on the basis of relative half-lives for yeast transcripts reported by [25] and mRNA half-life of the yeast gene YOR202W, assumed to be on average 9 min [76,77]. Parameter m was obtained from Eq.20, and B from Eq.21. The meaning of all variables was presented at the beginning of this section. Figure S1 The comparison of model parameters x and B with experimentally determined mRNA and protein abundances. Found at: doi:10.1371/journal.pcbi.1000865.s001 (0.26 MB PDF) Figure S2 The comparison of translational parameters between genes of high and low protein abundance, as well as between genes of high and low ribosome density. Found at: doi:10.1371/journal.pcbi.1000865.s002 (0.23 MB PDF) Table S1 A separate csv file containing calculated quantitative measures of translation for 4,621 yeast genes. There are 151 genes for which ribosome queuing was reported (parameter queue ! = 0, see below); the values of translational parameters of these genes may be irrelevant. Column descriptions: (gene) the systematic name of the yeast gene taken from Saccharomyces Genome Database; (L) length of the transcript CDS in codons; (x) absolute number of gene transcripts in a yeast cell; (b) absolute number of proteins produced from one molecule of a transcript during its lifespan; (B) total amount of protein molecules produced from transcripts of a particular type (B = b * x); (g) ribosome density in number of ribosomes attached to a transcript per 100 codons (g, = 10); (w) absolute number of ribosomes attached to one transcript; (P) translation initiation frequency (the inverse of I); (Pz) relative rate of binding of free ribosomes to the 59 end of a transcript; (Ps) relative rate of successful accomplishment of initiation once the ribosome-mRNA complex is formed; for clarity, normalised by the maximal observed value of Ps (65.88365), reported for gene YLL040C; (T) total time of translation of one protein molecule from a given transcript in ms (T = I + E); (I) total time (in ms) required for translation initiation, defined as a temporal interval from the point when the free 59 end of a transcript becomes available for ribosomes to the moment when a ribosome finds the initiation AUG codon and the entire complex starts the phase of elongation; (E) total time required for translation elongation of a transcript in ms; (mean_E) mean time required for elongation of one codon of a transcript in ms; (h) estimated half-life of a transcript in ms; (m) estimated mean lifetime of a transcript in ms; and (queue) ribosome queuing index estimated at 30 Celcius degree: value ''0'' -no ribosome queuing was observed for a transcript, value ''1'' -ribosome queuing was observed for a transcript, and value ''2'' the translation at the 59 end of a transcript is slow enough to delay the attachment of the successive ribosomes to the mRNA molecule.