Whole cell biophysical modeling of codon-tRNA competition reveals novel insights related to translation dynamics

The importance of mRNA translation models has been demonstrated across many fields of science and biotechnology. However, a whole cell model with codon resolution and biophysical dynamics is still lacking. We describe a whole cell model of translation for E. coli. The model simulates all major translation components in the cell: ribosomes, mRNAs and tRNAs. It also includes, for the first time, fundamental aspects of translation, such as competition for ribosomes and tRNAs at a codon resolution while considering tRNAs wobble interactions and tRNA recycling. The model uses parameters that are tightly inferred from large scale measurements of translation. Furthermore, we demonstrate a robust modelling approach which relies on state-of-the-art practices of translation modelling and also provides a framework for easy generalizations. This novel approach allows simulation of thousands of mRNAs that undergo translation in the same cell with common resources such as ribosomes and tRNAs in feasible time. Based on this model, we demonstrate, for the first time, the direct importance of competition for resources on translation and its accurate modelling. An effective supply-demand ratio (ESDR) measure, which is related to translation factors such as tRNAs, has been devised and utilized to show superior predictive power in complex scenarios of heterologous gene expression. The devised model is not only more accurate than the existing models, but, more importantly, provides a framework for analyzing complex whole cell translation problems and variables that haven't been explored before, making it important in various biomedical fields.

For mRNAs with abundance greater than 0.5, the value is rounded to the closest integer and the system is initiated with an integer number of these mRNAs, each acting as an individual mRNA object. For relatively low levels, this discretization error may be significant at the individual mRNA level, e.g. if the real abundance is 1.5, but it is rounded down to 1, there is a 50% difference in the representation of this mRNA. For the whole system simulation this error is negligible, but when further analysis is performed, we can insert an ad-hoc correction, assuming that total number of terminations of an mRNA type is proportional to its abundance.
When mRNAs with abundance lower than 0.5 are considered, a different approach is taken. Every such mRNA is represented by a single mRNA object labeled by 'Partial mRNA'. Such objects act as regular mRNA object with one difference: upon initiation, there is an additional randomization that allows initiation only in the fraction of times equal to the abundance. Such implementation results in partial consumption of resources (and consequently partial number of synthesized proteins), representing a fractional abundance. Clearly, this is an approximation because such effective reduction in initiation rate results in different ribosomal densities. However, because such mRNAs are rare by definition, and rare mRNA tend to have lower initiation rates, we expect that the approximation is reasonable.   (3,474) was divided in ration of 3/2 between GAU and CAU with respect to a gene copy number of 3 and 2, respectively.

Comment 3:
AUA is recognized by the anticodon 5'-LAU'-3', when L is Lysidine, a nucleoside that undergoes transformation from cytidine (carbonyl is replaced by the amino acid lysine). In (1) this anti codon is marked as "CAU".

Comment 4:
The tRNA count indicated as Met f1 and Met f2 correspond to the same tRNA and codon, so their count (1211 and 715, respectively) was combined. Also, even though the canonical start codon is AUG, other start codons exist (E. coli K-12 uses 83% AUG, 14% GUG, 3% UUG and few others (2). All these codons, if located at the start location, are recognized by fMet-tRNA.

Interaction coefficients optimization problem
When considering codon-tRNA interactions, we should consider both the canonical Watson-Crick (WC) pairs and other possible wobbling interactions. Section 2 lists all possible interactions in E. coli K-12 strains, along with the measured tRNA abundances. We define Interaction Coefficient (IC) as the chance, relative to WC pairs, that a binding between a codon (that currently resides at the A site of a ribosome) and an aminoacyl tRNA will take place, given that the tRNA is located in the vicinity of the codon. By this definition, the IC of WC crick pairs is 1, while other wobbling pairs will have an IC of less than 1 (but not less than zero).
Below we estimate the IC of all non-WC pairs by means of the tRNA Adaptation Index (tAI) (4).
Similarly to the definition of Reis et al., we define the tAI of a codon : Where is the number of tRNA molecules available in the cell, while is the IC between an anticodon whose first (from 5') nucleotide is and a codon whose third nucleotide is . This index is a measure of the extent to which a codon is adapted to the tRNA pool; e.g. a codon that is recognized by a relatively abundant tRNA pool will have a relatively higher score.
For a given set of IC values, a tAI for all codons can be calculated. Then, these values are normalized by the maximum tAI value. The tAI of a gene is defined as the geometrical mean of all normalized tAI scores of the ORF codons (including the start codon but excluding the stop codon).
In order the estimate the IC, we adapt the following strategy: we first find the set of IC that will maximize the correlation between the tAI and the TDR scores of the codons (5). Assuming there are several candidates (with similar correlation), we choose those that will maximize the correlation between tAI gene scores and their protein abundances of all E. coli genes. The idea is to avoid overfitting. The procedure is schematically described below: Schematic description of the optimization process of the interaction coefficients (IC). In step 1, a given set of IC is optimized to reach maximal correlation between TDR and tAI. In step 2, the best IC values are used to calculate the correlation between PA and tAI. The best IC set is chosen.
We estimate the IC vector, which contains the following six parameters: Use the following pipeline for optimization:   Table B: tAI values of the final IC set.

Detailed definition of ESDR and functional relations with timers
tRNA-codon recognition scheme that allows various interaction for each codon, with different interaction coefficients and tRNA abundances, induces complexity on the resource management of the model.

Ribosome pool:
Consider first the case of the ribosome pool. As there is only one type of ribosome, the rate of initiation events is related to the availability of free ribosomes (supply) and the current competition for these ribosomes (demand). Thus, upon initiation we can simply calculate the Ribosomal Supply-Demand Ratio (RSDR), and use this value to calculate the initiation timer.
tRNA pool: The case for tRNAs, however, is more complex. For a given codon, we assume that the penalty waiting time is tightly related to the effective competition on the resources. For a given codon at a given time, we check what are the available tRNA molecules that recognize this codon, and weight them by interaction coefficient . However, the demand for a resource must be also taken into account. We define the Effective Supply-Demand Ratio for codon as follows: where is the currently available pool of tRNAs of type , is the group of tRNAs that recognize the codon (i.e. tRNAs that satisfy ) and is the demand, the effective number of codons can be recognized by the tRNA of type . Note that both and are time dependent. Here is the group of codons that are recognized by tRNA (i.e. codons that satisfy ), is the total number of codons of type that currently require a tRNA and is normalized , as defined below. The power of , namely , represents a penalty related to tRNA recycling, with being a distance score of codon (defined below) and is a normalization factor to be estimated below. The rational is as follows: if two codons that are recognized by a given tRNA are adjacent to one another along the ORF, there is increased probability that the same tRNA will also deliver an amino-acid to the downstream codon, leading to effectively lower demand. It should be mentioned that the tRNA that exists the E ribosomal site is not charged with an amino acid, and it takes some time for the tRNA to undergo aminoacylation.
However, we still expect some allocation bias which is related to the distance between codons that share the same tRNA type. The parameter , which may be biophysically related to the aminoacylation time, is in fact a way to control the magnitude of the effect (extremely short aminoacylation times relative to decoding times will result is higher recycling rate, while extremely long aminoacylation times will result in no recycling at all). As will be described below, we estimate its value based on biophysical considerations.
The suggested penalty related to the average distance between codons. For each codon , we calculate the average distance, in codons, between instances of , across all ORFs in the transcriptome, normalized by the mRNA level. Formally, let ORF have a corresponding mRNA abundance of . Let be an ordered group of locations of codon in this ORF. If , the ORF is not considered. The distance score is defined as: In order to estimate we take the following approach: TDR, much like ESDR, reflects the effective competition among codons and tRNAs. Thus, both metrics are a measure of codon optimality (or adaptation to the tRNA pool) and should be correlated. We prefer using TDR over tAI because the former is estimated using experimental data and more reliably captures the dynamic information of translation elongation. ESDR itself is a dynamic metric, and its calculation requires a running wholecell simulation, making optimization very time consuming. As an alternative, for this optimization we use naïve ESDR. It is defined similarly to the ESDR, but instead of considering only currently available tRNAs and competing codons, we consider all relevant tRNAs and codons. the recycling mechanism to take place. Specifically, the value chosen is the one that maximizes the correlation.
Note that for high values the correlation is negative. This is related to the fact that more optimal codons are also more frequent, leading to a higher demand. The denominator of the ESDR is correlated stronger to TDR than the nominator, leading to inverse correlation. Accounting for effective tRNA recycling as described above helps to overcome this anomaly.

Normalization of : the normalized interaction coefficient is defined by
It is the interaction coefficient between codon and tRNA , normalized by the maximum affinity between the tRNA and any codon it recognizes. The normalization exists to avoid cases in which a given tRNA satisfies , leading to increased supply/demand ratio for this particular tRNA.
Choosing the tRNA: assuming that the timer for waiting for tRNA at codon has passed, we try to allocate a tRNA out of the currently free ones. The probability of choosing will be proportional to the affinity times the amount of the given tRNA, namely .

mRNA-Pool Interaction
The interaction between mRNAs and the cell resources is modelled through the timers that dictate ribosomal movement. The timer expresses the estimated time for the required resource to arrive, depending on pool availability. Timers are defined in the following cases: Initiation: the canonical prokaryotic initiation mechanism is relatively well understood. Local, mRNAspecific, factors include the distance of the Shine-Dalgarno sequence from the start codon, fold energy at this region, energy released upon initiator tRNA hybridization etc. These factors essentially determine local initiation time. We address the estimation of local initiation times in section 9. In addition, global factors such as the availability of ribosome, initiator tRNAs and various initiation factors may significantly alter the actual initiation time. While modeling only ribosome and tRNA pools, we define initiation timers using the following expression: Here is the local initiation time. is the ESDR of start codons (associated solely to fMet). The maximal possible timer is , for cases where both supply/demand ratios approach zero.
Finally, and are simply normalization factors that control the steepness of the relation. The value of significantly influences time penalty only when the system operates near depletion of resources, which is not the case in real cells. We assign a default value of 10. The value of is addressed below. The default value of is 1. We perform sensitivity tests to both and .
When such timer reaches zero and assuming a ribosome was allocated successfully, the first codon after the start codon will be located at the A site of the ribosome, starting the first elongation cycle.

Elongation:
Similarly to the initiation step, we refer to the canonical model of the translation elongation cycle. Our model assumes that the duration of each cycle can be viewed as sum of two components: (1) A constant asymptotic time, which includes all steps under the assumption of infinite resources. This component is codon-independent.
(2) A variable time, which reflects the limited availability of the tRNA pool (and thus codondependent). We assume that the variable part only depends on the ESDR of the codon being translated. In other words: Here is the asymptotic constant time, which is the only term in the case resources saturation ( Thus, Note that we treat as a normalization factor for ESDR for all codons. In order for this normalization to be stable, the value must reach steady state (otherwise, the normalization factor may fluctuate an induce non-physical artifacts to the model). We show that indeed reaches steady state.
To estimate the constant part, we follow (7)  Termination: We take a simple approach to termination, modeling it as a single step process of constant time. The time is estimated using ribosomal sequencing and described below.
Timer error estimation: typically, a timer will be assigned and only when it finishedthe associated action will take place if possible. For example, consider assignment of timer for tRNA allocation. As defined above, the penalty that is related to resources is smaller or equal to .
Assuming , we expect not more than 20 iterations until the timer ends. In this time, based on a simulation of native E. coli with nominal parameters (as estimated below), we expect change of not more than 20% and change of not more than 5%. With typical values being and lowest values of around 700 (for CCA), we expect an error of the timer value of around 3%.
Note that this represents a very extreme case.
The regime of severe resources depletion: When ESDR approaches zero, we assign timer value of (that were tailored based on experimental data) at most. By using this approach, we avoid cases in which a temporary low ESDR results in an extremely high timer (which is problematic if the depletion is resolved while the timer is still with high value). The potential problem with such approach is that for very low ESDR values, the timer is biased to the selection of and values, regardless of the discussed codon. In all the cases we tested, such limitation did not raise any problems (in part because severe resources depletion is not a realistic condition, while we used only empirical data). It should be emphasized, that any regime in which resources are not depletedthe suggested approach is a good approximation, as demonstrated by the high correlations with empirical data.

Detailed state machines
The state machines below are presented graphically and implemented in the mRNA.py file.  It should be noted that addressing codons in reverse order (starting from 3' to 5') is done for computational efficiency (allowing to avoid several conditions), but the expected error for such implementation (which appears also in previous studies) is very low. The induced error from such biased order of timer allocation is very small. As explained above (section 6, "Timer error estimation"), we can expect an error of less than 3% in the allocated timer over 17 iterations of 5ms, thus less than 0.2% in a single ΔT. This error represents, in part, inaccuracies in the demand that result from biased order of timer definitions. It may be argued that addressing codons in such order may result in incorrect allocation of resources. However, in additional to the small error, it should be further noted that such scenario is rare in the first place, since it required several codons on the same mRNA that require the same tRNA to start in finish decoding within ΔT=5ms. With tens of ribosomes translating a typical mRNA (~20 codons per ribosome, and hundreds of codons forming a typical ORF) that consists of tens of codon types, such scenario is indeed rare at the whole-cell landscape.

Nominal Parameters -Method
In this section we describe the way we chose nominal parameters for the mode, i.e. the set of parameters use by default for all reported results (unless otherwise mentioned).
We should distinguish between biological parameters and mathematical parameters. show that the model is robust for parameters change to the expected extent.
We took the following approach (schematically depicted below): 1. We chose a value of 5,100 for the number of mRNAs in the cell (this is a typical representative value, see the next sub-section).  It should be noted that ideally, for a given growth rate (resulted from a well-defined E. coli strain and growth medium), it should be possible conclude the number of ribosomes, tRNA and mRNA molecules. However, The growth rate is not the only explaining factor and moreover, all provided measurements are not synchronized in conditions. For this reason we adopted a more robust optimization method as described above.

Nominal Parameters -Values
Transcriptome: 3. Absolute number of mRNA molecules: From (10) we concluded a value of 7,800 for rapid growth in LB medium, and 2,400 for moderate growth rate in MM. However, explicit growth rate is not reported. From (11) follows that E. coli in similar mediums reaches a growth rate of 0.7 -2 doublings/h.
These values were also validated in (13) when modelling the rate of RNA chain elongation.
tRNA Pool: 1. Relative number of tRNA molecules: we used (1) to identify the relative amounts of different tRNA molecules in E. coli. See Table A for the values used.
2. Absolute number of tRNA molecules and ribosomes: (8) reports that the estimated number of tRNA molecules is 63,000 -669,000 for growth rates of 0.6 -2.0 doublings/hour. Values that are deduced from (1) indicate a similar range (when taking into account the provided ribosomes/tRNAs ratio and range of ribosome pool values from (8)).
Ribosomes Pool: reports that the estimated number of tRNA molecules is 6,800 -72,000 for growth rates of 0.6 -2.0 doublings/hour. These numbers show consistency for various estimation methods (BNID111527 (12)).

Initiation Time:
Translation initiation time is considered to be on the order of seconds in bacteria (14). This order of magnitude is with agreement with some specific reported values, such as 3 sec of the wild-type lacZ mRNA of E. coli (15). Also, assuming the previously mentioned distance of 14 -26 codons between ribosome, and assuming that elongation time is 5 -22 codons/sec (see "estimation of local initiation times"), we conclude that, on average, a new ribosome begins translating a given mRNA roughly every 1 -5 seconds. Unfortunately, more precise values at the mRNA-specific resolution are not available. We used the range of 0.5 -5.0 sec for parameters estimation.

Estimation of local initiation times
Below the method of initiation time estimation the for all E. coli genes.
Suppose we have types of mRNA molecules, where type has mRNA level of .
For each mRNA type we denote read count by , where is the read count at codon and is the number of codons in the ORF, including the start and stop codons. The total read count for this mRNA type: The total estimated read count per single mRNA molecule (of type ) is The total sum of reads for the entire transcriptome is: Suppose the total number of ribosomes at the pool is . At steady state, roughly of the ribosomes are active (8,16), i.e. are involved in a translation elongation process. We define a normalization factor that is data-specific (i.e. can be different for different ribosomal sequencing For a given average elongation rate and a ribosomal pool value , a range of initiation rates will be obtained using the method described above. These values must also agree with reported values.

Estimation of termination time
The

Parameter robustness tests
After choosing the nominal parameters for the system, each parameter was tested for robustness by performing a set of simulations with values around the chosen value. In the case of tRNA pool, since the selected value was in the saturation region of observed metrics, we focused on lower values. The test was qualitative, while making sure that the trends are as expected. We also made sure that no rapid changes were observed in the region of the selected values. , and . All parameters found to be robust from the standpoint of the chosen metrics.

Details of GFP heterologous expression modelling
We used the following GFP sequence (from the U57608.1 cloning vector): The assumption that this cloning vector is highly optimized for expression in E. coli, led to the following additional two assumptions: 1. The amount of GFP mRNA molecules was assumed be very high, e.g. 20% of all mRNA molecules. We thus introduced 1,275 GFP mRNAs (in additional to the 5,100 native RNAs).
Note that increasing this fraction further resulted in resources depletion.
2. The initiation rate had to be high in comparison to initiation rates of the native genes. We calculated the mean initiation rate of the 10 genes with the highest initiation rates and defined the result as the initiation rate of all GFP genes.

Initiation time estimation of Kudla et. al variants
In a heterologous expression simulation, it is required to define the number of mRNAs, the initiation time and the codon composition of each mRNA. In this case, all the heterologous mRNAs in a single simulation are from the same type, which is one of the variants analyzed by Kudla et al. where . This approach, that was used for both calibration and main simulations, forces the average number of mRNAs to be 20% of the total amount (as suggested in section 12), but maintains the relative variations of mRNA levels between variants.

Correlations between model predictions and Kudla et. al data
Below we show correlations between PA (protein abundance) and OD (optical density) (23), and nonexperimental values, such as FE (folding energy of the first third of the ORF) and prediction of the model.

Regressors for PA and OD, simple features
Based on 77 variants for which all data is available.

Correlations between PA (and OD) and ESDR per codon
The data is given in a separate excel spreadsheet. S2_File.xlsx

Regressors for PA and OD, based of ESDR features
We list the 4 combinations discussed in the main text. In each case, we list the 20 chosen features. It is not surprising that FE or CAI are chosen as the first feature, since these metric are tightly related to the translational efficiency of given gene. The other features are codon specific, thus they are expected to contribute when sufficient amount of them is combined together.
Objective: PA, features set: ESDR per codon, CAI and FE