Ribosome Traffic on mRNAs Maps to Gene Ontology: Genome-wide Quantification of Translation Initiation Rates and Polysome Size Regulation

To understand the complex relationship governing transcript abundance and the level of the encoded protein, we integrate genome-wide experimental data of ribosomal density on mRNAs with a novel stochastic model describing ribosome traffic dynamics during translation elongation. This analysis reveals that codon arrangement, rather than simply codon bias, has a key role in determining translational efficiency. It also reveals that translation output is governed both by initiation efficiency and elongation dynamics. By integrating genome-wide experimental data sets with simulation of ribosome traffic on all Saccharomyces cerevisiae ORFs, mRNA-specific translation initiation rates are for the first time estimated across the entire transcriptome. Our analysis identifies different classes of mRNAs characterised by their initiation rates, their ribosome traffic dynamics, and by their response to ribosome availability. Strikingly, this classification based on translational dynamics maps onto key gene ontological classifications, revealing evolutionary optimisation of translation responses to be strongly influenced by gene function.


Estimation of hopping rates in the two-state model
Assuming a spatially homogenous pool of tRNAs in the cell, the tRNA-capture rates k i are related to the inverse of the average arrival and recognition time of the cognate tRNA, and are determined as follows.
The total translation rate ω i of codon i is given by: where γ is the translocation rate, γ = 35s −1 (see main text). An initial estimate of the hopping rates k j of the codons translated by the tRNA j is where GCN j is the gene copy number of the tRNA of type j with j = 1, . . . , 41, and r is a proportionality constant such that ω i = 10 codons/s, · representing the average value over all the codons. That means, codons that are translated by the same tRNA j are initially assigned the same k i value. Therefore, effectively we would have only 41 1 different types of codons, corresponding to the 41 different types of tRNAs. However, experimental data suggest that the translation rates of codons using the G-U wobble are reduced by 39% compared to their G-C counterparts. Analogously, codons using the wobble I-C and codons using the wobble I-A are reduced by 36% relative to their I-U counterparts [1]. We reduce then the initially obtained rates accordingly. Furthermore, the average value of the total translation rate is calculated as it follows where n i is total number of codons in the cell of exactly type i, and n = 61 i=1 n i . Equating this last equation to 10s −1 , we obtain (numerically) the value for the proportionality constant r. Table S1 summarises the results. The termination rate is assumed to be fast and comparable to the translocation. Therefore, we set β = 35s −1 .
We are aware that additional factors can be taken into consideration to estimate the k i rates, for instance, the occupancy of the ribosomal A-site by other (non-cognate and nearcognate) tRNAs [2,3]. However, the main factor determining the arrival of the cognate tRNA is its abundance [12], and the overall conclusions of this work hold even considering further corrections in the estimation of the k i rates. To demonstrate this beyond doubt, we compared the k i rates with alternative transition rates that consider the occupancy time of the ribosomal A-site by near-and non-cognate tRNAs. To produce those rates we used tRNA abundance data from [4] and assumed biochemical reaction rates measured in vitro and in particular circumstances [5], as these were the only available data. Figure S1 shows a comparison of both sets of k i rates, from which it is apparent that the relationship among different k i rates does not change much in the alternative set of values.  Figure S1. Comparison between the k i transition rates (blue) an alternative choice of the transition rates considering the competition of nearand non-cognate tRNAs (orange) for all codon types.
We additionally confirmed that our classification of sequences into two distinct types of response to initiation rate, namely smooth and abrupt (see main text), is not affected by the additional consideration of non-cognate and near-cognate tRNAs. Moreover, as one can appreciate from Figure S2, the polysome size (density ρ) is comparable in both sets of rates, particularly around the expected value of the physiological initiation rate α ϕ (which median is 0.09 s −1 and mean 0.12 s −1 , see main text). For these reasons we did not to include this further level of complexity.  Table S1. Hopping rates k i considering supply (gene copy number of tRNAs) and wobble base-pairing.

Classification of sequences
In order to classify the mRNA sequences into abrupt or smooth we applied a clustering k-means based algorithm [17] to the density of particles in the excited state (ribosomes loaded with tRNAs before translocation to the next codon) versus the initiation rate α. This curve can be approximated by three consecutive straight lines. In the case of smooth sequences, the slope of the first line is larger than the slope of the second line, whereas in the case of abrupt sequences it is the other way around. In the idealistic case of a sequence composed of only the same fast codon except a slow codon in the bulk of the sequence, the density of ribosomes shows a discontinuity when α reaches the value of the hopping rate of the slow codon [6]. A real mRNA sequence is composed of many different codons, but if there is a slow codon or a cluster of slow codons in the bulk of the sequence it can act as a bottleneck for the traffic of ribosomes. Due to the inhomogeneities of the sequence (every codon is different) and to the finite length of the sequence, there is no real discontinuity in the density of particles, but beyond a certain value of the initiation rate α the density increases rather abruptly, until it reaches the saturation value. Therefore, we find three qualitatively different regions in the curve; in the first region the density increases slowly with α, in the second region the increase is very abrupt, and in the third region, the density shows a very shallow or no increase with α (see Figure S3b). In contrast, if the mRNA sequence does not contain any slow codons or cluster of slow codons in the bulk, but there are either no slow codons or they are at the beginning of the sequence, the density of particles increases smoothly with α. In this case, the curve can also be split in three regions; in the first region the density of loaded ribosomes increases steadily with α, in the second region the rate of increase decreases, until the saturation is reached in the third region (see Figure S3a). Hence, the k-means based algorithm iteratively assigns each point of the curve to one of the three regions, by performing a linear regression in each part. The iterative loop stops when there is no further change in the assignment of points to each of the three regions. By means of this algorithm we are able to classify 5,669 sequences out of all 6,293 ORFs. In the remaining sequences, the algorithm did not find three consecutive slopes to approximate the curve, but rather assigned different intermingled segments of the curve to the three different straight lines. In order to classify these sequences we applied another algorithm; we first performed a linear regression taking the first tenth of the points of the curve and kept adding consecutive points to the first region, every time updating the linear regression, until there was a large deviation from the initial slope. That point indicated the end of the first region of the curve. Then we continued with the third region of the curve, starting from the last point and going backwards. Like that, the second region, the most difficult to identify, was determined. We then calculated for each sequence a goodness of fit parameter between the simulated density curve and the fitted straight lines (normalised square root of the sum of squared differences) . In the first run of the second algorithm, the goodness of fit parameter for some sequences was too high (indicating a poor fit); hence, those sequences were reanalysed again by the second algorithm with changed parameters (the thresholds that limited the maximal allowed deviation from the initial first and third regions). Again, after this analysis some sequences presented a poor goodness of fit, and hence this procedure was reiterated 6 times. At the end, only 32 sequences could not be classified by means of these two algorithms.
In the final classification, we included an additional class for sequences on the boundaries between the two groups (for example, sequences initially classified as abrupt by the algorithms, but without pronounced traits of this class). Such sequences generally have an absolute relative difference between the slope of the second and the first part of the curve smaller than a pre-defined threshold, i.e. sequences with |b 2 − b 1 | /b 1 < 0.5, where b 1 and b 2 denote the slopes of the first and second part of the curve, respectively. We denote these sequences as hybrid. The histogram of the abruptness parameter (b 2 − b 1 )/b 1 for all sequences is presented in Fig. S4.

Codon arrangement as a key determinant of protein production rate
To show that codon arrangement is a key determinant of protein production rate J we randomised the codon configuration of a given mRNA in such a way that the codon composition is identical, as well as the encoded amino acid sequence. Applying this randomisation algorithm we generated 2,000 randomised surrogates of the original sequence and calculate the protein production rate J for each one, for a fixed value of the initiation rate α. We chose a large value of the initiation rate, namely α = 10 s −1 in order to isolate the elongation effects.
The values of the obtained protein production rates are different among the randomised population and are collected in an histogram (an example is shown in Figure S5). Similar distributions can be obtained for smaller initiation rates.
The fact that we obtain a broad distribution of J values clearly demonstrates that mRNAs with identical codon usage but different codon arrangement have different translation efficiency.

Quantification of initiation rates
By combining the genome-wide experimental data of polysome size by [15] with the results from our stochastic simulations, we were able to extract the value of the initiation rate α for each mRNA. This was done by equating the measured (experimental) values and the outcome of the simulations, Equation (1) of the manuscript, via a linear interpolation between the simulations points. The value of α was then determined as the value at which the simulated polysome size equated the experimental value.
We found a strong correlation between the ORF length and the estimated α ϕ (see Figure 3 in the main text). In Fig. S6 we emphasise that, because each mRNA has a different rate of change of the density upon changes of the initiation rate (given by its codon composition and arrangement), an mRNA with high ribosome density does not necessarily mean that it has a large initiation rate. In the example shown, in fact, we illustrate how the estimated initiation rate of a gene with a large density value (YAL008W, in blue) is smaller than the initiation rate of a gene with smaller amount of ribosomes per nucleotide (YBL087C, in green).  Figure S6. Procedure to estimate the physiological initiation rates α ϕ : the experimental [15] density ρ ϕ is used to extrapolate the physiological initiation rate α ϕ for each transcript from the predicted ribosome densities (full lines).
We interpret the correlation between α ϕ and the ORF length as the indication of a possible regulatory mechanism of the initiation of translation, likely due to the circularisation of smaller transcripts, that can control the recruitment and loading of ribosomes into the mRNAs, and lead to the known density-ORF length relationship [11] (see also Section 6 of this Supplementary Information). Furthermore, to demonstrate that comparable results can be obtained by using an independent dataset, we used the density measurements from [11] and repeated the procedure obtaining a new set of initiation rate values denoted by α Arava ϕ . We found again that the estimated values of the initiation rates are distributed rather than being a unique value for all the mRNAs, see Figure S7. The physiological α ϕ used in the main text (where we used the dataset from [15]) and this set of initiation rates α Arava ϕ are highly correlated (Spearman's rank=0.73, p-value< 10 −6 ), confirming that our results are general and not only valid for a specific dataset. 4.1. Pheromone treatment. We used the same procedure to estimate the set of initiation rates α t under pheromone treatment starting from the polysome measurements provided in [15]. We then compared the initiation rates under pheromone treatment to the physiological ones and defined, for each mRNA, the relative change as (α ϕ − α t )/α ϕ . Figure S8 shows the normalised histogram of the relative change across all genome.  Figure S7. (a) Distribution of the estimated initiation rates in S. cerevisiae using polysome size measurements reported in [11]. (b) Comparison between the physiological initiation rates estimated using data of polysome size by [15] (α ϕ ) and by [11] (α Arava ϕ ), Spearman's rank =0.73, p-value< 10 −6 .
Although in general we do not detect large changes in the initiation process under pheromone treatment, two genes playing a key role in the pheromone response pathway (SAG1 and HO) present a large variation in the initiation rate. Consistent with this observation, these two genes are known to be subject to alteration of their 5'UTRs under pheromone treatment [16].  Figure S8. Frequency of the relative change of the initiation rates between the physiological and the pheromone cases, (α ϕ −α t )/α ϕ . Triangles indicate the corresponding values for SAG1 and HO.

Energy of secondary structures
We computed the energy of secondary structures in the 5'UTRs (sequences from [7] with the RNAfold programme from the Vienna package [8]. The values of the free energies were then compared to the estimated initiation rates α ϕ ; we found a significant correlation (Spearman's rank = 0.25, p-value< 10 −6 ) between the initiation rates and the free energy of secondary structures, meaning that folding structures in the 5' leaders can affect the recruitment of ribosomes. Figure S9 shows the scatter plot between the estimates α ϕ and the absolute value of the free energy of secondary structures in the 5'UTRs.

ORF-length dependence of the initiation rate
In [14] the authors argue that a region of constant length with higher ribosome density close to the 5' end (caused by the ramp of slow codons) would lead to the observed larger ribosome densities on short transcripts [11,14], since that constant length high density region is a proportionately larger fraction of a small mRNA (see hypothesis A in Fig. S10). To test this argument, we have first determined the relative bottleneck position (position of the bottleneck divided by the length of the ORF) and we find, consistently with [13], slow regions at the start of the coding sequences (see Figure S11). We computed the bottleneck position (using the k i rates, section 1 of this Supplementary Information) of each sequence of the genome, following the method described in [13]. We find that the ramp of slow codons at the beginning of the sequences does not have a fixed length (hypothesis A in Fig. S10), but it is shorter for shorter sequences, so that the relative position of the bottleneck to the total ORF length is constant across the genome (hypothesis B in Fig. S10). This is shown in the joint histogram of relative bottleneck position and ORF length in Figure S11; if hypothesis A were true, we would have observed the peak of the distribution moving towards 1 upon increments of the ORF length. The reader should notice that a similar conclusion was found by Noval and Pilpel [9].
The excess of ribosomes queueing at the 5' end would cause the density-ORF length correlation observed in the experiments [11,14] only if hypothesis A were verified (same amount of ribosomes queueing would weight more for shorter mRNAs). Instead, in hypothesis B the amount of ribosomes queueing scales with the length of the transcript, and hence the excess of ribosomes at the 5' end cannot explain the experimental density-ORF length correlation.
We have also performed another test of hypothesis A, based on the stochastic dynamics of ribosomes; we compute the maximal density ρ max as defined in the 'Materials and Figure S11. Joint histogram of the relative bottleneck position and length L of the ORF. To compare different values of L we normalise the marginal distributions of the relative positions (the sum of the frequencies of bins with the same L is equal to 1). For any ORF length there is an accumulation of bottlenecks close to the 5' region (consistently with [13]). The relative position of the upstream bottlenecks rather than their absolute location, however, seems to be conserved.
Methods', which is determined by the relative position of the bottleneck, since after the bottleneck the density of ribosomes would be negligible. We find no significant correlation between ORF length and ρ max ( Figure S12), indicating that the position of the bottleneck does not depend on the transcript length; in other words, the average distance of the bottleneck from the beginning of the ORF is not constant across the genome.
Thus an elongation limited process (bottleneck) would not explain the negative correlation between densities and transcript lengths. We suggest instead, neglecting other possible sources of correlations such as mRNA degradation [10], that one possible reason for the negative correlation between coding sequence length and ribosome density is the relationship we detect here between initiation rates and ORF lengths. The negative correlation between initiation rates and ORF lengths may be a signature of ribosome recycling: due to the pseudo-circular conformation of mRNAs, the ends of short mRNA chains can relatively easily interact and enhance the ribosome re-initiation process, while long sequences are potentially less efficient recyclers of ribosomes [18,19], as already speculated in [11].

Computation of p-values
The p-values given along with the correlation coefficients were calculated as follows: after computing the correlation coefficient c(X, Y ) between the data sets X = {x i } and Y = {y i }, one of the data sets, say Y = {y i }, was shuffled 10 6 times, thereby generating 10 6 surrogates Y surr j , with j = 1, . . . , 10 6 . Then, the correlation coefficient was computed between X and each of the generated surrogates, leading to a set of 10 6 correlation coefficients values c(X, Y surr j ). These values were then ranked, and the proportion of these values above c(X, Y ) determined the p-value.
To avoid confusion, the measure of statistical significance provided by an hypergeometric function is instead denoted with a capital letter, P-value.

Software and simulations outcome
The programs used were written in C++ and are available on request. For all mRNAs, the profiles of density and current for different values of initiation rates, ρ(α) and J(α), are also available on request or can be downloaded from the authors' webpages or from http://abdn.ac.uk/mRNA-translation/.