Rhythmicity is linked to expression cost at the protein level but to expression precision at the mRNA level

Many genes have nycthemeral rhythms of expression, i.e. a 24-hours periodic variation, at either mRNA or protein level or both, and most rhythmic genes are tissue-specific. Here, we investigate and discuss the evolutionary origins of rhythms in gene expression. Our results suggest that rhythmicity of protein expression could have been favored by selection to minimize costs. Trends are consistent in bacteria, plants and animals, and are also supported by tissue-specific patterns in mouse. Unlike for protein level, cost cannot explain rhythm at the RNA level. We suggest that instead it allows to periodically reduce expression noise. Noise control had the strongest support in mouse, with limited evidence in other species. We have also found that genes under stronger purifying selection are rhythmically expressed at the mRNA level, and we propose that this is because they are noise sensitive genes. Finally, the adaptive role of rhythmic expression is supported by rhythmic genes being highly expressed yet tissue-specific. This provides a good evolutionary explanation for the observation that nycthemeral rhythms are often tissue-specific.


Detection of rhythmic gene expression
We mainly used the algorithm GeneCycle to detect rhythmic patterns in gene expression time-series (See Methods for more details). We checked that density distributions of p-values obtained from rhythm detection methods used in this paper produced expected left-skewed distributions (Figs 3 and 5). For each gene or protein with several data (several ProbIDs or transcripts), we combined p-values by Brown's method using the EmpiricalBrownsMethod R package. Fig 4 shows the density distribution of p-values obtained after this Brown's normalization for the transcriptomic data in Ostreococcus.
2 Gene expression level Figs 6 and 7 show that distributions of the mean (Fig 6) and the maximum (Fig 7) expression level calculated over time-points (See Methods) can be considered as normally distributed, so relevant for our statistical analysis that need these initial conditions.

Gene expression costs
To produce a given steady state protein level N p , the cell request energetic costs at transcriptional level (C RN A ) and at translational level (C p ): N RN A : abundance of transcripts L RN A : transcript length c nt : averaged nucleotide synthesis cost of the transcript N p : abundance of proteins L p : protein length c AA : averaged AA synthesis cost of the protein The number of ATP molecules consumed is in averaged around 30∼P for one amino-acid (AA) molecule produced (estimated in E. coli ) (1), against 49∼P for one single nucleotide (estimated in yeast and E. coli ) (1; 2). The median length is estimated around 1500 nucleotides for transcripts against 400 AA for proteins in yeast (1; 2). And, the abundance of transcripts is in average 1.2 molecules per cell (1), against 2,622 proteins per cell (yeast) (3) (9 and 6,384 molecules respectively found recently by Lahtvee et al. (4)). Thus, even whether the precursor synthesis per nucleotide is 1.6 fold costlier than for AA molecule and proteins are around 3.75 fold shorter than transcripts, N p /N RN A is on the order of 1,000, so the cost at the protein level is 160 times that at the RNA level (even higher at slower cellular growth-rate). Finally, protein costs are even more higher since chain elongation costs are larger than those of nucleotides polymerization since the nucleotide precursors are already activated molecules (rNTP) whereas chain elongation needs costs of charging tRNAs with AA (2).
Thus, because C RN A << C p , C p gives an estimate broadly representative of the expression costs per gene. Finally, since we compared these costs between genes, the other costs such as degradation or chain elongation costs, should not change our results. Degradation may cost essentially nothing (1) and its cost must be correlated with C p since longer or higher expressed proteins need more chain elongation activity and degradation rate (see next section), so does not modify the comparison of expression costs between genes.
Our results support the hypothesis also claimed by Wang et al. (5) that cycling expression of the more expensive genes is a conserved strategy for minimizing overall cellular energy usage. In this study, we provide new results based on relevant data. Indeed, data used by Wang et al. (5) for the calculation of costs appears to be biased, partly because: i. translation rates come from fibroblasts cells (6); and ii. there were errors in the estimation of protein levels resulting in a systematic underestimation of protein levels and derived translation rate constants (Cf Corrigendum (6)).
4 Expression costs, decay and half-lives

Costs of protein decay
Protein half-life is the time requested to reduce the protein amount by 50%. In terms of protein concentration, assuming constant protein production, protein half-life depends on protein degradation and cell dilution (cell's growth) (equation (3) and (4)). We assume that spontaneous degradation of proteins is negligible.
δ sdeg : spontaneous decay of unstable molecules δ deg : active protein degradation rate δ dil : cellular dilution Costs of protein decay are negligible enough to not be opposed by selection. Indeed, Lynch and Marinov (2015) and Wagner (2005) have shown that "degradation in a lysosome may cost essentially nothing, and amino-acid export back to the cytoplasm consumes ∼1 ATP for every 3 to 4 amino acids". Compared with the unique cost of producing one single nucleotide which consume 49∼P, protein decay costs becomes negligible comparatively to transcriptional costs, which are themselves negligible comparatively to translational costs. All the more, given that amino acids from degradation are reused and do not need to be produced by the cell, which therefore economizes around 30 ∼P per amino-acid.

Half-lives of rhythmic proteins
In rapidly growing bacterial cells, dilution is often more significant than degradation (7; 8), whereas in no-dividing and slowly dividing cells such as differentiated mammalian cells, protein half-life rely mostly on degradation because dilution is negligible (7). Half-life might be an inappropriate parameter regarding rhythmic proteins. Indeed, half-life depends on protein production rate and degradation rate which, for rhythmic proteins, vary over time. We think that protein half-life should be a constant property of proteins and, therefore, is difficult to understand for rhythmically expressed genes (although it is discussed by Lück et al. (9)).
Suppose we can deal with protein half-life property for all genes. Since higher expressed genes have longer halflives (they are more stable) (10; 11) and because highly expressed genes tend to be rhythmic, the expression costs for maintaining high level of these proteins should be reduced (due to their longer life-span). That is why, the cost per time-unit of maintaining the steady-state protein level might not be correlated with our estimation of expression cost (C p ). I.e., rhythmic proteins for which our results suggest that they are rhythmic because they are costlier (and costlier because they are highly expressed) would, in fact, require less energetic costs to be expressed than expected because these protein molecules tend to have longer half-lives (they live longer). On the other hand, models suggest that a longer half-life damps the amplitude of the periodic expression (9); as supported by observations showing that non-rhythmic proteins have much longer half-lives than rhythmic ones (in mouse fibroblast) (11).
Thus, highly expressed genes should be less rhythmic (due to amplitudes damped by longer half-lives) and therefore, less energy saved over 24 hours. Consequently, the rhythmic expression of costly proteins (as our results suggest) should not involve proteins with longer half-lives. Thus, rhythmic and highly expressed proteins remain costlier for the cell per time-unit than rhythmic and lower expressed proteins. Therefore, our results remain consistent even when considering protein half-lives.

Averaged AA synthesis cost and protein length
Before applying the t-tests, we checked that averaged AA synthesis costs and the length of proteins were normally distributed in both groups (rhythmic, first 15%, and non-rhythmic) (Figs 8 and 10). We also verified that quantile-quantile-plots showed comparable distributions between the theoretical distribution and the empirical distribution for both groups (Figs 9 and 11).
6 Gene expression noise 6.1 Transcriptional noise is the main source of the overall noise Relatively to translational noise, transcriptional noise is the main driver of the overall noise (12) and should give a good estimation of the output noise. Indeed, based on estimations of coefficient of variations (CV, cell-to-cell variations of protein level) for diverse transcription and translation rates in E. coli and S. cerevisiae, Hausser et al. (13) have shown that for a fixed transcriptional rate, CV is almost constant for diverse translational rates. Thus, changes in protein level have little to no impact on gene expression noise. The availability of mRNA molecules seems to drive the final noise. I.e., comparatively to the noise caused by the translational activity, the availability of low number molecules such as transcriptional factors (subject to the stochasticity of diffusion and binding in the cell environment) is the main factor of the output cell-to-cell variation in protein abundances.

Rhythmic proteins and expression noise
In the case of genes with constant mRNAs abundances and with rhythmic proteins, we expect that rhythmic expression at the protein level cannot originate from selection on expression noise. Indeed, since increasing transcription at constant protein abundance decreases expression noise (13; 14) ( Fig 1A), we should theoretically expect to improve expression precision (decrease noise) when protein level decreases for an equi-mRNA level, because the translational efficiency decreases (i.e. less proteins are produced per mRNA molecule). Thus, in this case, we expect the gene expression precision to be highest when the protein level is at the valley and lowest when the protein level is at its peak which we presume to be near to the optimal protein level ( Fig 1B: Theoretical plot). Based on estimations of coefficient of variations (CV, cellto-cell variations of protein level) for diverse transcription and translation rates in E. coli and S. cerevisiae, Hausser et al. (13) have shown that for a fixed transcriptional rate, CV is almost constant for diverse translational rates (13). Thus, the translational efficiency might not be the main driver of expression noise (Fig 1B: last plot). The availability of mRNA molecules drives the final noise, i.e. mainly the availability of transcriptional factors. Thus, this suggest that, in both cases, rhythmicity at protein level cannot be due to a selection on noise.
Finally, we remain attentive to the fact that all these considerations presume an immediacy between expression noise and biological functionality. The expression noise decreases with increasing transcription for an equi-protein level (13). In the case of rhythmic mRNAs abundance and constant protein level, the expression noise decreases when mRNAs level reaches its peak. B) Theoretical expression noise variation in the case of constant mRNAs abundance and rhythmic proteins. Theoretically, the expression noise is correlated with the translational efficiency. However, observations in E. coli and S. cerevisiae show that for a fixed transcriptional rate (that we assume to be correlated with mRNAs level), the expression noise (CV) is almost constant for diverse translational rates (13).

Noise level is based on a single timepoint scRNA dataset
Indeed, our noise estimation is reported at a single timepoint (unknown) for a cell population (scRNA). Since the peak times of rhythmic genes is largely distributed (Fig  2a), we expect the mean noise of a given time-point to be general to all time-points (Fig 2b).

Noise estimation
We compared several noise estimation methods and found that the F * polynomial degree of Barroso et al. (15) method was the best method across all datasets and especially the most efficient method in controlling for the effect of mean expression (S7 Table). To do this, we calculated the slope of the linear model which best fit the correlation between the noise and mean expression (Figs  12 and 13), as well as the R 2 and the Kendall correlation (τ ) , applied to normally distributed noise estimations. The linear model that explained the variance the least well (small R 2 ) was considered to be approximately the best expected value. Indeed, we expect a good noise estimation model to be independent of the mean gene expression (linear regression slope and Kendall's τ near to 0) and with a large range of noise values for every mean expression (small R 2 ). The F * polynomial degree of Barroso et al. (15) was the best method because it removed the correlation between mean expression and F * (smallest linear regression slope and Kendall's τ ) and keep a large range of noise values for every mean expression (smallest R 2 ) (S7 Table). If none of Kendall's correlation was not significantly uncorrelated (all Kendall's p-value<0.05), we used the polynomial degree which seem to maximally removed residual correlations (smallest Kendall's τ and smallest linear regression slope). Thus, we used F * : degree 3 for mouse liver (Kendall's τ = -0.0288, p-value = 1.6e-07, linear regression slope = 1.14e-13); degree 3 for mouse lung (Kendall's τ = -0.0545, p-value < 2.2e-16, linear regression slope = 3.40e-15); degree 3 for mouse muscle (Kendall's τ = -0.0795, p-value < 2.2e-16, linear regression slope = 9.76e-15); degree 4 for mouse heart (Kendall's τ = -0.0445, p-value < 2.2e-16, linear regression slope = -3.75e-13); degree 4 for mouse aorta (Kendall's τ = -0.0183, p-value = 4.70e-04, linear regression slope = 4.83e-14); degree 4 for mouse kidney (Kendall's τ = -0.0528, p-value < 2.2e-16, linear regression slope = 1.04e-13); and degree 4 for Arabidopsis roots (Kendall's τ = 0.0091, p-value = 0.1162, linear regression slope = -9.78e-15).   Mean expression SGE measure Figure 13: a-f ) Relationships between different stochastic gene expression (SGE) estimations and the mean gene expression, using discretive display with error-bars that represent the middle 95% of values. The 3 horizontal lines separate the values in 4 quartiles. The dots represent the median of the values. All parameters are log transformed. F * is the noise estimated by the method of Barroso et al. (15). F * min is the first polynomial degree which break the correlation between the noise with mean expression, and F * max is the next one.