Accounting for experimental noise reveals that transcription dominates control of steady-state protein levels in yeast

Cells respond to their environment by modulating protein levels through mRNA transcription and posttranscriptional control. Modest correlations between global steady-state mRNA and protein measurements have been interpreted as evidence that transcript levels determine roughly 40% of the variation in protein levels, indicating dominant post-transcriptional e↵ects. However, the techniques underlying these conclusions, such as correlation and regression, yield biased results when data are noisy, missing systematically, and collinear—properties of mRNA and protein measurements—which motivated us to revisit this subject. Noise-robust analyses of 25 studies of budding yeast reveal that mRNA levels explain roughly 80% of the variation in steady-state protein levels. Post-transcriptional regulation amplifies rather than competes with the transcriptional signal. Measurements are highly reproducible within but not between studies, and are distorted in part by between-study di↵erences in gene expression. These results substantially revise current models of protein-level regulation and introduce multiple noise-aware approaches essential for proper analysis of many biological phenomena.


Introduction
Cellular protein levels reflect the balance of transcript levels, protein production by translation initiation and completion, and protein removal by degradation, secretion and dilution [1,2](Figure 1A).The standard quantitative model for protein-level regulation is where P i is the cellular protein level (molecules per cell) of gene i, M i is the mRNA level, and ⌧ i and i are the mRNA translation and net protein removal rates, respectively.At steady-state, protein levels will be proportional to mRNA levels with proportionality constants of ⌧ i / i , such that if rates of translation and removal did not vary by gene, steady-state mRNA and protein levels would correlate perfectly [1].Consequently, the mRNA-protein correlation observed in global measurements of mRNA and protein levels has been intensely studied, and deviations from perfect correlation used to quantify the contribution of post-transcriptional processes to cellular protein levels [1][2][3][4][5][6].The consensus emerging from these studies holds that, across organisms, transcriptional regulation explains 40-50% of the variation in steady-state protein levels, leaving half or more to be explained by posttranscriptional regulatory processes [2,4,[6][7][8][9].Higher correlations are observed, generally for subsets of less than half the genome that are biased toward high-abundance mRNA and protein expression [1,6,10].Low observed mRNA-protein correlations have motivated the search for alternate forms of regulation capable of accounting for the majority of protein-level variability [2,6,8].Recent studies have indeed uncovered wide between-gene variation in posttranscriptional mechanisms such as translation rates [11] and protein degradation rates [2].
However, as frequently noted [1,4,6,12,13], noise in measurements can cause many of the observations attributed to post-transcriptional regulation.Here, noise encompasses variability due to cell-to-cell variation, growth conditions, sample preparation and other e↵ects due to experimental design [14], and measurement biases and error [13].Uncorrelated noise between mRNA and protein measurements will reduce the observed mRNA-protein correlation relative to the true value, while inflating the variation in measurements of translational e ciency and other posttranscriptional processes [15,16].Empirically, disentangling noise e↵ects from biological e↵ects is critical for an accurate understanding of how cells regulate protein levels.
Rapid progress has been made in global measurement of transcript and protein levels by multiple methods, as underscored by recent high-coverage drafts of the human proteome [17,18].These methods were largely pioneered in budding yeast, and have been replicated many times by di↵erent groups.Motivated by the ongoing and intense interest in the contribution of mRNA levels to protein levels, we were prompted to revisit the subject in this well-studied model eukaryote.

Results
We collected 38 measurements of mRNA levels and 20 measurements of protein levels from 14 and 11 separate studies respectively, each of haploid S. cerevisiae growing exponentially in shaken liquid rich medium with 2% glucose between 22 C and 30 C (Table S1).These data cover varying amounts of the genome and display a wide range of correlations between studies (Figure 1B, Pearson correlations on log-transformed values with zeros and missing values omitted).Although correlations of replicates within studies are quite high [6], with median r = 0.97 for mRNA and 0.93 for protein levels, between-study correlations are far more modest, r = 0.62 for mRNA measurements and 0.57 for protein measurements.That is, data from a typical mRNA study explains 39% of the variance in another study (r 2 = 0.39) and a typical protein study's results explain only 32% in another study's variance, consistent with previous studies reporting wide variation between studies [9].Strong outliers indicate high reproducibility for a two pairs of studies (Figure 1B), but each such outlier is a correlation between separate studies done by the same research group, suggesting the presence of additional variability sources between groups.The high within-study reproducibility and low between-study reproducibility indicates the presence of large systematic errors between studies.
Correlations are modest even between studies using similar methods (e.g., r = 0.81 between two RNA-Seq datasets using Illumina instruments [11,19]).Comparing mRNA studies performed using similar or di↵erent methods on a shared set of 4,595 genes revealed little di↵erence in reproducibility whether similar or di↵erent methods were used (Figure 1C, no t-test P < 0.05 for di↵erences in correlation when comparing studies employing shared methods versus independent methods after false discovery rate correction).
Between-study correlations quantify the studies' mean ratio of true variance to total variance, termed the reliability [20,21] (see Methods).In turn, setting aside sampling error, the maximum observable correlation between any two datasets is equal to the geometric mean of their reliabilities.Because virtually all reported global mRNA-protein correlations involve mRNA and protein levels measured in separate studies, between-study reliabilities are the relevant quantity.The modest reliability values-setting aside those of the same group reporting two studies, which we exclude from this analysis-sharply limit the maximum observable mRNA-protein correlations.This limit has startling consequences: if steady-state mRNA and protein levels actually correlated perfectly (true r = 1.0), then given the median observed between-study correlations in Figure 1B, we would expect to observe mRNA-protein correlations of only r = p 0.57 ⇥ 0.62 = 0.60.The data reveal a wide range of modest mRNA-protein correlations with a median of r = 0.54 (Figure 1C) quantified either by the Pearson correlation between log-transformed measurements or the nonparametric Spearman rank correlation (Figure S1; both measures produce similar results and we employ the former throughout).Coverage of the 5,887 verified protein-coding genes in yeast [22] also varies widely.The largest pair of datasets covers 4,367 genes and shows an mRNA-protein correlation of r = 0.618 (r 2 = 0.38, 38% of protein-level variance explained by mRNA levels), close to consensus values [6].
Reduction of correlations by noise can be corrected using information from repeated measurements [16,21].Quantitative corrections for correlation attenuation were first introduced more than a century ago by Spearman [16], are widely used in the social sciences [21,23,24], and have found recent applications in biology [20,[25][26][27].Given two measurements each of variables X and Y , each with uncorrelated errors, the true correlation can be estimated using only correlations between the four measurements X 2 (see SI Materials and Methods): The correction reflects a simple intuition: the denominator quantifies the reliabilities of the measurements, which determine the maximum observable correlation, and the numerator quantifies the observed correlation using a geometric mean of four estimates and is divided by this maximum value to yield an estimate for the true value.In simulated data, this noise-corrected estimate accurately ascertains true correlations in the presence of noise far exceeding that apparent in most mRNA and protein data (Figure S2).The estimate is not itself a correlation coe cient, and may take values outside ( 1,1) due to sampling error [21] (cf. Figure S2B,C).Using Spearman's correction, we estimated mRNA-protein correlations for pairs of studies, obtaining a median corrected correlation of 0.92.Variability due to sampling error was large for small datasets as expected (cf. Figure S2, and decreased with as size increased, with estimates stabilizing for large datasets (> 3000 genes) at a mean of r = 0.88 ± 0.02 (Figure 1C).This value is echoed by consideration of the largest dataset with two mRNA [19,28] and two protein [29,30] measurements each.For these data, the four observed mRNA-protein correlations are r = 0.60, 0.63, 0.62 and 0.64, and the correlation between mRNA and protein measurements are r mRNA = 0.86 and r protein = 0.57 respectively, yielding the corrected estimate rtrue = 4 p 0.60⇥0.63⇥0.62⇥0.64p 0.85⇥0.57= 0.89.Extending these estimates to the full genome requires a more sophisticated approach.Measurements vary widely in coverage, are quantified on a range of scales arising from use of a diverse array of techniques, and cannot be assumed to have equal levels of noise.Even seemingly simple approaches to reduce noise, such as averaging measurements normalized to the same scale, are unworkable: only 16 proteins are detected by all 11 protein quantification studies, and these proteins are all highly abundant.Throwing out smaller datasets discards potentially valuable measurements, and it is unclear when to stop, since all datasets are incomplete to some degree.
To address these challenges, we adapted structural equation modeling to admit nonrandomly missing data (see Methods).We introduce a structured covariance model (SCM) that explicitly accounts for structured noise arising from replicates and use of shared measurement techniques, explicitly estimates noise at multiple levels, and allows inferences of latent covariance relationships with imputation of missing data.The SCM (Fig. S3) recovers true correlations in simulated data when substantial data are missing nonrandomly (Fig. S2), and satisfies posterior predictive checks using real data (Fig. S4).Fitting the SCM yields estimates of mRNA and protein levels integrating all data (Figure 2A) and estimates a wholegenome steady-state mRNA-protein correlation of r = 0.91 across all 5,854 genes for which an mRNA transcript has been detected in at least one of the 38 mRNA quantitation experiments (Figure1C).We emphasize that this method does not involve any attempt to maximize the mRNA-protein correlation or any assumptions about the strength of the correlation.
To evaluate accuracy of the SCM estimates, we scaled them to molecules per haploid cell using highquality published values.Estimates of the number of mRNA molecules per cell range from 15,000 to 60,000 molecules per cell ( [31,32]).A more recent study argued that the earlier, lower estimate resulted from misestimation of mRNA mass per cell and average mRNA length, with 36,000 molecules per cell as a revised estimate also supported by independent measurements [33].The higher estimate resulted from rescaling the lower estimate to match expression of five genes measured by single-molecule fluorescence in situ hybridization (FISH) [32].We adopted the 36,169 mRNA molecules per cell measurement [33].4.95pg total protein per haploid yeast cell [34]-and compared the results to small-scale gold-standard independent measurements of absolute mRNA and protein levels not used in our analysis.(No goldstandard genome-scale measurements of mRNA or protein levels exist for yeast or any other organism.)SCM estimates of absolute mRNA levels matched FISH measurements well [32] (average di↵erence of 1.2-fold between estimated and measured levels (Figure 2B, with one outlier estimate overshooting the FISH value by 1.7-fold).Notably, these results demonstrate that the FISH estimates are compatible with roughly 36,000 mRNA molecules per cell during exponential growth, and do not require the almost two-fold higher number advanced in the FISH study.Absolute protein levels for a set of 21 proteins di↵ering up to 25,000-fold in cellular abundance have been measured using single-reaction monitoring (SRM) with spiked-in stable-isotope standards [35].SCM estimates correlate better with these absolute levels (r = 0.93 between log-transformed values) than does any individual dataset, including the only study [30] which reports levels for all 21 proteins (r = 0.90) (Figure 2C, average di↵erence of 1.2-fold between SRM measurement and SCM estimate).Relative protein levels estimated by integrating multiple datasets using an alternative approach in which noise is not modeled [9] correlate with absolute levels less well (r = 0.88).The structured covariance modeling approach thus estimates steady-state cellular mRNA and protein levels with an unmatched combination of completeness, precision, and accuracy.
To evaluate imputation of missing data, we focused on the 813 genes with a detected mRNA transcript but no protein detected in any of the 11 studies.Some of these genes encode well-studied proteins such as the proteasomal regulator Rpn4p and the cyclin Cln3p, indicating clear false negatives.Ribosome profiling [11] provides an estimate of mRNA translation rate, a contributor to steady-state protein level.At least one of two independent studies [11,36] detects ribosomes in the coding sequence of 542 of these 813 genes, suggesting active translation, and translation rate correlates with the imputed protein levels (Figure 2D, r = 0.39 and 0.41 with the two studies).Because the missing protein data correspond to genes at the detection limit of these ribosome profiling data (Figure 2D), we predict that many of the remaining genes will be found to produce proteins at low levels in exponential phase.
The structured covariance model provides direct estimates of dataset-specific noise levelsl, which allow us to inquire about the main sources of noise.Cell-to-cell variability and non-systematic instrument error cannot be dominant contributors, because the very high replicate correlations within studies, the vast majority of which are biological replicates, restrict the possible noise from these sources to less than 4% of the variance in mRNA levels and 6% for protein levels on average.We therefore examined the data for signs of systematic di↵erences.
Because growth conditions perturb cell physiology, di↵erences in cell culturing and harvesting may also contribute to noise.The 25 experiments in our dataset report culturing yeast cells to an optical density (OD, absorbance at 600nm) of 0.36-1.0or, when cell density was reported, from 0.3-4 ⇥ 10 7 cells/mL.Budding yeast cells begin to deplete glucose and enter the diauxic shift at similar densities.Depletion of nutrients induces a stereotypic response in which instantaneous growth rate slows and, concomitantly, ribosomal protein gene expression is strongly repressed [37,38].We reasoned that any di↵erences arising from such transcriptional responses would introduce unintended variation-i.e., noise.This, in turn, would reduce the observed between-study mRNA-protein correlation.
To test for systematic gene regulatory responses as a cause of noise, we treated noise as if it were an experimental perturbation, and analyzed how gene expression depended upon the noise level.We calculated the slope in each gene's transcript level as a function of decreasing dataset noise quantified by the SCM-estimated signal-to-noise ratio.Many genes showed systematic increases and decreases in level with increasing noise (Figure 3A).GO process analysis on the top 100 genes by slope yielded "translation" and "cytoplasmic translation" as enriched terms (P < 10 6 ), and ribosomal genes show systematically higher mRNA values in less-noisy datasets (Wilcoxon signed-rank test P < 10 16 ) (Figure 3B).Because ribosomal proteins are highly abundant, we were concerned that some systematic regression toward the mean or other abundance-related e↵ect might influence these results.As a control, we examined mRNA levels of genes encoding glycolytic enzymes, which have comparable abundance in yeast, but whose levels are not strongly responsive to cellular stress [38].Glycolytic genes, exemplified by CDC19, showed no significant slope di↵erences (P > 0.05).These results suggest systematic determinants of variability between experiments, consistent with nutrient depletion, which occurs under conditions virtually identical to those used to generate many of the analyzed samples.
Our results indicate that the true correlation between steady-state mRNA and protein levels in budding yeast is far higher than previously recognized, which might be taken as evidence that posttranscriptional regulation plays a minor role.Yet positive evidence exists for strong contributions from posttranscriptional regulatory processes, most prominently substantial per-gene variation in translational e ciency [11], prompting us to re-examine these results.
We focused first on the recent report that translation rates estimated by ribosome profiling explained more than twice the protein-level variation than did measured mRNA levels [11].We wondered whether these findings might reflect noisier mRNA measurements than translation-rate measurements.Consistent with this, correlations using SCM-integrated protein levels are substantially higher for both mRNA and translation rate (Figure4A).Noise-corrected correlations indicate no significant di↵erence in the predictive power of either measure for protein levels-both correlate with roughly r = 0.9 (Figure 4A).
Major contributions to protein levels from mechanisms other than mRNA level become obvious upon inspection of the data.The dynamic range of protein expression is much wider than that of mRNA levels [30]; in the SCM estimates, consistent with previous studies, the range of mRNA expression between genes at the 1st and the 99th percentile is 1,044-fold whereas the range of protein expression is 1,039,000fold, a thousand times broader.A surprising consequence of the relative dynamic ranges of mRNA and protein expression, coupled with the strong correlation between mRNA and protein levels, is that absolute protein levels cannot be proportional to absolute mRNA levels at the genome scale.Equation 1 predicts that, given equal rates of translation and degradation, a gene with a thousand-fold higher mRNA level should have a thousand-fold higher protein level, but the data show that this estimate is too low by three orders of magnitude, indicating that rates of translation, degradation, or both must di↵er profoundly and systematically between genes.
This simple analysis illustrates a fundamental asymmetry: although absence of posttranscriptional regulatory processes would produce a perfect mRNA-protein correlation [1], a perfect mRNA-protein correlation would not indicate a negligible posttranscriptional contribution to relative protein levels.In fact, contrary to the assumptions of some influential analyses, it is possible for mRNA levels and (for example) translation rates to each explain more than 50% of protein-level variation-all that is required is that these contributions not be independent.
As an example of a non-independent contribution, posttranscriptional processes can shape the the dynamic range of protein levels compared to mRNA levels.Such a contribution can be quantified by the slope of the linear relationship between log-transformed protein and mRNA levels, which is the exponent relating the untransformed absolute levels.
Previous work has reported this slope to be roughly unity for smaller datasets using ordinary least squares (OLS) linear regression [10], a result we confirmed (Figure 4B).However, OLS regression assumes the independent variable is error-free [39,40] and thus it is improper to apply OLS regression to these data when the objective is to determine the functional relationship between variables [40].As with correlations, error causes systematic underestimation of slopes, a phenomenon called regression dilution bias [39].Indeed, the million-fold protein-level variation, compared to the thousand-fold mRNA-level variation, provides strong guidance that the actual slope is closer to 2 (protein levels are proportional to squared mRNA levels) than 1.Use of a noise-tolerant technique, ranged major-axis (RMA) regression [40], yielded substantially steeper slopes, with more-complete datasets producing larger slopes (Figure 4B).Also as with correlations, non-randomly missing data can also cause underestimation of regression slopes due to restriction of range.We looked for this e↵ect by analyzing datasets constructed using data from two of the largest studies [11,29], but only computing the RMA slope using genes with proteins detected in each of the smaller studies.Smaller artificial datasets yielded sharply reduced slopes (Figure 4C), confirming that missing data su ces to cause severe understimation of the nonlinear relationship between mRNA and protein levels.
The SCM approach, which accounts for both noise and missing data, yields an estimated slope of 2.2 (Figure 4B), consistent with the expectation derived from simple examination of the relative dynamic ranges.Residual noise unaccounted for by the model will tend to inflate this value, but all pairwise estimates exceed 1.0.Steady-state protein levels therefore reflect a dramatic amplification of the transcriptional signal: rather than competing with transcriptional regulation as often assumed, posttranscriptional regulation cooperates.
If translational regulation drives much of this cooperative amplification, as anticipated, then translation rate (the number of mRNAs multiplied by the translation rate per mRNA) must rise nonlinearly with mRNA level.This is visually clear from examination of the linear fit (slope = 1) compared to the RMA regression line (slope = 1.65, Figure 4D).Data from an independent study using a similar methodology shows a slope of 1.70 (Figure 4E).Thus, most of the superlinear relationship between mRNA and protein levels can be attributed to translational regulation, likely at the level of translation initiation.

Discussion
Our results demonstrate that the widely accepted consensus that steady-state mRNA levels explain less than half ( 40%) of the variation in protein levels is a significant underestimate; the true value, taking into account the reduction in correlation due to experimental noise, is closer to 80%.
Our study is restricted to a single well-studied growth condition for a single well-studied organism.The principles of accounting for noise, but not precise results, can and should be extrapolated to regulatory contributions in other settings and other organisms.An influential study on mouse fibroblasts measured mRNA and protein levels and degradation rates for thousands of genes [2], concluding that mRNA levels explained 41% of the variation in protein levels.However, a recent follow-up study concluded that, once e↵ects of error and missing data were accounted for, mRNA levels explain 75% or more of the proteinlevel variation in these data [13].Although translation rates were inferred to cause most protein-level variation in the original study, measured translation-rate variation is insu cient to explain the observed protein-level variation [13].Our results support similar conclusions.
The strong correlation between steady-state mRNA and protein levels may seem to validate the use of mRNA levels as relatively faithful proxies of protein levels.We urge caution, as a tempting conclusionthat mRNA changes serve as faithful proxies for protein changes-does not follow.Attempts to infer the correlation between transcript and protein changes from steady-state mRNA-protein correlations confuse two distinct and complex phenomena.The genome-scale relationship between mRNA levels and protein levels is an evolved property of the organism, reflecting natural selection's tuning of each gene's transcriptional and posttranscriptional controls, not merely an input-output relationship between mRNA and protein.Two genes with steady-state mRNA levels di↵ering by 10-fold may have 100-fold di↵erences in protein levels due to evolved di↵erences in their posttranscriptional regulation.This information does not indicate how the protein level for a gene will change if its transcript level is induced 10-fold in a cell, because no regulatory evolution is possible at this timescale.
A related consequence is that the number of proteins per mRNA, often treated as roughly constant, increases steeply with gene expression level.The increased density of ribosomes on high-expression transcripts suggests increased rates of translation initiation as a major contributor to this evolved nonlinearity.Consistent with this, recent work has shown that in yeast and a wide range of other organisms, the stability of mRNA structures in the 5' region weakens as expression level increases, favoring more e cient translation initiation [41].
Our results underscore the urgent need for genome-scale gold-standard measurements of absolute mRNA and protein levels to enable identification and correction of systematic errors in widely used geneexpression measurement techniques.That di↵erent groups have, as yet, been unable to reliably reproduce these bread-and-butter measurements using di↵erent methods implies that advantages can be gained in improved accuracy, rather than mere precision.

Reliability
We wish to measure latent variables and but, due to noise, actually observe variables X = + ✏ and Y = + where the random noise variables ✏ and have zero mean and are uncorrelated with the latent variables and with each other.The reliability quantifies the ratio of latent-variable variance to total (latent plus noise) variance in X.Given two random variables X 1 and X 2 representing replicate measurements of , the latent (true) variance can be estimated by Cov(X where the error terms vanish because they are uncorrelated.Thus, the expected correlation between replicates is which is the geometric mean of the reliabilities of the two measurements.

Spearman's correction
We wish to measure the Pearson correlation coe cient between latent variables r , = Cov( , ) p Var( ) Var( ) but, due to noise, actually observe Uncorrelated noise has no average e↵ect on the numerator because errors cancel (see above), but the error terms in the denominator do not cancel.This e↵ect additively inflates the variances in the denominator, biasing the observed correlations downward relative to the truth.Given the reliabilities ↵ X and ↵ Y , Spearman's correction is given by with equality in expectation.Given two measurements each of X and Y , all with di↵erent unknown reliabilities, the true correlation can then be estimated using only correlations between measurements: We extend this estimate to r = 4 r r X1Y1 r X2Y2 r X1Y2 r X2Y1 r X1X2 r Y1Y2 which again has expected value r and has the further desirable properties of exploiting all pairwise correlations and being independent of the choice of indices.In practice, each of correlations contributing to Spearman's correction are replaced with correlations estimated from the data, such that the result is also an estimate of the true correlation.

Data collection
We gathered 16 data sets that measure mRNA expression and 11 that measure protein concentrations, mostly published, yielding a total of 58 high-throughput measurements of mRNA and protein levels from 5,854 genes in budding yeast.The measurements were taken using di↵erent technologies including custom and commercial microarrays, high-throughput sequencing and mass spectrometry.All yeast cultures were growing in rich media and sampled during the exponential growth phase.Details of the data sets are summarized in Table 1.
We gathered 16 data sets that measure mRNA expression and 11 that measure protein concentrations, mostly published, yielding a total of 58 high-throughput measurements of mRNA and protein levels from 5,854 genes in budding yeast.The measurements were taken using di↵erent technologies including custom and commercial microarrays, high-throughput sequencing and mass spectrometry.All yeast cultures were growing in rich media and sampled during the exponential growth phase.Details of the data sets are summarized in Table 1.

The structured covariance model (SCM)
The model has two components: an observation model p(I i,j |X i,j ), which provides the probability of observing a value for mRNA/protein i in replicate j, given the underlying mRNA/protein level, and a hierarchical model p(X i,j | . . . ) for the underlying mRNA/protein levels themselves.The full model is specified as T i,t ⇠ N NT (0, ⌧ t ) (10) Random variables L i,l correspond to the true denoised protein (l = 1) and mRNA (l = 2) levels, for mRNAs and proteins i = 1, . . ., N, and L i = [L i,1 , L i,2 ] 0 .The random variables T i,t and E i,k capture common technological variation and batch e↵ects, respectively, t = 1, . . ., N t , k = 1, . . ., N E .R i,j are measurement noise for replicate j = 1, . . ., N R .
Both technology e↵ects and batch e↵ects between experiments are assumed to be independent, Cov(T i1,t1 , T i2,t2 ) = 0 if t 1 6 = t 2 , and Cov(E i1,k1 , E i2,k2 ) = 0 if k Measurement noise is independent between replicates, Cov(R i1,j1 , R i2,j2 ) = 0 if j 1 6 = j 2 .The parameter ⌫ j reflects replicate specific bias common to all mRNAs/proteins.The coe cient G k is an experiment specific scaling factor for the true underlying expression and abundance, and reflects the amount of post-transcriptional amplication.

Missing data model
Equation 13 models the probability that measurement X i,j is missing, p(I i,j = 0), as a logistic function of the value of the measurement.The parameters of the missing data mechanism, ⌘ 0 k and ⌘ 1 k , are shared by all replicates within an experiment; they uniquely determine the probability that measurements are observed, conditional on X i,j .

Prior specifications
To complete the model specifications we place priors on , ⌧ t , ⇠ k , ✓ j , ⌘ 0 k and ⌘ 1 k .We use either flat, or weakly informative priors on all parameters so as to bias the inference as little as possible.For the parameters ⌘ 0 k and ⌘ 1 k of the logistic observation model we use a Cauchy prior with mean zero and scale 2.5 as suggested by [42].We assume flat priors on the scaling factors, G k , and the measurement bias parameters ⌫ j .For the replicate and experiment variances ✓ j and ⇠ k we use independent conjugate Inv-Gamma(3/2, 3/10) prior.Finally, for the estimand of interest, we assume is a priori drawn from the set of correlation matrices with marginally uniform correlations [43].Correlations between studies sharing the same quantification method or di↵erent methods (dark and light gray bars, respectively), using mRNA datasets with 5000 genes (4,595 genes quantified by all datasets).For example, the second column from the left shows the 18 correlations between each of three commercial microarray studies and six studies using custom microarrays or RNA-Seq.D, Large-scale datasets vary widely in coverage of 5,887 yeast coding sequences and in resulting estimates of the mRNA-protein correlation.Shown are all pairwise correlations between 14 mRNA and 11 protein datasets, with within-study replicates averaged if present.Correlations are shown between mRNA and protein levels reported without correction (dots); using Spearman's correction on pairs of datasets (binned, boxes show mean and bars indicate standard deviation); using Spearman's correction on the largest set of paired measurements (red box); and as estimated by structured covariance modeling for 5,854 genes with a detected mRNA or protein (red diamond).E, Correlations obtained for the largest set of paired measurements, two of mRNA and two of protein levels (N=3,418).
Figure 2. Integrated estimates of mRNA and protein levels using structured covariance modeling (SCM).A, Integrated estimates across 58 global measurements reveal a strong genome-wide dependence between steady-state protein and mRNA levels (r = 0.91).Light gray points and marginal density indicate genes with detected mRNA but no detected protein.B, Absolute mRNA level estimates versus single-molecule fluorescence in situ hybridization counts [32].C, Absolute protein level estimates versus stable-isotope-standardized single reaction monitoring measurements [35].Dotted lines in B and C show perfect agreement.D, Evidence for active translation of undetected proteins inferred from ribosome profiling; data from one [36] of two [11] studies.Dashed line shows ranged major-axis regression best fit.Marginal densities show ribosome density for all detected transcripts (medium gray), all transcripts with a detected protein (dark gray), and transcripts with no detected protein (light gray).
Figure 3. Cellular responses linked to growth are apparent in gene expression data.A, Gene expression varies systematically with noise; shown are normalized mRNA levels for genes encoding large ribosomal protein 23A (RPL23A), the glycolytic enzyme pyruvate kinase 1 (CDC19 ), and a proteasome lid subunit (RPN6 ).Lines show linear fits; slopes for RPL23A and RPN6 are significantly nonzero with P < 0.05.B, Expression of classes of genes changes systematically with noise.Box and whisker plots show all genes with at least 25 measurements (N=5,326), 133 ribosomal proteins, and 20 glycolytic enzymes.Wilcoxon signed-rank tests, ***, P < 10 16 ; n.s., P > 0.05.Transcriptional and translational regulation act coherently to set protein levels.A, The correlation of mRNA (light gray) and rates of translation (dark gray) reported in the original ribosome-profiling study, using averaged mRNA and protein levels, and corrected for noise using Spearman's correction on the same set of genes (N=3,266).Diamond shows whole-proteome SCM estimate.B, The exponent relating protein and mRNA concentrations estimated by noise-blind (ordinary least squares) and noise-aware (ranged major-axis) regression analyses.Gray points, all pairs of datasets; black points, pairs of datasets with > 3500 measurements.Dotted line shows perfect agreement; dashed line marks SCM estimate.C, Missing data leads to underestimation of the mRNA-protein exponent.The exponent from two large mRNA and protein studies was computed after limiting analysis to only genes with proteins detected in each of the 11 protein studies.D, Ribosome density depends nonlinearly on mRNA level.Dashed line shows linear (slope = 1) fit.Solid gray line shows RMA regression fit.E, mRNA-ribosome-density exponents estimated from independent studies [11,36].

Figure 1 .
Figure1.Quantification and consequences of noise on the correlation between measurements of steady-state mRNA and protein levels.A, Steady-state protein levels reflect the balance of mRNA translation and protein removal.B, Global measurements of mRNA and protein levels vary widely in reproducibility and coverage.Each point represents a pair of studies.Dots show between-study correlations (median shown by dashed line), a measure of reliability.Dotted line, median of within-study correlations.Blue dots show pairs of studies from the same research group.C, Correlations between studies sharing the same quantification method or di↵erent methods (dark and light gray bars, respectively), using mRNA datasets with 5000 genes (4,595 genes quantified by all datasets).For example, the second column from the left shows the 18 correlations between each of three commercial microarray studies and six studies using custom microarrays or RNA-Seq.D, Large-scale datasets vary widely in coverage of 5,887 yeast coding sequences and in resulting estimates of the mRNA-protein correlation.Shown are all pairwise correlations between 14 mRNA and 11 protein datasets, with within-study replicates averaged if present.Correlations are shown between mRNA and protein levels reported without correction (dots); using Spearman's correction on pairs of datasets (binned, boxes show mean and bars indicate standard deviation); using Spearman's correction on the largest set of paired measurements (red box); and as estimated by structured covariance modeling for 5,854 genes with a detected mRNA or protein (red diamond).E, Correlations obtained for the largest set of paired measurements, two of mRNA and two of protein levels (N=3,418).

Figure 4 .
Figure 4. Transcriptional and translational regulation act coherently to set protein levels.A, The correlation of mRNA (light gray) and rates of translation (dark gray) reported in the original ribosome-profiling study, using averaged mRNA and protein levels, and corrected for noise using Spearman's correction on the same set of genes (N=3,266).Diamond shows whole-proteome SCM estimate.B, The exponent relating protein and mRNA concentrations estimated by noise-blind (ordinary least squares) and noise-aware (ranged major-axis) regression analyses.Gray points, all pairs of datasets; black points, pairs of datasets with > 3500 measurements.Dotted line shows perfect agreement; dashed line marks SCM estimate.C, Missing data leads to underestimation of the mRNA-protein exponent.The exponent from two large mRNA and protein studies was computed after limiting analysis to only genes with proteins detected in each of the 11 protein studies.D, Ribosome density depends nonlinearly on mRNA level.Dashed line shows linear (slope = 1) fit.Solid gray line shows RMA regression fit.E, mRNA-ribosome-density exponents estimated from independent studies[11,36].

Table 1 .
List of mRNA data sets (above the midline) and protein concentration data sets (below the midline).The number of replicates in each data set is given after the technology name.