Post-transcriptional regulation across human tissues

Transcriptional and post-transcriptional regulation shape tissue-type-specific proteomes, but their relative contributions remain contested. Estimates of the factors determining protein levels in human tissues do not distinguish between (i) the factors determining the variability between the abundances of different proteins, i.e., mean-level-variability and, (ii) the factors determining the physiological variability of the same protein across different tissue types, i.e., across-tissues variability. We sought to estimate the contribution of transcript levels to these two orthogonal sources of variability, and found that scaled mRNA levels can account for most of the mean-level-variability but not necessarily for across-tissues variability. The reliable quantification of the latter estimate is limited by substantial measurement noise. However, protein-to-mRNA ratios exhibit substantial across-tissues variability that is functionally concerted and reproducible across different datasets, suggesting extensive post-transcriptional regulation. These results caution against estimating protein fold-changes from mRNA fold-changes between different cell-types, and highlight the contribution of post-transcriptional regulation to shaping tissue-type-specific proteomes.


Introduction
The relative ease of measuring mRNA levels has facilitated numerous investigations of how cells regulate their gene expression across different pathological and physiological conditions (Sørlie et al, 2001;Slavov and Dawson, 2009;Spellman et al, 1998;Slavov et al, 2011Slavov et al, , 2012;;Djebali et al, 2012).However, often the relevant biological processes depend on protein levels, and mRNA levels are merely proxies for protein levels (Alberts et al, 2014).If a gene is regulated mostly transcriptionally, its mRNA level is a good proxy for its protein level.Conversely, post-transcriptional regulation (PTR) can set protein levels independently from mRNA levels, as in the cases of classical regulators of development (Kuersten and Goodwin, 2003), cell division (Hengst and Reed, 1996;Polymenis and Schmidt, 1997) and metabolism (Daran-Lapujade et al, 2007;Slavov et al, 2014a).Thus understanding the relative contributions of transcriptional and post-transcriptional regulation is essential for understanding their trade-offs and the principles of biological regulation, as well as for assessing the feasibility of using mRNA levels as proxies for protein levels.Some studies of these relative contributions have concluded that protein levels depend mostly on the mRNA levels (Li et al, 2014;Jovanovic et al, 2015;Csárdi et al, 2015) while other studies have concluded the opposite, i.e., that protein levels depend mostly on post-transcriptional regulation (Gygi et al, 1999;Smits et al, 2014;Schwanhäusser et al, 2011).These differing views arise because of differences in the systems, the methods, and the quantified protein variance.In particular, correlations between absolute levels of mRNA and protein mix/conflate many sources of variation, including variability between the levels of different proteins, variability within the same protein across different conditions and cell-types, and the variability due to measurement error and technological bias.
However, these different sources of variability have very different biological interpretations and implications.A major source of variability in protein and mRNA data arises from differences between the levels of mRNAs and proteins corresponding to different genes.That is, the mean levels (averaged across tissue-types) of different proteins and mRNAs vary widely.We refer to this source of variability as mean-level variability.This mean-level variability reflects the fact that some proteins, such as ribosomal proteins, are highly abundant across all profiled conditions while other proteins are orders of magnitude less abundant across all profiled conditions.Another principal source of variability in protein levels, intuitively orthogonal to the mean-level variability, is the variability within a protein across different physiological conditions or cell-types.This variability reflects normal physiological regulation, which we refer to as physiological variability, and is usually much smaller in magnitude.However, physiological variability is frequently the most relevant source of variability for understanding different phenotypes across cells-types and physiological conditions.
Here, we separately quantify the contributions of transcriptional and post-transcriptional regulation to the mean-level variability and to the physiological variability across human tissues.
Our results suggest that the physiological variability across human tissues is dominated by posttranscriptional regulation, while the mean-level variability is dominated by transcriptional regulation.These results reconcile previous results in the literature (Gygi et al, 1999;Schwanhäusser et al, 2011;Li et al, 2014;Wilhelm et al, 2014;Jovanovic et al, 2015;Csárdi et al, 2015;Smits et al, 2014) and highlight the dominance of post-transcriptional regulation in determining the variability in the levels of a protein across cell-types and physiological conditions.We then suggest a simple and general approach for deconvolving the contributions of transcriptional and posttranscriptional regulation to measured protein levels.

Results
The correlation between absolute mRNA and protein levels conflates distinct

sources of variability
We start by outlining the statistical concepts underpinning the common correlational analysis and depiction (Gygi et al, 1999;Schwanhäusser et al, 2011;Wilhelm et al, 2014;Csárdi et al, 2015) of estimated absolute protein and mRNA levels as displayed in Figure 1a.The magnitude of the correlation between the absolute mRNA and protein levels of different genes and across different tissue-types is used used to estimate the level at which the protein levels are regulated (Wilhelm et al, 2014).If the physiological variability of a gene is dominated by transcriptional regulation, its protein-to-mRNA ratio in different tissue-types will be a gene-specific constant.Based on this idea, Wilhelm et al (2014) estimated these protein-to-mRNA ratios.They suggested that the median ratio for each gene can be used to scale its tissue-specific mRNA levels and that this "scaled mRNA" predicts accurately tissue-specific protein levels.Indeed, scaled mRNA levels explain large fraction of the total variance (R 2 T = 0.77, across 6104 measured proteins, Figure 1a) as previously observed (Schwanhäusser et al, 2011;Wilhelm et al, 2014).However, R 2 T quantifies the fraction of the total protein variance explained by mRNA levels between genes and across tissue-types; thus, it conflates the mean-level variability with the physiological variability.This conflation is shown graphically in Figure 1b for a subset of 100 genes measured across 12 tissues.The physiological variability is captured by the variability within the regression fits and the mean-level variability is captured by the variability between the regression fits.
The aggregation of distinct sources of variation, where different subgroups of the data show different trends, may lead to the effect known as Simpson's or amalgamation paradox in the statistical literature, which can lead to counter-intuitive results and incorrect conclusions (Blyth, 1972).
To illustrate the Simpson's paradox in this context, we chose a subset of genes for which the scaled mRNA and measured protein levels are negatively correlated across tissues, and the mean-level variability spans the full dynamic range of the data.For this subset of genes, the overall (conflated/amalgamated) correlation is large and positive, despite the fact that all within-gene trends are negative.This counter-intuitive result is possible because the conflated correlation is dominated by the variability with larger dynamical range, in this case the mean-level variability.This conceptual example taken from the Wilhelm et al (2014) data demonstrates that, R 2 T is not necessarily informative about the physiological variability, i.e., the protein variance explained by scaled mRNA within a gene (R 2 P ).Thus the conflated correlation is not generally informative about the level -transcriptional or post-transcriptional -at which physiological-variability is regulated.However, it is exactly the physiological-variability across tissue-types that provides the biological identity of each tissue type.This physiological variability has a dynamic range of about 2 − 10 fold and is thus dwarfed by the 10 4 fold dynamic range of abundances across different proteins.
To further demonstrate the implications of this vast difference in the dynamic ranges, we generate data from a simple model using the observed between-tissue variability in the protein-to-mRNA ratio.This protein/mRNA ratio has been referred to as a gene's "translational efficiency" because it reflects, in large part, its translational rate.Since this ratio also reflects other layers of regulation, such as protein degradation (Jovanovic et al, 2015), we will refer to it as a PTR ratio.

Physiological mRNA variability is a poor predictor of the physiological protein variability
Figure 1 illustrates the statistical problems with using the fraction of the total protein variance explained by scaled mRNA levels (R 2 T ) as an indication about the extent to which mRNA changes contribute to protein changes across tissues (i.e., R 2 P ).To investigate the significance of this conflation further, we next evaluated the differences between scaling mRNA with the median PTR ratio (as in Figure 1a) and scaling mRNA with the PTR ratio of a specific tissue.That is, instead of using the median PTR ratio, we can use the PTR ratio estimated from one tissue to scale the mRNA level from another tissue.For instance, we correlate the protein levels measured in uterine to the uterine mRNA levels scaled by the prostate PTR ratio, Figure 2a.This correlation is lower compared to the correlation when mRNA is scaled by the median PTR ratio shown in Figure 1a.This reduction underscores that the PTR ratio for a gene varies enough between tissue types to affect even the conflated variability R 2 T .Extending this analysis to more pairs of tissues (Figure 2b, c) indicates very similar results; In all cases, the correlation is around 0.5, substantially smaller than the 0.9 correlation observed when mRNA is scaled by the median PTR ratio, Figure 1a.
Despite the very similar correlations between measured protein and scaled mRNA levels for all 3 comparisons in Figure 2a-c, the corresponding correlations between the protein and mRNA foldchanges differ substantially, Figure 2d-f.Unlike the buffering of mRNA variability observed across species and individuals (Khan et al, 2013;Battle et al, 2014), the protein levels generally vary much more across tissues than the mRNA levels (Supporting Figure 1); thus the protein and the mRNA fold changes in Figure 2d-f are plotted on different scales.The fold change comparisons in Figure 2d-f demonstrate that in fact, the fraction of variance explained in protein fold-changes by mRNA fold-changes is usually small and depends strongly on the compared pair of tissues.For instance, the mRNA fold-changes between the uterus and prostate have essentially no predictive power for protein fold-changes across these tissues.At the same time, other tissues show a moderate foldchange correlation (e.g., prostate vs. kidney and uterus vs. kidney).The fraction of the variance in protein fold-changes that can be explained by mRNA fold-changes varies significantly across the three examples.However, R 2 T remains high and constants in all three examples (Figure 2a-c), because it is dominated by the mean-level variability.This result underscores the general problem of variance conflation in the analysis of measured mRNA and protein levels.
Next, we sought to evaluate whether physiological variability of mRNAs can serve as a proxy for the physiological variability of proteins.For this analysis, we extend our results on fold-change correlations from Figure 2 to all pairwise combinations of tissue-types.The range of correlations in Figure 3a indicates that for some pairs of tissues, physiological variability in mRNA explains a significant fraction of the physiological protein variance but for other tissue-type pairs it does not.This result indicates that for most genes, the observed variability across at least some tissues is either due to measurement noise or to post-transcriptional regulation.Before focusing on distinguishing between these two possibilities, we investigated whether the physiological variability of some genes is regulated primarily transcriptionally.If so, the protein fold-changes of such genes may be predicted reliably from mRNA fold-changes.To this end, we quantify the error in predicting protein fold-changes from mRNA fold-changes.The cumulative distribution of errors (Figure 3b) indicates that the protein fold-changes for less that 1000 genes can be estimated from mRNA fold-changes with less than 100% error.For over 30% of proteins, estimating protein levels using a single gene-specific PTR ratio results in over 1000% error; see methods.

Coordinated post-transcriptional regulation of functional gene sets
The lack of correlation between protein and mRNA fold-changes can reflect large measurement noise rather than post-transcriptional regulation (Li et al, 2014;Franks et al, 2015).The noise contribution to the variability of the PTR ratios of a gene is independent from the function of the gene.Conversely, genes with similar functions are likely to be regulated similarly and thus have similar tissue-type-specific PTR ratios.Thus, we explored whether the across-tissues variability of the PTR ratios of functionally related genes reflects such tissue-type-specific and biologicalfunction-specific post-transcriptional regulation.
For this analysis, we define the "relative PTR ratio" (rPTR) of a gene in a given tissue to be the PTR ratio in that tissue divided by the median PTR ratio of the gene across the other 11 tissues.
We evaluated the significance of rPTR variability for a gene-set in each tissue-type by comparing the corresponding gene-set rPTR distribution to the rPTR distribution for those same genes pooled across the other tissues (Figure 4); we use the KS-test to quantify the statistical significance of differences in the rPTR distributions; see Methods.Our results indicate that the genes from many GO terms (Consortium et al, 2004) have much higher rPTR in some tissues than in others.For example the ribosomal proteins of the small subunit (40S) have high rPTR in kidney but low rPTR in stomach (Figure 4a-b).Some of these trends can account for fundamental physiological differences between tissue types.For example, the kidney is by far the most metabolically active (energy consuming) tissue among the 12 profiled tissues (Hall, 2010) and it has very high rPTR for many gene sets involved in energy production (Figure 4a).In this case, post-transcriptional regulation very likely plays a functional role in meeting the high energy demands of kidneys.Moreover, the fact that we observe a highly significant (posterior error probability < 10 −10 ) mode of rPTR (such as increased TF for mitochondrial genes and decreased rPTR for focal adhesion in kidney) indicates that at least some of the variability in post-transcriptional regulation across tissue-types reflects regulatory activity rather than measurement noise.

Quantifying post-transcriptional regulation across human tissues
The results in Figure 4 demonstrate the some of the physiological variability of protein levels is due to post-transcriptional regulation, not noise.To further quantify the fractions of physiological protein variability due to transcriptional regulation, post-transcriptional regulation, and noise, we need to take noise into account.Both RNA-seq (Marioni et al, 2008;Consortium et al, 2014) and mass-spectrometry (Schwanhäusser et al, 2011;Peng et al, 2012) have relatively large and systematic error in estimating absolute levels of mRNAs and proteins, i.e., the ratios between different proteins/mRNAs.These errors originate from DNA sequencing GC-biases, and variations in protein digestion and peptide ionization.However, relative quantification of the same gene across tissue-types by both methods is much more accurate since systematic biases are minimized when taking ratios between the intensities/counts of the same peptide/DNA-sequence measured in different tissue types (Ong et al, 2002;Blagoev et al, 2004;Consortium et al, 2014;Jovanovic et al, 2015).It is this relative quantification that is used in estimating physiological variability, and thus noise levels are much smaller compared to the noise of absolute quantification.
To quantify the transcriptional and post-transcriptional contributions to physiological protein variability, we start by estimating the reliability of the measurements, Figure 5a, b.Reliability is simply defined as the fraction of the observed/empirical variance due to signal.Thus reliability is proportional to the signal strength and decreases with the noise levels.For both protein and mRNA, we use independent estimates of their fold changes between salivary and adrenal glands.
In the case of mRNA, these independent estimates correspond to replica RNA-seq measurements, Figure 5a.For proteins, the independent estimates where derived by non-overlapping sets of of peptides; that is, the fold change of each protein with multiple quantified peptides in the salivary and adrenal glands was estimated from from two non-overlapping sets of of peptides, Figure 5b.
Taking into account the reliability of the measurements, we depict the upper-bound for the fractions of physiological protein variability that can be explained by mRNA levels (i.e., transcriptional regulation) in Figure 5c.To account for any uncertainty in the reliability estimates, we depict the fraction of explained variance for a wide range of reliability estimates.At the reliability estimated for this dataset (Figure 5a, b), at most 30 % of the physiological variability -the variability of protein levels across-tissue types that is left after accounting for the measurement noise -can be explained by the mRNA levels.The remaining 70 % is most likely due to post-transcriptional regulation.This result underscores the different modes of regulation for the mean-level variability (mostly transcriptional) and for the physiological variability (mostly post-transcriptional).Since the exact estimate depends on the noise/reliability levels, we show estimates for higher and lower levels of reliability.Even if measurement error is larger and most measured variance in both mRNA and protein levels is due to noise, not signal, i.e., reliability < 50 %, transcriptional regulation still can explain at most about 50 % of the physiological protein variability.Thus even in this extreme case, post-transcriptional regulation is likely a major determinant of physiological protein variability, and thus tissue-type specific proteomes.

Discussion
Highly abundant proteins have highly abundant mRNAs.This dependence is consistently observed (Jovanovic et al, 2015;Csárdi et al, 2015;Gygi et al, 1999;Smits et al, 2014;Schwanhäusser et al, 2011) and dominates the explained variance in the estimates of absolute protein levels (Figure 2a, Figure 2a-c, Figure 5c).This underscores the role of transcription for setting the full dynamic range of protein levels.In stark contrast, differences in the proteomes of distinct human tissues are poorly explained by transcriptional regulation (Figure 2d-f, Figure 5a-b).Rather, the mechanisms shaping the tissue-specific proteomes involve post-transcriptional regulation.This result underscores the role of translational regulation and of protein degradation for mediating physiological functions within the range of protein levels consistent with life.
The estimates of absolute protein levels are affected by technological biases and measurement error (Peng et al, 2012;Franks et al, 2015) which can contribute to overestimating posttranscriptional regulation.These biases can difficult to estimate and influential (Csárdi et al, 2015), potentially leading to underestimates of the variance in protein levels explained by transcription.
However, such systematic biases do not affect the relative changes of protein levels and the estimates of physiological variability.Indeed, the strong enrichment of rPTR within gene sets (Figure 4) demonstrates a concerted regulation at the post-transcriptional level.It is thus unlikely that bias and measurement error alone explain the weak correlations between tissue-specific differences in mRNA and protein levels (Figure 2).
As with all analysis of empirical data, all results depend on the quality of the data and the estimates of their reliability.If the reliability of the data are significantly below 50 %, the data would be consistent with mRNA levels accounting for most of the physiological variability, as an upper limit estimate for the transcriptional contribution.In that case, however, the signal is dominated by noise, and thus the data cannot accurately quantify the contributions of different regulatory mechanisms.The strong functional enrichment for rPTR (Figure 4) and the error estimates by Wilhelm et al (2014) suggest that the physiological variance not explained by mRNA is likely due to post-transcriptional regulation, not to signal dwarfed by noise.
The correlations between the fold changes of mRNAs and proteins in Figure 3 indicate that the relative contributions of transcriptional and post-transcriptional regulation can vary substantially depending on the tissues compared.Thus, the level of gene regulation depends at least to come extent on context.For example transcriptional regulation is contributing significantly to the dynamical responses of dendritic cells (Jovanovic et al, 2015) and to the differences between spleen and kidney (Figure 3a) but much less to the differences between spleen and thyroid gland (Figure 3a).All data, across all profiled tissues, suggest that post-transcriptional regulation contributed very substantially to the physiological variability of protein levels.The degree of this large contri-bution depends on the context.Indeed, if we only increase the levels for a set of mRNAs without any other changes, the corresponding protein levels must increase proportionally as demonstrated by gene inductions (McIsaac et al, 2011).However, the differences across cell-types are not confined only to different mRNA levels.Rather, these differences include different RNA-binding proteins, alternative untranslated regions (UTRs) with known regulatory roles in protein synthesis, specialized ribosomes, and different protein degradation rates (Mauro and Edelman, 2002;Gebauer and Hentze, 2004;Rojas-Duran and Gilbert, 2012;Castello et al, 2012;Arribere and Gilbert, 2013;Slavov et al, 2014b;Katz et al, 2014).The more substantial these differences, the bigger the potential for posttranscriptional regulation.Thus cell-type differentiation and commitment may result in much more post-transcriptional regulation than observed during perturbations preserving the cellular identity.
Consistent with this possibility, mRNA fold-changes can account for less than 50 % of the measured physiological variability; the remaining variability is likely due to substantial tissue-specific post-transcriptional regulation; in contrast, stimulating dendritic cells elicits a strong transcriptional response but not change in the cell-type and thus less cell-type-specific post-transcriptional regulation (Jovanovic et al, 2015).
For these genes, about 8% of the mRNA measurements and about 40% of the protein measurements are missing.
First, denote m ij the log mRNA levels for gene i in condition j.Similarly, let p ij denote the corresponding log protein levels.First, we normalize the columns of the data, for both protein and mRNA, to different amounts of total protein per sample.Any multiplicative factors on the raw scale correspond to additive constants on the log scale.Consequently, we normalize data from each tissue-type by minimizing the sum of squared differences between data from that tissue and the first tissue (chosen to serve as a baseline).Specifically, for all proteins and conditions j > 1, we normalize each measurement by setting Where p n ij and p u ij represent the normalized and non-normalized protein measurements respectively.We conduct the same normalization for mRNA.This normalization corrects for any multiplicative differences in the raw mRNA or protein.
After normalization, we define r ij = p ij − m ij as the log PTR ratio of gene i in condition j.
If the post-transcriptional regulation the i th gene were not tissue-specific, then the i th PTR ratio would be independent of tissue-type and can be estimated as Then the log "scaled mRNA" (or mean protein level) can be defined as On the raw scale this amounts to scaling each mRNA by its median PTR ratio and represents and estimate of the mean protein level.The residual difference between the log mean protein level and the measured log protein level r ij = p ij − p ij consists of both tissue-specific post-transcriptional regulation and measurement noise.

Noise correction
Measurement noise attenuates estimates of correlations between mRNA and protein level (Franks et al, 2015).A simple way to quantify this attenuation of correlation due to measurement error is via Spearman's correction.Spearman's correction is based on the fact that the variance of the measured data can be decomposed into the sum of variance of the noise and the signal.If the noise and the signal are independent, this decomposition and the Spearman's correction are exact.Below is a simple proof that the observed empirical variance is the sum of the variance of the signal and the variance of the noise: • e i -Expectation at the i th data point; ẽi = e i − e To use this additivity and make Spearman's correction, we need to estimate the "reliability" of the measurements, which is defined as the fraction of total measured variance due to signal rather than to noise: The noise corrected correlation is then simply We estimated the reliabilities of the mRNA and the protein measurements from independent estimates for the mRNA and the protein levels, Figure 5a, b.Given these estimates of mRNA and protein reliabilities, we computed the de-noised fraction of physiological protein variability explained by transcript levels using Equation 3. Figure 5c depicts the R 2 regions as a function of measurement reliablities.

Functional gene set analysis
To identify tissue-specific PTR for functional sets of genes, we analyzed the distributions of PTR ratios within functional gene-sets using the same methodology as Slavov and Botstein (2011).We restrict our attention to functional groups in the GO ontology (Consortium et al, 2004) for which at least 10 genes were quantified by Wilhelm et al (2014).Let k index one of these approximately 1600 functional gene sets.First, for every gene in every tissue we estimate the relative PTR (rPTR) or equivalently, the difference between log mean protein level and measured protein level: To exclude the possibility that rij = 0 exactly, we require that j = j.When the estimated rPTR is larger than zero, the measured protein level in tissue j is larger than the estimated mean protein level.Likewise, when this quantity is smaller than onezero, the measured protein is smaller than expected.Measured deviations from the mean protein level are due to both measurement noise and tissue specific PTR.To eliminate the possibility that all of the variability in the rPTR ratios is due to measurement we conduct a full gene set analysis.
For each of the gene sets we compute a vector of these estimated log ratios so that a gene set is comprised of where i 1 to i n k index the genes in set k and j indexes the tissue type.
Let KS(G 1 , G 2 ) be the function that returns the p-value of the Kolmogorov-Smirnov test on the distribution in sets G 1 and G 2 .The KS-test is a test for a difference in distribution between two samples.Using this test, we identify gene sets that show systematic differences in PTR ratio in a particular tissue (j) relative to all other tissues.
Specifically, the p-value associated with gene set k in condition j is To correct for multiple hypotheses testing, we computed the false discovery rate (FDR) for all gene sets in tissue j (Storey, 2003).In Figure 4a-c, we present only the functional groups with FDR less than 2% and report their associated p-values.The significance of many of these groups, controlling for false discoveries suggests that not all of the variability in rPTR is due to measurement noise. .The dynamic range of physiological variability is larger for proteins than for mRNAs (a) Distributions of dynamic ranges for mRNAs and proteins quantified by the standard deviations computed on a log scale.Note that the log-scale makes the standard deviation independent of scalar scaling on the linear scale.(b) Distribution of differences between the standard deviations of proteins and their corresponding mRNAs.The larger than zero median indicates that the physiological variability of most genes is larger at the protein level than at the mRNA level.

Figure 1 .Figure 2 .
Figure1.The fraction of total protein variance explained by scaled mRNA levels is not informative about the physiological variance explained by scaled mRNA levels.(a) mRNA levels scaled by the median protein-to-mRNA ratio correlate strongly with measured protein levels (R 2 B = 0.77 over 6104 measured proteins in each of 12 different tissues).(b) A subset of 100 genes are used to illustrate an example Simpson's paradox: regression lines reflect within gene physiological variation across each tissue.Despite the fact that the overall correlation between scaled mRNA and measured protein levels is large and positive R T = 0.89, for any single gene in this set, scaled mRNA is negatively correlated with measured protein levels (R P < 0).