Transcriptional and post-transcriptional regulation shape tissue-type-specific proteomes, but their relative contributions remain contested. Estimates of the factors determining protein levels in human tissues do not distinguish between (i) the factors determining the variability between the abundances of different proteins, i.e., mean-level-variability and, (ii) the factors determining the physiological variability of the same protein across different tissue types, i.e., across-tissues variability. We sought to estimate the contribution of transcript levels to these two orthogonal sources of variability, and found that scaled mRNA levels can account for most of the mean-level-variability but not necessarily for across-tissues variability. The reliable quantification of the latter estimate is limited by substantial measurement noise. However, protein-to-mRNA ratios exhibit substantial across-tissues variability that is functionally concerted and reproducible across different datasets, suggesting extensive post-transcriptional regulation. These results caution against estimating protein fold-changes from mRNA fold-changes between different cell-types, and highlight the contribution of post-transcriptional regulation to shaping tissue-type-specific proteomes.
The identity of human tissues depends on their protein levels. Are tissue protein levels set largely by corresponding mRNA levels or by other (post-transcriptional) regulatory mechanisms? We revisit this question based on statistical analysis of mRNA and protein levels measured across human tissues. We find that for any one gene, its protein levels across tissues are poorly predicted by its mRNA levels, suggesting tissue-specific post-transcriptional regulation. In contrast, the overall protein levels are well predicted by scaled mRNA levels. We show how these speciously contradictory findings are consistent with each other and represent the two sides of Simpson’s paradox.
Citation: Franks A, Airoldi E, Slavov N (2017) Post-transcriptional regulation across human tissues. PLoS Comput Biol 13(5): e1005535. https://doi.org/10.1371/journal.pcbi.1005535
Editor: Christine Vogel, NYU, UNITED STATES
Received: December 19, 2016; Accepted: April 26, 2017; Published: May 8, 2017
Copyright: © 2017 Franks et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work was partially funded by a SPARC grant from the Broad Institute to NS and EA (https://www.broadinstitute.org/), the Washington Research Foundation Fund for Innovation in Data-Intensive Discovery (wrf.washington.edu/) and the Moore/Sloan Data Science Environments Project at the University of Washington (msdse.org/), and NIGMS of the NIH under Award Number DP2GM123497 (https://www.nigms.nih.gov/). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
The relative ease of measuring mRNA levels has facilitated numerous investigations of how cells regulate their gene expression across different pathological and physiological conditions [1–6]. However, often the relevant biological processes depend on protein levels, and mRNA levels are merely proxies for protein levels . If a gene is regulated mostly transcriptionally, its mRNA level is a good proxy for its protein level. Conversely, post-transcriptional regulation can set protein levels independently from mRNA levels, as in the cases of classical regulators of development , cell division [9, 10] and metabolism [11, 12]. Thus understanding the relative contributions of transcriptional and post-transcriptional regulation is essential for understanding their trade-offs and the principles of biological regulation, as well as for assessing the feasibility of using mRNA levels as proxies for protein levels.
Previous studies have considered single cell-types and conditions in studying variation in absolute mRNA and protein levels genome-wide, often employing unicellular model organisms or mammalian cell cultures [13–19]. However, analyzing per-gene variation in relative mRNA and protein expression across different tissue-types in a multicellular organism presents a potentially different and critical problem which cannot be properly addressed by examining only genome-scale correlations between mRNA and protein levels. [20–22] have measured protein levels across human tissues, thus providing valuable datasets for analyzing the regulatory layers shaping tissue-type-specific proteomes. The absolute levels of proteins and mRNAs in these datasets correlate well, highlighting that highly abundant proteins have highly abundant mRNAs. Such correlations between the absolute levels of mRNA and protein mix/conflate many sources of variation, including variability between the levels of different proteins, variability within the same protein across different conditions and cell-types, and the variability due to measurement error and technological bias.
However, these different sources of variability have very different biological interpretations and implications. A major source of variability in protein and mRNA data arises from differences between the levels of mRNAs and proteins corresponding to different genes. That is, the mean levels (averaged across tissue-types) of different proteins and mRNAs vary widely. We refer to this source of variability as mean-level variability. This mean-level variability reflects the fact that some proteins, such as ribosomal proteins, are highly abundant across all profiled tissues while other proteins, such as cell cycle and signaling regulators, are orders of magnitude less abundant across all profiled conditions . Another principal source of variability in protein levels, intuitively orthogonal to the mean-level variability, is the variability within a protein across different cell-types or physiological conditions and we refer to it as across-tissues variability. The across-tissues variability is usually much smaller in magnitude, but may be the most relevant source of variability for understanding different phenotypes across cells-types and physiological conditions.
Here, we sought to separately quantify the contributions of transcriptional and post-transcriptional regulation to the mean-level variability and to the across-tissues variability across human tissues. Our results show that much of the mean-level protein variability can be explained well by mRNA levels while across-tissues protein variability is poorly explained by mRNA levels; much of the unexplained variance is due to measurement noise but some of it is reproducible across datasets and thus likely reflects post-transcriptional regulation. These results add to previous results in the literature [13–18, 20, 22] and suggest that the post-transcriptional regulation is a significant contributor to shaping tissue-type specific proteomes in human.
The correlation between absolute mRNA and protein levels conflates distinct sources of variability
We start by outlining the statistical concepts underpinning the common correlational analysis and depiction [13, 15, 17, 20] of estimated absolute protein and mRNA levels as displayed in Fig 1a and 1b. The correlation between the absolute mRNA and protein levels of different genes and across different tissue-types has been used to estimate the level at which the protein levels are regulated [20, 22].
(a) mRNA levels correlate with measured protein levels (RT = 0.33 over all measured mRNAs and proteins across 12 different tissues). (b) Protein levels versus mRNA levels scaled by the median protein-to-mRNA ratio (PTR); the only change from panel (a) is the scaling of mRNAs, which considerably improves the correlation. (c) A subset of 100 genes are used to illustrate an example Simpson’s paradox: regression lines reflect within-gene and across-tissues variability. Despite the fact that the overall correlation between scaled mRNA and measured protein levels is large and positive RT = 0.89, for any single gene in this set, mRNA levels scaled by the median PTR ratio are not correlated to the corresponding measured protein levels (RP ≈ 0). (d) Cumulative distributions of across-tissues scaled mRNA-protein correlations (RP) for 3 datasets [20–22]. The smooth curves correspond to all quantified proteins by shotgun proteomics while the dashed curves correspond to a subset of proteins quantified in a small targeted dataset . The vertical lines show the corresponding overall (conflated) correlation between scaled mRNA levels and protein levels, RT. See Methods and S1 Fig.
One measure reflecting the post-transcriptional regulation of a gene is its protein to mRNA ratio, which is sometimes referred to as a gene’s “translational efficiency”. Since this ratio also reflects other layers of regulation, such as protein degradation and noise , we will refer to it descriptively as protein-to-mRNA (PTR) ratio. If the across-tissues variability of a gene is dominated by transcriptional regulation, its PTR in different tissue-types will be a gene-specific constant. Based on this idea, [20, 22] estimated these protein-to-mRNA ratios and suggested that the median PTR for each gene can be used to scale its tissue-specific mRNA levels and that this “scaled mRNA” predicts accurately tissue-specific protein levels.
Indeed, mRNA levels scaled by the corresponding median PTR explain large fraction of the total protein variance (, across all measured proteins, Fig 1a and 1b) as previously observed [15, 20, 22]. However, this high does not indicate concordance for across-tissues variability of mRNAs and proteins. quantifies the fraction of the total protein variance explained by mRNA levels between genes and across tissue-types; thus, it conflates the mean-level variability with the across-tissues variability. This conflation is shown schematically in Fig 1c for a subset of 100 genes measured across 12 tissues. The across-tissues variability is captured by the variability within the regression fits while the mean-level variability is captured by the variability between the regression fits.
Such aggregation of distinct sources of variability, where different subgroups of the data show different trends, may lead to counter-intuitive results and incorrect conclusions, and is known as the Simpson’s or amalgamation paradox . To illustrate the Simpson’s paradox in this context, we depicted a subset of genes for which the measured mRNA and protein levels are unrelated across-tissues while the mean-level variability still spans the full dynamic range of the data, Fig 1c. For this subset of genes, the overall (conflated/amalgamated) correlation is large and positive, despite the fact that all across-tissues (within-gene) trends are close to zero. This counter-intuitive result is possible because the conflated correlation is dominated by the variability with larger dynamical range, in this case the mean-level variability. This conceptual example using data from  demonstrates that is not necessarily informative about the across-tissues variability, i.e., the protein variance explained by scaled mRNA within a gene (). Thus the conflated correlation is not generally informative about the level—transcriptional or post-transcriptional—at which across-tissues variability is regulated. This point is also illustrated in S1 Fig with data for all quantified genes: The correlations between scaled mRNA and measured protein levels are not informative for the correlations between the corresponding relative changes in protein and mRNA levels.
To further illustrate this point with more datasets, Fig 1d displays the cumulative distributions of across-tissues mRNA-protein correlations (RP) for all proteins quantified across the large shotgun datasets [20, 21], as well as the corresponding conflated correlations between scaled mRNA and protein levels (RT). This depiction demonstrates that RT are not representative for RP. To extend this analysis to protein levels measured by targeted MS , we plotted the distributions of across-tissues mRNA-protein correlations (RP) for the subset of 33 genes quantified across all datasets [20–22]; see dotted curves in Fig 1d. These genes were selected to have larger variance across tissues  and have relatively higher across-tissues correlations, especially in the data by [21, 22]. Nevertheless, all datasets include low and even negative across-tissues correlations (RP) and very high conflated correlations between scaled mRNA and protein levels (RT), Fig 1d. These results underscore the weak connection between RP and RT even for a carefully selected and measured subset of mRNAs and proteins.
The across-tissues variability has a dynamic range of about 2 − 10 fold and is thus dwarfed by the 103 − 104 fold dynamic range of abundances across different proteins. While across-tissues variability is smaller than mean-level variability, it is exactly the across-tissues variability that contributes to the biological identity of each tissue type and we focus the rest of our analysis to factors regulating the across-tissues protein variability.
Estimates of transcriptional and post-transcriptional regulation across-tissues depend strongly on data reliability
Next, we sought to estimate the fractions of across-tissues protein variability due to transcriptional regulation and to post-transcriptional regulation. This estimate depends crucially on noise in the mRNA and protein data, from sample collection to measurement error. Both RNA-seq [24, 25] and mass-spectrometry [15, 26] have relatively large and systematic error in estimating absolute levels of mRNAs and proteins, i.e., the ratios between different proteins/mRNAs. These errors originate from DNA sequencing GC-biases, and variations in protein digestion and peptide ionization. However, relative quantification of the same gene across tissue-types by both methods can be much more accurate since systematic biases are minimized when taking ratios between the intensities/counts of the same peptide/DNA-sequence measured in different tissue types [18, 25, 27, 28]. It is this relative quantification that is used in estimating across-tissues variability, and we start by estimating the reliability of the relative quantification across human tissues, Fig 2a–2d. Reliability is defined as the fraction of the observed/empirical variance due to signal. Thus reliability is proportional to the signal strength and decreases with the noise levels.
(a) The within-study reliability—defined as the fraction of the measured variance due to the signal—of relative mRNA levels is estimated as the correlation between the mRNA levels measured in the twelve different tissues. Estimates for the levels of each transcript measured in different subjects were correlated (averaging across the 12 tissue-types) and the results for all analyzed transcripts displayed as a distribution for each RNA dataset [29, 30]. (b) The within-study reliability of relative protein levels is estimated as the correlation between the protein levels measured in 12 different tissues [20, 21]. Within each dataset, separate estimates for each protein were derived from non-overlapping sets of peptides and were correlated (averaging across the 12 tissue-types) and the results for all analyzed proteins displayed as a distribution; see Methods. (c) The across-study reliability of mRNA was estimated by correlating estimates as in (a) but these estimates came from different studies  and . (d) The across-study reliability of proteins was estimated by correlating estimates as in (b) but these estimates came from different studies  and . (e) The fraction of across-tissues protein variance that can be explained by mRNA levels is plotted as a function of the reliability of the estimates of mRNA and protein levels, given an empirical mRNA/protein correlation of 0.29. The red Xs correspond to two estimates of reliability of the mRNA and protein measurements computed from both independent mRNA and protein datasets.
To estimate the within study reliability of mRNA levels, we took advantage that each mRNA dataset contains data from multiple subjects. We split the subjects in each dataset into two subsets, each of which containing measurements for all 12 tissues from several subjects. The levels of each mRNA were estimated from each subset by averaging across subjects and the estimates from the two subsets correlated, Fig 2a. These correlations provide estimates for the reliability of each mRNA and their median provides a global estimate for the reliability of relative RNA measurement, not taking into account noise due to sample collection and processing.
To estimate the within study reliability of protein levels, we computed separate estimates of the relative protein levels within a dataset. For each protein, Estimate 1 was derived from 50% of the quantified peptides and Estimate 2 from the other 50%. Since much of the analytical noise related to protein digestion, chromatographic mobility and peptide ionization is peptide-specific, such non-overlapping sets of of peptides provide mostly, albeit not completely, independent estimates for the relative protein levels. The correlations between the estimates for each protein (averaging across 12 tissues) are displayed as a distribution in Fig 2b.
In addition to the within study measurement error, protein and mRNA estimates can be affected by study-dependable variables such as sample collection and data processing. To account for these factors, we estimated across study reliability by comparing estimates for relative protein and mRNA levels derived from independent studies, Fig 2c and 2d. For each gene, we estimate the reliability for each protein by computing the empirical correlation between mRNA abundance reported by the ENCODE  and by . The correlations in Fig 2c have much broader distribution than the within-study correlations, indicating that much of the noise in mRNA estimates is study-dependent.
To estimate the across study reliability of protein levels, we compared the protein levels estimated from data published by  and . To quantify protein abundances,  used iBAQ scores and  used spectral counts. To ensure uniform processing of the two datasets, we downloaded the raw data and analyzed them with maxquant using identical settings, and estimated protein abundances in each dataset using iBAQ; see Methods. The corresponding estimates for each protein were correlated to estimate their reliability. Again, the correlations depicted in Fig 2d have a much broader distribution compared to the within-study protein correlations in Fig 2b, indicating that, as with mRNA, the vast majority of the noise is study-dependent. As a representative estimate of the reliability of protein levels, we use the median of the across tissue correlations from Fig 2d.
The across tissues correlations and the reliability of the measurements can be used to estimate the across tissues variability in protein levels that can be explained by mRNA levels (i.e., transcriptional regulation) as shown in Fig 2e; see Methods. As the reliabilities of the protein and the mRNA estimates decrease, the noise sensitivity of the estimated transcriptional contribution increases. Although the average across-tissues mRNA protein correlation was only 0.29 (R2 = 0.08), the data are consistent with approximately 50% of the variance being explained by transcriptional regulation and approximately 50% coming from post-transcriptional regulation; see S2 Fig for reliability-corrected estimates for specific functional gene sets. However, the low reliability of the data and large sampling variability precludes making such estimate reliable. Thus, we next considered analyses that can provide estimates for the scope of post-transcriptional regulation even when the reliability of the data is low.
Coordinated post-transcriptional regulation of functional gene sets
The low reliability of estimates across datasets limits the reliability of estimates of transcriptional and post-transcriptional regulation for individual proteins, Fig 2. Thus, we focused on estimating the post-transcriptional regulation for sets of functionally related genes as defined by the gene ontology (GO) . By considering such gene sets, we may be able to average out some of the measurement noise and see regulatory trends shared by functionally related genes. Indeed, some of the noise contributing to the across-tissues variability of a gene is likely independent from the function of the gene; see Methods. Conversely, genes with similar functions are likely to be regulated similarly and thus have similar tissue-type-specific PTR ratios. Thus, we explored whether the across-tissues variability of the PTR ratios of functionally related genes reflects such tissue-type-specific and biological-function-specific post-transcriptional regulation.
Since this analysis aims to quantify across-tissues variability, we define the “relative protein to mRNA ratio” (rPTR) of a gene in a given tissue to be the PTR ratio in that tissue divided by the median PTR ratio of the gene across the other 11 tissues. We evaluated the significance of rPTR variability for a gene-set in each tissue-type by comparing the corresponding gene-set rPTR distribution to the rPTR distribution for those same genes pooled across the other tissues (Fig 3); we use the KS-test to quantify the statistical significance of differences in the rPTR distributions; see Methods. The results indicate that the genes from many GO terms have substantially higher rPTR in some tissues than in others. For example the ribosomal proteins of the small subunit (40S) have high rPTR in kidney but low rPTR in stomach (Fig 3a–3c).
(a) mRNAs coding for the ribosomal proteins, NADH dehydrogenase and respiratory proteins have higher protein-to-mRNA ratios in kidney as compared to the median across the other 11 tissues (FDR < 1%). In contrast mRNAs genes functioning in Rac GTPase binding have lower protein-to-mRNA ratios (FDR < 1%). (b) The stomach also shows significant rPTR variation, with low rPTR for the ribosomal proteins and high rPTR for tRNA-aminoacylation (FDR < 1%). (c) Summary of rPTR variability, as depicted in panel (a-b), across all tissues and many gene ontology (GO) terms. Metabolic pathways and functional gene-sets that show statistically significant (FDR < 1%) variability in the relative protein-to-mRNA ratios across the 12 tissue types. All data are displayed on a log10 scale, and functionally related gene-sets are marked with the same color. (d) The reproducibility of rPTR estimates across estimates from different studies is estimated as the correlation between the median rPTRs for GO terms showing significant enrichment as shown in panels (a-c). See Methods, S2 and S3 Figs.
While the strong functional enrichment of rPTR suggests functionally concerted post-transcriptional regulation, it can also reflect systematic dataset-specific measurement artifacts. To investigate this possibility, we obtained two estimates for rPTR from independent datasets: Estimate 1 is based on data from  and , and Estimate 2 is based on data from  and . These two estimates are reproducible (e.g., ρ = 0.7 − 0.8) for most tissues but less for others (e.g., ρ = 0.14), as shown by the scatter plots between the median rPTR for GO terms in Fig 3d; S3 Fig shows the reproducibility for all tissues. The correlations between the two rPTR estimates remain statistically significant albeit weaker (i.e., ρ = 0.1 − 0.4) when computed with all GO terms (not only those showing significant enrichment) as shown in S1 Table, as well as when computed between the rPTRs for all genes, S2 Table.
Consensus protein levels
Given the low reliability of protein estimates across studies show in Fig 2, we sought to increase it by deriving consensus estimates. Indeed, by appropriately combining data from both protein studies, we can average out some of the noise thus improving the reliability of the consensus estimates; see Methods. As expected for protein estimates with increased reliability, the consensus protein levels correlate better to mRNA levels than the corresponding protein levels estimated from a either dataset alone, Fig 4a and 4b. We further validate our consensus estimates against 124 protein/tissue measurements from a targeted MS study . We computed the mean squared errors (MSE) between the protein levels estimated from the targeted study and the other three datasets using only protein/tissue measurements quantified in all datasets, facilitating fair comparison (Fig 4c). The MSE are lower for the consensus dataset than for either  or  and are consistent with a 10% error reduction relative to the  dataset. In addition to increased reliability, the consensus dataset increased coverage, providing a more comprehensive quantification of protein levels across human tissues than either draft of the human proteome taken alone (Table 1).
We compiled a consensus protein dataset by merging data from  and  as described in Methods. The relative protein levels estimated from [20, 21], and the consensus dataset were correlated to mRNA levels from  (a) or to mRNA levels from  (b). The correlations are shown as a function of the median correlation between protein estimates from  and . The consensus dataset exhibits the highest correlations, suggesting that it has averaged out some of the noise in each dataset and provides a more reliable quantification of of human tissue proteomes. (c) The datasets from , from , and the consensus dataset were evaluated by comparison to a targeted MS validation dataset quantifying 33 proteins over 5 tissues . The similarity for each dataset was quantified by the mean squared error (MSE) relative to the targeted MS validation data using 124 protein/tissue measurements that were observed in all datasets. The MSEs are reported for each of the five tissues and for all 5 tissues combined; they indicate that the consensus data have the best agreement with the validation dataset.
Highly abundant proteins have highly abundant mRNAs. This dependence is consistently observed [13–15, 17, 18] and dominates the explained variance in the estimates of absolute protein levels (Fig 1 and S1 Fig). This underscores the role of transcription for setting the full dynamic range of protein levels. In stark contrast, differences in the proteomes of distinct human tissues are poorly explained by transcriptional regulation, Fig 1. This is due to measurement noise (Fig 2) but also to post-transcriptional regulation. Indeed, large and partially reproducible rPTR ratios suggest that the mechanisms shaping tissue-specific proteomes involve post-transcriptional regulation, Fig 3. This result underscores the role of translational regulation and of protein degradation for mediating physiological functions within the range of protein levels consistent with life.
As with all analysis of empirical data, the results depend on the quality of the data and the estimates of their reliability. This dependence on data quality is particularly strong given that some conclusions rest on the failure of across-tissues mRNA variability to predict across-tissues protein variability. Such inference based on unaccounted for variability is substantially weaker than measuring directly and accounting for all sources of variability. The low across study reliability suggest that the signal is strongly contaminated by noise, especially systematic biases in sample collection and handling, and thus the data cannot accurately quantify the contributions of different regulatory mechanisms, Fig 2. Another limitation of the data is that isoforms of mRNAs and proteins are merged together, i.e., using razor proteins. This latter limitation is common to all approaches quantifying proteins and mRNAs from peptides/short-sequence reads. It stems from the limitation of existing approaches to infer and distinctly quantify isoforms and proteoforms.
The strong enrichment of rPTR ratios within gene sets (Fig 3) demonstrates a functionally concerted regulation at the post-transcriptional level. Some of the rPTR trends can account for fundamental physiological differences between tissue types. For example, the kidney is the most metabolically active (energy consuming) tissue among the 12 profiled tissues  and it has very high rPTR for many gene sets involved in energy production (Fig 3a). In this case, post-transcriptional regulation likely plays a functional role in meeting the high energy demands of kidneys. Quantifying and understanding mRNA and protein covariation in single cells is an important frontier of this analysis .
The rPTR patterns and the across tissue correlations in S1 Fig indicate that the relative contributions of transcriptional and post-transcriptional regulation can vary substantially depending on the tissues compared. Thus, the level of gene regulation depends strongly on the context. For example transcriptional regulation is contributing significantly to the dynamical responses of dendritic cells  and to the differences between kidney and prostate gland (S1h Fig) but less to the differences between kidney and liver (S1g Fig). All data, across all profiled tissues, suggest that post-transcriptional regulation contributes substantially to the across-tissues variability of protein levels. The degree of this contribution depends on the context.
Indeed, if we only increase the levels for a set of mRNAs without any other changes, the corresponding protein levels must increase proportionally as demonstrated by gene inductions . However, the differences across cell-types are not confined only to different mRNA levels. Rather, these differences include different RNA-binding proteins, alternative untranslated regions (UTRs) with known regulatory roles in protein synthesis, specialized ribosomes [35–38], and different protein degradation rates [39–43]. The more substantial these differences, the bigger the potential for post-transcriptional regulation. Thus cell-type differentiation and commitment may result in much more post-transcriptional regulation than observed during perturbations preserving the cellular identity. Consistent with this possibility, tissue-type specific proteomes may be shaped by substantial post-transcriptional regulation; in contrast, cell stimulation that preserves the cell-type, may elicit a strong transcriptional remodeling but weaker post-transcriptional remodeling.
We used RNA estimates based on RNA-seq from [29, 30] and protein estimates based on shotgun mass-spectrometry from [20, 21]. These large scale datasets contained N = 6104 genes measured in each of twelve different human tissues: adrenal gland, esophagus, kidney, ovary, pancreas, prostate, salivary gland, spleen, stomach, testis, thyroid gland, and uterus. For these genes, about 8% of the mRNA measurements and about 40% of the protein measurements are missing. The mRNA datasets contain measurements from multiple subjects/people and the subjects were split into two subsets in estimating the within study reliability in Fig 2a. We also used a small scale targeted dataset from  containing data for 33 proteins measured across 5 tissues. The datasets were collected by different groups and measurements derived from different subjects.
Searching raw MS data
Raw data from [21, 22] were searched by MaxQuant  220.127.116.11 against a protein sequence database including all entries from a Human UniProt database from 2015 and known contaminants such as human keratins and common laboratory contaminants. MaxQuant searches were performed with trypsin specificity allowing up to two missed cleavages, with fixed Carbamidomethyl acetylation on cysteines, and with variable modifications allowing methionine oxidation and acetylation on Protein N-termminus. All razor peptides were used for quantifying the proteins to which they were assigned by MaxQuant. False discovery rate (FDR) was set to 1% at both the protein and the peptide levels.
Scaling mRNA levels
First, denote mit the log mRNA levels for gene i in tissue t. Similarly, let pit denote the corresponding log protein levels. First, we normalize the columns of the data, for both protein and mRNA, to different amounts of total protein per sample. Any multiplicative factors on the raw scale correspond to additive constants on the log scale. Consequently, we normalize data from each tissue-type by minimizing the absolute differences between data from the tissue and the first tissue (arbitrarily chosen as a baseline). That is, for all t > 1, we define with Where and represent the normalized and non-normalized protein measurements respectively. For each t, the value of μt which minimizes the absolute difference is We use the same normalization for mRNA. This normalization, which corresponds to a location shift of the log abundances for each tissue, corrects for any multiplicative differences in the raw (unlogged) mRNA or protein. We normalize these measurements by aligning the medians rather than the means, as the median is more robust to outliers.
After normalization, we define rit = pit − mit as the log PTR ratio of gene i in condition t. If the post-transcriptional regulation for the ith gene were not tissue-specific, then the ith PTR ratio would be independent of tissue-type and can be estimated as In such a situation the log “scaled mRNA” (or mean protein level) can be defined as On the raw scale this amounts to scaling each mRNA by its median PTR ratio and represents and estimate of the mean protein level. The residual difference between the log mean protein level and the measured log protein level, which we call the log rPTR ratio consists of both tissue-specific post-transcriptional regulation and measurement noise.
For each gene, i, we compute the correlation between mRNA and protein across tissues. Unlike the between gene correlations which are consistently large after scaling for each tissue (Fig 1a), across-tissues correlations are highly variable between genes. Although this could be in part because true mRNA/protein correlations vary significantly between genes, a huge amount of the heterogeneity can be explained by sampling variability. There are only 10 and 12 tissues in common across datasets (depending on which datasets are used) and for many genes the abundances are missing, which means that the empirical estimates of across tissue correlation for each gene are very noisy. To find a representative estimate of the across-tissues correlation we can take the median over all genes. As an alternative, if the correlation was roughly constant between genes, we can pool information to yield a representative estimate of this across-tissues correlation. For a gene i, we compute the Fisher transformation of the within-gene correlation. This Fisher transformation, is approximately normally distributed: where Ni are the number of observed mRNA-protein pairs for gene i (at most 12) and ρ corresponds to the population correlation. We estimate the maximum likelihood estimate of the Fisher transformed population correlation by weighting each observation by its variance: We then transform this estimate back to the correlation scale Depending on the data sets used, with this method we estimate the population across-tissues mRNA/protein correlation to be between 0.21 () and 0.29 (). This correlation cannot be used as direct evidence for the relationship between mRNA and protein levels since both mRNA and protein datasets are unreliable due to measurement noise. This measurement noise attenuates the true correlation. Below we address this by directly estimating data reliability and correcting for noise.
Measurement noise attenuates estimates of correlations between mRNA and protein level . A simple way to quantify this attenuation of correlation due to measurement error is via Spearman’s correction. Spearman’s correction is based on the fact that the variance of the measured data can be decomposed into the sum of variance of the noise and the signal. If the noise and the signal are independent, this decomposition and the Spearman’s correction are exact .
Note that it is simple to show that the empirical variance is the sum of the variance of the signal and the variance of the noise:
- ei—Expectation at the ith data point;
- ζi—Noise at the ith data point; 〈ζ〉 = 0
- xi—Observation at the ith data point; , xi = ei + ζi;
Spearman’s correction is based on estimates of the “reliability” of the measurements, which is defined as the fraction of total measured variance due to signal rather than to noise: (1) (2) If X and Y are noisy measurements of two quantities, we can compute the noise corrected correlation between them as (3) In practice, reliabilities are not known but we can often estimate them. In this application, for both mRNA and protein we need measurements in which all steps, from sample collection to level estimation, are repeated independently. In order to estimate the mRNA reliabilities we use independent measurements from  and . For estimating protein reliabilities we use measurements from  and . across-tissues reliabilities are computed per gene whereas within-tissue reliabilities are computed per tissue across genes. If two independent measurements have the same reliability, it can be estimated by computing the correlation between the two measurements [17, 46, 47]. We estimated the approximate across-tissues protein reliability to be 0.21 and the across-tissues mRNA reliability to be 0.77. Given the estimated across-tissues mRNA/protein correlation of 0.29 (calculated using data from  and ) we estimated the noise-corrected fraction of across-tissues protein variance explained by mRNA to be approximately 50%, Fig 2. Note that if both mRNA or both protein datasets share biases, then the estimated reliabilities will be too small, thus deflating the inferred fraction of protein variance explained by mRNA. Moreover, because the reliabilities are low, sampling variability is large, missing data is prevalent, and mRNA/protein correlation likely vary by gene, there is uncertainty about this estimate.
Creating a consensus protein dataset
We use the two independent protein datasets to create a single consensus data set which is of arguably higher reliability than either dataset individually. To create this dataset, we take a weighted average of the two protein abundance datasets, by tissue. We compute the weights based on measurement reliabilities for each tissue in each of the two datasets.
Assume we have two random variables, and , corresponding to measurements on the same quantity (e.g. two independent protein measurements) with where is the signal which is independent of , the measurement error for sample i. We have a third random variable corresponding to a different quantity (e.g. an mRNA measurement), that is typically positively correlated with and with the same covariance . To create the consensus data set we first compute the reliability of for both datasets.
Thus, Similarly, . We use these facts and compute the empirical correlations between datasets to independently estimate the across gene reliabilities for each tissue from each dataset. We then Fisher weight the protein abundances based on their reliabilities. That is, for each tissue t, the consensus dataset, is When the reliability of and are close, each dataset is weighted equally. When one reliability dominates the other, that dataset contributes more to the aggregated dataset. We found that the full consensus data set has a higher median per gene correlation with mRNA than either of the protein datasets individually (0.34) and agreed more closely with validation data from  (Table 1).
Functional gene set analysis
To identify tissue-specific rPTR for functional sets of genes, we analyzed the distributions of rPTR ratios within functional gene-sets using the same methodology as . We restrict our attention to functional groups in the GO ontology  for which at least 10 genes were quantified by . Let k index one of these approximately 1600 functional gene sets. First, for every gene in every tissue we estimate the relative PTR (rPTR) or equivalently, the difference between log mean protein level and measured protein level:
To exclude the possibility that exactly, we require that t′ ≠ t. When the estimated rPTR is larger than zero, the measured protein level in tissue t is larger than the estimated mean protein level. Likewise, when this quantity is smaller than zero, the measured protein is smaller than expected. Measured deviations from the mean protein level are due to both measurement noise and tissue specific PTR. To eliminate the possibility that all of the variability in the rPTR ratios is due to measurement error we conduct a full gene set analysis.
Let be the function that returns the p-value of the Kolmogorov-Smirnov test on the distribution in sets and . The KS-test is a test for a difference in distribution between two samples. Using this test, we identify gene sets that show systematic differences in PTR ratio in a particular tissue (t) relative to all other tissues.
To correct for testing multiple hypotheses, we computed the false discovery rate (FDR) for all gene sets in tissue t . In Fig 3a–3c, we present only the functional groups with FDR less than 1% and report their associated p-values. Note that the test statistics for each gene set are positively correlated since the gene sets are not disjoint, but  prove that the Benjamini-Hochberg procedure applied to positively correlated test statistics is conservative. Thus, the significance of of certain functional groups suggests that not all of the variability in rPTR is due to measurement noise. We also calculated rPTR using two pairs of measurements: one set of rPTR estimates was calculated using protein data from  and mRNA from  and the other was calculated using data from  and . rPTR of the significant sets was largely reproducible across estimates from independent datasets (Fig 3d) and less reproducible across genes (S2 Table). Note that when computing the per tissue reliabilities for the construction of the consensus data set, we found that the reliabilities of the lung and pancreas datasets from  were much less reliable than the data from . This could explain why the independent estimates of the rPTR ratios for these tissues were less reproducible.
S1 Table. Estimates of relative protein-to-RNA (rPTR) ratio for GO terms reproduce across different datasets.
Pearson correlations between two estimates of the median rPTR ratios for all GO terms indicate reproducible effects in all tissues. As in Fig 2, rPTR estimates are derived using independent data sources. The lower and upper estimates are the endpoints of the 95% confidence interval.
S2 Table. Estimates of relative protein-to-RNA (rPTR) ratio for genes reproduce across different datasets.
Correlations between the two estimates of rPTR ratios for all genes indicate reproducible effects in all tissues. The rPTR ratios were estimated independently from different datasets (as in Fig 2). The lower and upper estimates are the endpoints of the 95% confidence interval.
S1 Dataset. Consensus dataset of protein levels across human tissues.
A zip-archived comma-delimited text file with consensus estimates of protein levels across 13 human tissues: adrenal gland, colon, esophagus, kidney, liver, lung, ovary, pancreas, prostate, testis, spleen, stomach, and heart.
S2 Dataset. Peptide levels across human tissues.
A zip-archived comma-delimited text file with estimates of peptide levels across 13 human tissues: adrenal gland, colon, esophagus, kidney, liver, lung, ovary, pancreas, prostate, testis, spleen, stomach, and heart. This file contains all peptide levels (integrated precursors areas) estimated from the MaxQuant searches described in the Methods.
S1 Fig. The total protein variance explained by scaled mRNA levels is not indicative of the correlations between mRNA and protein fold-changes across the corresponding tissue pairs.
(a-c, top row), protein versus mRNA in kidney, liver and prostate. (d-f, middle row) protein versus scaled mRNA in kidney, liver and prostate. The only difference from the top row is that the mRNA was scaled by the median PTR. (g-i, bottom row) protein fold changes versus the corresponding mRNA fold changes between the tissues indicated on the top. While scaled mRNA is predictive of the absolute protein levels the accuracy of these predictions does not generally reflect the accuracy of protein fold-changes across tissues that are predicted from the corresponding mRNA fold-changes. RNA fold changes in (g-i, bottom row) were computed between the mRNA levels without PTR scaling.
S2 Fig. Fraction of across-tissues variability in protein levels explained by RNA variability for different functional gene sets.
(a) The distributions of across-tissues correlations for gene sets defined by the gene ontology are shown as boxplots. The reliability of RNA and protein are estimated as the correlations between estimates from different datasets. (b) For each gene set, the median RNA-protein correlation was corrected by the median reliabilities and the results shown as a boxplot. Differences between RNA-protein correlations for different gene-sets cannot be explained simply by differences in the reliabilities.
We thank M. Jovanovic, H. Specht, E. Wallace, J. Schmiedel, and D. A. Drummond for discussions and constructive comments.
Supplemental website: https://web.northeastern.edu/slavovlab/2016_PTR/
The code can be found at: https://github.com/afranks86/tissue-ptr
- Conceptualization: NS.
- Data curation: NS AF.
- Formal analysis: NS AF.
- Funding acquisition: NS AF EA.
- Investigation: NS AF.
- Methodology: NS AF.
- Project administration: NS AF EA.
- Resources: NS EA.
- Supervision: NS.
- Validation: NS AF.
- Writing – original draft: NS AF.
- Writing – review & editing: NS AF.
- 1. Sørlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proceedings of the National Academy of Sciences. 2001;98(19):10869–10874.
- 2. Slavov N, Dawson KA. Correlation signature of the macroscopic states of the gene regulatory network in cancer. Proceedings of the National Academy of Sciences. 2009;106(11):4079–4084.
- 3. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, et al. Comprehensive identification of cell cycle–regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular biology of the cell. 1998;9(12):3273–3297. pmid:9843569
- 4. Slavov N, Macinskas J, Caudy A, Botstein D. Metabolic cycling without cell division cycling in respiring yeast. Proceedings of the National Academy of Sciences of the United Statesof America. 2011;108(47):19090–19095.
- 5. Slavov N, Airoldi EM, van Oudenaarden A, Botstein D. A conserved cell growth cycle can account for the environmental stress responses of divergent eukaryotes. Molecular Biology of the Cell. 2012;23(10):1986–1997. pmid:22456505
- 6. Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, et al. Landscape of transcription in human cells. Nature. 2012;489(7414):101–108. pmid:22955620
- 7. Alberts B, Johnson A, Morgan JLD, Raff M, Roberts K, Walter P.Molecular Biology of the Cell.6th ed. Garland; 2014.
- 8. Kuersten S, Goodwin EB. The power of the 3′ UTR: translational control and development. Nature Reviews Genetics. 2003;4(8):626–637. pmid:12897774
- 9. Hengst L, Reed SI. Translational control of p27Kip1 accumulation during the cell cycle. Science. 1996;271(5257):1861–1864. pmid:8596954
- 10. Polymenis M, Schmidt EV. Coupling of cell division to cell growth by translational control of the G1 cyclin CLN3 in yeast. Genes & development. 1997;11(19):2522.
- 11. Daran-Lapujade P, Rossell S, van Gulik WM, Luttik MA, de Groot MJ, Slijper M,et al. The fluxes through glycolytic enzymes in Saccharomyces cerevisiae are predominantly regulated at posttranscriptional levels. Proceedings of the National Academy of Sciences. 2007;104(40):15753–15758.
- 12. Slavov N, Budnik B, Schwab D, Airoldi E, van Oudenaarden A. Constant Growth Rate Can Be Supported by Decreasing Energy Flux and Increasing Aerobic Glycolysis. Cell Reports. 2014;7:705–714. pmid:24767987
- 13. Gygi SP, Rochon Y, Franza BR, Aebersold R. Correlation between protein and mRNA abundance in yeast. Molecular and cellular biology. 1999;19(3):1720–1730. pmid:10022859
- 14. Smits AH, Lindeboom RG, Perino M, van Heeringen SJ, Veenstra GJC, Vermeulen M. Global absolute quantification reveals tight regulation of protein expression in single Xenopus eggs. Nucleic acids research. 2014;42(15):9880–9891. pmid:25056316
- 15. Schwanhäusser B, Busse D, Li N, Dittmar G, Schuchhardt J, Wolf J, et al. Global quantification of mammalian gene expression control. Nature. 2011;473(7347):337–342. pmid:21593866
- 16. Li JJ, Bickel PJ, Biggin MD. System wide analyses have underestimated protein abundances and the importance of transcription in mammals. PeerJ. 2014;2:e270. pmid:24688849
- 17. Csárdi G, Franks A, Choi DS, Airoldi EM, Drummond DA. Accounting for experimental noise reveals that mRNA levels, amplified by post-transcriptional processes, largely determine steady-state protein levels in yeast. PLoS Genetics. 2015;11(5):e1005206. pmid:25950722
- 18. Jovanovic M, Rooney MS, Mertins P, Przybylski D, Chevrier N, Satija R, et al. Dynamic profiling of the protein life cycle in response to pathogens. Science. 2015;347(6226):1259038. pmid:25745177
- 19. Cheng Z, Teo G, Krueger S, Rock TM, Koh HW, Choi H, et al. Differential dynamics of the mammalian mRNA and protein expression response to misfolding stress. Molecular systems biology. 2016;12(1):855. pmid:26792871
- 20. Wilhelm M, Schlegl J, Hahne H, Gholami A, Lieberenz M, et al. Mass-spectrometry-based draft of the human proteome. Nature. 2014;509:582–587. pmid:24870543
- 21. Kim MS, Pinto SM, Getnet D, Nirujogi RS, Manda SS, Chaerkady R, et al. A draft map of the human proteome. Nature. 2014;509(7502):575–581. pmid:24870542
- 22. Edfors F, Danielsson F, Hallström BM, Käll L, Lundberg E, Pontén F, et al. Gene specific correlation of RNA and protein levels in human cells and tissues. Molecular Systems Biology. 2016;12(10):883
- 23. Blyth CR. On Simpson’s paradox and the sure-thing principle. Journal of the American Statistical Association. 1972;67(338):364–366.
- 24. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome research. 2008;18(9):1509–1517. pmid:18550803
- 25. Consortium SI, et al. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nature Biotechnology. 2014;32(9):903–914.
- 26. Peng M, Taouatas N, Cappadona S, van Breukelen B, Mohammed S, Scholten A,et al. Protease bias in absolute protein quantitation. Nature methods. 2012;9(6):524–525. pmid:22669647
- 27. Ong SE, Blagoev B, Kratchmarova I, Kristensen DB, Steen H, Pandey A, et al. Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics. Molecular & cellular proteomics. 2002;1(5):376–386.
- 28. Blagoev B, Ong SE, Kratchmarova I, Mann M. Temporal analysis of phosphotyrosine-dependent signaling networks by quantitative proteomics. Nature biotechnology. 2004;22(9):1139–1145. pmid:15314609
- 29. Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, et al. Landscape of transcription in human cells. Nature. 2012;489(7414):101–108. pmid:22955620
- 30. Fagerberg L, Hallström BM, Oksvold P, Kampf C, Djureinovic D, Odeberg J,et al. Analysis of the human tissue-specific expression by genome-wide integration of transcriptomics and antibody-based proteomics. Molecular & Cellular Proteomics. 2014;13(2):397–406.
- 31. Consortium GO, et al. The Gene Ontology (GO) database and informatics resource. Nucleic acids research. 2004;32(suppl 1):D258–D261.
- 32. Hall JE. Guyton and Hall Textbook of Medical Physiology: Enhanced E-book.Elsevier Health Sciences; 2010.
- 33. Budnik B, Levy E, Slavov N. Mass-spectrometry of single mammalian cells quantifies proteome heterogeneity during cell differentiation. bioRxiv. 2017; https://doi.org/10.1101/102681
- 34. McIsaac RS, Silverman SJ, McClean MN, Gibney PA, Macinskas J, Hickman MJ,et al. Fast-acting and nearly gratuitous induction of gene expression and protein depletion in Saccharomyces cerevisiae. Molecular biology of the cell. 2011;22(22):4447–4459. pmid:21965290
- 35. Mauro VP, Edelman GM. The ribosome filter hypothesis. Proceedings of the National Academy of Sciences. 2002;99(19):12031–12036.
- 36. Mauro VP, Matsuda D. Translation regulation by ribosomes: Increased complexity and expanded scope. RNA biology. 2016;13(9):748–755. pmid:26513496
- 37. Slavov N, Semrau S, Airoldi E, Budnik B, van Oudenaarden A. Differential stoichiometry among core ribosomal proteins. Cell Reports. 2015;13:865–873. pmid:26565899
- 38. Preiss T. All Ribosomes Are Created Equal. Really? Trends in biochemical sciences. 2016;41(2):121–123. pmid:26682497
- 39. Gebauer F, Hentze MW. Molecular mechanisms of translational control. Nature reviews Molecular cell biology. 2004;5(10):827–835. pmid:15459663
- 40. Rojas-Duran MF, Gilbert WV. Alternative transcription start site selection leads to large differences in translation activity in yeast. RNA. 2012;18(12):2299–2305. pmid:23105001
- 41. Castello A, Fischer B, Eichelbaum K, Horos R, Beckmann BM, Strein C, et al. Insights into RNA biology from an atlas of mammalian mRNA–binding proteins. Cell. 2012;149(6):1393–1406. pmid:22658674
- 42. Arribere JA, Gilbert WV. Roles for transcript leaders in translation and mRNA decay revealed by transcript leader sequencing. Genome research. 2013;23(6):977–987. pmid:23580730
- 43. Katz Y, Li F, Lambert NJ, Sokol ES, Tam WL, Cheng AW, et al. Musashi proteins are post-transcriptional regulators of the epithelial-luminal cell state. eLife. 2014;3:e03915. pmid:25380226
- 44. Cox J, Mann M. MaxQuant enables high peptide identification rates, individualized ppb-range mass accuracies and proteome-wide protein quantification. Nature biotechnology. 2008;26(12):1367–1372. pmid:19029910
- 45. Franks AM, Csárdi G, Drummond DA, Airoldi EM. Estimating a structured covariance matrix from multi-lab measurements in high-throughput biology. Journal of the American Statistical Association. 2015;110(509):27–44. pmid:25954056
- 46. Spearman C. The proof and measurement of association between two things. Am J Psychol. 1904;15:72–101.
- 47. Zimmerman D, Williams R. Properties of the spearman correction for attenuation for normal and realistic non-normal distributions. Applied Psychological Measurement. 1997;21(3):253270.
- 48. Slavov N, Botstein D. Coupling among growth rate response, metabolic cycle, and cell division cycle in yeast. Molecular Biology of the Cell. 2011;22(12):1997–2009. pmid:21525243
- 49. Storey JD. The positive false discovery rate: A Bayesian interpretation and the q-value. Annals of statistics. 2003; p. 2013–2035.
- 50. Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Annals of statistics. 2001; p. 1165–1188.