mRNA 5′ terminal sequences drive 200-fold differences in expression through effects on synthesis, translation and decay

mRNA regulatory sequences control gene expression at multiple levels including translation initiation and mRNA decay. The 5′ terminal sequences of mRNAs have unique regulatory potential because of their proximity to key post-transcriptional regulators. Here we have systematically probed the function of 5′ terminal sequences in gene expression in human cells. Using a library of reporter mRNAs initiating with all possible 7-mer sequences at their 5′ ends, we find an unexpected impact on transcription that underlies 200-fold differences in mRNA expression. Library sequences that promote high levels of transcription mirrored those found in native mRNAs and define two basic classes with similarities to classic Initiator (Inr) and TCT core promoter motifs. By comparing transcription, translation and decay rates, we identify sequences that are optimized for both efficient transcription and growth-regulated translation and stability, including variants of terminal oligopyrimidine (TOP) motifs. We further show that 5′ sequences of endogenous mRNAs are enriched for multi-functional TCT/TOP hybrid sequences. Together, our results reveal how 5′ sequences define two general classes of mRNAs with distinct growth-responsive profiles of expression across synthesis, translation and decay.


Introduction
Genes are expressed in complex patterns that are post-transcriptionally controlled through instructions encoded within mRNAs. These instructions can occur throughout mRNAs, but 5 0 terminal sequences have unique regulatory potential because they are adjacent to the mRNA cap structure. This structure is appended to mRNAs co-transcriptionally and then rapidly bound by nuclear cap-binding proteins that impact both splicing and export from the nucleus [1]. Once in the cytoplasm, the cap is bound by a different complement of factors that facilitate translation [1]. Finally, the removal of the cap is the penultimate and generally irreversible step of mRNA decay.
There are several known examples of 5 0 terminal sequences with post-transcriptional regulatory functions. The best-studied of these are terminal oligopyrimidine (TOP) motifs, which are defined as a +1 C followed by a series of pyrimidines [2]. These motifs commonly occur in mRNAs encoding translation factors and render their translation and stability sensitive to growth signals through the mTOR Complex 1 pathway [3][4][5][6]. Regulation of these mRNAs is mediated through the translation repressor La-related protein 1 (Larp1), which binds TOP motifs and sequesters the mRNAs away from the cap-binding translation factor, eIF4F [7][8][9]. eIF4F also interacts directly with cap-adjacent nucleotides [10] and has intrinsically lower affinity for mRNAs initiating with a +1 C [11]. Finally, there is evidence that decapping efficiency is influenced by 5 0 nucleotides, as in yeast the decapping protein Dcp2 preferentially cleaves mRNAs with A or G as the +1 nucleotide, potentially reducing the overall stability of these transcripts [12]. 5 0 terminal sequences are also constrained by requirements for efficient transcription start site selection and initiation. These sequence patterns are reflected in the "core promoter" sequence [13], a 100-150 nt region that encodes an array of elements that position the transcription machinery and ultimately trigger transcription initiation. There is no canonical sequence that absolutely defines the transcription start site (TSS), and transcription generally initiates at a spectrum of positions. Nonetheless, the most common TSS motif is the Initiator (Inr) sequence, which occurs in~50% of genes [13]. The human Inr consensus sequence is BBCA +1 BW [14] (B = G,C,T; W = A,T; +1 indicates the first nucleotide), although a global analysis of TSS data reported a minimal Inr of YR +1 [15,16]. A less common initiation motif is the TCT motif [17][18][19]. Similar to TOP motifs, this element is most common in the promoters of translation-associated genes. The established motif, based primarily on analyses of endogenous RP mRNA TSSs, is defined as YYC +1 TTTYY, yielding mRNAs that initiate with a +1 C rather than the +1 A that characterizes Inr TSSs [18].
Aside from these examples, the full spectrum of 5 0 sequence functions has not been determined. A powerful approach for identifying similar regulatory elements has been to use largescale reporter libraries that systematically query the function of thousands of mRNA sequences [20][21][22][23][24]. However, these systems are generally designed with promoters that produce mRNAs with fixed 5 0 terminal sequences, and so have not assessed functions of different 5 0 sequences. To survey functions of 5 0 sequences directly, we constructed a reporter library of mRNAs varying only in the first 7 5 0 nucleotides. Library mRNAs were unexpectedly expressed at a wide range of levels that closely parallel observations for endogenous mRNAs and share features with Inr and TCT motifs. The translation and stability of library mRNAs is similar under basal conditions but diverge under conditions when cap-dependent translation is repressed. The 5 0 sequences that define translationally-regulated mRNAs include TOP motifs and related sequences that partially overlap with TCT sequences that are optimal for transcription. We show that endogenous mRNAs are enriched for hybrid TOP/TCT sequences that combine the efficient transcription of TCT motifs and the translation regulation of TOP sequences, and that these patterns are evolutionarily conserved.

Design of a 5 0 reporter system
To systematically test the functions of 5 0 terminal sequences, we constructed a plasmid encoding a CMV promoter, a short 5 0 UTR, a coding sequence (Renilla luciferase), and a constant 3 0 UTR (Fig 1A). The CMV promoter preferentially initiates transcription at a specific site [9], allowing us to specify the 5 0 ends of expressed mRNAs. A cassette of 7 random nucleotides was then positioned directly after the CMV promoter region. The rationale for choosing a 7 nt sequence was that many RNA-binding proteins interact with similar size sequences and it is short enough that all 16,384 7 mers could be robustly quantified by deep sequencing. Deepsequencing of the plasmid library identified 16359 of 16384 7-mer sequences with greater than 20 reads and no position-specific bias (S1 Fig). There is a slight overrepresentation of T and G nucleotides, which are likely artifacts introduced during the initial synthesis of the randomized 7 nt sequence. This minor bias was corrected for in downstream analysis (Methods). This plasmid library, which we refer to as the 5pseq library, was then packaged into lentivirus and used to stably infect HeLa cells.
We next measured the expression levels of all library mRNAs. RNA was extracted from the infected HeLa cells, reverse transcribed with a library-specific primer, and then appended with a 5 0 linker using splint-ligation ( Fig 1B). Sequencing libraries were amplified and analyzed by deep sequencing, confirming preferential initiation at a single position (~48%) (Fig 1C). A downside of this approach is that commonly available reverse transcriptases often append a Extracted mRNA is reverse transcribed with a library-specific primer. A 5 0 linker is then attached by splint-ligation, followed by PCR amplification with Illumina-compatible primers. (C) 5pseq mRNAs initiate at the expected TSS. Library mRNAs expressed in HeLa cells were processed as in (B) and analyzed to determine initiation sites within the 5pseq promoter region (excluding +1 G mRNAs). (D) Nucleotide frequencies at the 5 0 ends of 5pseq mRNAs. Library mRNAs expressed in HeLa cells were processed as in (B) and used to determine the frequency of all 5 0 7-mer sequences initiating with A, C, or U. Frequencies of each 7-mer were then normalized to frequencies in the plasmid library and used to determine the frequency of each nucleotide at each position.
https://doi.org/10.1371/journal.pgen.1010532.g001 PLOS GENETICS mRNA 5' sequences are major determinants of gene expression non-templated cytidine to the 3 0 end of the synthesized cDNA [25]. This appears as an additional 5 0 G in sequencing libraries. This addition could be corrected computationally for library mRNAs initiating with A, C or U. However, it was not possible to distinguish mRNAs containing bona fide +1 G sequences from those where a +1 G reflected the addition of a nontemplated nucleotide during reverse transcription. For instance, an mRNA initiating at the +2 position with 5 0 AACCTT might appear in sequencing results as an mRNA initiating at the +1 position with 5 0 GAACCTT. This ambiguity makes counts of 7mer sequences with +1 Gs unreliable, and so we chose to exclude them from further analysis. This is unfortunate because many endogenous mRNAs initiate transcription with a +1 G, and so this omission potentially excludes unique regulatory motifs from this analysis. Nonetheless, we detected 12,049 of 12,288 possible sequences with +1 A, C or U detected with greater than 50 reads in all replicates. Aside from the +1 position, nucleotides were observed at roughly the expected frequencies amongst these library sequences ( Fig 1D).

Library mRNAs are expressed across a 200-fold range
Unexpectedly, the expression levels of library mRNAs spanned more than a 200-fold range and followed a distinctly bimodal distribution (Fig 2A). Inspection of the underlying sequences revealed that high and low-expressed mRNAs showed significantly different preferences for the first (+1) nucleotide (Fig 2A). mRNAs initiating with an 'A' were uniformly wellexpressed, while those initiating at a 'U' were poorly expressed. In contrast, mRNAs initiating with a +1 C nearly spanned the entire range (Fig 2A). Given that +1 G mRNAs are excluded from this analysis, the true impact of 5 0 sequences on transcription efficiency may span an even greater range. Amongst the 10% most highly-expressed +1 A motifs, there was a notable depletion of A nucleotides throughout the remainder of the 7 nt sequence. The top 10% +1 C motifs also showed a decreased frequency of A nucleotides throughout the 7 nt sequence, but additionally required a +2 T/U for maximally efficient transcription ( Fig 2B). Differences in expression level were not correlated with differences in library mRNA stability, which were confined to a narrow range of~2-fold under basal conditions (S2A and S2B Fig). To test the importance of the +2 position for +1 C mRNAs, we generated a series of reporters with different +2 nucleotides. Expression analysis of these reporters confirmed the importance of a +2 T/ U for maximal expression (Fig 2C). These results argue that not all TSS sequences are transcribed with equal efficiency and can significantly impact expression level.
Different promoters potentially prefer different TSS sequences. To test whether the observed TSS patterns occur beyond the CMV promoter, we examined a recently published analysis of thousands of human core promoters to identify other promoters that preferentially initiate transcription at a single position [26]. In this library, the core promoters for both KARS1 and SNHG1 drove relatively robust levels of transcription preferentially from a single A or C, respectively, mimicking the TSS selection within their respective endogenous promoters ( Fig 2D). We then generated synthetic versions of these promoter sequences encoding an A, C, or T at the +1 TSS position. Both the KARS1 and SNHG1 promoters efficiently initiated transcription at either a +1 A or +1 C (Fig 2D). In contrast, replacing the +1 nucleotide with a T strongly disrupted initiation. This indicates that transcription from a +1 T is similarly disfavored in other contexts beyond the CMV promoter.

Library 5 0 sequence expression predicts the expression of endogenous 5 0 sequences
Next, we examined the potential significance of these findings for understanding differences in expression levels between endogenous mRNAs. To test this, we analyzed several previously reported TSS datasets prepared using cap-analysis of gene expression (CAGE), a strategy for determining mRNA 5 0 terminal sequences by isolating and then sequencing RNA fragments with 5 0 caps [27]. For each dataset, we selected reads that aligned to "promoter regions" of genes, which we defined as a 1 kb window surrounding the annotated transcription start site for each transcript. We then compared the frequencies of 5 0 3mer sequences, reasoning that these account for most of the variation in expression observed in the 5pseq library. Strikingly, the relative expression of 3mer TSS sequences in each of these CAGE datasets was significantly correlated with expression level in the 5pseq library ( Fig 3A). In both CAGE and 5pseq datasets, +1 A mRNAs were collectively well-expressed, +1 U mRNAs were poorly expressed, and +1 C mRNAs spanned a broad range. The similarity between results from library and endogenous mRNAs argue that 5 0 terminal sequences are important determinants of expression level amongst endogenous promoters as well. We note that +1 G sequences were excluded from this analysis for the technical reasons cited above. This leads to slight overestimates of the overall Expression levels of 5pseq library mRNAs. Library mRNAs were expressed in HeLa cells, extracted, and used to prepare sequencing libraries. Frequencies of all 7 nt sequences initiating with A, C or U were determined and normalized to frequencies within the input plasmid library. Library mRNAs were then binned by expression level and plotted as a histogram (upper panel) or density plot separately depicting expression levels of mRNAs initiating with the indicated nucleotide (lower panel). (B) Motifs in 5 0 sequences of well-expressed +1 A and +1 C mRNAs. Nucleotide frequencies within the first 7 nucleotides of the top 5% +1 A and +1 C library mRNAs from (A). (C) The +2 U is required for maximal expression of +1 C mRNAs. Plasmids expressing mRNAs encoding Renilla luciferase and initiating with the indicated 5 0 sequences were transfected into HEK-293T cells. Expression levels of each mRNA were determined 24 h later by qPCR. Significance by t-test between +2 U and each other construct, n = 3. (D) The TSS +1 nucleotide impacts initiation efficiency in endogenous core promoters. TSS frequencies for endogenous KARS1 and SNHG1 genes in HeLa cells (FANTOM5) or from synthetic core promoter regions of KARS1 or SNHG1 where the expected +1 nucleotide is an A, C or T.
https://doi.org/10.1371/journal.pgen.1010532.g002 PLOS GENETICS mRNA 5' sequences are major determinants of gene expression frequencies of all +1 A, C and T 3mer sequences in both the 5pseq and CAGE datasets but does not impact their relative frequencies.
Previous analyses of endogenous promoters have identified Inr and TCT motifs as important determinants of TSS selection [28]. These motifs include nucleotides upstream of the TSS but yield mRNAs with specific 5 0 sequences. We therefore wondered whether these motifs are also linked to expression level in library mRNAs. Indeed, both Inr and TCT classes of 5 0 sequences (ABW (B = C,G,T; W = A,T) for Inr and CTYTYY (Y = C,T) for TCT motifs) were expressed at significantly higher levels in both library and endogenous mRNAs ( Fig 3B). Given the central role of the first two nucleotides in the expression of library mRNAs, we wondered whether these are the key predictive features of Inr and TCT motifs. Both AG and CT 5 0 sequences were indeed expressed at significantly higher levels than average ( Fig 3B). AG, in particular, was expressed at even higher levels than Inr sequences in both datasets, suggesting that this dinucleotide sequence is an optimal version of the Inr, consistent with previous analysis of CAGE data [15,16]. In contrast, the classical TCT 5 0 sequence was a much stronger predictor of expression in CAGE data than in 5pseq library mRNAs ( Fig 3B). This suggests that endogenous TCT sequences might be also optimized for functions beyond efficient transcription initiation, such as post-transcriptional control.

0 sequences define distinct patterns of translation
To determine post-transcriptional functions of 5 0 terminal sequences, we first examined translation differences between library mRNAs. Towards this end, extracts from HeLa cells stably expressing the 5pseq library were separated into sub-polysome and polysome-associated PLOS GENETICS mRNA 5' sequences are major determinants of gene expression fractions by centrifugation through sucrose gradients ( Fig 4A). Translation levels were estimated by calculating polysome/sub-polysome (P/SP) ratios for each library mRNA. Under basal conditions, library mRNAs showed small differences in translation rates, which varied across a 4-fold range ( Fig 4B). mRNAs were similarly well translated regardless of the +1 nucleotide, although +1 A mRNAs were translated slightly better than +1 C or U mRNAs.
This narrow range of translation rates may reflect that under basal conditions, the 5 0 ends of most mRNAs are bound by the eIF4F initiation complex and effectively sequestered from other 5 0 binding proteins [29]. In contrast, growth-repressive conditions destabilize eIF4F and expose mRNA 5 0 ends to other translation regulators, such as 4EHP, Larp1 or decapping proteins. To test whether the translation of mRNAs becomes more sensitive to 5 0 sequences under growth-repressing conditions, we again measured P/SP ratios in cells treated with the mTOR inhibitor Torin 1, which triggers growth arrest and inhibits eIF4F ( Fig 4A) [30]. Under these conditions, the translation differences between mRNAs expanded to a significantly greater range, reaching 16-fold for transcripts differing by only a handful of nucleotides ( Fig 4B). In particular, mRNAs with different +1 nucleotides experienced significantly different changes in translation. +1 A mRNAs maintained the highest levels of translation, while +1 C mRNAs were significantly repressed ( Fig 4B). +1 U mRNAs were slightly repressed, but much less than +1 C mRNAs. Moreover, while the well-transcribed AG and CU mRNAs were similarly translated under growth-promoting conditions, the translation of CU mRNAs was selectively repressed by mTOR inhibition (Fig 4C).

Functional requirements of TOP sequences
A +1 C is the defining feature of TOP motifs [31]. Indeed, the most mTOR-sensitive sequences within the 5pseq library closely resembled classical TOP motifs, with increasing enrichment of C/U nucleotides at positions close to the 5 0 terminus ( Fig 4D). mTOR-resistant sequences were primarily distinguished by a +1 A and no other obvious features ( Fig 4D). We showed previously that increasingly long series of C/U nucleotides within TOP motifs are correlated with greater repression [32]. Increasingly long series of C/U nucleotides also correlate with translation repression following mTOR inhibition in the 5pseq library ( Fig 4E). In this case, the maximum degree of suppression occurs at approximately 5 nt, which matches the number of nucleotides bound by the TOP suppressor Larp1 [8]. Our previous analysis of endogenous TSSs indicated a maximum sequence of 7 nt, but this likely reflects the fact that these TSSs are heterogenous, and longer C/U stretches ensure that a greater number of transcripts encode a maximal TOP motif. To systematically probe TOP motif requirements in the 5pseq library, we calculated the translation effect of varying each nucleotide in the canonical TOP sequence of CYYYYNN ( Fig 4F). This confirmed the importance of the +1 C and the diminishing contribution of the next 4 nucleotides. Surprisingly, however, we found that the +3 position was particularly critical to TOP function. Replacement of a +3 C/U with a G was almost as disruptive as replacing the +1 C. In contrast, replacement of the +2 C/U with a purine was much less disruptive to translation regulation, but diminished expression level, as previously noted ( Fig 2B). We confirmed these results with individual reporter constructs ( Fig 4G).

0 sequence link mRNA translation and stability
Because translation functions of 5 0 sequences in the 5pseq library were most evident under growth-restricting conditions, we also wondered whether these conditions might also trigger changes in stability. To test this, we similarly measured library mRNA stabilities in cells with the mTOR inhibitor Torin 1 by blocking transcription with Actinomycin D (Fig 5A). Under Overview of 5pseq translation analysis. Extracts were prepared from HeLa cells stably expressing 5pseq library mRNAs and treated with vehicle (DMSO) or 250 nM Torin 1 for 2 h. Extracts were then centrifuged through 5-50% sucrose gradients and fractionated with constant monitoring of absorbance at 254 nm to obtain polysome profiles. RNA was extracted from the indicated polysome or sub-polysome fractions and used to prepare sequencing libraries. (B) Translation of 5pseq library mRNAs in control or Torin 1-treated conditions. Translation rates for library mRNAs isolated in (A) were estimated by Polysome/Sub-polysome ratios and depicted as density plots, separated by +1 nucleotide. (C) Box plots for translation rates of AG and CU mRNAs from (B). Significance by t-test. � p < 10 −10 . (D) 5 0 nucleotide frequencies of library mRNAs with mTOR-sensitive and mTOR-resistant translation. The frequencies of nucleotides at the first 7 positions of mRNAs with the 5% most these conditions, mRNA half-lives expanded to an approximately 4-fold range (compare Figs 5A to S2A). +1 A mRNAs were the most unstable, while +1 C mRNAs were most stable, and +1 U mRNAs were in between ( Fig 5A). Actinomycin D can trigger artifacts that interfere with accurate measurements of mRNA stability and so we also measured the stabilities of several representative +1 A and +1 C mRNAs from the 5pseq library using a doxycycline-repressible version of the promoter (S3 Fig). Consistent with library results, +1 C mRNAs were slightly more stable than +1 A mRNAs under basal conditions and this difference was amplified under Torin 1-treated conditions (S3 Fig). We noticed that the stabilities of library mRNAs appeared inversely related to their translation status, i.e. decreased translation correlated with increased stability. To determine the extent of this relationship, we compared the stability and translation of library mRNAs in both control and growth-inhibited conditions. Under control conditions, translation and stability were effectively uncorrelated (Fig 5B). Under growth-inhibited conditions, however, a strong correlation between these properties emerged, following a pattern that was again dominated by the identity of the +1 nucleotide ( Fig 5B). +1 C mRNAs were simultaneously translationally repressed and stabilized, while +1 A mRNAs were subjected to the opposite regulation ( Fig 5B). This argues that the translation and stability function of these mRNAs is linked by 5 0 sequences.

Endogenous mRNAs are optimized with TCT/TOP hybrids
The results described above show overlapping but distinct sequence requirements when comparing efficient transcription from TCT TSSs and growth-dependent translation/stability regulation of TOP mRNAs. Both systems require a +1 C for efficient transcription and translation regulation. However, the transcription system shows stronger preferences for a +2 T/U nucleotide while the translation system is particularly sensitive to the +3 nucleotide. A comparison of the expression level and translation function of 5 0 3 mers shows four specific 3-mers that optimize the function of both systems: CTC, CTT, CCT, and CCC ( Fig 6A). To test whether these hybrid TCT/TOP sequences are enriched in endogenous mRNAs, we analyzed CAGE reads from several cell lines and tissues. Remarkably, these results showed that these 4 3mers were the most commonly used +1C sequences in nearly all of the datasets (Fig 6B). This argues that the TSSs of endogenous mRNAs reflect selection for both transcription efficiency and translation regulation. We noted one exception in liver 5 0 sequences, where CTA replaced CCC for the fourth-most common 3 mer. CTA (as well as CTG) are both strong expression motifs, but only weakly sensitive to translation regulation. Interestingly, closer inspection of the liver TSS dataset revealed that the high frequency of CTA is primarily driven by the high expression of albumin, which primarily initiates with CTA. The selective pressures driving this sequence are not clear but may reflect other properties of TCT promoters that are optimized for consistent high expression.
Evolutionary conservation of 5 0 sequences +1 C mRNAs are a unique class of transcripts that utilize specialized mechanisms for their transcription and post-transcriptional regulation. Aspects of this regulatory system have been and least mTOR-dependent change in translation from (B). (E) mTOR-sensitive translation increases with longer C/U 5 0 sequences. Changes in translation following mTOR inhibition from (B) were determined for mRNAs with the indicated 5 0 sequence motifs (R = A,G; Y = C,U). Significance by t-test comparison to mRNAs with R[7] 5 0 sequences, � p < 0.01. (F) Sequence requirements of 5 0 TOP motifs for translation and expression. The effect of individual nucleotide substitutions at the indicated 5 0 positions in library mRNAs with 5 0 CYYYYNN motifs on levels of mTOR-regulated translation from (B) and expression from Fig 2A. (G) The +3 C/U is key for TOP motif translation functions. Plasmids expressing mRNAs encoding Renilla luciferase and initiating with the indicated 5 0 sequences were transfected into HEK-293T cells. Cells were incubated overnight, and then treated with vehicle (DMSO) or 250 nM Torin 1 for 2 h, and then analyzed for luciferase levels. Significance by two-way ANOVA, n = 3. � p < 10 −5 .
https://doi.org/10.1371/journal.pgen.1010532.g004 PLOS GENETICS mRNA 5' sequences are major determinants of gene expression reported in diverse species, including plants, flies, and throughout vertebrates. We therefore wondered how broadly patterns of 5 0 sequences described here are conserved. To test this, we extracted reads aligned to promoter regions of annotated genes in CAGE datasets for 5 eukaryotic species (Fig 6C) [17,27,[33][34][35][36]. A limitation of this analysis is that different methods for TSS sequencing can affect the overall representation of specific sequences, even amongst datasets prepared with similar strategies. Nonetheless, we find that 5 0 sequence usage is strikingly similar between humans, zebrafish, drosophila and even arabidoposis (Fig 6C). In contrast, there was much less similarity in TSS sequences between human and yeast datasets, which  showed almost no expression of +1 C mRNAs (Fig 6C). Yeast TSSs, however, are similarly depleted of +1 T/U sequences.
In humans, the functional class most enriched for +1 C mRNAs are ribosomal protein (RP) genes, which almost all contain classical TOP motifs [31]. RP orthologues are highly conserved and so easily identified between species. To test whether this class of mRNAs utilizes the +1 C regulatory system across species, we compared RP gene TSSs between humans, zebrafish, flies, yeast and plants. 5 0 sequence usage for RP genes in humans, zebrafish and drosophila were strikingly similar, preferring the same TOP/TCT hybrids that optimized expression and translation regulation in the 5pseq library (Fig 6C). These three species all express homologues of both the TOP binding translation regulator Larp1 and the TCT transcription factor TRF2, which is strong evidence that these regulatory systems function similarly across these species. Arabidopsis showed slight enrichment of +1C mRNAs in RP mRNAs, although the overall pattern of TSS usage is starkly different than observed in flies and vertebrates. This is consistent with a recent observation [37] that Larp1 targets a distinct subset of mRNAs in plants. Yeast TSSs showed no enrichment for any +1 C 5 0 sequences and yeast does not express homologs of Larp1 or TRF2. Taken together, these results suggest +1 C regulatory systems emerged soon after the transition to multicellularity, but the specific evolutionary history remains unclear.

Discussion
In this study, we systematically examined the functions of mRNA 5 0 terminal sequences, a region uniquely positioned to influence all fundamental phases of the mRNA life cycle. Our results show that these sequences-and +1 nucleotides, in particular-define basic mRNA classes with functionally distinct patterns of transcription, translation and stability. We find that mRNAs initiating with AG or CU nucleotides are most efficiently transcribed. A +1 U is universally disfavored, consistent with its infrequency in endogenous mRNAs [38,39]. Post-transcriptionally, +1 A and +1 C mRNAs are similarly well-translated and stable under growthpromoting conditions but differ significantly when growth signals are interrupted. +1 A mRNAs remain well-translated but unstable while +1 C mRNAs are translationally-repressed but stabilized. 5 0 sequences beyond the +1 position, including variations in classical TOP motifs, can modulate these post-transcriptional functions. Importantly, the 5 0 sequence patterns that are optimized for transcription and post-transcriptional control are present in endogenous mRNAs and broadly conserved across cell types and species.
Previous studies of endogenous mRNAs have found that +1 A and +1 C mRNAs are produced by distinct Inr and TCT promoter elements, respectively [13]. This preference for initial nucleotides is also thought to require distinct configurations of the transcription machinery [18,40]. It was therefore surprising that the CMV, KARS1, and SNHG1 promoters were all capable of efficient transcription of both +1 A and +1 C mRNAs. How do these core promoter sequences recapitulate the specialized features of both Inr and TCT motifs? One possibility is that these promoters can recruit multiple configurations of the transcription machinery. A second possibility may be that 5 0 AG and CU sequences reflect intrinsic preferences of Pol II for initiating transcription, at least once the transcription machinery has been recruited to a specific location. This hypothesis is consistent with a recent analysis of TSS usage in zebrafish that identified thousands of endogenous promoters yielding mixtures of +1A/G and +1 C mRNAs [41]. The strong correlation between 5pseq library results and endogenous TSS usage (Fig 3) also suggests that these preferences are common features of promoters throughout the genome.
Although individual 5 0 sequences are transcribed with widely different efficiencies, their impact on the overall transcription of mRNAs likely varies with the specific promoter. A previous study the CMV promoter and two endogenous core promoters (HBB and S100A4) used a high-throughput approach to test the function of each nucleotide on overall transcription output [42]. The most significant regions for each promoter were the TATA box and TSS regions. Of the three promoters considered, transcription from the CMV promoter was most sensitive to changes at the TSS. For instance, changing the TSS from +1 AGA to +1 CGA or TGA significantly decreased overall expression, consistent with our library results. Output from the HBB and S100A4 promoters was much less sensitive to substitutions in the TSS. This may reflect transcription initiation at alternative positions when the preferred TSS is disrupted, similarly to what we observe with +1 T versions of the KARS1 and SNHG1 promoters (Fig 2D). Transcription output from promoters that narrowly restrict initiation to specific positions may therefore be most sensitive to specific TSS sequences, while transcription from more permissive promoters is primarily dictated by other features.
Beyond transcription, our results also reveal a striking global relationship between the stability and translation functions of 5 0 sequences. Under growth-repressive conditions, +1 A mRNAs are globally well-translated but less stable, while +1 C mRNAs are translationally repressed but stabilized. For classical 5 0 TOP sequences, this mechanism likely involves Larp1 [3,43]. Our finding that that the same 5 0 sequences necessary for translation regulation also impact stability implies that these mechanisms are tightly linked, at least within the context of the library mRNA used here. Unexpectedly, this inverse relationship between translation and stability under growth-repressive conditions extends across all 5pseq library sequences, not just TOP sequences (Fig 5B). Larp1 may generally recognize +1 C mRNAs, but it is unclear why +1 A mRNAs should behave similarly. One possibility is that these mRNAs are bound and stabilized by a Larp1-like protein that preferentially binds +1 A mRNAs, although we are currently unaware of any such protein.
The inverse relationship between the translation and stability of library mRNAs contrasts with the positive correlation observed in some other contexts. For instance, translation initiation rates are positively correlated with mRNA stability in growing yeast, potentially reflecting a competition between translation initiation and decay factors for the mRNA 5 0 cap [44]. Translation elongation rates are also positively correlated with stability, such that mRNAs with high frequencies of inefficiently decoded codons are degraded more rapidly [45][46][47]. Current models suggest that inhibiting mTORC1 should decrease both translation initiation and elongation rates (by inhibiting eIF4F and activating EEF2K, respectively). It might therefore be expected that mRNA stabilities would globally decline, contrasting with our observations. Even so, mTORC1 inhibition in yeast was also found to globally increase mRNA stability and broaden the range, similarly to what we find here [48]. It seems plausible that the global changes in the translation machinery that occur during growth-inhibitory conditions could upend the normal relationship between translation and stability that exists in growing cells. Such conditions might also trigger distinct mechanisms that globally alter mRNA stability, such as decapping or deadenylation. Further studies of global changes in mRNA decay dynamics between normal and stress conditions will likely shed light on these questions.
A final question is to understand how the link between RNA translation and stability contributes to cellular function. Under growth-restrictive conditions, the inverse relationship between translation and stability may offer two advantages. Many +1 C mRNAs, which include 5 0 TOP mRNAs, encode stable 'housekeeping' proteins, that are most needed during phases of cell growth. The simultaneous translation repression and stabilization of these mRNAs allows cells to temporarily (and rapidly) reduce protein production without forfeiting the investment in mRNA synthesis. When permissive conditions return, cells are primed to resume production. Gentilella and colleagues recently proposed a similar "protective" model for Larp1 function that also involve the direct protection of small ribosomal subunits from degradation [49]. Additionally, the translation-stability link may be a mechanism for buffering the quantity of protein that is produced from each mRNA synthesized. In other words, a system that degrades mRNAs only when translated would define mRNA half-lives in terms of protein production rather than time, maintaining total protein production within a narrow range even amidst changing environmental conditions.
In summary, we find that different classes of 5 0 sequences are linked to global patterns of transcription, translation and decay. This system provides a means for coordinating the expression of large classes of genes at multiple levels. Moreover, genes often initiate transcription at multiple TSSs, yielding mRNAs with a spectrum of 5 0 sequences, including many that produce mixtures of +1 A/G and C/U mRNAs [41]. This may allow cells to fine tune expression dynamics by producing mixtures of mRNAs with varying stabilities and translation in growth-promoting and inhibitory conditions. Importantly, these properties can also be altered by environmental or cellular cues that trigger small shifts in TSS locations [32].
Overall, the results described here are unlikely to have captured the full regulatory potential of 5 0 sequences. In particular, we excluded +1 G 5 0 sequences. Although +1 Gs are common in endogenous mRNAs and globally follow expression patterns that are similar to +1 A mRNAs in CAGE data (Fig 6C), specific +1 G 5 0 sequences may possess unique functions that we have missed. For instance, some +1 G mRNAs may be transcribed, translated, or stabilized more efficiently than +1 A, C or U mRNAs. In this case, the analysis described here potentially underestimates the full range of effects that 5 0 sequences have on these processes. Second, our study only examined the first 7 nt. This allowed for a deeper sampling of all possible sequences, but excluded longer motifs that might encode functional (e.g. cap-proximal uORFs) or structural features that are significant for endogenous mRNA regulation. Further investigation will be necessary to answer these questions.

Synthesis of 5pseq library
The 5pseq library plasmid was generated by introducing a Sal1 restriction site 25 nt downstream of the CMV promoter in the pCT3 plasmid (pCT3-TE2), a lentiviral plasmid based on pLJC1 (Addgene #87972) that encodes the mouse Eef2 5 0 UTR and coding sequence for Renilla luciferase [9]. A DNA insert encoding a randomized 7 nt sequence adjacent to the CMV transcription start site was prepared using a two-step PCR amplification. First, primers TE117 and TE118 were used to amplify the promoter region of pCT3-TE2, while PCR of primers TE119 and TE120 was used to generate a dsDNA fragment containing the random 7 nt sequences flanked by part of the CMV promoter and part of the 5 0 UTR. (Table 1). These were then combined in a second PCR reaction to generate a 420 nt fragment containing the entire CMV promoter, TSS, and partially overlapping the 5 0 UTR. After clean-up, Gibson Assembly (NEB) was used to insert the dsDNA fragment into pCT3-TE2 that had been digested with NdeI and SalI. The ligated product was then electroporated into Endura tm electro-competent cells (Lucigen #60242-1) in 2 separate reactions, grown in recovery media for 1 h, and then plated on agar plates with ampicillin. Dilutions were plated to estimate colony number. Plasmid was isolated from 17x 10 6 colonies on 16 plates by scraping colonies into 50 mL tubes, and then isolating DNA by maxi-prep (Qiagen). To assess library complexity, the TSS/Promoter region was amplified by PCR using Illumina compatible primers (TE127 and TE111) from 5 ng plasmid using Phusion HF polymerase (NEB). Sequencing results were analyzed using custom Python scripts to quantify the frequency of each 7 nt TSS sequence.

Viral infection of HeLa cells with 5pseq library
To prepare lentivirus for transducing the 5pseq library, HEK-293T cells were seeded on 3 15 cm plates at 13.5 million cells per plate and incubated overnight. The following day, cells on each plate were transfected with a mixture of 10 μg library plasmid, 9 μg psPax2 packaging plasmid (Addgene #12260), and 1 μg VSV-G envelope plasmid (Addgene #8454) using 100 μL PEI in 1 mL serum-free DMEM and incubated at room temperature for 10 min. Transfection mix was added drop-wise to cells. After 24 h media was replaced with 20 mL fresh DMEM + 10% FBS per plate, and cells were grown for an additional 24 h. To isolate virus, supernatant from cells was collected and centrifuged at 300 g for 5 min, and then filtered through a 0.45 μm filter to remove cellular debris. For infection, 14.5 mL virus was combined with 4.5 million HeLa cells in 15.5 mL fresh DMEM and 120 uL polybrene (2 mg/mL), and then seeded on a 15 cm plate. After 24 h, media was replaced with 30 mL DMEM + 10% FBS supplemented with 0.4 μg/mL puromycin. After 48 h, media was replaced with fresh DMEM + 10% FBS. In the following days cells were trypsinized and seeded on plates for subsequent experiments. To analyze library expression, total RNA was extracted using TRIzol and used to prepare Illumina sequencing libraries as described below.

Sequencing library construction
The preparation of sequencing libraries from cells for quantifying 5pseq library expression is similar to preparing CAGE libraries for endogenous mRNAs, whereby a double-stranded splint adapter is ligated to the 3 0 end of library cDNA [50]. The first step is to prepare the 3 0 adapter. Two different adapter sequences were used, one ending with NNNNNN (TE580) and another ending with GNNNNN (TE581) to accommodate the frequent addition of a non-templated 3 0 C during reverse transcription. Solutions of oligos TE580, TE581 and TE582 were prepared at 200 μg/mL in 1 mM Tris-HCl pH 7.5. Oligos TE580 and TE581 were each mixed at 2 μg/mL, separately, with oligo TE582 at 400 ng/mL of each oligo in 100 mM NaCl, denatured at 95˚C for 5 min, and then slowly cooled at 0.1˚C/s to 11˚C to anneal oligos. Annealed TE580/TE582 and TE581/TE582 were then combined at a 1:4 ratio and diluted to a final concentration of 200 ng/mL. Annealed adapters were then aliquoted and stored at -20˚C.
To prepare sequencing libraries, total RNA was reverse transcribed using a primer specific for library mRNA (TE121) and the Protoscript II reverse transcriptase (NEB). Following reverse transcription, RNA was hydrolyzed by the addition of 100 mM NaOH, heating to 98˚C for 20 min, and then pH neutralization by the addition of 100 mM HCl. cDNA was then cleaned up on silica columns (Zymo DNA Clean and Concentrator 5) and eluted in 6.5 μL of water. cDNA was then denatured at 65˚C for 5 min, and then placed on ice for 2 min. 1.5 μL of the adapter mixture (200 ng/mL) was prewarmed at 37˚C for 5 min then cooled on ice for 2 min, then combined with 6 μL cDNA and 15 μL Mighty Mix T4 DNA ligase reaction mix (Takara) and incubated overnight at 16˚C. cDNA was column-purified (Zymo DNA Clean and Concentrator 5) and eluted in 25.5 μL water.
PCR amplification of library mRNAs was performed in two steps. For the first step (PCR1), primers CT297 and TE128 were used to amplify library sequences from cDNA in a 50 μL reaction with Phusion polymerase in HF buffer (NEB) for 10 cycles. Amplified DNA was then isolated by column purification (Zymo DNA Clean and Concentrator 5) and eluted in 20 μL water. For the second step (PCR2), to determine the appropriate number of cycles, 2 μL of PCR1 product was used in three 20 μL reactions and amplified for 9, 11, or 13 cycles using oligos CT279 and CT297 with Phusion polymerase and HF reaction buffer (NEB). Each reaction was then analyzed by PAGE on a 12% TBE gel. The expected product is 203 nt. The number of cycles that yielded a single sharp band were used for the final PCR. For the final PCR, 2-8 μL of the initial PCR1 product, 5 μM each of the desired i5 and i7 Illumina dual index amplification primers, dNTPs, 1X HF reaction buffer and Phusion polymerase (NEB) were combined

Quantification of library expression
To quantify the expression level of each 7 nt 5 0 sequence, Illumina sequencing results were processed in several steps. First, the total counts of the spike-in mRNA, if used, were quantified using a custom Python script and removed from the FASTQ file. Second, each read was searched for a seed sequence present in the constant region of the library mRNA 5 0 UTR (AGCCGCCGCC). Reads containing the seed sequence were processed to extract the first 30 nt of the mRNA sequence, including any non-templated G nts, and alignment position within the library plasmid. The frequencies of each 5 0 sequence were then reported. Third, 5 0 reads sharing a common "base" 5 0 sequence but with varying numbers of non-templated G nts appended to the 5 0 end were identified and grouped together. Non-template Gs were identified according to mismatches with the promoter region of the plasmid sequence. The counts for 5 0 sequences within each group were summed to determine a final count for each common base sequence. We note, this can only be determined for reads where the TSS begins with a non-G nt or aligns upstream of the +1 position of the 7 random nt sequence. The reason is that the TSS of any read initiating with a G that aligns within the random 7 nt sequence cannot be definitively distinguished from a read initiating downstream that has been extended by nontemplated Gs. For example, a sequencing result of 5 0 -GAACCTT could reflect an mRNA produced with that 5 0 sequence from the +1 position of the promoter or, alternatively, an mRNA originally initiating with 5 0 -AACCTT from the +2 position with an additional 'G' appended as an artifact of reverse transcription. Because we could not reliably quantify the true counts of +1 G 7mers, we considered only reads that definitively initiated at the +1 position of the random 7 nt sequence with a non-G nt.

Analysis of 5pseq library translation
To measure translation rates of 5pseq library mRNAs, 13 million HeLa cells expressing the library were seeded on each of 4 15 cm plates in DMEM supplemented with 10% IFS and antibiotics and incubated overnight. Cells were then treated with vehicle (DMSO) or 250 nM Torin 1 for 2 h, and then washed 3 times in cold PBS-supplemented with 100 μg/mL cycloheximide, and then lysed in 1 mL polysome lysis buffer (20 mM Tris-HCl pH 7.4, 150 mM NaCl, 5 mM MgCl 2 , 1 mM DTT, 100 μg/mL cycloheximide, 1% Triton-X100). Cells were incubated for 5 min on ice, and then centrifuged 5 min at 14,000 rpm in a benchtop centrifuge to remove insoluble material. At this point, as a control, extracts from cells expressing a single classical TOP and non-TOP reporter mRNA, were added to library extracts. 300 μL of extract was then layered on top of a 5-50% sucrose gradient (20 mM Tris-HCl pH 7.4, 150 mM NaCl, 5 mM MgCl 2 , 1 mM DTT, 100 μg/mL cycloheximide, 5 or 50% sucrose) using a Biocomp Gradient-Station, and centrifuged at 36,000 rpm for 1.5 h in a SW41-TI rotor. Each gradient was then fractionated using a Biocomp GradientStation with constant monitoring at 254 nm separated into sub-polysome and polysome fractions. Fractions were supplemented with 0.5% SDS. Volumes were adjusted to 5.5 mL with water, then 10 ul capped spike-in (50 fg/ul) was added to each fraction, followed by digestion with 55 μL of proteinase K (NEB, 20 mg/mL) for 30 min at 50˚C. RNA was extracted with acid phenol, cleaned up with chloroform, and then precipitated with NaOAc and isopropanol. RNA resuspended in 11 μl water, of which 10 μl was used for library construction.

Analysis of 5pseq library stability
To measure 5pseq library mRNA stability, 10 million library-expressing HeLa cells were seeded in each of 8 15 cm plates and incubated overnight. Cells were treated with vehicle (DMSO) or 250 nM Torin 1 for 2 h, and then treated with 2 μg/mL Actinomycin D for an additional 2 h or processed immediately. To extract RNA, cells were washed once in cold PBS, and then lysed in 2 mL Trizol containing 0.75 fg/μl capped spike-in. RNA was isolated according to the manufacturer's instructions and resuspended in 21 μl water and quantified by UV absorbance, 10 μl of each sample was used to prepare Illumina-compatible libraries as described in the Analysis of 5pseq Library Expression section.

Analysis of CAGE data
CAGE data was obtained from the sources listed in Table 1. To determine TSS frequencies, each dataset ( NCBI Refseq for sacCer3, Arabidopsis thaliana: TAIR10 genes) using SAMtools. The frequencies of all 7 nt TSS sequences in filtered reads were then determined using custom Python scripts. Ribosomal protein (RP) gene promoters for each species were identified based on gene name and manual curation. As with the transcriptome-wide TSS analysis, CAGE reads aligning to 1000 nt windows centered on the annotated TSSs for these transcripts were extracted using SAMtools and analyzed using custom Python scripts to quantify frequencies of 5 0 3-mers.

Translation and expression reporter assay
The indicated 5 0 sequences were inserted into the library plasmid using Gibson assembly. HEK-293T cells were transfected with 100 ng pIS0 (Addgene #12178, encoding firefly luciferase), 100 ng of the Renilla reporter and 800 ng of empty vector (1 μg total plasmid DNA) using XtremeGENE 9. After 24 h, cells were divided in 12-well plates at 0.3 million cells/well and incubated for an additional 24 h. For translation assays, cells were treated as indicated, and analyzed using the Promega Dual-Luciferase Reporter Assay System according to the manufacturer's instructions. To measure expression, RNA was extracted using Trizol, reverse transcribed using Protoscript II (NEB), and quantified using qPCR with primers for renilla luciferase (forward: TCATGGCCTCGTGAAATCCCGT, reverse: GCATTGGAAAAGAAT CCTGGGTCCG) and firefly luciferase (forward: GAGGCGAACTGTGTGTGAGA, reverse: GAGCCACCTGATAGCCTTTG). Levels of renilla luciferase were normalized to levels of firefly luciferase using the ΔΔCt method [52].

Design and analysis of transcription from KARS1 and SNHG1 promoters
The core promoter sequences for the human KARS1 and SNHG1 genes (chr11:62,855,864-62,855,926 and chr11:62,855,865-62,856,083, respectively) were identified in a previously reported library of core promoter sequences [53] and cloned into the pCT3-TE2 library vector using the Nde1 and Sal1 restriction sites, replacing the CMV promoter region and positioning the expected transcription start site (based on CAGE analysis of the endogenous promoter) 97 nucleotides upstream of the start codon. This results in expression of an mRNA encoding~21 nucleotides of the endogenous 5 0 UTR followed by 76 nucleotides of the library vector. Versions of each construct encoding a +1 A, C or T at the expected +1 TSS position were produced. To map transcription start sites, HEK-293T cells were transiently transfected with 1 μg of each plasmid and incubated overnight. RNA-seq libraries for TSS analysis were prepared and sequenced as described above. Reads were then aligned to plasmid sequences using the bowtie2 short-read aligner to map transcription start sites [54]. Reads were soft-clipped during alignment, such that non-templated 5 0 G nucleotides were removed from aligned reads. TSS plots of endogenous promoters from HeLa cells were obtained from analysis of previously published CAGE data from the FANTOM5 project, as described above [27].

Measurement of reporter mRNA stability using doxycycline-repressible constructs
The indicated library mRNA sequences were inserted into a vector encoding a doxycyclinerepressible version of the CMV promoter (pCW-TTA, derived from pCW57.1, Addgene #41393) using Gibson Assembly (New England Biolabs). Transcription initiation at the expected TSS was confirmed by 5 0 RACE, which is identical to the preferred TSS of the constitutive CMV promoter used in the 5pseq library. HEK-293T cells were transiently transfected with 100 ng of each vector, 100 ng of pIS0 (Addgene #12178) encoding firefly luciferase, and 800 ng empty vector, incubated overnight, seeded in 12-well plates, and then incubated overnight again. Cells were then treated with vehicle (DMSO) or 250 nM Torin 1 for 30 min, and then treated with 1 μg/mL doxycycline for 0, 3 or 6 h. RNA was isolated from cells at each timepoints, reverse transcribed (Protoscript II) and analyzed by qPCR for levels of renilla luciferase and GAPDH (forward: TTCTTTTGCGTCGCCAGCCGA, reverse: ACCAGGCGCC CAATACGACCA). Levels of renilla luciferase were normalized to levels of GAPDH using the ΔΔCt method [52].