Nuclear export is a limiting factor in eukaryotic mRNA metabolism

The eukaryotic mRNA life cycle includes transcription, nuclear mRNA export and degradation. To quantify all these processes simultaneously, we perform thiol-linked alkylation after metabolic labeling of RNA with 4-thiouridine (4sU), followed by sequencing of RNA (SLAM-seq) in the nuclear and cytosolic compartments of human cancer cells. We develop a model that reliably quantifies mRNA-specific synthesis, nuclear export, and nuclear and cytosolic degradation rates on a genome-wide scale. We find that nuclear degradation of polyadenylated mRNA is negligible and nuclear mRNA export is slow, while cytosolic mRNA degradation is comparatively fast. Consequently, an mRNA molecule generally spends most of its life in the nucleus. We also observe large differences in the nuclear export rates of different 3’UTR transcript isoforms. Furthermore, we identify genes whose expression is abruptly induced upon metabolic labeling. These transcripts are exported substantially faster than average mRNAs, suggesting the existence of alternative export pathways. Our results highlight nuclear mRNA export as a limiting factor in mRNA metabolism and gene regulation.

The conversion efficiency of 4sU was evaluated by absorption spectra before and after treatment with IAA.

Basic read alignment
Reads were mapped to human reference genome hg19 with slamdunk [1] (version 0. Mapping statistics of the control data sets, shown for both control samples and the nuclear and cytosolic fraction.Uniquely mapped reads: reads with exactly one alignment (bowtie2 and hisat2), or reads with exactly one reliable alignment (slamdunk).Ambiguous reads: reads with more than one alignment (bowtie2 and hisat2), or reads with more than one reliable alignment or with only unreliable alignments (slamdunk).Unmapped reads: reads with no alignment (slamdunk, bowtie2, hisat2).Mapping statistics for both time series and the nuclear and cytosolic fraction.Uniquely mapped reads: reads with exactly one reliable alignment.Ambiguous reads: reads with more than one reliable alignment or with only unreliable alignments.Unmapped reads: reads with no alignment.

Assignment of reads to 3'UTRs and annotation of 3'UTRs
The BED file defining hg19 annotated 3'UTRs was downloaded from UCSC [4].Overlapping 3'UTRs were merged, considering strand orientation.Uniquely mapped reads were assigned to a specific 3'UTR if their alignment overlapped with the UTR with at least one nucleotide position.Mapping orientation was taken into account during read assignment, as reads derived from (-) strand genomic regions map forwards and reads derived from (+) strand genomic regions map reversely as a consequence of the experimental setup To annotate 3'UTRs with their corresponding genes, the hg19 genome annotation was downloaded from NCBI RefSeq [5] (S4

Peak calling
To identify 3' transcript isoforms and transcripts with internal priming sites, we searched the genome for positions with a particularly high coverage of 3' ends of mapped reads ("peaks", to be defined below).By assigning only count to the 3' end position of a mapped read, we generated coverage profiles for the nuclear and cytosolic fractions of every measurement time point and each of the two replicate time series.Due biological variation and in the priming of the reverse transcription, the 3' poly-A site of a transcript might not be determined up to single nucleotide resolution.To obtain an (unbiased) idea on how spatially constrained such poly-A sites are in practice, we first binarized the coverage profiles in a ±1000bp window around each genomic position (0 = no count, 1 = at least one count at a given nucleotide position) and then averaged these profiles across all genomic positions that were covered at least once (Fig F and Fig G).The shape of these profiles indicated that counts typically aggregate in a ±50bp region around a central position, the closer the more.We therefore decided to define peak centers as the local coverage maxima of the genome-wide coverage profile that has been convolved with a triangular smoothing kernel of with ±50 (i.e., the function

Estimation of the labeling efficiency and new/total ratios
Recall the notation from the Methods.Fix a 3'UTR respectively a peak region onto which J reads were mapped.Given a read j = 1, ..., J, let T j be its number of T-positions in the genomic sequence of the read alignment, and let o j ∈ 0, 1, ..., T j be the number of positions at which we observe T>C transitions.Let h j ∈ {0, 1} be a hidden variable indicating whether read j originates from a pre-existing RNA (h j = 0), or a newly synthesized RNA (h j = 1).Let ρ ∈ [0, 1] be the proportion of newly synthesized RNAs in the RNA population from which read j was drawn.Let ∈ [0, 1] be the labeling efficiency, i.e., the probability by which a T>C conversion happens in a newly synthesized RNA.Let p 0 ∈ [0, 1] be the probability of a T>C sequencing error.Let ∈ [0, 1] be the probability of a C>non-C sequencing error, and denote by the probability of observing a T>C conversion in a newly synthesized read.Denote the parameters by Θ = (ρ, , p 0 , ).
To estimate sequencing error rates p 0 and , we calculated the single-nucleotide mismatch rates for the nuclear and the cytosolic compartment in the control samples.The T>C probability p 0 was estimated as the average of the T>C respectively A>G exchange rate of reads mapping to the (+) respectivley (-) strand.
The C>non-C probability was calculated as the average of the according, strand-specific base exchange rates.
From Equations ( 6) and ( 7), we have We employ an EM algorithm to estimate the ratio of newly synthesized RNAs ρ and the labeling efficientcy for each 3'UTR separately for each compartment and time point in each time series.To that end, we initiliased the labeliung efficiency l with p 0 and the new total RNA ratio ρwith the observed labeled total RNA ratio.

M-step.
Taking the partial derivative of Q with respect to ρ and equating this expression to zero yields Equating the partial derivative of Q with respect to to zero yields Solving this for p 1 (assuming 1 − − p 0 = 0) and then solving Equation (1) for yields  Each distribution represents one time point for the respective time series and compartment, with darker shades of a color encoding earlier and brighter shades encoding later time points.Efficiencies were estimated for all 3'UTRs with a coverage of at least 100 in the respective measurement.The black dots and text labels indicate the median labeling efficiencies obtained after discarding all 3'UTRs with an efficiency estimate smaller than 0.01 within a measurement.Text labels are ordered by increasing measurement time points.The x-axis is trimmed for better display.

Modeling of the labeling efficiency in the nucleus and the detection gap
General considerations on the labeling efficiency.The parameters of our model, such as RNA synthesis or RNA degradation, were so far assumed to be constant.While we have chosen minimally invasive procedures to keep these parameters unchanged, the assumption of constant labeling efficiency can be problematic, as discussed below.To understand this, remember that the time variable t, which appears in our dynamic equations, denotes the time after the start of 4sU labeling.When we say that a transcript has been synthesized at time t, we mean that this transcript has been polyadenylated and can be detected with our experimental method at time t.Let l(t) be the labeling efficiency at time t, i.e., the probability that a 4sU nucleotide instead of uridine is included in a nascent transcript at time t.Because it may take a while for the 4sU molecules to diffuse into the nucleus, we expect the labeling efficiency to increase monotonically from l(0) = 0 to a certain limit.
First, we verify the assumption implicit in our model that labeling efficiency is the same for each uridine position of the same transcript.The length of a primary human gene transcript (including introns) ranges between 0.5kb and 1, 000kb [6].As the elongation speed of RNA Polymerase II ranges between 0.5kb/min and 5kb/min [7], the synthesis of a primary transcript can take up to a few hours.The onset of transcription may therefore even precede the start of 4sU labeling in long genes, which seems inconsistent with the assumption of constant labeling efficiency.However, this apparent contradiction resolves if we considered that we are sequencing only the 3' ends of mRNAs, i.e., the last ˜80nt upstream of the poly(A) tail of a transcript.All considerations on labeling efficiency refer only to this 3' end, which is synthesized by RNA polymerase II within seconds.It therefore seems justified to assume that the labeling efficiency during 3'-end synthesis of a transcript is constant and equals l(t).Note that we still have neglected the time it takes until the poly(A) tail has been added after transcription termination.As a consequence, we will observe all processes described above with a delay.We call this delay the detection gap ∆, which we expect to be in the range of seconds to minutes.
Estimation of the labeling efficiency kinetics.Denote by RNA new x (t) the amount of (mature) RNA of transcript x in the nucleus that has been newly synthesized (i.e., is detecteble by RNA-seq) within t minutes after the start of 4sU labeling.Let RNA lab x (t) the amount of newly synthesized RNA that can be detected as labeled (by at least one T>C transition) within t minutes after labeling start.Our next goal is to estimate the fraction f x (t) = RNA lab x (t)/RNA new x (t) of newly synthesized RNA of a transcript x at time t which is recognized as labeled.This fraction will be used later as a correction factor in the estimation process.We assume that the labeling efficiency in the nucleus follows a delayed saturation kinetics, Here, max is the limit 4sU incorporation rate, γ models the diffusion speed of 4sU inside the cell, and ∆ is the detection gap.Let u x > 1 be the number of Uridines in the sequenced part of transcript x.The lengths of the sequenced reads are very similar, so without loss we may assume this number to be constant for each transcript of type x.Then, the probability of a transcript x that has been newly synthesized exactly at time t to be recognized as labeled is Assuming a constant synthesis rate µ x and a constant export+decay rate λ x for x, the ordinary differential equations for the nuclear RNA become They have the solutions Thus, the fraction f x (t) of newly synthesized RNAs that are detected as labeled/converted is It remains to estimate s max and γ.If t is the time after labeling start at which a transcript x has been synthesized, the distribution of t in the population of newly synthesized transcripts x at time t after labeling is We conclude that the average labeling efficiency measured for a single uridine in a newly synthesized transcript x at measurement time t is We have assessed how e x (t; λ x , γ, max ) varies with transcript half lives, respectively their degradation rate λ x (Fig Fig J).We found that for fixed time points t, e x is approximately independent of λ x , for λ x corresponding to transcript half lives above ∼ 100min (i.e., relative changes are small).The reason is that for sufficiently small λ x , A(t , t, λ x ) is approximately constant for all t ∈ (0, t] except for very small values of t .We therefore define the (degradation-rate independent) labeling efficiency ē(t; γ, max , ∆) := and keep in mind that ē(t; γ, max , ∆) ≈ e x (t; λ x , γ, max , ∆) for λ x small enough, say for a transcript half life greater than 100min.
In a first round of our estimation procedure, we assume that labeling efficiencies do not depend on the degradation rate of a transcript.Using our estimation procedure (S1 Appendix Section 5), we obtain preliminary labeling efficiency estimates êx (t) for each measurement time t and transcript x, together with preliminary half life estimates for each transcript x.We found that 98% of all transcripts have half life estimates above 100min.Since the median is a robust measure of location for skewed distributions, we use the median of all êx (t) for transcripts with a preliminary half life estimate >100min as an estimate of ē(t; γ, max ): ê(t) := median (ê x (t); prelim.half lifeλ x > 100min) (18) We additionally account for the detection gap by introducing a time shift of ∆ for our observations.The estimates of γ, s max and ∆ are then obtained by a least squares fit of ê(t) ∼ ē(t; γ, max , ∆) , t ∈ {15, 30, 45, 60, 90, 120, 180min} This yields (see Fig I) for replicate time series 1 (replicate time series 2).Contrary to our intuition, the estimated value for the detection gap was negative.We still keep our model, since it represents an excellent fit of labeling efficiency time curves.Labeling efficiency per nucleotide (nucleus, Replicate 1) x Efficiency [%] q q q q q q q max.efficiency = 7.8% diffusion half rate = 0. Labeling efficiency per nucleotide (nucleus, Replicate 2) x Efficiency [%] q q q q q q q max.efficiency = 7.9% diffusion half life = 0.056 min maturation gap = −18 min q q q q q q q 0 50 100 150 0 2 4 6 8

Labeling efficiency per nucleotide (cytosol)
Labeling time

Efficiency [%]
q q q q q q q Replicate 1 Replicate 2

Fig I.
Per nucleotide labeling efficiencies.The dots (red/dark red: replicate 1/replicate 2) show the median values for ê(t) for each measurement time point (top: nucleus, bottom: cytosol).The blue lines are the least squares fits for the parameters γ, max and ∆ in both replicate time series.The parameters of the labeling efficiency function ē(t − ∆; γ, max ) (Equation ( 16)) were fitted according to Equation (19).Per nucleotide conversion efficiency as a function of transcript half life q q q q q q q 45 min Per nucleotide conversion efficiency as a function of transcript half life q q q q q q q 116 min Per nucleotide labeling+conversion efficiency (e(t; λ x , γ, max , g), see Equation ( 15)) as a function of transcript halflife (log(2)/λ x , x-axis).This was done for a diffusion rate γ coresponding to a half live (log 2)/γ of 5min (left plot) and a half life of 15min (right plot), which is considered a realistic upper bound for γ.The maximum labeling efficiency was set to max = 7% (dashed gray horizontal lines).Each colored curve corresoponds to a certain labeling time t = 15, 30, ..., 180min (see the inset).The dashed colored horizontal lines indicate the corresponding values of ē(t; γ, max ).The colored dots on the curves indicate the half lives above which Approximation (17) has an error smaller than 5%.The dashed gray vertial line at 45min (left) respectively 116min (right) indicates the transcript half life above which Approximation (17) has an error smaller than 5% for all measurement time points.

Labeling
Necessity of modeling time-dependency of the labeling efficiency.So far, half life estimation procedures based on SLAM-Seq assumed a labeling efficiency that is constant over time [8].This means replacing the above labeling efficiency estimate f x (t) by an approximation g x (t) assuming constant per-uridine labeling efficiency ēx (t): We have assessed whether this assumption causes a relevant bias, or whether this simplification is justified.
Starting with the equation we replace f x (t) by g x (t) in the estimation process.This amounts to letting If we merely use the observation at measurement time point t, the biased estimate λx,t of λ x becomes In a post-hoc analysis, we fix the values γ and ŝmax to the values obtained from replicate time series 1, see Equations (20,20).We then calculated the the relative bias λx,t /λ x for all our measurement time points t, for the whole range of transcript half lives λ x , and for various uridine contents of the reads (Fig K ).The relative bias is generally negative for short half lives (except for very high uridine contents and very short labling times).The longer the labeling times, the more the bias is shifted towards negative values.Above a half life of ˜100min, the bias is asymptotically constant as a function of half life, and it ranges in the interval [−7%, 7%], with longer labeling times resulting in smaller biases.In conclusion, the estimation bias relatively small for the setting in our experiment.We therefore do correct for the time-dependency of the labeling efficiency, but we generally advise to perform an assessment of this bias in future SLAM-Seq experiments.

Effects of variance stabilization
We thoroughly assessed the effects of variance stabilization of the estimates new total RNA ratios as shown in Fig L. Variance was calculated as 1 2 (ρ 1 − ρ 2 ) 2 , where ρ i are the estimated new/total ratios from the two replicate time series, for all time points and all transcripts.Right: Same as left, with ρ i replaced by √ 4N i arcsin √ ρ i .

T>C conversion and mismatch statistics
Conversions originating from the labeling are expected to show as A>G on the (-) and T>C on the (+) strand.Expectedly and reassuringly, the labeling conversion rates clearly show an increasing trend over time, while the other rates stay constant and low.Also, the cytosolic rates increase less and slower than the nuclear rates, as the export of newly synthesized, labeled RNA transcripts from the nucleus to the cytosol requires time.No conversion correction: mismatch rates as observed in the primary alignment.SNP-sites correction: mismatch rates as observed in the primary alignment when excluding all nucleotide positions annotated as known SNP site.SNP and edit-site correction: conversions as observed in the primary alignment when excluding all nucleotide positions annotated as known SNP site or tagged as potential editing sites.A>G (T>C) conversions: A>G (T>C) conversions as observed in the read alignment.Other conversions: average of all non-A>G and non-T>C conversions observed in the read alignment.Conversions of (-) strand ((+) strand) 3'UTRs: conversions of all uniquely mapped reads which are assigned to an annotated (-) strand ((+) strand) 3'UTR.Conversions of other regions on the (-) strand ((+) strand): conversions of all uniquely mapped reads which are assigned to a non-3'UTR peak on the (-) strand ((+) strand) with a minimum coverage of 5 in the sample and compartment.Conversion statistics of the SLAM-seq labeling samples, shown for the reads assigned to non-3'UTR peaks.For each measurement, peaks with a coverage of less than 5 in that measurement were excluded from the calculations.Statistics are corrected for both annotated SNP sites and potential editing sites.A>G (T>C) conversions: A>G (T>C) conversions as observed in the read alignment.Other conversions: average of all non-A>G and non-T>C conversions observed in the read alignment.Conversions of the (-) strand ((+) strand): conversions of all uniquely mapped reads which are assigned to an annotated (-) strand ((+) strand) non-3'UTR peak.
Fig A. Absorption spectra of 4-thiouracil (4TU) and 4-thiouridine (4SU) before and after treatment with iodoacetamide (IAA) showing the shift in the absorption maxima of reactant and product.Data represent mean absorbance (line) ± s.d.(ribbon) of independent experiments (n = 3) comparing the effect of temperature on conversion efficiency in presence of 1 mM DTT.
Fig C.Mapping statistics of the control data sets, shown for both control samples and the nuclear and cytosolic fraction.Uniquely mapped reads: reads with exactly one alignment (bowtie2 and hisat2), or reads with exactly one reliable alignment (slamdunk).Ambiguous reads: reads with more than one alignment (bowtie2 and hisat2), or reads with more than one reliable alignment or with only unreliable alignments (slamdunk).Unmapped reads: reads with no alignment (slamdunk, bowtie2, hisat2).
Fig D.Mapping statistics for both time series and the nuclear and cytosolic fraction.Uniquely mapped reads: reads with exactly one reliable alignment.Ambiguous reads: reads with more than one reliable alignment or with only unreliable alignments.Unmapped reads: reads with no alignment.
Fig E. Assignment of uniquely mapped reads of the SLAM-seq labeling samples to hg19 annotated 3'UTRs.Overlapping 3'UTRs were previously merged considering strand orientation.(A) Absolute read counts.(B) Relative read counts.
Fig G).Peaks or peak regions are then the regions spanning -50bp to +50bp around a peak center.We identified a total of 98,102 peaks, 15,125 of which were measured with at least one count in each time series, time point and compartment, and 1,262,263 peaks for the reads not assigned to annotated 3'UTRs, 4481 of which were measured with at least one count in each time series, time point and compartment.
Fig F.Relative coverage profiles.Profiles were created separately for the reads assigned to annotated 3'UTRs (UTR-reads) and reads not assigned to such (non-UTR reads) as well as the nuclear and cytosolic fraction.The combined profile is based on the coverage of the nuclear and cyosolic fraction added up.

Fig H .
Fig H.Distribution of labeling efficiency estimates , calculated for each 3'UTR and each compartment, time point and time series.Each distribution represents one time point for the respective time series and compartment, with darker shades of a color encoding earlier and brighter shades encoding later time points.Efficiencies were estimated for all 3'UTRs with a coverage of at least 100 in the respective measurement.The black dots and text labels indicate the median labeling efficiencies obtained after discarding all 3'UTRs with an efficiency estimate smaller than 0.01 within a measurement.Text labels are ordered by increasing measurement time points.The x-axis is trimmed for better display.
[min]   4sU detected / U newly synthesized[%] [min]   4sU detected / U newly synthesized[%] Fig J.Per nucleotide labeling+conversion efficiency (e(t; λ x , γ, max , g), see Equation (15)) as a function of transcript halflife (log(2)/λ x , x-axis).This was done for a diffusion rate γ coresponding to a half live (log 2)/γ of 5min (left plot) and a half life of 15min (right plot), which is considered a realistic upper bound for γ.The maximum labeling efficiency was set to max = 7% (dashed gray horizontal lines).Each colored curve corresoponds to a certain labeling time t = 15, 30, ..., 180min (see the inset).The dashed colored horizontal lines indicate the corresponding values of ē(t; γ, max ).The colored dots on the curves indicate the half lives above which Approximation (17) has an error smaller than 5%.The dashed gray vertial line at 45min (left) respectively 116min (right) indicates the transcript half life above which Approximation (17) has an error smaller than 5% for all measurement time points.

Fig K .
Fig K. Assessment of half life estimation bias when assuming constant labeling efficiency.Each line shows the relative bias (in %) as a function of the (true) transcript half life.Line colors correspond to different labeling times (see insets).The caclulations were done for different uridine contents of a transcript: 5 (top left), 10 (top right), 20 (bottom left), 40 (bottom right) uridines per sequenced read.The time dependency of the labeling efficiency was modeled by Equation (9) with parameters γ = 0.069min −1 and max = 7.8% estimated from replicate time series 1.
Fig L.Impact of variance stabilizing transformation.Left: Variance of new/total mRNA ratios (y-axis) as a function of the mean (x-axis, binned in steps of 0.1), for the nuclear (red) and cytosolic (blue) fraction.Variance was calculated as1  2 (ρ 1 − ρ 2 ) 2 , where ρ i are the estimated new/total ratios from the two replicate time series, for all time points and all transcripts.Right: Same as left, with ρ i replaced by √ 4N i arcsin √ ρ i .
Fig M.Conversion statistics of the control samples.No conversion correction: mismatch rates as observed in the primary alignment.SNP-sites correction: mismatch rates as observed in the primary alignment when excluding all nucleotide positions annotated as known SNP site.SNP and edit-site correction: conversions as observed in the primary alignment when excluding all nucleotide positions annotated as known SNP site or tagged as potential editing sites.A>G (T>C) conversions: A>G (T>C) conversions as observed in the read alignment.Other conversions: average of all non-A>G and non-T>C conversions observed in the read alignment.Conversions of (-) strand ((+) strand) 3'UTRs: conversions of all uniquely mapped reads which are assigned to an annotated (-) strand ((+) strand) 3'UTR.Conversions of other regions on the (-) strand ((+) strand): conversions of all uniquely mapped reads which are assigned to a non-3'UTR peak on the (-) strand ((+) strand) with a minimum coverage of 5 in the sample and compartment.
Fig N. Conversion statistics of the SLAM-seq labeling samples, shown for the reads assigned to annotated 3'UTRs.Statistics are corrected for both annotated SNP sites and potential editing sites.A>G (T>C) conversions: A>G (T>C) conversions as observed in the read alignment.Other conversions: average of all non-A>G and non-T>C conversions observed in the read alignment.Conversions of the EM algorithm(-) strand ((+) strand): conversions of all uniquely mapped reads which are assigned to an annotated (-) strand ((+) strand) 3'UTR.
Fig O.Conversion statistics of the SLAM-seq labeling samples, shown for the reads assigned to non-3'UTR peaks.For each measurement, peaks with a coverage of less than 5 in that measurement were excluded from the calculations.Statistics are corrected for both annotated SNP sites and potential editing sites.A>G (T>C) conversions: A>G (T>C) conversions as observed in the read alignment.Other conversions: average of all non-A>G and non-T>C conversions observed in the read alignment.Conversions of the (-) strand ((+) strand): conversions of all uniquely mapped reads which are assigned to an annotated (-) strand ((+) strand) non-3'UTR peak.