Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Nonsense-Mediated mRNA Decay Immunity Can Help Identify Human Polycistronic Transcripts

  • Guy Shahaf,

    Affiliation Bioinformatics Program, School of Computer Science, The Academic College of Tel Aviv-Yaffo, Tel Aviv, Israel

  • Dorit Shweiki

    Affiliation Bioinformatics Program, School of Computer Science, The Academic College of Tel Aviv-Yaffo, Tel Aviv, Israel

Nonsense-Mediated mRNA Decay Immunity Can Help Identify Human Polycistronic Transcripts

  • Guy Shahaf, 
  • Dorit Shweiki


Eukaryotic polycistronic transcription units are rare and only a few examples are known, mostly being the outcome of serendipitous discovery. We claim that nonsense-mediated mRNA decay (NMD) immune structure is a common characteristic of polycistronic transcripts, and that this immunity is an emergent property derived from all functional CDSs. The human RefSeq transcriptome was computationally screened for transcripts capable of eliciting NMD, and which contain an additional ORF(s) potentially capable of rescuing the transcript from NMD. Transcripts were further analyzed implementing domain-based strategies in order to estimate the potential of the candidate ORF to encode a functional protein. Consequently, we predict the existence of forty nine novel polycistronic transcripts.

Experimental verification was carried out utilizing two different types of analyses. First, five Gene Expression Omnibus (GEO) datasets from published NMD-inhibition studies were used, aiming to explore whether a given mRNA is indeed insensitive to NMD. All known bicistronic transcripts and eleven out of the twelve predicted genes that were analyzed, displayed NMD insensitivity using various NMD inhibitors. For three genes, a mixed expression pattern was observed presenting both NMD sensitivity and insensitivity in different cell types. Second, we used published global translation initiation sequencing data from HEK293 cells to verify the existence of translation initiation sites in our predicted polycistronic genes. In five of our genes, the predicted rescuing uORFs are indeed identified as translation initiation sites, and in two additional genes, one of two predicted rescuing uORF is verified. These results validate our computational analysis and reinforce the possibility that NMD-immune architecture is a parameter by which polycistronic genes can be identified. Moreover, we present evidence for NMD-mediated regulation controlling the production of one or more proteins encoded in the polycistronic transcript.


The vast majority of eukaryotic genes are considered monocistronic with a single transcription unit encoding for a single protein (alternatively-spliced variants included). Polycistronic transcription units (no trans-splicing involved; i.e., "eukaryotic operon") are rare in eukaryotes and specifically in mammals, and therefore little is known on how they differ from the monocistronic ones.

Genomically organized polycistronic units are known in several organisms (e.g., nematodes, Arabidopsis thaliana) yet those are trans-spliced and each monocistronic unit is translated independently [1]. Further, episodic occurrences of eukaryotic bicistronic transcripts, which do not undergo trans-splicing are documented (including STNA-STNB in Drosophila; GK-GPR in tomato and mammalian GDF-1-LASS1, SNRPN-SNURF, MTPN-LUZP6 and MFRP- C1QTNF5) [1], [2], [3], [4], [5].

Newly synthesized mRNAs are subjected to a pioneer round of translation in which premature termination codon (PTC) containing transcripts are identified and degraded in various degrees of efficiency via the Nonsense-mediated mRNA decay (NMD) mechanism [6], [7]. In mammals, NMD onset is primarily associated with the identification of un-removed exon-junction protein complexes (EJCs) in PTC-containing transcripts [8]. During the pioneer round event, previously deposited splicing-dependent EJCs, positioned ∼20–24 nucleotides upstream to the exon-exon junction, are detached and removed. It was demonstrated that translating ribosomes are responsible for the removal of the EJCs positioned within the coding region, during the pioneer round of translation [9], [10], [11]. Un-removed EJCs in prematurely translation-terminated transcripts trigger NMD degradation. By and large, PTCs elicit NMD if positioned more than 55 nucleotides upstream to the terminal exon-exon junction, known as the “55 nucleotide rule”. Stop codons positioned downstream to this site (in the penultimate or the terminal exon) fail to elicit NMD and are considered NMD immune [7], [12].

Seven polypeptides constitute the mammalian NMD core mechanism: up-frameshift protein 1 (UPF1), UPF2, UPF3 (comprised isoforms UPF3 and UPF3X) SMG1, SMG5, SMG6 and SMG7. UPF1 is the most conserved, essential protein, with RNA-dependent ATPase and 5′-3′ helicase activities [13], [14]. UPF1 was shown to directly interact with both cap-binding-protein CBP80 and translation termination factors eRF1 and/or eRF3, thus likely linking NMD and translation termination activities [15], [16]. In the event of premature termination, UPF1 and SMG1 interact with EJC-associated UPF2 and UPF3X. Consequent to UPF1/SMG1- EJC interaction, SMG1-mediated UPF1 phosphorylation occurs, triggering translational repression and NMD induced degradation [17], [18]. Until recently the common belief was that NMD is restricted to the pioneer round of translation and only to mRNAs which are associated with cap-binding-protein CBP80-CBP20 complex. Following the removal of the EJCs and the CBP80-CBP20 complex and its replacement by eIF4E, the transcript therefore becomes NMD immune, free to undergo multiple translation cycles [14], [19], [20], [21]. Recently, however, several lines of evidences indicated that NMD may also occur on eIF4E-bound transcripts, which are already being translated [22], [23].

Turning a polycistronic transcript into NMD-immune following the pioneer round of translation requires the removal of all EJCs subsequent to the translation of the functional CDS. In this manuscript we raise the hypothesis that human (and most likely mammalian) functional polycistronic transcripts share a distinctive NMD-immune architecture, which is an emergent property of all functional CDSs. Further, we argue that the definition of potentially NMD-eliciting transcripts (failing to fulfill the "55 nucleotide rule") ought to include the 5' UTR of the molecule, as occurs in many upstream open reading frame (uORF) containing transcripts. In non-polycistronic transcripts. uORFs, which are found in almost 50% of human genes, are mainly characterized by their negative regulatory effect [24], [25], [26], [27]. In other words, the upstream CDS in polycistronic transcripts and the regulatory uORF differ by their NMD-induction potential.

A computational-based approach was utilized to survey monocistronic and polycistronic transcript architecture and to predict the existence of novel polycistronic transcripts in the human transcriptome. We screened the human RefSeq dataset for potentially NMD-eliciting transcripts, according to the classic definition and our modified one. Further, we aimed to isolate those transcripts containing ORFs capable of "rescuing" the mRNA from its NMD-eliciting destiny, i.e., overlapping the exon junctions or positioned in their proximity (as detailed in the Methods section). We then applied domain-based strategies (see below) to predict the potential of the candidate ORF to encode a functional protein. Polycistronic (mainly bicistronic) transcript prediction is presented and discussed.


Known human polycistronic transcript architecture

Our main hypothesis was that polycistronic transcripts share a distinctive NMD-immune architecture, leading to the production of stable mRNA, maximally available for translation. We thus assessed the architecture of the known human polycistronic and bicistronic transcripts (Figure 1). In two bicistronic genes (LASS1-GDF1, and SNURF-SNRPN), NMD immunity is hypothesized to be contributed by both CDSs (LASS1-GDF1, all exon junctions are covered by the two CDSs; SNURF CDS ends 45 nucleotides upstream to the exon junction). In the MFRP-C1QTNF5 gene, three ORFs are responsible for NMD-immune architecture, yet only two of them are documented to encode the known proteins (the first encodes MFRP and the last - C1QTNF5). The MFRP-C1QTNF gene was found to be strongly expressed in human medulla oblongata [3]. Thus, we speculated that, either the middle ORF is also translated, contributing to the removal of the EJCs, or that EJC removal results from an upstream ribosome-dependent spatial effect, occurring during translation reinitiation of the C1QTNF5 CDS (with its AUG positioned only 44 nucleotides downstream to the exon junction; see discussion section). Finally, the MTPN stop codon in the MTPN-LUZP6 transcripts is positioned in the terminal exon solely responsible for transcript NMD-immunity, while LUZP6 is encoded by a cryptic ORF positioned in the 3′ UTR region, which uses IRES and a non-AUG translation initiation codon [5]. Hence, in the four known human bicistronic examples, NMD-immune architecture is demonstrated. Yet, given the very few documented genes available, more evidence is necessary.

Figure 1. Known human polycistronic transcripts architecture.

Exon junctions highlighted in bold, uncovered exon junction coordinates are indicated in bold; annotated CDS in turquoise; ORF in purple; CDS, ORF and transcript coordinates are indicated.

Novel polycistronic transcript prediction in the 3' UTR of penultimate or upstream NMD-eliciting transcripts

Based on the architecture of the known bicistronic transcripts, we devised a strategy for the identification of novel polycistronic genes. Polycistronic transcript search was limited to potentially NMD-eliciting transcripts with an annotated stop codon positioned in the penultimate or upstream exon. Transcripts in which the annotated stop codon is positioned in the terminal exon (similar to the MTPN-LUZP6 gene) were excluded from this study due to the following reasons: (i) the vast majority of the known mammalian bicistronic genes share an NMD-immune architecture contributed by all functional CDSs; (ii) no other criteria were indicative enough: our preliminary results show that ORF coding potential alone is insufficient to distinguish functional ORFs from non-functional ones (data not shown). Furthermore, comparative genomics per se seem to be inadequate based on the lack of evolutionary conservation in the known bicistronic genes.

In all, 30035 Refseq records were analyzed for potentially failing to fulfill the “55 nucleotides rule” and eliciting NMD, as detailed in the Methods section. Of these, 113 transcripts contained an annotated stop codon positioned 55 nucleotides or more upstream to the terminal exon-exon junction. Those were further analyzed for the existence of ORFs which are potentially capable of turning the transcript from NMD-eliciting into NMD-immune. Ninety three potential rescuing ORFs were identified in 68 transcripts.

The existence of a rescuing ORF overlapping an exon junction is far from sufficient in order to identify a polycistronic transcript. We therefore assessed potential functional ORFs based on the following criteria: (i) Existence of a translation initiation sequence. Two potential elements - Kozak-like sequence and internal ribosome entry sites (IRESs) in the 5' end can be considered. We avoided relying on IRES identification as a search criterion because its presence in cellular mRNAs is still debated [28], [29]. Over 85 reported cellular IRES-containing transcripts share long 5' UTRs, multiple uAUGs and a similar GC content, yet a considerable amount of genes fit this profile with no evidences for IRES existence [29], [30]. Furthermore, IRESs are characterized structurally, with no known consensus sequence and therefore in silico identification is problematic, and most studies focus on empirical data validation not on novel IRES prediction [30], [31], [32]. Indeed, when screening the known bicistronic transcripts for IRES sequences, utilizing UTRScan and IRSS, no IRES elements were identified (data not shown) [33], [34]. Still, all polycistronic candidates reported in this manuscript were computationally analyzed for IRES elements with no positive results. (ii) No significant similarity between the candidate ORF sequence and the annotated CDS (or CDSs of alternatively spliced isoforms of the same gene; lower than 50% similarity). A high degree of sequence similarity was assumed to indicate gene rearrangement rather than the existence of a functional ORF. (iii) The potential ORF encoded protein shares a significant similarity to other proteins in the protein database or contains functional domains according to InterProScan analysis (or both – see Methods). In addition, candidate polycistronic transcripts were screened for transcript architecture conservation in other organisms, utilizing BLAST analysis to GenBank databases.

Out of the 93 potential rescuing ORFs, 53 (39 transcripts) were discarded due to high homology between the rescuing ORF and the annotated CDS. The remaining ORFs were further analyzed according to the criteria elaborated above. Eight candidate bicistronic transcripts (6 genes) were identified, out of which two were discarded because the predicted protein was identified to contain only a signal peptide sequence, with no other known protein domains (See Methods section). From the remaining six transcripts, three novel (2 genes) and three known bicistronic transcripts (SNRPN, MFRP and LASS1; GI's: 29540557, 223633880 and 110349723, respectively) were identified (Table 1, only novel candidates are presented).

Table 1. Novel bicistronic transcript candidates followed 3′ UTR analysis of penultimate or upstream NMD-eliciting transcripts.

Annotated ATG exon position: implications for polycistronic transcripts prediction

Limiting the search for functional ORFs to the 3' UTR of the mRNA seems arbitrary. One CDS may indeed be more dominant over the other in terms of its expression level, yet it is not necessarily the first in the polycistronic transcript (e.g., SNURF-SNRPN). Similar to the strategy undertaken in the former stage, we needed to distinguish transcripts which contain a regulatory uORF from polycistronic ones in which the upstream CDS is still unknown. The upstream CDSs in polycistronic transcripts and regulatory uORFs differ first and foremost by their NMD-induction potential. Thus we performed a preliminary analysis aiming to identify potentially NMD-eliciting transcripts based on mRNA 5' screening. We analyzed the distribution of the annotated ATG exon position in human RefSeq transcripts and evaluated how many of them are potentially NMD-eliciting (unless a rescuing ORF will be revealed). NMD degradation induction relies on EJCs that remain after the pioneer round of translation. Since no known sequence-based parameters are available to indicate whether translation re-initiation will occur in sequential ORFs, our approach is applicable only for those cases in which the uORF/CDS and the annotated ATG are positioned in different exons and therefore at least one remaining EJC potentially exists. Transcripts for which the first exon contains the 5′ UTR and the annotated ATG, as well as potentially encoding ORF, were not included in our study as they require experimental evaluation of re-initiation and NMD-eliciting potential. We found that only 59% of the annotated ATGs are positioned in the first exon of the transcript and the rest are positioned in the second or downstream exons (Table 2).

Transcripts in which the annotated ATG is positioned in the second or downstream exons were analyzed for 5′ UTR ORF existence (12320 records; 41% of the Refseq transcriptome). Of these, 6118 transcripts (20.3% of total Refseq transcripts) contain no ORF in their 5′ UTR, i.e., the ribosomal 43S pre-initiation complex is assumed to scan the mRNA until the annotated ATG is reached (detaching pre-deposited EJCs on its way) [19], [20], [21]. These transcripts were therefore not considered as candidates for 5′ UTR-related NMD-induction. Another 4130 transcripts (13.8% of total Refseq transcripts), contain one or more ORFs in their 5′ UTR, yet are expected to display NMD-immunity due to their architecture which theoretically ensures removal of all EJCs according to the “55-nucleotide rule”, implemented on the 5′ UTR. Finally, 2063 records (6.9% of total Refseq transcripts), although they contain an ORF in their 5′ UTR, are predicted to have an NMD-eliciting architecture, since not all EJCs are expected to be removed after the pioneer round of translation, even if re-initiation occurs at the annotated ATG.

At this point we concluded the following: (i) a considerable portion of RefSeq transcripts contain two or more ORFs, are in NMD-immune architecture, and therefore have the potential to function as polycistronic mRNAs; (ii) NMD-eliciting potential in the human transcriptome is likely higher then so far evaluated, due to 5′ UTR-related NMD-eliciting architecture (2063 records; 6.9% of total Refseq transcripts).

5′ UTR-based novel polycistronic transcript prediction

After dividing the transcriptome into groups according to the annotated ATG position and the existence of rescuing uORFs, we turned to predict the 5' UTR-related novel polycistronic transcript potential. A total of 4130 transcripts (13.8% of Refseq transcriptome) constitute the dataset from which we aimed to differentiate transcripts with regulatory uORFs from those with functional upstream CDSs.

Two working assumptions guided this stage: (i) the first ATG identified by the 43S pre-initiation complex can be positioned in the second and downstream exon, and all EJCs deposited upstream to it are removed. Therefore no full exon-junctions coverage is required, and instead we screened for exon-junction coverage between the end of the first ORF identified and the annotated ATG. (ii) potential ORFs were analyzed only if the ORF was larger than 99 nucleotides. This cutoff value was set based on the size range of known polycistronic encoded proteins (59 to 580 amino acids, LUZP6 and MFRP, respectively) and the relatively small size of human uORFs (average length 51.5 nucleotides) [35]. ORF encoding potential and Kozak sequence recognition were carried out as described above and in Methods. Out of the 4130 candidate transcripts screened, 335 were identified to contain ORFs larger than 99 nucleotides with no significant similarity between the candidate ORF sequence and the annotated CDS. Of these, 81 transcripts: (i) contain a Kozak-like sequence in proximity to the candidate AUG and (ii) the potential ORF-encoded protein shares a significant similarity to other proteins in the protein database and/or contains functional domains according to InterProScan analysis. In 29 of these 81 transcripts, InterProScan analysis identified only a signal peptide sequence and/or transmembrane regions, and they were therefore discarded (See Methods section). The remaining 52 transcripts are considered polycistronic candidates, among which three are known transcripts (SNURF-SNRPN, LUZP6 and GDF1; GIs: 29540556, 190886450 and 110349791, respectively). An additional three undergo an unusual transcription pattern: leptin receptor (LEPR, GI: 310923183), which is reported to share the same promoter and the first two exons with the leptin receptor overlapping transcript (LEPROT) gene [36]; The IGF 2 read-through product (GI: 183603938); And the GPR75- ASB3 gene (G protein-coupled receptor 75-ankyrin repeat and SOCS box containing 3; GI: 188528701) read-through product [37] (Table 3, a detailed description in Table S1).

Table 3. Novel human polycistronic transcript candidates followed 5' UTR analysis.

Novel polycistronic transcript validation

We hypothesized that human polycistronic mRNAs share a unique configuration, in which functional CDSs are mutually organizes in an NMD-immune structure. This architecture was demonstrated in four known bicistronic genes (LASS1-GDF1, SNURF-SNRPN, MFRP-C1QTNF5 and MTPN-LUZP6), and was further used to predict the existence of 49 novel polycistronic transcripts. In order to validate our predictions, we screened the literature and databases for known cases of NMD inhibition and transcription initiation site detection.

NMD insensitivity of polycistronic transcripts.

mRNA expression datasets from Gene Identification by NMD inhibition (GINI) experiments, in which mRNA levels are compared in the presence and absence of NMD- inhibitors (emetine, caffeine and NMD-specific siRNA inhibitors) were utilized to identify polycistronic transcript NMD sensitivity. Five datasets representing a variety of cell types were downloaded from the Gene Expression Omnibus database (GEO) and analyzed. Tables 4 and 5 summarize the NMD sensitivity status of the known bicistronic (Table 4) and polycistronic predicted genes (Table 5; detailed information in Tables S2A and S2B) found in the different experiments. Overall, the known bicistronic genes display considerable, stable expression in the different cell types analyzed (Table 4, Table S2A). Fourteen of the predicted genes fulfilled our primary criterion, i.e. genes which all their documented transcripts seem polycistronic (see Methods section). Out of these, twelve are represented in the various datasets that were used for validation (C20orf203, ERVFRD-1, FRRS1, HMGB1, LOC401052, LOC442578, MGC119295, STAG3L3, TXNDC6, UTP14C, ZNF117 and ZNF841), mostly displaying NMD insensitivity (Table 5, Table S2B). ZNF841 exhibits NMD sensitivity in the sole experiment that monitored this gene, in mononuclear leukocytes taken from both healthy and prostate cancer patients (GSE24204). Three more genes display partial NMD sensitivity: The UTP14C gene displays a mixed expression pattern in different cell types, with NMD sensitivity in one experiment and insensitivity in three others. The FRRS1 gene exhibits NMD insensitivity in one experiment (GSE16170) using two treatments (Ago2 siRNA - targets CBP80/20-bound mRNAs and thus considered a regulator of NMD and UPF1 and Ago2 siRNA), and NMD sensitivity in another experiment, in mononuclear leukocytes of prostate cancer patients but not of healthy patients (GSE24204). Finally, the HMGB1 gene displays a mixed expression pattern, indicating NMD sensitivity in one out of 4 experiments. Yet in the same experiment, one HMGB1 probe produces an NMD sensitivity pattern while the other probe indicates NMD insensitivity (GSE29788). The probe which is responsible for the NMD sensitivity pattern identifies also an additional antisense transcript (AF83771), for which we have no CDS annotation, and thus may be an artifact. It may be argued that NMD inhibition is not be fully achieved in GINI experiments and thus lead to somewhat distorted results, yet several observations oppose this notion, including: different methods of NMD inhibition were undertaken in the various experiments analyzed; the mixed results of NMD sensitivity and insensitivity, which was limited to a narrow number of our genes and not for the entire array; and significant, high levels of expression displayed for only some of our genes. All these characteristics do not fit a general NMD-inhibition failure pattern.

Table 4. NMD sensitivity status of human bicistronic genes in published NMD-inhibition experiments.

Table 5. NMD sensitivity status of human polycistronic predicted genes in published NMD-inhibition experiments.

Translation initiation site (TIS) detection in predicted polycistronic transcripts.

NMD is not 100% efficient [6], [7]. Consequently, NMD insensitivity validation per se is not sufficient, as some transcripts may show detectable levels of seemingly stable mRNA even though NMD degradation does occur. A more direct verification is therefore desired, reinforcing our validation results. We therefore crosschecked our predicted transcript list with a dataset of experimentally verified TISs, produced by utilizing global translation initiation sequencing (GTI-seq) in the transcriptome of human embryonic cell line (HEK293) [38]. Out of the 49 polycistronic transcripts predicted in this study, 9 transcripts (5 genes) are listed in the Lee et al. TIS dataset, in the exact position and with the same ORF size as predicted in our study (Table 6). Additionally, two more transcripts, in which two uORFs are expected to rescue the transcript from NMD, are partially listed in this dataset with only one TIS verified.

Table 6. Human polycistronic transcripts found in Lee et al TIS dataset: Novel polycistronic transcripts candidates that were found in Lee et al TIS dataset with exact match both in ORF start position and length; missing rescuing ORF in brackets.


The function of NMD in quality control surveillance and as a gene expression regulatory apparatus is well documented. Diverse events contribute to PTC occurrence, with a key role for alternative splicing, nonsense mutations and SNP-related events [39], [40], [41], [42]. The regulatory aspect of NMD was demonstrated in several physiological settings including the regulation of selenoprotein mRNAs, splicing-factor gene expression, physiologically-related classes of transcripts in Hela cells and others [14], [43]. Additionally, a role for an exon-truncated class of alternative polyadenylation as an NMD-rescue regulatory mechanism was previously suggested by us [44].

The NMD mechanism is not 100% efficient. Up to 25% of the PTC-containing transcripts escape NMD degradation [6], [7]. An efficient re-initiation site within a short distance of a nonsense mutation and in the same exon, was demonstrated to elicit NMD rescue and immunity of the transcripts [45], [46]. At times, escape was shown to be associated with the regulatory mechanism responsible for the introduction of the PTC (e.g. ApoB48 and thrombopoietin translation control) [47], [48]. Overall, 5.8% protein isoforms of the SWISSPROT database are derived from PTC containing transcripts, indicating its regulatory effect [49]. Hence, the existence of a PTC in a given transcript does not necessarily indicate the transcript's destiny, but may rather hint on regulatory stratification.

The novelty of our work is in suggesting a yet undescribed connection between NMD and polycistronic gene architecture. In this study we hypothesized that human polycistronic mRNAs (and most likely mammalian ones in general) share a unique configuration, in which functional CDSs are mutually organized in an NMD-immune architecture. Indeed, we detected NMD-immune transcript architecture in the known human bicistronic genes. Further, we analyzed the human Refseq transcriptome, predicting the existence of 49 novel polycistronic transcripts. Bioinformatics-based analysis, indicating the existence of known protein domains in the predicted functional-ORFs encoded proteins or their similarity to known proteins, supports our assumption regarding the encoded proteins' potential to be produced and active. mRNA expression datasets from GINI experiments were utilized for experimental verification, aiming to explore whether the allegedly polycistronic transcripts are insensitive to NMD. All the known bicistronic transcripts and an additional eleven genes from our predicted gene list displayed NMD insensitivity for various NMD inhibitors, including emetine, caffeine and UPF1 and Ago2 siRNAs (other genes in our list were not sampled in those GINI experiments). The gene ZNF481, which was studied in a single experiment, displayed an NMD sensitive profile. An additional three of the eleven genes displayed a mixed expression pattern of both NMD sensitivity and insensitivity in different cell types, possibly manifesting time-dependent and process-dependent translation re-initiation regulation. Furthermore, some of the genes displayed significantly high levels of expression which do not fit the expression pattern of inefficient NMD. Hence, these results reinforce our view that an NMD-immune architecture is likely to play a role in polycistronic transcript expression regulation. Finally, we succeeded to find 9 predicted polycistronic transcripts (5 genes) in a translation initiation sites dataset, with the exact position and ORF size as predicted in our study (and 2 additional partial matches, for a total of 7 genes) [38]. No overlap between these gene lists of validated polycistronic transcripts was found. Yet, it is expected that different cell types will exhibit different patterns of polycistronic gene regulation, thus, one does not exclude the other. Altogether, 16 novel polycistronic genes are experimentally validated.

The rationale for novel polycistronic transcript prediction was based on distinguishing functional CDSs from regulatory ORFs by their potential to elicit NMD and to encode a functional protein. Many studies addressed the question of parameters affecting translation re-initiation following a uORF, including uORF size, length and lack of secondary structure of the intercistronic spacing sequences and the usage of conserved uATG [19], [25], [26], [50], [51], [52], [53]. Yet no sequence-based information is clear enough to pinpoint whether translation re-initiation will occur. Thus, and although many additional parameters may play a role, and other polycistronic models are likely to exist, no sequence-based models, other than PTC occurrence seem usable. Our work therefore provides one possible scenario arguing for the existence of cellular polycistronic transcripts with an NMD-immune architecture, permitting both mRNA stability and a regulatory mean to control the expression of all or some of the functional CDSs.

It was shown that following translation of a uORF and the release of the 60S subunit, the 40S subunit may remain on the mRNA and resume scanning for as far as 600 nucleotides, without re-initiating translation [54]. Consequently, theoretically, EJCs may be removed from downstream exon junctions in the absence of near-by translation re-initiation. Undoubtedly, this observation is of importance for searching for an NMD-immune architecture, yet the lack of additional knowledge on the conditions affecting 40S scanning and EJC removal prevents us from implementing this knowledge in our study.

Almost half of the human genes have uORFs in their 5′ UTRs, capable of reducing protein expression by 30 to 80% [24], [53], [55], [56], [57]; and though these findings are well documented and acknowledged, they are not assimilated into the vast majority of studies screening and evaluating the NMD fraction of the transcriptome in different contexts. In this study we challenged the classical definition of the "55 nucleotide rule", arguing that it should be "stretched" to the 5' UTR of the transcripts. Namely, we claim that the search for polycistronic-related functional ORFs should take place both in the 3' and 5' UTRs, upstream and downstream to the annotated CDS. Subsequently, we estimate the fraction size of NMD-eliciting transcripts in human Refseq transcriptome to be approximately 7.3%, significantly larger in comparison to its size (0.4%) when analyzing only the 3' UTR.

Moreover, we further argue that while evaluating the potential for NMD, both sides of the exon-exon junction (upstream and downstream) ought to be equally considered. EJCs deposited 20–24 nucleotides upstream to the exon junction are being pushed away and removed by the ribosome. The ribosome's spatial dimensions dictate a downstream EJC displacement even if the stop codon is positioned 50 to 55 nucleotides upstream to the exon junction. Based on this fact, we argue that in the event of translation re-initiation, EJC removal is likely to occur even if the ORF starts in close proximity downstream to the exon-exon junction, further changing NMD fraction size as estimated computationally. Indeed, yeast ribosome footprint experiments indicated the protection of 28 nucleotides upstream and downstream to the ATG (−12 to +15 nucleotides) [58]. Taking into account EJC size and the re-initiation-related sequence being scanned by the ribosome, a region larger than 12 nucleotides is likely to be covered. Possibly, this claim is demonstrated in the known human bicistronic gene MFRP-C1QTNF5, whose transcript contains 3 ORFs contributing to the NMD-immune architecture of the transcript. Yet the gene is recognized by its first and last encoded proteins. There is no evidence for the translation of the protein encoded in the central ORF. We think that it is possible that the last ORF is responsible for removal of the EJCs during translation, due to ribosomal spatial hindrance, even though its ATG is positioned 44 nucleotides downstream to the exon junction. Although the MFRP-C1QTNF5 gene potentially supports our view, the distance range downstream to the exon junction ought to be examined and established experimentally before being implemented in computational studies.

We assume that many functional-ORF encoded proteins are expressed in low levels or to a limited period of time, thus their detection is challenging. Moreover, we believe that seemingly NMD-eliciting, polycistronic transcripts are underrepresented in the mammalian genome annotation, partially due to the tendency to suppress NMD candidate mRNAs from nucleic databases (e.g., RefSeq database policy). NMD-eliciting transcripts seem to include remarkable regulatory features, waiting to be further studied. We therefore hope that our hypothesis will be further verified by experimentalists.


Human transcriptome dataset

35157 Human Refseq mRNAs records were downloaded from the RefSeq database (NCBI, Release dated 18/12/2010; [59]. 1286 records were discarded due to unavailable information on their exon positions or joint CDS in their annotation. Records with annotated stop codon coordinates positioned after the last exon were discarded for 3' analysis (14 records).

NMD-candidate identification

The remaining 33871 records were analyzed for their potential to elicit NMD. Stop codon and exon-intron partitioning of the mRNA molecule was retrieved based on the Refseq annotation.

The annotated stop codon was identified as PTC if the 3′ most nucleotide of the stop codon is positioned more than 55 nucleotides upstream of the terminal exon–exon junction. Transcripts in which the stop codon is positioned in the terminal exon or in the 55 nucleotides preceding the terminal exon junction were considered as NMD-immune. 113 records were identified to potentially elicit NMD.

ORF identification in 3' UTRs

"Rescuing ORFs", capable of turning NMD-eliciting transcripts into NMD-immune one, were searched screening the mRNA sequence in the three reading-frames (0, +1, +2) from 5' to 3' direction. In order for an ORF to be defined as a rescuing one, all exon-exon junctions, downstream to the annotated stop codon, are to be covered according to the "55 nucleotide rule". Namely, the ORF should cover the entire sequence length within the range of 55 nucleotides upstream to the first exon-exon junction-downstream to the annotated stop codon, and up to at least 55 nucleotides upstream to the terminal junction (or further downstream). For NMD-eliciting transcripts harboring an annotated stop codon in the penultimate exon in a distance >55 nucleotides, a minimal rescuing ORF should start and end in the penultimate exon in a distance smaller than 55 nucleotides upstream to the terminal exon. If more than one ORF was required in order to turn NMD-eliciting transcript into NMD-immune, they were considered as one unit.

Out of 113 NMD-eliciting transcripts, 68 NMD-immune transcripts contained one or more rescuing ORFs (a total of 93 ORFs). At this stage, all rescuing ORFs were considered, including ORFs which overlap one another in different or identical reading-frames. No overlap was allowed between the annotated CDS and rescuing ORFs.

ORF identification in 5' UTRs

Functional uORFs were searched for in the 5′ UTRs of RefSeq records for which the annotated CDS start codon is in the second or later exon. An ORF screen was carried out in the three reading-frames (0, +1, +2) from 5' to 3'. Records which followed our hypothesized polycistronic architecture, namely, exon-junctions coverage between the end of the first ORF identified and the annotated ATG and potential ORFs larger than 99 nucleotides, were further analyzed. Out of 4130 transcripts containing ORFs in their 5' UTR with an NMD-immune architecture, 335 transcripts were identified to contain ORFs larger than 99 nucleotides and with no significant similarity between the candidate ORF sequence and the annotated CDS, and were further evaluated for polycistronic and bicistronic candidacy.

Kozak sequence element

Initiation of translation by eukaryotic ribosomes is optimal at the ACCATGG consensus sequence [60]. Yet a purine in position −3 (relative to the A nucleotide in the ATG) followed by ATG at positions +1-+3 is sufficient for efficient translation initiation, thus this minimal sequence was identified as Kozak positive in this study.

BLAST analysis

BLASTN analysis was carried in order to rule out candidate ORF encoded proteins with significant similarity to the annotated protein of the transcript. ORF sequences were analyzed against the human RefSeq mRNA dataset utilizing BLASTN standalone application with default parameters [61]. No significant similarity between the candidate ORF sequence and the annotated CDS (or CDSs of alternatively spliced isoforms of the same gene) was allowed. A high degree of sequence similarity was hypothesized to indicate a 3′ UTR ORF, which is the result of a gene rearrangement event rather than the existence of a functional ORF. Parsing of the BLASTN results was based on a threshold of e-value greater than 1.00E-06 and 50% or more coverage between the CDS from transcripts with the same gene ID and the candidate ORF.

Functional ORF encoded protein characterization utilizing BLASTP

Candidate functional ORFs were predicted based on whether the potential ORF encoded protein shares a significant similarity to other proteins in the protein database. Candidate ORFs were translated to potential protein sequences and were analyzed for protein similarity utilizing BLASTP (standalone version) with default parameters against the non-redundant protein sequences (nr) database [61]. A positive ratio (number of positive hits/ORF length) > 0.5 or align length (alignment length/ORF length) > 0.8 was used to identify significant hits.

Functional ORF encoded protein characterization utilizing InterProScan

Further, candidate functional ORFs were predicted based on whether the potential ORF contains functional domains according to InterProScan analysis. The ORF-encoded proteins were screened for known protein domains utilizing InerProScan Web Services ( [62].

The signal peptide domain and the transmembrane domain were considered insufficient to predict a functional protein. Although the subcellular localization of proteins is widely studied with many tools, its prediction is not always accurate. Further, non-classical secretion pathways exist, assisting in the secretion of signal-peptide free proteins, contributing to the uncertainty of predicting subcellular localization [63]. Thus, identifying polycistronic transcript existence based on a signal peptide or transmembrane domain existence seems less reliable, and additional indications were required.

Polycistronic transcripts NMD-immunity validation

Five GINI experiments datasets (GSE1703, GSE16170, GSE20491, GSE24204 and GSE29788) in which mRNA levels are compared in the presence and absence of NMD- inhibitors of different sorts (i.e., the chemicals emetine and caffeine and NMD-specific siRNA inhibition; see Tables 4 and 5 for experiment treatment details), were downloaded from the GEO database [64]. Experiments were selected based on data enabling reanalysis, and at least partial overlap between the known and predicted bicistronic genes and probes represented in the array. mRNA expression results were used only if the probe/s available in the array identified the only gene of question and not additional gene family members. Further, since the vast majority of human genes undergo alternative splicing, we obviously preferred relying on probes which distinguish between monocistronic and polycistronic transcripts of the same gene, when such probes were available. For the predicted genes, we limited ourselves to genes which all their documented transcripts (according to RefSeq annotation), are polycistronic according to our prediction (one or more). If threshold expression levels are defined in the dataset annotation (published manuscripts included), those were taken under consideration, and otherwise all available results were included. In order to assess whether gene expression levels significantly differ between NMD-inhibition treatment and control cells, we implemented heteroscedastic two tailed T-Test analysis. For those experiments which specifically set a cut-off value defining the threshold of NMD sensitivity, we used the later (as indicated in Tables 4 and 5 and in Tables S2A and S2B).

Supporting Information

Table S1.

Novel polycistronic transcript candidates followed 5' UTR analysis.


Table S2.

Table S2A - Known human bicistronic genes in published GINI experiments. Table S2B - Predicted polycistronic genes in published GINI experiments.


Author Contributions

Conceived and designed the experiments: DS. Performed the experiments: GS. Analyzed the data: GS DS. Wrote the paper: DS.


  1. 1. Blumenthal T (2004) Operons in eukaryotes. Brief Funct Genomic Proteomic 3: 199–211.
  2. 2. Gray TA, Saitoh S, Nicholls RD (1999) An imprinted, mammalian bicistronic transcript encodes two independent proteins. Proc Natl Acad Sci U S A 96: 5616–5621.
  3. 3. Katoh M (2001) Molecular cloning and characterization of MFRP, a novel gene encoding a membrane-type Frizzled-related protein. Biochem Biophys Res Commun 282: 116–123.
  4. 4. Lee SJ (1991) Expression of growth/differentiation factor 1 in the nervous system: conservation of a bicistronic structure. Proc Natl Acad Sci U S A 88: 4250–4254.
  5. 5. Xiong Z, Liu E, Yan Y, Silver RT, Yang F, et al. (2006) An unconventional antigen translated by a novel internal ribosome entry site elicits antitumor humoral immune reactions. J Immunol 177: 4907–4916.
  6. 6. Anczukow O, Ware MD, Buisson M, Zetoune AB, Stoppa-Lyonnet D, et al. (2008) Does the nonsense-mediated mRNA decay mechanism prevent the synthesis of truncated BRCA1, CHK2, and p53 proteins? Human mutation 29: 65–73.
  7. 7. Isken O, Maquat LE (2007) Quality control of eukaryotic mRNA: safeguarding cells from abnormal mRNA function. Genes & development 21: 1833–1856.
  8. 8. Maquat LE (2005) Nonsense-mediated mRNA decay in mammals. J Cell Sci 118: 1773–1776.
  9. 9. Dostie J, Dreyfuss G (2002) Translation is required to remove Y14 from mRNAs in the cytoplasm. Current biology : CB 12: 1060–1067.
  10. 10. Lejeune F, Ishigaki Y, Li X, Maquat LE (2002) The exon junction complex is detected on CBP80-bound but not eIF4E-bound mRNA in mammalian cells: dynamics of mRNP remodeling. The EMBO journal 21: 3536–3545.
  11. 11. Sato H, Maquat LE (2009) Remodeling of the pioneer translation initiation complex involves translation and the karyopherin importin beta. Genes & development 23: 2537–2550.
  12. 12. Maquat LE (2004) Nonsense-mediated mRNA decay: splicing, translation and mRNP dynamics. Nat Rev Mol Cell Biol 5: 89–99.
  13. 13. Baker KE, Parker R (2004) Nonsense-mediated mRNA decay: terminating erroneous gene expression. Curr Opin Cell Biol 16: 293–299.
  14. 14. Isken O, Maquat LE (2008) The multiple lives of NMD factors: balancing roles in gene and genome regulation. Nat Rev Genet 9: 699–712.
  15. 15. Hosoda N, Kim YK, Lejeune F, Maquat LE (2005) CBP80 promotes interaction of Upf1 with Upf2 during nonsense-mediated mRNA decay in mammalian cells. Nat Struct Mol Biol 12: 893–901.
  16. 16. Kashima I, Yamashita A, Izumi N, Kataoka N, Morishita R, et al. (2006) Binding of a novel SMG-1-Upf1-eRF1-eRF3 complex (SURF) to the exon junction complex triggers Upf1 phosphorylation and nonsense-mediated mRNA decay. Genes Dev 20: 355–367.
  17. 17. Isken O, Kim YK, Hosoda N, Mayeur GL, Hershey JW, et al. (2008) Upf1 phosphorylation triggers translational repression during nonsense-mediated mRNA decay. Cell 133: 314–327.
  18. 18. Ohnishi T, Yamashita A, Kashima I, Schell T, Anders KR, et al. (2003) Phosphorylation of hUPF1 induces formation of mRNA surveillance complexes containing hSMG-5 and hSMG-7. Mol Cell 12: 1187–1200.
  19. 19. Jackson RJ, Hellen CU, Pestova TV (2010) The mechanism of eukaryotic translation initiation and principles of its regulation. Nat Rev Mol Cell Biol 11: 113–127.
  20. 20. Kozak M (1989) The scanning model for translation: an update. J Cell Biol 108: 229–241.
  21. 21. Sonenberg N, Hinnebusch AG (2009) Regulation of translation initiation in eukaryotes: mechanisms and biological targets. Cell 136: 731–745.
  22. 22. Durand S, Lykke-Andersen J (2013) Nonsense-mediated mRNA decay occurs during eIF4F-dependent translation in human cells. Nature structural & molecular biology 20: 702–709.
  23. 23. Rufener SC, Muhlemann O (2013) eIF4E-bound mRNPs are substrates for nonsense-mediated mRNA decay in mammalian cells. Nature structural & molecular biology 20: 710–717.
  24. 24. Calvo SE, Pagliarini DJ, Mootha VK (2009) Upstream open reading frames cause widespread reduction of protein expression and are polymorphic among humans. Proc Natl Acad Sci U S A 106: 7507–7512.
  25. 25. Churbanov A, Rogozin IB, Babenko VN, Ali H, Koonin EV (2005) Evolutionary conservation suggests a regulatory function of AUG triplets in 5'-UTRs of eukaryotic genes. Nucleic Acids Res 33: 5512–5520.
  26. 26. Kozak M (2001) Constraints on reinitiation of translation in mammals. Nucleic Acids Res 29: 5226–5232.
  27. 27. Matsui M, Yachie N, Okada Y, Saito R, Tomita M (2007) Bioinformatic analysis of post-transcriptional regulation by uORF in human and mouse. FEBS Lett 581: 4184–4188.
  28. 28. Komar AA, Hatzoglou M (2005) Internal ribosome entry sites in cellular mRNAs: mystery of their existence. The Journal of biological chemistry 280: 23425–23428.
  29. 29. Kozak M (2005) A second look at cellular mRNA sequences said to function as internal ribosome entry sites. Nucleic acids research 33: 6593–6602.
  30. 30. Baird SD, Turcotte M, Korneluk RG, Holcik M (2006) Searching for IRES. RNA 12: 1755–1785.
  31. 31. Filbin ME, Kieft JS (2009) Toward a structural understanding of IRES RNA function. Curr Opin Struct Biol 19: 267–276.
  32. 32. Gilbert WV (2010) Alternative ways to think about cellular internal ribosome entry. J Biol Chem 285: 29033–29038.
  33. 33. Pesole G, Liuni S, Grillo G, Licciulli F, Mignone F, et al. (2002) UTRdb and UTRsite: specialized databases of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs. Update 2002. Nucleic Acids Res 30: 335–340.
  34. 34. Wu TY, Hsieh CC, Hong JJ, Chen CY, Tsai YS (2009) IRSS: a web-based tool for automatic layout and analysis of IRES secondary structure prediction and searching system in silico. BMC Bioinformatics 10: 160.
  35. 35. Iacono M, Mignone F, Pesole G (2005) uAUG and uORFs in human and rodent 5'untranslated mRNAs. Gene 349: 97–105.
  36. 36. Bailleul B, Akerblom I, Strosberg AD (1997) The leptin receptor promoter controls expression of a second distinct protein. Nucleic acids research 25: 2752–2758.
  37. 37. Prakash T, Sharma VK, Adati N, Ozawa R, Kumar N, et al. (2010) Expression of conjoined genes: another mechanism for gene regulation in eukaryotes. PloS one 5: e13284.
  38. 38. Lee S, Liu B, Huang SX, Shen B, Qian SB (2012) Global mapping of translation initiation sites in mammalian cells at single-nucleotide resolution. Proceedings of the National Academy of Sciences of the United States of America 109: E2424–2432.
  39. 39. Green RE, Lewis BP, Hillman RT, Blanchette M, Lareau LF, et al. (2003) Widespread predicted nonsense-mediated mRNA decay of alternatively-spliced transcripts of human normal and disease genes. Bioinformatics 19 Suppl 1i118–121.
  40. 40. Lewis BP, Green RE, Brenner SE (2003) Evidence for the widespread coupling of alternative splicing and nonsense-mediated mRNA decay in humans. Proc Natl Acad Sci U S A 100: 189–192.
  41. 41. Mort M, Ivanov D, Cooper DN, Chuzhanova NA (2008) A meta-analysis of nonsense mutations causing human genetic disease. Human mutation 29: 1037–1047.
  42. 42. Han A, Kim WY, Park SM (2007) SNP2NMD: a database of human single nucleotide polymorphisms causing nonsense-mediated mRNA decay. Bioinformatics 23: 397–399.
  43. 43. Mendell JT, Sharifi NA, Meyers JL, Martinez-Murillo F, Dietz HC (2004) Nonsense surveillance regulates expression of diverse classes of mammalian transcripts and mutes genomic noise. Nat Genet 36: 1073–1078.
  44. 44. Gilat R, Shweiki D (2007) A novel function for alternative polyadenylation as a rescue pathway from NMD surveillance. Biochemical and biophysical research communications 353: 487–492.
  45. 45. Buisson M, Anczukow O, Zetoune AB, Ware MD, Mazoyer S (2006) The 185delAG mutation (c.68_69delAG) in the BRCA1 gene triggers translation reinitiation at a downstream AUG codon. Human mutation 27: 1024–1029.
  46. 46. Zhang J, Maquat LE (1997) Evidence that translation reinitiation abrogates nonsense-mediated mRNA decay in mammalian cells. EMBO J 16: 826–833.
  47. 47. Chester A, Somasekaram A, Tzimina M, Jarmuz A, Gisbourne J, et al. (2003) The apolipoprotein B mRNA editing complex performs a multifunctional cycle and suppresses nonsense-mediated decay. The EMBO journal 22: 3971–3982.
  48. 48. Stockklausner C, Breit S, Neu-Yilik G, Echner N, Hentze MW, et al. (2006) The uORF-containing thrombopoietin mRNA escapes nonsense-mediated decay (NMD). Nucleic acids research 34: 2355–2363.
  49. 49. Hillman RT, Green RE, Brenner SE (2004) An unappreciated role for RNA surveillance. Genome biology 5: R8.
  50. 50. Hinnebusch AG (1997) Translational regulation of yeast GCN4. A window on factors that control initiator-trna binding to the ribosome. J Biol Chem 272: 21661–21664.
  51. 51. Poyry TA, Kaminski A, Jackson RJ (2004) What determines whether mammalian ribosomes resume scanning after translation of a short upstream open reading frame? Genes Dev 18: 62–75.
  52. 52. Hinnebusch AG (2006) eIF3: a versatile scaffold for translation initiation complexes. Trends Biochem Sci 31: 553–562.
  53. 53. Morris DR, Geballe AP (2000) Upstream open reading frames as regulators of mRNA translation. Mol Cell Biol 20: 8635–8642.
  54. 54. Miller PF, Hinnebusch AG (1989) Sequences that surround the stop codons of upstream open reading frames in GCN4 mRNA determine their distinct functions in translational control. Genes Dev 3: 1217–1225.
  55. 55. Kozak M (1991) Structural features in eukaryotic mRNAs that modulate the initiation of translation. J Biol Chem 266: 19867–19870.
  56. 56. Resch AM, Ogurtsov AY, Rogozin IB, Shabalina SA, Koonin EV (2009) Evolution of alternative and constitutive regions of mammalian 5'UTRs. BMC Genomics 10: 162.
  57. 57. Suzuki Y, Ishihara D, Sasaki M, Nakagawa H, Hata H, et al. (2000) Statistical analysis of the 5' untranslated region of human mRNA using "Oligo-Capped" cDNA libraries. Genomics 64: 286–297.
  58. 58. Ingolia NT, Ghaemmaghami S, Newman JR, Weissman JS (2009) Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324: 218–223.
  59. 59. Pruitt KD, Tatusova T, Maglott DR (2007) NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 35: D61–65.
  60. 60. Kozak M (1986) Point mutations define a sequence flanking the AUG initiator codon that modulates translation by eukaryotic ribosomes. Cell 44: 283–292.
  61. 61. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. Journal of molecular biology 215: 403–410.
  62. 62. Zdobnov EM, Apweiler R (2001) InterProScan—an integration platform for the signature-recognition methods in InterPro. Bioinformatics 17: 847–848.
  63. 63. Imai K, Nakai K (2010) Prediction of subcellular locations of proteins: where to proceed? Proteomics 10: 3970–3983.
  64. 64. Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, et al. (2009) NCBI GEO: archive for high-throughput functional genomic data. Nucleic acids research 37: D885–890.