Reader Comments

Post a new comment on this article

Little Biological and Statistical Support for Hundreds of Selenoproteins in Mouse Pseudo-Messenger RNAs

Posted by PLOS_Genetics on 20 Feb 2008 at 10:54 GMT

Originally submitted as a Reader Response by Sergi Castellano (sc464@cornell.edu) on 6 June 2006:

Eukaryotic selenoproteins are difficult to find. Different computational methods to identify selenoproteins exist, having in common a moderate to high sensitivity but a low specificity. Therefore, experimental testing of predictions is necessary. Independent lines of evidence describe similar and small selenoproteomes, humans having 25 selenoproteins and rodents 24 [1,2]. Frith et al. tackle, in silico, the identification of disrupted mRNAs in mouse and describe hundreds of novel selenoproteins and other types of disrupted mRNAs [3]. This exciting result rises ten-fold the number of selenoproteins in mammals, a dramatic shift in our views of the use and importance of selenium. However, non-standard mRNAs are a double-edged sword. The deceiving nature of these predictions suggests caution and, unfortunately, examination of this work reveals biological and statistical flaws.

The FANTOM3 set of 102,801 mouse cDNAs, composed of 56,722 protein-coding, 34,030 noncoding and 12,049 artifacts, UTRs and immature cDNAs [4], was analyzed. These cDNAs were aligned to SWISS-PROT proteins of any species by three-frame translation and the best hit was kept. Several problems arise from this strategy. First, the protein-coding cDNAs not aligned to themselves (SWISS-PROT has less than half the mouse proteome) align to distant species. Alignments between homologous but divergent proteins in the right frame with shifted real stops result in internally aligned stops (shuffled terminal exons are specially insidious). Alignments between non-homologous proteins, in any frame, also overtranslates cDNAs. Second, non-coding cDNAs align to proteins in a meaningless frame, producing additional aligned stops. The same goes for the artifact cDNAs. These problems are exacerbated by the use of soft-masking, a nonconservative E-value for the number of searches performed, and a quality filter (comparison to the genome) unable to correct for mistranslations not due to sequencing errors. Overall, extended translation of cDNAs beyond the true stop and the construction of out-of-frame alignments with stops in-frame easily occur. Repetition of the experiment shows poor and assymetrically conserved alignments around UGA codons, a reflection of the aforementioned concerns.

Nonetheless, 1,514 cDNAs were found with UGA disruptions only. This represents twice as many cases as those of UAG or UAA. The authors consider this as evidence for hundreds of selenoproteins, while it just reflects the stop codon usage bias in mouse cDNAs (about 50% UGAs). No enrichment of SECIS in the cDNAs with UGA is found, which is attributed to a presumed low search sensitivity. On the contrary, SECISEARCH is highly sensitive [1]. It is also suggested that an alternative signal could be in place, but all present evidence (SECIS conservation and SECIS-independent searches) suggest the opposite [1,2]. Finally, the NMD rule is redefined by allowing stop codons to be 55 nt upstream of any splice junction instead of the 3'-most splice junction. Subsequent analyses, not only of selenoproteins, are thus unreliable. Although more selenoproteins may exist, the false-positive issues and the lack of symmetrically conserved cysteine-homologs and experimental data make the claim of hundreds of selenoproteins unjustified to date.

References
1. Kryukov GV, Castellano S, Novoselov SV, Lobanov AV, Zehtab O, et al. (2003) Characterization of mammalian selenoproteomes. Science 300:1439-1443.
2. Castellano S, Novoselov SV, Kryukov GV, Lescure A, Blanco E, et al. (2004) Reconsidering the evolution of eukaryotic selenoproteins: a novel nonmammalian family with scattered phylogenetic distribution. EMBO Rep 5:71-77.
3. Frith MC, Wilming LG, Forrest F, Kawaji H, Tan SL, et al. (2006) Pseudo-Messenger RNA: Phantoms of the Transcriptome. PLoS Genet 2:e23
4. Maeda N, Kasukawaa T, Oyama R, Gough J, Frith M, et al. (2006) Transcript Annotation in FANTOM3: Mouse Gene Catalog Based on Physical cDNAs. PLoS Genet 2:e62.

RE: Little Biological and Statistical Support for Hundreds of Selenoproteins in Mouse Pseudo-Messenger RNAs

PLOS_Genetics replied to PLOS_Genetics on 20 Feb 2008 at 11:01 GMT

Originally submitted as a Reader Response by Martin Frith (martin@gsc.riken.jp) on 14 July 2006 (additional authors: Alistair Forrest, Lukasz Huminiecki, Piero Carninci, and Yoshihide Hayashizaki):

This response from Sergi Castellano questions our evidence for hundreds of mouse selenoproteins, and casts doubt on analyses underlying other parts of our study. Our article presents one line of statistical evidence for "potential" selenoproteins, and we agree that experimental evidence is needed to confirm this finding [1]. We do not agree that our article has "biological and statistical flaws." On the other hand, Castellano's letter pointed us to unexpected complexities in stop codon bias.

Castellano claims that we redefine the NMD rule by allowing stop codons to be 55 nt upstream of any splice junction instead of the 3'-most splice junction. These definitions are equivalent: if a stop codon is > 55 nt upstream of any splice junction, then it is necessarily > 55 nt upstream of the 3'-most splice junction and vice versa.

Castellano criticises our cDNA-protein alignments; we maintain that they accurately reflect homology (common ancestry). FASTX3 E-values are very accurate, provided repetitive sequences are masked [2]. Soft-masking does not affect this, since E-values are calculated before unmasking (FASTA3 documentation). Thus, ~99% of the alignments should reflect homology. Note that homology does not imply translatability. If ancestrally coding nucleotides become noncoding due to appearance of an upstream stop codon, we might expect asymmetric conservation around the stop codon as observed by Castellano. The whole point of our study is to explain the untranslatable cases.

Finally, Castellano suggests that the excess of pseudo-mRNAs with UGA disruptions only, compared to UAG only or UAA only, simply reflects stop codon usage bias. Our study included the following control: we counted the frequencies of UGA, UAG, and UAA disruptions in pseudo-mRNAs with more than one type of disruption. If the disruptions occurred randomly and independently in these proportions (37% UGA), the number of pseudo-mRNAs with UGA disruptions only would be about 300 fewer than observed. Following Castellano's suggestion, we counted stop codons used by a set of 8,113 reliably identified mouse proteins [3]. To our surprise, these frequencies are different (48% UGA), and using these frequencies, the number of pseudo-mRNAs with only UGA disruptions is no more than expected. Why are these frequencies different, and which is the more appropriate control?

Further statistics suggest a solution to this puzzle. The abundance of TGA trinucleotides in the mouse genome, relative to TAG and TAA, is 37%. (The abundance in nonrepetitive regions only is also 37%.) The frequency of UGA in noncoding frames of the 8,113 ORFs, on the other hand, is 64%, while the abundance of UGA in the 3'UTRs of these 8,113 ORFs is 39%. Thus, codon usage biases cause an excess of UGA trinucleotides in noncoding frames of protein-coding ORFs. Protein-coding ORFs commonly evolve by frameshift mutations near the end of the ORF, causing a new, previously out-of-frame stop codon to be used [3; Figure 5]. This process, together with the excess of UGA in noncoding frames of ORFs, may account for the previously
unexplained overrepresentation of UGA stop codons in higher eukaryotes [4]. On the other hand, the stop codon disruptions in our pseudo-mRNAs do not originate from previously out-of-frame trinucleotides, because pseudo-mRNAs with frameshifts were excluded from our selenoprotein analysis. Thus, the UGA frequency in pseudo-mRNAs with more than one type of disruption reflects the genome average. All things considered, this (pseudo-mRNAs with more than one type of disruption) is the most appropriate control, and the excess of pseudo-mRNAs with UGA disruptions only remains unexplained by any simple null hypothesis.

References
1. Frith MC et al. (2006) PLoS Genet 2: e23
2. Pearson WR (1998) J Mol Biol 276:71-78
3. Frith MC et al. (2006) PLoS Genet 2: e52
4. Sun J et al. (2005) J Mol Evol 61:437-444