Computational identification of the selenocysteine tRNA (tRNASec) in genomes

Selenocysteine (Sec) is known as the 21st amino acid, a cysteine analogue with selenium replacing sulphur. Sec is inserted co-translationally in a small fraction of proteins called selenoproteins. In selenoprotein genes, the Sec specific tRNA (tRNASec) drives the recoding of highly specific UGA codons from stop signals to Sec. Although found in organisms from the three domains of life, Sec is not universal. Many species are completely devoid of selenoprotein genes and lack the ability to synthesize Sec. Since tRNASec is a key component in selenoprotein biosynthesis, its efficient identification in genomes is instrumental to characterize the utilization of Sec across lineages. Available tRNA prediction methods fail to accurately predict tRNASec, due to its unusual structural fold. Here, we present Secmarker, a method based on manually curated covariance models capturing the specific tRNASec structure in archaea, bacteria and eukaryotes. We exploited the non-universality of Sec to build a proper benchmark set for tRNASec predictions, which is not possible for the predictions of other tRNAs. We show that Secmarker greatly improves the accuracy of previously existing methods constituting a valuable tool to identify tRNASec genes, and to efficiently determine whether a genome contains selenoproteins. We used Secmarker to analyze a large set of fully sequenced genomes, and the results revealed new insights in the biology of tRNASec, led to the discovery of a novel bacterial selenoprotein family, and shed additional light on the phylogenetic distribution of selenoprotein containing genomes. Secmarker is freely accessible for download, or online analysis through a web server at http://secmarker.crg.cat.


A) Benchmark in eukaryotes
A total of 212 eukaryotic genomes (105 with Sec, and 107 without Sec) were analyzed. Secmarker predicted tRNA-Sec in 104 genomes from the positive set (105), achieving a sensitivity of 99%, compared to 94% and 61% for aragorn and tRNAscan-SE, respectively. There was a single Seccontaining genome without Secmarker predictions (false negative), that of the protist Phytophthora capsici (Fig 2). Like Secmarker, tRNAscan-SE did not predict any tRNA-Sec candidate. However, aragorn predicted two tRNA-Sec genes in the genome of P. capsici. Manual inspection of the aragorn candidates revealed that these did not fit the eukaryotic tRNA-Sec model (section 2 in this file). We concluded that the tRNA-Sec predictions by aragorn in the genome of P. capsici were actually false positives. Notwithstanding, since the rest of Sec machinery and a Sec-containing thioredoxin reductase gene were found in the genome [1], we expected tRNA-Sec to be present as well, but most likely missing from the genome assembly. We thus analyzed raw sequencing data available from P. capsici with Secmarker. We identified a tRNA-Sec supported by several genomic reads, containing the full sequence of the gene (section 3 in this file). This confirmed the presence of tRNA-Sec in P. capsici. Therefore, despite this genome was counted as a Secmarker false negative, this was due to incompleteness of the genome assembly available, rather than lack of accuracy of the program.
The specificity of Secmarker in eukaryotic genomes was 99%, compared to 62% and 49% for tRNAscan-SE and aragorn, respectively. Unlike Secmarker, the other two programs produced numerous false positive predictions in fungi and land plants (Fig 2A), both known to lack selenoproteins [2]. The only false positive by Secmarker was a bona fide prediction in the genome of Phytophthora ramorum. The presence of a good tRNA-Sec candidate for P. ramorum is in apparent contradiction with the the lack of selenoprotein genes and other Sec machinery factors in the genome of this species, as reported in [1]. Suspiciously, the other three genomes available from the genus Phytophthora, in contrast, exhibit the complete Sec machinery and at least one bona fide selenoprotein gene. Although selenoprotein extinctions have been reported in some species within otherwise Sec-containing genus [3], this could also reveal that the genome assembly analyzed in [1] was incomplete. Thus, we analyzed a more recent, more complete assembly, and found all the markers of Sec utilization, as well as a Sec-containing thioredoxin reductase gene with a SECIS element (section 4 in this file). We concluded, therefore, that P. ramorum was wrongly predicted to lack Sec in [1]; this species actually utilizes Sec and Secmarker and aragorn correctly identified the tRNA-Sec gene.
We investigated the overlap between the tRNA-Sec predictions by the different programs. Secmarker showed the highest fraction of predictions with support from at least another program (83.0%) compared to aragorn and tRNAscan-SE (39.7% and 30.6%, respectively). There were 35 predictions by Secmarker in Sec-containing genomes that did not overlap any prediction from the other two programs (Fig 2B). Only eight of them corresponded to the top scoring prediction in the genome. Such predictions were in six protists and in Daphnia pulex (discussed in the main text). The top scoring prediction in Monodelphis domestica was among the 35, but it showed pseudogene features. The remaining corresponded to genomes with multiple predictions, and the top scoring one was not among the 35. Some of them were likely to be Secmarker false positive predictions.

B) Benchmark in bacteria.
A total of 217 bacterial genomes were included in the test, with 42 of them (19%) predicted to utilize Sec. Secmarker and aragorn achieved perfect sensitivity (100%), while tRNAscan-SE missed tRNA-Sec in six Sec-containing bacterial genomes (sensitivity 86%). Specificity was higher for Secmarker (99%) than for aragorn (87%) and tRNAscan-SE (92%). Secmarker produced two false positives, both in the genus Burkholderia. The sequences of these two tRNA-Sec candidates are identical. Both of them were predicted also by aragorn, and missed by tRNAscan-SE. These two genomes were predicted to lack Sec due to the absence of the gene selD in the genome assembly [1]. However, we were able to identify the rest of the Sec machinery, as well as a Seccontaining FDH gene (section 5 this file). This supports that, although these two genomes appeared as false positives in our benchmark, Sec is actually used by these organisms and Secmarker and aragorn correctly identified their tRNA-Sec gene. There was a good overlap between the predictions produced by the three programs in bacterial genomes ( Fig 3B). All but one of the tRNA-Sec predictions in Sec-containing genomes were predicted by at least two programs, and 37 out of 48 were predicted by the three of them.

C) Benchmark in archaea.
Since archaeal genomes were scarcely represented in [1], we analyzed a larger number of genomes in this kingdom (Fig 4). We used the presence of the genes selD and EF-Sec identified by Selenoprofiles [4] to predict Sec utilization in a total of 213 archaeal genomes. We predicted 14 such genomes (7%) to use Sec. Secmarker obtained perfect sensitivity (100%), while aragorn sensitivity was 93% and tRNAscan-SE sensitivity was 57%, with one and six false negatives, respectively (Fig 4). RF01852 sensitivity, ran with the recommended parameters (http:// rfam.xfam.org/family/tRNA-Sec#tabview=tab9), was 92.9%, missing the tRNA-Sec in Lokiarchaeota. Perfect specificity was reached with Secmarker and tRNAscan-SE (100%), while aragorn obtained a specificity of 98%, with five false positives. Manual inspection of the false positives by aragorn revealed that their sequences did not fit the the archaeal tRNA-Sec model (section 2 in this document). With this, and the lack of other Sec machinery factors and selenoproteins in these genomes, we concluded that they are indeed false positives. In addition, we noticed that aragorn predicted a second tRNA-Sec in the genome of Methanococcus voltae A3. This second tRNA had no variable arm, therefore it was a clear false positive (section 2 in this document). That same tRNA was the only candidate predicted by tRNAscan-SE in the genome of M. voltae A3. tRNAscan-SE, therefore, missed the real tRNA-Sec in this genome, even though in our benchmark it was considered a true positive prediction. of

Benchmark with tRNAscan-SE 2.0
The benchmark of Secmarker included tRNAscan-SE v1.23. A newer version 2.0 is available through a web server (http://trna.ucsc.edu/tRNAscan-SE/). However, the program is not yet available for download (as of 11/2016). Due to limitation in the maximum input size, we cannot replicate the benchmark (641 genomes) through the web server. We thus decided to perform a more limited test, running tRNAscan-SE 2.0 with a small set of false positives and true positives obtained from our benchmark set. We retrieved the 102 SeC predictions obtained with tRNAscan-SE v1.23 in selenoproteinless genomes (considered here as false positives), and the 159 Secmarker top scoring predictions from selenoprotein-containing genomes (considered here as true positives). All predictions were extended 25 bases at both sides. The "sequence source" option in the tRNAscan-SE 2.0 web server was set as "Bacterial","Archaeal" or "Eukaryotic", according to the source organism. The rest of parameters were set to default values. The results with the newer tRNAscan-SE 2.0 did improve compared to version 1.23, specially in sensitivity (S1.1 Table). All but one of the prokaryotic true positives (TP) were identified, and the number of detected TP in eukaryotes more than doubled. Yet, Secmarker still appears to perform much better. tRNAscan-SE 2.0 missed 14 eukaryotic TP. Regarding the false positives (FP), the eukaryotic ones were reduced to 54, but the 14 prokaryotic were mispredicted again (13 of them belong to Mycoplasma, with a variation in the genetic code where UGA encodes for Trp). of

Benchmark with RF01852
The Rfam database provides a CM for tRNA-Sec (RF01852). In order to benchmark Secmarker (which includes three manually curated CM models) against the single RF01852 model, we downloaded the RF01852 and ran it on our set of genomes with cmsearch (Infernal v 1.1.1). The parameters recommended in the curation page (http://rfam.xfam.org/family/ RF01852#tabview=tab9) were used: 'cmsearch --nohmmonly -T 25.39'. The results were then parsed to exclude those hits with score lower than 47.0 ("gathering threshold"). The results were compared with Secmarker. RF01852 obtained a higher number of predictions, 5,133 vs 3,421. The overlap of the two methods was 3,293 (63%) from a total of 5,261 (the union of the two runs); 128 (2%) were predicted only by Secmarker, and 1,840 (35%) were predicted only by RF01852. We further analyzed all 5261 by running cmsearch with RF00005 (canonical tRNA). 98% of the results predicted only by RF01852 had a higher score when aligned to RF00005, suggesting they are in fact canonical tRNAs. That proportion was much lower for the results predicted only by Secmarker, 7%. Among the overlapping predictions, only 1% had a higher score with RF00005. Besides having better performance, Secmarker has the advantage of assigning the domain (bacteria, archaea or eukaryota). of