Current address: Rosetta Inpharmatics, a wholly owned subsidiary of Merck & Co., Seattle, Washington, United States of America
Conceived and designed the experiments: XX YJ GDS. Performed the experiments: XX YJ. Analyzed the data: XX YJ GDS. Contributed reagents/materials/analysis tools: XX YJ GDS. Wrote the paper: XX YJ GDS.
The authors have declared that no competing interests exist.
An increasing number of
RNA is remarkably versatile, acting not only as messengers to transfer genetic information from DNA to protein but also as critical structural components and catalytic enzymes in the cell. More intriguingly, RNA elements in messenger RNAs have been widely found in bacteria to control the expression of their downstream genes. The functions of these RNA elements are intrinsically linked to their secondary structures, which are usually conserved across multiple closely related species during evolution and often shared by genes in the same metabolic pathways. We developed a new computational approach to find putative functional RNA elements by looking for conserved RNA secondary structures that are distinguished from random RNA secondary structures in the orthologous RNA sequences from related species. We applied this approach to multiple
RNA is remarkably versatile
In the past a few years, many
Experimental screenings
A number of algorithms have been developed for common RNA secondary structure prediction, such as RNAalifold
Studies have shown that for a single sequence RNA secondary structure alone is not sufficient to distinguish functional RNA from random sequence
In this paper, we present a new SVM based functional RNA identifier named RSSVM (
We examined the performance of RSSVM in identifying RNA regulatory motifs on 1686 positive and 1686 negative test sequence sets (see
The prediction results from different SVM models at the same
“▵” and “○” mark the results at
At any FPR, Dynalign+LIBSVM has significantly lower sensitivities than RSSVM and RNAz on all test sets and on test sets with low identities, especially in the range of low FPRs (FPR<0.05) (
QRNA does not provide a similar measurement of
We further evaluated the performance of RSSVM on test sets with different ranges of average sequence identities. We use
(A) At the overall FPR of 0.05. (B) At the more stringent overall FPR of 0.01 or 0.02. The lowest possible FPR that Dynalign+LIBSVM can achieve is 0.02.
At the more stringent overall FPR of 0.01 on all test sets, RSSVM (0.68) and RNAz (0.65) have almost the same overall prediction sensitivity (
Overall, for the best performance, RNAz, Dynalign+LIBSVM and QRNA are in favor of sequence sets with high identities. RSSVM, however, has consistent and more sensitive performance on the low-identity sets while keeping the same FPRs. These programs can complement each other for the best performance in identifying regulatory RNAs on sequences with a wide range of identities.
Three major improvements may contribute to the better performance of RSSVM compared to RNAz in identifying regulatory RNAs, especially on test sets whose identities are lower than 70%. The first improvement is using the more accurately predicted common RNA secondary structures by RNA Sampler. The accuracy of predicted structures can be measured by the correlation coefficient of structure prediction (
In addition, because the common structures predicted by RNA Sampler are more accurate in general, they may provide insightful hints for inferring the functions of the predicted RNA motifs and guiding the design of experimental validation.
As many known bacterial regulatory RNA sites are located in the 5′-UTR sequences and often conserved during evolution, we applied RSSVM, RNAz and QRNA on multiple
The total numbers of predicted regulatory RNA motifs by different approaches are listed in
RSSVM (FPR = 0.01) | RNAz (FPR = 0.01) | QRNA | |
Total number of predicted regulatory RNAs | 166 | 109 | 112 |
False positives on shuffled sequences | 0 | 0 | 13 |
Matching known RNA motifs in Rfam |
17 | 16 | 11 |
Overlapping with predicted transcription terminators or attenuators | 72 | 49 | 40 |
Overlapping with predicted transcription terminators |
62 | 42 | 31 |
Overlapping with predicted transcription attenuators |
56 | 37 | 32 |
With literature support | 21 | 11 | 7 |
We searched all the orthologous UTRs with Infernal using all bacterial RNA motif models from Rfam, and 19 known RNA motifs gave Infernal scores higher than 10 bits and occurred in at least two orthologous sequences of a UTR set. 6 of the 19 RNA motifs have orthologous sequences from
Putative transcription terminators predicted by Rnall
Putative transcription attenuators predicted by a previous comparative genomics study
Numbers in the parentheses are the total numbers of known RNA motifs or predicted transcription terminators/attenuators in the 1002
Rank |
GI | RSSVM |
RNAz |
QRNA |
Gene Name | Gene Product | Matching RNA Family in Rfam | ||
1 | + |
trpE | anthranilate synthase component I | RF00513 | Trp_leader | RNA element | |||
3 | + |
SO1202 | conserved hypothetical protein | RF00005 | tRNA | tRNA | |||
4 | + |
SO4727 | conserved hypothetical protein | RF00558 | L20_leader | RNA element | |||
6 | ppiD | peptidyl-prolyl |
RF00506 | Thr_leader | RNA element | ||||
7 | + |
thrA | aspartokinase I/homoserine dehydrogenase, threonine-sensitive | RF00506 | Thr_leader | RNA element | |||
8 | + | hisG | ATP phosphoribosyltransferase | RF00514 | His_leader | RNA element | |||
16 | rpsB | ribosomal protein S2 | RF00127 | t44 RNA | RNA gene | ||||
34 | + | SO1071 | conserved hypothetical protein | RF00080 | yybP-ykoY | Riboswitch | |||
39 | + | pheA | chorismate mutase/prephenate dehydratase | RF00513 | Trp_leader | RNA element | |||
64 | + | SO1007 | conserved hypothetical protein | RF00168 | Lysine | Riboswitch | |||
73 | 0.240 | + | Rne | ribonuclease E | RF00370 | sroD RNA | RNA gene | ||
93 | SO0547 | conserved hypothetical protein | RF00522 | PreQ1 | Riboswitch | ||||
100 | SO2715 | TonB-dependent receptor | RF00059 | TPP | Riboswitch | ||||
117 | + |
lysC | aspartokinase III, lysine-sensitive | RF00168 | Lysine | Riboswitch | |||
120 | 0.420 | thiC | thiamin biosynthesis protein ThiC | RF00059 | TPP | Riboswitch | |||
125 | nadB | L-aspartate oxidase | RF00522 | PreQ1 | Riboswitch | ||||
133 | SO0774 | 5-formyltetrahydrofolate cyclo-ligase family protein | RF00013 | 6S RNA | RNA gene | ||||
195 | 0.903 | 0.014 | rpsO | ribosomal protein S15 | RF00114 | S15 leader | RNA element | ||
302 | 0.661 | + | SO0815 | TonB-dependent receptor C-terminal domain protein | RF00174 | Cobalamin | Riboswitch | ||
The rank is based on the
Bold fonts represent predictions above the
“+” represent QRNA predictions that fit the “RNA” model in at least two pairwise alignments.
The shuffled sequences were identified as “RNA” by QRNA.
The predictions by the three approaches overlap significantly with each other, as shown in the Venn diagram in
The numbers in the parentheses are of the predictions matching known RNA motifs.
The specificity, the fraction of correct predictions, is difficult to accurately measure because of the poor knowledge on RNA motifs in
Besides predictions that match Rfam motifs, we can also assess the accuracy of our predictions by comparing them to other independent types of predictions and to published reports of regulatory motifs or genes undergoing post-transcriptional regulation.
As transcription attenuation is a common regulatory mechanism for RNA motifs in the 5′-UTRs, we checked whether the orthologous UTR sets contain any putative rho-independent transcription terminators predicted by Rnall
We examined the leading genes of all the 166 operons with predicted regulatory RNAs by RSSVM (
Rank |
GI | RSSVM |
RNAz |
QRNA (Q) Terminator (T) Attenuator (A) | Gene Name | Gene Product | Knowledge of Regulation | Reference |
5 | Q T - | ilvG | acetolactate synthase II, large subunit | Leader peptide, and transcription attenuator | ||||
17 | - T A | ldhA | D-lactate dehydrogenase | Possible post-transcriptional effect | ||||
23 | 0.105 | - T - | aspS | aspartyl-tRNA synthetase | tRNA synthetase leader | |||
25 | - - - | ilvI | acetolactate synthase III, large subunit | Leader peptide, and transcription attenuator | ||||
26 | - - - | flgB | flagellar basal-body rod protein FlgB | Putative GEMM element | ||||
27 | 0.241 | - - - | aroH | phospho-2-dehydro-3-deoxyheptonate aldolase, trp-sensitive | Possible transcription termination | |||
35 | Q T A | leuA | 2-isopropylmalate synthase | Leader peptide, and transcription attenuator | ||||
41 | 0.094 | - - - | pdhR | pyruvate dehydrogenase complex repressor | PdhR-box in |
|||
52 | - T - | adhE | aldehyde-alcohol dehydrogenase | Stem-loop for occupying RBS in |
||||
55 | 0.196 | - - - | ahpC | Alkyl hydroperoxide reductase, C subunit | Post-transcriptionally regulated by CsrA in |
|||
63 | 0.012 | Q T A | glnS | glutaminyl-tRNA synthetase | tRNA synthetase leader | |||
83 | 0.451 | - T A | SO1769 | glutamate decarboxylase, putative | Possible post-transcriptional regulation in S. oneidensis | |||
88 | Q T - | rpoB | DNA-directed RNA polymerase, beta subunit | Transcriptional attenuation | ||||
91 | - - - | rplJ | ribosomal protein L10 | Ribosomal protein leader | Rfam | |||
105 | 0.274 | Q T - | pflB | formate acetyltransferase | Possible post-transcriptional regulation | |||
106 | - - - | SO3896 | Outer membrane porin, putative | Post-transcriptional regulation in S. oneidensis | ||||
109 | Q T A | rpsL | ribosomal protein S12 | Ribosomal protein leader | ||||
112 | 0.179 | - T - | fliE | flagellar hook-basal body complex protein FliE | Putative GEMM element | |||
124 | 0.108 | - T - | secE | preprotein translocase, SecE subunit | RNaseIII sites in the leader sequence of SecE in E. coli | |||
147 | 0.456 | - T - | speA | biosynthetic arginine decarboxylase | Possible post-transcriptional regulation in S. oneidensis | |||
161 | Q - - | aroF | phospho-2-dehydro-3-deoxyheptonate aldolase, tyr-sensitive | Attenuator sensing tyr-tRNA | ||||
163 | 0.102 | - - - | rplU | ribosomal protein L21 | Ribosomal protein leader |
same as those in
One class of our predicted RNA motifs correspond to known RNA regulatory motifs upstream of the operons involved in amino acid and vitamin biosynthesis, including
Another class of our predicted RNA motifs are located in the operons encoding ribosomal proteins and tRNA synthetases, such as
In our predictions, some genes have been known to be regulated at the transcriptional level through binding of transcription factors (TF) to their palindromic DNA binding sites, such as
Besides the known RNA motifs, our predictions also include some interesting candidate novel motifs. One interesting example is for gene
We use the predicted regulatory RNA motif in front of the
(A) Alternative terminator and anti-terminator stem-loop structures improved on the previously proposed structures. Base pairs in the red boxes are the positions where compensatory mutations are observed; blue lines are leucine codons enriched in the leader peptide coding region. (B) Structural alignment of the anti-antiterminator and terminator structure in five
In this paper, we present a new program, RSSVM, based on support vector machines for identifying putative
Comparing to other RNA motif identification tools, such as RNAz, Dynalign+LIBSVM and QRNA, RSSVM is more sensitive in detecting functional RNAs at the same FPR, especially on sequences of low identities. The more sensitive performance of RSSVM, compared to that of RNAz and Dynalign+LIBSVM, may be attributed to the following three improvements in the SVM model: first, the common structures and alignments are generated by RNA Sampler, which provides more accurate structure predictions, does not require sequence alignments as input and works well on sequences of low identities; second, more distinctive features are used to represent the common RNA structures and alignments; third, the SVM model is trained with more universal functional RNA structures that cover a large number of RNA motif/gene families and a wide range of sequence identities. We tested a few alternative SVM models which have only one or two of these improvements, such as a modified RNAz that is re-trained using the same training sets for RSSVM, and a modified RNAz that is re-trained using the same training sets for RSSVM and that uses RNA Sampler's structural alignments instead of ClustalW alignments as input. We observed that the sensitivities of these SVM models on all test sets and on test sets with low identities were higher than those of RNAz and similar to those of RSSVM when loose FPRs were allowed, and their sensitivities were gradually improved in the stringent FPR range (FPR<0.02) by adding one improvement at a time (
RNA Sampler and RSSVM run reasonably fast for genome-wide scan of regulatory RNAs. On average, it takes RNA Sampler 125 seconds on a single CPU workstation to predict the common structure of a set of 5 RNA sequences of an average of 150 nt in length. For a project with the similar size of the
The RNA classification of RSSVM is based on the common structures generated by RNA Sampler. These predicted common structures provide preliminary hints for the putative structures associated to the regulatory functions. As demonstrated in the RNA motifs for
There is always a trade-off between sensitivity and specificity (1 – false positive rate) in computational predictions. Using looser cutoffs (lower
Application of RSSVM to find RNA regulatory motifs/genes is not limited to the
To better serve the
We use the program, RNA Sampler
Support Vector Machines (SVM) are supervised learning methods widely used for classification and regression. In these methods, labeled data are represented by vectors that are defined by various features, and support vector machines map the feature vectors to a higher dimensional space and construct a maximal separating hyperplane to classify the input data into binary categories. SVM has been used in previous studies
We developed a new SVM classifier for detecting regulatory RNAs. Our SVM classifier differs from the previous ones in three major aspects: first, the recently developed new program, RNA Sampler, is used to predict common RNA secondary structures and structural alignments on any set of homologous RNA sequence, and feature vectors based on such predictions are used to build the SVM classifier; second, a different set of feature parameters are used to represent the common RNA structures and structural alignments; third, the SVM classifier is trained on a larger number of various bacterial RNA gene and motif families that cover a wider range of sequence lengths and identities than previous studies
To train the SVM classifier, both positive and negative training sets are needed. We use sequences of 112 known bacterial regulatory RNA families retrieved from the Rfam database
In total, we generated 8335 positive and 8335 negative training sequence sets. Using a similar procedure, we generated 1686 positive and 1686 negative test sequence sets that are not identical to any training set. The distributions of the sizes and average pairwise sequence identities of the training and test sets are shown in
In our SVM classifier, we use six features to represent the common RNA secondary structure and structural alignment. These features are: (1) The mean minimum free energy (MFE)
On each training sequence set, we ran RNA Sampler to generate the common structure and structural alignment and calculated the values of the six features described above. We implemented the SVM classifier for regulatory RNA structure detection using the core program LIBSVM
We predicted the common structures and structure alignments for all positive and negative test sets using RNA Sampler and classified the structures with the final SVM model. We also compared the performance of our SVM classifier on these test sets with that of other leading software for RNA motif identification, including RNAz
With the RNA secondary structure prediction algorithm, RNA Sampler, and the RNA motif identification algorithm, RSSVM, we can search putative regulatory RNA structural motifs from any orthologous RNA sequence set. As shown in the flow chart in
The genomic sequences of
Since most of the known bacterial regulatory motifs are located in the mRNA leader sequences of the regulated transcription units (operons), we focused on finding conserved RNA regulatory motifs in the 5′-UTRs of orthologous transcription units (TU). Because our knowledge of the operon structures in
In total we obtained 1002 sets of orthologous mRNA leader sequences from the five
We scanned each orthologous UTR sequence set in three overlapping windows, −250∼−100, −200∼−50, and −150∼20 (1 corresponds to the translation start site). We first predicted the common RNA structure for each window and generated corresponding structural alignment using RNA Sampler, and then provided the RNA Sampler output to the SVM classifier to predict whether the window contains a regulatory RNA structure. We also aligned sequences in the same window using ClustalW and applied RNAz and QRNA to predict the existence of RNA motifs. Because QRNA only takes two-sequence alignments as input, for each aligned sequence window, we split the multiple-sequence ClustalW alignment into pairwise alignments, each consisting of the sequence from the anchor species and a sequence from another species. We call that QRNA detected the RNA motifs in the sequence window only if at least two of all the pairwise alignments were predicted to contain an RNA motif by QRNA.
Raw Data for ROC curves in
(0.57 MB XLS)
Raw Data for ROC curves in
(0.50 MB XLS)
Cumulative distribution of sequence identities of the sequence sets with predicted regulatory RNAs by RSSVM and/or RNAz.
(0.01 MB PDF)
The Receiver Operating Characteristic (ROC) curves of RSSVM, RNAz, retrained RNAz on ClustalW alignments, and retrained RNAz on RNA Sampler alignments. (A) On all test sets. (B) On test sets with sequence identities lower than 70%. We retrained RNAz using the same training sets for RSSVM.
(0.02 MB PDF)
The Receiver Operating Characteristic (ROC) curves of RSSVM and RNAz on real and shuffled sequence sets of eukaryotic RNAs from Rfam. The curves of both programs on all test sets (sequence identities range between 20–100%) and on test sets of low identities (<70%) are drawn separately.
(0.01 MB PDF)
Distribution of sequence identities and size of the sequence sets studied. (A) Training sets. (B) Test sets. (C)
(0.01 MB PDF)
Flow chart of the genome-wide identification of RNA regulatory motifs/genes using RNA Sampler and RSSVM.
(0.01 MB PDF)
The prediction sensitivities and false positive rates of RSSVM, RNAz, Dynalign+LIBSVM and QRNA on test sets with different sequence identities. Different P-value cutoffs were used to fairly compare prediction sensitivities of different SVM models at the same FPR level. Numbers in bold fonts are the best results given by all the programs for an identity range.
(0.01 MB PDF)
Top 166 predicted regulatory RNAs by RSSVM.
(0.06 MB PDF)
Comparisons between different RNA motif identification algorithms.
(0.01 MB PDF)
We thank Stefan Washietl for insightful discussion on the SVM model in RNAz. We also thank Zizhen Yao and Andrew Uzilov for discussions about using CMfinder and Dynalign+LIBSVM.