Sequence similarity governs generalizability of de novo deep learning models for RNA secondary structure prediction

doi:10.1371/journal.pcbi.1011047

Fig 1.

Illustrations of the SeqFold2D network and the Stralign dataset.

(A) The two-module architecture of the SeqFold2D models. An input RNA sequence of length L is first embedded via one-hot encoding and feed-forward layers to yield an L×C tensor. The first module consists of N blocks of either bidirectional Long-Short-Term-Memory (LSTM) or transformer encoders. The resultant L×C tensor is then transformed into the L×L×C pair representation via outer-product, before being fed to the second module of N blocks of residual 2D convolutional layers. The output block is made up of three feed-forward layers and predicts the PPM of dimension L×L. (B) The population distributions of eight RNA families at different sequence similarity levels for the Stralign dataset. The abbreviations are, rRNA: ribosomal RNA, tRNA: transfer RNA, Intron I: group I intron, tmRNA: transfer messenger RNA, SRP: signal recognition particle, and TERC: telomerase RNA component. The innermost ring shows the original Stralign dataset with a total of 37,149 sequences, noting that the five under-represented families (counter-clockwise from Intron I to TERC) are scaled up for visibility and the multiplier N is shown as “N×” in the label (see Fig A in S1 Text for the unscaled version). The L600 ring is after removing sequences longer than 600; the NR100 ring shows the cross-sequence level; and the NR80 ring shows the cross-cluster level. Note that the 16S rRNA NR80 has only 50 sequences and is barely visible.

More »

Expand

Fig 2.

The mean F1 scores on the TR (tan) and TS (blue) sets by SeqFold2D and selected DL and traditional models in two different dataset setups.

(A) Both TR and TS sets are from Stralign NR100. (B) TR from Stralign NR100 and TS from ArchiveII NR100.The models are sorted by their TS F1 scores. The names for the learning-based models are appended with the number of parameters and the trailing asterisk indicates the use of post-processing. At the end of each bar shows the F1 value. All learning-based models except SPOT-RNA are re-trained.

More »

Expand

Fig 3.

Illustrations of the TR-TS gaps of the SeqFold2D-1.4M model.

(A) Stral-NR100 as TR and Archi-NR100 as TS. (B) Stral-NR80 as TR and Archi-Stral-NR80 as TS. The first pair of violins shows the F1 scores for the entire TR (left, tan) and TS (right, blue) set and the following pairs show the scores for each RNA family. Averaged scores are shown as dashed lines (white) and at the very top. The parentheses above show the sequence counts in numbers (for the entire set or families with <1% share) or in percentages (for families with >1% share). The families existing in one set only are shown as “nan” for the other set, e.g., 23S rRNA in Archi-NR100 only.

More »

Expand

Fig 4.

Illustrations of the TR (left, tan) vs. TS (right, blue) performances for selected learning and physics-based models at the cross-family level with the Strive-NR80 dataset.

For each cross-family study, one RNA family is held out as the TS set and the rest eight families are used for model development (TR and VL). Each panel/row here shows one such study labelled by the TS family name (B-E), while the first panel, (A) [Baseline], shows a baseline study with randomly splits of all families for the TR, VL, and TS subsets. Panel A thus is de facto a cross-cluster study with all subsets derived from the same parent dataset. For each panel, the average TR and TS scores are shown at the top and highlighted for the learning-based model with the highest TS score (physics-based models excluded). All learning-based models are retrained with the numbers of parameters shown after names. It should be noted that, despite our best re-training efforts, the scores of MXfold2 and Ufold should be viewed as guides only as we are unable to match their reported performances when using the same datasets. Still, given the inverse correlation between TR and TS performances, their TR-TS gaps are expected to be under-estimates.

More »

Expand

Fig 5.

Illustrations of the cross-family F1 scores and the PGscore distributions for all studies.

(A) The TS vs. TR F1 scores of the baseline cross-cluster study ([Baseline]) and all nine cross-family studies (labelled by the TS family name) with Strive-NR80. Four zones (I-IV) are delineated for easy reference. The diagonal line in zone III denotes the line of zero TR-TS gap, i.e., TR = TS. The dash line in zone IV is a guide to the eye only. The cross-family TS scores of the three groups of models (physics-based, ML, and DL) are shown in three respective boxplots as annotated. (B) Boxplots of the PGscores from all learning-based models for each study at the specific TR-TS similarity level. The studies are, XSeq-I: the cross-sequence study with Stral-NR100, XSeq-II: cross-sequence with Stral-NR100 and Archi-NR100, XCls-I: cross-cluster with Strive-NR80, XCls-II: cross-cluster with Stral-NR80 and Archi-Stral-NR80, XCls-III: cross-cluster with bpRNA, XFam: all nine cross-family studies with Strive-NR80. The learning-based models included for each study are shown in Figs T-U in S1 Text.

More »

Expand

Fig 6.

Illustrations of the F1-unseen vs. F1-seen correlations of the SeqFold2D-960K model.

Each PSA or PSSA program is shown in the same color in all panels. (A) The F1-unseen over F1-seen ratio as a function of the PSI or PSSI threshold. The horizontal dashed line marks the F1 ratio between the entire unseen and seen datasets. (B) The PCC value as a function of the identity threshold. (C) The distributions of the F1-unseen and F1-seen scores at the nominal PSI or PSSI threshold of 50%. (D) The distributions of the F1-unseen and F1-seen scores at the identity threshold of 80%. It is common to find no seen sequences above high thresholds for an unseen sequence, leading to many null F1-seen values that are absent in (D).

More »

Expand