A divide-and-conquer approach based on deep learning for long RNA secondary structure prediction: Focus on pseudoknots identification

doi:10.1371/journal.pone.0314837

Fig 1.

Example of a Pseudoknot.

The pseudoknot is formed from two interlaced stems displayed in red and gray respectively.

More »

Expand

Fig 2.

DivideFold workflow.

The sequence is partitioned into several fragments and the different predicted secondary structures are then recombined to the original positions of the fragments in the sequence.

More »

Expand

Fig 3.

One partition iteration.

The partition process occurs recursively until all fragments are shorter than a chosen length. At each iteration, the left-most and right-most parts are combined to form a single fragment. Note that the red fragment may be further partitioned similarly at the next iteration.

More »

Expand

Fig 4.

Pseudoknot conservation during partition.

This is an example of a partition iteration where the RNA sequence includes five pseudoknots, displayed in red. Four of them can be recovered because they occur in the same fragment, continuous or not. However, the pseudoknot represented in a dashed line occurs between two different fragments (in green and yellow) and will be lost.

More »

Expand

Fig 5.

Illustration of the motif insertion.

The insertion of the motif UC*GA in a sequence is shown here. The possible configurations of the motif are found inside the sequence using regular expressions. Then, at each nucleotide in the sequence, the number of configurations that occur at that position are summed, leading to a feature vector. This process is repeated for each motif to be inserted, each time yielding a different feature for the sequence.

More »

Expand

Fig 6.

Architecture of the divide model.

A 1D dilated CNN is used, followed by a fully connected layer with a sigmoid activation. Convolutional layers use a kernel size of 3, dilation rates decreasing in powers of 2 and ReLu activations.

More »

Expand

Fig 7.

Cut points prediction.

The Bradyrhizobium 16S ribosomal RNA is given here as an example. The divide model predicts cutting probabilities, and cut points are then selected using a peak detection algorithm. The selected cut points can be seen in the blue dots, and ideal positions in the black bars. The structure of the RNA is displayed for reference. No base pairs are broken here.

More »

Expand

Fig 8.

Structure prediction models evaluation by sequence length.

The F-score for secondary structure prediction including pseudoknots depending on the input sequence length for RNAs shorter than 1,000 nt is shown for IPknot [16,17], ProbKnot [15], KnotFold [33], pKiss [8,9] and UFold [27] on the Test dataset. The computation times are displayed for reference in a log scale.

More »

Expand

Fig 9.

Compression and break rates with respect to maximum fragment length.

Mean compression rate and break rate of our divide model for different values of the maximum fragment length hyperparameter, for sequences longer than 1,200 nucleotides on the Test dataset. The F-score for secondary structure prediction is shown for reference.

More »

Expand

Fig 10.

Secondary structure F-score for DivideFold with respect to maximum fragment length, by sequence length.

The F-score for secondary structure prediction including pseudoknots depending on the input sequence length for RNAs longer than 1,200 nt is shown for DivideFold, using KnotFold as the structure prediction model, for different values for the maximum fragment length, on the Test dataset.

More »

Expand

Table 1.

Pseudoknot prediction performance.

More »

Expand

Fig 11.

Pseudoknot prediction performance by sequence length.

The F-score is shown for pseudoknot prediction depending on the input sequence length for RNAs longer than 1,000 nt. Confidence intervals are shown in the black bars. The performance of DivideFold, IPknot [16,17], ProbKnot [15] and KnotFold [33] is reported here on the Test dataset. Missing bars mean that the performance is very close to zero and cannot be seen.

More »

Expand

Table 2.

Secondary structure prediction performance including pseudoknots.

More »

Expand

Fig 12.

Secondary structure prediction performance including pseudoknots by sequence length.

The F-score is shown depending on the input sequence length for RNAs longer than 1,000 nt. The performance of DivideFold, IPknot [16,17], ProbKnot [15] and KnotFold [33] is reported here on the Test dataset.

More »

Expand

Fig 13.

Computation time by sequence length.

The time is displayed in a log scale depending on the input sequence length for DivideFold, IPknot [16,17], ProbKnot [15] and KnotFold [33] for RNAs longer than 1,000 nt on the Test dataset.

More »

Expand

Fig 14.

Secondary structure of the 16S ribosomal RNA [49].

The final cut points chosen by our divide model are shown in the red dots. The pseudoknots are displayed in purple.

More »

Expand

Table 3.

Results for pseudoknot prediction on the 16S ribosomal RNA.

More »

Expand

Table 4.

Results for secondary structure prediction including pseudoknots on the 16S ribosomal RNA.

More »

Expand

Fig 15.

Pseudoknot prediction performance by sequence length for RNAs between 500 nt and 1,000 nt.

The F-score is shown for pseudoknot prediction depending on the input sequence length. The performance of DivideFold, IPknot [16,17], ProbKnot [15], KnotFold [33], pKiss [8,9] and UFold [27] is reported here for RNAs between 500 and 1,000 nt on the Test dataset.

More »

Expand

Fig 16.

Secondary structure prediction performance including pseudoknots by sequence length for RNAs between 500 nt and 1,000 nt.

The F-score is shown depending on the input sequence length. The performance of DivideFold, IPknot [16,17], ProbKnot [15], KnotFold [33], pKiss [8,9] and UFold [27] is reported here for RNAs between 500 and 1,000 nt on the Test dataset.

More »

Expand

Table 5.

Mean F-score comparison for pseudoknot prediction between the Test and bpRNA-NF-15.0 datasets.

More »

Expand

Table 6.

Mean F-score comparison for secondary structure prediction including pseudoknots between the Test and bpRNA-NF-15.0 datasets.

More »

Expand