Fig 1.
Model architectures and training strategy.
The model input is formed by terminator sequences. In the pre-trained model, the models are first trained with inverse-folding based data before training with terminator sequences. These pre-training data feature the structure of terminators, but not their specific sequence properties. The input data are either one-hot encoded and fed into a 1D-CNN, or matrix encoded and then passed into a 2D-CNN. Both CNN architectures are followed by a fully connected layer and a single output neuron.
Fig 2.
Impact of sequence and structure on terminator and tRNA recognition.
(A) Intrinsic terminators comprise five sections: The hairpin structure in the center consists of a stem and a loop, framed by an A-rich zone (A-tail) on the 5’-end and a longer U-rich zone (U-tail) on the 3’-end. The terminator data used in this study additionally contain adjacent genomic sequences of the terminator (left pad and right pad). (B) Impact of terminator sections as relative activation impact on CNN models (left) and relative detection impact on ARNold (right). Random mutations were introduced in each of the 7 sections of the transcription terminators. The relative activation impact on the models is calculated from the difference between the model output corresponding to the original sequences and sequences with random nucleotide mutations in half of all nucleotides per section. The relative detection impact for ARNold is calculated for the same mutated sequences, and is estimated by averaging over binary outputs across the mutation data set. (C) Impact of the base pairings in the stem of terminators for a growing number of mutated base pairs as relative activation impact on CNN models and relative detection impact on ARNold. The relative activation impact is calculated from the difference between the model output corresponding to mutations which retain or disrupt the pairing state in the stem structure. The relative detection impact for ARNold is calculated for the same mutated sequences, and is estimated by averaging over binary outputs across the mutation data set. (D) Relative activation impact of the base pairings in the stems of tRNAs on CNN models, for a growing number of mutated base pairs. The relative activation impact is calculated from the difference between the model output corresponding to mutations which retain or disrupt the pairing state in the stem structure. (B), (C): For k = 1, …, 10 and n ∈ {93, 84, 102, 91, 94, 92, 93, 113, 99, 92} (D): For k = 1, …, 10 and n ∈ {198, 203, 194, 202, 201, 201, 199, 201, 194, 201}.
Fig 3.
F1-score (A), area under precision-recall curve (B) and precision-recall curve (C, D) of one-hot CNN and matrix CNN with and without pre-training. The performance of ARNold on the same validation data is indicated in grey. The p-value of the Wilcoxon rank-sum test between each model x and y is indicated as coloured dot above model x, and as asterisks above model y, with *: p ≤ 0.05, **: p ≤ 0.005, ***: p ≤ 0.001.
Fig 4.
Transcriptome annotation of intrinsic terminators.
(A) Average model output of one-hot CNN and matrix CNN with and without pre-training, relative to the position of transcription termination sites identified with SEnd-seq. (B) Average area under precision-recall curve for a transcriptome-wide search for transcription terminators in E. coli. The distance to transcription termination sites identified with SEnd-seq is used as ground truth. The distance threshold, up to which a predicted terminator is attributed to a close-by termination site, is varied and shown on the x-axis. (C) Precision-recall curve for all models at a distance threshold of 35 nt, in comparison to precision and recall of ARNold. (D) Area under precision-recall curve at a distance threshold of 35 nt. The p-value of the Wilcoxon rank-sum test between each model x and y is indicated as coloured dot above model x, and as asterisks above model y, with *: p ≤ 0.05, **: p ≤ 0.005, ***: p ≤ 0.001. N = 10 for each model type in A and B.