Improving deep models of protein-coding potential with a Fourier-transform architecture and machine translation task

doi:10.1371/journal.pcbi.1011526

Fig 1.

Overview of problem setting and computational method.

(A) Summary of messenger RNA functional regions and known elements regulating translation. See [48] for a review of known regulatory elements. (B) Neural network sequence-to-sequence architecture. We designed LFNet (left) to apply a learned filter matrix W to a 1D short-time Fourier transform (spectrogram) of the hidden representations, enabling frequency-domain filtering of the 3-base periodicity present in coding sequences. We trained this architecture for multiple problem settings: bioseq2class outputs a classification token, bioseq2seq also predicts the protein translation, and bioseq2start predicts the position of the start codon for mRNAs.

More »

Expand

Fig 2.

Comparison of training tasks and neural architectures.

Names have been shortened by removing the “bioseq2” prefix for all of them. (A) F1 score across five replicates of bioseq2seq, bioseq2seq-wt, bioseq2class, and bioseq2start using both LFNet and CNN architectures. (B) Analysis of CDS detection abilities by bioseq2seq variants. Rate at which predicted protein sequence aligns better to the CDS than alternative ORFs (left), and alignment percent identity with the CDS (right).

More »

Expand

Table 1.

Classification performance.

Several versions of bioseq2seq, bioseq2start and bioseq2class models were trained on a dataset consisting of sequences from eight mammalian species. Our lowest and highest performing models are shown in this table alongside several top-performing machine learning models also trained on our dataset for comparison. For bioseq2seq-wt (LFN), predictions were made using the leading ‘classification’ token 〈PC〉 or 〈NC〉 of the first beam, terminating inference before the peptide prediction. For our models and RNAsamba, multiple replicates were trained with different random seeds. Evaluation metrics were calculated with 〈PC〉 as the positive class and listed as mean ± std. dev. where multiple replicates are available.

More »

Expand

Table 2.

Results on twenty-two validated micropeptides, separated into transcripts where the longest ORF corresponds to the true CDS (“Longest ORF”, n = 10), and those where it does not (“Not Longest ORF”, n = 12).

Each model was evaluated on classification accuracy (percentage predicted to be coding) and the percentage of the CDSs that were identified.

More »

Expand

Fig 3.

Frequency-domain content in model representations.

LFNet filters from selected layers, with complex filter weights visualized in terms of magnitude (bioseq2seq-wt in panel A, bioseq2class in B) and phase (bioseq2seq-wt in C, bioseq2class in D). For each layer heatmap, the x-axis represents the hidden embedding dimension, and the y-axis refers to a discrete frequency bin, with annotations for the equivalent nucleotide periodicity. Both model types learned weights with a pronounced structure around 3-nt periodicity, visible mostly clearly in the phase for bioseq2seq-wt and in the magnitude for bioseq2class. (E) A nucleotide-resolution metagene consisting of average encoder-decoder attention scores from mRNAs aligned relative to their start codons. Attention distributions for this plot were taken from head 5 of the lower bioseq2seq-wt (LFN) decoder layer, which primarily attends to the start codon and places attention downstream of the start in a periodic fashion. (F) The equivalent plot for the same attention head applied to lncRNAs aligned relative to the start of the longest ORF.

More »

Expand

Fig 4.

Predicted mutation effects by model type on a subset of testing data.

Names have been shortened by removing the “bioseq2” prefix for all of them. (A) Inter-replicate agreement according to Pearson correlation of saturated in silico mutagenesis (ISM) ΔS scores, i.e. the difference in log(P(〈PC〉)/P(〈NC〉)) between single-nucleotide variants and their wild-type sequence. Correlation of ISM scores is computed pairwise across replicates and averaged into a single value per transcript. (B) Metagene plots of ISM in which the absolute value of ΔS was averaged within each of 25 positional bins and across all three possible mutations in each position, with mRNAs and lncRNAs depicted separately for both bioseq2seq-wt (LFN), bioseq2seq-wt (CNN), and bioseq2class (LFN). Vertical dashed lines denote the first and last bin of the CDS for mRNAs and the longest ORF for lncRNAs. Metagenes from all five replicates are shown, with the best-performing model colored using the darkest hue. (C) Changes in coding score for changes that introduce a premature stop codon, in fifty-codon bins along the length of the CDS. (D) Changes in score for mRNAs from nucleotide substitutions that knock out a start codon. (E) Changes in score relative to wildtype for mRNAs shuffled within each functional region. UTRs were shuffled to preserve dinucleotide frequencies. Codon shuffling excluded the start and stop codons to preserve CDS length.

More »

Expand

Fig 5.

Detailed analysis of in silico mutagenesis (ISM) on the full test set.

(A) Plots of ISM metagenes for selected amino acids lysine (left) and glycine (right). Mean ΔS is shown for 25 positional bins across mRNA CDS regions with mutations listed based on the resulting codon. The red line represents the average across all missense/nonsynonymous mutations. For amino acids with more than two codons, the blue dashed line depicts the average synonymous mutation for comparison. (B) Mean ISM for synonymous point mutations by codon position and nucleotide. X’s denote substitutions which do not exist as synonymous changes. (C) An example protein-coding transcript with NCBI accession NM_001206605.1. Signed ISM scores for the transcript are depicted as a heatmap and the RNA sequence is portrayed with characters scaled according to the ↑ PC importance strategy, i.e. regions with highly negative ISM weights depicted in dark blue. The subregions shown are windows around the start codon, the position of maximum importance, and the stop codon. (D) Same as panel C with an example long noncoding RNA with NCBI accession NR_109777.1. The endogenous sequence is scaled according to ↑ NC, or highly positive ISM values drawn in dark red. (E) mRNA motifs discovered in our test set with STREME using ISM importance values from bioseq2seq to determine sequence regions in which to search for enriched signals. Annotations denote the importance and control strategy for each trial, with boldfaced annotations signifying that importance values were not masked and ordinary typeface indicating that feature importance at start and stop codons and nonsense mutations were excluded. Motifs are positioned near the regions in which they were enriched. (F) Same as panel E showing discovered lncRNA motifs.

More »

Expand

Fig 6.

Gradient-based approximation performance.

(A) Summary results from tuning of β hyperparameter for MDIG alongside baseline methods. The intra-replicate agreement according to Pearson correlation of each gradient-based approximation with ISM is summarized using the median across transcripts as a point estimate. (B) Scatter plot of ΔS for all possible synonymous point mutations, i.e. every wildtype>variant pair differing at one position, from MDIG on the test set (x-axis) versus the same for ISM (y-axis) on the test set. (C) mRNA motifs discovered in our training set with STREME using MDIG importance values from bioseq2seq to determine sequence regions in which to search for enriched signals. Results from unmasked importance are shown above the transcript diagram and those from the masked trials are shown below. (D) lncRNA motifs discovered in the training set using MDIG importance values from bioseq2seq, depicted in the same manner as panel C.

More »

Expand