Improving deep models of protein-coding potential with a Fourier-transform architecture and machine translation task
Fig 5
Detailed analysis of in silico mutagenesis (ISM) on the full test set.
(A) Plots of ISM metagenes for selected amino acids lysine (left) and glycine (right). Mean ΔS is shown for 25 positional bins across mRNA CDS regions with mutations listed based on the resulting codon. The red line represents the average across all missense/nonsynonymous mutations. For amino acids with more than two codons, the blue dashed line depicts the average synonymous mutation for comparison. (B) Mean ISM for synonymous point mutations by codon position and nucleotide. X’s denote substitutions which do not exist as synonymous changes. (C) An example protein-coding transcript with NCBI accession NM_001206605.1. Signed ISM scores for the transcript are depicted as a heatmap and the RNA sequence is portrayed with characters scaled according to the ↑ PC importance strategy, i.e. regions with highly negative ISM weights depicted in dark blue. The subregions shown are windows around the start codon, the position of maximum importance, and the stop codon. (D) Same as panel C with an example long noncoding RNA with NCBI accession NR_109777.1. The endogenous sequence is scaled according to ↑ NC, or highly positive ISM values drawn in dark red. (E) mRNA motifs discovered in our test set with STREME using ISM importance values from bioseq2seq to determine sequence regions in which to search for enriched signals. Annotations denote the importance and control strategy for each trial, with boldfaced annotations signifying that importance values were not masked and ordinary typeface indicating that feature importance at start and stop codons and nonsense mutations were excluded. Motifs are positioned near the regions in which they were enriched. (F) Same as panel E showing discovered lncRNA motifs.