Fig 1.
Self-supervised deep learning workflow for characterizing biosynthetic gene cluster (BGC) properties.
Schematic of the workflow for characterizing BGCs with BiGCARP, a self-supervised deep neural network. We curate a dataset of annotated BGCs from antiSMASH for training BiGCARP. We then use ESM-1b [14], a protein masked language model, to obtain pretrained embeddings of protein family (Pfam) domains in our dataset and to explore whether pretrained Pfam domain embeddings show improvement on the quality of their representations. By representing BGCs as chains of Pfams, we train a self-supervised masked language model on these domains to characterize BGC properties in microbial genomes. We leverage these learned representations to detect BGCs from microbial genomes and to predict their natural product class.
Fig 2.
BiGCARP architecture with validation performance curves on the self-supervised dataset.
(a) We use the masked language model objective described in [33] to train BiGCARP to reconstruct the BGC product class and Pfam sequence on our self-supervised dataset, which contains around 127,000 BGC Pfam sequences. BiGCARP is a dilated 1D-convolutional neural network masked language model based on CARP [32] and ByteNet [34]. (b) Validation loss (cross-entropy) and accuracy for BiGCARP with different initial Pfam embeddings.
Fig 3.
Relevant representations of Pfam domains are encoded in learned ESM-1b embeddings.
Uniform manifold approximation and projection (UMAP) visualization of learned representations of Pfam domains from BiGCARP with different initial Pfam embeddings.
Table 1.
Pretraining results, including the exponentiated cross entropy (ECE) metric on the pretraining test set and area under the receiver operating characteristic curve (AUROC) for BGC start locations and domains on the 9-genomes validation set.
Table 2.
Domain AUROC and average precision after supervised training on the DeepBGC training set.
Table 3.
Product classification results on MIBiG.
Table 4.
Area under the receiver operating characteristic curve (AUROC) for BGC start locations and domains on 773 bacterial genomes released after antiSMASH 3.0 database.