Deep self-supervised learning for biosynthetic gene cluster detection and product classification

doi:10.1371/journal.pcbi.1011162

Fig 1.

Self-supervised deep learning workflow for characterizing biosynthetic gene cluster (BGC) properties.

Schematic of the workflow for characterizing BGCs with BiGCARP, a self-supervised deep neural network. We curate a dataset of annotated BGCs from antiSMASH for training BiGCARP. We then use ESM-1b [14], a protein masked language model, to obtain pretrained embeddings of protein family (Pfam) domains in our dataset and to explore whether pretrained Pfam domain embeddings show improvement on the quality of their representations. By representing BGCs as chains of Pfams, we train a self-supervised masked language model on these domains to characterize BGC properties in microbial genomes. We leverage these learned representations to detect BGCs from microbial genomes and to predict their natural product class.

More »

Expand

Fig 2.

BiGCARP architecture with validation performance curves on the self-supervised dataset.

(a) We use the masked language model objective described in [33] to train BiGCARP to reconstruct the BGC product class and Pfam sequence on our self-supervised dataset, which contains around 127,000 BGC Pfam sequences. BiGCARP is a dilated 1D-convolutional neural network masked language model based on CARP [32] and ByteNet [34]. (b) Validation loss (cross-entropy) and accuracy for BiGCARP with different initial Pfam embeddings.

More »

Expand

Fig 3.

Relevant representations of Pfam domains are encoded in learned ESM-1b embeddings.

Uniform manifold approximation and projection (UMAP) visualization of learned representations of Pfam domains from BiGCARP with different initial Pfam embeddings.

More »

Expand

Table 1.

Pretraining results, including the exponentiated cross entropy (ECE) metric on the pretraining test set and area under the receiver operating characteristic curve (AUROC) for BGC start locations and domains on the 9-genomes validation set.

More »

Expand

Table 2.

Domain AUROC and average precision after supervised training on the DeepBGC training set.

More »

Expand

Table 3.

Product classification results on MIBiG.

More »

Expand

Table 4.

Area under the receiver operating characteristic curve (AUROC) for BGC start locations and domains on 773 bacterial genomes released after antiSMASH 3.0 database.

More »

Expand