DepoScope: Accurate phage depolymerase annotation and domain delineation using large language models

doi:10.1371/journal.pcbi.1011831

Fig 1.

Overview of the data collection and preprocessing steps at the sequence and the structure level.

(a) Data collection involved collecting raw sequences from INPHARED and InterPro carbohydrate catalytic domains. Protein sequences were processed into multiple sequence alignments (MSAs), while the sequences related to the InterPro domains were used to construct profile Hidden Markov Models (HMMs). Then, HHBlits was used to scan the constructed HMMs against the MSAs, followed by additional clustering and affinity propagation to increase the quality of the collected protein sequence data. (b) The remaining proteins were further filtered at the structure level using ESMFold, FoldSeek and the CAZy database to get to a final training dataset that included sequences with the folds n-bladed β-propeller, right-handed β-helix and triple helix. Icons attributions: Font Awesome Free 5.2.0 by @fontawesome (CC by 4.0); Database by Delapouite (CC by 3.0). No changes were made to the icons.

More »

Expand

Fig 2.

Identified polysaccharides-degrading folds in the set of proteins from the INPHARED database.

Four different types of PD folds were identified with (a) the right-handed β-helix, (b) the n-bladed β-propeller, (c) TIM β/α-barrel and (d) the α/α toroid. The PD fold is colored in red, while the remaining part of the protein was left in gold.

More »

Expand

Fig 3.

DepoScope model architecture.

DepoScope consists of a combination of two deep learning models that perform token classification and binary classification, respectively. The token classification model is a fine-tuned ESM-2 model that receives protein sequences as input, which are transformed into tokens (one for each amino acid). For each token, the model learns to classify it as being part of a PD domain or not in the finetuning process, with four distinct labels: “none”, “right-handed β-helix”, “n-bladed β-propeller” and “triple helix”. The outputs of this first model additionally serve as inputs to the second model, which is a combination of two convolutional layers and a dense layer that produce a binary output, which reflects the prediction for the entire sequence of whether or not the protein is a depolymerase.

More »

Expand

Fig 4.

Confusion matrices of the token classification task for the T6, T12 and T30 configurations of the fine-tuned ESM-2 model.

Four labels were predicted by the models with “none”, “right-handed β-helix”, “bladed β-propeller” and “triple helix”.

More »

Expand

Table 1.

Performances of the fine-tuned models ESM-2 6L, ESM-2 12L and ESM-2 30L for the token classification task on the evaluation dataset (best results annotated in bold).

More »

Expand

Table 2.

Benchmark results for the three tested models.

More »

Expand