Prediction of virus-host associations using protein language models and multiple instance learning

doi:10.1371/journal.pcbi.1012597

Fig 1.

A diagrammatic representation of the EvoMIL method.

(A) Protein sequences of viruses and virus-host associations are collected from the VHDB [10]. For each host, we collect the same number of positive and negative viruses, and then embeddings of protein sequences from viruses are obtained by the pre-trained transformer model [7], which are features for host predictions based on attention-based MIL; (B) Protein sequences of viruses are split to sub-sequences, which are used as input to the pre-trained transformer model to obtain the corresponding embeddings; (C) There is a host label for a set of protein sequences on each virus, and attention-based MIL is applied to train the model for each host dataset by protein embeddings of viruses. Finally, we predict the host label for each virus and assign an instance weight that represents the importance of each protein for the virus.

More »

Expand

Fig 2.

Performance of binary classification tasks.

This figure separately shows the heatmap of AUC, accuracy, F1 score, sensitivity, specificity, and precision on 15 prokaryotic (A) and 5 eukaryotic host binary classifiers (C), negative samples are selected by strategy 1; ROC curves of 15 prokaryotic hosts (B) and 5 eukaryotic hosts (D) corresponding with heatmap plots A and C; AUC values of different taxonomies on prokaryotic (E) and eukaryotic hosts (F) where negative samples are selected using strategy 2.

More »

Expand

Fig 3.

Performance of multi-class classifications on ESM-1b and k-mer features.

A and B represent the AUC and accuracy, respectively, for prokaryotic and eukaryotic hosts using four feature sets (ESM-1b, AA_2, PC_3 and DNA_5), AUC and accuracy are equivalent with those presented in Table 1. C and D indicate the results obtained by testing the trained models on prokaryotic and eukaryotic hosts associated with 5 to 30 viruses, using the four different feature sets described above.

More »

Expand

Table 1.

The AUC, Accuracy and F1 score of multi-class MIL by using ESM-1b and k-mer features.

More »

Expand

Fig 4.

The taxonomic tree, aligning with Log2 of ratio accuracy between ESM-1b and k-mers.

The figure shows the taxonomic tree of 22 prokaryotic (A) and 36 eukaryotic (B) hosts. Each host is aligned with a bar plot showing the accuracy ratio and standard deviation of 5-fold cross-validation between ESM-1b and AA_2, PC_3, and DNA_5, respectively. The taxonomic tree align with the accuracy between ESM-1b and k-mers is shown in S4 Fig.

More »

Expand

Fig 5.

The Confusion matrix plot of prokaryotic hosts (A) and eukaryotic hosts (B) based on EvoMIL.

The confusion matrix plots A and B represent the performance of the EvoMIL model on 22 prokaryotic hosts and 36 eukaryotic hosts, respectively. It is constructed by evaluating the model’s predictions on a test set comprising 20% of the dataset, while the EvoMIL model was trained on the remaining 80% of the data. This plot provides insights into the model’s accuracy in predicting the host species for the tested viruses.

More »

Expand

Fig 6.

Comparison of EvoMIL and other host prediction approaches on an independent test dataset.

The y-axis presents the number of correct predictions (coloured bar) and the number of incorrect predictions (grey bar) for each tool (x-axis) on the chosen benchmarking test dataset (S7 Table). This plot shows the percentage of correct and incorrect host species predictions on the test dataset, and prediction source results are available in S8 Table.

More »

Expand

Fig 7.

The bar plots display the ranking of weights for the top 5 proteins and all proteins of viruses associated with E. coli and H. sapiens, respectively.

The top four bar plots illustrate protein weights obtained for E. coli based on binary classification models(A, B) and multi-class classification models (C, D), respectively. Similarly, the bottom four bar plots depict the protein weights obtained for H. sapiens based on binary classification models(E, F) and multi-class classification models (G, H), respectively. Each host consists of two sections: the left subplot shows the top 5 ranked protein weights, while the right subplot displays all protein weights sorted in descending order.

More »

Expand

Fig 8.

The bar plot of GO annotations of viral viruses associated with E. coli (top) and H. sapiens (bottom).

Panels show the number of each GO annotation of the top 5 ranked proteins for each virus associated with E. coli (A, B) and H. sapiens (C, D). Here, the protein weights in A and C are obtained by binary models, whereas in B and D, the weights are obtained by multi-class classification models.

More »

Expand