Prediction of virus-host associations using protein language models and multiple instance learning

Dan Liu; Francesca Young; Kieran D. Lamb; David L. Robertson; Ke Yuan

doi:10.1371/journal.pcbi.1012597

Peer Review History

Original SubmissionJune 15, 2023
9 Apr 2024 Decision Letter - Rob J De Boer, Editor, Fuhai Li, Editor Dear Dr Yuan, Thank you very much for submitting your manuscript "Prediction of virus-host associations using protein language models and multiple instance learning" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Fuhai Li Academic Editor PLOS Computational Biology Rob De Boer Section Editor PLOS Computational Biology ********************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors propose a method named EvoMIL for both prokaryotic and eukaryotic virus host prediction. By integrating a pre-trained protein embedding module and a well-designed weighing module (attention-based multiple instance learning), EvoMIL is expected to leverage different proteins in the virus genomes. The idea of the EvoMIL is interesting and the results show that EvoMIL performs good on the dataset created by the author. However, there are still some major concerns about the work, especially the experimental design. Major concerns: 1. Virus host prediction is a hot research topic, and numerous tools have been developed to address this problem. However, this manuscript lacks benchmark experiments that demonstrate the competitiveness of EvoMIL in comparison to state-of-the-art methods. As a newly designed method, the author should cite these existing tools and provide a comprehensive comparison. A review of many tools was provided here: https://doi.org/10.1093/bioinformatics/btac239 Some of these tools probably have better/newer versions. Although these only focus on prokaryotic viruses, the authors can still compare the performance of this part. 2. In addition to the existing tools, the authors also discuss the potential use of alignment-based methods for host prediction. Considering that EvoMIL aims to assess the significance of viral proteins in host prediction, I suggest that the authors include BLASTp results as a fundamental benchmark method in their experiments. For instance, in the binary classification task, it would be valuable to examine whether BLASTp can yield better alignment results for positive pairs compared to negative pairs. This would provide insights into the performance of EvoMIL. 3. Since novel viruses are typically reconstructed from metagenomic data, it is possible that the viral sequences obtained may not be as complete as those downloaded from curated databases. Consequently, certain proteins may be missing from the assembled viral contigs. As EvoMIL is a "protein-based" method, I am wondering whether it can still perform effectively on short contigs. 4. As described in the manuscript, it is noted that EvoMIL is currently designed to handle host prediction for a specific set of 15 prokaryotes and 5 eukaryotes and cannot perform well on other microbes. This means that if the dataset includes viruses that do not infect these specific microbes, EvoMIL may not be able to provide accurate predictions. To address this limitation, I would suggest that the authors consider introducing a class labeled as 'Other' to accommodate users who may wish to utilize EvoMIL with datasets containing viruses that infect organisms outside of the predefined set. This would allow for more flexibility and usability of EvoMIL in a broader range of scenarios. 5. The creation of the negative set: To discuss more comprehensively, the authors provided 2 strategies to create the negative set: 1) The hosts of positive viruses and negatives viruses should not be in the same genus. 2) The natural hosts of positive viruses and negatives viruses are in the same taxonomic group (from genus to phylum). The author claims that the strategy 2 introduces challenging cases to evaluate the tool. It is partially correct. My concern is that two viruses respectively infecting rodent and human (both order Mammalia) can be very different. If the similarity between them is very low, the case is then not that challenging. In a typical bioinformatic scenario, creating the dataset by the sequence’s similarity is preferred. A visualization of the similarity between the positive set and negative set could help mitigate the concern. 6. Number of the eukaryotic hosts. In the first experiment (binary classification), the authors only created 5 eukaryotic datasets, because they require that there are at least 125 viruses in the positive set. The cutoff is a bit stringent. How will this method perform when the cutoff is 50 as the prokaryotic one? When the tool is designed on 36 eukaryotes, why not show them all? Other detailed comments can be found below. (1) The data distribution Compared to the emergent novel viruses, the VHDB database is not complete. Although using it as a reference is acceptable, the authors should show the virus taxonomy distribution to the audience. (2) Model structure Although the attention deep MIL module is from an existing work, the authors still need to elucidate the detail of the module. In Fig 1, the structures of neural network and the final classifier are not mentioned. The learning architecture of AA_2, PC_4, and DNA_5 is not stated. (3) Problem formulation In the second experiment (multi-class task), is the data multi-label? Or with single label? This is not stated. The definition of “multi-class classification” and “multi-label classification” is different. While the former only outputs one label, the latter might output multiple labels for one query. And a number of viruses have the ability to infect multiple hosts. Without stating this, the problem formulation is not finished. (4) Fig 3 and Table 5 The caption of Fig.3 is not clear. Which experiment does it belong to? It seems that Fig 3 (a, c) and Table 1 are collectively for 22 prokaryotes and 36 eukaryotes. And Fig 3 (b, d) is for small dataset. The organization is interleaved. It is hard to follow the logic. (5) Number of the prokaryotic hosts. The author only included 22 prokaryotes, which is rather limited. (6) The accuracy of multi-classification. The tool is finally complied as a multi-class classifier among 36 eukaryotic hosts. And according to table 1, the accuracy on test set is around 49.4%. This is not a satisfactory result. I suppose the alignment-based method like BLASTn can achieve this as well. (7) Fig 5 and Fig S2 In Fig 5, is true label at the left axis? This is not stated. While Fig 5 shows the error cases in each host group and different rank by heatmap, Fig S2 uses the confusion matrix directly. They are of a bit redundancy. I recommend the authors use Fig S2 in the main part instead of Fig 5. Because Fig 5 does not reflect the error trend. In Fig S2, most of the area is white, because a strong signal is at Homo-Homo (212). To better show the result, I strongly recommend the author conduct row normalization by the actual classes. (8) Balanced binary dataset In the part of the balanced binary dataset, the author defined 4 vectors, V, Vpos, Vneg, and Vpro. V is of P length, and Vpro is of T length. T is not defined, so that Vpro is confusing as well. In 450, there is a typo “Using this method”. When we look at the virus world from an individual host, like human, most of the viruses in the world do not infect human. So, I think it is better to include all the negative data, instead of just using a small part of it. (9) The UMAP section The authors want to know “whether the embeddings of these important proteins identified by attention-based MIL contain any underlying clustering structures.” But the clustering structures can only show the successful application of ESM-1b embedding without any further benefit to the host part. As long as the proteins number is large enough, there definitely will be clusters in UMAP. However, ESM-1b is not contributed by this work. So, I don’t understand why they put this here. In 324th row, please try to explain the difference. Reviewer #2: >>> Overall, ESM-1b demonstrated superior performance compared to the k-mer features, through 5-fold cross-validation on both prokaryotic and eukaryotic hosts. Furthermore, we split multi-class datasets into 80% training and 20% test sets, then train and test our model without cross-validation, and compare the accuracy for each host to evaluate the prediction performance of ESM-1b and k-mer features on multi-class classification. I don’t understand why this would be desirable if you already have the results of a 5-fold cross validation. I’d remove this part and make Figure 4 be based on the results from the cross validation (i.e., have the log2(esm-kmer) values be computed for each fold separately and then provide standard deviation markers on the barplots. Also, for this figure please provide guidelines or better spacing between each set of bars to make it easier to read which set of bars correspond to each species >>> “Similarly, ESM-1b outperforms the other feature sets in 16 out of 36 eukaryotes.” Does this mean that your model failed to outperform a very simple baseline in most of the eukaryote datasets? >>> Please make the methodological details of how the features for AA 2, PC 3 and DNA 5 were created for reproducibility. >>> It would be nice if Figure 5 had a taxonomy tree similarly to Figure 4 >>> Please make Figure 6 have a white background to save ink if someone ever prints this. >>> “Again, the UMAP plots demonstrate that different protein functions are clustered into separate locations, while GO terms with similar functions tend to form sub-clusters.” I disagree with this statement, it seems that there are many proteins with the same GO term that are separated in different clusters (e.g. “Virion Assembly” in subfigures A,B and “Viral Assembly” subfigures C,D). If you want to make this claim please run a clustering algorithm (k means will suffice, but a hierarchical clustering algorithm would give you better working space to make claims) and see to which degree certain functions cluster or do not cluster together. >>> “For example, the indices of GO annotations of the top 5 ranked proteins (Fig 7 A) are the same as the indices of all proteins (Fig 7 B), meaning that we can obtain GO annotations although only selecting top 5 ranked proteins.” This sentence is a bit confusing, although one can understand it after referring to the image. It would be good if you could make a bit more time to make this sentence clearer. Reviewer #3: In this paper the authors present an efficient model able to carry out the virus-host association starting from the proteins set of the virus. This is a novel method and all the presented data and analysis support the claims. Some more details are missing about some theoretical concepts, methods or technical issues. The code and the trained models are available on a GitHub link on the paper and the original data are collected from the Virus-Host database. Nevertheless, the descriptions of the algorithm, the dataset and the used architectures are not satisfactory in the paper for the sake of reproducibility. The paper has been written with a good and clear english but some concepts are hardly accessible for non-specialists. In conclusion I suggest to extend some part of the paper in order to make it more accessible to non-specialists and also make the experiments easily reproducible. You can find below some questions that could be useful for this task: - Can you explain in a more clear and detailed way the so called "strategy 1" and "strategy 2" ? - What's the aim of your binary and multi-class classification models? - Can you explain and describe a little more the K-mer features ? - Can you add a few line about UMAP, Gene Ontology, GO annotation and InterProScan ? - Can you describe the used deep learning architectures and the input data dimensions in order to make possible the reproducibility of your experiments ? ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: None Reviewer #2: Yes Reviewer #3: Yes ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: Yes:** Daniele Baggi Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols https://doi.org/10.1371/journal.pcbi.1012597.r001
Revision 1
1 Aug 2024 Author Response Attachments Attachment Submitted filename: point-to-point-reply.pdf https://doi.org/10.1371/journal.pcbi.1012597.r002
9 Sep 2024 Decision Letter - Rob J De Boer, Editor, Fuhai Li, Editor Dear Dr Yuan, Thank you very much for submitting your manuscript "Prediction of virus-host associations using protein language models and multiple instance learning" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Fuhai Li Academic Editor PLOS Computational Biology Rob De Boer Section Editor PLOS Computational Biology ********************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors have addressed some of my comments, which I believe has improved the manuscript. However, I still think the authors are able to add BLAST (or BLASTp) as a benchmark method. This is important to examine the difficulty of the problem. Although the authors mentioned there is no published method for this, it is quite feasible. First, if the contigs contain multiple proteins, a simple majority vote can be applied. Authors can choose the variants of the majority vote strategy for this. Second, the authors can use BLASTn and do not need to rely on proteins. Nevertheless, this experiment does not need very complicated design. Instead, it will provide useful insights. Reviewer #2: The authors seem to have addressed most of my points about the paper. Here are a few more suggestions and a typo: there is no explicit definition of dsDNA predictiton -> prediction Figure 4 is better now, it would also be good to have a non-ratio version in the supplementary. "Here, we extract k-mer features from sequences corresponding to DNA, amino acids(AA) and their physio-chemical properties (PC) [13]. (...) To extract PC k-mers from protein sequences we first re-label each amino acid as one of seven groups based on its physio-chemical properties: ({AGV}, {C}, {FILP}, {MSTY}, {HNQW}, {DE}, and {KR}), [30]" Although you cite [13] above as the citation for all the three characteristics, you only cite 30 at the end of the second sentence highlighted, it'd be good to have a more explicit citation mentioning that the physiochemical property grouping is from [30]. Not necessary, but would aid clarity. [13] doi:10.1371/journal.pcbi.1007894 [30] doi:10.1073/pnas.0607879104 ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: None Reviewer #2: Yes ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols https://doi.org/10.1371/journal.pcbi.1012597.r003
Revision 2
2 Oct 2024 Author Response Attachments Attachment Submitted filename: PLOS comp bio 2nd revision point to point reply.pdf https://doi.org/10.1371/journal.pcbi.1012597.r004
28 Oct 2024 Decision Letter - Rob J De Boer, Editor, Fuhai Li, Editor Dear Dr Yuan, We are pleased to inform you that your manuscript 'Prediction of virus-host associations using protein language models and multiple instance learning' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Fuhai Li Academic Editor PLOS Computational Biology Rob De Boer Section Editor PLOS Computational Biology Feilim Mac Gabhann Editor-in-Chief PLOS Computational Biology Jason Papin Editor-in-Chief PLOS Computational Biology ********************************************************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: No further comments. ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: None ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No https://doi.org/10.1371/journal.pcbi.1012597.r005
Formally Accepted
3 Nov 2024 Acceptance Letter - Rob J De Boer, Editor, Fuhai Li, Editor PCOMPBIOL-D-23-00938R2 Prediction of virus-host associations using protein language models and multiple instance learning Dear Dr Yuan, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Zsofia Freund PLOS Computational Biology \| Carlyle House, Carlyle Road, Cambridge CB4 3DN \| United Kingdom ploscompbiol@plos.org \| Phone +44 (0) 1223-442824 \| ploscompbiol.org \| @PLOSCompBiol https://doi.org/10.1371/journal.pcbi.1012597.r006

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .