PhANNs, a fast and accurate tool and web server to classify phage structural proteins

Vito Adrian Cantu; Peter Salamon; Victor Seguritan; Jackson Redfield; David Salamon; Robert A. Edwards; Anca M. Segall

doi:10.1371/journal.pcbi.1007845

Peer Review History

Original SubmissionMarch 30, 2020
16 May 2020 Decision Letter - Mihaela Pertea, Editor Dear Mr. Cantu Alessio Robles, Thank you very much for submitting your manuscript "PhANNs, a fast and accurate tool and web server to classify phage structural proteins" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Mihaela Pertea Software Editor PLOS Computational Biology Mihaela Pertea Software Editor PLOS Computational Biology ********************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: Cantu et al. present the tool PhANNs for the prediction of phage genes. This tool is greatly needed as the number of phage genomes being detected in metagenomic studies continues to increase and the majority of the genes are unable to be annotated. Having now tested the tool, my comments re: the manuscript can be divided into (1) the tool/code repo and (2) the manuscript. Overall, the manuscript is well written and clear. Tool: 1. Question: Both the Webserver and GitHub repo rely on this manually-curated database. As is a concern frequently with new tools, how do the authors plan to keep this database up to date? 2. Question: It looks like there is a way for the user/community to update the model file and/or create their own. For building the database and model, I'm assuming it's in the model_training folder, but there is no ReadMe describing the files in there. 3. The web service results list a table. It's great to be able to download it, but there is no details on how to interpret this table. What numbers are being reported by the web? Also, why are some cells highlighted on the web results? Does it automatically pick the highest score or is there some way that it distinguishes a good prediction? This information should be provided on the webpage, either with the results or the submission page. The web service is an attractive tool for less computational users. It however lacks documentation. 4. As a test to see how well the tool worked with non-phage sequences, I fed it some bacterial protein sequences and was surprised to find predictions other than "Other". Some had scores of 10.00, of course without knowing what the number being reported is and what the range is, is that good? I'm imagining a scenario where a user has a contig they think is viral but are uncertain. Manuscript: 1. Details re: the prediction models are not thoroughly discussed. As this is the computational contribution of this work, additional details and testing would strengthen the paper. Related, I would consider moving Table S2 into the main text. There is no other mention of the codes used for the model names and what the individual models consider in the main text. This makes interpretation of Figure 2 challenging. Alternatively, the Figure 2 legend should include details re: what each of the model names listed on the x axis mean. 2. While the authors acknowledge that there are other tools for phage function prediction, the manuscript lacks a comparison to these tools. For instance, Galiez et al. has a better accuracy than PhANNs for capsid and tail predictions (as indicated in Table 1). And while yes, PhANNs is considering more than just these two types of phage proteins, a comparison of how PhANNs performs relative to these other solutions would convince the reader that they should use PhANNs. Or perhaps they should use another tool for specifically capsid genes in addition to PhANNs. How does PhANNs predictions compare to these other tools? Is it classifying the same proteins? Reviewer #2: The manuscript by Cantu et al. presents a method for classifying proteins into 11 categories: 10 phage structural types and one catch-all ‘other’ group. The method applies neural network training, and used a large number of protein sequences to generate 12 models. Their source code is freely available under the MIT license for anonymous download in Github, and there is a webserver that applies one of those models (tetra_sc_tri_p) to user data. This work extends upon previous work by the same group and is meant to fill a need in the field for quick classification methods of unstudied phage proteins with low likelihood to be anything besides phage structural proteins. There are multiple concerns, however, around the generation of the training set and validation of the resulting models. GENERAL COMMENTS 1. The training data sets for the neural network models were collected from the NCBI database using the ncbi_get_structural.py script, which retrieves sequences from the NCBI protein database based solely on how they are named by the sequence depositors. This is problematic, as a) many phage protein annotations in the database are incorrect, and b) even if they are “correct” there is no enforceable naming convention between annotators. For example, the term “minor tail protein” could refer to baseplate components, tail needles, tail fibers, tail tape measure proteins or head-tail joining proteins. Likewise, some groups may use alternate names for the major capsid protein, such as “coat” or “MCP”, and some names may simply be misspelled. It should also be noted that the ncbi_get_structural.py script contains a typo at line 145 (“head-tail joinning”) that would have interfered with data retrieval. 2. Manual curation is mentioned in the manuscript (lines 25, 83, 93, Table 2), but no curation criteria for class inclusion or exclusion are provided. Did the curation involve any validation that the proteins in a given class (e.g., tail fiber) actually had their named functions? Some categories, like baseplate, decreased significantly from manual curation (Table 2) but we do not know why. 3. The biological or structural definitions of each of the ten classes are not defined in the manuscript, so it is difficult to tell what structural components belong to each class. While terms such as “major capsid” and “portal” are generally well-agreed upon (and will often contain those terms in their name), other categories are more ambiguous. Aside from the ambiguity in what constitutes a “minor tail protein” described above, it is not clear what qualifies as “major tail” (could be tail tube, tail sheath or both), a “tail shaft”, or what the structural distinction is between “collar” and “head-tail joining” proteins. The authors are free to define the classes as they wish, but detailed definitions are required, especially if the intent is for these annotations to be applied to novel phage genomes. 4. The manuscript describes validation of the neural network against sets of database proteins, but there is no validation of the tool against sets of known, experimentally verified proteins, or against clusters of structural proteins that are supported by external bioinformatic evidence (e.g., conserved domains, COGs, high-quality UniProt/SwissProt annotations). The authors describe this as a tool for the annotation of novel phage genomes, so there must be some evidence that the tool “works” on predicting known proteins. This validation should also include a description of the scoring criteria produced in the tool output, which appears to score each protein for each category on a scale of 0 to 10. It is not clear how to interpret these outputs or what constitutes a “good” score for the tool. SPECIFIC COMMENTS Line 43 – ‘by lysing specific components of microbiomes’ is unclear. Is it meant that specific taxa are targeted? Phage can also interact with bacteria in ways other than lysing them. Lines 60-62 – the argument that phage therapy has increased demand for phage annotation is supported, but the implication here seems to be that annotation of structural proteins will provide higher confidence in using phages for therapy. Is there a specific connection (positive or negative) to structural proteins and phages used for therapy? What does “provisional” mean in this context? Is this just another word for “inaccurate”? Figs. 2, 3 and 4 are missing y-axis labels. The bars in Figs. 2 and 3 are very narrow and can only be viewed when zoomed in to >200%. ****** Have all data underlying the figures and results presented in the manuscript been provided?** Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: Yes ******** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods https://doi.org/10.1371/journal.pcbi.1007845.r001
Revision 1
5 Aug 2020 Author Response Attachments Attachment Submitted filename: PhANNs_reviewers_responses.docx https://doi.org/10.1371/journal.pcbi.1007845.r002
26 Sep 2020 Decision Letter - Mihaela Pertea, Editor Dear Mr. Cantu Alessio Robles, We are pleased to inform you that your manuscript 'PhANNs, a fast and accurate tool and web server to classify phage structural proteins' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Mihaela Pertea Software Editor PLOS Computational Biology Mihaela Pertea Software Editor PLOS Computational Biology ********************************************************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors have addressed all of the reviewer comments in their revised manuscript. The gitHub repo and tool are both easy to understand and work well. The additional figures and tables provide additional support for the utility and power of this tool. Reviewer #2: I think the authors for their attention to the reviewer comments. ****** Have all data underlying the figures and results presented in the manuscript been provided?** Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: Yes ******** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes:** Catherine Putonti Reviewer #2: No https://doi.org/10.1371/journal.pcbi.1007845.r003

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .