Peer Review History

Original SubmissionNovember 30, 2020
Decision Letter - Paula Jauregui, PhD, Editor

Dear Dr. Mollentze,

Thank you for submitting your manuscript entitled "Identifying and prioritizing potential human-infecting viruses from their genome sequences" for consideration as a Research Article by PLOS Biology.

Your manuscript has now been evaluated by the PLOS Biology editorial staff, as well as by an academic editor with relevant expertise, and I am writing to let you know that we would like to send your submission out for external peer review.

However, before we can send your manuscript to reviewers, we need you to complete your submission by providing the metadata that is required for full assessment. To this end, please login to Editorial Manager where you will find the paper in the 'Submissions Needing Revisions' folder on your homepage. Please click 'Revise Submission' from the Action Links and complete all additional questions in the submission questionnaire.

Please re-submit your manuscript within two working days, i.e. by Dec 17 2020 11:59PM.

Login to Editorial Manager here: https://www.editorialmanager.com/pbiology

During resubmission, you will be invited to opt-in to posting your pre-review manuscript as a bioRxiv preprint. Visit http://journals.plos.org/plosbiology/s/preprints for full details. If you consent to posting your current manuscript as a preprint, please upload a single Preprint PDF when you re-submit.

Once your full submission is complete, your paper will undergo a series of checks in preparation for peer review. Once your manuscript has passed all checks it will be sent out for review.

Given the disruptions resulting from the ongoing COVID-19 pandemic, please expect delays in the editorial process. We apologise in advance for any inconvenience caused and will do our best to minimize impact as far as possible.

Feel free to email us at plosbiology@plos.org if you have any queries relating to your submission.

Kind regards,

Paula

---

Paula Jauregui, PhD,

Associate Editor

PLOS Biology

Revision 1
Decision Letter - Paula Jauregui, PhD, Editor

Dear Dr. Mollentze,

Thank you very much for submitting your manuscript "Identifying and prioritizing potential human-infecting viruses from their genome sequences" for consideration as a Research Article at PLOS Biology. Your manuscript has been evaluated by the PLOS Biology editors, an Academic Editor with relevant expertise, and by several independent reviewers. I apologize for the time you had to wait meanwhile your manuscript was on review.

You will see that both reviewers think that this is an interesting study but they have several concerns that need to be solved. In particular, reviewer #2 thinks that you should make the methods more accessible for non-experts and clearly state the advance of their manuscript with respect to previous work, indicating whether 70% of virus categorized correctly is a good result. Reviewer #1 questions the usefulness of the model, and have several questions about your methods. Please address all the reviewers concerns.

In light of the reviews (below), we will not be able to accept the current version of the manuscript, but we would welcome re-submission of a much-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent for further evaluation by the reviewers. Unfortunately, depending on the revisions, we may need to add a reviewer with AI/ML expertise, which was difficult to secure in this round, but in order to avoid additional delays we have decided to make a decision now.

We expect to receive your revised manuscript within 3 months.

Please email us (plosbiology@plos.org) if you have any questions or concerns, or would like to request an extension. At this stage, your manuscript remains formally under active consideration at our journal; please notify us by email if you do not intend to submit a revision so that we may end consideration of the manuscript at PLOS Biology.

**IMPORTANT - SUBMITTING YOUR REVISION**

Your revisions should address the specific points made by each reviewer. Please submit the following files along with your revised manuscript:

1. A 'Response to Reviewers' file - this should detail your responses to the editorial requests, present a point-by-point response to all of the reviewers' comments, and indicate the changes made to the manuscript.

*NOTE: In your point by point response to the reviewers, please provide the full context of each review. Do not selectively quote paragraphs or sentences to reply to. The entire set of reviewer comments should be present in full and each specific point should be responded to individually, point by point.

You should also cite any additional relevant literature that has been published since the original submission and mention any additional citations in your response.

2. In addition to a clean copy of the manuscript, please also upload a 'track-changes' version of your manuscript that specifies the edits made. This should be uploaded as a "Related" file type.

*Re-submission Checklist*

When you are ready to resubmit your revised manuscript, please refer to this re-submission checklist: https://plos.io/Biology_Checklist

To submit a revised version of your manuscript, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' where you will find your submission record.

Please make sure to read the following important policies and guidelines while preparing your revision:

*Published Peer Review*

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*PLOS Data Policy*

Please note that as a condition of publication PLOS' data policy (http://journals.plos.org/plosbiology/s/data-availability) requires that you make available all data used to draw the conclusions arrived at in your manuscript. If you have not already done so, you must include any data used in your manuscript either in appropriate repositories, within the body of the manuscript, or as supporting information (N.B. this includes any numerical values that were used to generate graphs, histograms etc.). For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5

*Blot and Gel Data Policy*

We require the original, uncropped and minimally adjusted images supporting all blot and gel results reported in an article's figures or Supporting Information files. We will require these files before a manuscript can be accepted so please prepare them now, if you have not already uploaded them. Please carefully read our guidelines for how to prepare and upload this data: https://journals.plos.org/plosbiology/s/figures#loc-blot-and-gel-reporting-requirements

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosbiology/s/submission-guidelines#loc-materials-and-methods

Thank you again for your submission to our journal. We hope that our editorial process has been constructive thus far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Paula

---

Paula Jauregui, PhD,

Associate Editor,

pjaureguionieva@plos.org,

PLOS Biology

*****************************************************

REVIEWS:

Reviewer #1: Genomics and evolution of pathogens.

Reviewer #2: Epidemiology and evolution, mathematical models.

Reviewer #1: In "Identifying and prioritizing potential human-infecting viruses from their genome

Sequences" Mollentze et al. use machine learning approaches to examine the potential for using various genomic features of viral genomes to predict whether the virus will be able to infect humans. Although I cannot evaluate the details of the machine learning approach used, the manuscript is well written and overall, the analysis appears sound.

The main contribution of this study is demonstrating that there is relevant information within viral genomes for predicting the potential of a virus to infect humans, and that this includes additional information beyond that provided by broad-level taxonomy and phylogenetic similarity. There are two primary, potential implications of this: 1) perhaps viral genomes could be used to prioritize new viruses with zoonotic potential and 2) the genome features that are informative about the potential to infect humans could help us to better understand the adaptation of viruses to certain host species.

The first of these is the primary focus of this manuscript. However, in practice, I question just how useful this model will be for this type of prioritization given that 25% of the 'negative' training set ("No known human infections") is still predicted to be of high or very high priority. Similarly, >25% of human infecting viruses are categorized as low or medium priority. Given the high number of new viruses that are regularly being identified, and assuming that most will not be capable of human infection, it's not clear to me that using the output of this model will significantly improve our ability to prioritize high risk viruses. The results certainly don't seem compatible with the implications of this statement: "The performance of our models, while imperfect, means that many potential zoonoses can be identified immediately after virus discovery and genome sequencing." Say 10% of new viruses characterized from animal hosts have the potential to infect humans (probably an overestimate), then only about ~23% of the viruses selected by the model to be high or very high priority would actually be capable of infecting humans. Sure, the model will identify some, but how can users distinguish between the true and false positives?

I think it would have been more interesting to see a deeper exploration of the genomic features shown to be associated with human infecting viruses.

Other concerns:

1. The authors clearly describe how they determine which viruses they consider to be able to infect humans, but they don't clearly describe where they obtained the list of 861 viral species used or how human infecting viruses were categorized as primarily human viruses or zoonotic. These aspects of the methods need to be clarified.

2. For the novel viruses, it is important to demonstrate that these are really new species (and genomes) and not simply re-categorization of existing/known strains.

3. It is specified that for the novel viruses tested, viruses were only included id they were "from families known to contain species that infect animals." Was this same criterion also used for the 861 species included in the training/testing set? If not, it seems that this is a biased subset to use for such testing.

4. ~28% of the novel genomes (71/256) belong to Anelloviridae (Torque Teno viruses), and as noted in the manuscript, no viruses from this group were including in the training/testing set. This seems strange to me. These species certainly existed prior to the latest ICTV release.

5. One confusing aspect of the decision to only include one representative virus from each ICTV species is that certain species are composed several different strains, only some of which have been associated with human infections. For example, Betacoronavirus 1 includes the human endemic virus HCoV-OC43 and also several strains only associated with infections of other mammals. Why was this not taken into account when choosing the representative genome for each species? I would also like to see some discussion of the potential impact of this on model predications.

6. Of the 43 novel viruses predicted to be unlikely to infect humans, have any been shown to cause human infections?

7. 14/19 of the unknown viruses predicted to be in the very high priority were Torque Teno viruses (anelloviruses). In total, 71 of these viruses were included in the novel set. What proportion of these 71 were isolated from humans? What is the distribution of known human-infecting anelloviruses in the various prediction categories.

8. The conclusion that SARS2 ranked "considerably higher" than SARS1 in the analysis show in Fig 3C seems potentially a bit misleading. The distributions shown are largely overlapping. Is there a significant difference between these distributions? Given the relatively subtle difference between SARS1 and 2, and the fact that several other coronaviruses, not known to cause human infections, are also ranked similarly to SARS2 in Fig 3C, I think it is important that this statement in the abstract is tempered: "…could have identified the exceptional risk of SARS-CoV-2 prior to the emergence of the first SARS-related coronavirus in humans."

9. For Fig. 3A, please provide some indication of from which host species each of the viruses was isolated.

10. For Fig 3C, Need to provide meaning of circles and error bars in legend.

11. Fig. S3. The meaning of the dark grey is not clear. Please describe.

12. The analysis shown in Fig. S9 should include the coronaviruses from animals that are most closely related to SARS-CoV-2. For example: RaTG13, CoVZXC21, CoVZC45 and the pangolin-isolated viruses from 2017 and 2019.

Reviewer #2: Review of Mollentze et al, PLoS Biol, Feb 2021

- - -

I really liked this paper, and I think it could potentially be published in PLoS Biology. The introduction and results read really well. The discussion needs more context of what's been done in the past. The methods are all AI, and this needs a lot more explanation as many readers of PLoS Biology, all emerging disease ecologists, and even most bioinformaticians will not be familiar with these approaches. It's not enough to cite a paper in the methods and say that some algorithmic/statistical approach was taken. The reader needs to know exactly what this approach does so they can put it in context of the paper's biological assumptions and conclusions.

I don't work in AI or machine learning. Please keep this in mind as you read the comments below as some may sound elementary. However, the paper does need to reach a biology audience with no background in AI/ML.

MAJOR POINTS

1. Most important comment to address is how the probability p was calculated (the probability shown in Figure 3, and maybe in Figure 2A). I tried to follow this and didn't have time to go to the source material. Is p the mean across the best-performing 10% of models as in the Fig 3 caption? If this is the right definition, then if you want to categorize a new virus, which model do you choose from these best 10%? Is p the output probablity described on line 334? Line 372 suggests that p comes from a logistic regression approach, but maybe this is a different p.

2. line 111: what are these 100 or 1000 models? Are these different parameterizations of the same statistical learning algorithm, each parameterization corresponding to a different training set? Is it common in machine-learning papers to label these as different "models"? Line 312 says a "range of models" but this seems to refer to the nine "models" or "feature set combinations" in Figure 1A. Is this correct?

3. How many features are there? Figure 1 shows very clearly how good the prediction is when certain feature groups are included or excluded. But, how many features are in the "Similarity to ISGs" feature set, for example? If a virus is 10kb long, do we choose some length maximum, or some number of windows to look at? And then do we take these fragments and find best local alignments to all N=2054 ISGs? Do we take the score from this local alignment as a similarity score and input it into the ML algorithm? Please just tell me using a back-of-the-envelope calculation (if possible) how many ISG features there are for a 13KB influenza A genome. Later in the paper you settle on N=125 features to use, and I had no idea whether the total number of starting features was nine (Figure 1A), one thousand, or one million.

4. What is a SHAP value? In simple language, made understandable to a non-AI moderately quantitative scientist.

5. What is a "per-virus effect size" of a feature? Is this the effect size in Figure 2? If not, please define both of these.

6. Page 6: 70% of viruses categorized correctly. Is this good? Should we be aiming for something like 90% or 95%? Is this normal for AI methods on complex problems like this one? Is it better than what other approaches have achieved? I feel like this point needs some serious discussion. You can't isolate a novel rare virus in the middle of a rain forest and tell the local animal health department that you're 70% sure it would infect humans. If all approaches in the past have yielded 70% predictability, then maybe the current paper is not much of an advance.

7. In general, how much of an advance is this over previous work done in this area. I see that some of this previous work is described in the introduction, but more detail (either in intro or discussion) would be helpful for the reader to understand the need for a better approach. At some point during the 2007-2012 period, many scientific groups were receiving funding for viral discovery, viral chatter, identification of potential pandemic viruses, identification of zoonotic viruses (see Mark Woolhouse, Nathan Wolfe, Peter Daszak, Paul Kellam, Peter Simmonds). Did this work go anywhere? Did all the deep sequencing approaches work for either virus discovery or virus characterization? What were the biggest drawbacks of the goals set out during this period? Were all the nice-looking diagrams in Nature and Science showing different risk levels of zoonotic jumps correctly put together? Did any of these groups have a 70% accurate prediction method for novel viruses? I am not one of the five authors listed above, and I have no interest or stake in the success/failure of these past projects. This is a genuine question.

MINOR POINTS

8. Abstract - "prior to the emergence of" .. SARS1 or SARS2? Maybe it's simpler to say here that your approach "is able to correctly classify SARS-CoV-2 as having zoonotic potential based on genomic data alone"

9. Abstract - "developed machine learning algorithms" .. did the authors develop new algorithms, adapt existing ones, use standard algorithms inside R-packages?

10. Introduction, second para, "Empirical and theoretical lines of evidence suggest such signals might exist [8,9]." - just a couple more sentences of detail here please on what these lines of evidence are.

11. line 105: tested independently? Or tested seperately?

12. line 106: "Combining all genome composition features.." -- I would write "combining both types of genome composition features.."

13. what is the definition of the "priority categories" (lines 115-116)

14. line 163-164: "such that increased similarity to human genomes did not always increase the likelihood of infecting humans" .. this sounds pretty important and interesting. Can you say more here? Do you know what types of similarity do and don't correspond to more likely human infection? Do you know why?

15. lines 180-182. What are these 55 viruses? Sarbecoviruses? Betacoronaviruses? Does this list include both SARS1 and SARS2?

16. line 186, SARS1 or SARS2?

17. Fig S10. Is the x-axis here "CTG codon usage" (i.e. proportion of Leucines coded for by CTG/CUG), or is it some other bias metric?

18. line 317 "class-stratified", i.e. "stratified by virus taxonomic class", yes?

Revision 2

Attachments
Attachment
Submitted filename: Zoonotic_rank_response_to_reviewers.V5.docx
Decision Letter - Paula Jauregui, PhD, Editor

Dear Dr Mollentze,

Thank you for submitting your revised Research Article entitled "Identifying and prioritizing potential human-infecting viruses from their genome sequences" for publication in PLOS Biology. I have now obtained advice from the original reviewers and have discussed their comments with the Academic Editor. 

Based on the reviews, we will probably accept this manuscript for publication, provided you satisfactorily address the remaining points raised by the reviewers. Please also make sure to address the following data and other policy-related requests.

Reviewer #2 wants you to further clarify the methods to make them more accessible to general biologists, as you explain them in the rebuttal letter.

DATA POLICY:

You may be aware of the PLOS Data Policy, which requires that all data be made available without restriction: http://journals.plos.org/plosbiology/s/data-availability. For more information, please also see this editorial: http://dx.doi.org/10.1371/journal.pbio.1001797 

Note that we do not require all raw data. Rather, we ask that all individual quantitative observations that underlie the data summarized in the figures and results of your paper be made available in one of the following forms:

1) Supplementary files (e.g., excel). Please ensure that all data files are uploaded as 'Supporting Information' and are invariably referred to (in the manuscript, figure legends, and the Description field when uploading your files) using the following format verbatim: S1 Data, S2 Data, etc. Multiple panels of a single or even several figures can be included as multiple sheets in one excel file that is saved using exactly the following convention: S1_Data.xlsx (using an underscore).

2) Deposition in a publicly available repository. Please also provide the accession code or a reviewer link so that we may view your data before publication. 

Regardless of the method selected, please ensure that you provide the individual numerical values that underlie the summary data displayed in the following figure panels as they are essential for readers to assess your analysis and to reproduce it: Figures 1ABCD, 2BC, S6, S8, S12, S13.

NOTE: the numerical data provided should include all replicates AND the way in which the plotted mean and errors were derived (it should not present only the mean/average values).

Please also ensure that figure legends in your manuscript include information on where the underlying data can be found, and ensure your supplemental data file/s has a legend.

Please ensure that your Data Statement in the submission system accurately describes where your data can be found.

As you address these items, please take this last chance to review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the cover letter that accompanies your revised manuscript.

We expect to receive your revised manuscript within two weeks.

To submit your revision, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' to find your submission record. Your revised submission must include the following:

-  a cover letter that should detail your responses to any editorial requests, if applicable, and whether changes have been made to the reference list

-  a Response to Reviewers file that provides a detailed response to the reviewers' comments (if applicable)

-  a track-changes file indicating any changes that you have made to the manuscript. 

NOTE: If Supporting Information files are included with your article, note that these are not copyedited and will be published as they are submitted. Please ensure that these files are legible and of high quality (at least 300 dpi) in an easily accessible file format. For this reason, please be aware that any references listed in an SI file will not be indexed. For more information, see our Supporting Information guidelines:

https://journals.plos.org/plosbiology/s/supporting-information  

*Published Peer Review History*

Please note that you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*Early Version*

Please note that an uncorrected proof of your manuscript will be published online ahead of the final version, unless you opted out when submitting your manuscript. If, for any reason, you do not want an earlier version of your manuscript published online, uncheck the box. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us as soon as possible if you or your institution is planning to press release the article.

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please do not hesitate to contact me should you have any questions.

Sincerely,

Paula

---

Paula Jauregui, PhD,

Associate Editor,

pjaureguionieva@plos.org,

PLOS Biology

------------------------------------------------------------------------

Reviewer remarks:

Reviewer #1: I am satisfied with the changes the authors have made to the manuscript.

Reviewer #2: Thanks to the authors for making a number of improvements to the ms. The referee response doc is clear, but the manuscript less so. The numbers of features are now included in the methods which helps in understanding how much data went into the training. I would recommend putting these numbers into the results section as well -- in parentheses, e.g. (N=146) -- so the reader can follow how much data has gone into each fit/model/parameterization.

1. The SHAP value is still barely explained or not explained. This made pages 9 and 10 of the new ms very hard to follow. It also made most of Figure 2 very hard to understand. I can't tell if this is my shortcoming in this area, or if this is just an opaque measure that will be hard for all biologists to understand. Is a paragraph on the SHAP value and some equations/methods appropriate here? Other biologists will be reading this.

Some other points

2. Line 80, exert not assert

3. Line 90, "we first build statistical models that assign a probability of zoonotic occurrence"

4. Line 109, I would remove the work "predictably"

5. Line 289, remove 95%

6. lines 329-332 do not make sense. SARS is the human virus, so you can't call these the 62 human and animal genomes of SARS. Please use individual lineages or sarbecovirus as a classification here.

7. Likewise, "within SARS coronavirus" on line 369 does not make sense. You probably should use lineage or sub-genus here again.

Revision 3

Attachments
Attachment
Submitted filename: round3_response_to_reviewers_V2.docx
Decision Letter - Paula Jauregui, PhD, Editor

Dear Dr Mollentze,

On behalf of my colleagues and the Academic Editor, Jason Ladner, I am pleased to say that we can in principle offer to publish your Research Article "Identifying and prioritizing potential human-infecting viruses from their genome sequences" in PLOS Biology, provided you address any remaining formatting and reporting issues. These will be detailed in an email that will follow this letter and that you will usually receive within 2-3 business days, during which time no action is required from you. Please note that we will not be able to formally accept your manuscript and schedule it for publication until you have made the required changes.

Please take a minute to log into Editorial Manager at http://www.editorialmanager.com/pbiology/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production process.

PRESS

We frequently collaborate with press offices. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximise its impact. If the press office is planning to promote your findings, we would be grateful if they could coordinate with biologypress@plos.org. If you have not yet opted out of the early version process, we ask that you notify us immediately of any press plans so that we may do so on your behalf.

We also ask that you take this opportunity to read our Embargo Policy regarding the discussion, promotion and media coverage of work that is yet to be published by PLOS. As your manuscript is not yet published, it is bound by the conditions of our Embargo Policy. Please be aware that this policy is in place both to ensure that any press coverage of your article is fully substantiated and to provide a direct link between such coverage and the published work. For full details of our Embargo Policy, please visit http://www.plos.org/about/media-inquiries/embargo-policy/.

Thank you again for choosing PLOS Biology for publication and supporting Open Access publishing. We look forward to publishing your study. 

Sincerely, 

Paula Jauregui

---

Paula Jauregui, PhD 

Associate Editor 

PLOS Biology

pjaureguionieva@plos.org

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .