Peer Review History

Original SubmissionMarch 3, 2021
Decision Letter - Paula Jauregui, PhD, Editor

Dear Dr. Blackwell,

Thank you for submitting your manuscript entitled "Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences" for consideration as a Methods and Resources by PLOS Biology.

Your manuscript has now been evaluated by the PLOS Biology editorial staff, as well as by an academic editor with relevant expertise, and I am writing to let you know that we would like to send your submission out for external peer review.

However, before we can send your manuscript to reviewers, we need you to complete your submission by providing the metadata that is required for full assessment. To this end, please login to Editorial Manager where you will find the paper in the 'Submissions Needing Revisions' folder on your homepage. Please click 'Revise Submission' from the Action Links and complete all additional questions in the submission questionnaire.

Please re-submit your manuscript within two working days, i.e. by Mar 14 2021 11:59PM.

Login to Editorial Manager here: https://www.editorialmanager.com/pbiology

During resubmission, you will be invited to opt-in to posting your pre-review manuscript as a bioRxiv preprint. Visit http://journals.plos.org/plosbiology/s/preprints for full details. If you consent to posting your current manuscript as a preprint, please upload a single Preprint PDF when you re-submit.

Once your full submission is complete, your paper will undergo a series of checks in preparation for peer review. Once your manuscript has passed all checks it will be sent out for review.

Given the disruptions resulting from the ongoing COVID-19 pandemic, please expect delays in the editorial process. We apologise in advance for any inconvenience caused and will do our best to minimize impact as far as possible.

Feel free to email us at plosbiology@plos.org if you have any queries relating to your submission.

Kind regards,

Paula

---

Paula Jauregui, PhD,

Associate Editor

PLOS Biology

Revision 1
Decision Letter - Paula Jauregui, PhD, Editor

Dear Dr. Blackwell,

Thank you very much for submitting your manuscript "Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences" for consideration as a Methods and Resources at PLOS Biology. Your manuscript has been evaluated by the PLOS Biology editors, an Academic Editor with relevant expertise, and by several independent reviewers.

In light of the reviews (below), we will not be able to accept the current version of the manuscript, but we would welcome re-submission of a much-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent for further evaluation by the reviewers.

In particular, reviewer #2 thinks that the article in its current form may leave a reader with an impression that this is yet another bacterial genome database as you have not sufficiently explained the pros and cons of using this resource as opposed to another database like NCBI, GTDB or PATRIC. This reviewer suggests to include an additional set of analyses using minHash distances at various thresholds to compare what proportion of the genomic space provided by ENA is not covered in other databases like GTDB and PATRIC, and vice versa. Please address also the rest of the reviewers' concerns.

We expect to receive your revised manuscript within 3 months.

Please email us (plosbiology@plos.org) if you have any questions or concerns, or would like to request an extension. At this stage, your manuscript remains formally under active consideration at our journal; please notify us by email if you do not intend to submit a revision so that we may end consideration of the manuscript at PLOS Biology.

**IMPORTANT - SUBMITTING YOUR REVISION**

Your revisions should address the specific points made by each reviewer. Please submit the following files along with your revised manuscript:

1. A 'Response to Reviewers' file - this should detail your responses to the editorial requests, present a point-by-point response to all of the reviewers' comments, and indicate the changes made to the manuscript.

*NOTE: In your point by point response to the reviewers, please provide the full context of each review. Do not selectively quote paragraphs or sentences to reply to. The entire set of reviewer comments should be present in full and each specific point should be responded to individually, point by point.

You should also cite any additional relevant literature that has been published since the original submission and mention any additional citations in your response.

2. In addition to a clean copy of the manuscript, please also upload a 'track-changes' version of your manuscript that specifies the edits made. This should be uploaded as a "Related" file type.

*Re-submission Checklist*

When you are ready to resubmit your revised manuscript, please refer to this re-submission checklist: https://plos.io/Biology_Checklist

To submit a revised version of your manuscript, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' where you will find your submission record.

Please make sure to read the following important policies and guidelines while preparing your revision:

*Published Peer Review*

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*PLOS Data Policy*

Please note that as a condition of publication PLOS' data policy (http://journals.plos.org/plosbiology/s/data-availability) requires that you make available all data used to draw the conclusions arrived at in your manuscript. If you have not already done so, you must include any data used in your manuscript either in appropriate repositories, within the body of the manuscript, or as supporting information (N.B. this includes any numerical values that were used to generate graphs, histograms etc.). For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5

*Blot and Gel Data Policy*

We require the original, uncropped and minimally adjusted images supporting all blot and gel results reported in an article's figures or Supporting Information files. We will require these files before a manuscript can be accepted so please prepare them now, if you have not already uploaded them. Please carefully read our guidelines for how to prepare and upload this data: https://journals.plos.org/plosbiology/s/figures#loc-blot-and-gel-reporting-requirements

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Thank you again for your submission to our journal. We hope that our editorial process has been constructive thus far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Paula

---

Paula Jauregui, PhD

Associate Editor

PLOS Biology

pjaureguionieva@plos.org

*****************************************************

REVIEWS:

Reviewer #1: Computational epidemiology.

Reviewer #2: Evolution of human pathogens.

Reviewer #1: Blackwell et al. describe a curated resource of bacterial genomes that they have generated from ENA records. They process this data using a particular workflow with state-of-the art, appropriate methods and analyse it with respect to taxonomic composition and AMR gene occurrence.

Overall, while the technical part is well done, it seems to me that more exciting things and a more-in depth analyses of individual questions might be learned from this data resource than what is presented now in the article. As their main conclusion, in the abstract the authors state: Whilst these archives are rich in data, considerable processing is required before biological questions can be addressed…. An analysis on this scale revealed the uneven species composition in the ENA/public databases, with just 20 of the total 2,336 species making up 90% of the genomes. The over-represented species tend to be acute/common human pathogens."

That most data in public genome databases comes from very few, cultivable bacteria, most of which are human pathogens is a well-known fact, see, for instance, https://pubmed.ncbi.nlm.nih.gov/11864374/ ; https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4361730/; https://pubmed.ncbi.nlm.nih.gov/18341921/

Furthermore, I do not think that it is as complicated or time-consuming to come to this conclusion as the authors think using other approaches. One could, for instance, go to the NCBI taxonomy site, select the filter „has genome sequences", display 5 levels and sort the results locally by number of genomes.

Reviewer #2: The article by Blackwell and colleagues describes a novel resource: a large database of >660k bacterial genomes created from short-read data downloaded from European Nucleotide Archive in 2018. The genomes underwent a consistent assembly pipeline with an extensive quality control and a number of extra post-processing steps, including kmer indexing and minhash sketching. In addition, the authors have provided various analyses aimed at describing the genetic diversity, sequencing trends and the distribution of antimicrobial resistance genetic markers.

I find the article well written, methodology sound and the analyses interesting. Most importantly, the study offers both an exceptional new methodology and a fantastic resource for the entire field of biology. In my opinion the article deserves to be published in PLoS Biology.

My major and only reservation is that the article in its current form may leave a reader with an impression that this is yet another bacterial genome database as the authors have not sufficiently explained the pros and cons of using this resource as opposed to another database like NCBI, GTDB or PATRIC. In fact the authors have stated directly that both NCBI and GTDB contain an order of magnitude greater number of bacterial species than the ENA archive. However, it is unclear whether the large number of assemblies provided with this resource is due to oversampling of major epidemiological lineages or due to inclusion of multiple novel lineages of the species already present. The ENA archive could also contain some species which are not present in other databases. To clarify this, I would suggest to include an additional set of analyses using minHash distances at various thresholds to compare what proportion of the genomic space provided by ENA is not covered in other databases like GTDB and PATRIC, and vice versa. A large increase in the number of genomes of major sequence types (eg, ST131 in E. coli or ST258 in Klebsiella) would be a great contribution for many phylodynamic studies which explicitly try to estimate parameters of bacterial evolution, or for the studies of the evolution of bacterial accessory genomes.

Minor comments:

> Based on Figure 2 there seems to be a correlation between species abundance and the number of AMR genes. Is that driven by the clinical interest (problematic strains are more likely to be sequenced) or by a stronger signal to find AMR with enough genomes?

> It is unclear whether there is any chance of linking non-genetic metadata (eg, ecology or sampling date) with the ENA strains. If not (as I expect), it should be mentioned as one drawback of using a resource like this.

> I would really love to see a few basic summary plots of the assembly statistics, like the number of contigs, N50, CheckM parameters etc.), as one of the supplementary figures

> l. 116-117: this sentence is unclear; not ideal for what exactly, species identification?

> l. 542: I found y-axis title sub-optimal, maybe remove "covered"?

> Figure 1: some colours are hard to distinguish (eg, C. coli and C. difficile), I'd suggest changing the colour scale or showing genera, not species.

> Figure 2: please mark the most abundant species (say top 20). Also, why Actinobacteria are not shown? Please explain in the legend.

Revision 2

Attachments
Attachment
Submitted filename: Response_to_Reviewers.docx
Decision Letter - Paula Jauregui, PhD, Editor

Dear Dr. Blackwell,

Thank you for submitting your revised Methods and Resources entitled "Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences" for publication in PLOS Biology. I have now discussed your revision with the Academic Editor. 

We will probably accept this manuscript for publication, provided you satisfactorily address the following data and other policy-related requests.

DATA POLICY:

You may be aware of the PLOS Data Policy, which requires that all data be made available without restriction: http://journals.plos.org/plosbiology/s/data-availability. For more information, please also see this editorial: http://dx.doi.org/10.1371/journal.pbio.1001797 

Note that we do not require all raw data. Rather, we ask that all individual quantitative observations that underlie the data summarized in the figures and results of your paper be made available in one of the following forms:

1) Supplementary files (e.g., excel). Please ensure that all data files are uploaded as 'Supporting Information' and are invariably referred to (in the manuscript, figure legends, and the Description field when uploading your files) using the following format verbatim: S1 Data, S2 Data, etc. Multiple panels of a single or even several figures can be included as multiple sheets in one excel file that is saved using exactly the following convention: S1_Data.xlsx (using an underscore).

2) Deposition in a publicly available repository. Please also provide the accession code or a reviewer link so that we may view your data before publication. 

Regardless of the method selected, please ensure that you provide the individual numerical values that underlie the summary data displayed in the following figure panels as they are essential for readers to assess your analysis and to reproduce it: Figures 2AB, 3ABC, Supplementary Figures 1ABCD, 2ABCD, 3, 4AB, 5AC, 6.

NOTE: the numerical data provided should include all replicates AND the way in which the plotted mean and errors were derived (it should not present only the mean/average values).

Please also ensure that figure legends in your manuscript include information on where the underlying data can be found, and ensure your supplemental data file/s has a legend.

Please ensure that your Data Statement in the submission system accurately describes where your data can be found.

As you address these items, please take this last chance to review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the cover letter that accompanies your revised manuscript.

We expect to receive your revised manuscript within two weeks.

To submit your revision, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' to find your submission record. Your revised submission must include the following:

-  a cover letter that should detail your responses to any editorial requests, if applicable, and whether changes have been made to the reference list

-  a Response to Reviewers file that provides a detailed response to the reviewers' comments (if applicable)

-  a track-changes file indicating any changes that you have made to the manuscript. 

NOTE: If Supporting Information files are included with your article, note that these are not copyedited and will be published as they are submitted. Please ensure that these files are legible and of high quality (at least 300 dpi) in an easily accessible file format. For this reason, please be aware that any references listed in an SI file will not be indexed. For more information, see our Supporting Information guidelines:

https://journals.plos.org/plosbiology/s/supporting-information  

*Published Peer Review History*

Please note that you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*Early Version*

Please note that an uncorrected proof of your manuscript will be published online ahead of the final version, unless you opted out when submitting your manuscript. If, for any reason, you do not want an earlier version of your manuscript published online, uncheck the box. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us as soon as possible if you or your institution is planning to press release the article.

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please do not hesitate to contact me should you have any questions.

Sincerely,

Paula

---

Paula Jauregui, PhD,

Associate Editor,

pjaureguionieva@plos.org,

PLOS Biology

Revision 3

Attachments
Attachment
Submitted filename: Response_to_Reviewers.docx
Decision Letter - Paula Jauregui, PhD, Editor

Dear Dr Blackwell,

I'm handling your manuscript on behalf of my colleague Dr Jauregui, who is out of the office for two weeks. I note that you have re-submitted your paper, BUT on looking at the file inventory, it seems that you may have forgotten to upload the latest versions of your files. My understanding is that Dr Jauregui sent the decision letter on Sept 12th, but the uploaded files all date from Aug 31st or before, and therefore don't contain the requested changes.

Please look at Dr Jauregui's previous decision letter for all the details, but essentially her sole requests were that you:

a) provide the underlying numerical values for Figures 2AB, 3ABC, S1ABCD, S2ABCD, S3, S4AB, S5AC, S6 (either as supplementary data files or as depositions in Figshare, Dryad, Github, etc.).

b) cite the location of the data clearly in each relevant main or supplementary Fig legend (e.g. "the data underlying this Figure may be found in S1 Data" or "the data underlying this Figure may be found in https://github.com/...."

Please do not hesitate to contact me should you have any questions.

Sincerely,

Roli Roberts

Roland G Roberts PhD

Senior Editor

PLOS Biology

rroberts@plos.org

on behalf of

Paula Jauregui, PhD,

Editor,

pjaureguionieva@plos.org,

PLOS Biology

Revision 4

Attachments
Attachment
Submitted filename: Response_to_Reviewers.docx
Decision Letter - Paula Jauregui, PhD, Editor

Dear Grace,

On behalf of my colleagues and the Academic Editor, William Hanage, I'm pleased to say that we can in principle offer to publish your Methods and Resources "Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences" in PLOS Biology, provided you address any remaining formatting and reporting issues. These will be detailed in an email that will follow this letter and that you will usually receive within 2-3 business days, during which time no action is required from you. Please note that we will not be able to formally accept your manuscript and schedule it for publication until you have made the required changes.

Please take a minute to log into Editorial Manager at http://www.editorialmanager.com/pbiology/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production process.

PRESS: We frequently collaborate with press offices. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximise its impact. If the press office is planning to promote your findings, we would be grateful if they could coordinate with biologypress@plos.org. If you have not yet opted out of the early version process, we ask that you notify us immediately of any press plans so that we may do so on your behalf.

We also ask that you take this opportunity to read our Embargo Policy regarding the discussion, promotion and media coverage of work that is yet to be published by PLOS. As your manuscript is not yet published, it is bound by the conditions of our Embargo Policy. Please be aware that this policy is in place both to ensure that any press coverage of your article is fully substantiated and to provide a direct link between such coverage and the published work. For full details of our Embargo Policy, please visit http://www.plos.org/about/media-inquiries/embargo-policy/.

Thank you again for choosing PLOS Biology for publication and supporting Open Access publishing. We look forward to publishing your study. 

Sincerely, 

Roli

Roland G Roberts PhD

Senior Editor

PLOS Biology

on behalf of

Paula Jauregui, PhD 

Senior Editor 

PLOS Biology

pjaureguionieva@plos.org

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .