Host phenotype classification from human microbiome data is mainly driven by the presence of microbial taxa

Renato Giliberti; Sara Cavaliere; Italia Elisa Mauriello; Danilo Ercolini; Edoardo Pasolli

doi:10.1371/journal.pcbi.1010066

Peer Review History

Original SubmissionOctober 13, 2021
8 Dec 2021 Decision Letter - Ilya Ioshikhes, Editor, Luis Pedro Coelho, Editor Dear Dr Pasolli, Thank you very much for submitting your manuscript "Host phenotype classification from human metagenomes is more driven from presence than abundance of microbial taxa" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. As you can see from the reviews, the feedback is generally positive and, in our judgement, none of the raised issues present a fundamental objection to the soundness and relevance of the work. We nonetheless ask that the authors consider all the points raised to improve the manuscript. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Luis Pedro Coelho Associate Editor PLOS Computational Biology Ilya Ioshikhes Deputy Editor PLOS Computational Biology ********************* As you can see from the reviews, the feedback is generally positive and, in our judgement, none of the raised issues present a fundamental objection to the soundness and relevance of the work. We nonetheless ask that the authors consider all the points raised to improve the manuscript. Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: General: Giliberti and colleagues conduct a machine learning meta-analysis including different metagenomic datasets, using the information about presence and absence of bacterial species as input features for modelling rather than their relative abundance. The machine learning analyses are performed in a technically thorough manner and uncover a surprising finding, i.e. that the presence/absence information alone seems to be as predictive for various disease states as relative abundances. However, the biological relevance of the presented results are not fully clear and should be discussed in more detail. Major: 1. The authors discuss the implications for practical applications in diagnostic tests. However, it remains unclear to this referee whether detection of bacteria present rather than their quantification (by a cheap target approach such as qPCR)would make a diagnostic routine easier to implement in practice. The authors might want to discuss for which diagnostic assays this could matter. 2. We feel that some of the chosen visualizations are not optimal to convey differences between various approaches in terms of AUC differences. A better alternative might be scatter plots with the original AUC on the x-axis and the AUC from the model trained on presence/absence data (or with different threshold, different taxonomic levels, or different machine learning algorithms) on the y-axis. 3. The included datasets (and samples within a dataset) vary considerably in terms of their mean sequencing depth, which likely affects the detection of taxa with lower relative abundance. For this reason data is often downsampled (rarefied) before presence/absence calls and derived richness estimates are calculated. This issue is partially addressed by the authors’ use of different threshold values above which the authors call a taxa “present”, but this aspect might need further clarification. A more explicit description of how their thresholding strategy relates to downsampling datasets should be added and in the best case show both approaches could be compared empirically (on a few datasets). 4. The analysis of statistically significant taxa in both approaches (lines 330 and following, also SFig. 4) feels disconnected from the rest of the manuscript and generally underdeveloped. Could the authors compare the p-values from the Wilcoxon and Fisher tests, maybe again via scatter plots (on -log10 scale of the p-values)? How well do the p-values correlate? Are there taxa for which the two tests disagree consistently across datasets? How does it look like when abundance fold change and prevalence difference across groups is compared? Is there a clear correlation? In general, this analysis could be greatly expanded to bolster the biological interpretation of the results. In this context, it is also important to note that there are many CRC datasets in the set of included studies, potentially biasing the disease-associations reported in lines 339 and following. Instead, the authors could try to get a single significance value for each disease (including all available studies of this disease) and compare again. 5. Does the disease which is to be predicted have any influence on loss of/retaining accuracy after converting features to presence/absence information? One could imagine that CRC, relying more on rare biomarkers, could be less affected by downgrading features to presence/absence information than for example IBD, which is characterized by stronger community shifts in highly abundant and prevalent taxa. This analysis could contribute to developing a clearer understanding of the biological relevance of the presented analysis. Minor: 1. The LODO analysis for CRC studies could be moved from the supplement to the main manuscript, especially if the display items are streamlined via scatter plots. 2. Are the different models trained on the same cross-validation splits? If not, adopting this approach could remove the random noise introduced by repeated CV splitting. Reviewer #2: Giliberti et al. present a very well written paper comparing the predictive performance of quantitative vs qualitative taxonomic profiles. The breadth of the work is comprehensive in the range of phenotypes (IBD, CRC, T2D, etc) being analyzed and the data types (16S rRNA and Metagenomics). The core question being answered is an interesting one and this paper will likely be of interest to the broad microbiome community. However, there are three main issues that would first need to be addressed. The first is a logistical issue concerned with the availability of the source code used to generate the data in this paper. The second issue is concerned with the use of the AUC of the ROC to evaluate classification tasks on unbalanced datasets. The last issue is concerned with the choice of test used to compute differential abundance. The code used to generate the data and figures included in this paper needs to be made available and well documented. Please note the versions of the software dependencies used, how to run these scripts, and when and where the datasets were downloaded from. My recommendation is to host this repository on GitHub or another source code sharing platform. Using the AUC of the ROC for evaluating the performance of a classification task for unbalanced datasets is likely to yield misleading results. My suggestion would be to use this metric and the area under the precision-recall curve AUPRC. Alternatively using any of the other metrics that were already measured for this dataset would also be appropriate. These metrics need to be also evaluated with the statistical testing framework used for the AUC of the ROC. Lastly, performing differential abundance comparisons using Mann-Whitney’s U test is inappropriate due to the compositional nature of the data. The authors should update these results to use a different test and or normalization strategy Beyond these issues, I would recommend that the authors rethink the title of their paper, as it stands the classification performance is shown to be comparable in between both cases. The title makes an implication about qualitative profiles outperforming quantitative profiles. Furthermore the use of the term "metagenome" in the title is misleading since amplicon sequencing data is also presented in this paper. Reviewer #3: This paper proposed a meta-analysis on 25 publicly available dataset of metagenomic studies and 30 public dataset of 16S rRNA studies, the major investigation purpose is whether presence/absence data could be a valid and efficient indicator for classifying samples. The conclusion is that by degrading abundance data to binary(presence/absence) data, the classification performance could be maintained without decreasing AUC in many parameter settings. This is an interesting paper for host-phenotype classification by only considering binary(presence/absence ) data instead of abundance data, because it may suggest that it may be possible for designing microbe-based diagnostic tasks in future by the detection of the presence of a microbial taxa set rather than the complex abundance estimation by sequencing technology. The writing of the paper is very clear, however there are several concerns may be useful to consider, at least valuable to discuss: (1)The paper used Random forest, SVM, Lasso and Elastic Net for comparison, it is possible to adopt some updated classification approaches, for example: https://pubmed.ncbi.nlm.nih.gov/32396115/ https://pubmed.ncbi.nlm.nih.gov/32657370/ (2)The paper designed many experiments by using a lot of datasets, it is useful to provide a website link for downloading the processed dataset for comparison and reproduce the results. (3)Traditionally, classification methods including Random forest, SVM, Lasso and Elastic Net are designed for continuous data instead of binary data (presence/absence data). When the data form is changed, the algorithm may not adapted to it. The author may explain why these methods could work on both continuous data (abundance data) and binary data (presence/absence data) ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: No: Reviewer #3: Yes ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: Yes:** Xingpeng Jiang Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols https://doi.org/10.1371/journal.pcbi.1010066.r001
Revision 1
17 Mar 2022 Author Response Attachments Attachment Submitted filename: Presence absence meta analysis - Rebuttal letter.pdf https://doi.org/10.1371/journal.pcbi.1010066.r002
20 Mar 2022 Decision Letter - Ilya Ioshikhes, Editor, Luis Pedro Coelho, Editor Dear Dr Pasolli, Thank you very much for submitting your manuscript "Host phenotype classification from human microbiome data is mainly driven by the presence of microbial taxa" for consideration at PLOS Computational Biology. We are likely to accept this manuscript for publication, providing that you modify the manuscript according to the recommendations below: We consider that the scientific questions have been addressed in this revision. However, before accepting the manuscript, we ask that the authors revise the figures, particularly the supplemental ones, for improved readability and accessibility: - The fonts and markers are often too small to read comfortably even though there appears to be space available. This is particularly true in the supplemental figures, but some of the main figures could also be improved. - We also ask that Figs S7 and S8 be redone using a coulorblind-safe color scheme rather than rely on red/green distinction. - We ask the authors to consider pointing to Figs. S1 and S2 from the caption of Fig 1. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Luis Pedro Coelho Associate Editor PLOS Computational Biology Ilya Ioshikhes Deputy Editor PLOS Computational Biology *********************** A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] We consider that the scientific questions have been addressed in this revision. However, before accepting the manuscript, we ask that the authors revise the figures, particularly the supplemental ones, for improved readability and accessibility: - The fonts and markers are often too small to read comfortably even though there appears to be space available. This is particularly true in the supplemental figures, but some of the main figures could also be improved. - We also ask that Figs S7 and S8 be redone using a coulorblind-safe color scheme rather than rely on red/green distinction. - We ask the authors to consider pointing to Figs. S1 and S2 from the caption of Fig 1. Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols References: Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. https://doi.org/10.1371/journal.pcbi.1010066.r003
Revision 2
25 Mar 2022 Author Response Attachments Attachment Submitted filename: Presence absence meta analysis - Rebuttal letter.pdf https://doi.org/10.1371/journal.pcbi.1010066.r004
29 Mar 2022 Decision Letter - Ilya Ioshikhes, Editor, Luis Pedro Coelho, Editor Dear Dr Pasolli, We are pleased to inform you that your manuscript 'Host phenotype classification from human microbiome data is mainly driven by the presence of microbial taxa' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Luis Pedro Coelho Associate Editor PLOS Computational Biology Ilya Ioshikhes Deputy Editor PLOS Computational Biology *********************************************************** https://doi.org/10.1371/journal.pcbi.1010066.r005
Formally Accepted
18 Apr 2022 Acceptance Letter - Ilya Ioshikhes, Editor, Luis Pedro Coelho, Editor PCOMPBIOL-D-21-01860R2 Host phenotype classification from human microbiome data is mainly driven by the presence of microbial taxa Dear Dr Pasolli, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Livia Horvath PLOS Computational Biology \| Carlyle House, Carlyle Road, Cambridge CB4 3DN \| United Kingdom ploscompbiol@plos.org \| Phone +44 (0) 1223-442824 \| ploscompbiol.org \| @PLOSCompBiol https://doi.org/10.1371/journal.pcbi.1010066.r006

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .