BioSeq-Diabolo: Biological sequence similarity analysis using Diabolo

Hongliang Li; Bin Liu

doi:10.1371/journal.pcbi.1011214

Peer Review History

Original SubmissionOctober 3, 2022
16 Dec 2022 Decision Letter - William Stafford Noble, Editor, Maxwell Wing Libbrecht, Editor Dear Prof. Liu, Thank you very much for submitting your manuscript "BioSeq-Diabolo : biological sequence similarity analysis using Diabolo" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. As you will see from the reports below, the reviewers note the good results and usability of the code web server. However, they identify many significant issues. In particular, the manuscript lacks description of the methods (particularly of the use of LTR, the main contribution) and the evaluation experiments. Please ensure that the reader can evaluate from the text whether evaluation experiments support the desired conclusions, including a description of what information is held out from which parts of the approach to avoid data leakage and, more generally, what biological problem these experiments are simulating. Also, it seems that some performance numbers are copied from other papers; please reproduce the results or describe how you ensured that evaluation regime is exactly reproduced. Without substantial revisions such that the manuscript precisely describes its methods and accurate evaluation methodology, we will be unlikely to send the paper back to review. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Maxwell Wing Libbrecht, Ph.D. Academic Editor PLOS Computational Biology Lucy Houghton Staff PLOS Computational Biology ********************* As you will see from the reports below, the reviewers note the good results and usability of the code web server. However, they identify many significant issues. In particular, they manuscript lacks description of the methods (particularly of the use LTR, the main contribution) and the evaluation experiments. Please ensure that the reader can evaluate from the text whether evaluation experiments support the desired conclusions, including a description of what information is held out from which parts of the approach to avoid data leakage and, more generally, what biological problem these experiments are simulating. Also, it seems that some performance numbers are copied from other papers; please reproduce the results or describe how you ensured that evaluation regime is exactly reproduced. Without substantial revisions such that the manuscript precisely describes its methods and accurate evaluation methodology, we will be unlikely to send the paper back to review. Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: Li and Liu introduced a new method called BioSeq-Diabolo to calculate the biological sequence similarities. Biological sequence similarity analysis is an important task in bioinformatics, and efficient methods are desired. Generally, BioSeq-Diabolo is interesting, because it uses the techniques derived from the field of natural language processing, bringing new techniques and concepts to solve this task. The corresponding webserver has been constructed, and the standalone package has been released as well. I have tested both the webserver and the package, and they worked well as described. I believe that they will be particularly interesting for the researchers who are working on the related fields. I have the following comments to improve the presentation of this manuscript. 1. From the webserver, I understand that there are several pipeline software tools as listed in the right part. However, the relationships between these software tools and BioSeq-Diabolo should be explained in the webserver site as well. As a result, the users are able to use these software tools for their own tasks and aims. 2. The information or references of the datasets provided in the Download section (http://bliulab.net/BioSeq-Diabolo/download/) should be given. 3. I have downloaded the stand-alone package from http://bliulab.net/BioSeq-Diabolo/static/download/Sesica.tar.gz, and I felt this package is easy to use, especially for the large dataset analysis aims. However, I failed to find the files for the examples reported in the main text. I suggest Li and Liu providing this information so as to help the users reproducing the reported experiments. 4. The figure 2 is confusing. More information should be provided in its legend. 5. The structure shown in figure 4 is clear, which is very similar as the Diabolo, but the input and output shown in the top part flowchart is not clear. The two important components should be shown in different colors 6. In the current study, Li and Liu used four performance measures to evaluate their performance. Did they follow other studies, and why? 7. For a given methods in BioSeq-Diabolo, it is difficult for the users to select the best one for their own tasks. Did Li and Liu consider this problem? It will be more useful to add the function to automatically select the optimized methods for different tasks. Furthermore, there are several parameters for the different methods. The parameter optimization function should also be added. 8. The result visualization part is elegant and clear, but I cannot find the corresponding figures when using the standalone package. Can Li and Liu also add these visualization function into the standalone package as well? 9. This manuscript presented a very useful tool for biological sequence similarity analysis, but more discussions of its potential applications should be given in the conclusion section in order to attract the readers of Plos computational biology. Reviewer #2: The authors established the web server and the stand-alone package for biological sequence similarity analysis called BioSeq-Diabolo. Its performance was well evaluated for different biological sequence similarity analysis problems. BioSeq-Diabolo is easily combined with their previously established software tools, and it is the first comprehensive platform incorporating various methods from natural language processing (NLP). It is reasonable to conduct the biological sequence similarity analysis based on the NLP methods considering the similarities between the biological sequences and the natural language sentences. The following revisions are suggested: 1) The heterogeneous biological sequence similarity analysis shown in BioSeq-Diabolo will contribute to many problems in bioinformatics, such as non-coding RNA and disease association prediction. In the current version of the manuscript, it is not clear how to combine the heterogeneous features (such as non-coding RNAs and diseases) to make the prediction. More explanations are needed. 2) The information shown in table 1 clearly indicates that BioSeq-Diabolo and their previous method BLM are complementary, dealing with different problems. For example, Diabolo is for biological sequence similarity analysis, and BLM is for classification or sequence labelling problem in bioinformatics. More explanations of their relationships are need in both the main text and the web site of Diabolo. 3) Although figure 1 shows the similarities between sequence similarity analysis and semantics analysis, these similarities are not so clear. More explanations and discussions are needed, especially for their similarities among the individual steps shown in this figure. 4) In section 3.2, the concept of embedding is not clear. Dose it mean extract the features of the biological sequences? 5) Learning to Rank is a method to combine different methods. This method seems to be better than the other unsurprised ones. As discussed in the corresponding section, it has been successfully applied to solve many problems in bioinformatics. The authors are suggested to show more details of Learning to Rank, because it is an important step in Diabolo. 6) Written problems: the last column in Table 1, “Sequence similarities analysing” -> “Sequence similarity analysis”. 7) The web server of Diabolo is a big advantage, and it is easy to use. I only need to feed the features, and the web server will output the desired analysis results. I validated it with both the examples provided by the authors and my own data, and I got the desired results. So I think the web server will popular with the other users as well, but Some improvements are desired. For term explanation, the authors are suggested to add the corresponding references of the terms, which will help the user to get the correct information. The main buttons (“Server”, “Tutorial” and “Document”) are suggested to be more clear and bigger. 8) More discussions of the advantages and disadvantages of Diabolo are suggested to given in the conclusion part. Reviewer #3: In this work, by using the LTR integration method to analyze the similarity of different biological sequences (DNA, RNA, protein, disease, etc), an unified platform for systematically analyzing the similarity of homogeneous and heterogeneous biological sequences has been realized. The text is logically coherent, and the server functions provided are complete, powerful and useful. Here are some comments and suggestions to further enhance this study. Major 1) In the three tests, the descriptions of the evaluation indicators are missing, and they don't appear in the supplementary material either. 2) It is said that "The users only need to input the embeddings of the biological sequence data. BioSeq-Diabolo will intelligently identify the task, and then accurately analyse the biological sequence similarities based on biological language semantics." SO how BioSeq-Diabolo intelligently identifies tasks is not explained in the paper. 3) In the protein function annotation task, the official evaluation metrics in CAFA are F-max and S_min, which are missing in this work. 4. Lack of running time reports and performance evaluation of models under large-scale data. Estimation of time consumption is important for a user-oriented method. Minor 1. Figure 1, a, the final protein should be "1D3B" instead of "D3B" 2. Figure 4, "Learning to Rank" part, "Predicted diseases list for topic", "topic" is used in NLP background. Here it should be "Predicted diseases list for RNAs". The same problem exists in the Figure of the server(http://bliulab.net/BioSeq-Diabolo/server/). ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: None Reviewer #2: Yes Reviewer #3: Yes ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols https://doi.org/10.1371/journal.pcbi.1011214.r001
Revision 1
4 Mar 2023 Author Response Attachments Attachment Submitted filename: Reponse letter.docx https://doi.org/10.1371/journal.pcbi.1011214.r002
16 Mar 2023 Decision Letter - William Stafford Noble, Editor, Maxwell Wing Libbrecht, Editor Dear Prof. Liu, Thank you very much for submitting your manuscript "BioSeq-Diabolo : biological sequence similarity analysis using Diabolo" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board. We would like to invite the resubmission of a significantly-revised version that takes into account the comments below. Thank you for your revised manuscript. Upon initial reading, it seems that a key comment from the reviewers (e.g. R3-C1) remains unaddressed. It seems that descriptions of AUC and AUPR in general were added, but there is still no description of the experiments themselves. Please ensure that the reader can fully understand and reproduce each experiment from the text (main or supplementary), including the source of the data, all preprocessing performed, how fields from the source database were mapped to machine learning concepts (e.g. how true label y_i is defined for each task), the division of data into train and test sets, etc. This issue must be addressed before the manuscript can be re-sent for review. Also, when referencing the supplemental material, please include a reference to the specific section in question. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Maxwell Wing Libbrecht, Ph.D. Academic Editor PLOS Computational Biology Lucy Houghton Staff PLOS Computational Biology *********************** Thank you for your revised manuscript. Upon initial reading, it seems that a key comment from the reviewers (e.g. R3-C1) remains unaddressed. It seems that descriptions of AUC and AUPR in general were added, but there is still no description of the experiments themselves. Please ensure that the reader can fully understand and reproduce each experiment from the text (main or supplementary), including the source of the data, all preprocessing performed, how fields from the source database were mapped to machine learning concepts (e.g. how true label y_i is defined for each task), the division of data into train and test sets, etc. This issue must be addressed before the manuscript can be re-sent for review. Also, when referencing the supplemental material, please include a reference the specific section in question. Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols https://doi.org/10.1371/journal.pcbi.1011214.r003
Revision 2
21 Mar 2023 Author Response Attachments Attachment Submitted filename: Reponse letter.docx https://doi.org/10.1371/journal.pcbi.1011214.r004
18 Apr 2023 Decision Letter - William Stafford Noble, Editor, Maxwell Wing Libbrecht, Editor Dear Prof. Liu, Thank you very much for submitting your manuscript "BioSeq-Diabolo : biological sequence similarity analysis using Diabolo" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Maxwell Wing Libbrecht, Ph.D. Academic Editor PLOS Computational Biology Lucy Houghton Staff PLOS Computational Biology ********************* A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: It is revised well Reviewer #2: The authors have improved their work and addressed my comments. Happy to recommend this version for publication in PLoS CB. Reviewer #3: The author has actively responded to our comments and resolved most of them. Here are some additional suggestions to further enhance this study. 1. The author added "The details of the reported experiments" section in the supplementary material to further explain the details of the experiment. But as far as the protein function annotation problem is concerned, the experimental implementation has the following problem: Generally speaking, when evaluating protein function annotation, most of the researchers will evaluate MFO, BPO, and CCO separately, instead of only evaluating CCO. 2. I think the training time should not be the focus of the evaluation. Instead, the experiment should focus on evaluating the time consumption between different approaches. That is, when the number of samples is the same, the time required for different methods to get the result from scratch. 3. "BioSeq- Diabolo will intelligently identify the specific tasks (homogeneous or heterogeneous biological sequence similarity analysis tasks) based on the input embeddings of the biological sequences and the parameters provided by the users." This section needs to specify how to identify specific tasks. 4. In addition, users only need to input biological sequence data, but generating different input embeddings from biological sequence data requires different models. How does BioSeq-Diabolo select the corresponding model. 5. The layout of Figure 4 is suitable, and many subplots are too small to be seen clearly. You can try to regroup the subplots. The theme of Diabolo does not need to be presented on this diagram, which hinders the normal layout of the diagram and can be reflected in your LOGO. ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: None Reviewer #2: Yes Reviewer #3: Yes ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols References: Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. https://doi.org/10.1371/journal.pcbi.1011214.r005
Revision 3
19 May 2023 Author Response Attachments Attachment Submitted filename: Response Letter_v1.docx https://doi.org/10.1371/journal.pcbi.1011214.r006
24 May 2023 Decision Letter - William Stafford Noble, Editor, Maxwell Wing Libbrecht, Editor Dear Prof. Liu, We are pleased to inform you that your manuscript 'BioSeq-Diabolo : biological sequence similarity analysis using Diabolo' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Maxwell Wing Libbrecht, Ph.D. Academic Editor PLOS Computational Biology Lucy Houghton Staff PLOS Computational Biology *********************************************************** https://doi.org/10.1371/journal.pcbi.1011214.r007
Formally Accepted
12 Jun 2023 Acceptance Letter - William Stafford Noble, Editor, Maxwell Wing Libbrecht, Editor PCOMPBIOL-D-22-01464R3 BioSeq-Diabolo : biological sequence similarity analysis using Diabolo Dear Dr Liu, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Zsofi Zombor PLOS Computational Biology \| Carlyle House, Carlyle Road, Cambridge CB4 3DN \| United Kingdom ploscompbiol@plos.org \| Phone +44 (0) 1223-442824 \| ploscompbiol.org \| @PLOSCompBiol https://doi.org/10.1371/journal.pcbi.1011214.r008

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .