iDNA-ITLM: An interpretable and transferable learning model for identifying DNA methylation

Xia Yu; Cui Yani; Zhichao Wang; Haixia Long; Rao Zeng; Xiling Liu; Bilal Anas; Jia Ren

doi:10.1371/journal.pone.0301791

Peer Review History

Original SubmissionOctober 16, 2023
5 Dec 2023 Decision Letter - Li Chen, Editor PONE-D-23-33766iDNA-ITLM: An interpretable and transferable learning model for identifying DNA methylationPLOS ONE Dear Dr. Ren, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Jan 19 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Li Chen Academic Editor PLOS ONE [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Partly ******** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: N/A ****** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No Reviewer #2: Yes ****** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: No ****** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The manuscript by Yu et al. reports a novel data augmentation strategy, which involves the continuous self-replication of a short DNA sequence into a longer one, followed by its embedding into a high-dimensional matrix to expand the receptive field. This approach, combined with the existing BERT model, is used for predicting DNA methylation sites. The model performs well across 17 datasets from different species, and it can also be extended to predict RNA methylation. Generally, the manuscript is well written, and the conclusions are supported by the presented results. However, some issues that need to be addressed to provide more substantial evidence of the model's superiority and to elucidate the entire training process. Major: 1. The selling point of this paper is the data enhancement part. Did you compare the model performance with and without data enhance strategy? Adding this experiment will make the results more convincing. 2. How did you train your model? As you used pretrained BERT model, did you retrain it using your data or freeze the parameters of it? Please provide more information to clarify your training process. 3. For transfer learning part, did you fine-tune the pre-trained DNA model using RNA sequences or train it from scratch? Did you retrain the whole model or only retrain a portion of the model? Please clarify your transfer learning process. Minor: 1. Check for typos throughout the text. Eg: the title of MLM section 2. Please provide higher quality figures. When zoomed in, the figures lose clarity and appear blurry. 3. The formatting of the formulas and equations is inconsistent (Italic? Bold?) and the layout is messy. Please modify. 4. For Figure6, I guess the xlabel and ylabel are ‘train data’ and ‘test data’? Please add labels of axis. 5. In transfer learning part, you claimed you did transfer learning without altering model’s parameters. I guess you're referring to hyperparameters here? Reviewer #2: This manuscript by Xia et al introduces a novel data enhancement strategy for identifying DNA methylation sites. The authors demonstrate that the proposed method outperforms baseline methods, including iDNA-ABT, iDNA-ABF, iDNA-MS, and MM-6mAPred. While the paper is well-organized, there are concerns about the clarity in describing methods and results. Comments: 1. The paper contains numerous grammar errors, confusing sentences/subtitles, and misreferences (e.g., Figure1B referred to as Fig2B). A careful review of the paper for these issues is necessary before submission. 2. The motivation behind the self-replicating operation of the sequence is unclear. Does the model benefit from this operation, or is it solely for converting the sequence to a matrix? The authors should clarify this aspect. 3. Figure 1 is confusing. Does Fig 1.B involve data enhancement through self-replicating? If so, how is a sequence converted to a matrix with different rows? Subfigures CDEF can be consolidated into a single figure. Additionally, the species information in the prediction in Figure H needs clarification—are these species details part of the training data or the model? The authors need to re-draw this overview figure to make it clear. 4. How was the training / testing split performed? Why are they equal? 5. The motivation for presenting the logo plot of methylation in Figure 2 is not clear. Authors should provide more context. If not, I would suggest to remove the figure2 6. For Figure 5, if the datasets are uncorrelated, a bar plot is suggested instead of a line plot. 7. In the transfer learning section, authors should include the performance of baseline methods (directly trained on RNA dataset without fine-tuning) ****** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No ******** [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. https://doi.org/10.1371/journal.pone.0301791.r001
Revision 1
29 Dec 2023 Author Response Dear Prof. reviewer， Thank you very much for your comments and professional advice. These help us to improve academic rigor of our article. Based on your suggestions and requests, we made corrected modifications on the revised manuscript. Furthermore, we would like to show the details as follows: Reviewer 1 The manuscript by Yu et al. reports a novel data augmentation strategy, which involves the continuous self-replication of a short DNA sequence into a longer one, followed by its embedding into a high-dimensional matrix to expand the receptive field. This approach, combined with the existing BERT model, is used for predicting DNA methylation sites. The model performs well across 17 datasets from different species, and it can also be extended to predict RNA methylation. Generally, the manuscript is well written, and the conclusions are supported by the presented results. However, some issues that need to be addressed to provide more substantial evidence of the model's superiority and to elucidate the entire training process. Major: 1. The selling point of this paper is the data enhancement part. Did you compare the model performance with and without data enhance strategy? Adding this experiment will make the results more convincing. The author’s answer: We have supplemented the experiment comparing the performance of the model with and without data enhancement in Section 4.2 of the article. (Page 15) 2. How did you train your model? As you used pretrained BERT model, did you retrain it using your data or freeze the parameters of it? Please provide more information to clarify your training process. The author’s answer: When using the Bert model, we retrained it with on one or several species DNA methylation sequence data. During the process of retraining the model, we fine-tuned relevant parameters, such as the learning rate and embedding, to optimize the model's performance. Once the model's parameters were well-trained, this set of parameters could be applied to the identification of methylation site in other DNA and RNA species, with equally good performance. 3. For transfer learning part, did you fine-tune the pre-trained DNA model using RNA sequences or train it from scratch? Did you retrain the whole model or only retrain a portion of the model? Please clarify your transfer learning process. The author’s answer: In the transfer learning section, we did not fine-tune the pre-trained DNA model using RNA sequences. Instead, we input RNA methylation sequences into the model proposed in this paper, employing the fine-tuning parameters that optimized the model during the DNA methylation site recognition stage. This means that methylation sequences from different RNA species are directly trained in the model without the need for fine-tuning parameters for different species, achieving good performance. Minor: 1. Check for typos throughout the text. Eg: the title of MLM section The author’s answer: Thank you for pointing out the issue, we have now modified it to a consistent format. We have made corrections to the typos.（Page 7） 2. Please provide higher quality figures. When zoomed in, the figures lose clarity and appear blurry. The author’s answer: Thank you for pointing out the issue, all the images in the article are output in a 300dpi format. I assume the distortion you're referring to upon enlargement is regarding Figure 2. This figure is a visual representation converted from sequence data after data augmentation into matrix data. Unlike other images, it does not possess distinctive features, so visually, it appears distorted when enlarged. 3. The formatting of the formulas and equations is inconsistent (Italic? Bold?) and the layout is messy. Please modify. The author’s answer: Thank you for pointing out the issue, We have already standardized the formatting of the equations in the article, uniformly setting them to Cambria Math 12.（Page 9-10） 4. For Figure6, I guess the xlabel and ylabel are ‘train data’ and ‘test data’? Please add labels of axis. The author’s answer: Yes, the xlabel and ylabel are ‘train data’ and ‘test data’ for Figure 6. We have already added the corresponding xlabel and ylabel. (Page 18) 5. In transfer learning part, you claimed you did transfer learning without altering model’s parameters. I guess you're referring to hyperparameters here? The author’s answer: Yes, the model's parameters referred to here are the hyperparameters. Reviewer 2 This manuscript by Xia et al introduces a novel data enhancement strategy for identifying DNA methylation sites. The authors demonstrate that the proposed method outperforms baseline methods, including iDNA-ABT, iDNA-ABF, iDNA-MS, and MM-6mAPred. While the paper is well-organized, there are concerns about the clarity in describing methods and results. Comments: 1. The paper contains numerous grammar errors, confusing sentences/subtitles, and misreferences (e.g., Figure1B referred to as Fig2B). A careful review of the paper for these issues is necessary before submission. The author’s answer: Thank you for pointing out the issue, We have already corrected the relevant errors in the article. (Page 4) 2. The motivation behind the self-replicating operation of the sequence is unclear. Does the model benefit from this operation, or is it solely for converting the sequence to a matrix? The authors should clarify this aspect. The author’s answer: The motivation for the self-replication of the sequence is that the length of the DNA methylation sequence is relatively short, only 41bp, this led the author to think of the receptive field in image feature extraction, where the larger the receptive field, the richer the features extracted. Therefore, the DNA methylation sequence is self-replicated and embedded into a large-scale matrix, thereby increasing the sequence's receptive field to facilitate the extraction of features from the DNA methylation sequence. This motivation is explained in the Introduction section of the article. The model benefits from the self-replication of DNA, and the author has supplemented the article with experiments before and after data augmentation. (Page 3, 14-15) 3. Figure 1 is confusing. Does Fig 1.B involve data enhancement through self-replicating? If so, how is a sequence converted to a matrix with different rows? Subfigures CDEF can be consolidated into a single figure. Additionally, the species information in the prediction in Figure H needs clarification—are these species details part of the training data or the model? The authors need to re-draw this overview figure to make it clear. The author’s answer: ①Yes, Figure 1.B includes data self-replication to achieve data augmentation, and then through the embedding module, it realizes the conversion of data from sequences to matrices. Each row of sequence data is transformed into a matrix after going through self-replication and the embedding module. The relevant detailed information is thoroughly explained in the 2.3 embedding module section. (Page 5) ②The species information in Figure 1.H indicates that the species category can be directly determined after training the model with the training data, which consists of DNA methylation sequence data from different species. ③ To make the Figure 1 look clearer, we have made slight adjustments to Figure 1. (Page 4) 4. How was the training / testing split performed? Why are they equal? The author’s answer: The training and test sets have an equal number of samples because this is how it is set up in the iDNA-MS database. In this database, for all species, the number of samples in the training set and the test set is the same. This facilitates cross-validation, making the experimental design simpler and more consistent. Some other papers, such as iDNA-ABF, iDNA-MS, also have their training and test sets set up in this manner. 5. The motivation for presenting the logo plot of methylation in Figure 2 is not clear. Authors should provide more context. If not, I would suggest to remove the figure2 The author’s answer: We utilized the WebLogo tool to illustrate the DNA methylation sequence patterns of different species, as shown in Figure 2. From the figure, it is noticeable that there are significant differences in the positions and extents of base enrichment for various methylation types within the genome. This provides a theoretical basis for further research into models for DNA methylation site recognition. It demonstrates that sequences are not random or devoid of distinct features; rather, DNA sequences from different species exhibit unique characteristics. This concept is also elaborated upon in the article. (Page 11) 6. For Figure 5, if the datasets are uncorrelated, a bar plot is suggested instead of a line plot. The author’s answer: The dataset used by the model in the Figure 5 is relevant and all from the same dataset.Therefore, we used a line plot.(Page 16) 7. In the transfer learning section, authors should include the performance of baseline methods (directly trained on RNA dataset without fine-tuning) The author’s answer: Thank you for pointing out the issue, we have already replenished the experiments. (Page 21-22) Attachments Attachment Submitted filename: Respond to Reviewers.docx https://doi.org/10.1371/journal.pone.0301791.r002
24 Mar 2024 Decision Letter - Li Chen, Editor iDNA-ITLM: An interpretable and transferable learning model for identifying DNA methylation PONE-D-23-33766R1 Dear Dr. Ren, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Li Chen Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed ******** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: (No Response) Reviewer #2: Yes ****** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: (No Response) Reviewer #2: Yes ****** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: (No Response) Reviewer #2: Yes ****** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: (No Response) Reviewer #2: Yes ****** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: (No Response) Reviewer #2: (No Response) ****** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No ******** https://doi.org/10.1371/journal.pone.0301791.r003
Formally Accepted
29 Apr 2024 Acceptance Letter - Li Chen, Editor PONE-D-23-33766R1 PLOS ONE Dear Dr. Ren, I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team. At this stage, our production department will prepare your paper for publication. This includes ensuring the following: * All references, tables, and figures are properly cited * All relevant supporting information is included in the manuscript submission, * There are no issues that prevent the paper from being properly typeset If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps. Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. If we can help with anything else, please email us at customercare@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Li Chen Academic Editor PLOS ONE https://doi.org/10.1371/journal.pone.0301791.r004

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .