How large is the universe of RNA-like motifs? A clustering analysis of RNA graph motifs using topological descriptors

Rui Wang; Tamar Schlick

doi:10.1371/journal.pcbi.1013230

Peer Review History

Original SubmissionJanuary 10, 2025
18 Mar 2025 Decision Letter - Ilya Ioshikhes, Editor, Mingfu Shao, Editor PCOMPBIOL-D-25-00058 How Large is the Universe of RNA-Like Motifs? A Clustering Analysis of RNA Graph Motifs Using Topological Descriptors PLOS Computational Biology Dear Dr. Wang, Thank you for submitting your manuscript to PLOS Computational Biology. After careful consideration, we feel that it has merit but does not fully meet PLOS Computational Biology's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript within 60 days May 18 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at ploscompbiol@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pcompbiol/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: * A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. This file does not need to include responses to formatting updates and technical items listed in the 'Journal Requirements' section below. * A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. * An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, competing interests statement, or data availability statement, please make these updates within the submission form at the time of resubmission. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter We look forward to receiving your revised manuscript. Kind regards, Mingfu Shao, Ph.D. Academic Editor PLOS Computational Biology Ilya Ioshikhes Section Editor PLOS Computational Biology Journal Requirements: 1) Please ensure that the CRediT author contributions listed for every co-author are completed accurately and in full. At this stage, the following Authors/Authors require contributions: Rui Wang. Please ensure that the full contributions of each author are acknowledged in the "Add/Edit/Remove Authors" section of our submission form. The list of CRediT author contributions may be found here: https://journals.plos.org/ploscompbiol/s/authorship#loc-author-contributions 2) We ask that a manuscript source file is provided at Revision. Please upload your manuscript file as a .doc, .docx, .rtf or .tex. If you are providing a .tex file, please upload it under the item type u2018LaTeX Source Fileu2019 and leave your .pdf version as the item type u2018Manuscriptu2019. 3) Please provide an Author Summary. This should appear in your manuscript between the Abstract (if applicable) and the Introduction, and should be 150-200 words long. The aim should be to make your findings accessible to a wide audience that includes both scientists and non-scientists. Sample summaries can be found on our website under Submission Guidelines: https://journals.plos.org/ploscompbiol/s/submission-guidelines#loc-parts-of-a-submission 4) Please upload all main figures as separate Figure files in .tif or .eps format. For more information about how to convert and format your figure files please see our guidelines: https://journals.plos.org/ploscompbiol/s/figures 5) Thank you for stating "The code and data for the feature and clustering algorithms are available at the public repository PSGRNA-Clustering. The RNA inverse folding using dual graph representations package is available at Dual-RAG-IF." Please provide direct links in the online submission form to access the datasets. 6) Please amend your detailed Financial Disclosure statement. This is published with the article. It must therefore be completed in full sentences and contain the exact wording you wish to be published. 1) State what role the funders took in the study. If the funders had no role in your study, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript." Reviewers' comments: Reviewer's Responses to Questions Reviewer #1: This work explores the space of all possible dual graphs (a coarse-grained representation of RNA structures) with number of vertices ranging from 4 to 9. These structures are clustered into two groups which are later labeled as RNA-like or not-RNA-like using a handful of known RNA structures. Several commonly used clustering algorithms were tried using statistical features generated from the eigenvalues of the graph Laplacians on a sequence of subgraphs (thresholded by edge weights). The study suggests that ~46% of the graphs could correspond to RNA structures. It also provides a landscape for designing novel RNA motifs. I mainly have questions on the (1) reliability of the main conclusion and (2) utilizing the problem setup (small number of features used with interpretable methods) to include interpretations of the model. 1.According to the description in 2.2.1, it seems that “persistence” is never used. A simpler descriptio without unnecessary jargon would make it more readible, especially for readers who are not familiar with the concept of persistence. For example, “exploring the eigenvalues of combinatorial Laplacians at different frames of a filtration.” 2.Related to the previous point, there is also no need to include the parameter $q$ and $q$th-order persistent Laplacian, as only standard graph Laplacians are used. 3.The interpretation of the main result could use more clarification. I have the following questions: (1) It is hard to tell from the scatter plots in Figure 3 whether there is separation between the two clusters. The method of choice, K-means, works well on datasets with clear separation between clusters. For example, if the dataset is drawn from a 1D Gaussian, k-means with k=2 will just divide the dataset into two, one below the mean and one above the mean. The ~50% RNA-like and the significantly different percentage from GMM makes me wonder if the ~50% is due to applying K-means to a unimodal distribution. It would be helpful to also visualize the distribution as density plots and potentially in different embeddings, such as UMAP, etc. (2) Conceptually, it is possible for the ground-truth to be, for example, 20% of RNA-like structures. However, the current validation can not distinguish between the two (vs. the predicted 46%). It would be helpful to include the additional metrics (like homogeneity scores) in the performance table in the main manuscript for more comprehensive assessment of the performance. Also, is there any resource, direct or indirect, on known negative samples? If such results exists, including specificity could be very helpful to clarify this point. In addition, if some negative samples are available, it is also interesting to explore semi-supervised learning which may outperform of the current approach where separation of data and label assignment are done in two sequential steps. 4.I find the interpretation in 3.3 interesting. Why is Betti-1 used here but not in the clustering analysis (2.2.1)? It would be interesting to see whether the patterns identified from a few example graphs (Figure 4) apply to the entire dataset. Can you visualize the average Betti curves across all graphs of each cluster? Some statistical tests for distinguishing the curves from the two groups are useful to formally state the observation. Additionally, interpretations of the 18 features should be included and presented in a similar manner. 5.Can the authors verify if the subpanels of Figure 3 are correct? At least for the top left panel, it looks like much more than 139 points (total possible graphs for V4&5). 6.2.2.1 states 18 features while 3.1.1 states 19 features. 7.I appreciate the property summary of dual graphs in 2.1.1. Can you elaborate more on how the listed properties (non-surjective, non-injective) affect the implication of the clustering analysis results? Lastly, I am curious about why the performance is generally worse on smaller graphs? Could including higher-order Laplacians lead to improvements? Reviewer #2: This study employs clustering analysis using graph data mining techniques to represent RNA structures through an "RNA-Like" graph-topological framework. However, several concerns arise: 1. Graph Representation of RNA Structures A. The authors treat RNA structures as graphs, with double-stranded helical stem regions represented as nodes. However, considering all base-pairing regions as equivalent vertices may introduce bias in the graphical representation of RNA structures. For instance, a short 5-nt stem-loop structure may have a completely different function from a 100-nt long double helix in living cells. Yet, both structures could share an identical graphical representation under this approach, potentially oversimplifying key structural distinctions. B. It is commendable that the authors categorized graph topologies into three groups: existing, RNA-Like, and non-RNA-Like, based partly on prior structural knowledge. However, RNA structures exhibit high dynamic variability. Less frequently observed conformations may be misclassified as non-RNA-Like due to overfitting to a limited existing dataset. Additionally, current RNA secondary and tertiary structure prediction tools have inherent technical limitations, which may further affect the accuracy of the graphical representation. Clarifying these limitations in the manuscript is important. 2. Distinction Between ‘RNA-Like’ and ‘Non-RNA-Like’ Graph Classes A. While the authors discuss that biological RNA topologies are more likely to contain distinguishing subgraphs, a more comprehensive explanation and comparison between RNA-Like and non-RNA-Like graph classes is necessary. This should incorporate insights from clustering, persistent subgraph generation (PSG), or other data mining approaches. B. If persistent Betti numbers (Betti 0 and Betti 1) can be mapped back to specific biological or structural scenarios, the interpretation will be more intuitive. For instance, if one category consistently retains more connected components, this should be explicitly analyzed and explained. ******** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] Figure resubmission: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. If there are other versions of figure files still present in your submission file inventory at resubmission, please replace them with the PACE-processed versions. Reproducibility:** To enhance the reproducibility of your results, we recommend that authors of applicable studies deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols https://doi.org/10.1371/journal.pcbi.1013230.r001
Revision 1
13 May 2025 Author Response Attachments Attachment Submitted filename: ReplyReviewers.pdf https://doi.org/10.1371/journal.pcbi.1013230.r002
12 Jun 2025 Decision Letter - Ilya Ioshikhes, Editor, Mingfu Shao, Editor Dear Dr. Wang, We are pleased to inform you that your manuscript 'How Large is the Universe of RNA-Like Motifs? A Clustering Analysis of RNA Graph Motifs Using Topological Descriptors' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Mingfu Shao, Ph.D. Academic Editor PLOS Computational Biology Ilya Ioshikhes Section Editor PLOS Computational Biology ********************************************************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors have addressed all my comments and I believe this substantially improved manuscript is suitable for publication. Just one more optional comment: I still think it is confusing and uncessary to include the concept of persistence, as it is not used at all in this work. This to me is the same as calling an ODE a special case one-independent-variable PDE throughout a paper that only discusses ODE. The generalization to persistence cases can be mentioned in the discussion section but it is not necessary to use this term throughout the manuscript. That being said, this does not affect the main biochemical message and I respect whichever the authors choose. Reviewer #2: The authors have addressed my comments. ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes:** Yiliang Ding https://doi.org/10.1371/journal.pcbi.1013230.r003
Formally Accepted
Acceptance Letter - Ilya Ioshikhes, Editor, Mingfu Shao, Editor PCOMPBIOL-D-25-00058R1 How Large is the Universe of RNA-Like Motifs? A Clustering Analysis of RNA Graph Motifs Using Topological Descriptors Dear Dr Wang, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Anita Estes PLOS Computational Biology \| Carlyle House, Carlyle Road, Cambridge CB4 3DN \| United Kingdom ploscompbiol@plos.org \| Phone +44 (0) 1223-442824 \| ploscompbiol.org \| @PLOSCompBiol https://doi.org/10.1371/journal.pcbi.1013230.r004

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .