Deciphering the RRM-RNA recognition code: A computational analysis

Joel Roca-Martínez; Hrishikesh Dhondge; Michael Sattler; Wim F. Vranken

doi:10.1371/journal.pcbi.1010859

Peer Review History

Original SubmissionJuly 12, 2022
24 Sep 2022 Decision Letter - Shi-Jie Chen, Editor, Nir Ben-Tal, Editor Dear Prof. Dr. Vranken, Thank you very much for submitting your manuscript "Deciphering the RRM-RNA recognition code: A computational analysis" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Shi-Jie Chen Academic Editor PLOS Computational Biology Nir Ben-Tal Section Editor PLOS Computational Biology ********************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: Roca-Martinez and coworkers have performed a computational analysis of RNA recognition motifs (RRMs) and RRM-RNA complexes in an effort to develop a scoring method for predicting and evaluating the probability of interaction between canonical RRMs and single-stranded RNA. The authors use available sequence and structure information to obtain individual scoring matrices for commonly observed interacting positions in RRM and RNA sequences (identified through multiple-sequence alignments), describing the preference of different nucleobase types to interact with different residue types. While the question of understanding the physicochemical underpinnings of RNA-protein interactions and predicting and sculpting their sequence determinants is extremely timely and important, the comments below should be addressed in detail before the suitability of the manuscript for publication can be adequately assessed. Major comments 1. The description of the RRM-RNA scoring approach (p. 6), the very heart of the manuscript, is unclear and sloppy. Equation 2 is not consistent with the text and a proper explanation of the symbols and indices used is missing (denominator in the first term different from text, fn not explained, index i in denominator in the first term different from index J in the text etc.). Also, the explanations are given in an incomplete way e.g. the sentence “…is related to the number of times adenines interact with any other amino acid residue in position beta1-1” is missing the crucial qualifier “adenines at position 1”. This makes it hard to comprehend how the scores were actually calculated. 2. More importantly, the motivation and the physical foundation of the scoring function is not adequately explained. Centrally, the scores do not consider the frequency of amino-acid residues observed at a specific position (see e.g. the second term in Eq. 2), making it not symmetric when considering nucleotides and residues, respectively. In the example given on p. 6, the score should depend on the frequency of arginines interacting with adenines as well as with other nucleotides, but this is not included. 3. Also, the scoring function shares resemblance with the standard quasi-chemical approach for defining knowledge-based potentials (Miyazawa, S. and Jernigan, R.L., Macromolecules, 18, 534-552 (1985)), but with important differences. Namely, the authors here normalize the number of occurrences of a given event (e.g. presence of a nucleotide at a given site interacting with a given residue) by the number of all events other than that event (e.g the number of interactions of that nucleotide with all the other residues, except the one in question) and not the total number of all events (e.g. the number of interactions of that nucleotide with all residues). Why is this? The authors are motivated by the GOR method for analyzing secondary structural propensities, but it is not clear that the same formalism is applicable here – namely, in the GOR method one analyzes the linkage between an object (amino acid) and its property, while here one analyzes the propensity of two objects (amino acid and nucleotide) to co-occur in the same context (i.e. contact). This is related to the asymmetry discussed in point 1. 4. The rather extensive literature on contact-based statistical potentials for nucleic-acid/protein interactions should be adequately cited and discussed (see, for example Donald et al. Nucleic Acids Res., 2007, 35, 1039–1047. or Tuszynska et al. BMC Bioinformatics, 2011, 12, 348 and other). 5. The authors refer to their randomized test set as a negative test set (p. 7). As there is no guarantee that many members of this set are not actual binders – the naming should be changed to something like “background set” or “randomized set”, but certainly not “negative set”. More critically, randomization was only done on the side of the RNA sequences (change of 1 nucleotide in the sequence) and not on the side of RRM sequences – this relates to the asymmetry of the whole approach as discussed above and must be properly defended. 6. Defining clusters as all complexes that have a certain similarity score with at least 25% of complexes in the cluster is quite low as a cutoff (p. 5). Of course, if one increases the cutoff, one risks not having sufficient samples for adequate statistics. The authors should defend the choice of their cutoff by providing quantitative evidence that it does not overly impact the scores i.e. the qualitative features of their method. 7. For the validation of their scoring method the authors analyze two experimentally studied examples, while extensive data on RRM binding motifs obtained by different experimental methods exists and is not used. See for example the RNAcompete results (Ray, D., Kazan, H., Cook, K. et al, A compendium of RNA-binding motifs for decoding gene regulation, Nature, 499, 172–177 (2013)) or the Attract database (PMID: 27055826). The authors should validate their results on an as extensive a set of experimental data as possible. Minor comments 1. On p. 9, the authors state that “The unbiased number of observed contacts in the training set that is used to calculate the scores is also shown in the preference matrices (Figure 8 B,C), below each of the scores”. However, these numbers are not integers, so it is unclear what they actually refer to. 2. Numbers in Figure 2 are not fully consistent with the text (1263 instead of 1259, 20 instead of 19; p. 3 and p. 4). 3. It is stated that the "alignment for the RRM-RNA structures" is available in Dataset S6 (p. 4). However, Dataset S6 contains the RRM-RNA similarity matrix. 4. In the caption of Table 1 (p. 28), the authors state that “The symbols reflect the score change after the E87N mutation”, however there are no symbols. Reviewer #2: The paper describes construction of statistical based potential for scoring rrm domain interactions with specific RNA sequence. They use structural model of binding to identify contacting residues and then create sequence based interaction statistics. The authors acknowledge that the method is limited to already known binding modes of rna to RRM domains or very close to those. The authors validate the approach on leave out training set as well as few novel cases. The problem is still open - but there is a progress. I think the paper can be published. Would be interesting to compare the approach with AF like approach for modeling protein RNA interactions (see the link below). Also can protein rna models from this approach can be used to create alignment? https://www.biorxiv.org/content/10.1101/2022.09.09.507333v1.full.pdf Reviewer #3: The manuscript by Roca-Martinez RNA-recognition motifs discusses a scoring method to estimate binding between an RRM and a single stranded RNA, and the method aims to predict RRM binding RNA sequence motifs based on RRM protein sequence. The authors adopt a simpler statistical approach over deep learning method employed in several existing methods for better interpretability. Interesting results on discriminatin of high affinity RNAs with the UAG core motif from lower affinity RNAs are reported. While the reported method serves a useful purpose towards the overall task of solving the problem of deciphering RNA recognition code of RRMs, there are a number of significant issues: 1. The score described by Equation 2 is not explained and the physical unerpinning cannot be found. The two log terms summed are the same as product of two ratio. But it is not clear what does it mean, and why does this make sense? Does it model some eperical binding? Conservation? Not clear. Furthermore, why the denominator in the first term comes to be f_n - N_i,R_J ? This is not understandable. 2. There are numerous places where the method development depends on visual inspection. This raises serious issure of reproducibility. 3. The model appears to be rather restrictive and works only for cluster 0 and no binding mode change can occur. 4. The negative test should be strenthened and should include other entries if possible. Other issues: 1. p.4. "the number of unique postions between both nucleotides" It is not clear if the authors meant positions only in A, only in B, or both? 2. Will the results sensitive to the specific threshold of 5A? Minor issues 1. Figures seems to be jumping around in order, and it makes it difficult to go back and forth. ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols https://doi.org/10.1371/journal.pcbi.1010859.r001
Revision 1
9 Nov 2022 Author Response Attachments Attachment Submitted filename: response_to_reviewers.pdf https://doi.org/10.1371/journal.pcbi.1010859.r002
7 Jan 2023 Decision Letter - Shi-Jie Chen, Editor, Nir Ben-Tal, Editor Dear PhD Vranken, We are pleased to inform you that your manuscript 'Deciphering the RRM-RNA recognition code: A computational analysis' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Shi-Jie Chen Academic Editor PLOS Computational Biology Nir Ben-Tal Section Editor PLOS Computational Biology ********************************************************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors have significantly revised the manuscript and have adequately addressed all of my concerns from the first round. Reviewer #3: The authors have made changes to improve the manuscript. However, a number of important issues have not been addressed adequately. I find it still difficult to understand the origin of Eqn 2 and the physical basis remains unclear. Furthermore, while I appreciate why the authors are are using visual inspection for verifications, the issue of reproducibility largely remain. ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #3: Yes ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #3: No https://doi.org/10.1371/journal.pcbi.1010859.r003
Formally Accepted
17 Jan 2023 Acceptance Letter - Shi-Jie Chen, Editor, Nir Ben-Tal, Editor PCOMPBIOL-D-22-01066R1 Deciphering the RRM-RNA recognition code: A computational analysis Dear Dr Vranken, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Zsofia Freund PLOS Computational Biology \| Carlyle House, Carlyle Road, Cambridge CB4 3DN \| United Kingdom ploscompbiol@plos.org \| Phone +44 (0) 1223-442824 \| ploscompbiol.org \| @PLOSCompBiol https://doi.org/10.1371/journal.pcbi.1010859.r004

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .