Benchmarking uncertainty quantification for protein engineering

Kevin P. Greenman; Ava P. Amini; Kevin K. Yang

doi:10.1371/journal.pcbi.1012639

Peer Review History

Original SubmissionOctober 30, 2023
11 Feb 2024 Decision Letter - Nir Ben-Tal, Editor, Rachel Kolodny, Editor Dear Yang, Thank you very much for submitting your manuscript "Benchmarking uncertainty quantification for protein engineering" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, we apologize for the long time it took, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Rachel Kolodny Academic Editor PLOS Computational Biology Nir Ben-Tal Section Editor PLOS Computational Biology ********************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: This paper is about quantifying the ability of different protein sequence-based models to accurately predict uncertainty in different protein design data regimes and modalities. The authors tried difference sequence representations, model architectures and uncertainty quantification methods across FLIP benchmark tasks. Although the amount of experiments involved in this work is impressive, the paper is unfortunately written in a way that mostly lists those dense results, hence making it difficult to get any insight or grasp the relationship between them. Moreover, some of the paper's claims could benefit from better statistical analysis support. Major comments The paper is results-dense and would benefit from some curation of metrics and/or models that would make it easier to ingest. For example, I suspect that RMSE and correlation are highly correlated as well as calibration plot and uncertainty correlation, but they both have their own distinct figures. Some of these results could be moved to supplementary. The figures are hard to read or interpret since they tend to have too much information and don’t use the space efficiently by repeating the same axis and legend that end up using all the space. For example, the bar plots in Figure 4 are hard to read, especially the B panel for models that have close to zero correlation that seems to be missing. Can the authors comment on the apparent strong relationship found between the coverage and width metrics for the same task across different models? It almost points toward models not better quantifying uncertainty per point, but just being more confident in general. It almost feels that a simple re-scaling of the uncertainties on a calibration set from training sets would make those trends and claims vanish. Claims about models being better than others on task are hard to support without any statistical testing. The fact that using different starting seeds for training the CNN model gives similar performance across Figure 4 almost indicates that some of the trends found in previous figures could also vanish. Also, the active learning experiments with the random acquisition function seem to indicate that a simple re-sampling of the training data could be used to better estimate confidence intervals on metrics and support claims about model performance versus each other. Can the authors re-sample their training data to generate different training data and get a better estimation of their metrics? Can they also choose at least 3-4 different starting points in BO and active learning to not let the starting 10% dictate the conclusion? Can the authors comment on the expected relationship between the different metrics? It is not clear why the miscalibration curve should be plotted against the RMSE. Minor comments The authors do cite their uncertainty metrics, but given how central they are to the paper, can they describe them more? I might have missed these details, but what are the data used for active learning and BO. Is it the designed AAV or just the random? Would it make more sense to start from sampled and then sample from designed? (or maybe it is the case) In the active learning experiments, are the models getting more confident as they get more details? A summary table of the different rankings of selected models across tasks would be helpful since the paper has so many results that it is hard to keep track of performance across different regimes and modalities. Ideally, these rankings would be derived from robust statistical testing. Some curves are not visible in Figure 6 since they are hidden by others. Reviewer #2: This paper considers the problem of uncertainty prediction for the problem of protein engineering. The paper reports extensive experimentation, with the goal to evaluate different types of predictors, and specifically evaluate their uncertainty predictions. The experiment is conducted in three different types of tasks, with varying degree of domain shift between the training data and test data. The uncertainty prediction is evaluated using different metrics, and through two downstream applications, namely, active learning and Bayesian optimization. The results of the experiments show that: 1. There is no single method that performs better across all tasks. 2. Current uncertainty predictors are not always useful for downstream tasks like active learning (where they are useful in some cases), and Bayesian optimization (where a greedy, uncertainty-agnostic approach performs better). 3. Representing proteins using the ESM language model embeddings rather than one-hot encoding, improves results in some cases but not all. The main conclusion is that uncertainty predictors cannot be assumed to work out of the box for protein engineering tasks and need to be further developed, and/or carefully evaluated per task. Strengths: 1. The paper performs an extensive and thorough evaluation of the methods under different varying conditions, and gives both a detailed description of each setup and the big picture of the status of current methods. 2. The conclusion is a useful and practical contribution to the community that often considers relying on such predictions of uncertainty. Weaknesses: 1. The paper does not present a novel model or evaluation methodology, and is a relatively straightforward implementation of various experiments. In my view this should not prevent the paper from being accepted as the experimentation is thorough and therefore valuable to the community. 2. The results do not show a clear winning method that can directly inform practitioners on ways to improve their research on downstream tasks. However, like I mentioned above, there is value in empirically demonstrating the limitations of current models, which should serve as a warning for practitioners of downstream tasks, and an invitation to researchers to perform more research on uncertainty prediction. 3. There are a few issues that were not clear to me. See questions below. I believe that fixing those issues, would make the paper publishable. Questions: 1. The representation and CNN architecture is not clear to me. What dimension exactly is being averaged in the ESM embeddings? At the end, what is the dimension of the protein representation both for ESM and one-hot encoding? Is it constant or varying with sequence length? Why do you need a CNN to process this representation, as opposed to a fully connected MLP? Is there some local invariance or smoothness property that should be captured by convolutions? If so over what dimension? 2. It is not clear from the description in section 4.6 that all methods use the same number of samples in evaluation. If this is the case it should be stated, and if not, it should be discussed and justified. 3. Some methods were not fully described, for example it was mentioned that the evidential CNN uses a loss termed L^R, without elaborating. 3. Section 4.7 is confusing in listing the different metrics. First it describes MAE and R^2 metrics which I failed to see in any of the epxeriment results. Second, it states there are four uncertainty metrics but I could only count three (ro_unc, coverage, AUCE). Third, the description of the coverage states that the range is 4\\sigma/R rather than 4\\sigma, however in figure 3 these are shown on different axes (coverage vs. width/R) - I’m not sure what’s going on there. Some general remarks: 1. The fact that the linear model and GP are performing better than the CNN suggest that maybe there is not enough training data for a deep learning method to work. Are there larger datasets that can be tested? 2. I’m not sure there is a way to do this better than the thorough experiments setup already presented, but in the evaluation through downstream tasks (active learning and Bayesian optimization), there is still some conflation between the quality of the prediction and the quality of the uncertainty predictions. It might be beneficial to try to compare a predictor where the uncertainty is computed exactly from ground truth data. This can serve as some kind of upper bound on the performance. ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols https://doi.org/10.1371/journal.pcbi.1012639.r001
Revision 1
28 Jun 2024 Author Response Attachments Attachment Submitted filename: protein_uq_plos_compbio-reviewer_responses.pdf https://doi.org/10.1371/journal.pcbi.1012639.r002
29 Aug 2024 Decision Letter - Nir Ben-Tal, Editor, Rachel Kolodny, Editor Dear Yang, Thank you very much for submitting your manuscript "Benchmarking uncertainty quantification for protein engineering" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we would like to accept this manuscript for publication, but please modify the small changes the reviewer requested. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Rachel Kolodny Academic Editor PLOS Computational Biology Nir Ben-Tal Section Editor PLOS Computational Biology ********************* A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The author has addressed most of my comments. Regarding the statistical evidence discussion, I do understand that creating new splits is its own endeavor, and this paper is not about FLIP but about uncertainty quantification. However, since the author already did the hard work of training 5 models per split with different seeds, can they also report the standard deviation for all metrics in the supplementary table. This could help the reader have some context on how different models perform relative to each other in a given context ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols References: Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. https://doi.org/10.1371/journal.pcbi.1012639.r003
Revision 2
12 Nov 2024 Author Response Attachments Attachment Submitted filename: protein_uq_plos_compbio-reviewer_responses2.pdf https://doi.org/10.1371/journal.pcbi.1012639.r004
14 Nov 2024 Decision Letter - Nir Ben-Tal, Editor, Rachel Kolodny, Editor Dear Yang, We are pleased to inform you that your manuscript 'Benchmarking uncertainty quantification for protein engineering' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Rachel Kolodny Academic Editor PLOS Computational Biology Nir Ben-Tal Section Editor PLOS Computational Biology Feilim Mac Gabhann Editor-in-Chief PLOS Computational Biology Jason Papin Editor-in-Chief PLOS Computational Biology *********************************************************** https://doi.org/10.1371/journal.pcbi.1012639.r005
Formally Accepted
11 Dec 2024 Acceptance Letter - Nir Ben-Tal, Editor, Rachel Kolodny, Editor PCOMPBIOL-D-23-01757R2 Benchmarking uncertainty quantification for protein engineering Dear Dr Yang, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Dorothy Lannert PLOS Computational Biology \| Carlyle House, Carlyle Road, Cambridge CB4 3DN \| United Kingdom ploscompbiol@plos.org \| Phone +44 (0) 1223-442824 \| ploscompbiol.org \| @PLOSCompBiol https://doi.org/10.1371/journal.pcbi.1012639.r006

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .