STREAK: A supervised cell surface receptor abundance estimation strategy for single cell RNA-sequencing data using feature selection and thresholded gene set scoring

Azka Javaid; Hildreth Robert Frost

doi:10.1371/journal.pcbi.1011413

Peer Review History

Original SubmissionDecember 6, 2022
21 Feb 2023 Decision Letter - Mark Alber, Editor, Elena Papaleo, Editor Dear Ms. Javaid, Thank you very much for submitting your manuscript "STREAK: A Supervised Cell Surface Receptor Abundance Estimation Strategy for Single Cell RNA-Sequencing Data using Feature Selection and Thresholded Gene Set Scoring" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations. In particular, all the reviewers gave useful comments to improve the overall presentation and results which could strengthen the work. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Elena Papaleo, PhD Academic Editor PLOS Computational Biology Mark Alber Section Editor PLOS Computational Biology ********************* A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment.** Reviewer #1: In this work, the authors present a supervised method to estimate receptor abundance on individual cells from scRNA-seq data, which is trained on joint scRNA/seq/CITE-seq data. The authors benchmark their method against other state-of-the-art methods designed to address the same problem and show that theirs works better. The problem the authors are trying to solve is relevant given the recent interest in single-cell multiomics technologies, and the method is well explained and makes sense. However, I found the presentation of the actual results confusing: • For example, several plots rely on the “proportion of abundance profiles that have the highest Spearman rank correlation” metric, which is a hard description to parse. Turns out that this is the number of receptors with the highest correlation with CITE-seq data compared to the other estimation methods, which explains trends that are otherwise puzzling like some of the bars going down with increasing numbers of cells in Fig 2. Also the main text uses the word “proportion” but the plots use “frequency”, and actually show absolute numbers in their y axes. If this metric is to be used, it should be explained better and with more consistent terminology. • Furthermore, figs. 3-6 show that in several cases, even though STREAK has higher performance, the difference is not that large. This makes it questionable to use a metric based on which method has the highest correlation. In other words, there is a loss of information of the actual correlation values in figs. 2, 7, 8, and 9 that I think is important. • The inclusion of the different numbers of cells in Figs 2, 7, and 9 is interesting in principle, but no insight is extracted or discussed in the text. Fig. 8 is the exception, where the different numbers of cells used for training have a clear purpose. But even then, the point of using different number of cells for their test dataset is unclear. • Figs 3-6 contain tables with every estimation for every receptor in every dataset, which makes it hard to parse or extract any insights from them. • Finally, the plot text (axis and tick labels) is too small, which makes it impossible to read on printed paper and requires lots of zooming in a computer. I would recommend the authors to use correlation vs correlation scatter plots, with each point corresponding to a receptor and the x and y axis showing the spearman correlations between CITE-seq and each one of the estimation methods under study. The position of the dots can be compared with the x=y diagonal to see which method is overall better for most receptors, and numerical information about the correlation values is not lost. This makes for a more succinct comparison of correlation performance metrics, and maybe the number of figures and panels can be reduced. For examples, check Fig. 1b of https://www.nature.com/articles/s41592-021-01252-x. Finally, while the selection of competing methods against which this work is compared is adequate, the authors avoid comparing against PIKE-R2P because, they claim, the code was not available. As of this writing the code seems to be available at https://github.com/JieZheng-ShanghaiTech/PIKE-R2P. Therefore, I would ask the authors to perform this comparison. Reviewer #2: I read this paper entitled “STREAK: A Supervised Cell Surface Receptor Abundance Estimation Strategy for Single Cell RNA-Sequencing Data using Feature Selection and Thresholded Gene Set Scoring” with great interest, as it proposes a new solution to the challenges associated with estimating receptor abundance for scRNA-seq target data. The authors have developed a new supervised receptor abundance estimation method called STREAK for single-cell transcriptomics data. This method is an improvement over their previous unsupervised method SPECK, as it leverages associations learned from joint scRNA-seq/CITE-seq training data and uses a thresholded gene set scoring mechanism. The authors evaluate STREAK against both unsupervised and supervised methods on three joint scRNA-seq/CITE-seq datasets and conclude that it outperforms other techniques and provides a more biologically interpretable and transparent statistical model. I commend the authors for the introduction of a new and improved method for cell surface receptor abundance estimation. The use of joint scRNA-seq/CITE-seq training data to leverage associations and improve the accuracy of the estimation is one of the strengths. Besides, the authors have also carried out a detailed evaluation of the method on multiple datasets and a demonstration of its superiority over other techniques. I have following questions/suggestions for the authors: 1. How logical and fair is the comparison with other techniques given those methods may have different assumptions and limitations, particularly the comparison between supervised and unsupervised techniques? 2. The evaluation of the method is limited to three joint scRNA-seq/CITE-seq datasets representing two different tissue types and using two evaluation strategies. In my opinion, a wider range of data and a larger number of datasets should be used to validate the generalizability and robustness of the method. Can we use or expand the same method to other tissue types and bigger datasets? 3. I would appreciate it if the authors can prepare and highlight the comparison of different methods in a Tabular format as a Main Table. Reviewer #3: in the manuscript "STREAK: A Supervised Cell Surface Receptor Abundance Estimation Strategy for Single Cell RNA-Sequencing Data using Feature Selection and Thresholded Gene Set Scoring: the authors report on a new supervised receptor abundance estimation method. The manuscript and the tool are of definite interest and I have only minor comments: 1. Would be possible to apply this tool to generate accurate estimation of cell surface receptor abundance for tissue-specific single cell data beyond blood / immune cells? 2. The authors evaluated STREAK against the Random Forest (RF) algorithm, but RF heavily depends on feature selection. Can the authors evaluate the data with at least one more machine learning algorithm such as SVM or GBM to confirm STREAK superiority to other machine learning methods. 3. The authors say that they did not compare STREAK against PIKE-R2P as it is not in CRAN, but the R code is available in github. Comparison should be conducted. 4. The CDF in the manuscript had gamma distribution, but single cell data usually follows a negative binomial distribution. Is there an advantage to use gamma distribution? ******** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No: Package can be found by searching on CRAN. I would recommend a direct link and a copy on github or similar if possible. Reviewer #2: Yes Reviewer #3: Yes ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols References: Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. https://doi.org/10.1371/journal.pcbi.1011413.r001
Revision 1
22 Apr 2023 Author Response Attachments Attachment Submitted filename: STREAK-CommentIntegration-AJ-HRF.pdf https://doi.org/10.1371/journal.pcbi.1011413.r002
20 Jun 2023 Decision Letter - Mark Alber, Editor, Elena Papaleo, Editor Dear Ms. Javaid, Thank you very much for submitting your manuscript "STREAK: A Supervised Cell Surface Receptor Abundance Estimation Strategy for Single Cell RNA-Sequencing Data using Feature Selection and Thresholded Gene Set Scoring" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations from Reviewer 1. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Elena Papaleo, PhD Academic Editor PLOS Computational Biology Mark Alber Section Editor PLOS Computational Biology ********************* A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment.** Reviewer #1: Thanks to the authors for addressing my comments. I still have one important concern and a few presentation issues: • The authors mention that they were unable to run PIKE-R2P from its github repo. However, the main text still says “We do not perform comparison against PIKE-R2P since its implementation does not currently exist on the Comprehensive R Archive Network (CRAN) or on alternative code repository platforms”, which is untrue. The authors should update this statement and describe, briefly but precisely, what issues they had with the software. • From the description of their algorithm in the methods section, is seems that the VAM method only uses X_R but not A. This is important since the authors claim that one advantage of their method is that the gene weights can be interpreted, or manually tuned to reflect biological knowledge. Can the authors clarify where A is used? • Minor clarification: In step 2 of their summary of VAM, M[,k] is supposed to be a column matrix with dimensions (m2, 1) but the matrix product is Xk^T(Igsigma^2)^-1X_k, which is (m2, g)^T(g, g) (m2, g) = (g, m2)(g, g)(m2, g), which doesn’t make sense for a matrix product. Is this wrong, or is there a typo? • While I appreciate that the bar plots have been simplified, I’m concerned that some of them do not show all the bars that they should. For example the MALT dataset in Figure 2 does not include bars for the RNA and RF methods. From looking at the supplementary figures, it seems that their value is zero, but this is not obvious from the figures. Maybe the authors can modify the plots to make this more explicit, or at the very least include a note in the figure legend saying that all bars unshown correspond to a value of zero. • The authors say that heatmaps use bold text to indicate receptors where STREAK estimates are higher than other methods. But this is 1) really hard to see, 2) from what I can tell there are several receptors in which STREAK performs best that are not bolded. Example: CD11b-2 in Figure 9a. Can you clarify? • Minor: In Figure 13, the y axis is frequency instead of proportion. Reviewer #2: I have carefully reviewed the revised manuscript titled "STREAK: A Supervised Cell Surface Receptor Abundance Estimation Strategy for Single Cell RNA-Sequencing Data using Feature Selection and Thresholded Gene Set Scoring" along with responses to my previous comments, and I recommend accepting it for publication in PLOS Computational Biology. I appreciate the thoroughness with which the authors addressed my concerns and incorporated the suggested revisions. The revised manuscript now provides a clear and comprehensive description of the STREAK method for estimating cell surface receptor abundance in single-cell RNA-sequencing data. Reviewer #3: No further comments ******** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes:** Dr. Vikas Sharma Reviewer #3: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols References: Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. https://doi.org/10.1371/journal.pcbi.1011413.r003
Revision 2
27 Jun 2023 Author Response Attachments Attachment Submitted filename: STREAK-CommentIntegration-AJ-HRF-June26.pdf https://doi.org/10.1371/journal.pcbi.1011413.r004
7 Aug 2023 Decision Letter - Mark Alber, Editor, Elena Papaleo, Editor Dear Ms. Javaid, We are pleased to inform you that your manuscript 'STREAK: A Supervised Cell Surface Receptor Abundance Estimation Strategy for Single Cell RNA-Sequencing Data using Feature Selection and Thresholded Gene Set Scoring' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Elena Papaleo, PhD Academic Editor PLOS Computational Biology Mark Alber Section Editor PLOS Computational Biology *********************************************************** https://doi.org/10.1371/journal.pcbi.1011413.r005
Formally Accepted
17 Aug 2023 Acceptance Letter - Mark Alber, Editor, Elena Papaleo, Editor PCOMPBIOL-D-22-01790R2 STREAK: A Supervised Cell Surface Receptor Abundance Estimation Strategy for Single Cell RNA-Sequencing Data using Feature Selection and Thresholded Gene Set Scoring Dear Dr Javaid, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Zsofi Zombor PLOS Computational Biology \| Carlyle House, Carlyle Road, Cambridge CB4 3DN \| United Kingdom ploscompbiol@plos.org \| Phone +44 (0) 1223-442824 \| ploscompbiol.org \| @PLOSCompBiol https://doi.org/10.1371/journal.pcbi.1011413.r006

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .