Haisu: Hierarchically supervised nonlinear dimensionality reduction

Kevin Christopher VanHorn; Murat Can Çobanoğlu

doi:10.1371/journal.pcbi.1010351

Peer Review History

Original SubmissionDecember 7, 2021
6 Apr 2022 Decision Letter - Dina Schneidman-Duhovny, Editor Dear Dr Cobanoglu, Thank you very much for submitting your manuscript "Haisu: Hierarchically Supervised Nonlinear Dimensionality Reduction" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Dina Schneidman Software Editor PLOS Computational Biology ********************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors proposed an interesting approach for conducting nonlinear dimensionality reduction by incorporating hierarchical label and tested its performance on three single-cell RNA-sequencing datasets (embedded with commonly-used packages like t-SNE, UMAP, and PHATE). The concept of the proposed technique is generally straightforward and well described like investigating the embedding effect of TA1 cells without labels, explaining how to determine hierarchical distancing factor(str), and the main findings are validated with multiple strategies. The authors modify the premier visualization-centric techniques in this field, which has a significant merit and may potentially become a popular tool, if fully validated, for handling an overwhelming and rapidly-increasing amount of single-cell sequencing data. However, I do have some concerns before considering publication: 1. If the hierarchical label information is inserted into the step of “feature selection”, will the pipeline generate same or similar results? Moreover, if the user-defined (a prior called by the authors in the manuscript) hierarchical information is applied for dimensionality reduction, would it lead some leakage of labeling info for subsequent unsupervised clustering? 2. One of major advantages by performing single-cell sequencing is to break through the current hierarchical understanding of cell differentiation. Here it is applied as a prior knowledge, would it become an obstacle to uncovering new insights? 3. It seems that the selection criteria of hierarchy relationships in three datasets are different, will this treatment cause some biased outcomes？And, is there any principled guidance to help determine the most appropriate hierarchy labels for independent datasets？Line 339 on Page 16:“However, the effect of the hierarchy can easily be modified for a desired effect.”How to help users to walk around this trap? 4. Line 354 on Page 17: “We found its effect to be highly dependent on the characteristics of the dataset and the size/structure of the input hierarchy.” I wonder what aspect of the characteristics of a new dataset will affect the size/structure of the input hierarchy and how? 5. Regarding the definition of the distance modification function in Results (Line 109-112 on Page 7), it is given as follows: Is it in effect equivalent to the equation (Line 421-424 on Page 18) Reviewer #2: Real world data often lies in low-dimensional manifolds embedded in high-dimensional spaces, with high codimensions (manifold hypothesis). It is generally a hard task to obtain projections from high-dimensional spaces to visualizable spaces, e.g. of dimension 2 or 3, such that the properties that characterize the low-dimensional embedding are preserved. There are several techniques employed for dimensionality reduction, typically divided into linear and nonlinear classes. The latter class includes t-SNE, UMAP and PHASE. In the article under consideration, the authors propose a method that modifies the distances used in t-SNE, UMAP and PHASE by means of user-defined hierarchies. This method results in a modification of the projection that reduces the dimensionality, parametrized by a strength factor decided by the user. The main focus of this approach is on biomedical data, and the authors provide experiments on single-cell RNA sequencing datasets. The article is of value, and it should be considered for publication in the journal PLOS Computational Biology. However, the article currently needs significant revisions, as described below. The main issue lies in the fact that the method is overall not well explained, and some natural questions regarding the relation between t-SNE, UMAP, PHASE and the corresponding modified versions proposed in this article are not properly discussed. Also, the organization of the article does not seem to be ideal, and it would be greatly beneficial to restructure it. -- The subsection named “Hierarchically Supervised NLDR” is not working at the moment. There are several issues with the exposition. For instance, the setting is not well explained, and it can only be guessed by the reader. There is a high dimensional input dataset, but it seems that the input should also consist of a hirarchy graph. This is mentioned at the end of the first paragraph (line 100), but it should be manifest from the beginning. It is not clear what “variable distance function” means here. It seems that variable here means “not constant”. However, distances cannot be constant, as this would violate the triangular axiom. So, one might deduce that variable here means something else, and the authors are trying to convey that the distance satisfies this property of being variable. Also, the term “distance modification” appearing in line 98 is also unclear at this point. It is clarified (implicitly) towards the end of the manuscript, in the Methods section, but it should not be left unclear for so long. The graph “G” seems to be the hirarchy graph mentioned above. This should be explicitly stated. Similarly, the term ``strength’' in line 106 does not correspond to any canonical notion in the computational sciences. The meaning can be deduced afterwards from line 110, but again, this is not helping the reader. More generally, in this first subsection, the relation between theta_{ij} and the methodology is completely unclear. It is only understood that it modifies the distance, but it is not clear how until the last section of the article. This subsection could be merged with the last section for clarity, and rewritten. If the authors prefer to discuss the technicalities after showing the results, this could be done too. A brief general description without using equations could be added as the first subsection of the article, and a detailed account (current first subsection and last section merged together and better exposed) could appear at the end. The current situation lies somehow in between. In line 115, the use of s.t. (such that), does not seem to follow the traditional practice in mathematics. It seems rather that the authors meant to say “so that”. But s.t. means “such that”, which is also indicated by a vertical bar in sentences using the formalism of mathematical logic. In line 121, it seems that the quantity “m” could be obtained via min, max or other types of aggregation. The authors say that the probability could be taken to be min, max or other types of aggregation. This does not seem to be correct. Also, “m” seems to depend on i,j, so this dependence should be explicitly shown in the notation because “m” is not a parameter common to all x_i, x_j. -- The Benchmark subsection is overall well written. However, the tests that have been to evaluate the ability of this method to preserve the structure of the embeddings generated by t-SNE, UMAP, PHASE do not seem to be complete. The results regarding the relative positioning of points found here (line 269 on) are encouraging, but the aforementioned methods were motivated by deep analytical, geometrical and topological reasons which are not considered here. It would be reasonable to expect that increasing the strength coefficient the perturbation on the distances increasingly alters the geometric structure until the latter is lost. If on the one hand it would be unreasonable to expect a full understanding of these mechanisms in this article, it is on the other hand good to provide at least a perspective on this matter, as it can be an important factor in the fine tuning of the parameters for specific applications. -- There is an issue with considering the ``Effect of HAISU on t-SNE, UMAP, and PHATE’’, as in the last section of the article. Until now, no general understanding of the method has been provided to the reader. So, as far as the reader knows, up to now, HAISU is a method that alters t-SNE, UMAP, and PHATE to produce better embeddings. In fact, the only explanation regarding the way the coefficients theta_{ij} descibed at the beginning of the article influence the projection is provided in this section. In other words, this is the actual definition of the method, not the analysis of the effect of the method on t-SNE, UMAP, and PHATE. The method is not considered in generality, even though some experiments on PCA are also provided to compare to linear methods. One could only guess how the definition goes in the case of PCA. From the codes one might get a clearer understanding of how to apply HAISU in general, but it is not ideal. -- The right hand sides of the equations in lines 419 and 422 seem identical. It seems that there is a factor of theta_{ij} in line 419 that should not appear. ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: No: ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes:** Zheng Wang Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols https://doi.org/10.1371/journal.pcbi.1010351.r001
Revision 1
9 Jun 2022 Author Response Attachments Attachment Submitted filename: ResponseToReviewers.docx https://doi.org/10.1371/journal.pcbi.1010351.r002
3 Jul 2022 Decision Letter - Dina Schneidman-Duhovny, Editor Dear Dr Cobanoglu, We are pleased to inform you that your manuscript 'Haisu: Hierarchically Supervised Nonlinear Dimensionality Reduction' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Dina Schneidman Software Editor PLOS Computational Biology ********************************************************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors have addressed my concerns. ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes:** Zheng Wang https://doi.org/10.1371/journal.pcbi.1010351.r003
Formally Accepted
15 Jul 2022 Acceptance Letter - Dina Schneidman-Duhovny, Editor PCOMPBIOL-D-21-02208R1 Haisu: Hierarchically Supervised Nonlinear Dimensionality Reduction Dear Dr Cobanoglu, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Zsofia Freund PLOS Computational Biology \| Carlyle House, Carlyle Road, Cambridge CB4 3DN \| United Kingdom ploscompbiol@plos.org \| Phone +44 (0) 1223-442824 \| ploscompbiol.org \| @PLOSCompBiol https://doi.org/10.1371/journal.pcbi.1010351.r004

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .