A variational autoencoder trained with priors from canonical pathways increases the interpretability of transcriptome data

Bin Liu; Bodo Rosenhahn; Thomas Illig; David S. DeLuca

doi:10.1371/journal.pcbi.1011198

Peer Review History

Original SubmissionMay 19, 2023
5 Aug 2023 Decision Letter - Mark Alber, Editor, Greg Tucker-Kellogg, Editor Dear Dr. DeLuca, Thank you very much for submitting your manuscript "A variational autoencoder trained with priors from canonical pathways increases the interpretability of transcriptome data" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We would like you to respond to the specific feedback on major limitations that should be addressed before resubmission. The most detailed comments (from reviewer 3) include issues regarding the validation and benchmarking of performance, as well as issues of pathway representation from the selected pathways used in the development of the VAE. These issues should be addressed directly. Reviewers #1 and #2 raise other significant issues that need to be addressed. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Greg Tucker-Kellogg, PhD Guest Editor PLOS Computational Biology Mark Alber Section Editor PLOS Computational Biology ********************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: Here, the authors propose the use of autoencoder variations to increase interpretability of deep learning in transcriptomics data. More specifically, the novelty of this paper is the addition of canonical biological pathways as priors to the beta-variational autoencoder model. The authors reasoned that the ability to tune the prior weights with the beta hyperparameter ties the latent space to the curated biological pathways and henceforth providing better biological interpretability. While the conceptual premise seems interesting, there are a few aspects I would like to see further explored: - Table 2 provided showed that the introduction of beta drastically increases the reconstruction loss and reduces the KL divergence. However, the relationship between the beta hyperparameter and the two losses warrants more investigation – for example, how does the beta hyperparameter scale with respect to each loss? This may allow for a more fine-tuned approach to selecting the optimal amount of trade-off mentioned in the paper. - The authors should expand on the classification metrics as well provide statistical tests to these metrics. Even though the authors attempted to justify the worse model accuracies with the poorer reconstruction losses, it remains unclear why the beta-prior models are underperforming compared to the simple counterparts. This argues against the premise and warrants more investigation. Providing more classification metrics may provide an insight into why the models are worse off. This also ties into previous point, if the reconstruction loss scales accordingly to beta hyperparameter, it will be evident in the classification metrics. - Even though the argument for have biologically-relevant priors is logically sound, it requires more testing to determine if the effect is true. A good sanity check would be to introduce random biological sets and see if it reduces the model’s classification performance. Another potential check would be to see if there are specific genes that are heavily influencing the models. For example, by only using subsets of the gene sets, one can test for the robustness of the biological priors. Other comments: - While I agree that gene-level features are sufficient, I would like to see the effect of different normalization methods on the gene-level features. It is well known that different normalization methods can affect the downstream RNA-seq data analysis. As such, the authors should consider different normalization approaches. - While the premise of the paper is to prove the utility of biologically-informed priors, it might be useful to see if different pathway definitions from different databases may result in different results. - The scope of disease classification tasks should be expanded to include other types of cancers and disease to test the statistical robustness of the models. Different physiological conditions manifest into different statistical properties of the data distributions observed. As such, it will be prudent to find out what type of statistical distribution causes the model to perform better or worse. - A good sanity check would be to see if the same pathways are derived when doing conventional differential gene expression analysis with the same classification groups. This may provide an insight into the advantages of using a machine-learning approach shown in this paper. Reviewer #2: The authors proposed autoencoders to compute latent representation for transcriptome signals. Five autoencorders, such as simpleAE, simpleVAE, Beta-SimpleVAE, PriorVAE, and Beta-PriorVAE, were compared for the assessment. The problem in the manuscript is interesting, but there are several major concerns before the publication. - The approaches need to be clarified. The data processing and pipeline for the transcript level input, gene level input, and community level input are not clear. I strongly encourage the authors to improve the method. - What's definition of community level input? Why is important? And for the community detection, why is KNN-based graph the best solution? Why not network inference methods? - Autoencoder is trained on each pathway or with whole gene level? - What is the justification to set the node number and layer number determined? - I am wondering if there are very similar works published before to compare the performance directly, rather than simple comparison with simple autoencoders. Reviewer #3: Summary This study develops an innovative approach to identifying biological processes undergoing perturbation from transcriptome data using variational autoencoders. The study achieves this through the incorporation of biological priors to direct the VAE networks to learn representations of transcriptomes that are based on biological concepts and could in principle be easier to interpret biologically. The authors show that both a simple fully-connected VAE and the novel prior-informed VAE can learn reduced representations of transcriptomes from high dimensions to 50 latent dimensions. These representations retain meaningful biological information that enable accurate reconstruction. While the fully connected VAE outperforms the prior-informed VAE in disease and organ classification tasks, it lacks directly interepretable latent dimensions. The prior-informed (beta-weighted) VAE not only solves the benchmark tasks but also provides semantically accurate latent features that map to biological pathways. Overall, the work presented is innovative conceptually and methodologically. The main drawbacks include lack of rigor in benchmarking performance of the autoencoders as only a single metric (precision) is used to compare the models without consideration of the tradeoff to recall. There is need to assess the biological properties of the reconstructed transcriptomes related to the input datasets, especially an understanding of any relationships between reconstruction error and transcript levels of specific genes, pathways or tissue source/ disease state of the input samples. The ability of the prior-informed VAE to identify biological pathways that discriminate tissue source or disease state should also be compared to pathways that would be identified using conventional differential gene expression analysis and pathway enrichment. The validation of the prior-informed VAE's ability to enhance interpretability is weakened by the use of a set of 50-pathways as the source of priors for the latent space as the pathways are not comprehensive or representative of wide range of biological processes. Strengths -The approach to leverage VAE to identify differentially expressed biological pathways is innovative -The approach addresses some key limitations of conventional approaches for differential gene expression analysis such as GSEA that require lists of well defined differentially expressed genes -Unlike previous differential gene expression analysis approaches such as limma, DESeq2 and Seurat, the use of the prior-informed VAE does not assume linearity across samples -The approach developed using autoencoders to enhance interpretability is distinct from other approaches that have been developed in the field using VAE. Unlike prior approaches that correlate phenotypes to latent features post-hoc or constrain the networks using known gene-pathway associations, the prior-informed VAE in this study enables the networks to leverage biological priors while still giving them freedom to learn relationships among genes Major Limitations -Authors need to perform a more detailed analysis of the samples where the correlation between the a given input transcriptome was not the most correlated to its reconstructed output to understand underlying mechanisms. This is to ensure future user can understand contexts in which the reconstruction may become unreliable -There is need to assess any biases in reconstruction error at the gene and pathway level. For example, are there some genes or pathways whose transcript levels are more prone to higher error in reconstruction? -There is need to assess reconstruction performance based on highly vs lowly expressed genes as biases in these gene groups could impact biological outcomes typically based on gene expression -The tissue source/ treatment of the inputs need to be fully described in order to have a better understanding of the biological relevance of the results. Labeling the sample names with intuitive names will help in determining whether the observed correlations across tissues also capture tissue relationships -A comparison of the classification performance using differential gene expression or expression profiles of each sample directly would also be informative and useful to the field to show any advantages and disadvantages -Assessment of the performance needs to include recall and area under precision recall and the associated AUPRC curves. Without knowledge of the recall in relation to the precision, the utility of the high precision scores is low (including in the case where precision was 100%) -The results of differential analysis features across conditions needs to be compared to conventional approaches e.g. GSEA. For example, when comparing adenocarcinoma vs. health tissue, what are the differentially expressed pathways obtained using typical methods (e.g. GSEA)? -The Pathways captured by the 50-dimensions are not comprehensive so it seems to be a big biological flaw to try to reduce the transcriptome dimensionality to these 50 pathways and use those results for interpretability. It would be better to use more comprehensive ontologies e.g. GO or pathways (KEGG) or to modify the prior informed VAE to include a dimension for "Unknown pathways" Minor Limitations -It would be useful to show how well the network based clustering that was performed to identify representative genes is representative of the full transcriptome -It would be insightful if the heatmaps of the inputs and outputs were visualized separately in addition to as shown in Fig. 2. Separate visualization figure for inputs vs. outputs would help to demonstrate whether tissue correlations based on reconstructed data recapitulate the tissue correlations based on inputs. This could also be visualized using tSNE -Supplementary Fig. 1 is missing heatmaps for simple VAE and beta-simple VAE transcript associations ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: No: There is a link to github repository but the page is not functional ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: Yes:** Geoffrey H. Siwo Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please n https://doi.org/10.1371/journal.pcbi.1011198.r001
Revision 1
10 Mar 2024 Author Response Attachments Attachment Submitted filename: Response the reviewers.pdf https://doi.org/10.1371/journal.pcbi.1011198.r002
11 Jun 2024 Decision Letter - Mark Alber, Editor Dear Dr. DeLuca, We are pleased to inform you that your manuscript 'A variational autoencoder trained with priors from canonical pathways increases the interpretability of transcriptome data' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Mark Alber, Ph.D. Section Editor PLOS Computational Biology Mark Alber Section Editor PLOS Computational Biology ********************************************************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: I am satisfied with the new revision. ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No https://doi.org/10.1371/journal.pcbi.1011198.r003
Formally Accepted
26 Jun 2024 Acceptance Letter - Mark Alber, Editor PCOMPBIOL-D-23-00803R1 A variational autoencoder trained with priors from canonical pathways increases the interpretability of transcriptome data Dear Dr DeLuca, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Livia Horvath PLOS Computational Biology \| Carlyle House, Carlyle Road, Cambridge CB4 3DN \| United Kingdom ploscompbiol@plos.org \| Phone +44 (0) 1223-442824 \| ploscompbiol.org \| @PLOSCompBiol https://doi.org/10.1371/journal.pcbi.1011198.r004

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .