A practical application of generative adversarial networks for RNA-seq analysis to predict the molecular progress of Alzheimer's disease

Jinhee Park; Hyerin Kim; Jaekwang Kim; Mookyung Cheon

doi:10.1371/journal.pcbi.1008099

Peer Review History

Original SubmissionDecember 14, 2019
29 Feb 2020 Decision Letter - Hugues Berry, Editor, William Stafford Noble, Editor Dear Dr. Cheon, Thank you very much for submitting your manuscript "A practical application of generative adversarial networks for RNA-seq analysis to predict the molecular progress of Alzheimer's disease" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. Your revision must address all the issues raised by the reviewers and including the points related to reproducibility, robustness and software and data file documentation. Moreover, please be sure to improve the presentation of GANs to a non-expert audience, and to justify their application for gene expression classification. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Hugues Berry Associate Editor PLOS Computational Biology William Noble Deputy Editor PLOS Computational Biology ********************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: This paper presents an application of generative adversarial networks (GANs) to predict the molecular progress of Alzheimer’s disease. It is a topic of interest to researchers in this related field, but the paper needs very significant improvement. 1. In general, there is a lack of explanation of the method of GANs used in this study. The author should provide more detailed information about GANs to make the model clearer for the reader. 2. The results are a little hard to follow for someone who is not an expert in this field or without closely reading the methods. 3. Are the results robust? For example, whether different data augmentation method or parameters can lead to large changes in results. Reviewer #2: Summary: Authors present in silico study of the molecular mechanism of Alzheimer's disease progression from wild type to disease state using publically available mouse model RNAseq data. They raise the problem of lacking sufficient experimentally generated data to study various stages of the disease progression and propose to address it by producing synthetic data using generative adversarial networks (GANs) that have become popular deep learning methods in the fields such as image analysis and being adapted for biological studies. In the manuscript authors firstly demonstrate data generation process, then they analyse introduced disease transition curves in the context of disease related pathways. Authors perform pathway enrichment analysis using interpolated data of up- and down-regulated genes and suggest the hypothesis of mutual regulation between amyloid beta production and cholesterol biosynthesis. The manuscript is supported by the analysis code provided via GitHub repository https://github.com/KBRI-Neuroinformatics/GAN-for-bulk-RNAseq. I am recommending the paper could be accepted after authors have addressed the issues stated in the major and minor comments sections. My major concern is the reproducibility of the results demonstrated in the article. Major and minor comments are provided below. Major comments: 1. It is not currently mentioned if a power analysis and gene effect sizes were taken to account when identifying a significantly differentially expressed genes. 2. It is also not clear whether 1208 DEGs were differentially expressed only between 7MWT vs 7MAD or other comparison was used to identify these genes. 3. It is not quite clear what causality authors intend to discover on p.4 line 96 and how it is reflected in the results part. 4. P.5 line 112; p.19 line 430 Please clarify whether during the augmentation procedure the interpolation was performed between samples belonging to one group or from different groups. Figure 1 caption is a missing explanation of what s1...s2 is. 5. Please clarify the choice of data generated at 100k epoch for the comparison with the original augmented data instead of data generated at 25k epoch when convergence has already been reached. 6. Model performance evaluation metrics such as precision, recall and F1 score are not presented. 7. Please explain why only 846 samples were generated by GANs. 8. P.8 line186 - p.9 line193. It is quite challenging to evaluate the claims that authors make about the gene-specific features due to the construction of the sentences.Figure S3 that is referred to as S3A and S3B in fact does not have A and B part, legend and caption. 9. It’s also not clear how collective behavior in the gene association network is connected with the weight parameters of the last layer in the generator. 10. P.9 line 209. It is not clear why only transition curves from 7M WT to 7M AD were selected for further analysis. Also the caption of Figure 3(A-B) is misleading, i.e. it does not provide information that only transition curves from 7M WT to 7M AD were plotted. Please clarify what is meant by “original data points for all months” on Figure S4. 11. For transition curves classification authors use assumption of the existence of 6 predefined disease trajectories. It is not clear if this assumption is based on the previously confirmed domain knowledge or authors propose their hypothesis. It’s not clear what is TYROBP causal network is and how it was identified. Software: 1. The software is not sufficiently documented to fully reproduce the analysis. I encourage authors to add user guidelines and requirements (e.g. python 3, etc.) to the README file. 2. Instead of referring to the original dataset GSE104775 from the dedicated repository authors provide their own composed data file GSE104775_RLD_OurDEG_ADWT_bM100_augmentation_rev.npz that is not accompanied by a clarification of what is its content. I suggest authors to add a corresponding description of the data file deposited at github repository. 3. The file GAN_for_bulkRNAseq.ipynb does not provide differential expression analysis that is the first step of the analysis described in the manuscript. I encourage authors to add missing analyses scripts to fully reproduce the results starting from the downloading of the raw data, preprocessing and differential expression analysis. 4. Analyses of GSE90693 and GSE125583 data sets used for the validation of biological findings are missing from the project repository. Minor comments: 1. In order to increase the reproducibility of the results I suggest the authors to share software on GitHub with specified release version and accompanied by a code DOI. It can be obtained at DOI providing repository such as Zenodo or similar. This would provide permanent access to a usable instance of the published code. Code with an assigned DOI may be formally cited in future publications. 2. Finally, I noticed the study source code at https://github.com/KBRI-Neuroinformatics/GAN-for-bulk-RNAseq was not available under an open source license. An open source license is essential to allow other researchers to modify and reuse the code. I recommend the authors release the software under a permissive open source license, such as the MIT License. See https://help.github.com/en/github/creating-cloning-and-archiving-repositories/licensing-a-repository. 3. I recommend authors to carefully read their text and pay attention to punctuation. I found few typos cases where the references were placed after the end of the sentence, e.g. p.3 line 44, 53, 54, 56; p.4 line 78, 90,105, etc. A dot is missing the end of the sentence on p. 9 line 193. Typo on p.13 line 27. 4. I haven't quite understood the meaning of the sentence on p. 1 line 20-24 in the context of information given in the abstract. Reviewer #3: The authors used generative adversarial networks (GANs) to capture the gene expression transition patterns from wild-type samples (WT) to Alzheimer’s disease (AD) samples. To train the deep neural net, the author did data augmentation and selected differentially expressed genes as the features. Evaluation methods, including Pearson correlations between real and fake samples, distribution of the fake samples compared to real samples, and a t-SNE clustering after training convergence, were used to demonstrate the validity of the augmented samples. The author used the latent space interpolation of the GANs and showed 6 patterns of AD progression, which provide new insights in dissecting biological (AD) pathways. In general, the manuscript is well organized and the study is rigorous, but we have a few concerns: 1. It would be more convincing by showing additional t-SNE clustering results for the original 36 samples and the augmented samples and mind: If either of the t-SNE results shows similar clusters to Figure 2C. If the answer to question (a) is a yes, for every cluster, how many of the genes overlap with those when use all samples including real, augmented and GAN generated samples? 2. By performing the analysis mentioned above, I would be either convinced or skeptical about the necessity of using GANs to find the patterns, though GANs can make the transition plots. 3. In the paper, the author put forward the current problem of limited availability of biological samples. But it’s not their deep learning model that solved this problem. What indeed deals with this problem is the process of data augmentation in their method. There might be a misleading claim in the paper that their deep learning model solved the limited sample size problem. 4. The author built a GANs model to simulate gene expression data of 846 real expression data. After showing their fake data simulates the real data pretty well, they tried to interpret the generative network. From their learned model, by using latent space interpolation method in GANs, they were able to classify 6 patterns of gene expression change from WT to AD state. Then they did downstream analysis like pathway analysis to learn the pathological progress of Alzheimer’s disease. But why GANs? Traidtional1. bioinformatics tools could also classify gene expression change patterns since they have time series data for both WT and AD mice. Here the author failed or didn’t show that GANs is better than bioinformatics method, or using GANs is a necessity. ****** Have all data underlying the figures and results presented in the manuscript been provided?** Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ******** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods https://doi.org/10.1371/journal.pcbi.1008099.r001
Revision 1
27 May 2020 Author Response Attachments Attachment Submitted filename: Response_to_reviewers.docx https://doi.org/10.1371/journal.pcbi.1008099.r002
28 Jun 2020 Decision Letter - Hugues Berry, Editor, William Stafford Noble, Editor Dear Dr. Cheon, We are pleased to inform you that your manuscript 'A practical application of generative adversarial networks for RNA-seq analysis to predict the molecular progress of Alzheimer's disease' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Hugues Berry Associate Editor PLOS Computational Biology William Noble Deputy Editor PLOS Computational Biology ********************************************************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The issues have been addressed. Reviewer #2: Manuscript: Revised version of the manuscript has been substantially improved over previous submission. In sections where I am qualified to evaluate I did not find any issues. The provided approach is potentially of genuine utility. Software: Inclusion of the scripts for all steps of the analysis and user guidelines for the software clearly increase the value of the work for the research community and contributes to research reproducibility. Reviewer #3: All concerns addressed. ****** Have all data underlying the figures and results presented in the manuscript been provided?** Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: None Reviewer #2: Yes Reviewer #3: None ******** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No https://doi.org/10.1371/journal.pcbi.1008099.r003
Formally Accepted
16 Jul 2020 Acceptance Letter - Hugues Berry, Editor, William Stafford Noble, Editor PCOMPBIOL-D-19-02173R1 A practical application of generative adversarial networks for RNA-seq analysis to predict the molecular progress of Alzheimer's disease Dear Dr Cheon, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Sarah Hammond PLOS Computational Biology \| Carlyle House, Carlyle Road, Cambridge CB4 3DN \| United Kingdom ploscompbiol@plos.org \| Phone +44 (0) 1223-442824 \| ploscompbiol.org \| @PLOSCompBiol https://doi.org/10.1371/journal.pcbi.1008099.r004

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .