Global importance analysis: An interpretability method to quantify importance of genomic features in deep neural networks

Peter K. Koo; Antonio Majdandzic; Matthew Ploenzke; Praveen Anand; Steffan B. Paul

doi:10.1371/journal.pcbi.1008925

Peer Review History

Original SubmissionAugust 18, 2020
10 Nov 2020 Decision Letter - Weixiong Zhang, Editor, Roger Dimitri Kouyos, Editor Dear Dr. Koo, Thank you very much for submitting your manuscript "Global Importance Analysis: A Method to Quantify Importance of Genomic Features in Deep Neural Networks" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Roger Dimitri Kouyos Associate Editor PLOS Computational Biology Weixiong Zhang Deputy Editor PLOS Computational Biology ********************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: Koo et al. present a novel algorithm to quantify the effect size, of putative patterns in genomic data, on model predictions (Global Importance Analysis). Further, they developed a new convolutional network to predict RNA-protein interactions (ResidualBind). I would recommend this work for publication with minor revisions. 1. You claim ResidualBind outperforms other methods on predicting RNA-protein interactions. Moreover, you mention predictions of other methods significantly increase when adjusting for secondary structures, while ResidualBind does not benefit. However, your description is not clear, if you compare ResidualBind to other methods while they are adjusted for secondary structures. I.e. e.g. ThermoNet might outperform ResidualBind in an adjusted setting. If this is the case and ResidualBind already includes the effects of secondary structures, it must miss effects the other methods account for. Could you theorise what these effects could be? If not, please mention (for the ones it applies), that they are already adjusted for secondary structures. 2. In case this provides additional information, could you report the Pearson’s correlation for secondary structures (PU/PHIME), stratified by the probability “high vs low” to form those structures (e.g. median as a cut off or 1st vs 4th quartile)? 3. To further prove the performance boost of ResidualBind and its ability to recognise secondary structures without explicitly teaching them, you could show the performance difference of ResidualBind vs previous methods, using the 2009 set. However, I leave it to the discretion of the authors. 4. As I understand it, you compare ResidualBind with other methods while preprocessing for all with clip-transformation. While this might be outside of the scope of this work, did you test for the effect of log-transformation on the other methods? 5. Could you clarify if the 2009-RNAcompete Dataset was preprocessed the same way as the 2013-RNAcompete dataset? 6. (Line 305) Add the table number. I assume you mean the supplementary table. Additionally, please provide the source for the CISBP-RNA database. Reviewer #2: In this paper the authors presented (1) a residual CNN model (ResidualBind) for predicting RNA-protein binding from sequence and (2) a model interpretability approach (Global Importance Analysis) that evaluates the importance of a putative sequence feature by sampling group of sequences from a synthetic distribution and comparing the mean prediction scores of the sequences with and without the putative feature embedded over the population. The authors claimed that this approach enables quantitative hypotheses test and can be used for studying model learned global interactions among patterns and functions such as additive function or sequence context. Although the benchmarking result on RBP predictions for ResidualBind is promising, certain claims regarding the Global Importance Analysis (GIA) are not convincing enough and only supported with ad-hoc synthetic data (which may lack generality) and limited comparison against existing interpretation methods along the same line. Major concerns: 1)Although the concept of being global and finding summarizing statistics over a population of sequences can be interesting, I’m concerned about the fully synthetic setup and the ad-hoc way the sequences are generated. It seems that the calculation of GIA is highly dependent on the selection of the embedding position i; however, no principled guidance on how to decide i is provided, neither do the rationalization of the choice of location 18-24. Given that the motif can appear in any locations in natural sequences, forcing them to be at a fixed position seems like an un-natural design choice, not to mention the case where there might be position specific patterns and an ad-hoc selected embedding position may easily break that pattern. On the other hand, modeling the contextual distribution is also nontrivial. Although the authors have noticed any uncaptured distribution modes or distribution mismatch could lead to misleading interpretation (especially when there is non-linear dependencies and interaction logic), they have not provided a principled and concrete guidance on how to deal with this problem. Listed options such as using PWM, dinucleotide shuffling can easily fall into the undesired scenario, and again no rationalization is provided why PWM is chosen for the analysis on ResidualBind. I would be more convinced if the authors at least try multiple design choices and empirically analyze their effect on the result. 2)The authors mentioned multiple times that GIA’s global analysis over a population enables the study of feature interactions, however this claim is not well supported by the experiment. The example on counting the number of a repetitive motifs does not involve higher order interactions such as XOR logic or epistatic interaction, and the spacing example is too simple. Although ResidualBind is trained with real experimental data, these synthetic examples (repeating motifs and spacing) may never present in the train data so it is not fair to say that ResidualBind ‘learnt’ to do counting of motifs or spacing. Given that ResidualBind uses mean-pooling instead of max-pooling, it is not surprised to me that simply repeating a motif multiple times will result in higher activation, and one may observe similar correlation by looking at the activation instead of GIA score. More evidence is needed to show that this is something a model learnt and not an artifact of the CNN architecture, and that GIA is necessary for discovery of such pattern. 3)Although the methodology appears to be novel, each of its individual component has overlap with many existing studies, and the literature review and benchmark comparisons on prior methods is insufficient. For example the contrary to sequences without an embedded motif is very similar to integrated gradient [1] and DeepLIFT[2] which compare to a reference sequences. The use of multiple sequence samples to study distributional (instead of individual) feature importance has been introduced in Max-Entropy [3]. Moreover, there exists multiple literatures that explicitly studies feature interactions in genomic neural networks with population level analysis, [4][5][6] and they have involved more comprehensive examples of interactions. The author stated that attribution-based methods cannot provide effect size of extended pattern which is not necessarily true, as the attribution scores of individual nucleotides can be summed up easily to make overall contribution score which has been shown effective in several literatures. It would be more convincing if the author can provide clarifications on distinctions between this work and existing literatures and rationales about why their design choices are superior to these existing methods, as well as quantitative comparison to the existing baselines other than in-silico mutagenesis. [1] Axiomatic Attribution for Deep Networks, ICML, Sundararajan et. al, 2017 [2] DeepLIFT: Learning Important Features Through Propagating Activation Differences, ICML 2017 [3] Maximum entropy methods for extracting the learned features of deep neural networks, PLOS compbio, 2017 [4] Discovering epistatic feature interactions from neural network models of regulatory DNA sequences, bioinformatics, Greenside et al, 2018 [5] Visualizing complex feature interactions and feature sharing in genomic deep neural networks, BMC Bioinformatics, Liu et al, 2019 [6] Deep learning at base-resolution reveals motif syntax of the cis-regulatory code, bioarxiv, Avsec et. al, 2020 4)It seems that the application of GIA requires prior knowledge about the location and the sequence which needs to be analyzed. Although it is helpful for validating putative features, it may not be very useful in general cases where the underlining mechanism and feature syntax are unknown. The author gave one example on an initio motif discovery; however, this is only applicable to single motif with known length and not easily generalizable to other complex features and interactions. Minor concerns 1)The author referred to a bioarxiv literature which appears to be largely overlapping with the current submission and should be considered as the prior version of same submission. It would be better to remove such reference as it is confusing. 2)For the motif visualization of top GIA k-mer with in-silico mutagenesis, it is unclear if the L2-norm is taken w.r.t the GIA score or the change of GIA score after introducing mutation. It would make more sense if it’s the latter one, as an L2 norm of GIA score of all variants does not have information specific to the referenced wildtype and thus do not contain information about that nucleotide’s sensitivity. Also the scale of global importance values on fig 2e (-4,0) is difference than that of fig 2f (0,4). 3)There are several typos/latex compilation errors in the text, e.g. line 222 p-value?0.01, and line305 missing table. ****** Have all data underlying the figures and results presented in the manuscript been provided?** Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: Yes ******** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods https://doi.org/10.1371/journal.pcbi.1008925.r001
Revision 1
3 Feb 2021 Author Response Attachments Attachment Submitted filename: Response.pdf https://doi.org/10.1371/journal.pcbi.1008925.r002
10 Mar 2021 Decision Letter - Weixiong Zhang, Editor, Roger Dimitri Kouyos, Editor Dear Dr. Koo, Thank you very much for submitting your manuscript "Global Importance Analysis: An Interpretability Method to Quantify Importance of Genomic Features in Deep Neural Networks" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Roger Dimitri Kouyos Associate Editor PLOS Computational Biology Weixiong Zhang Deputy Editor PLOS Computational Biology ********************* A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors addressed all my concerns. I recommend the manuscript for publication Reviewer #2: I appreciate that the authors have improved the manuscript based on prior reviews and most of the concerns are addressed. Specifically, the authors added clarifications in the discussion that GIA mainly serves as tool for downstream analysis of other interpretability methods. This is important as it helps the audience to identify which part of the practice in this paper is “artificially chosen” and needs customization in practice. Thus, it might be better if the authors could make this distinction clear even earlier in the paper (perhaps in introduction/method) whenever ‘ad-hoc’ design choices are used. I t is good that the authors included 6 additional strategies for sampling synthetic sequences and showed more extensive results using max-pooling. I also appreciate the more comprehensive literature review on pre-existing methods of attribution-based and interaction interpretability tools, although the description of [39] is a bit misleading. As far as I know, DeepResolve uses multiple optimization samples and not just “one optimization run” to explore diverse patterns in the optimization landscape, and thus the limitation the author mentioned could be largely alleviated. It would be better if the authors consider modification of the corresponding paragraph for correctness. ****** Have all data underlying the figures and results presented in the manuscript been provided?** Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: Yes ******** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see http://journals.plos.org/ploscompbiol/s/submission-guidelines#loc-materials-and-methods References: Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. https://doi.org/10.1371/journal.pcbi.1008925.r003
Revision 2
23 Mar 2021 Author Response Attachments Attachment Submitted filename: Response.pdf https://doi.org/10.1371/journal.pcbi.1008925.r004
30 Mar 2021 Decision Letter - Weixiong Zhang, Editor, Roger Dimitri Kouyos, Editor Dear Dr. Koo, We are pleased to inform you that your manuscript 'Global Importance Analysis: An Interpretability Method to Quantify Importance of Genomic Features in Deep Neural Networks' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Roger Dimitri Kouyos Associate Editor PLOS Computational Biology Weixiong Zhang Deputy Editor PLOS Computational Biology *********************************************************** https://doi.org/10.1371/journal.pcbi.1008925.r005
Formally Accepted
20 Apr 2021 Acceptance Letter - Weixiong Zhang, Editor, Roger Dimitri Kouyos, Editor PCOMPBIOL-D-20-01489R2 Global Importance Analysis: An Interpretability Method to Quantify Importance of Genomic Features in Deep Neural Networks Dear Dr Koo, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Andrea Szabo PLOS Computational Biology \| Carlyle House, Carlyle Road, Cambridge CB4 3DN \| United Kingdom ploscompbiol@plos.org \| Phone +44 (0) 1223-442824 \| ploscompbiol.org \| @PLOSCompBiol https://doi.org/10.1371/journal.pcbi.1008925.r006

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .