On the identification of differentially-active transcription factors from ATAC-seq data

Felix Ezequiel Gerbaldo; Emanuel Sonder; Vincent Fischer; Selina Frei; Jiayi Wang; Katharina Gapp; Mark D. Robinson; Pierre-Luc Germain

doi:10.1371/journal.pcbi.1011971

Peer Review History

Original SubmissionMarch 6, 2024
3 Jun 2024 Decision Letter - Shaun Mahony, Editor, Sushmita Roy, Editor Dear Dr Germain, Thank you very much for submitting your manuscript "On the identification of differentially-active transcription factors from ATAC-seq data" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. As you will see, the reviews are encouraging, but the reviewers raised several important issues. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Shaun Mahony Academic Editor PLOS Computational Biology Sushmita Roy Section Editor PLOS Computational Biology ********************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The study of Gerbaldo et al. describes a thorough benchmarking study comparing methods for inferring the activity of transcription factors from differential chromatin accessibility between conditions, such as protein knockout or overexpression. The overall design and structure of the study are very careful, and the execution of the analysis seems very convincing, from exhaustive supplementary figures to the organization of the underlying data and code. I strongly support the publication of the manuscript, as, despite the narrow topic and heavily technical nature of the study, this is a great practical example of performing a comprehensive benchmarking study, which includes simulations, previously published data, and a newly generated experimental dataset. I consider the manuscript to be mostly publication-ready, although introducing minor edits could be useful here and there. Please find more details below. Of note, I am not a native English speaker thus some of my suggestions might be non-relevant. Sincerely yours, Ivan V. Kulakovskiy General Throughout the text, you use the term 'motif' ambiguously: (1) to denote the model of the DNA pattern describing the TF-DNA binding specificity and (2) to denote particular pattern occurrences in the sequences of ATAC-Seq peaks. I strongly suggest differentiating the respective objects explicitly: please use the term 'motif' or 'motif model' to describe the pattern (e.g. the position weight matrix) and use 'motif hits' or 'motif occurrences' or 'motif matches' when talking about particular 'words' yielded by sequence scanning (motif finding, pattern matching). This ambiuity is a general issue in the literature and would be great to avoid. Particularly, this makes it difficult to properly understand the description of the 'network score' (line 237). Another general problem comes with 'data not shown' statements, e.g., it is unclear what you mean by 'the best results' (line 127) or what are the 'other technical sources of variation' (lines 413-414). Please consider briefly describing the underlying data e.g. with a few examples. Suggested additions While the technical results are described in a very detailed way, the Discussion could include a special section dedicated to interpreting the observed effects. First, regarding the NFKB1 TRAFTAC data: should we treat the low-to-inconsistent effect for the NFKB1 motif as a typical outcome given what is known on its cellular role? Finding the members of the AP-1 complex among the most significant TFs linked with the differential chromatin accessibility is not surprising, and a brief comparison to the other benchmarked datasets could be informative, i.e., whether those motifs are commonly detected on top of the list. Exploring the composition of the typical 'false positives' could be also informative in this context. Similarly, it seems interesting to hear your interpretation of striking differences between close family members, such as GATA1-GATA2 and RUNX1-RUNX2. Is it a direct consequence of a limited overall magnitude of change in chromatin accessibility for one of two similar TFs? Could it be linked to a difference in protein expression? This might be helpful to highlight and explain the 'difficult' datasets, especially for pairs of TFs sharing similar binding specificities and, thus, similarly affected by any issues related to motif scanning. Motif scanning Using FIMO with default settings is appropriate as an off-the-shelf approach, but exploring if using strict or weaker thresholds would affect the 1-2 top-scoring methods might be informative, e.g., using NFKB TRAFTAC data as a brief case study. This might be particularly interesting given monaLisa's solid performance and internal motif-finding procedure. Minor comments regarding the text The following word choices seemed strange to me and I suggest rechecking and, if necessary, rephrasing the respective statements and/or providing more details: 'analytic variations' // Author summary 'the impacts of a novel method' // Author summary 'regulate a given transcriptional signature' // line 4 'less biased' // line 21 'peak samples' // line 68 'each rule' // lines 146-147 'we had some difficulties running the code' // lines 197-198, please provide some details 'Benjamini, Hochberg, and Yekutieli' // lines 247-248, do you mean "BH" (aka "FDR" ) or "BY"? Minor comments regarding Figures and Figure labels Figure 1: Please provide an extra (supplementary?) table summarizing all variants of methods assessed in the benchmark (rows in panel A). The current naming scheme is not uniform and hard to track throughout the manuscript. FDR on panel B (Y-axis) can be also mistaken for adjusted P-values. Please also spell out the plot labels, i.e. "Adj. P-value" instead of "p.adjust". Figure 2: please rephrase 'true motif ranks' to reduce ambiguity. It can be read as 'ranks of true motifs' or 'true ranks of various motifs', which have dramatically different meanings in the context of your benchmarking setup. Figure 5: 'simes' should be probably capitalized. Reviewer #2: The manuscript titled "On the Identification of Differentially Active Transcription Factors from ATAC-seq Data" delves into the performance evaluation of TF enrichment analysis methods. The authors aimed to determine the optimal pipeline for TF enrichment by benchmarking background selection, variability normalization, incorporation of factors in multivariate linear models for differential analysis, and the combination of multiple tools to adjust p-values. It's noteworthy that quantile normalization yielded the best results for z-score normalization. TF analysis presents a challenging question to address within a single manuscript, and it's not surprising that this study leaves several questions unanswered. Key steps in TF enrichment analysis include motif selection, definition of binding sites, fragment count, background correction, counts normalization, model selection, and enrichment analysis. While the authors covered most of these steps, there's a lack of thorough discussion on motifs. This oversight led to some expected results, as noted in lines 587-595. The authors should explain why they chose HOCOMOCO over other databases and address how motifs generated from different methods could introduce bias in TF binding site prediction. Additionally, merging similar motifs is a crucial step in motif cleaning that should not be overlooked. It's important to recognize that there is no ground truth for TF enrichment analysis. The authors attempted to create a method for generating "true TFs," which could be a significant advancement in the field. Further discussion on "true TFs" in the manuscript, including aspects such as the distance distribution relative to transcription start sites and co-localizations among the binding sites of "true TFs," would greatly enhance the depth of the analysis. Would it be beneficial to consolidate the software used in the comparison into a summarized table for clarity? Additionally, it's intriguing to note that all footprint-based methods did not yield results for the "true TF." I wonder if the footprint-based method could potentially compensate for biases in chromatin accessibility-based methods. Could it be feasible to include the footprint for the NFkB motif under both conditions in the section discussing TRAFTAC-mediated TF degradation? Here are some minor points that need to be addressed: 1. Detailed parameters are missing in lines 62-72. 2. I couldn't find a description for reads shift in the method, although the author may have mentioned it. 3. If possible, could you add the TFEA (https://doi.org/10.1038/s42003-021-02153-7) package to the comparison? 4. Would it be better to describe fold-changes by formula in line 268? 5. Many figure legends are missing, such as Figure S2D and Figure S4. 6. How did the authors conclude that the two metrics show strong agreement in Figure S2D? 7. There are some typos, such as "FigS2B" in line 371 and "FigS3C-E" in line 426. 8. There seems to be a figure repeated in Fig4AB and FigS7B. 9. It's challenging to see the red circles for NFKB1 and NFKB2 in Fig5A. Reviewer #3: Review The authors present a very nice benchmarking study for testing computational tools for identifying TF motif activity based on ATAC-seq data. They carefully selected a set of studies that specifically perturb only one TF ano d then measured ATAC-seq data. To further validate their findings, they simulated the removal of a TF by using ChIP-seq data of the TF to predict where it binds, and then specifically down sampled the ATACseq signal at these loci. They have done a great job in assembling a very thoughtful ground truth, both simulated and real data. Major comments: - while the authors tested the whole range of methods on the real data, they only tested a couple of methods on their simulated data. The simulated data, however is much better controlled in terms of what effects are expected, so it would be important to show all methods on the simulated data. One caveat of the real data - especially the K/O datasets that had the TF down for 72h is that there may be lots of indirect effects as well so testing the methods on the simulated data would be more fair to compare all method, because there, the authors know exactly what to expect. - The authors mention they ranked the TFs by fold-change, did they consider absolute fold-change? This is particularly important for TFs that may act as repressor in the real datasets (not in the simulated) because one would expect a loss of accessibility for repressors (as shown e.g. in the diffTF paper). Maybe the reason Runx2 is not detected by any method is because it is a repressor and accessibility is lost upon its deletion? - diffTF should not be used in permutation mode for these small sample sizes, this is completely meaningless. Also it is meaningless to increase permutations to 1000, that just means many permutations will be exactly the same - The authors should emphasise better that ChromVar only performs well when combined with their additional extension using the limma model, it should be very clear that ChromVar on its own (I’m assuming that’s chromVAR:differentialDeviations) performs quite badly. This is important for readers to understand, not that they use chromVar in default. - Many TFs have similar motifs, the authors should consider this, especially for the precision measurement in Fig 1B, but maybe also for the rank based analysis. I could envision a similar analysis as the network based analysis (which is very cool by the way!) could be done for TFs with similar motifs Minor comments: - the references are not sorted according to appearance - the ordering in Figure 1A is not very clear, how were the methods ordered? It seems some of the well-performing methods are lower than some of the more poorly performing methods. If it is based on the sum of the ranks, maybe the Runx2 should be taken out to avoid skewing based on arbitrarily high ranks? - I’m wondering whether it is necessary to have all the variations of one method in the main figure, or whether it would be sufficient to show the default and the best-performing addition to one method in the main figure and the rest in the supplement. It would make the main figure 1 easier to read - but this is of course a decision the authors should take. ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: IVAN V KULAKOVSKIY Reviewer #2: No Reviewer #3: Yes:** Judith Zaugg Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols https://doi.org/10.1371/journal.pcbi.1011971.r001
Revision 1
5 Aug 2024 Author Response Attachments Attachment Submitted filename: DTFAB responses.docx https://doi.org/10.1371/journal.pcbi.1011971.r002
23 Sep 2024 Decision Letter - Shaun Mahony, Editor, Sushmita Roy, Editor Dear Dr Germain, We are pleased to inform you that your manuscript 'On the identification of differentially-active transcription factors from ATAC-seq data' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Shaun Mahony Academic Editor PLOS Computational Biology Sushmita Roy Section Editor PLOS Computational Biology ********************************************************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors carefully and thoroughly addressed the issues of the original manuscript and I consider the current version fully suitable for publication. Reviewer #2: The authors have done a very nice job in addressing this reviewers comments. Reviewer #3: The authors have addressed my comments very nicely. In the meantime we have replied to the error message posted for diffTF on the simulated data. Well done, Judith Zaugg ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: IVAN V KULAKOVSKIY Reviewer #2: No Reviewer #3: Yes:** Judith Zaugg https://doi.org/10.1371/journal.pcbi.1011971.r003
Formally Accepted
9 Oct 2024 Acceptance Letter - Shaun Mahony, Editor, Sushmita Roy, Editor PCOMPBIOL-D-24-00394R1 On the identification of differentially-active transcription factors from ATAC-seq data Dear Dr Germain, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Dorothy Lannert PLOS Computational Biology \| Carlyle House, Carlyle Road, Cambridge CB4 3DN \| United Kingdom ploscompbiol@plos.org \| Phone +44 (0) 1223-442824 \| ploscompbiol.org \| @PLOSCompBiol https://doi.org/10.1371/journal.pcbi.1011971.r004

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .