maxATAC: Genome-scale transcription-factor binding prediction from ATAC-seq with deep neural networks

Tareian A. Cazares; Faiz W. Rizvi; Balaji Iyer; Xiaoting Chen; Michael Kotliar; Anthony T. Bejjani; Joseph A. Wayman; Omer Donmez; Benjamin Wronowski; Sreeja Parameswaran; Leah C. Kottyan; Artem Barski; Matthew T. Weirauch; V. B. Surya Prasath; Emily R. Miraldi

doi:10.1371/journal.pcbi.1010863

Peer Review History

Original SubmissionJuly 6, 2022
12 Sep 2022 Decision Letter - Teresa M. Przytycka, Editor, William Stafford Noble, Editor Dear Miraldi, Thank you very much for submitting your manuscript "maxATAC: genome-scale transcription-factor binding prediction from ATAC-seq with deep neural networks" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Teresa M. Przytycka Academic Editor PLOS Computational Biology William Noble Section Editor PLOS Computational Biology ********************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors have presented a computational framework (maxATAC) based on deep convolutional neural networks for predicting transcription factor (TF) binding from ATAC-seq profiles. Authors demonstrated TF binding side prediction from ATAC-seq in new cell types. For benchmarking, authors curated an extensive dataset of existing cell-type specific ChIP-seq and ATAC-seq datasets. Data were manually verified using annotations of each experiment. Their models performed well on both ATAC-seq and scATAC-seq. Quality-controlled, processed datasets are also available to download. Overall, it is a well-written manuscript with an apt description of the method. Minor comment The deep learning approach used in this work is not considered as novel (e.g. Tianqi Yang et al, Bioinformatics 2022, Laiyi Fu et al, Science Advances 2020). While this work is unique on capability of trans-cell TFBS predictions and providing community access for quality-controlled, processed, ready-to-use curated large dataset to advance gene regulatory network research on ATAC-seq data Reviewer #2: This manuscript curated a good quality dataset for ATAC-seq and TF ChIP-seq., and present a method named maxATAC for prediction of TF binding sites. TF binding site prediction is an important problem in gene regulatory analysis, and has broad application in many fields. The curated dataset will be useful for future method developments. The manuscript writing is clear and easy to follow. However, I have one concern about the validation of the method. 1. There are several methods available for TF binding site prediction from chromatin accessibility data, for example methods presented in the ENCODE-DREAM challenge. Using different data as input and comparing AUPR is not fair. The reason authors did not compare maxATAC with those method is the difference of input data. I agree that there are some differences between ATAC and DNase. However, for most case, these two data are highly correlated. Authors can train those methods on the curated data they collected and perform direct comparison with those methods. 2. To evaluate the motif scanning, how authors calculate the AUPR is not clear. I assuming they slide the motif matching score to calculate the AUPR. Another way to perform comparison is sliding sum of (or product of) motif matching score and ATAC-seq signal (RPKM, or openness) to calculate AUPR. Reviewer #3: The Cazares et al. manuscript presents a well-curated benchmark dataset that pairs ATAC-seq and published transcription factor (TF) ChIP-seq in 20 cell types for an unprecedented number of TFs (127 TFs with ChIP-seq data in at least 2 cell types; 74 TFs of them with data in at least 3 cell types). The authors generated their own OMNI-ATAC-seq for some cell lines to assemble the resource. In addition to the dataset, the paper proposes a deep learning model that predicts trans-cell-type TF occupancy based on a dilated CNN model. The maxATAC model was carefully evaluated using both bulk and single-cell ATAC-seq data on various cell types and TFs in a held-out chromosome and held-out cell type manner and was shown to reach comparable (though perhaps slightly weaker) performance to several current state-of-the-art methods for cross-cell-type TF occupancy prediction developed for DNase-seq. The maxATAC approach also had superior performance compared to traditional motif scanning or ChIP-seq signal averaging across training cell types for most TFs. Overall, this manuscript is well-written and results are nicely presented, with the limitations of the study helpfully described in the Discussion. While there is limited technical or conceptual novelty relative to previous neural network models that tackle the same or related problems, the benchmark dataset establishes a useful resource, and the maxATAC performance results lay down a useful baseline for future deep learning methods. More exploration of the central question of cross-cell-type generalizability of TF occupancy prediction, wider method comparison, and better interpretation of the trained models would strengthen the paper. Major points: 1. The key question for the paper is the extent of generalizability of TF binding models to new cell types, where potentially new co-factors or members of the TF complex may be expressed and alter the sequence recognition code. The authors present various analyses to explain the variability in maxATAC’s performance over TFs, but the issue of the cell-type-specificity of the underlying sequence signal is not fully developed. For example, in Figure S5, the authors show the maxATAC auPR relative to training ChIP-seq signal auPR (as a log odds ratio) vs Jaccard distance over the training ChIP-seq samples. This tells us that when a factor binds nearly the same sites in all cell types (e.g. CTCF), it is hard to outperform training ChIP-seq signal, whereas the model can outperform this baseline when there are cell-type-specific sites. However, it absolute terms, maxATAC has the highest auPR on CTCF test data out of all factors, because it also has a highly conserved binding motif across cell types. So it would be helpful to address how the model performs on unseen cell types for TF with cell-type-specific binding motifs as compared to highly conserved binding motifs. 2. As a related issue, a model interpretation analysis, e.g. using feature attribution to identify cell-type-specific vs. conserved motifs, might also help investigating the issue of generalizability. In particular, is the model actually learning co-factor binding signals in addition to the TF motif? Can the authors show that the model is learning multiple modes of binding for TFs with cell-type-specific binding motifs/patterns? 3. The major method comparison was made between maxATAC and two baseline methods (motif-scanning and ChIP-seq average signal), but since there are also some published ATAC-seq based TF “footprinting” approaches such as HINT-ATAC and TOBIAS, i.e. methods that try to model the ATAC-seq signal to better local the potential TF binding site, it would be interesting to see a comparison with them. It would also be interesting to understand whether the maxATAC model is learning a “footprinting” method (relatively protected region within a peak) or simply finding peak summits. 4. It would also be interesting to compare the deep learning method to “shallow learning”, e.g. a kernel method like gkm-SVM combined with a simple kernel on ATAC-seq signal. 5. There are also some improvements that could be made to make the presentation of this paper clearer. Various model and training details (explicit explanation of the output of the model, ground truth used for training, details of the encoding and 32bp resolution) are not clear until the methods, and some details are obscure even there? It would be helpful to include a figure and overview of the model in the main text. The training set-up might be described in the ENCODE-DREAM challenge, but a self-contained presentation would be helpful here. Also, the wording “a suite of models” and “a collections of models” is a bit misleading – it is not true that there is collection of models with different structures, as is finally clarified in the model description section, but rather there are models for different TFs (all with the same architecture). Minor points: - Fig S1A: should have a consistent use of cell-line-specific and cell-type-specific in the figure caption - Line 198: should be sequence-based ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols https://doi.org/10.1371/journal.pcbi.1010863.r001
Revision 1
21 Nov 2022 Author Response Attachments Attachment Submitted filename: 221118_maxATAC_responseToReviewers.pdf https://doi.org/10.1371/journal.pcbi.1010863.r002
10 Jan 2023 Decision Letter - Teresa M. Przytycka, Editor, William Stafford Noble, Editor Dear Miraldi, We are pleased to inform you that your manuscript 'maxATAC: genome-scale transcription-factor binding prediction from ATAC-seq with deep neural networks' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Teresa M. Przytycka Academic Editor PLOS Computational Biology William Noble Section Editor PLOS Computational Biology ********************************************************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: Authors addressed my comments Reviewer #2: I do not have further comments. Reviewer #3: The revised paper has sufficiently addressed most of the issues described in the major and minor comments of the previous review. For comments 1 and 2, the authors have added model interpretation results using TFMoDISco, which provides an interesting investigation of some limitations in cross-cell-type generalization due to cell-type-specific binding modes. Note that it might be helpful to highlight GATA3 and CREM in Fig. 5C for better visibility. Comment 3 has to some extent been addressed with the comparison to TOBIAS and the neural network activation maximization plot, and the authors suggest that the model is indeed learning a “footprint” region (although hard to disambiguate from simply a peak summit region). The authors mention two papers (DeepSEA and a 2016 paper from the Gifford lab) to justify not performing the comparison with gkm-SVM in comment 4. This is fine, though please note that there are issues with some of the early CNN literature in genomics; for example, the original DeepSEA results may largely reflect GC content bias, and later papers regress out GC content prior to training models. For comment 5, the presentation of the maxATAC methodology is now much clearer with the additional information provided in Figure 2. ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes:** Hatice Ulku Osmanbeyoglu Reviewer #2: No Reviewer #3: No https://doi.org/10.1371/journal.pcbi.1010863.r003
Formally Accepted
26 Jan 2023 Acceptance Letter - Teresa M. Przytycka, Editor, William Stafford Noble, Editor PCOMPBIOL-D-22-01037R1 maxATAC: genome-scale transcription-factor binding prediction from ATAC-seq with deep neural networks Dear Dr Miraldi, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Zsofi Zombor PLOS Computational Biology \| Carlyle House, Carlyle Road, Cambridge CB4 3DN \| United Kingdom ploscompbiol@plos.org \| Phone +44 (0) 1223-442824 \| ploscompbiol.org \| @PLOSCompBiol https://doi.org/10.1371/journal.pcbi.1010863.r004

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .