A framework for integrating directed and undirected annotations to build explanatory models of cis-eQTL data

David Lamparter; Rajat Bhatnagar; Katja Hebestreit; T. Grant Belgard; Alice Zhang; Victor Hanson-Smith

doi:10.1371/journal.pcbi.1007770

Peer Review History

Original SubmissionJune 28, 2019
11 Sep 2019 Decision Letter - Weixiong Zhang, Editor, Roger Pique-Regi, Editor Dear Dr Hanson-Smith, Thank you very much for submitting your manuscript 'A framework for integrating directed and undirected annotations to build explanatory models of cis-eQTL data' for review by PLOS Computational Biology. Your manuscript has been fully evaluated by the PLOS Computational Biology editorial team and in this case also by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the manuscript as it currently stands. While your manuscript cannot be accepted in its present form, we are willing to consider a revised version in which the issues raised by the reviewers have been adequately addressed. We cannot, of course, promise publication at that time. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Your revisions should address the specific points made by each reviewer. Please return the revised version within the next 60 days. If you anticipate any delay in its return, we ask that you let us know the expected resubmission date by email at ploscompbiol@plos.org. Revised manuscripts received beyond 60 days may require evaluation and peer review similar to that applied to newly submitted manuscripts. In addition, when you are ready to resubmit, please be prepared to provide the following: (1) A detailed list of your responses to the review comments and the changes you have made in the manuscript. We require a file of this nature before your manuscript is passed back to the editors. (2) A copy of your manuscript with the changes highlighted (encouraged). We encourage authors, if possible to show clearly where changes have been made to their manuscript e.g. by highlighting text. (3) A striking still image to accompany your article (optional). If the image is judged to be suitable by the editors, it may be featured on our website and might be chosen as the issue image for that month. These square, high-quality images should be accompanied by a short caption. Please note as well that there should be no copyright restrictions on the use of the image, so that it can be published under the Open-Access license and be subject only to appropriate attribution. Before you resubmit your manuscript, please consult our Submission Checklist to ensure your manuscript is formatted correctly for PLOS Computational Biology: http://www.ploscompbiol.org/static/checklist.action. Some key points to remember are: - Figures uploaded separately as TIFF or EPS files (if you wish, your figures may remain in your main manuscript file in addition). - Supporting Information uploaded as separate files, titled Dataset, Figure, Table, Text, Protocol, Audio, or Video. - Funding information in the 'Financial Disclosure' box in the online system. While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see here. We are sorry that we cannot be more positive about your manuscript at this stage, but if you have any concerns or questions, please do not hesitate to contact us. Sincerely, Roger Pique-Regi, Ph.D. Guest Editor PLOS Computational Biology Weixiong Zhang Deputy Editor PLOS Computational Biology A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] This is an important and interesting problem, however all three reviewers agree that the authors should explain better what gain in accuracy and in biological understanding do the directed annotations provides. To this end, the authors should address this by using a more clear mathematical notation and provide a more accurate explanation and intuition for their modeling choices. Additional literature covering this problem should also be covered, and additional methods should also be used for comparison. These comparisons using simulated and/or real datasets suggested by the reviewers should also further highlight any advantage, if any, the directed annotations can provide in terms of accuracy or gained biological insight. The reviewers also note that the code needs to be improved and all the data be made available. Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: Summary - This paper describes a framework to understand gene expression in different genomic datasets (e.g. tissue type). This framework can use sequence-specific directed and undirected annotations. It can predict what types of genomic annotations drive gene expression in different conditions. - The method is tested using multiple eQTL datasets, Expecto mutation predictions as directed annotations, and distance from TSS as undirected annotations. It is tested with a full gene expression and genotype dataset as well as with summary statistics and estimated linkage matrices. The annotations that it prioritizes appear to be biologically relevant. - The novelty of the method is that it is the first to use directed annotations to determine the types of annotations that are important in different genomic datasets, and that it can weigh the contribution of these directed annotations based on distance to TSS. Some previous methods can predict directed effects of mutations, and others can determine what types of undirected annotations drive gene expression in a dataset, including distance to TSS. But none does both. - The framework is interesting and perhaps important for the future of the field, but this paper does not offer any comparison to existing undirected annotation tools. The lack of comparison makes it difficult to determine the utility of using directed annotations, especially at the higher computational burden. It also makes it difficult to determine how much biological insight the framework can add to the field. - The current framework may offer significant methodological insight and enable future researchers to build on its use of directed annotations. However, the lack of comparison with existing methods presents a serious problem, and the biological findings reported do not represent any significant biological insight. Major issues 1. I can imagine the utility of using directed annotations to interpret gene expression, but I don’t think that utility has been shown. The method prioritizes certain genomic annotations that make biological sense. But there’s no comparison to the annotations that are prioritized by existing tools that use undirected annotations only. There’s no comparison to a simple enrichment test of genomic annotations of eQTL variants, or a more sophisticated enrichment test using e.g. DAP/torus. The introduction also does not include a discussion of current approaches that use undirected annotations. 2. This paper appears to rely heavily on the Expecto algorithm (ref 7). Expecto is used to generate the directed annotations by calculating the difference in predicted epigenetic signals [e.g. DNase, histone marks] between the reference and alternative alleles of a SNP. These values are then entered into the BAGEA directed annotation matrix. Expecto can already use those values to predict the effect of a given mutation on gene expression (although I don’t think it takes TSS distance into account). The current method’s utility is that using the Expecto annotations, it can learn which of the annotations are relevant for genetic effects in a given dataset on a genome-wide basis. While this seems like it could be useful and interesting, it’s difficult to judge without a metric for comparison (major issue #1). 3. The prioritized annotations per dataset do not appear to offer major biological insight. The findings provided are a good “sanity check” that the expected annotations are prioritized, and there are a couple minor biological insights (e.g. bone-marrow-derived mesenchymal stem cells vs. fibroblasts, DNase1 peaks vs. hotspots). It would add to the paper if the methodology could be used to create new hypotheses about a new or unknown dataset or biological question. Minor issues 1. The authors should provide more context for Sj/MSEdir and highlight the expected values possible based on the cis genetic varition. What is the maximum expression variance that their method could have been expected to explain? The finding that “6.6% of the total genetic variance in cis was explained by the externally-fitted directed component mu_j for genes in the top quartile” helped me understand the usefulness and biological relevance of the method. I’m also curious how much genetic variance was explained by only the variants that end up being used for prediction, as 98.4% of the entries in the directed matrix were set to zero. 2. The method introduces an intuitive way to learn the pattern of the importance of distance to TSS. However, I wonder how much this pattern varies between genes and between datasets, and I don’t see any data on this. If the pattern doesn’t vary much, as I imagine it might not, why not just keep Fv fixed instead of learned? 3. It would be interesting to know if there are differences between genes with low MSE and high MSE. Is there anything about the high MSE genes that makes it easier to predict their gene expression using this model? 4. There are no comments about the finding that some traditionally activating annotations (e.g. Fig 3C CD3 DNase peaks) have negative effect sizes for certain datasets. I can speculate myself, but I think it would be worth discussing. 5. The introduction begins by explaining the importance of understanding the effects of rare variants, but the method as implemented only uses common variants (MAF > 2.5% in EUR). 6. It would be useful to provide a list of all the Roadmap annotation used (or at least the cell types), so that readers know which annotations were not prioritized. Perhaps a full list of Figure 5A could be included in the supplement. 7. Figure 1B seems more complicated than necessary. Maybe a grid would be a better tool for representation. 8. It is difficult to follow the trend lines and see the different layers in Figures 2A and 4B, especially in Figure 4B. 9. Including the tissue sample size or number of eQTLs per tissue might add more meaning to Figure 5A. 10. Since it looks like there are a couple outliers, it might be more informative to sort the x-axis of Figure 5B by median instead of by maximum. And it would be interesting to know which tissues those outliers came from. 11. It appears the GTEx data were analyzed using a EUR-only LD matrix, when the sample is ~85% EUR. 12. The method does not use INDEL information (I believe). 13. I tried to download and test the software myself, but the install script would not work; it failed on “devtools::use_rcpp()” with the message “Error: 'use_rcpp' is not an exported object from 'namespace:devtools'”. I did not investigate further. Reviewer #2: The authors present a model connecting genomic annotations, predicted effects of variants on those annotations, and eQTLs. The key innovation is the the model can consider both directed and undirected annotations: in particular variant effect predictions (e.g. from deep learning models) and distance from TSS respectively. The model is constructed and initially tested for the scenario where genotypes and gene expression are available, but an extension to using summary statistics and reference panel LD is presented that seems to recapiluate the full data results well. The underlying model is a "soft sparse" Bayesian regression with ARD prior on coefficients (this gives a student-t distribution on each coefficient, the authors should check that and mention it i so). The annotations are incorporated into the mean of the prior. The vector of directed annotation a SNP, v, is dot producted with a learnt coefficient vector w. Similarly, the undirected annotations f are dot producted with nu. These two scalars are multiplied together to give the prior mean for the SNP effect on expression (I think it might be helpful to give this explanation in the text since the matrix version is a bit more obtuse. The prior variance can optionally be stochastically dependent on a third binary annotation matrix C. Priors (and hyperpriors) are setup on all variables and VBEM is used for learning. The an affine transform of the "LD matrix" X'X in principle needs inverting each iteration, which is costly. They pre-compute a low rank approximation (explained e.g. 99% of variance) and then use the matrix inversion lemma subsequently. The paper is very clearly presented. There are small number of typos I've annotated on the attached manuscript. I wish PLOS wouldn't tell authors to upload low quality bitmaps and put the figures at the end, but that's a review of the journal not the paper (for future reference they will let you submit a pdf with inline figs at first submission). The model itself is very reasonable. The main aspect one could argue about is whether this approach of "soft sparsity" is competitive with spike-and-slab priors as used in Bayesian fine-mapping approaches such as enloc or Matt Stephen's recent SuSiE method. The results themselves are of course not overwhelming, with only a small fraction of genetic variance (let alone total variance) explained by the model, even for the most well predicted genes. I would be curious to know what are the reason(s) driving this: variant effector predictors aren't good enough? insufficient sample size for training BAGEA (seems unlikely)? Chromatin state is in only proxy cell types? While the small pve is what it is, I would appreciate some attempt to benchmark vs existing eQTL/annotation approaches like eQTNMiner, enloc or stratified LD score regression. I realize this is non-trivial since existing approaches can't predict gene expression, and there is no ground truth for what annotations should be important for eQTLs in a given cell type. That leaves simulations, which are likely to bias towards which model is most similar to the simulation, and fine mapping. Fine mapping might be a reasonable way to compare (although I realize the method is not specifically tailored to this). You could think of downsampling individuals and see how well you recover the credible set of variants from the full data. One thing I would emphasize is that this model could be applied to very rare, even de novo, variants. This distinguishes the approach from both existing eQTL and eQTL+annotation methods. Maybe I missed it but do you have a run time analysis? This is relevant for practitioners. Is there a good reason the undirected annotations are binary? The math looks like it would all go through just fine with non-negative annotations. Reviewer #3: Lamparter et al. propose a new computational method, BAGEA, to model cis-eQTL data using genomic annotations. They argue that the new computational model can identify epigenetic marks relevant to expression biology. For the most part, the paper is clearly written. However, I have some concerns about some critical model assumptions, evaluations of the proposed model, and the analysis of the real eQTL data. Major comments: 1. Model assumptions. There is generally a lack of discussions on the intuition and motivation of modeling decisions. First, a unique perspective of the proposed method is its ability to model directed genomic annotations. However, it is unclear how important are the directions of the annotations enforced in the model. In particular, are the omega parameters constrained (e.g., in the motivating example of binding affinity?). If yes, how? If not, I fail to understand how the concept of directed genomic annotations differs from a quantitative annotation. Relating to the same point, should the variance term alpha in Eqn (2) depend on the mean term too? Because a potentially large independent variance term can make the directional effect meaningless. Second, the joint modeling of directed and undirected annotations in Eqn (2) is not intuitive. It is unclear to me why it is a reasonable strategy that the model only consider the interactions between the two distinct categories. The particular model formulation also causes confusions: consider a single directed annotation and a single undirected annotation, would the absence of the undirected annotation (i.e., 0) completely remove the effect of the quantitative information from the directed annotation? 2. Evaluation of the model. The authors choose to evaluate the proposed model based on its predictive performance. However, their results show extremely inaccurate predictions at the absolute scales (1.5% variance explained). It is well known that gene expressions are generally difficult to predict, but other available genotype predictive approaches, e.g., the simple elastic net algorithm implemented in PrediXcan method, show much higher heritability based on similar datasets (See https://www.ncbi.nlm.nih.gov/pubmed/27835642). Can the authors explain the discrepancy? On the surface, this seems to undermine the main point of the paper: by introducing an annotation-informed prior, I'd expect at least similar, if not much better, predictive performance (an annotation-uninformed model should be included as a special case of the proposed model by setting appropriate nu, omega and alpha parameters). 3.The relevance of genomic annotations. More details should be provided on this topic. Although some computational difficulty is acknowledged in the discussion, the adopted training and testing protocol should be evaluated in a simulation setting. It is also vague from the paper if an individual annotation is assessed marginally or under the control of other competing annotations. Minor comments: 1. undirected annotations are not formally defined as directed annotations. 2. The GitHub page for the software needs to be improved. It does not have proper documentation or example datasets. I strongly encourage the authors to provide the datasets (i.e., the summary statistics from eQTL data and genomic annotations) analyzed in this paper online for the reproducibility purpose. ******** Have all data underlying the figures and results presented in the manuscript been provided?** Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: No: I did not receive a spreadsheet form of the numerical data underlying graphs/summary statistics. Reviewer #2: Yes Reviewer #3: None ******** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes: David A Knowles Reviewer #3: No Attachments Attachment Submitted filename: PCOMPBIOL-D-19-01078_annotated.pdf https://doi.org/10.1371/journal.pcbi.1007770.r001
Revision 1
2 Dec 2019 Author Response Attachments Attachment Submitted filename: cover letter _revisions.pdf https://doi.org/10.1371/journal.pcbi.1007770.r002
30 Jan 2020 Decision Letter - Weixiong Zhang, Editor, Roger Pique-Regi, Editor Dear Dr. Hanson-Smith, Thank you very much for submitting your manuscript "A framework for integrating directed and undirected annotations to build explanatory models of cis-eQTL data" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations. Please make sure to address the remaining minor comments from Reviewer 1. As well as asking for clarification on some further issues, they have highlighted a few spelling mistakes/typos - please also take this opportunity to read through carefully for any others. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Roger Pique-Regi, Ph.D. Guest Editor PLOS Computational Biology Weixiong Zhang Deputy Editor PLOS Computational Biology ********************* A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: Thank you to the authors for thoroughly addressing the previous comments and adding the requested analyses. A few new, minor comments: 1. In "Joint modeling of cis-eQTLs and Directed Annotations..." you mentioned a cell line rationale for the prioritized TF annotations. Were the expected TFs also prioritized, or did the effects appear to be mostly driven by cell lines? 2. Line 352, "annotation" sp. 3. Was torus run using all eQTLs (all p-values)? Or were the same variants used for both methods (p<10e-7)? 4. Could comment #3 possibly explain why BAGEA showed no effect for many non-zero torus effects (Fig S9)? It's also possible that I'm misunderstanding the text -- I'm not entirely sure that I understand the last sentence of the paragraph (lines 360/361). 5. Fig 4. caption partially sp. 6. Fig 4A: What's going on with Adipose Omentum -- why is the RM MSE at 1? Apologies if I missed this discussion in the text. 7. When discussing DNase1 data processing, it might be helpful to mention that DNase1 hotspots are usually wider than DNase1 peaks calls (I believe). Reviewer #2: I've read the other reviews, author feedback and changes. I was already quite positive about the paper but I think with the additions now made it should certainly be accepted. ****** Have all data underlying the figures and results presented in the manuscript been provided?** Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: None ******** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes: David A Knowles Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see http://journals.plos.org/ploscompbiol/s/submission-guidelines#loc-materials-and-methods https://doi.org/10.1371/journal.pcbi.1007770.r003
Revision 2
20 Feb 2020 Author Response Attachments Attachment Submitted filename: cover letter _rebuttal_second_round_pretty.pdf https://doi.org/10.1371/journal.pcbi.1007770.r004
3 Mar 2020 Decision Letter - Weixiong Zhang, Editor, Roger Pique-Regi, Editor Dear Dr. Hanson-Smith, We are pleased to inform you that your manuscript 'A framework for integrating directed and undirected annotations to build explanatory models of cis-eQTL data' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Roger Pique-Regi, Ph.D. Guest Editor PLOS Computational Biology Weixiong Zhang Deputy Editor PLOS Computational Biology *********************************************************** https://doi.org/10.1371/journal.pcbi.1007770.r005
Formally Accepted
29 Apr 2020 Acceptance Letter - Weixiong Zhang, Editor, Roger Pique-Regi, Editor PCOMPBIOL-D-19-01078R2 A framework for integrating directed and undirected annotations to build explanatory models of cis-eQTL data Dear Dr Hanson-Smith, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Bailey Hanna PLOS Computational Biology \| Carlyle House, Carlyle Road, Cambridge CB4 3DN \| United Kingdom ploscompbiol@plos.org \| Phone +44 (0) 1223-442824 \| ploscompbiol.org \| @PLOSCompBiol https://doi.org/10.1371/journal.pcbi.1007770.r006

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .