DOT: Gene-set analysis by combining decorrelated association statistics

Olga A. Vsevolozhskaya; Min Shi; Fengjiao Hu; Dmitri V. Zaykin

doi:10.1371/journal.pcbi.1007819

Peer Review History

Original SubmissionAugust 23, 2019
25 Oct 2019 Decision Letter - Thomas Lengauer, Editor, Jennifer Listgarten, Editor Dear Dr Zaykin, Thank you very much for submitting your manuscript 'DOT: Gene-set analysis by combining decorrelated association statistics' for review by PLOS Computational Biology. Your manuscript has been fully evaluated by the PLOS Computational Biology editorial team and in this case also by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the manuscript as it currently stands. While your manuscript cannot be accepted in its present form, we are willing to consider a revised version in which the issues raised by the reviewers have been adequately addressed. We cannot, of course, promise publication at that time. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Your revisions should address the specific points made by each reviewer. Please return the revised version within the next 60 days. If you anticipate any delay in its return, we ask that you let us know the expected resubmission date by email at ploscompbiol@plos.org. Revised manuscripts received beyond 60 days may require evaluation and peer review similar to that applied to newly submitted manuscripts. In addition, when you are ready to resubmit, please be prepared to provide the following: (1) A detailed list of your responses to the review comments and the changes you have made in the manuscript. We require a file of this nature before your manuscript is passed back to the editors. (2) A copy of your manuscript with the changes highlighted (encouraged). We encourage authors, if possible to show clearly where changes have been made to their manuscript e.g. by highlighting text. (3) A striking still image to accompany your article (optional). If the image is judged to be suitable by the editors, it may be featured on our website and might be chosen as the issue image for that month. These square, high-quality images should be accompanied by a short caption. Please note as well that there should be no copyright restrictions on the use of the image, so that it can be published under the Open-Access license and be subject only to appropriate attribution. Before you resubmit your manuscript, please consult our Submission Checklist to ensure your manuscript is formatted correctly for PLOS Computational Biology: http://www.ploscompbiol.org/static/checklist.action. Some key points to remember are: - Figures uploaded separately as TIFF or EPS files (if you wish, your figures may remain in your main manuscript file in addition). - Supporting Information uploaded as separate files, titled Dataset, Figure, Table, Text, Protocol, Audio, or Video. - Funding information in the 'Financial Disclosure' box in the online system. While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see here. We are sorry that we cannot be more positive about your manuscript at this stage, but if you have any concerns or questions, please do not hesitate to contact us. Sincerely, Jennifer Listgarten Associate Editor PLOS Computational Biology Thomas Lengauer Methods Editor PLOS Computational Biology A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: In this paper, the authors present a new summary-statistics-based method for testing a group of common SNPs in aggregate for association to a phenotype. Unlike previous approaches, the authors' test statistic explicitly (and exactly) removes correlation between the individual SNPs' summary statistics. I generally like this paper and appreciate the authors' precision and rigor in deriving and presenting their method. Their theoretical results concerning the power of their test as well as others are also a valuable contribution. So I generally feel this is a very solid contribution to the field. In the long-term I would suggest that the authors consider applications of their framework beyond set-testing since my impression is that the growing number of highly significant associations between individual SNPs and phenotypes will eventually cause set-testing to decline as an approach in the common-variant realm. But this is beyond the scope of this paper and for now there remains a substantial community of users of set tests who could benefit from the approach described by the authors. Regarding the technical substance of the paper, I have the following major comments: - I'm unclear on the phenomenon whereby TQ tests don't experience an increase in power as more SNPs are added to the model, e.g., in Setting 1. Looking at the authors' model, in which the variance of the environmental noise, epsilon, is set at 1, it would seem that the more SNPs I add to the model with non-trivial effects, the more phenotypic variance is produced by the genetics. In the limit of infinite SNPs and constant-magnitude environmental noise then, the phenotype should be deterministically set by genotype. It would seem unintuitive that in this situation the TQ tests wouldn't have full power. What am I missing? Are the authors scaling something somewhere? - Relatedly, it would help if the authors included in their methods section more detailed descriptions of their simulation set ups especially including sample size and proportion of phenotypic variance explained by genotype for each simulation (including the simulations with real genotypes). - I don't know if the proportion of variance explained by genotype is high in the authors' simulations. But if it is, do they expect their results to generalize to settings where this is not the case? For real traits, any one set of tens to hundreds of contiguous SNPs typically only explains a very small proportion (on the order of 1%, usually even less than that) of phenotypic variance, so I'd be interested to see if this is the case in the simulations here. Sometimes it's okay to simulate small sections of genome explaining high proportions of phenotypic variance as long as sample size is lowered in some corresponding way, but if this is the case here the authors should explain and perhaps use their theory to justify. - How do the authors expect their statistic to behave in the presence of near-perfect LD? It seems they don't regularize their LD matrix, which surprised me. I would be interested to see power results under a simulation setting where two SNPs, only one of which is causal and contains 75% of the causal signal in locus, have a) 99% correlation and b) 100% correlation. - For the simulations with real genotypes, how was the 100kb region on chromosome 17 chosen? Do the authors expect the simulation results to generalize to other regions of the genome as well? If they are unsure, is it computationally feasible to do simulations where random sets of contiguous SNPs are chosen from the whole genome? - How were the genes ESR1, FGFR2, RAD51B, and TOX3 chosen by the authors for demonstration of their method? Does this set include all the genes found in the Min et al paper to have association with breast cancer? Would it be possible to test a larger set of genes chosen more systematically so that readers can have a sense for whether the authors' approach should in general be preferred over other approaches? Or perhaps to test a few genes chosen by authors of other set testing methods papers? - Do the authors think it would make sense to compare (either in simulation or in practice) to the gene-level test in de Leeuw 2016 PLOS Comp Bio since that method also provides a way to test the SNPs surrounding an individual gene for association while accounting for correlation between variants in order to boost power? Relatedly: ACAT seems to be a method intended primarily for testing of rare variants in sequence data; could it be that this makes it an inappropriate comparison point? - I liked the way the authors argued for their particular choice of pseudoinverse by suggesting that exchangeability of SNPs should be preserved by this operation. Kudos! I also have the following minor comments: - It seems that the claims about the scaling of power as a function of L are for fixed rho > 0, because when rho=0 the tests considered are equivalent. The authors may want to clarify this. - In the definition of r_ij on page 2, should there be a square-root in the denominator? - On page 3 there is a typo in "This general idea is straightforward and HAVE been used..." (emphasis mine) - What was the sample size of the breast cancer data set that the authors analyzed? - In Equations 4 and 5, rho_ij appears on both sides of the equations. - The derivation of the covariance matrix of the vector of summary statistics can be carried out without the delta method but under the assumption of Gaussian genotypes (which is justifiable for large sample size and MAF bounded away from zero). See Proposition 2 in the supplement of Reshef et al 2018 Nat Genet. The authors may wish to comment on whether these two derivations give different results and if so why not. - For the results in Table 6: 1) which set of genotypes were the phenotypes simulated from? 2) Which set of genotypes was used as the reference panel? The only genotypes I saw mentioned were 1000 Genomes, but two distinct sets of genotypes are required for the described analysis. Reviewer #2: Zaykin et al propose DOT, a new method for Gene Based Association Testing. There is demand for a gene (or set-based) method, so a method that improves upon previous methods would be of much interest and (with easy to use software) could become highly used. Zaykin perform many simulations to show that DOT has the potential to improve on a state-of-the-art method, VEGAS (and also ACAT, a method I am not familiar with). They also have a real data example, but this is very limited. While I am not convinced from this draft alone, I believe that by including an extra simulation method, and a more convincing application, DOT could be a useful addition to the field. Major points Reading the method (and apologies that I did not understand all the details), DOT appears similar to methods which first compute principal components for each gene (ie eigen decompose the snp snp correlation matrix), then regress the phenotype on these (consider the following paper, or derivatives https://onlinelibrary.wiley.com/doi/pdf/10.1002/gepi.20219). Thus I require convincing this method is different to / an improvement on those. The format of the paper makes it challenging to read. Usually methods would come before results. However, if the journal requires such a style, then you must give some brief details at the start of results. I consider there to be insufficient detail of the simulations. For example, I can't see sample size and rho was hard to find. Is it the case for all simulations that all L snps are assigned effects, or just the first one? It is good you compare with vegas (TQ?). But to my knowledge, the most common methods are SCAT, or magma, and my preferred is Fast-LMM-Set, so would ideally like at least one of these considered (or a statement with justification that these very similar to VEGAS) The application is very limited. While I appreciate there is justification for the choice, unfortunately it looks odd to consider only four handpicked loci, rather than perform a genome wide analysis. I believe you require odds ratios for the SNPs in table 8 (ideally from multi snp analysis and perhaps those from single snp) Minor Points I applaud the range of simulations, and also of considering situations where DOTS is not well-suited I also like the insight into how DOT has the potential to gain power (when a wide spectrum of effect sizes, which is thought likely to be the case with complex traits). In the simulations, it is hard to understand the effect sizes. Can you instead report in terms of heritability, ideally both (average) phenotypic variance explained by the gene/region, and (average) variance explained by most significant individual snp The tables (and I think figures) require captions. In generally, these should give a full description (or if the same, say "see Table 1... etc"), rather than relying on the user to parse through the main text. Good that a github page is provided with software (although I have not tested) Please provide a summary of run time for a decent sized analysis. Very Minor Points Intro; It is important to distinguish situations ... I suggest you replace second "in which" with "from those" or something similar I would prefer if you provided more thresholds when testing the false positive rates (e.g. show not just alpha 1e-4, but also say 0.05, maybe a few others, in supplement if necessary) It is good you can accommodate covariates, but is this feature used in application? Signed Doug Speed Reviewer #3: In this manuscript, the authors combined single-SNP summary statistics in order to conduct joint analysis of a set of SNPs without accessing original genotype-phenotype datasets. To develop efficient overall summary-statistic, the authors used a decorrelation trick To simplify the correlation structure of the the vector of the single-SNP summary-statistics. The later are correlated by construction. Thus, by rotating the this vector over the eigenvectors of its corresponding correlation matrix one can simplify its correlation structure. Although the decorrelation-trick of a response vector is not a new concept—it has been used for kinship matrix several times in linear mixed models in presence of familial data, e.g. FastLMM— the theoretical and analytical development of the DOT p-values in this manuscript is relevant, in the context summary-statistic association. Major and Minor Comments are dteailed in a PDF file attached to this review. ******** Have all data underlying the figures and results presented in the manuscript been provided?** Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ******** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes: Doug Speed Reviewer #3: No Attachments Attachment Submitted filename: PlosComBioRepot.pdf https://doi.org/10.1371/journal.pcbi.1007819.r001
Revision 1
8 Jan 2020 Author Response Attachments Attachment Submitted filename: response_to_reviewers.pdf https://doi.org/10.1371/journal.pcbi.1007819.r002
25 Feb 2020 Decision Letter - Thomas Lengauer, Editor, Jennifer Listgarten, Editor Dear Dr Zaykin, Thank you very much for submitting your manuscript "DOT: Gene-set analysis by combining decorrelated association statistics" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations, and in particular those of reviewer #1. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Jennifer Listgarten Associate Editor PLOS Computational Biology Thomas Lengauer Methods Editor PLOS Computational Biology ********************* A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: Overall the authors have addressed my theoretical and methods-related concerns quite well in this revision. However, I still have serious reservations about the authors' analysis of real data, which analyses a very small set of genes that were not chosen systematically. I previously wrote: "Would it be possible to test a larger set of genes chosen more systematically so that readers can have a sense for whether the authors’ approach should in general be preferred over other approaches? Or perhaps to test a few genes chosen by authors of other set testing methods papers?" The authors did not perform this analysis, and so I still do not know whether their method is more powerful than existing methods beyond the very small set of genes they have analyzed. (The addition in revision of a second phenotype, cleft lip, analyzed in the same way as the first phenotype did not give me a better global sense for why people should use this method.) My understanfing of what the authors have shown is that: a) DOT assigns lower p-values than other methods do to the 4 selected breast cancer genes. This seems weak to me first because lower p-values don't necessarily correspond to higher power (a method can give very low p-values on 1 % of alternatives but fail to reject the null the rest of the time), and second because these genes have already been prioritized by other methods, suggesting that their connection to breast cancer is not a new discovery enabled by DOT. For example, these genes seem from the text to harbor previously reported risk SNPs. Am I missing something? b) DOT can point at new SNPs associated with breast cancer and cleft lip at these known loci (Tables 10 and 12). But the authors also state (appropriately) that since these results don't come with p-values they should be interpreted with caution, and they also state that cannot conclude that these SNPs are causal but rather only additional proxy SNPs. So I'm unsure what we can confidently learn from these results. I personally don't find (a) or (b) to be strong reasons that practitioners should use DOT. Overall, I see two ways forward: 1. The authors can carry out a systematic analysis of the performance of their method on real data. For example, they could run the method on a larger set of genes (e.g., all protein coding genes, or all genes expressed in breast tissue, or a set of genes benchmarked in other set testing papers). This would allow the authors to say things like "in a systematic analysis, our method identified X genes to be in loci that are significantly associated with breast cancer, while competing methods identified only Y such genes." I think this would make a much stronger case for the use of this method. And if it's not true, then that is important for potential users to know even if it doesn't preclude publication of the paper. 2. Alternatively, recognizing they have performed extensive revisions already, the authors can add a statement explaining that the genome-wide performance of their method is yet-uncharacterized and would be important to assess in future work. I suppose it would be okay to publish the paper in case 2, but my opinion is that I would be less excited about it. Not answering the central question of whether DOT is more powerful than other methods in practice on real data is not consistent with the otherwise high level of statistical rigor in this potentially interesting paper. Minor comments: - Just above Table 1, you have a typo: "the column labeled \\hat\\gamma provide the average noncentrality value" ("provide" should be "provides") - In the sentence “Different combinations of sample size, the number of causal SNPs, their individual effect sizes and LD patterns among them, resulted in total proportion of phenotypic variance explained...", whose addition I appreciate in this revision, sample size should not be enumerated as one of the parameters that affects the total proportion of phenotypic variance explained. - On page 10, you cite "Min et al. [27, 28]" but neither of refs. 27 or 28 has Min as the last name of a first author in your bibliography. - In your response to R1.1.6, you state that eqns 22 and 27 in Reshef et al. 2018 are derived under the null, but this is not true: Eq 22 defines the computation of summary statistics from data (regardless of model) and Equation 27 includes a parameter beta which can be non-zero. A question therefore remains about the relationship between your derivation and the derivation that assumes Gaussian genotypes. (Fine if you want to drop this issue.) Reviewer #2: The authors have made a careful response and I am happy with the changes. Reviewer #3: No addtional comments ****** Have all data underlying the figures and results presented in the manuscript been provided?** Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: None ******** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes: Doug Speed Reviewer #3: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see http://journals.plos.org/ploscompbiol/s/submission-guidelines#loc-materials-and-methods https://doi.org/10.1371/journal.pcbi.1007819.r003
Revision 2
6 Mar 2020 Author Response Attachments Attachment Submitted filename: response_to_reviewers.pdf https://doi.org/10.1371/journal.pcbi.1007819.r004
23 Mar 2020 Decision Letter - Thomas Lengauer, Editor, Jennifer Listgarten, Editor Dear Dr Zaykin, We are pleased to inform you that your manuscript 'DOT: Gene-set analysis by combining decorrelated association statistics' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Jennifer Listgarten Associate Editor PLOS Computational Biology Thomas Lengauer Methods Editor PLOS Computational Biology ********************************************************* Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: I thank the authors for their revision, and I am happy to recommend acceptance given the clarifications the authors made about their analysis of real data. Setting aside this one point of disagreement, I feel this is very high quality work and I commend the authors on their valuable contribution to the field. ****** Have all data underlying the figures and results presented in the manuscript been provided?** Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes ******** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No https://doi.org/10.1371/journal.pcbi.1007819.r005
Formally Accepted
6 Apr 2020 Acceptance Letter - Thomas Lengauer, Editor, Jennifer Listgarten, Editor PCOMPBIOL-D-19-01433R2 DOT: Gene-set analysis by combining decorrelated association statistics Dear Dr Zaykin, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Matt Lyles PLOS Computational Biology \| Carlyle House, Carlyle Road, Cambridge CB4 3DN \| United Kingdom ploscompbiol@plos.org \| Phone +44 (0) 1223-442824 \| ploscompbiol.org \| @PLOSCompBiol https://doi.org/10.1371/journal.pcbi.1007819.r006

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .