Accurate genetic and environmental covariance estimation with composite likelihood in genome-wide association studies

Boran Gao; Can Yang; Jin Liu; Xiang Zhou

doi:10.1371/journal.pgen.1009293

Peer Review History

Original SubmissionJune 5, 2020
10 Aug 2020 Decision Letter - Michael P. Epstein, Editor, Hua Tang, Editor Dear Dr Gao, Thank you very much for submitting your Research Article entitled 'Accurate genetic and environmental covariance estimation with composite likelihood in genome-wide association studies' to PLOS Genetics. Your manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the current manuscript. Based on the reviews, we will not be able to accept this version of the manuscript, but we would be willing to review again a much-revised version. We cannot, of course, promise publication at that time. Should you decide to revise the manuscript for further consideration here, your revisions should address the specific points made by each reviewer. We will also require a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. If you decide to revise the manuscript for further consideration at PLOS Genetics, please aim to resubmit within the next 60 days, unless it will take extra time to address the concerns of the reviewers, in which case we would appreciate an expected resubmission date by email to plosgenetics@plos.org. If present, accompanying reviewer attachments are included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist. To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see our guidelines. Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission. While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process. To resubmit, use the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder. [LINK] We are sorry that we cannot be more positive about your manuscript at this stage. Please do not hesitate to contact us if you have any concerns or questions. Yours sincerely, Michael P. Epstein Associate Editor PLOS Genetics Hua Tang Section Editor: Natural Variation PLOS Genetics Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: This is a great paper with contributions to an important problem. Overall, the manuscript is well-structured and clearly written. Although I have some concerns and questions about the technical details of GECKO and some analyses presented in the paper, all my comments should be addressable in principle. Major comments: 1. It was not immediately obvious to me if GECKO assumes the design matrix (X) to be fixed or random. Methods for genetic correlation estimation (and frankly, heritability estimation as well) have not been consistent on this issue. My recollection is that the GREML approach and GNOVA assume a fixed design while LDSC is based on a random design. While the field seems to have accepted that genetic correlation estimation is robust to these two different designs (at least empirically), whether this will affect the estimation of environmental covariance is unclear. Under a random design, the intercept of LDSC is the phenotypic covariance of two traits. Under a fixed design, I am under the impression that the intercept may actually be the environmental covariance instead. When the goal is to estimate the genetic covariance, such differences does not matter since the slope in LDSC is always a good estimator for rho_g. But the interpretation of intercept is completely different under two model assumptions. It will be important to clarify the assumptions in GECKO and provide some discussions on whether the environmental covariance is robust to different assumptions. 2. The field has not paid as much attention to the environmental covariance compared to genetic covariance. One reason is that the environmental covariance is more susceptible to technical artifacts (such as insufficiently controlled confounding) in the GWAS summary statistics. In fact, this was a highlight in genetic covariance estimation since if there are any technical correlations between two sets of GWAS, such correlations will most likely be independent from LD. The first LDSC paper on contrasting confounding from polygenicity is a good example on this issue (in the single-trait context). Suppose there is unadjusted population stratification in a GWAS, such inflation is randomly applied to SNPs in the genome while true polygenic effects are stronger in genomic regions with stronger LD. Therefore, all the technical bias will be absorbed into the intercept term of single-trait LDSC. The multi-trait extension of this problem is that technical correlation of two traits caused by population stratification will show up in the intercept of LDSC. I have been using LDSC to illustrate the problem since it's a famous method but I don't think GECKO is immune to the exact same issue. If you run two GWASs without adjusting for PCs, and apply GECKO to their sumstats, I expect the "environmental covariance" estimates to be severely biased while the genetic covariance estimates will remain solid. If the authors can come up with an approach to address this problem, it will improve this manuscript. Otherwise, at least treat this as a serious problem and list it as a limitation. 3. Only ~600k SNPs were used in the real data analyses which seems low. I wonder whether the somewhat heuristic composite likelihood approach in GECKO is still effective when there are dense SNPs with extensive LD in the analysis. This is important considering that both simulation and real data analysis in this paper were based on relatively low numbers of SNPs. LDSC also requires trimming of sumstats by taking the overlap with HAPMAP SNPs while GNOVA uses all available SNPs. After all, if there are few SNPs with minimal LD in the data, genetic and environmental covariance estimation becomes a much simpler problem. The tricky issue really is LD here. 4. It is helpful that GECKO can partition genetic covariance by annotation. It will be helpful if the authors can provide some discussions on whether the procedure can be easily generalized to overlapping annotations. 5. The inflated type-I error of GNOVA illustrated in Table 1 is striking and also inconsistent with the original publication of GNOVA. However, these results are also inconsistent with what is shown in Figure 2. There, when the true genetic covariance is 0, all three approaches seem to have comparable type-I error rates. Why the contradiction? Minor comments: 6. Figure 2's panels are labeled differently between the main text and the actual plot. See line 403 on page 24. I think the authors meant Fig2D and Fig2E instead of panels B and D. 7. In the real data analysis, is there any trait pair showing significant genetic and environmental covariance but with opposite signs? How to interest these results? 8. Fig1E seems to suggest that LDSC had noisy estimates on genetic covariance but it contradicts Fig1F. 9. In the simulation corresponding to Fig2C, if we apply GECKO, should we expect to see near-null results with a "power" of 5%? In practice, it may not be obvious if two GWASs have overlapping samples. Consequently, it may not be easy (or even possible) to distinguish a lack of environmental covariance and a lack of sample overlap. It may be helpful to add some discussions on this issue. 10. There have been some recent developments on genetic covariance estimation. For example, HDL (https://www.nature.com/articles/s41588-020-0653-y) uses a likelihood-based approach to improve the efficiency of estimation and SUPERGNOVA (https://www.biorxiv.org/content/10.1101/2020.05.08.084475v1.abstract) de-correlates the summary statistics first and then apply standard regression analysis. In addition, trans-ethnic estimation is an important topic and a clear future direction (https://www.biorxiv.org/content/10.1101/803452v3). Since these are very recent papers, it wouldn't be fair to ask for comparisons between GECKO and these methods. But some discussions regarding these new developments will improve the manuscript. Reviewer #2: Review: Accurate genetic and environmental covariance estimation with composite likelihood in genome-wide association studies Summary: Gao et al introduce a novel statistical approach, GECKO, to estimate genetic and environmental covariance between complex traits using only GWAS summary statistics with a computationally efficient composite likelihood estimator. The approach is motivated by the lack of likelihood-based approaches in summary-statistic-based genetic covariance estimation (e.g., LDSC, GNOVA both use MoM estimators). The authors performed extensive simulations to validate their approach and demonstrate its improved performance. They then followed with application to real data as an illustration. The idea is interesting, and I found the manuscript extremely well written and straightforward to follow. I detail my main comments below. Major Comments: 1. The authors performed a considerable number of simulations demonstrating the mostly improved performance of GECKO with respect to LDSC and GNOVA. However, all simulations were performed essentially under the correct generative model, with no inspection of performance under model misspecification. A few scenarios come to mind: a) LD pattens match exactly between the focal population and LD score computation. To investigate this point further, can the authors showcase the performance of all three methods when 1000G LD information is used to compute LD scores? b) How does GECKO (and LDSC, GNOVA) perform under misspecification of n_s? c) How does stratification affect the performance of GECKO? A straightforward simulation would be to generate both phenotypes as a function of several leading PCs, in addition to the remaining genetic + environmental model. Minor Comments: 1. MoM estimators are unbiased, and it is not unexpected to see estimates largely agree across all three methods. However, as the authors note, GECKO’s development was partly motivated by improving statistical power. The use of MSE ratios is interesting to get a relative sense of performance, but still find it difficult to judge what my expectation should be here (e.g., Is 1.05 really all that different). I appreciated the authors decision to include power calculations in the non-stratified setting given its clear interpretation but noticed their absence in the functional stratified simulations. 2. Replace “” with “x” in estimates/p-values. Reviewer #3: Gao et al. presented GECKO, a statistical method that aims to estimate both genetic and environmental covariances between different traits using GWAS summary statistics. The key idea of GECKO is to approximate complicated joint likelihood with a relatively simple composite likelihood, which holds the potential to be computationally fast if the weight parameters in the composite likelihood are correctly specified. The authors evaluated the performance of GECKO by simulations as well as the analysis of 22 traits collected from five large scale GWASs. I have the following concerns about this paper. Main comments: 1. Simulations: While the authors have simulated >200 scenarios, these mostly reflect the small step size used to evaluate the impact of genetic covariance (-0.4 to 0.4) or environmental covariance (-0.4 to 0.4), whereas the other key parameter, e.g., heritability, which is also extremely important, was set at constant values. The values used in the simulations do not seem to reflect those seen in real data that the authors showed later on. Overall, I feel that the current parameter values used in simulations are too different from what one would expect from real data, which make it difficult to judge whether these simulation results are useful. To get a better sense on the performance of GECKO, I think it is important to simulate data that are more representative of real data. I would suggest the authors to do the following: a) Estimate heritability for the 22 traits collected from real GWASs. Estimate genetic covariances and environmental covariances using these real data. Use these estimates as a guideline to determine the range of parameter values for simulations. b) The data were simulated from the true model. What if the model is mis-specified? In particular, what if the weights are not optimal? The authors should illustrate that GECKO is robust to misspecification of the model and weights. c) Evaluation of type I error at 0.05 level seems to be too liberal. I am curious to see if GECKO still has controlled type I error rate at a more stringent significance level, e.g. 0.005. 2. Real data analysis: a) Please show the heritability for the 22 traits. This will help readers interpret the results and connect with simulations. b) I am surprised that all estimated covariances are very small. However, even for such small covariances, they are highly significant. For example, the positive genetic covariance between BMI and WBC (covariance = 7.9110-3; p value =2.2410-12). What is the biological significance of such a small but highly significant covariance? Can you also show estimation error of the estimated covariances? What I am surprised is that all of the estimated covariances are extremely small. I am not sure that these results are biologically significant. The authors should provide better interpretation of the results and show that these estimates are reliable. c) Since 223 pairs of traits were tested, multiple testing issue should be considered. I am also curious to see the performance of GEKCO at a more stringent significance threshold. Considering correlation among the 223 pairs, 100 independent tests might be a good start. With this, using Bonferroni, the threshold will be 0.05/100=0.0005. 3. The methods section should include more details on how the weights are estimated as this is the key of GECKO. 4. In terms of comparison, shouldn’t you also compare with mvLMM? 5. Can you comment on the computing time of GECKO? How does it compare with other compared methods with respect to computing speed? 6. Can you comment on how to use GECKO for admixed populations? Minor comments: 1. Line 404: How did you compare the power of different methods at fixed type I error rate? 2. Can you comment on why the relative powers of different methods are different across study designs? What is the intuition behind this difference? 3. For real data analysis, it would still be interesting to include GNOVA’s estimates for covariance. ******* Have all data underlying the figures and results presented in the manuscript been provided?** Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ******** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No https://doi.org/10.1371/journal.pgen.1009293.r001
Revision 1
23 Oct 2020 Author Response Attachments Attachment Submitted filename: Point_to_Point_Response.docx https://doi.org/10.1371/journal.pgen.1009293.r002
2 Dec 2020 Decision Letter - Michael P. Epstein, Editor, Hua Tang, Editor Dear Dr Gao, We are pleased to inform you that your manuscript entitled "Accurate genetic and environmental covariance estimation with composite likelihood in genome-wide association studies" has been editorially accepted for publication in PLOS Genetics. Congratulations! Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional acceptance, but your manuscript will not be scheduled for publication until the required changes have been made. Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org. In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. If you have a press-related query, or would like to know about making your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date. Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics! Yours sincerely, Michael P. Epstein Associate Editor PLOS Genetics Hua Tang Section Editor: Natural Variation PLOS Genetics www.plosgenetics.org Twitter: @PLOSGenetics ---------------------------------------------------- Comments from the reviewers (if applicable): Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors did an excellent job revising the manuscript and performing new analyses. All my previous comments have been sufficiently addressed and I have no additional comment or concern. Reviewer #2: The authors have addressed all my comments. Reviewer #3: The authors have appropriately addressed my previous concerns. I don't have any additional comments. ******** Have all data underlying the figures and results presented in the manuscript been provided?** Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ******** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No ---------------------------------------------------- Data Deposition If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website. The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly: http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-20-00904R1 More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support. Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present. ---------------------------------------------------- Press Queries** If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org. https://doi.org/10.1371/journal.pgen.1009293.r003
Formally Accepted
29 Dec 2020 Acceptance Letter - Michael P. Epstein, Editor, Hua Tang, Editor PGENETICS-D-20-00904R1 Accurate genetic and environmental covariance estimation with composite likelihood in genome-wide association studies Dear Dr Gao, We are pleased to inform you that your manuscript entitled "Accurate genetic and environmental covariance estimation with composite likelihood in genome-wide association studies" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work! With kind regards, Melanie Wincott PLOS Genetics On behalf of: The PLOS Genetics Team Carlyle House, Carlyle Road, Cambridge CB4 3DN \| United Kingdom plosgenetics@plos.org \| +44 (0) 1223-442823 plosgenetics.org \| Twitter: @PLOSGenetics https://doi.org/10.1371/journal.pgen.1009293.r004

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .