Peer Review History

Original SubmissionJuly 12, 2019
Decision Letter - Heather J Cordell, Editor, Scott M. Williams, Editor

Dear Dr Bhatnagar,

Thank you very much for submitting your Research Article entitled 'Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models' to PLOS Genetics. Your manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the current manuscript. Based on the reviews, we will not be able to accept this version of the manuscript, but we would be willing to review again a much-revised version. We cannot, of course, promise publication at that time.

Should you decide to revise the manuscript for further consideration here, your revisions should address the specific points made by each reviewer. We will also require a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

If you decide to revise the manuscript for further consideration at PLOS Genetics, please aim to resubmit within the next 60 days, unless it will take extra time to address the concerns of the reviewers, in which case we would appreciate an expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments are included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see our guidelines.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool.  PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, use the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

We are sorry that we cannot be more positive about your manuscript at this stage. Please do not hesitate to contact us if you have any concerns or questions.

Yours sincerely,

Heather J Cordell

Associate Editor

PLOS Genetics

Scott Williams

Section Editor: Natural Variation

PLOS Genetics

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors demonstrate a method for genome-wide association that combines mixed-model based control of population structure or genetic relatedness with multi-variate regressions. The idea is to increase power for GWAS applications by including multiple markers at once while accounting for structure. Previous tools for this application are somewhat limited, so this method aims to provide a more comprehensive and computationally efficient tools. They include a well-structured R package to implement their method.

This is an important topic, and their software is much faster and easier to use than existing tools. However I have several comments on their presentation, particularly the comparisons with existing methods and some limitations of the simulations:

Major issues:

1. I think that the novelty of the method is not made clear in the paper. The model itself appears virtually identical to that of the glmmlasso R package (Schelldorfer et al 2011) except limited to one random effect – the objective function appears identical, and those authors also derived a block coordinate gradient descent algorithm, though I don’t know if they are identical. The addition here seems mostly in terms of computational efficiency and flexibility (combinations of L1 and L2 penalties vs just L1 for Schellendorfer). Using the SVD to rotate and diagonalize the LMM so that the random effect covariance matrix is diagonal is a very useful addition and makes the software very fast, but this is not really emphasized in the methods. Also, the BSLMM model (Zhou et al 2013 Plos Genetics) is also effectively a variable selection LMM, though it’s Bayesian instead of frequentist and requires MCMC so is slower. But it’s performance could be compared to ggmix and the two-step LASSO presented in this paper.

2. While this method is much faster than glmmlasso (and presumably BSLMM), how does it scale to large numbers of markers? Typical genomics datasets today contain >1e5-1e6 markers. The datasets used here seem to contain 10K-50K. There is some discussion about limitations of scaling to large N, but not large p.

3. The first set of simulations include population structure and admixture. But all 10K markers are simulated in linkage equilibrium. This is an ideal situation for LASSO. But in real data, nearby markers will be partially correlated. How well does this method select the correct marker when there are other correlated markers? What does it do when the true causal variant is not in the data, but imperfectly correlated markers are (the typical GWAS setting)? Does it tend to select the nearest marker, or does it select a set of nearby markers? I can’t tell in the GAW20 and mouse datasets the extent of the LD among markers and whether this is addressed there. Also, the simulation result that including the causal markers in the kinship matrix has little impact is encouraging. But the Yang et al 2014 paper, which discusses the impact of including causal markers in the estimated kinship matrix, suggests that this is only likely when N/M is large (ie as many or more individuals as markers). In the simulations presented here, N/M is <0.1, which is probably small enough that proximal contamination is not an issue. If N/M were ~1, then the impact might be greater. Note: M is the effective number of markers after accounting for LD among markers.

Minor issues

• 97: Wang et al 2011 does not treat the variance components as fixed, but iteratively estimates them along with beta.

• 142: How much of the total variance is accounted for by causal SNPs vs the background in the simulations?

• 355: X_1,…,X_N should each be transposed

• Fig 5: I don’t think that the red line is a p-value threshold. Maybe “significance threshold”? How can this be <100 if there were 200 bootstraps and significance required >50% inclusion?

• I believe Eq (30) is wrong. The term V^{-1} shouldn’t be present in the likelihood if b is included.

Reviewer #2: I have a few comments and suggestions:

Page 2, Line 33: It this true: better sensitivity and specificity? What are you using to define sensitivity?

Page 7, line 143-163. I found the definitions of the different X matrices confusing. I think you can simplify to indexes: eg: causal for list of causal snp indexes, kinship for list of snp index in the kinship matrix, etc. I’d also refer the the snps as covariates and not label them as fixed, and state when the causal set is in the kinship set.

Page 7, line 159: Did you try a larger number of SNPs? In GWAS the number is much larger and PRS methods use much more than 50 independent SNPs for estimation.

Page 10, line 183: if I understand your sparsity estimate correctly, with a causal rate of 1%, setting all B to zero would give a value of 99%?

Page 11, 188: I am curious why you only reported the ‘optimal’ value of the penalty parameter. Is your method outperforming in terms of sparsity because it just does a better job of selecting a sparse model? Your false positive rate is lower but the true positive rate is much lower. If one is searching for the set of true causal variants, they are usually willing to take the tradeoff of better sensitivity for weeding through the false positives. I would prefer to see curves of FP versus TP rates with the values at the optimal tuning parameter marked. In practice I found AIC/BIC somewhat conservative compared to CV or controlling for error rate via phenotype permutation.

Page 11, Line 196: I am sorry if I missed this somewhere, but how was the model tuned for the lasso and twostep in the training data? Did ggmix use the GIC in the materials and methods?

Page 14, Line 243 typo- methods

Page 12: Figure 3. This data might be better summarized in a table that could include the additional data in the supplementary files.

The math is a bit beyond my abilities, but I have previously read a paper that suggests maximizing the log likelihood, −12[ln|V|+(Y−βX)TV−1(Y−BX)] subject to the L1/L2 norm penalty to control for relatedness in penalized regression methods for genetic data (where V is the variance covariance matrix). Is this essentially what you are doing?

It would have been interesting to use the set of related individuals in the UK Biobank on a few traits where PRS works well.

Reviewer #3: The authors present a penalized multi-variate regression model, ggmix, that jointly models multiple genotypes in mixed-model setup that incorporates the kinship or the genetic relationship matrix (GRM). It is an important problem in statistical genetics and I appreciate that the authors developed a comprehensive algorithm that simultaneously incorporates population structure and variable selection problem. The methods are well described and the paper was easy to follow, but I have some concerns in the experiments and evaluation metric. Overall, I think the paper presents an important problem, but the results are not convincing.

First, in simulation results, the ‘correct sparsity’ measure is basically ‘accuracy’ measure in binary classification problem of whether the regression coefficients are zero or non-zero. This ‘accuracy’ measure is often misleading when class distribution is imbalanced. For example, if 99% of coefficients are zero, you can get 99% accuracy by just classifying everything to be zero, which is clearly not a good model. I suggest adapting different measure, such as MCC. I can see that twostep and LASSO both maintains FPR at 0.05, which is why ‘correct sparsity’ is around 0.95. At the same time, we can see that LASSO and twostep achieve higher TPR. Also from a slightly different point of view, in Figure 3(D), comparing TPR at different points on FPR is not a fair comparison. To compare different methods in the context of TPR and FPR, either AUC or TPR at the same rate of FPR should be considered.

Intro is slightly misleading, especially in lines 107-109, because I first thought that ggmix takes out causal (i.e. selected) variables out of the relationship matrix, then I later realized that loss of power due to causal SNPs included in the GRM still happens in ggmix, but you aim to minimize the loss by joint modeling – but it is not clear why this would be the case. Is there any theoretical or simulation-based evidence that joint modeling achieves higher power in such a case?

In the mouse data, the model parameters were optimized so that the two loci are picked up, and then the evaluation metric is based on whether these two loci are picked up, which is circular.

In discussion, it was mentioned that leave-one-chromosome-out approach is possible, but has not been tried. What would be the compelling reason to model all chromosomes together in the proposed problem, especially when the model is still additive and trans-interaction term is not directly modeled?

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Revision 1

Attachments
Attachment
Submitted filename: response_to_reviewers.pdf
Decision Letter - Heather J Cordell, Editor, Scott M. Williams, Editor

* Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. *

Dear Dr Bhatnagar,

Thank you very much for submitting your Research Article entitled 'Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models' to PLOS Genetics. Your manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important topic but identified some aspects of the manuscript that should be improved.

We therefore ask you to modify the manuscript according to the review recommendations before we can consider your manuscript for acceptance. Your revisions should address the specific points made by each reviewer.

In addition we ask that you:

1) Provide a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

2) Upload a Striking Image with a corresponding caption to accompany your manuscript if one is available (either a new image or an existing one from within your manuscript). If this image is judged to be suitable, it may be featured on our website. Images should ideally be high resolution, eye-catching, single panel square images. For examples, please browse our archive. If your image is from someone other than yourself, please ensure that the artist has read and agreed to the terms and conditions of the Creative Commons Attribution License. Note: we cannot publish copyrighted images.

We hope to receive your revised manuscript within the next 30 days. If you anticipate any delay in its return, we would ask you to let us know the expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments should be included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, you will need to go to the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

Please let us know if you have any questions while making these revisions.

Yours sincerely,

Heather J Cordell

Associate Editor

PLOS Genetics

Scott Williams

Section Editor: Natural Variation

PLOS Genetics

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors have significantly revised their manuscript. I think that the inclusions of BSLMM in the comparisons is useful. However, I have a couple additional concerns.

1. Are you sure you’re extracting the model fit from BSLMM correctly? The specifics of use are not described in the methods. It’s a Bayesian model, so gives a probability of inclusion of each SNP. From the prefix.param.txt file, you should use the gamma column to report the number of SNPs that cross a posterior inclusion probability threshold. If you count how many betas are != 0, that will likely be large. But this is not an accurate estimate of the number of markers included in the model. You can also get the posterior on the number of SNPs included from the prefix.gamma.txt file. If you’re counting SNPs that are included, what posterior probability threshold are you using for inclusion? If you’re reporting the % of SNPs included, are you reporting the posterior mean?

2. Also, the authors didn’t apply BSLMM to several of the analyses. There is a write.plink function in the snpStats package that could be used to write GEMMA-compatible input files from R. Also, the PhenotypeSimulator package seems to have a writeStandardOutput function that can write bimbam or Gemma output. Especially if the extraction of estimates of needs to be revised and in fact works more similarly to the other methods, then I would recommend applying it to all analyses (Biobank may be too large).

3. I am still a bit confused about when tuning parameters were set based on TPR / FPR and when based on GIC or CV or other direct methods. In real data, it’s generally not possible to set based on TPR/FPR, but I think most of your comparisons now are done that way. I understand that the goal is to show that your model can work well, and so comparing to the truth is useful. But I think more clarity is needed about when you’re demonstrating the true performance of the model vs when you’re demonstrating how the model would actually be run by a practitioner who did not know any true positive effects going in.

4. 412-414. Is this statement true? Schelldorfer et al 2011 used what they called a "Block Coordinate Gradient Descent method" for their penalized LMM

5. 501: I think that this is underselling your method. The main difference is that your method is orders of magnitude faster than lmmlasso, so it’s actually usable. A secondary difference is that it is limited to only one random effect.

6. 288: how do you get a p-value from ggmix?

Reviewer #3: The authors addressed most of my major concerns in this revision. I have one more comment:

The authors first state they could not run BSLMM on simulated data and mouse data because of data format issues. Converting genotypes to PLINK format should be straightforward and should not be an issue. I understand for simulating large amount of data, it might be practically difficult, but I cannot understand why mouse microsatellite data cannot be converted to PLINK format. I do not think you need additional experiments with BSLMM but please remove the statement that converting the data to PLINK format is not possible.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #3: No

Revision 2

Attachments
Attachment
Submitted filename: ggmix_response_to_reviewers.pdf
Decision Letter - Heather J Cordell, Editor, Scott M. Williams, Editor

Dear Dr Bhatnagar,

We are pleased to inform you that your manuscript entitled "Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models" has been editorially accepted for publication in PLOS Genetics. Congratulations!

Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional accept, but your manuscript will not be scheduled for publication until the required changes have been made.

Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.

In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field.  This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

If you have a press-related query, or would like to know about one way to make your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!

Yours sincerely,

Heather J Cordell

Associate Editor

PLOS Genetics

Scott Williams

Section Editor: Natural Variation

PLOS Genetics

www.plosgenetics.org

Twitter: @PLOSGenetics

----------------------------------------------------

Comments from the reviewers (if applicable):

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: I am satisfied with the changes made by the authors

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

----------------------------------------------------

Data Deposition

If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website.

The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly: 

http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-19-01153R2

More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.

Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.

----------------------------------------------------

Press Queries

If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org.

Formally Accepted
Acceptance Letter - Heather J Cordell, Editor, Scott M. Williams, Editor

PGENETICS-D-19-01153R2

Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models

Dear Dr Bhatnagar,

We are pleased to inform you that your manuscript entitled "Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Matt Lyles

PLOS Genetics

On behalf of:

The PLOS Genetics Team

Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom

plosgenetics@plos.org | +44 (0) 1223-442823

plosgenetics.org | Twitter: @PLOSGenetics

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .