A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank

Junyang Qian; Yosuke Tanigawa; Wenfei Du; Matthew Aguirre; Chris Chang; Robert Tibshirani; Manuel A. Rivas; Trevor Hastie

doi:10.1371/journal.pgen.1009141

Peer Review History

Original SubmissionJanuary 21, 2020
20 Mar 2020 Decision Letter - Gregory P. Copenhaver, Editor, Xiaofeng Zhu, Editor Dear Dr Hastie, Thank you very much for submitting your Research Article entitled 'A Fast and Scalable Framework for Large-scale and Ultrahigh-dimensional Multivariate Genome-wide Predictive Modeling with Application to the UK Biobank' to PLOS Genetics. Your manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the current manuscript. Based on the reviews, we will not be able to accept this version of the manuscript, but we would be willing to review again a much-revised version. That is, we will consider a revised manuscript that robustly demonstrates marked improvement over the existing approaches (i.e. Bayesian Approaches, BLUP and polygenic risk scores), as the reviewer 1 pointed out. We cannot, of course, promise publication at that time. Should you decide to revise the manuscript for further consideration here, your revisions should address the specific points made by each reviewer. We will also require a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. If you decide to revise the manuscript for further consideration at PLOS Genetics, please aim to resubmit within the next 60 days, unless it will take extra time to address the concerns of the reviewers, in which case we would appreciate an expected resubmission date by email to plosgenetics@plos.org. If present, accompanying reviewer attachments are included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist. To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see our guidelines. Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission. While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process. To resubmit, use the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder. [LINK] We are sorry that we cannot be more positive about your manuscript at this stage. Please do not hesitate to contact us if you have any concerns or questions. Yours sincerely, Xiaofeng Zhu Associate Editor PLOS Genetics Gregory P. Copenhaver Editor-in-Chief PLOS Genetics Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: Review of Qian et al Summary: This paper describes an efficient algorithm for fitting the lasso regression model to large data sets, along with an implementation (R package snpnet) and application to the UK biobank data to obtain predictors (effectively performing genomic prediction, or computing polygenic risk scores, PRS) for several different phenotypes. The paper compares prediction accuracy with some other simpler methods. The lasso is, in general, a widely studied, and also quite widely used method. As such an algorithm and implementation for very large datasets are of potential interest to a general audience both inside and outside genetics. However, for readers of PloS genetics the interest is going to stand or fall on the application: is Lasso a good method to do genomic prediction? I am skeptical of this: the Lasso has never been the method of choice for genomic prediction in smaller data sets, with the field generally preferring other large-scale regression methods, including very simple methods (eg "ridge regression", usually known as BLUP in the quantitative genetics literature) or very computationally intensive methods (Bayesian regression, usually fit via MCMC). The Elastic Net is also sometimes used. But the Lasso, rarely. While I will keep an open mind on whether this could change for biobank-sized data, the current paper is unconvincing on this because none of the comparisons are with state-of-the-art approaches to this problem. Overall then I think the main contribution of this paper is the the algorithmic ideas, whose main appeal is their simplicity and generality: I like the fact that the design allows the algorithm to maximally exploit previous implementations, rather than having to reimplement the coordinate ascent steps for example. However, unless the resulting method is really competitive with state of the art for genomic prediction then this seems better suited to another journal. Detailed comments: 1. Comparisons with other methods: The methods used here do not seem to represent a reasonable selection of state-of-the-art approaches to forming predictors for genetic data, on which there is a large literature. Historically genomic prediction has been done using multiple linear regression fit either using very simple methods (eg "ridge regression", usually known as BLUP in the quantitative genetics literature) or very computationally intensive methods (Bayesian regression, usually fit via MCMC). More recently, motivated by the difficulty of accessing/sharing genotype data, as well as computational considerations, a literature has sprung up around methods that attempt to build predictors based on summary statistics only (and LD from a reference panel). For example, Ge et al and Lloyd-Jones et al: https://www.nature.com/articles/s41467-019-09718-5 https://www.nature.com/articles/s41467-019-12653-0 are recent examples, and includes comparisons with other methods, the latter specifically on the UK biobank data with some of the same phenotypes considered here. To take a quick example, in Fig 2 of Lloyd Jones, looking at R2 for BMI, the performance among the methods they consider ranges from 0.1 to 0.126. In this paper (Table 3) Lasso achieves 0.103. I realize these numbers are not directly comparable, being based on different protocols (CV splits etc) for analyzing the UK biobank data, but it illustrates my concern that Lasso may not be competitive with the best existing methods. 2. Algorithmic description I found much of the algorithmic description in the overview overly long and hard to follow. The basic idea seems rather simple (which is a good thing!) but the presentation seems to obscure the simplicity rather than highlighting it. The formal presentation of Algorithm 1 in section 4 helps a lot, and I suggest it should be moved to the overview section. This should allow the text in the algorithmic overview to be shortened, since much of the words seems to be repeating, in less precise terms, what is given in Algorithm 1. Also: - the algorithm and text did not seem to address what happens if the "checking" step fails. That is, in step 5 of Algorithm 1, what if no lambda satisfies the KKT conditions? Or is this guaranteed not to happen? - How is M chosen? Does it matter? - the algorithm seems to rely on the fact that marginal screening is going to be effective at identifying the correct variables to add in. In some cases with complex correlations among variables this may not be true - one can construct problems where the best pair of variables to include are not among the marginally strongest. How does the algorithm cope with that kind of situation? Is it guaranteed to converge in practice? 3. Standardization The question of whether or not to standardize variables is usually phrased in terms of modeling assumptions -- if rare SNPs have bigger effects than common SNPs then standardization could be appropriate and improve predictive performance. The paper suggests that standardization will produce worse performance but this is not obvious a priori - it should be shown empirically. 4. Implementation The software implementation does not appear to be quite ready for widespread distribution (e.g. the R package on github has no man pages, and I could not find a minimal working example). Other - the use of the term "multivariate" in the context of a multiple regression with univariate outcome is rather confusing. From the title I expected the paper to deal with multivariate outcomes. Better to stick to "multiple regression", or perhaps "multi-SNP regression" if you prefer. - references to heritability were also confusing. E.g. the abstract refers to "state-of-the-art heritability estimation", when the goal here seems not to be heritability estimation but building a predictor, which are different things. Heritability provides an upper bound on prediction accuracy from genetic data, but building a predictor is not the same as "estimating heritability", and most approaches to estimating heritability do not explicitly build predictors. I think you can (and probably should) write the whole paper without mentioning heritability, and focussing entirely on PRS and prediction accuracy. - the presentation of result is much longer than it need be. The main results for different phenotypes and methods could probably be shown in a single figure (e.g. Lloyd-Jones Fig 2). Many of the other figures did not seem essential to the main story. Refs: Ge, T., Chen, C., Ni, Y. et al. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun 10, 1776 (2019). https://doi.org/10.1038/s41467-019-09718-5 Lloyd-Jones, L.R., Zeng, J., Sidorenko, J. et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nat Commun 10, 5086 (2019). https://doi.org/10.1038/s41467-019-12653-0 Reviewer #2: How to build the best predictive model using large-scale genetic data is important in health and disease studies. This paper provides a true regression approach for this problem, an important alternative to the polygenic risk scores. The results from analysis of the UK Biobank are convincing and interesting. The algorithm seems to be quite reasonable. I only have a few minor comments - (1) since Lasso results in biased estimates of the regression coefficients. Do the authors think that by performing further debased estimation, one can further improve the prediction performance? (2) since a very large number of SNPs are selection for each of the data examples, would the consistency results still hold? Lasso theory requires that the model has to be very sparse. (3) Why univariate screening + Lasso does not perform as well as fitting Lasso using all the SNPs? Does this mean that the univariate screening as proposed by Jianqin Fan etc does not really work in the settings considered in this paper? ******** Have all data underlying the figures and results presented in the manuscript been provided?** Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: None Reviewer #2: Yes ******** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Matthew Stephens Reviewer #2: No https://doi.org/10.1371/journal.pgen.1009141.r001
Revision 1
20 May 2020 Author Response Attachments Attachment Submitted filename: response_to_reviewers.pdf https://doi.org/10.1371/journal.pgen.1009141.r002
13 Jul 2020 Decision Letter - Gregory P. Copenhaver, Editor, Xiaofeng Zhu, Editor * Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. * Dear Dr Hastie, Thank you very much for submitting your Research Article entitled 'A Fast and Scalable Framework for Large-scale and Ultrahigh-dimensional Sparse Regression with Application to the UK Biobank' to PLOS Genetics. Your manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important topic but identified some aspects of the manuscript that should be improved. We therefore ask you to modify the manuscript according to the review recommendations before we can consider your manuscript for acceptance. Your revisions should address the specific points made by reviewers. Reviewer #1 raised important issues regarding the results of SBayesR. This issue will need to be fully resolved in the revision and the editors agree that it may be very helpful for you to reach out the authors of SBayesR during your revision process, but we leave that up to you to decide. In addition we ask that you: 1) Provide a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. 2) Upload a Striking Image with a corresponding caption to accompany your manuscript if one is available (either a new image or an existing one from within your manuscript). If this image is judged to be suitable, it may be featured on our website. Images should ideally be high resolution, eye-catching, single panel square images. For examples, please browse our archive. If your image is from someone other than yourself, please ensure that the artist has read and agreed to the terms and conditions of the Creative Commons Attribution License. Note: we cannot publish copyrighted images. We hope to receive your revised manuscript within the next 30 days. If you anticipate any delay in its return, we would ask you to let us know the expected resubmission date by email to plosgenetics@plos.org. If present, accompanying reviewer attachments should be included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist. While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission. PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process. To resubmit, you will need to go to the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder. [LINK] Please let us know if you have any questions while making these revisions. Yours sincerely, Xiaofeng Zhu Associate Editor PLOS Genetics Gregory P. Copenhaver Editor-in-Chief PLOS Genetics Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors have done a thorough job responding to my comments, and I believe the whole paper is much improved. Just one new substantive issue has arisen during this revision: the results reported for the SBayesR method are very poor, and seem to strongly contradict the original publication on this method. Indeed, it is a bit hard to believe that it performs quite so poorly, and the reasons for its poor performance need to be understood and either corrected or explained. For example, for height, SBayesR does no better than just Age + Sex in predicting height - so it essentially has a 0% R2 when you consider the genetic component only. In contrast, LLoyd-Jones et al report that SBayesR achieved an R2 of >35% for height in the UK biobank. Something is clearly wrong, either with the SBayes software or with the way it has been applied. (Other traits show a similar pattern, but the height result is particularly striking.) Of course, I don't know what the problem is, but I suggest a first step would be to ask the SBayesR authors if they have suggestions, and/or get their original code and see if you can reproduce their published results. Other items: in SBayes i noticed you excluded the MHC. Maybe this is recommended by SBayes software, but it seems likely to hurt R2 and AUC for many traits as the MHC has a strong effect on many traits. To make results comparable across methods it seems necessary to either exclude or include MHC for all methods. (It seems unlikely that this issue explains the poor performance on height noted above.) Reviewer #2: my previous comments were minor and the authors have addressed these comments. ******** Have all data underlying the figures and results presented in the manuscript been provided?** Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: No: UK Biobank data can't be provided Reviewer #2: Yes ******** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes:** Matthew Stephens Reviewer #2: No https://doi.org/10.1371/journal.pgen.1009141.r003
Revision 2
17 Aug 2020 Author Response Attachments Attachment Submitted filename: response_to_reviewers_v2.pdf https://doi.org/10.1371/journal.pgen.1009141.r004
4 Sep 2020 Decision Letter - Gregory P. Copenhaver, Editor, Xiaofeng Zhu, Editor Dear Dr Hastie, We are pleased to inform you that your manuscript entitled "A Fast and Scalable Framework for Large-scale and Ultrahigh-dimensional Sparse Regression with Application to the UK Biobank" has been editorially accepted for publication in PLOS Genetics. Congratulations! Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional accept, but your manuscript will not be scheduled for publication until the required changes have been made. Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org. In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. If you have a press-related query, or would like to know about one way to make your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date. Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics! Yours sincerely, Xiaofeng Zhu Associate Editor PLOS Genetics Gregory P. Copenhaver Editor-in-Chief PLOS Genetics www.plosgenetics.org Twitter: @PLOSGenetics ---------------------------------------------------- Comments from the reviewers (if applicable): ---------------------------------------------------- Data Deposition If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website. The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly: http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-20-00068R2 More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support. Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present. ---------------------------------------------------- Press Queries If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org. https://doi.org/10.1371/journal.pgen.1009141.r005
Formally Accepted
13 Oct 2020 Acceptance Letter - Gregory P. Copenhaver, Editor, Xiaofeng Zhu, Editor PGENETICS-D-20-00068R2 A Fast and Scalable Framework for Large-scale and Ultrahigh-dimensional Sparse Regression with Application to the UK Biobank Dear Dr Hastie, We are pleased to inform you that your manuscript entitled "A Fast and Scalable Framework for Large-scale and Ultrahigh-dimensional Sparse Regression with Application to the UK Biobank" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work! With kind regards, Matt Lyles PLOS Genetics On behalf of: The PLOS Genetics Team Carlyle House, Carlyle Road, Cambridge CB4 3DN \| United Kingdom plosgenetics@plos.org \| +44 (0) 1223-442823 plosgenetics.org \| Twitter: @PLOSGenetics https://doi.org/10.1371/journal.pgen.1009141.r006

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .