Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models.

Complex traits are known to be influenced by a combination of environmental factors and rare and common genetic variants. However, detection of such multivariate associations can be compromised by low statistical power and confounding by population structure. Linear mixed effects models (LMM) can account for correlations due to relatedness but have not been applicable in high-dimensional (HD) settings where the number of fixed effect predictors greatly exceeds the number of samples. False positives or false negatives can result from two-stage approaches, where the residuals estimated from a null model adjusted for the subjects' relationship structure are subsequently used as the response in a standard penalized regression model. To overcome these challenges, we develop a general penalized LMM with a single random effect called ggmix for simultaneous SNP selection and adjustment for population structure in high dimensional prediction models. We develop a blockwise coordinate descent algorithm with automatic tuning parameter selection which is highly scalable, computationally efficient and has theoretical guarantees of convergence. Through simulations and three real data examples, we show that ggmix leads to more parsimonious models compared to the two-stage approach or principal component adjustment with better prediction accuracy. Our method performs well even in the presence of highly correlated markers, and when the causal SNPs are included in the kinship matrix. ggmix can be used to construct polygenic risk scores and select instrumental variables in Mendelian randomization studies. Our algorithms are available in an R package available on CRAN (https://cran.r-project.org/package=ggmix).

Thank you for this question. We have now clarified in line 305 that we used all SNPs with a positive posterior inclusion probability to predict the response, which is the default setting of GEMMA. We additionally tried imposing a stricter posterior inclusion probability threshold (0.05, 0.10 and 0.50) in order to improve feature selection. These thresholds however, resulted in overly sparse models as most SNPs had a low posterior probability. We have reported these additional results in Table 2  Thank you for your suggestions. We realized that the data format can indeed be converted. We have removed that statement that such conversion was not possible. However, based on our understanding, BSLMM was not specifically developed for feature selection and we found the results were highly dependent on the choice of the arbitrary posterior probability threshold. We posit it would require extensive rationales and experiments to arrive at a reasonable threshold for each analysis, which is beyond the scope of our study. Therefore, we restricted such analyses to compare the prediction accuracy in the UK Biobank and GAW20 analyses. Yes we absolutely agree with this comment. In the originally submitted version of our manuscript, all of our simulation studies were performed from a practitioner's point of view. As was pointed out by several reviewers, and in your comment, there are pros and cons to this approach. We have added the following text to the manuscript to clarify (Lines 185-189): Note that in practice, this approach to selecting the tuning parameter is generally not possible since we do not know the underlying true model in advance. For real data, we suggest an information criterion approach described in Section 5.3.8 or a sample splitting approach such as the one we used for the UK Biobank analysis shown in Section 3.2.1.

412-41Is this statement true? Schelldorfer et al 2011 used what they called a "Block Coordinate Gradient Descent method" for their penalized LMM
To our understanding, we believe this statement to be true. We agree that Schelldorfer et al 2011 also developed a Block Coordinate Gradient Descent algorithm for solving their objective function. The key difference in our algorithm vs. theirs is the rotation of the response vector Y and the design matrix X by the eigen vectors of the kinship matrix, resulting in a diagonal covariance matrix. This leads to optimizing 2 variance parameters instead of q * parameters as shown in Algorithm 1 of Schelldorfer et al 2011. It is for this reason we wrote: ... in the specific context of fitting a penalized LMM for population structure correction with theoretical guarantees of convergence.
5. 501: I think that this is underselling your method. The main di erence is that your method is orders of magnitude faster than lmmlasso, so it's actually usable. A secondary di erence is that it is limited to only one random e ect.
Thank you for highlighting this. We have changed the text as follows: We note here that the main difference between the proposed model, and the lmmlasso, is that we rotate the response vector Y and the design matrix X by the eigen vectors of the kinship matrix. This results in a diagonal covariance matrix making our method orders of magnitude faster and usable for high-dimensional genetic data. A secondary difference is that we are limiting ourselves to a single unpenalized random effect.
6. 288: how do you get a p-value from ggmix?
We did not report any p-value based on ggmix. If it was the p-value in line 280 that you were referring to, that was a p-value obtained by running FaST-LMM to confirm previous findings in the literature.

Reviewer 3
The authors addressed most of my major concerns in this revision. I have one more comment: Thank you for pointing this out. We had issues with the data format and now realized they can be converted to PLINK format. We have thus removed the statement. Meanwhile, as BSLMM was not specifically developed for feature selection, we did not apply it to the mouse microsatellite data analysis to avoid an overload of arbitrary criteria.