A General Framework for Variable Selection in Linear Mixed Models with Applications to Genetic Studies with Structured Populations

Complex traits are known to be influenced by a combination of environmental factors and rare and common genetic variants. However, detection of such multivariate associations can be compromised by low statistical power and confounding by population structure. Linear mixed effect models (LMM) can account for correlations due to relatedness but have not been applicable in high-dimensional (HD) settings where the number of fixed effect predictors greatly exceeds the number of samples. False positives can result from two-stage approaches, where the residuals estimated from a null model adjusted for the subjects’ relationship structure are subsequently used as the response in a standard penalized regression model. To overcome these challenges, we develop a general penalized LMM framework called ggmix that simultaneously, in one step, selects variables and estimates their effects, while accounting for between individual correlations. Our method can accommodate several sparsity-inducing penalties such as the lasso, elastic net and group lasso, and also readily handles prior annotation information in the form of weights. We develop a blockwise coordinate descent algorithm which is highly scalable, computationally efficient and has theoretical guarantees of convergence. Through simulations, we show that ggmix leads to correct Type 1 error control and improved variance component estimation compared to the two-stage approach or principal component adjustment. ggmix is also robust to different kinship structures and heritability proportions. Our algorithms are available in an R package (https://github.com/greenwoodlab).


INTRODUCTION
number of fixed effect predictors greatly exceeds the number of samples. False positives can result from two-stage approaches, where the residuals estimated from a null model adjusted for the subjects' relationship structure are subsequently used as the response in a standard penalized regression model. To overcome these challenges, we develop a general penalized LMM framework called ggmix that simultaneously, in one step, selects variables and estimates their effects, while accounting for between individual correlations. Our method can accommodate several sparsity-inducing penalties such as the lasso, elastic net and group lasso, and also readily handles prior annotation information in the form of weights. We develop a blockwise coordinate descent algorithm which is highly scalable, computationally efficient and has theoretical guarantees of convergence. Through simulations, we show that ggmix leads to correct Type 1 error control and improved variance component estimation compared to the two-stage approach or principal component adjustment. ggmix is also robust to different kinship structures and heritability proportions. Our algorithms are available in an R package (https://github.com/greenwoodlab).

Introduction
Genome-wide association studies (GWAS) have become the standard method for analyzing genetic datasets owing to their success in identifying thousands of genetic variants associated with complex diseases (https://www.genome.gov/gwastudies/). Despite these impressive findings, the discovered markers have only been able to explain a small proportion of the phenotypic variance; this is known as the missing heritability problem [1]. One plausible explanation is that there are many causal variants that each explain a small amount of variation with small effect sizes [2]. Methods such GWAS, which test each variant or single nucleotide polymorphism (SNP) independently, may miss these true associations due to the stringent significance thresholds required to reduce the number of false positives [1]. Another major issue to overcome is that of confounding due to geographic population structure, family 1 INTRODUCTION and/or cryptic relatedness which can lead to spurious associations [3]. For example, there may be subpopulations within a study that differ with respect to their genotype frequencies at a particular locus due to geographical location or their ancestry. This heterogeneity in genotype frequency can cause correlations with other loci and consequently mimic the signal of association even though there is no biological association [4,5]. Studies that separate their sample by ethnicity to address this confounding suffer from a loss in statistical power.
To address the first problem, multivariable regression methods have been proposed which simultaneously fit many SNPs in a single model [6,7]. Indeed, the power to detect an association for a given SNP may be increased when other causal SNPs have been accounted for. Conversely, a stronger signal from a causal SNP may weaken false signals when modeled jointly [6].
Solutions for confounding by population structure have also received significant attention in the literature [8,9,10,11]. There are two main approaches to account for the relatedness between subjects: 1) the principal component (PC) adjustment method and 2) the linear mixed model (LMM). The PC adjustment method includes the top PCs of genome-wide SNP genotypes as additional covariates in the model [12]. The LMM uses an estimated covariance matrix from the individuals' genotypes and includes this information in the form of a random effect [3].
While these problems have been addressed in isolation, there has been relatively little progress towards addressing them jointly at a large scale. Region-based tests of association have been developed where a linear combination of p variants is regressed on the response variable in a mixed model framework [13]. In case-control data, a stepwise logistic-regression procedure was used to evaluate the relative importance of variants within a small genetic region [14]. These methods however are not applicable in the high-dimensional setting, i.e., when the number of variables p is much larger than the sample size n, as is often the case in genetic studies where millions of variants are measured on thousands of individuals.

INTRODUCTION
There has been recent interest in using penalized linear mixed models, which place a constraint on the magnitude of the effect sizes while controlling for confounding factors such as population structure. For example, the LMM-lasso [15] places a Laplace prior on all main effects while the adaptive mixed lasso [16] uses the L 1 penalty [17] with adaptively chosen weights [18] to allow for differential shrinkage amongst the variables in the model. Another method applied a combination of both the lasso and group lasso penalties in order to select variants within a gene most associated with the response [19]. However, these methods are normally performed in two steps. First, the variance components are estimated once from a LMM with a single random effect. These LMMs normally use the estimated covariance matrix from the individuals' genotypes to account for the relatedness but assumes no SNP main effects (i.e. a null model). The residuals from this null model with a single random effect can be treated as independent observations because the relatedness has been effectively removed from the original response. In the second step, these residuals are used as the response in any high-dimensional model that assumes uncorrelated errors. This approach has both computational and practical advantages since existing penalized regression software such as glmnet [20] and gglasso [21], which assume independent observations, can be applied directly to the residuals. However, recent work has shown that there can be a loss in power if a causal variant is included in the calculation of the covariance matrix as its effect will have been removed in the first step [13,22].
In this paper we develop a general penalized LMM framework called ggmix that simultaneously selects variables and estimates their effects, accounting for between-individual correlations. Our method can accommodate several sparsity inducing penalties such as the lasso [17], elastic net [23] and group lasso [24]. ggmix also readily handles prior annotation information in the form of a penalty factor, which can be useful, for example, when dealing with rare variants. We develop a blockwise coordinate descent algorithm which is highly scalable and has theoretical guarantees of convergence to a stationary point. All of our algorithms are implemented in the ggmix R package hosted on GitHub with extensive Page 4 of 56 documentation (http://sahirbhatnagar.com/ggmix/). We provide a brief demonstration of the ggmix package in Appendix C.
The rest of the paper is organized as follows. Section 2 describes the ggmix model. Section 3 contains the optimization procedure and the algorithm used to fit the ggmix model. In Section 4, we compare the performance of our proposed approach and demonstrate the scenarios where it can be advantageous to use over existing methods through simulation studies. Section 5 discusses some limitations and future directions.

Model Set-up
Let i = 1, . . . , N be a grouping index, j = 1, . . . , n i the observation index within a group and N T = ∑ N i=1 n i the total number of observations. For each group let y i = (y 1 , . . . , y n i ) be the observed vector of responses or phenotypes, X i an n i × (p + 1) design matrix (with the column of 1s for the intercept), b i a group-specific random effect vector of length n i and ε i = (ε i1 , . . . , ε in i ) the individual error terms. Denote the stacked vectors Y = vector of fixed effects regression coefficients corresponding to X. We consider the following linear mixed model with a single random effect [25]: where the random effect b and the error variance ε are assigned the distributions Here, Φ N T ×N T is a known positive semi-definite and symmetric covariance or kinship matrix calculated from SNPs sampled across the genome, I N T ×N T is the identity matrix and parameters σ 2 and η ∈ [0, 1] determine how the variance is divided between b and ε. Note that η is also the narrow-sense heritability (h 2 ), defined as the proportion of phenotypic variance attributable to the additive genetic factors [1]. The joint density of Y is therefore multivariate normal: The LMM-Lasso method [15] considers an alternative but equivalent parameterization given by: where δ = σ 2 e /σ 2 g , σ 2 g is the genetic variance and σ 2 e is the residual variance. We instead consider the parameterization in (3) since maximization is easier over the compact set η ∈ [0, 1] than over the unbounded interval δ ∈ [0, ∞) [25]. We define the complete parameter vector as Θ := (β, η, σ 2 ). The negative log-likelihood for (3) is given by where V = ηΦ + (1 − η)I and det(V) is the determinant of V.
Let Φ = UDU T be the eigen (spectral) decomposition of the kinship matrix Φ, where U N T ×N T is an orthonormal matrix of eigenvectors (i.e. UU T = I) and D N T ×N T is a diagonal where Since (7) is a diagonal matrix, its inverse is also a diagonal matrix: , . . . , From (6) and (8), log(det(V)) simplifies to since det(U) = 1. It also follows from (6) that since for an orthonormal matrix U −1 = U T . Substituting (9), (10) and (11) into (5) the negative log-likelihood becomes where Y = U T Y, X = U T X, Y i denotes the i th element of Y, X ij is the i, j th entry of X and 1 is a column vector of N T ones.

Penalized Maximum Likelihood Estimator
We define the p + 3 length vector of parameters Θ := (Θ 0 , In what follows, p + 2 and p + 3 are the indices in Θ for η and σ 2 , respectively. In light of our goals to select variables associated with the response in high-dimensional data, we propose to place a constraint on the magnitude of the regression coefficients. This can be achieved by adding a penalty term to the likelihood function (13). The penalty term is a necessary constraint because in our applications, the sample size is much smaller than the number of predictors. We define the following objective function: where f (Θ) := −ℓ(Θ) is defined in (13), P j (·) is a penalty term on the fixed regression coefficients β 1 , . . . , β p+1 (we do not penalize the intercept) controlled by the nonnegative regularization parameter λ, and v j is the penalty factor for jth covariate. These penalty factors serve as a way of allowing parameters to be penalized differently. Note that we do not penalize η or σ 2 . An estimate of the regression parameters Θ λ is obtained by This is the general set-up for our model. In Section 3 we provide more specific details on how we solve (15).

Computational Algorithm
We use a general purpose block coordinate gradient descent algorithm (CGD) [26] to solve (15).
At each iteration, we cycle through the coordinates and minimize the objective function with respect to one coordinate only. For continuously differentiable f (·) and convex and block-separable P (·) (i.e. P (β) = ∑ i P i (β i )), Tseng and Yun [26] show that the solution generated by the CGD method is a stationary point of Q λ (·) if the coordinates are updated in a Gauss-Seidel manner i.e. Q λ (·) is minimized with respect to one parameter while holding all others fixed. The CGD algorithm has been successfully applied in fixed effects models (e.g. [27], [20]) and linear mixed models with an ℓ 1 penalty [28]. In the next section we provide some brief details about Algorithm 1. A more thorough treatment of the algorithm is given in Appendix A.
We emphasize here that previously developed methods such as the LMM-lasso [15] use a twostage fitting procedure without any convergence details. From a practical point of view, there is currently no implementation that provides a principled way of determining the sequence of tuning parameters to fit, nor a procedure that automatically selects the optimal value of λ. To our knowledge, we are the first to develop a CGD algorithm in the specific context of fitting a penalized LMM for population structure correction with theoretical guarantees of convergence. Furthermore, we develop a principled method for automatic tuning parameter selection and provide an easy-to-use software implementation in order to promote wider uptake of these more complex methods by applied practitioners.

Updates for the β parameter
Recall that the part of the objective function that depends on β has the form where Conditional on η (k) and σ 2 (k) , it can be shown that the solution for β j , j = 1, . . . , p is given where S λ (x) is the soft-thresholding operator and (x) + = max(x, 0). We provide the full derivation in Appendix A.1.2.

Updates for the η paramter
Given β (k+1) and σ 2 (k) , solving for η (k+1) becomes a univariate optimization problem: We use a bound constrained optimization algorithm [29] implemented in the optim function in R and set the lower and upper bounds to be 0.01 and 0.99, respectively.

Regularization path
In this section we describe how determine the sequence of tuning parameters λ at which to fit the model. Recall that our objective function has the form The Karush-Kuhn-Tucker (KKT) optimality conditions for (22) are given by: The equations in (23) are equivalent to where w i is given by (17), , and γ ∈ R p is the subgradient function of the ℓ 1 norm evaluated at (β 1 , . . . ,β p ).
Therefore Θ is a solution in (15) if and only if Θ satisfies (24) for some γ. We can determine a decreasing sequence of tuning parameters by starting at a maximal value for λ = λ max for whichβ j = 0 for j = 1, . . . , p. In this case, the KKT conditions in (24) are equivalent We can solve the KKT system of equations in (25) Following Friedman et al. [20], we choose τ λ max to be the smallest value of tuning parameters λ min , and construct a sequence of K values decreasing from λ max to λ min on the log scale.
The defaults are set to K = 100, τ = 0.01 if n < p and τ = 0.001 if n ≥ p.

Warm Starts
The way in which we have derived the sequence of tuning parameters using the KKT conditions, allows us to implement warm starts. That is, the solution Θ for λ k is used as the initial value Θ (0) for λ k+1 . This strategy leads to computational speedups and has been implemented in the ggmix R package.

Prediction of the random effects
We use an empirical Bayes approach (e.g. [30]) to predict the random effects b. Let the maximum a posteriori (MAP) estimate be defined as where, by using Bayes rule, f (b|Y, β, η, σ 2 ) can be expressed as Solving for (27) is equivalent to minimizing the exponent in (28): Taking the derivative of (29) with respect to b and setting it to 0 we get: where V −1 is given by (11), and ( β, η) are the estimates obtained from Algorithm 1.

Choice of the optimal tuning parameter
In order to choose the optimal value of the tuning parameter λ, we use the generalized information criterion [31] (GIC): where df λ is the number of non-zero elements in β λ [32] plus two (representing the variance parameters η and σ 2 ). Several authors have used this criterion for variable selection in mixed models with a n = log N T [28,33], which corresponds to the BIC. We instead choose the highdimensional BIC [34] given by a n = log(log(N T )) * log(p). This is the default choice in our ggmix R package, though the interface is flexible to allow the user to select their choice of a n .

Simulation Study
To assess the performance of ggmix, we simulated random genotypes from the BN-PSD admixture model using the bnpsd package [35,36]. We used a block diagonal kinship structure with 5 subpopulations. In Figure 1, we plot an estimated kinship matrix (Φ), based on a single simulated dataset, in the form of a heatmap. Each block represents a subpopulation, and a darker color indicates a closer genetic relationship.  In Figure 2 we plot the first two principal component scores calculated from the block diagonal kinship matrix in Figure 1, and color each point by subpopulation membership.

Empirical Kinship Matrix with Block Structure
We can see that the PCs can identify the subpopulations which is why including them as additional covariates in a regression model has been considered a reasonable approach to control for confounding.

Block Structure
1st principal component 2nd principal component For other parameters in our simulation study, we define the following quantities: • c: percentage of causal SNPs • X (f ixed) : n × p f ixed matrix of SNPs that will be included as fixed effects in our model.
• X (causal) : n×(c * p f ixed ) matrix of SNPs that will be truly associated with the simulated • X (other) : n × p other matrix of SNPs that will be used in the construction of the kinship matrix. Some of these X (other) SNPs, in conjunction with some of the SNPs in X (test) will be used in construction of the kinship matrix. We will alter the balance between these two contributors and with the proportion of causal SNPs used to calculate kinship.
• X (kinship) : n × k matrix of SNPs used to construct the kinship matrix.
• β j : effect size for the j th SNP, simulated from a U nif orm(0.3, 0.7) distribution for We simulate data from the model In addition to these parameters, we also varied the amount of overlap between the causal SNPs and the SNPs used to generate the kinship matrix. We considered two main scenarios: 1. None of the causal SNPs are included in the calculation of the kinship matrix: 2. All the causal SNPs are included in the calculation of the kinship matrix: ] .
Both kinship matrices are meant to contrast the model behavior when the causal SNPs are included in both the main effects and random effects versus when the causal SNPs are only included in the main effects. These scenarios are motivated by the current standard of practice in GWAS where the candidate marker is excluded from the calculation of the kinship matrix [8]. This approach becomes much more difficult to apply in large-scale multivariable models where there is likely to be overlap between the variables in the design matrix and kinship matrix.

Page 19 of 56
We compare ggmix to the lasso and the twostep method. For the lasso, we include the first 10 principal components of the estimated kinship as unpenalized predictors in the design matrix. For the twostep method, we first fit an intercept only model with a single random effect using the average information restricted maximum likelihood (AIREML) algorithm [37] as implemented in the gaston R package [38]. The residuals from this model are then used as the response in a regular lasso model. Note that in the twostep method, we have removed the kinship effect in the first step and therefore do not need to make any further adjustments when fitting the penalized model. We fit the lasso using the default settings in the glmnet package [20] and select the optimal value of the regularization parameter using 10-fold crossvalidation.
Letλ be the estimated value of the optimal regularization parameter selected via cross- We evaluate the methods based on correct sparsity defined as 1 We also compare the model error (∥Xβ − X βλ∥ 2 ), true positive rate ( | Sλ ∈ S 0 | /|S 0 |), false positive rate ( | Sλ / ∈ S 0 | /|j / ∈ S 0 |), and the variance components for the random effect and error term. The following estimator is used for the error variance of the lasso [39]: Page 20 of 56

Results
We first plot the correct sparsity results for the null model (c = 0) and the model with 1% causal SNPs (c = 0.01) in Figures 3 and 4          while the error variance is inflated. Both the lasso and twostep methods have better signal recovery as compared to ggmix. However, this signal is being spread across many variables leading to many Type 1 errors.

Discussion
We develop a general penalized LMM framework for population structure correction that simultaneously selects and estimates variables, accounting for between individual correlations, in one step. Our CGD algorithm is computationally efficient and has theoretical guarantees of convergence. We provide an easy-to-use software implementation of our algorithm along with a principled method for automatic tuning parameter selection. Through simulation studies, we show that existing approaches such as a two-stage approach or the lasso with a principal component adjustment lead to a large number of false positives. Our proposed method has excellent Type 1 error control and is robust to the inclusion of causal SNPs in the kinship matrix. This feature is important since in practice the kinship matrix is constructed from a random sample of SNPs across the genome, some of which are likely to be causal.
Although we derive a CGD algorithm for the ℓ 1 penalty, our approach can also be easily extended to other penalties such as the elastic net and group lasso with the same guarantees of convergence.
A limitation of ggmix is that it first requires computing the covariance matrix with a compu- genetic information on half a million individuals. When the matrix of genotypes used to construct the covariance matrix is low rank, there are additional computational speedups that can be implemented. While this has been developed for the univariate case [8], to our knowledge, this has not been explored in the multivariable case. We are currently developing a low rank version of the penalized LMM developed here, which reduces the time complexity While the predominant motivation for our approach has been association testing, we believe that there are other applications in which it can be used as well. For example, in the most recent Genetic Analysis Workship 20 (GAW20), the causal modeling group investigated causal relationships between DNA methylation (exposure) within some genes and the change in high-density lipoproteins ∆HDL (outcome) using Mendelian randomization (MR) [41].
Penalized regression methods could be used to select SNPs strongly associated with the exposure in order to be used as an instrumental variable (IV). However, since GAW20 data consisted of families, two step methods were used which could have resulted in a large number of false positives. ggmix is an alternative approach that could be used for selecting the IV while accounting for the familial structure of the data. Our method is also suitable for fine mapping SNP association signals in genomic regions, where the goal is to pinpoint individual variants most likely to impact the undelying biological mechanisms of disease [42].

A Block Coordinate Descent Algorithm
We use a general purpose block coordinate descent algorithm (CGD) [26] to solve (15). At each iteration, the algorithm approximates the negative log-likelihood f (·) in Q λ (·) by a strictly convex quadratic function and then applies block coordinate decent to generate a decent direction followed by an inexact line search along this direction [26]. For continuously differentiable f (·) and convex and block-separable P (·) (i.e. P (β) = ∑ i P i (β i )), [26] show that the solution generated by the CGD method is a stationary point of Q λ (·) if the coordinates are updated in a Gauss-Seidel manner i.e. Q λ (·) is minimized with respect to one parameter while holding all others fixed. The CGD algorithm can thus be run in parallel and therefore suited for large p settings. It has been successfully applied in fixed effects models (e.g. [27], [20]) and [28] for mixed models with an ℓ 1 penalty. Following Tseng and Yun [26], the CGD algorithm is given by Algorithm 2.
Below we detail the specifics of Algorithm 2 for the ℓ 1 penalty.
Algorithm 2: Coordinate Gradient Descent Algorithm to solve (15) Set the iteration counter k ← 0 and choose initial values for the parameter vector Θ (0) ; repeat Approximate the Hessian ∇ 2 f (Θ (k) ) by a symmetric matix H (k) : j ← line search given by the Armijo rule Update; k ← k + 1 until convergence criterion is satisfied ;

A.1 ℓ 1 penalty
The objective function is given by

A.1.1 Descent Direction
For simplicity, we remove the iteration counter (k) from the derivation below. where We consider each of the three cases in (41) below Since λ > 0 and H jj > 0, we have The solution can be written compactly as Since λ > 0 and H jj > 0, we have Again, the solution can be written compactly as

Page 38 of 56
For −1 ≤ u ≤ 1, λ > 0 and H jj > 0 we have The solution can again be written compactly as We see all three cases lead to the same solution for (40). Therefore the descent direction for . . , β p } for the ℓ 1 penalty is given by

A.1.2 Solution for the β parameter
If the Hessian ∇ 2 f (Θ (k) ) > 0 then H (k) defined in (33) is equal to ∇ 2 f (Θ (k) ). Using α init = 1, satisfying the Armijo Rule inequality is reached for init δ 0 = 1. The Armijo rule update for the β parameter is then given by Substituting the descent direction given by (43) into (44) we get We can further simplify this expression. Let . Re-write the part depending on β of the negative log-likelihood in (13) as The gradient and Hessian are given by Substituting (48) and (49) into β Similarly, substituting (48) and (49) in β Finally, substituting (50) and (51) into (45) we get Where S λ (x) is the soft-thresholding operator and (x) + = max(x, 0).

Number of Active Variables for Null Model
a variable is active if its estimated coefficient is non-zero

In this section we briefly introduce the freely available and open source ggmix package in R.
More comprehensive documentation is available at https://sahirbhatnagar.com/ggmix.
Note that this entire section is reproducible; the code and text are combined in an .Rnw 1 file and compiled using knitr [43].
There are three basic inputs that ggmix needs:

C.2 Fit the linear mixed model with Lasso Penalty
We will use the most basic call to the main function of this package, which is called ggmix.

## [1] "lassofullrank" "ggmix_fit"
We can see the solution path for each variable by calling the plot method for objects of class ggmix_fit: Here, s specifies the value(s) of λ at which the extraction is made. The function uses linear interpolation to make predictions for values of s that do not coincide with the lambda sequence used in the fitting algorithm.
We can also get predictions (X β) using the predict method for objects of class ggmix_fit:

C.3 Find the Optimal Value of the Tuning Parameter
We use the Generalized Information Criterion (GIC) to select the optimal value for λ. The default is a n = log(log(n)) * log(p) which corresponds to a high-dimensional BIC (HD-

BIC):
# pass the fitted object from ggmix to the gic function: hdbic <-gic(fit) class(hdbic) ## [1] "ggmix_gic" "lassofullrank" "ggmix_fit" # we can also fit the BIC by specifying the an argument bicfit <-gic(fit, an = log(length(admixed$y))) We can plot the HDBIC values against log(λ) using the plot method for objects of class GIC q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 201 201 200 198 183 160 113 50 14 8 6 3 The optimal value for λ according to the HDBIC, i.e., the λ that leads to the minium HDBIC is: BIC q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q We can also extract just the nonzero coefficients which also provide the estimated variance components η and σ 2 : We can also make predictions from the hdbic object, which by default will use the model corresponding to the optimal tuning parameter:

C.5 Extracting Random Effects
The user can compute the random effects using the provided ranef method for objects of class ggmix_gic. This command will compute the estimated random effects for each subject using the parameters of the selected model:

C.6 Diagnostic Plots
We can also plot some standard diagnotic plots such as the observed vs. predicted response, QQ-plots of the residuals and random effects and the Tukey-Anscombe plot. These can be plotted using the plot method on a ggmix_gic object as shown below.