Analysis of Genome-Wide Association Studies with Multiple Outcomes Using Penalization

Genome-wide association studies have been extensively conducted, searching for markers for biologically meaningful outcomes and phenotypes. Penalization methods have been adopted in the analysis of the joint effects of a large number of SNPs (single nucleotide polymorphisms) and marker identification. This study is partly motivated by the analysis of heterogeneous stock mice dataset, in which multiple correlated phenotypes and a large number of SNPs are available. Existing penalization methods designed to analyze a single response variable cannot accommodate the correlation among multiple response variables. With multiple response variables sharing the same set of markers, joint modeling is first employed to accommodate the correlation. The group Lasso approach is adopted to select markers associated with all the outcome variables. An efficient computational algorithm is developed. Simulation study and analysis of the heterogeneous stock mice dataset show that the proposed method can outperform existing penalization methods.


Introduction
This study has been partly motivated by the analysis of the genetic architecture of complex traits in heterogeneous stock mice from Wellcome Trust Center. This data resource, which also includes pedigree information, was based on an advanced intercross mating among 8 inbred strains over 50 generations of random mating [1,2], since the use of pseudorandom breeding for over 50 generations should result in an average distance between recombinants of v2 cM. The average linkage disequilibrium (LD), as measured by R 2 between adjacent markers, is 0.62 [3]. As with many complex mammal diseases, clinical risk factors and environmental exposures have failed to provide a comprehensive description of immunological disorders. The laboratory mouse is a key model organism for understanding gene function in mammals. Valdar et al. [1,4] conducted a genome-wide association study and gene-environment interaction modeling to search for genetic markers possibly correlated to phenotypes. We analyze the CD4/ CD8 ratio and CD4:CD3 in this study. CD4/CD8 ratio, which is also known as the T-Lymphocyte Helper/Suppressor Profile, is a basic laboratory test in which the percentage of CD3-positive lymphocytes in the blood positive for CD4 (T helper cells) and CD8 (a class of regulatory T cells) are counted and compared. CD4:CD3 is another clinical index for immunological diseases. Both indices are related to the diagnosis of immunological diseases. Since the indices, CD4/CD8 ratio and CD4:CD3, are highly correlated and mechanisms behind them are related, the potentially associated genetic markers are expected to be very similar. Thus it may be more powerful to analyze the phenotypes simultaneously.
GWAS data have high dimensionality. Conventional statistical approaches analyze one SNP at a time and then adjust for multiple comparisons. Such approaches are easy to implement, however, they may contradict the fact that the development and progression of complex diseases and traits are caused by the aggregated effects of multiple SNPs. They may miss SNPs with weak marginal but strong joint effects. In the analysis of joint effects of a large number of SNPs, regularized estimation is needed. In addition, it is expected that only a subset of profiled SNPs are associated with the response variables. Thus, marker selection is needed along with estimation.
With high-dimensional data, penalization has been extensively applied for regularized estimation and variable selection. Commonly used penalization methods include Lasso, elastic net, bridge, SCAD, MCP and others. Such methods can effectively analyze data with a single response variable with interchangeable covariate effects. When there exists hierarchical structure among covariates, for example the ''pathway, SNP-within-pathway'' twolevel structure, the ''group'' version of the aforementioned penalization methods have been proposed. The group penalty is usually a composite penalty. For example with group SCAD [5], the outer is the SCAD penalty, and the inner is the ridge penalty. We note that such group penalization methods are mainly used for the analysis of data with a single response variable.
In this study, our goal is to analyze data with multiple correlated response variables and conduct marker selection. In classic statistical analysis with a small number of covariates, data with multiple response variables can be accommodated under the framework of multivariate analysis of variance (MANOVA) [6] and multivariate analysis of covariance (MANCOVA). However, such methods cannot accommodate high dimensional covariates. It is possible to first apply existing penalization methods, for example Lasso, analyze each response variable separately, and then combine the analysis results using meta-analysis methods. However, such an approach ignores the correlation among response variables and hence can be less informative. Yuan and Ekici [7] introduced a nuclear norm approach encouraging the sparsity among singular values which at the same time gives shrinkage coefficient estimates and thus conducts dimension reduction and coefficient estimation simultaneously in multivariate linear models. Chen et al. [8] proposed an approach for solving reduced rank multivariate stochastic regression models.
In the heterogeneous stock mice dataset, there are multiple continuously distributed, highly correlated response variables. Under a joint modeling framework, we propose first transforming multiresponse data into uni-response data following the same distribution. Then a group Lasso approach is applied to the transformed uniresponse data. With two responses, the effect of one SNP needs to be represented by two regression coefficients, which naturally form a ''group''. We emphasize that, unlike other group penalization studies in which one group usually corresponds to multiple covariates, here one group corresponds to a single covariate for multiple responses.

Analysis of multi-response data
Consider data with multiple correlated response variables. With data like the heterogeneous stock mice from Wellcome Trust Center, it is reasonable to assume that multiple responses share a certain common genetic basis, particularly the same set of susceptible SNPs. However, we note that although the response variables are correlated, they are not identical. With the inherent heterogeneity, it is not sensible to reinforce the same model with the same regression coefficients for different response variables.
Let M(w1) be the number of response variables, n be the number of subjects, and p be the number of SNPs. Denote y 1 , . . . ,y M as the response variables and x as the n|p covariate matrix. For m~1, . . . ,M, assume that y m is associated with x via the linear model is the regression coefficient corresponding to the mth response variable. We first transform the original data frame. For simplicity of notation, we use the same symbol y but with different subscripts for the new response variable. Although the proposed method can accommodate different covariates for distinct response variables, we assume that the same set of covariates are measured for all responses. Let y i be the length-M vector of response variables for the ith subject, and y~(y The regression coefficient B and corresponding model have the following features. First, only the first four response-associated SNPs have nonzero regression coefficients (i.e. the model is sparse). Thus marker identification amounts to identifying SNPs with non-zero regression coefficients. This strategy has been commonly used in regularized marker selection. Second, as the two response variables share the same susceptible SNPs, there is a natural grouping structure with the transformed covariates. For example, the first two regression coefficients/covariates correspond to the first SNP. Thus, they form a group of size two and should be selected at the same time. Motivated by the heterogeneous stock mice dataset, we describe the proposed approach for studies with quantitative traits under linear models. The proposed approach can be extended to other types of response variables and other statistical models, as long as the joint modeling of response variables can be conducted. In a study with M response variables, the least square loss function for transformed data can be written as where S is the covariance matrix for residuals.

Penalized estimation and marker selection
is the coefficients for the M responses at the jth locus. We defineB B as the minimizer of the penalized least squares loss function: Hereỹ y i~S DD is the L 2 norm, and d j is the number of levels at the jth locus (equals to M under the present setting). Note that prior to the transformation, we assume that the response follows a multivariate normal distribution. In contrast, after transformation, each element in the new responseỹ y follows a univariate normal distribution. We center the response and make the grand mean equal to zero.
The proposed penalty has been motivated by the following considerations. For a given SNP locus, we treat its regression coefficients for M response variables as a group, so that we can evaluate its overall effects. The within-group penalty has an L 2 norm, and the group-level penalty has an L 1 norm. Thus, the proposed penalty may have the following main properties. First, it can conduct group-level selection. Second, if a group is selected, then all members within that group are selected with non-zero estimates. But the magnitudes of regression coefficients may differ. On the other hand, if a group is not selected, all of its members are set to be zero. Such properties fit the goal of the proposed analysis.
As discussed in [9], we need to orthogonalize the transformed covariates block-wise in order to achieve computational efficiency. Write S j~R 0 j R j for an upper triangular matrix R j via Cholesky decomposition. Assume that S j is invertible. Let V j~Ũ U j R {1 j and b j~Rj b j , then the penalized least-squares in expression (1) becomes If we centerỹ y, there is no need to fit for intercept for (2). Computational algorithm. We use the group cyclical coordinate descent (GCD) algorithm. The GCD algorithm is a natural extension of the coordinate descent algorithm [10]. It optimizes a target function with respect to a single group parameter at a time and iteratively cycles through all group parameters until convergence. It is particularly suitable for problems as the present one which has a simple closed-form solution with a single group but lacks one with multiple groups.
The GCD algorithm proceeds as follows. For a given l, . Initialize the vector of residuals r~ỹ y{ P p j~1 V jb b (0) j and s~0. 2. For j~1, . . . ,p, repeat the following steps: Step 2 until convergence.
Breheny and Huang [11] discussed the convergence of coordinate descent algorithms for SCAD and MCP. We now consider the GCD for group Lasso. For any given l, starting from an initial estimate b (0) , the GCD algorithm generates a sequence of Since the sequence fQ(b s ; l) : s~1,2 . . .g is non-increasing and bounded below by 0, it always converges. The following proposition is concerned about the convergence of fb (s) : s~1,2 . . .g. Proposition 1 For any fixed l, the GCD updates fb (s) : s~1,2 . . .g converge to a global minimizer of the group Lasso criterion Q(l) and satisfy the inequality This proposition can be proved following the arguments of [12] who established the convergence of the coordinate descent algorithm for concave penalized selection methods including the Lasso.
Choice of tuning parameter. There are various methods that can be applied, including for example AIC, BIC, crossvalidation, and generalized cross-validation. Chen and Chen [13] developed a family of extended Bayesian information criteria (EBIC) to overcome the overly liberal selection problem caused by the small-n-large-p situation. Furthermore, Chen and Chen [14] established the consistency of EBIC under the generalized linear models in the small-n-large-p situation. For group Lasso, Yuan and Lin [15] proposed an approximation of the degree of freedom (DF). Here, we apply EBIC with an approximated DF to select the tuning parameter l. The EBIC is defined as: where RSS l is the residual sum of squares under a fixed l. The DF for group Lasso [15] is defined as: where p j is the number of predictors in the jth group andb b LS j is the least-square estimate for the jth group obtained by fitting group j only.
Note that when p j~1 for j~1, . . . ,p, group Lasso becomes Lasso, and its DF is the number of non-zero parameters selected. Therefore, one can take Lasso as a special case of group Lasso, and so does the DF in expression (4).
Significance level for the selected SNPs. With penalization methods, the relevance of a covariate usually is determined by whether its regression coefficient is nonzero. As secondary analysis, it may also be of interest to compute the p value. However, it should be noted that it is usually insensible to use both estimation magnitude (zero or nonzero) and significance level for selection.
Here, we use a multi-split method modified from the one proposed by Meinshausen et al. [17] to obtain p-values. With linear regression, we use F -test for each group to evaluate whether there are elements in this group with significant effects. This procedure puts us in a position to obtain p-values at the group level. It is simulation-based and can adjust for multiple comparisons. The multi-split method proceeds as follows: 1. Randomly split data into two disjoint sets of equal size: D in and D out . 2. Fit data in D in with the proposed method. Denote the set of selected groups by S. 3. ComputeP P j , p-value for group j(~1, . . . ,p), as follows: (a) If group j is in set S, setP P j equal to the p-value from the F-test in the regular linear regression where group j is the only group.
where DSD is the size of set S.
This procedure is repeated B times for each group. Let It is shown in [17] thatQ Q j (p) is an asymptotically correct p-value, adjusted for multiplicity. The authors also proposed an adaptive version that selects a suitable value of quartile based on data: where p 0 is chosen to be 0.05. It is shown that fQ j ,j~1, . . . ,pg, can be used for both FWER (family-wise error rate) and FDR (false discovery rate) control [17].

Simulation studies
In simulation, we consider six different scenarios, each with 500 subjects and 5,000 or 10,000 SNPs. For each subject, we simulate two response variables. The correlation between the two responses is set to be 0.1, 0.5 or 0.9, representing weak, moderate and strong correlations. For each response variable, there are twelve SNPs with nonzero effects. Those twelve SNPs can be grouped into three clusters. Among each cluster, the correlation between two SNPs is 0.2. The correlation among SNPs not associated with response is set to be 0.2. Response-associated and noisy SNPs are independent. More specifically, the genotypes are first generated from multivariate normal distributions and then categorized into 0, 1 or 2. To mimic a SNP with equal allele frequency, we categorize genotype in a way similar to [16]. The genotype is set to be 0, 1 or 2 depending on whether x ij v{c, {cƒx ij ƒc or x ij wc, where c is the 3rd-quartile of x. For the first response variable, the regression coefficient is  The two response variables depend on the same genotypic data and are correlated through the residuals. Clustering structure exists in this simulation.  To better gauge performance of the proposed approach, we also consider the following alternative approach. We first analyze each response variable separately using Lasso, and then combine the results by examining the overlapped SNPs. For both approaches, we apply the EBIC method described in the previous section to select the tuning parameter l. We evaluate the number of SNPs identified, number of true positives, false discovery rate (FDR) and false negative rate (FNR). In addition, estimation performance is also evaluated using SSE (sum of squared error).
Results based on 100 replicates are summarized in Table 1. Note that the true response-associated SNPs are 25-28, 41-44 and 57-60 for both responses. In total, there are 24 SNPs associated with the two responses. Table 1 shows that under all simulation scenarios, the proposed approach is able to identify almost all of the true positives, significantly more than the individual-dataset approach. The price is a few more false positives. With the proposed approach, the highest FDR is 0.18, which can be acceptable in practice. Under all scenarios, the proposed approach has significantly smaller SSEs. Taking both marker identification and estimation into consideration, we conclude that the proposed approach provides a competitive alterative to the existing individual-dataset approach. For one simulated dataset, p-values evaluated by the multi-split method for the selected groups are presented in Table 2. It can be seen that many true positives have significant p-values, while all false positives have insignificant pvalues.
With the proposed approach, it is assumed that the multiple responses of interest have exact the same set of important SNPs. Such an assumption is reasonable under some settings but too restricted under others. To get a more comprehensive understanding of the proposed approach, we also conduct simulation where the two sets of important SNPs are partially matched. In Table 3, we consider the simulation setting where 25% of the important SNPs are not matched. In Table 4, we consider the scenario with 50% unmatched important SNPs. Under both simulation scenarios, the proposed approach identifies more true positives. However, the model sizes and FDRs are much larger. Such an observation is reasonable: for a SNP associated with a single response variable, when it is identified using the proposed approach, this SNP is automatically identified for the response variable it is not associated with, creating one false positive. Thus with the proposed approach and partially matched important SNP sets, identifying more true positives inevitably leads to much larger model sizes. It is interesting to note that under all simulation scenarios, the proposed approach has significantly smaller SSEs.
Here we focus on the scenario with two response variables to match the data analyzed in the next section. It is possible to conduct analysis with three or more responses, which may have higher computational cost.

Application to heterogeneous stock mice dataset
The heterogeneous stock mice dataset is described in the Introduction section. We refer to the original publication for more detailed descriptions [1,2,4]. This dataset includes fully phenotypic records on 2,202 mice, and each was genotyped for 13,459 SNP markers. In joint modeling, SNPs with missingness cannot be included. Thus, we implement fastPHASE to impute the missingness in SNPs [18]. After deleting observations with missing phenotypes and alleles with minor allele frequency less than 0.05, there are 1,514 mice and 9,991 SNP markers in 19 autosomes. We analyze the data using three different approaches: the traditional one-SNP-at-a-time approach, analysis of individual response using Lasso, and the proposed approach. In Figure 1, we show the absolute values of b estimates from the single-SNP analysis on both CD4/CD8 ratio and CD4:CD3. Here single-SNP analysis is conducted using a Bonferroni approach with overall p-value 0.05. In Figure 2, we show the DbD from Lasso on both phenotypes and DDbDD from the proposed method. In Figure 1, one can see that the signal to noise ratio is weak, and it is difficult to tell the real associated signals from background. In contrast, the signal to noise ratio is strong, and a small number of SNPs are selected by using the Lasso and proposed method. When analyzing each response separately using Lasso and multiple responses using the proposed method, we use the method described in the previous section to select the tuning parameter l. We use the multi-split method to evaluate the significance of selected SNPs. In Figure 2, the larger dots stand for the selected SNPs with significant p-values. In Table 5, the total number of significant SNPs is summarized in the parenthesis for the Lasso on both phenotypes and the proposed method. Detailed information on the selected SNPs by the proposed method and individual Lasso methods on both CD4/ CD8 ratio and CD4:CD3 is presented in Table 6, Table 7 and Table 8, respectively. Note that there is no one-to-one correspondence between the magnitude of estimates and significance level. Such an observation is not uncommon in regression analysis. In addition, the proposed penalization approach is based on Lasso, which is known to shrink estimates towards zero. Another observation is that SNPs in high LD may have very different estimates, which is also ''as expected''. In single-response analysis, Lasso has the tendency to select one out of a set of highly correlated covariates. Thus, it is possible or even likely that out of the SNPs with high LD, one may have a large estimate while others have very small or zero estimates. The numbers of selected SNPs and overlaps among the proposed method, the Lasso method and single-SNP analysis are presented in Table 5. We see that the single-SNP analysis selects a large number of SNPs. This may be due to the fact that the selection of assayed SNPs is not totally random.
With our limited knowledge on susceptibility SNPs for immunology, we are not able to objectively evaluate the biological implications of identified SNPs. As an alternative, we consider the following evaluation of prediction performance, which may provide partial information on identification performance. (a) Randomly split the sample into five parts with equal sizes; (b) Analyze four parts using the proposed approach; (c) Use the obtained model and make prediction for subjects in the left-out part; (d) Repeat Steps (b) and (c) over all five parts. For comparison, the same approach is also used to evaluate the individual Lasso approach. The prediction mean squared errors are 1.66 for the proposed approach and 2.33 for the combined Lasso. By jointly analyzing two responses, the proposed approach has better prediction performance.  Table 5. Number of SNPs identified, and overlap of SNPs among the proposed method, the Lasso and single-SNP analysis for heterogeneous stock mice dataset.

Discussion
In the study of complex diseases, it is not uncommon that a single trait cannot provide a comprehensive description, and multiple traits need to be measured. In this article, we analyze data with multiple response variables under the assumption that they have the same set of important SNPs. A penalization approach is developed for marker selection. The proposed approach can accommodate the joint effects of multiple SNPs and be more informative than single-SNP analysis. Compared with the existing approaches that analyze different traits separately, it can more effectively accommodate the correlation among traits and hence be more efficient in marker selection. Numerical studies, including simulation and analysis of the heterogeneous stock mice dataset, show satisfactory performance of the proposed approach.
The heterogeneous stock mice data have two continuous response variables with marginally normal distributions. With other types of response variables, there is a rich literature on joint modeling, which can be adopted to couple with the proposed marker selection. The proposed approach is based on the group Lasso penalty. We expect that other ''group-type'' penalties, such as group SCAD or group bridge, can be applied. The group Lasso is selected because of its relatively low computational cost, which is especially desirable with high-throughput data. In our numerical study, we focus on the scenario where the MAFs are not very low. When the MAFs are low, our unpublished numerical study suggests that penalization methods may not perform well because the covariate design matrix is ''overly sparse''. Using penalization methods with rare variants is still being explored. Analysis of the heterogeneous stock mice data shows that the proposed approach can identify SNPs missed by single-response analysis. In addition, it has improved prediction performance. Therefore, the proposed method provides a useful alternative to the current analysis of multivariate traits in GWAS.