Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Analysis of Genome-Wide Association Studies with Multiple Outcomes Using Penalization

Analysis of Genome-Wide Association Studies with Multiple Outcomes Using Penalization

  • Jin Liu, 
  • Jian Huang, 
  • Shuangge Ma
PLOS
x

Abstract

Genome-wide association studies have been extensively conducted, searching for markers for biologically meaningful outcomes and phenotypes. Penalization methods have been adopted in the analysis of the joint effects of a large number of SNPs (single nucleotide polymorphisms) and marker identification. This study is partly motivated by the analysis of heterogeneous stock mice dataset, in which multiple correlated phenotypes and a large number of SNPs are available. Existing penalization methods designed to analyze a single response variable cannot accommodate the correlation among multiple response variables. With multiple response variables sharing the same set of markers, joint modeling is first employed to accommodate the correlation. The group Lasso approach is adopted to select markers associated with all the outcome variables. An efficient computational algorithm is developed. Simulation study and analysis of the heterogeneous stock mice dataset show that the proposed method can outperform existing penalization methods.

Introduction

This study has been partly motivated by the analysis of the genetic architecture of complex traits in heterogeneous stock mice from Wellcome Trust Center. This data resource, which also includes pedigree information, was based on an advanced intercross mating among 8 inbred strains over 50 generations of random mating [1], [2], since the use of pseudorandom breeding for over 50 generations should result in an average distance between recombinants of 2 cM. The average linkage disequilibrium (LD), as measured by between adjacent markers, is 0.62 [3]. As with many complex mammal diseases, clinical risk factors and environmental exposures have failed to provide a comprehensive description of immunological disorders. The laboratory mouse is a key model organism for understanding gene function in mammals. Valdar et al. [1], [4] conducted a genome-wide association study and gene-environment interaction modeling to search for genetic markers possibly correlated to phenotypes. We analyze the CD4/CD8 ratio and CD4∶CD3 in this study. CD4/CD8 ratio, which is also known as the T-Lymphocyte Helper/Suppressor Profile, is a basic laboratory test in which the percentage of CD3-positive lymphocytes in the blood positive for CD4 (T helper cells) and CD8 (a class of regulatory T cells) are counted and compared. CD4∶CD3 is another clinical index for immunological diseases. Both indices are related to the diagnosis of immunological diseases. Since the indices, CD4/CD8 ratio and CD4∶CD3, are highly correlated and mechanisms behind them are related, the potentially associated genetic markers are expected to be very similar. Thus it may be more powerful to analyze the phenotypes simultaneously.

GWAS data have high dimensionality. Conventional statistical approaches analyze one SNP at a time and then adjust for multiple comparisons. Such approaches are easy to implement, however, they may contradict the fact that the development and progression of complex diseases and traits are caused by the aggregated effects of multiple SNPs. They may miss SNPs with weak marginal but strong joint effects. In the analysis of joint effects of a large number of SNPs, regularized estimation is needed. In addition, it is expected that only a subset of profiled SNPs are associated with the response variables. Thus, marker selection is needed along with estimation.

With high-dimensional data, penalization has been extensively applied for regularized estimation and variable selection. Commonly used penalization methods include Lasso, elastic net, bridge, SCAD, MCP and others. Such methods can effectively analyze data with a single response variable with interchangeable covariate effects. When there exists hierarchical structure among covariates, for example the “pathway, SNP-within-pathway” two-level structure, the “group” version of the aforementioned penalization methods have been proposed. The group penalty is usually a composite penalty. For example with group SCAD [5], the outer is the SCAD penalty, and the inner is the ridge penalty. We note that such group penalization methods are mainly used for the analysis of data with a single response variable.

In this study, our goal is to analyze data with multiple correlated response variables and conduct marker selection. In classic statistical analysis with a small number of covariates, data with multiple response variables can be accommodated under the framework of multivariate analysis of variance (MANOVA) [6] and multivariate analysis of covariance (MANCOVA). However, such methods cannot accommodate high dimensional covariates. It is possible to first apply existing penalization methods, for example Lasso, analyze each response variable separately, and then combine the analysis results using meta-analysis methods. However, such an approach ignores the correlation among response variables and hence can be less informative. Yuan and Ekici [7] introduced a nuclear norm approach encouraging the sparsity among singular values which at the same time gives shrinkage coefficient estimates and thus conducts dimension reduction and coefficient estimation simultaneously in multivariate linear models. Chen et al. [8] proposed an approach for solving reduced rank multivariate stochastic regression models.

In the heterogeneous stock mice dataset, there are multiple continuously distributed, highly correlated response variables. Under a joint modeling framework, we propose first transforming multi-response data into uni-response data following the same distribution. Then a group Lasso approach is applied to the transformed uni-response data. With two responses, the effect of one SNP needs to be represented by two regression coefficients, which naturally form a “group”. We emphasize that, unlike other group penalization studies in which one group usually corresponds to multiple covariates, here one group corresponds to a single covariate for multiple responses.

Materials and Methods

Analysis of multi-response data

Consider data with multiple correlated response variables. With data like the heterogeneous stock mice from Wellcome Trust Center, it is reasonable to assume that multiple responses share a certain common genetic basis, particularly the same set of susceptible SNPs. However, we note that although the response variables are correlated, they are not identical. With the inherent heterogeneity, it is not sensible to reinforce the same model with the same regression coefficients for different response variables.

Let be the number of response variables, be the number of subjects, and be the number of SNPs. Denote as the response variables and as the covariate matrix. For , assume that is associated with via the linear model , where is the regression coefficient corresponding to the th response variable. We first transform the original data frame. For simplicity of notation, we use the same symbol but with different subscripts for the new response variable. Although the proposed method can accommodate different covariates for distinct response variables, we assume that the same set of covariates are measured for all responses. Let be the length- vector of response variables for the th subject, and . Covariates for the th subject have the form where . The regression coefficient vector is then where .

To better illustrate the basic features of the model settings here, consider a dataset with  = 2 response variables and SNPs. Assume that only the first four SNPs are associated with responses. Then the coefficients may look likeand correspondingly,The regression coefficient and corresponding model have the following features. First, only the first four response-associated SNPs have nonzero regression coefficients (i.e. the model is sparse). Thus marker identification amounts to identifying SNPs with non-zero regression coefficients. This strategy has been commonly used in regularized marker selection. Second, as the two response variables share the same susceptible SNPs, there is a natural grouping structure with the transformed covariates. For example, the first two regression coefficients/covariates correspond to the first SNP. Thus, they form a group of size two and should be selected at the same time.

Motivated by the heterogeneous stock mice dataset, we describe the proposed approach for studies with quantitative traits under linear models. The proposed approach can be extended to other types of response variables and other statistical models, as long as the joint modeling of response variables can be conducted. In a study with response variables, the least square loss function for transformed data can be written aswhere is the covariance matrix for residuals.

Penalized estimation and marker selection

Penalized estimation.

From definition, is the coefficients for the responses at the th locus. We define as the minimizer of the penalized least squares loss function:(1)Here , , , , , , , is the norm, and is the number of levels at the th locus (equals to under the present setting). Note that prior to the transformation, we assume that the response follows a multivariate normal distribution. In contrast, after transformation, each element in the new response follows a univariate normal distribution. We center the response and make the grand mean equal to zero.

The proposed penalty has been motivated by the following considerations. For a given SNP locus, we treat its regression coefficients for response variables as a group, so that we can evaluate its overall effects. The within-group penalty has an norm, and the group-level penalty has an norm. Thus, the proposed penalty may have the following main properties. First, it can conduct group-level selection. Second, if a group is selected, then all members within that group are selected with non-zero estimates. But the magnitudes of regression coefficients may differ. On the other hand, if a group is not selected, all of its members are set to be zero. Such properties fit the goal of the proposed analysis.

As discussed in [9], we need to orthogonalize the transformed covariates block-wise in order to achieve computational efficiency. Write for an upper triangular matrix via Cholesky decomposition. Assume that is invertible. Let and , then the penalized least-squares in expression (1) becomes(2)If we center , there is no need to fit for intercept for (2).

Computational algorithm.

We use the group cyclical coordinate descent (GCD) algorithm. The GCD algorithm is a natural extension of the coordinate descent algorithm [10]. It optimizes a target function with respect to a single group parameter at a time and iteratively cycles through all group parameters until convergence. It is particularly suitable for problems as the present one which has a simple closed-form solution with a single group but lacks one with multiple groups.

The GCD algorithm proceeds as follows. For a given ,

  1. Let be the initial estimate. A sensible initial estimate is zero (component-wise). Initialize the vector of residuals and .
  2. For , repeat the following steps:
    1. Calculate the least-square estimates with respect to
    2. Compute(3)
    3. Update .
    .
  3. Iterate Step 2 until convergence.

Breheny and Huang [11] discussed the convergence of coordinate descent algorithms for SCAD and MCP. We now consider the GCD for group Lasso. For any given , starting from an initial estimate , the GCD algorithm generates a sequence of updates , , whereSince the sequence is non-increasing and bounded below by 0, it always converges. The following proposition is concerned about the convergence of .

Proposition 1 For any fixed , the GCD updates converge to a global minimizer of the group Lasso criterion and satisfy the inequality

This proposition can be proved following the arguments of [12] who established the convergence of the coordinate descent algorithm for concave penalized selection methods including the Lasso.

Choice of tuning parameter.

There are various methods that can be applied, including for example AIC, BIC, cross-validation, and generalized cross-validation. Chen and Chen [13] developed a family of extended Bayesian information criteria (EBIC) to overcome the overly liberal selection problem caused by the small--large- situation. Furthermore, Chen and Chen [14] established the consistency of EBIC under the generalized linear models in the small-n-large-p situation. For group Lasso, Yuan and Lin [15] proposed an approximation of the degree of freedom (DF). Here, we apply EBIC with an approximated DF to select the tuning parameter . The EBIC is defined as:where is the residual sum of squares under a fixed . The DF for group Lasso [15] is defined as:(4)where is the number of predictors in the th group and is the least-square estimate for the th group obtained by fitting group only.

Note that when for , group Lasso becomes Lasso, and its DF is the number of non-zero parameters selected. Therefore, one can take Lasso as a special case of group Lasso, and so does the DF in expression (4).

Significance level for the selected SNPs.

With penalization methods, the relevance of a covariate usually is determined by whether its regression coefficient is nonzero. As secondary analysis, it may also be of interest to compute the value. However, it should be noted that it is usually insensible to use both estimation magnitude (zero or nonzero) and significance level for selection.

Here, we use a multi-split method modified from the one proposed by Meinshausen et al. [17] to obtain -values. With linear regression, we use -test for each group to evaluate whether there are elements in this group with significant effects. This procedure puts us in a position to obtain -values at the group level. It is simulation-based and can adjust for multiple comparisons. The multi-split method proceeds as follows:

  1. Randomly split data into two disjoint sets of equal size: and .
  2. Fit data in with the proposed method. Denote the set of selected groups by .
  3. Compute , p-value for group , as follows:
    1. If group is in set , set equal to the p-value from the F-test in the regular linear regression where group is the only group.
    2. If group is not in set , set .
  4. Define the adjusted -value as , where is the size of set .

This procedure is repeated times for each group. Let denote the adjusted p-value for group in the th iteration. For , let be the -quartile of . Define . It is shown in [17] that is an asymptotically correct -value, adjusted for multiplicity. The authors also proposed an adaptive version that selects a suitable value of quartile based on data:where is chosen to be 0.05. It is shown that , can be used for both FWER (family-wise error rate) and FDR (false discovery rate) control [17].

Results

Simulation studies

In simulation, we consider six different scenarios, each with 500 subjects and 5,000 or 10,000 SNPs. For each subject, we simulate two response variables. The correlation between the two responses is set to be 0.1, 0.5 or 0.9, representing weak, moderate and strong correlations. For each response variable, there are twelve SNPs with nonzero effects. Those twelve SNPs can be grouped into three clusters. Among each cluster, the correlation between two SNPs is 0.2. The correlation among SNPs not associated with response is set to be 0.2. Response-associated and noisy SNPs are independent. More specifically, the genotypes are first generated from multivariate normal distributions and then categorized into 0, 1 or 2. To mimic a SNP with equal allele frequency, we categorize genotype in a way similar to [16]. The genotype is set to be 0, 1 or 2 depending on whether , or , where is the -quartile of . For the first response variable, the regression coefficient isFor the second response variable, the regression coefficient isThe two response variables depend on the same genotypic data and are correlated through the residuals. Clustering structure exists in this simulation.

To better gauge performance of the proposed approach, we also consider the following alternative approach. We first analyze each response variable separately using Lasso, and then combine the results by examining the overlapped SNPs. For both approaches, we apply the EBIC method described in the previous section to select the tuning parameter . We evaluate the number of SNPs identified, number of true positives, false discovery rate (FDR) and false negative rate (FNR). In addition, estimation performance is also evaluated using SSE (sum of squared error).

Results based on 100 replicates are summarized in Table 1. Note that the true response-associated SNPs are 25–28, 41–44 and 57–60 for both responses. In total, there are 24 SNPs associated with the two responses. Table 1 shows that under all simulation scenarios, the proposed approach is able to identify almost all of the true positives, significantly more than the individual-dataset approach. The price is a few more false positives. With the proposed approach, the highest FDR is 0.18, which can be acceptable in practice. Under all scenarios, the proposed approach has significantly smaller SSEs. Taking both marker identification and estimation into consideration, we conclude that the proposed approach provides a competitive alterative to the existing individual-dataset approach. For one simulated dataset, -values evaluated by the multi-split method for the selected groups are presented in Table 2. It can be seen that many true positives have significant -values, while all false positives have insignificant -values.

thumbnail
Table 1. Simulation studies: the numbers are mean (standard deviation) based on 100 replicates.

https://doi.org/10.1371/journal.pone.0051198.t001

thumbnail
Table 2. Multi-split -values for simulated data with all matched non-zero s and  = 0.9.

https://doi.org/10.1371/journal.pone.0051198.t002

With the proposed approach, it is assumed that the multiple responses of interest have exact the same set of important SNPs. Such an assumption is reasonable under some settings but too restricted under others. To get a more comprehensive understanding of the proposed approach, we also conduct simulation where the two sets of important SNPs are partially matched. In Table 3, we consider the simulation setting where 25% of the important SNPs are not matched. In Table 4, we consider the scenario with 50% unmatched important SNPs. Under both simulation scenarios, the proposed approach identifies more true positives. However, the model sizes and FDRs are much larger. Such an observation is reasonable: for a SNP associated with a single response variable, when it is identified using the proposed approach, this SNP is automatically identified for the response variable it is not associated with, creating one false positive. Thus with the proposed approach and partially matched important SNP sets, identifying more true positives inevitably leads to much larger model sizes. It is interesting to note that under all simulation scenarios, the proposed approach has significantly smaller SSEs.

thumbnail
Table 3. Simulation studies: the numbers are mean (standard deviation) based on 100 replicates.

https://doi.org/10.1371/journal.pone.0051198.t003

thumbnail
Table 4. Simulation studies: the numbers are mean (standard deviation) based on 100 replicates.

https://doi.org/10.1371/journal.pone.0051198.t004

Here we focus on the scenario with two response variables to match the data analyzed in the next section. It is possible to conduct analysis with three or more responses, which may have higher computational cost.

Application to heterogeneous stock mice dataset

The heterogeneous stock mice dataset is described in the Introduction section. We refer to the original publication for more detailed descriptions [1], [2], [4]. This dataset includes fully phenotypic records on 2,202 mice, and each was genotyped for 13,459 SNP markers. In joint modeling, SNPs with missingness cannot be included. Thus, we implement fastPHASE to impute the missingness in SNPs [18]. After deleting observations with missing phenotypes and alleles with minor allele frequency less than 0.05, there are 1,514 mice and 9,991 SNP markers in 19 autosomes. We analyze the data using three different approaches: the traditional one-SNP-at-a-time approach, analysis of individual response using Lasso, and the proposed approach. In Figure 1, we show the absolute values of estimates from the single-SNP analysis on both CD4/CD8 ratio and CD4∶CD3. Here single-SNP analysis is conducted using a Bonferroni approach with overall -value 0.05. In Figure 2, we show the from Lasso on both phenotypes and from the proposed method. In Figure 1, one can see that the signal to noise ratio is weak, and it is difficult to tell the real associated signals from background. In contrast, the signal to noise ratio is strong, and a small number of SNPs are selected by using the Lasso and proposed method. When analyzing each response separately using Lasso and multiple responses using the proposed method, we use the method described in the previous section to select the tuning parameter . We use the multi-split method to evaluate the significance of selected SNPs. In Figure 2, the larger dots stand for the selected SNPs with significant -values. In Table 5, the total number of significant SNPs is summarized in the parenthesis for the Lasso on both phenotypes and the proposed method. Detailed information on the selected SNPs by the proposed method and individual Lasso methods on both CD4/CD8 ratio and CD4∶CD3 is presented in Table 6, Table 7 and Table 8, respectively. Note that there is no one-to-one correspondence between the magnitude of estimates and significance level. Such an observation is not uncommon in regression analysis. In addition, the proposed penalization approach is based on Lasso, which is known to shrink estimates towards zero. Another observation is that SNPs in high LD may have very different estimates, which is also “as expected”. In single-response analysis, Lasso has the tendency to select one out of a set of highly correlated covariates. Thus, it is possible or even likely that out of the SNPs with high LD, one may have a large estimate while others have very small or zero estimates. The numbers of selected SNPs and overlaps among the proposed method, the Lasso method and single-SNP analysis are presented in Table 5. We see that the single-SNP analysis selects a large number of SNPs. This may be due to the fact that the selection of assayed SNPs is not totally random.

thumbnail
Figure 1. Absolute values of

estimates from the simple linear regression on CD4/CD8 ratio and CD4∶CD3.

https://doi.org/10.1371/journal.pone.0051198.g001

thumbnail
Figure 2. Absolute values of

estimates from Lasso on CD4/CD8 ratio and CD4∶CD3 and estimates for the proposed method. Smaller dots represent SNPs selected by the Lasso/proposed method with insignificant multi-split -values. Larger dots represent SNPs with significant -values.

https://doi.org/10.1371/journal.pone.0051198.g002

thumbnail
Table 5. Number of SNPs identified, and overlap of SNPs among the proposed method, the Lasso and single-SNP analysis for heterogeneous stock mice dataset.

https://doi.org/10.1371/journal.pone.0051198.t005

thumbnail
Table 6. SNPs selected by the proposed method on both phenotypes CD4/CD8 ratio and CD4∶CD3.

https://doi.org/10.1371/journal.pone.0051198.t006

With our limited knowledge on susceptibility SNPs for immunology, we are not able to objectively evaluate the biological implications of identified SNPs. As an alternative, we consider the following evaluation of prediction performance, which may provide partial information on identification performance. (a) Randomly split the sample into five parts with equal sizes; (b) Analyze four parts using the proposed approach; (c) Use the obtained model and make prediction for subjects in the left-out part; (d) Repeat Steps (b) and (c) over all five parts. For comparison, the same approach is also used to evaluate the individual Lasso approach. The prediction mean squared errors are 1.66 for the proposed approach and 2.33 for the combined Lasso. By jointly analyzing two responses, the proposed approach has better prediction performance.

Discussion

In the study of complex diseases, it is not uncommon that a single trait cannot provide a comprehensive description, and multiple traits need to be measured. In this article, we analyze data with multiple response variables under the assumption that they have the same set of important SNPs. A penalization approach is developed for marker selection. The proposed approach can accommodate the joint effects of multiple SNPs and be more informative than single-SNP analysis. Compared with the existing approaches that analyze different traits separately, it can more effectively accommodate the correlation among traits and hence be more efficient in marker selection. Numerical studies, including simulation and analysis of the heterogeneous stock mice dataset, show satisfactory performance of the proposed approach.

The heterogeneous stock mice data have two continuous response variables with marginally normal distributions. With other types of response variables, there is a rich literature on joint modeling, which can be adopted to couple with the proposed marker selection. The proposed approach is based on the group Lasso penalty. We expect that other “group-type” penalties, such as group SCAD or group bridge, can be applied. The group Lasso is selected because of its relatively low computational cost, which is especially desirable with high-throughput data. In our numerical study, we focus on the scenario where the MAFs are not very low. When the MAFs are low, our unpublished numerical study suggests that penalization methods may not perform well because the covariate design matrix is “overly sparse”. Using penalization methods with rare variants is still being explored. Analysis of the heterogeneous stock mice data shows that the proposed approach can identify SNPs missed by single-response analysis. In addition, it has improved prediction performance. Therefore, the proposed method provides a useful alternative to the current analysis of multivariate traits in GWAS.

Acknowledgments

We would like to thank the AE and referees for careful review and insightful comments.

Author Contributions

Conceived and designed the experiments: JL JH SM. Performed the experiments: JL. Analyzed the data: JL. Contributed reagents/materials/analysis tools: JL JH SM. Wrote the paper: JL JH SM.

References

  1. 1. Valdar W, Scolberg LC, Gauguier D, Burnett S, Klenerman P, et al. (2006) Genome-wide genetic association of complex traits in heterogeneous stock mice. Nature Genetics 174: 879–887.
  2. 2. Solberg LC, Valdar W, Gauguier D, Nunez G, Taylor A, et al. (2006) A protocol for high-throughput phenotyping, suitable for quantitative trait analysis in mice. Mamm Genome 17: 129–146.
  3. 3. Lorenz A, Chao S, Asoro F, Heffner E, Hayashi T, et al. (2011) Genomic selection in plant breeding: knowledge and prospects. Advances in Agronomy 110: 77–123.
  4. 4. Valdar W, Scolberg L, Gauguier D, Cookson W, Rawlins J, et al. (2006) Genetic and environmental effects on complex traits in mice. Genetics 174: 959–984.
  5. 5. Wang L, Chen G, Li H (2007) Group SCAD regression analysis for microarray time course gene expression data. Bioinformatics 23: 1486–1494.
  6. 6. Stevens J. (2002) Applied multivariate statistics for the social sciences. Mahwah, NJ: Lawrence Erblaum.
  7. 7. Yuan M, Ekici A (2007) Dimension reduction and coefficient estimation in multivariate linear regression. J R Statist Soc B 69: 329–346.
  8. 8. Chen K, Chan K, Stenseth N (2012) Reduced rank stochastic regression with a sparse singular value decomposition. J R Statist Soc B 74: 203–221.
  9. 9. Huang J, Wei F, Ma S (2011) Semiparametric reregression pursuit. Accepted for publication by Statistica Sinica
  10. 10. Friedman J, Hastie T, Tibshirani R (2010) Regularized paths for generalized linear models via coordinate descent. J Stat Softw 33: 1–22.
  11. 11. Breheny P, Huang J (2011) Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann Appl Statist 5: 232–253.
  12. 12. Mazumder R, Friedman J, Hastie T (2011) SparseNet: Coordinate descent with non-convex penalties. J Amer Statist Assoc 106: 1125–1138.
  13. 13. Chen J, Chen Z (2008) Extended Bayesian information criteria for model selection with large model spaces. Biometrika 95: 759–771.
  14. 14. Chen J, Chen Z (2012) Extended BIC for small-n-large-p sparse. GLM 22: 555–574.
  15. 15. Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Statist Soc B 68: 49–67.
  16. 16. Wu T, Chen Y, Hastie T, Sobel E, Lange K (2009) Genomewide association analysis by LASSO penalized logistic regression. Bioinformatics 25(6): 714–721.
  17. 17. Meinshausen N, Meier L, Bühlmann P (2009) P-values for high-dimensional regression. J Am Stat Assoc 104(488): 1671–1681.
  18. 18. Scheet P, Stephens M (2006) A fast and flexible statistical model for large-scale population genotype data: Applications to inferring missing genotypes and haplotypic phase. Am J of Human Genetics 78: 629–644.