^{*}

Conceived and designed the experiments: SK EPX. Performed the experiments: SK. Analyzed the data: SK EPX. Contributed reagents/materials/analysis tools: SK EPX. Wrote the paper: SK EPX.

The authors have declared that no competing interests exist.

Many complex disease syndromes, such as asthma, consist of a large number of highly related, rather than independent, clinical or molecular phenotypes. This raises a new technical challenge in identifying genetic variations associated simultaneously with correlated traits. In this study, we propose a new statistical framework called graph-guided fused lasso (GFlasso) to directly and effectively incorporate the correlation structure of multiple quantitative traits such as clinical metrics and gene expressions in association analysis. Our approach represents correlation information explicitly among the quantitative traits as a quantitative trait network (QTN) and then leverages this network to encode structured regularization functions in a multivariate regression model over the genotypes and traits. The result is that the genetic markers that jointly influence subgroups of highly correlated traits can be detected jointly with high sensitivity and specificity. While most of the traditional methods examined each phenotype independently and combined the results afterwards, our approach analyzes all of the traits jointly in a single statistical framework. This allows our method to borrow information across correlated phenotypes to discover the genetic markers that perturb a subset of the correlated traits synergistically. Using simulated datasets based on the HapMap consortium and an asthma dataset, we compared the performance of our method with other methods based on single-marker analysis and regression-based methods that do not use any of the relational information in the traits. We found that our method showed an increased power in detecting causal variants affecting correlated traits. Our results showed that, when correlation patterns among traits in a QTN are considered explicitly and directly during a structured multivariate genome association analysis using our proposed methods, the power of detecting true causal SNPs with possibly pleiotropic effects increased significantly without compromising performance on non-pleiotropic SNPs.

An association study examines a phenotype against genotypic variations over a large set of individuals in order to find the genetic variant that gives rise to the variation in the phenotype. Many complex disease syndromes consist of a large number of highly related clinical phenotypes, and the patient cohorts are routinely surveyed with a large number of traits, such as hundreds of clinical phenotypes and genome-wide profiling of thousands of gene expressions, many of which are correlated. However, most of the conventional approaches for association mapping or eQTL analysis consider a single phenotype at a time instead of taking advantage of the relatedness of traits by analyzing them jointly. Assuming that a group of tightly correlated traits may share a common genetic basis, in this paper, we present a new framework for association analysis that searches for genetic variations influencing a group of correlated traits. We explicitly represent the correlation information in multiple quantitative traits as a quantitative trait network and directly incorporate this network information to scan the genome for association. Our results on simulated and asthma data show that our approach has a significant advantage in detecting associations when a genetic marker perturbs synergistically a group of traits.

Many complex disease syndromes, such as diabetes, asthma, and cancer, consist of a large number of highly related, rather than independent, clinical phenotypes. Differences between these syndromes involve a complex interplay of a large number of genomic variations that perturb the function of disease-related genes in the context of a regulatory network, rather than each gene individually

Numerous recent studies have shown that it is often more informative to map intermediate steps in disease processes, such as various disease-related clinical traits or expression levels of genes of interest, rather than merely the binary case/control disease status, to genetic marker loci

In several recent attempts on expression quantitative trait locus (eQTL) mapping, a significant focus has been placed on identifying modules of co-expressed genes and the genotype markers that perturb the whole module rather than a single gene. For example, a genotype variation in a putative transcription factor is likely to affect the expression levels of all of the genes regulated by this common transcription factor. Under this scenario, once a group of genes are mapped to a common locus in the genome, it is possible to examine whether the locus harbors a transcription factor that targets the group of genes jointly in order to understand the functional relationship between the genotype marker and the gene module (e.g.,

Nodes in the QTN represent clinical traits related to asthma. Each pair of nodes is connected with an edge if the corresponding two traits are highly correlated. The thicknesses of edges indicate the strength of correlation. We are interested in identifying SNPs that are associated with a subnetwork of clinical traits.

Recent advances in high-throughput sequencing and molecular profiling technologies have made it both affordable and efficient to observe DNA sequence variations over millions of genomic loci, to measure the abundance of transcripts of virtually all known coding sequences, and to measure a wide range of clinical traits in various disease populations

In a different approach to eQTL mapping, a module network

We believe that explicitly incorporating the molecular and/or clinical phenotype network as a trait correlation structure while searching for genetic associations can significantly increase the power of detecting pleiotropic effects. In this article, we present a new statistical approach, called

Our proposed approach is based on a regularized multivariate regression formalism, treating genotype markers as inputs and traits as outputs. To ensure interpretable and consistent recovery of the usually “sparse” causal (or “truly” relevant) variations among a large number of candidate polymorphic loci, we use a linear regression formalism with an

(A) In lasso, each phenotype represented as a circle is independently mapped to SNPs for association. (B) In graph-constrained fused lasso (

The problem of estimating the regression coefficients in GFlasso involves solving a convex program, in which a global optimum solution can be efficiently obtained by exploring the large body of existing work on fast algorithms for convex optimization. In this article, we develop a fast coordinate-descent algorithm to estimate the regression coefficients under GFlasso, from which markers relevant to the (possibly multiple) traits in questions can be identified from the non-zero elements in the estimated regression coefficients. The results on two datasets, one simulated from HapMap SNP markers and the other collected from the SARP asthma patients, show that our method has a significantly greater power with fewer false positives in detecting pleiotropic effects of markers than other methods that do not exploit the correlation structure in traits.

To capture correlated genome associations to a QTN, we employ a multivariate linear regression model as the basic model for trait responses given inputs of genome variations such as SNPs, with the addition of a sparsity-biasing regularizer to encourage selection of truly relevant SNPs in the presence of many irrelevant ones. Then, we introduce an additional regularizer of fusion penalty to encourage the sharing of association patterns from a common SNP to multiple inter-related traits.

There is a large literature on multivariate linear regression in statistics

In a standard regression approach for a single-trait association analysis, we assume a linear relationship between the covariates (SNPs) and each response (trait) parameterized by a set of regression coefficients, and estimate the parameters by optimizing a loss function defined on SNP-trait samples given the parameters. Based on the magnitudes of estimated regression coefficients, we draw conclusions on which SNPs are most significantly associated with the given trait. When data are available for multiple traits, we can apply this single-trait approach to each trait separately as we detail below.

Let

In a typical genome-wide association mapping, one examines a large number of marker loci with the goal of identifying only a small number of markers associated with the given phenotype. A naive application of the method in Eqn 2 to association mapping with large

The lasso for multiple-trait association mapping defined in Eqn 3 is equivalent to solving a set of

In order to estimate the association strengths jointly for multiple correlated traits while maintaining sparsity, we introduce another penalty term called graph-guided fusion penalty into the lasso framework. This novel penalty makes use of the complex correlation pattern among the traits represented as a QTN, and encourages the traits which appear highly correlated in the QTN to be influenced by a common set of genetic markers. Thus, the GFlasso estimate of the regression coefficients reveals joint associations of each SNP with the correlated traits in the entire subnetwork as well as associations with each individual trait.

We assume that a QTN, denoted by

Below, we first introduce

Given a QTN, it is reasonable to assume that if two traits are highly correlated and connected with an edge in the QTN, their variations across individuals are more likely to be explained by genetic variations at the same loci. In

The idea of using a fusion penalty has been first proposed in the classical regression problem for a univariate response (i.e., single output) and high-dimensional covariates to fuse the regression coefficients of two adjacent covariates when the covariates are assumed to be ordered such as in time

When applied locally to a pair of regression coefficients for each edge

When this edge-level fusion penalty is applied to all of the edges in the entire QTN as in the graph-guided fusion penalty, the overall effect is that

Although, in principle, the graph-guided fusion penalty has a smoothing effect on the rows of

Now, we describe an enhanced version of

Compared to

The optimization problems in Eqn 4 and Eqn 5 are convex, and can be formulated as a quadratic programming problem using the similar approach for solving the fused lasso problem

In this section, we describe a procedure for obtaining estimates of the regression coefficients in

This coordinate-descent procedure finds the optimal

The coordinate-descent algorithm for

We determine the initial values

Our goal is to find values for

We performed a simulation study to evaluate the power of the proposed GFlasso methods, and compared the results with those from single-marker/single-trait regression analyses as well as other multivariate regression methods.

We simulated genotype data of 50 SNPs for 250 individuals based on the HapMap data

Given the simulated genotype, we set the number of phenotypes to 10, and simulated the matrix of true regression coefficients by first choosing SNP-trait pairs with true associations and assigning values for the strengths of associations for the selected pairs as we describe below. We assumed three groups of correlated traits of sizes 3, 3, and 4. Three causal SNPs were randomly selected for the first group of traits, and four causal SNPs were selected for each of the other two groups, so that the shared relevant SNPs induce correlation among the traits within each cluster. In addition, we assumed another causal SNP for traits in both of the first two clusters in order to model the situation of a higher-level correlation structure across two subnetworks. Finally, we assumed one additional causal SNP for all of the phenotypes. In our simulation study, we assumed that shared causal SNPs are the only factors that induce correlations among traits, although in general there might be other genotypic effects or environmental factors that influence the correlation structure among traits.

Once the SNP-trait pairs with true association were selected, we considered the following two cases of association strengths for these pairs, while setting the rest of the regression coefficients to 0.

The regression coefficients for all of the SNP-trait pairs with true association were set to the same value. This corresponds to the situation where the basic assumption of the fusion penalty holds, and each SNP has the same effect across the traits in each subnetwork.

The regression coefficients for the SNP-trait pairs with true associations were set to different values randomly generated from a uniform distribution over an interval

Then, we simulated phenotype data based on the linear regression model with noise distributed as

We compared the results from the GFlasso methods with those from other methods given below:

We used (

These methods do not take into account the correlation structure in traits. We used a validation set to select the regularization parameter. The absolute values of the regression coefficients were used as a measure of association strength.

This method first transforms the output variables (traits) into a smaller number of variables that explain most of the variability in the original data, performs a standard multivariate regression on each of the transformed output separately, and then transforms the estimated regression coefficients back into the original space

For methods that require a specification of the values of the regularization parameters such as ridge regression, lasso, and the GFlasso methods, we used

As an illustrative example of the behaviors of the different methods, a graphical display of the QTN and the estimated QTL sets

Association strength 0.8 and threshold

Next, we systematically and quantitatively evaluated the performance of the association methods based on two criteria, sensitivity/specificity on the uncovered QTL sets

First, we varied the sample size of the dataset to see how the sample size affects the performance of the different methods for association analysis. We used datasets of sizes 50, 100, 150, 200, and 250, with association strength fixed at 0.5 for all associated SNP-trait pairs (Case 1), and we set the threshold

Panels show (A)

We examined how varying the signal-to-noise ratio affects the performances of the different methods. We simulated datasets with regression coefficients set to 0.3, 0.5, 0.8, and 1.0, respectively, with sample size

Panels show results for association strength (A) 0.3, (B) 0.5, (C) 0.8, and (D) 1.0. The sample size was 100, and the threshold

Next, we examined the sensitivity of the GFlasso methods to how the trait correlation network is generated, by varying the threshold

Panels show the threshold (A)

Given the SNP-trait pairs that the association methods found as associated, and the corresponding regression coefficients, we computed prediction errors to see if these SNPs with non-zero regression coefficients had a predictive power for traits of previously unseen individuals.

Panels show the prediction errors when the threshold

Since the fusion penalty tends to fuse the regression coefficients to be the same value within a densely connected subgraph, one may suspect that the bias introduced by this penalty can reduce the power when the true association strengths of a SNP to different traits are not the same within each subgraph. In order to examine how the performance is affected in this case, we considered the situation where the association strengths of each causal SNP are not uniform across traits within each subnetwork, but vary within the interval of

The association strength of a causal SNP is not uniform across correlated phenotypes that the SNP is associated with, and varies within the intervals of [0.3, 0.6] or [0.6, 0.9]. Panels show (A) association strength = [0.3, 0.6] when the threshold

The scalability of our methods can be assessed from

(A) We varied the number of SNPs with the number of phenotypes fixed at 250. (B) We varied the number of phenotypes with the number of SNPs fixed at 50. The QTNs were obtained using threshold

We applied our methods to a dataset collected from 543 asthma patients as a part of the Severe Asthma Research Program (SARP)

Before searching for associations between SNPs and traits, we first examined the correlation structure in the 53 clinical traits in question. We first computed the pairwise correlations between these traits as depicted in

(A) The correlation matrix of 53 asthma-related clinical traits. A pixel at row

A comprehensive comparison of QTL mapping using GFlasso and other methods is presented in

As can be seen from

In addition, the results from the single-marker analyses in

Because of the fusion penalty, the regression coefficients estimated by our methods formed a block structure, where each block corresponds to a SNP associated with several correlated traits. It is clear that the horizontal bars in

In order to see how the threshold

Number of edges | Number of nonzero regression coefficients | ||||

Lasso | |||||

0.3 | 421 | 105 | 106 | 108 | |

0.5 | 165 | 125 | 108 | 107 | 107 |

0.7 | 71 | 105 | 105 | 110 | |

0.9 | 11 | 125 | 123 | 123 |

In summary, the GFlasso methods identified the previously known causal SNP (SNP Q551R) as significantly associated with the lung physiology traits, while maintaining an overall sparse pattern in estimated regression coefficients to reduce false positives. The property of the GFlasso estimates having a block structure for a SNP jointly associated with a set of correlated traits led to an interesting new hypothesis that two additional SNPs (rs3024660 and rs3024622) on the upstream of SNP Q551R may be jointly influencing the same set of traits on lung physiology as SNP Q551R, which may be validated in a future follow-up study.

When multiple phenotypes are involved in association mapping, it is important to combine the information across phenotypes and make use of the full information available in data in order to achieve the maximum power. Most of the previous approaches either considered each phenotype separately, or used relatively primitive types of phenotype correlation structures such as surrogate phenotypes transformed through PCA or the mean values of subgroups of phenotypes found by clustering algorithms. Networks or graphs have been extensively studied as a representation of the correlation structure of phenotypes such as gene expression or clinical traits because they provide a flexible and explicit form of representation for capturing dependencies

In this article, we proposed a new family of regression methods called GFlasso that directly incorporates the correlation structure represented as a QTN and uses this information to guide the estimation process. These methods considered a multitude of phenotypes jointly, and estimated a joint association model in a single statistical framework. Often, we are interested in detecting genetic variations that perturb a sub-module of phenotypes rather than a single phenotype, and GFlasso achieved this through a fusion penalty, in addition to the lasso penalty, that encourages parsimony in the estimated model. The fusion penalty locally fused two regression coefficients for a pair of correlated phenotypes, and this effect propagated through edges of the QTN, effectively applying fusion to all of the phenotypes within each subgraph.

The fusion penalty in GFlasso introduced a bias that the amount of influence of a shared QTL is similar over the set of correlated traits in order to increase the power for detecting weak signal and reduce false positives. The simulation results showed that the benefit of information sharing due to the fusion penalty outweighed the risk of low-variance bias on fused regression coefficients when in reality the magnitudes of the coefficients can be highly variable. Perhaps a more effective approach and a promising future direction would be to encourage each SNP marker to be jointly relevant or irrelevant to the subset of correlated traits, but still allow the marker to have a different amount of influence on each of the traits. This would reduce the bias introduced by the fusion penalty and further improve the performance of GFlasso, since the only information shared across correlated traits is the sparsity pattern but not the magnitudes of the regression coefficients.

We have used a simple scheme of a thresholded correlation graph for learning the QTN of phenotypes to be used in GFlasso. Many different types of network-learning algorithms have been developed previously. For example, graphical Gaussian models (GGMs)

In this study, we assumed that the graph structure of a QTN is available from a pre-processing step. One of the possible extensions of the proposed method is to learn the QTN and the regression coefficients jointly by combining GFlasso with the graphical lasso

For any new multivariate genetic-association methods, a natural question is whether the new method can scale to a genome-scale analysis. The current implementation of GFlasso leaves this to be determined by a user-specified tradeoffs between power and computation time. As shown in

Finally, it is important to point out that as of now GFlasso considers only dependencies among phenotypes, and does not assume any dependencies among the markers. Since recombinations break chromosomes during meiosis at non-random sites, segments of chromosomes rather than an individual nucleotide are inherited as a unit from ancestors to descendants, creating a relatively low diversity in observed haplotypes than would be expected if each allele were inherited independently. Thus, SNPs in high LD are likely to be jointly associated with a phenotype in a regression-based penetrance function. In our future research, we plan to apply the same idea of the graph-guided fusion penalty for phenotypes to incorporate the LD structure among genotypes. It is straightforward to introduce another fusion penalty for correlated markers based on the genotype correlation graph and weight each term in the penalty using values that reflect the recombination rates and distances between each pair of genetic markers. This would allow a genome-phenome association analysis for identifying a block of correlated markers influencing a set of correlated phenotypes.

Software for our proposed method is available at

We thank Sally Wenzel, M.D. for providing the SARP dataset, and for her helpful discussions on the asthma association study. We also thank Ross Curtis and Kyung-Ah Sohn for their help with the implementation of the software and the helpful discussion on simulation study.