An Effective Method to Identify Heritable Components from Multivariate Phenotypes

Multivariate phenotypes may be characterized collectively by a variety of low level traits, such as in the diagnosis of a disease that relies on multiple disease indicators. Such multivariate phenotypes are often used in genetic association studies. If highly heritable components of a multivariate phenotype can be identified, it can maximize the likelihood of finding genetic associations. Existing methods for phenotype refinement perform unsupervised cluster analysis on low-level traits and hence do not assess heritability. Existing heritable component analytics either cannot utilize general pedigrees or have to estimate the entire covariance matrix of low-level traits from limited samples, which leads to inaccurate estimates and is often computationally prohibitive. It is also difficult for these methods to exclude fixed effects from other covariates such as age, sex and race, in order to identify truly heritable components. We propose to search for a combination of low-level traits and directly maximize the heritability of this combined trait. A quadratic optimization problem is thus derived where the objective function is formulated by decomposing the traditional maximum likelihood method for estimating the heritability of a quantitative trait. The proposed approach can generate linearly-combined traits of high heritability that has been corrected for the fixed effects of covariates. The effectiveness of the proposed approach is demonstrated in simulations and by a case study of cocaine dependence. Our approach was computationally efficient and derived traits of higher heritability than those by other methods. Additional association analysis with the derived cocaine-use trait identified genetic markers that were replicated in an independent sample, further confirming the utility and advantage of the proposed approach.


Introduction
Identifying genetic variation that underlies complex phenotypes has important implications for genetics and biology [1,2]. The power of most gene discovery studies is positively associated with the heritability of the trait [3]. Higher heritability of a trait implies that the trait varies due to stronger genetic influence. Thus, there is greater chance to detect its genetic causative variants. The narrow sense heritability h 2 is defined by the percentage of phenotypic variance that is due to additive genetic effects. The broad sense heritability H 2 is defined by the proportion of phenotypic variance due to all genetic variation.
Many complex phenotypes comprise a variety of low level traits (or phenotypic features) that are often highly variable. Association analysis of such a complex phenotype is impeded by this phenotypic heterogeneity [4]. For example, the diagnosis of drug dependence is determined by various patterns of drug use, their effects, and related behaviors [5]. A binary multivariate trait defined by the diagnosis of cocaine dependence, which partitions the population into cases (subjects with the disorder) and controls (subjects without the disorder), cannot differentiate the heterogeneous manifestations of the disease. Because of this, the success of identifying genetic variants is limited when using this binary trait in association analysis [6,7]. Identifying highly heritable components of the disease could permit the detection of genetic variants that are not detectable using the standard diagnosis-based traits [8][9][10][11][12]. Efforts have been made to enhance the binary trait by capturing more phenotypic variation, such as defining a multivariate trait as symptom count [7]. However, this kind of multivariate trait can have low heritability and may thus be sub-optimal for association analysis.
Heritable component analysis methods identify principal components of the data, i.e., linear combinations of low level traits, that are heritable [13][14][15][16]. All current methods decompose the identification of heritable components into solving two separate subproblems in sequence. They first estimate two covariance matrices of the low-level traits: S a , the variance due to additive genetic effects taking into account the relationships of individuals (family structure); and S, the covariance matrix due to effects other than additive genetic effects. If there are d low level traits in x, this means that two d-by-d matrices need to be estimated from the sample. Once the two covariance matrices are computed, a generalized eigenproblem is solved to identify the combination coefficients w so that the ratio of w > S a w/w > Sw is maximized, leading to high heritability for the combined trait w > x.
A few methods have been developed in the literature to estimate the two covariance matrices. In [14,15], the two matrices are estimated based on the genetic effect of a single quantitative-trait locus to all the low level traits. This method has limited utility when the variancecovariance of the low level traits is due to multiple genetic loci (which is often the case for complex phenotypes). In [13,16,17], the two covariance matrices are estimated from family pedigrees of the sample. The approach used in [13] takes only siblings in a family, so it is inadequate to handle general (complex) pedigrees. The two approaches in [16] and [17] can handle general pedigrees. The first one derives an analytic formula for the covariance matrices based on Analysis of Variance (ANOVA). Although reducing the computational cost, the analytic formula is unable to take into account the fixed effects from covariates such as sex, age or race, which is also a problem for the method in [13]. Currently, the most comprehensive approach is a maximum likelihood method [17] that can estimate the fixed effects and covariance matrices together, but this method is computationally prohibitive as discussed in [16]. Even when d = 20 low level traits are used, this method can run for days, and as observed in our experiments, the method may not converge. It requires very large sample to obtain reliable estimates of two covariance matrices and d combination coefficients, totally 2d 2 + d parameters, from a sample.
We show that, to obtain highly heritable components of a multivariate trait, the estimation of two covariance matrices is unnecessary. We propose an optimization approach that directly identifies a linear combination of low level traits whose estimated heritability is maximized. This optimization problem is formulated by decomposing the maximum likelihood method for estimating trait heritability. An sequential quadratic programming algorithm is developed to optimize the problem. We then extend the basic formulation to correct fixed effects of covariates in the component analysis. Because we do not estimate any covariance matrix, our approach is computationally much more efficient than those in [13,17]. The proposed approach is validated in both simulations and a case study on cocaine dependence. The effectiveness of the approach is demonstrated not only by the higher cross-validated heritability of the derived traits than the existing methods but also by a follow-up association study that compares the utility of the derived traits with the commonly used phenotype. Specifically, a highly heritable multivariate trait was derived for cocaine dependence. More statistically significant associations were found for this trait than for a symptom-count phenotype.

Methods
We first introduce the standard methods for heritability estimation, and then derive our formulation that maximizes the heritability of a linearly-combined trait. An efficient algorithm is developed to optimize the formulation. At last, we extend the approach to take into consideration the fixed effects of covariates.

Background: Heritability Estimation
To estimate the heritability of a quantitative trait y, the well established maximum likelihood method is based on linear mixture models [3,18]. The method assumes that the phenotype y i of a family i follows a multivariate normal distribution with covariance O i and separate means for male and female family members, μ m and μ f , respectively. Separate means are used for males and females based on the general observation that males and females present differences in quantitative traits, such as height and weight. The (j, k)-th entry of O i is the phenotypic covariance of two family members j and k, given by where s 2 a and s 2 d are the variance components due to additive and dominant genetic effects, respectively, and s 2 e denotes the variance component due to environmental factors. Eq (1) can be extended to include other effects, such as an epistatic genetic effect s 2 I . The quantity F i jk is the kinship coefficient between members j and k. It is the probability that two alleles randomly drawn from j and k at a genetic locus are identical by descent (IBD), i.e., that these two alleles are identical copies of the same ancestral allele. An allele is one of the alternative forms at a genetic locus. As the human genome is diploid, each individual has two copies of an allele that may differ at a genetic locus. The quantity D i jk is the probability that members j and k share both alleles at a genetic locus. Both matrices F i and Δ i can be calculated from the family pedigrees [3]. Example entries of F and Δ between selected family members are illustrated in Table 1 where random mating is assumed. The parameter G i jk is an environmental indicator that encodes whether j and k live together (G i jk ¼ 1) or apart (G i jk ¼ 0). The narrow sense heritability is given by h 2 ¼ s 2 a =s 2 p where s 2 p is the total variance in y, i.e., s 2 p ¼ s 2 a þ s 2 d þ s 2 e , while the broad sense heritability is given by H 2 ¼ ðs 2 a þ s 2 d Þ=s 2 p . In this paper, we target at quantitative traits with higher narrow sense heritability, which we henceforth simply refer to as heritability. However, our formulation can be easily modified to derive a quantitative trait of high H 2 .
The five parameters, μ m , μ f , s 2 a , s 2 d and s 2 e , are estimated by maximizing the log likelihood of the trait values over all sample families [18]. The log likelihood is computed by where μ i denotes a vector of the means μ m and μ f for male or female members, respectively, in the family i. The gradient and Hessian of Eq (2) with respect to μ m , μ f , s 2 a , s 2 d and s 2 e can be calculated, and a Newton-Raphson algorithm or a scoring method [18] can be applied to maximize the log likelihood Eq (2).
The heritability of a quantitative trait y is often estimated with correction for the effects of covariates z, such as age, sex, or race. These covariate effects are modeled as fixed effects on y. Thus, a linear regression model y = z > v + can be built where v indicates the combination weights for the covariates. The heritability of the residual is then estimated using the described maximum likelihood method and treated as the heritability of y after adjusting for covariate effects.

Proposed Quadratic Optimization
In heritability estimation, a trait is given, and we search for the values of s 2 a , s 2 d and s 2 e that maximize the likelihood of observing the trait values and compute the heritability as s 2 a =ðs 2 a þ s 2 d þ s 2 e Þ. In our study, we solve the inverse problem that a trait must be derived so that its heritability is maximized when estimated by the above maximum likelihood method.
For a given set of d phenotypic features x, we find a linearly combined trait y : y = x > w. If a trait y has the highest possible heritability, the covariance of y among any family members in family i, covðy i j ; y i k Þ, should be due to the additive effect s 2 a only, and s 2 d ¼ s 2 e ¼ 0. In other words, for such a trait, the covariance matrix of the phenotype y i of a family i relies only on the additive effect parameter s 2 a and the kinship matrix F i , i.e., Ω i ¼ 2s 2 a F i . Thus s 2 a is equal to the total variance s 2 p of y. We need to search for the values of w that maximize the likelihood of observing s 2 d ¼ s 2 e ¼ 0, or in other words, that maximize the likelihood of Ω i ¼ 2s 2 a F i . Let X i be the data matrix on the d features (as columns) for the subjects (as rows) in family i. Then the trait values of the family members form a vector y i ¼ X > i w. Because y is homogeneously dependent on the unknown w, w can be scaled so that the sample variance of y is 1, which implies that s 2 p ¼ 1 (and hence s 2 a ¼ 1). Substituting the values of O i , y i and s 2 a into the log likelihood in Eq (2) yields the following maximization problem: which is equivalent to the following minimization problem after eliminating constants (for By stacking the H i matrices of different families in columns, we get another matrix H. Similarly, we can form a matrix X, so the trait values of all subjects y = X > w. The sample variance of the trait y is, by definition, (1/n)(y − μ) > (y − μ). It is equal to (1/n)(X > w − μ) > (X > w − μ) = (1/n)β > H H > β. Then, the condition of s 2 p ¼ 1 corresponds to a constraint on β: (1/n)β > H H > β = 1, which can be rewritten as β > H H > β − n = 0.
As a matter of fact, μ m and μ f are not free parameters, as they are determined once w is determined. They are equal to the sample means of the trait, i.e., Mean(x > w), for male and female, respectively. Let x m and x f be the two means of the data vector x respectively over male and female samples. Then, x > m w ¼ m m and x > f w ¼ m f . These equations give two additional constraints. Let a m ¼ ½x > m ; À1; 0 > , a f ¼ ½x > f ; 0; À1 > , then the two constraints on β state that a > m b ¼ 0 and a > f b ¼ 0. Imposing all of these constraints on Eq (4) yields an optimization problem where a quadratic objective needs to be minimized subject to a quadratic constraint and two linear equality constraints as follows: According to statistical learning theory [19], optimizing only the empirical heritability on the training sample as in Eq (4) will lead to the so-called overfitting problem, which means that the resultant model y = x > w has low validation heritability despite a high training heritability. To enhance the generalizability of the derived model to new samples, a regularization condition on w, R(w), is required to control the complexity of the model. The objective function in Eq (4) thus becomes where λ is a pre-specified tuning parameter for balancing the two terms in the objective function, and R(w) can be realized in different forms and be application-specific. For example, R(w) can be implemented with the ℓ 1 vector norm: jjwjj 1 ¼ P d j¼1 jw j j, which is known to create shrinkage effects on w as shown in the Least Absolute Shrinkage and Selection Operator (LASSO) method [20]. When features in x are clustered in multiple groups and sparsity in the level of each feature group is desirable, R(w) can be implemented by the ℓ 2,1 vector norm as used in the group LASSO [21] and defined by jjwjj the indices of the features in the group ℓ. Specifically, we develop an algorithm in the next section to solve the following optimization problem with the ℓ 1 norm regularization condition Note that Problem Eq (4) is a special case of Problem Eq (6) when λ = 0. Hence, a solver for Problem Eq (6) can also solve Problem Eq (4).

Solving the Proposed Optimization Problem
The objective function in Eq (6) is not differentiable because of the ℓ 1 norm regularization condition. However, by a widely used change-of-variables strategy, we can convert it into an equivalent differentiable form so gradient based solvers can be used. We introduce two sets of variables u ! 0 and v ! 0 both of length equal to that of w. We set w = u − v, which gives Stacking all K i 's in columns leads to the full matrix K. The quadratic constraint in Eq (6) then becomes the linear constraints in Eq (6) We have bound constraints on the new variables, i.e., u ! 0 and v ! 0. We hence design a matrix J = [I 2d × 2d , [0] 2d , [0] 2d ] where I 2d × 2d is the identity matrix of dimension 2d × 2d, and [0] 2d is a column vector of all zero entries with length of 2d. Then, the bound constraints can be written as Jγ ! 0. Overall, with the new variables u and v, Eq (6) can be re-written as follows where e = 2d + 3 is the total number of constraints in the problem. It can be proved mathematically that the optimal solution of Eq (7) is identical to the optimal solution of Eq (6) in the sense that the optimal w = u − v. Note that the regularization condition in Eq (7), At optimality of Eq (7), either u j = 0 or v j = 0 for the jth feature because otherwise they would not be optimal. If both u j > 0 and v j > 0 and assume u j ! v j , then we have another solution, (ũ j ¼ u j À v j ,ṽ j ¼ 0), that achieves lower objective value than (u j , v j ) because the first term of f remains the same whereas the second term of f is reduced by 2v j . Thus, at optimality, the regularizer of Eq (7) P 2d (7) is not a convex problem due to the quadratic equality constraint g 1 , we can solve it efficiently by the framework of sequential quadratic programming (SQP) [22]. A SQP algorithm solves the optimization problem iteratively. At each iteration, it approximates the original problem by a convex quadratic program, for which a solution can be easily computed. A quadratic program is defined as a minimization of a quadratic objective function subject to linear constraints. To form the approximate subproblem, the Lagrangian function of Eq (7) is used: where α contains all Lagrange multipliers and k indexes the constraints. We use the secondorder Taylor expansion to approximate the Lagrangian which forms the quadratic objective function of the subproblem, and use the first-order expansions to approximate the original constraints which form linear constraints for the subproblem. The gradients of the objective function f and the constraints g i:i = 1:e with respect to γ can be calculated as follows: 2d is a column vector of all ones with length of 2d. The Hessian of L with respect to γ is calculated as: The subproblem at each iteration is formulated based on the current iterates γ t and Lagrange multipliers α t . At the iteration t + 1, the search directions for both γ and α can be computed by solving the following quadratic program where p is the search direction of γ, along which the objective function f can be decreased. Let p t be the solution to this subproblem andq t be the corresponding optimal Lagrange multipliers ofp t , the search direction of α is calculated asq t À a t . Then, a line search method, such as those described in [22], can be used to determine a step size of moving along the directions. Then, γ and α are updated as follows: Algorithm 1 summarizes the SQP algorithm that we developed to solve Eq (7), and hence Eq (6).
Algorithm 1 A sequential quadratic programming approach to solving Eq (7) and μ m , μ f equal to the sample male and female means of the obtained trait when w = 1.
3. Evaluate f, rf, rg k and r 2 gg L with the current γ and α. 4. Solve Eq (10) to obtainp andq. 5. Perform line search to find the learning step size s. 6. Update γ and α as in Eq (11). Repeat 3-6 until γ reaches a fixed point.

Correction for Covariates
As discussed in the background section, the heritability of a quantitative trait y with effects from covariates z is equal to the heritability of the residual of the linear model y = z > v + . Therefore, our objective here is to findŵ andv that optimize the heritability estimate of : = x > w − z > v, as y = x > w. Let Z p × n be the data matrix on z of length p for the n subjects, the residual is calculated for all the subjects as = X > w − Z > v.
Given the data Z and y, a linear regression model y = z > v + is typically obtained through a least squares method which has an analytical solution,v ¼ ðZZ > Þ À1 Zy.
which can be pre-calculated from data, the calculation of can be rewritten as = M > w. Then, the objective of optimizing the heritability of can be translated to finding the optimal w that gives an of highest estimate of heritability. Comparing to the problem of finding w that gives a trait y = x > w with highest possible heritability, the only difference we have here is that the design matrix has been changed from X to M for the parameters w. Therefore, we can use the same SQP algorithm (Algorithm 1) to find the w that optimizes the heritability of . An interesting observation in our derivation is that correcting a quantitative trait to account for covariant effects is equivalent to correcting the data matrix that used to derive the trait.

Algorithm Evaluation
The proposed approach was first validated in simulations where we compared it with the current two-step approaches, i.e., estimating the two covariance matrices from pedigrees first and then solving an eigenproblem. We compared with all the three different methods that can be used to estimate the variance-covariance matrices, which were referred to, respectively, as Ott [13], Anova [16] and ML [17]. The following Results section provides the details of the simulation and empirical evidence showing the superior performance of our algorithm. After validated in simulations, the proposed approach was then used in a case study to analyze a real-world dataset that was aggregated from genetic studies of cocaine dependence (CD) [7,23]. Our algorithm was able to derive a quantitative trait with higher heritability than that of commonly used CD phenotypes. To show how our approach could help genetic association analysis, we compared the utility of the derived trait against the symptom-count phenotype as traits in association analysis and replicated the findings on a separate sample. The narrow sense heritability of all of the tested traits in this study was estimated by the widely-used polygenic function in the Sequential Oligogenic Linkage Analysis Routines (SOLAR) program [24].
Ethics statement. The Semi-Structured Assessment for Drug Dependence and Alcoholism (SSADDA) dataset [7] was used in both our simulations and the case study to evaluate the proposed algorithm. The SSADDA subjects were recruited from multiple sites, including the University of Connecticut Health Center, Yale University School of Medicine, the University of Pennsylvania School of Medicine, McLean Hospital and the Medical University of South Carolina. All subjects gave written, informed consent to participate, using procedures approved by the institutional review board (IRB) at each participating site. Readers can consult with [7] for the details of subject recruitment in those studies. The SSADDA data were de-identified and the analyses in this present study were approved by the University of Connecticut IRB Protocol H15-045 and the University of Pennsylvania IRB Protocals 804787 and 812856.

Results
This section provides the details of the simulation process and the case study of CD together with the empirical evidence showing the superior performance of our approach.

Simulations
In order to make our synthetic data more realistic but with known patterns, we created the synthetic data based on family structures in the SSADDA dataset. In this dataset, there were totally 6810 subjects, of which 1915 were from small nuclear families and the remaining subjects were unrelated individuals. Based on the family structures in this data, we synthesized two quantitative traits following the same assumptions used in the maximum likelihood method for heritability estimation [18].
Experimental data and procedure. The values of the first trait y 1 were randomly drawn for each family from a multivariate Gaussian distribution : N(μ, O), where μ and O were determined as follows. The dimension of μ was determined by the number of subjects in the family, such that each dimension corresponded to an individual in the family. The value of μ used in the simulations may vary between families according to the gender of the members. Precisely, if a family member is male, μ was set to μ m ; otherwise it was set to μ f . The covariance matrix O was given by the following equation: where F and Δ were composed according to Hence, 80% of the phenotypic variance was due to additive genetic effects, and the ideal heritability is 0.8 according to Eq (13). By the random nature of the simulation, the actual heritability of the simulated trait may vary a little. In order to evaluate if our approach can correct for fixed effects of covariates, we created another quantitative trait y 2 by adding effects from age and race to y 1 . Let c 1 and c 2 measure the effects of age and race respectively on y 2 , we calculated y 2 as follows: y 2 = y 1 + c 1 × age + c 2 × race. The values of the two c's were arbitrarily set to c 1 = 1.1 and c 2 = 0.7, (which can certainly be set to any other non-zero values). Using SOLAR, we estimated the heritability of y 1 with sex as covariate (h 2 = 0.796) and the heritability of y 2 with sex, age, race as covariates (h 2 = 0.797).
We next simulated data of phenotypic features for the two quantitative traits. For each trait, we synthesized a dataset consisting of d = 10 relevant phenotypic features. We first specified the weights w of the features; then we generated data for these features as follows. For each subject, we randomly picked d − 1 features and drew their values randomly from the standard multivariate Gaussian distribution. Assume that the k-th feature is the remaining feature. Its value for subject i was computed by ðy i À P d j¼1;j6 ¼k w j x i j Þ=w k (where w k 6 ¼ 0 because these 10 features were created with non-zero weights in the linear model). This procedure guaranteed that the trait y ¼ P d j¼1 w j x j , and because the feature computed from the values of other features varied randomly among subjects, every feature had a portion of randomly-drawn data.
In practice, a multivariate trait may not depend on all of the considered phenotypic features. In order to test if our approach can identify the relevant features, we created four other datasets for each of the traits, respectively, consisting of d = 20, 30, 40 and 50 features where only the first 10 of them were created following the above procedure, thus relevant to the simulated traits. The other features were all randomly drawn from standard Gaussian distribution and assigned a weight of 0. By simulating the data in this way, there was at least one linear combination of the features in each dataset that led to the simulated traits of high heritability. If our approach is to work, it should find this linear combination which is considered as the groundtruth model. There is a likelihood that another linear combination could give even higher heritability due to the random nature of the data, but this likelihood is small. In our experiments, none of the algorithms could locate any other combinations with higher heritability than the implanted one.
In practice, a multivariate trait may also depend on some features that are not observed. In our simulations, it implied that some of the ten relevant features might be absent. Therein, we further explored how our approach performed in the situation where the data was incomplete by randomly removing relevant features. We experimented with removing one to five relevant features incrementally. Note that in this sensitivity test, there was no longer a groundtruth model for the algorithm to test against because the implanted linear model had been broken with missing features. In this case, if our approach is to work, it should find a combination that leads to a heritability estimate no lower than that of the original features and that derived by other known methods.
The three previous methods evaluated in our comparison all used a regularization condition in their eigenproblem, so they also had a tuning parameter λ. In the experiments with each dataset, the parameters λ of all methods were tuned in the same three-fold cross validation process. More specifically, for each dataset, we randomly split the sample into three groups, and each group had the same amount of unrelated individuals and families with multiple members whenever it was possible. Samples in each group were used in one of the three folds, respectively, as the validation data to test the heritability of the trait derived by a method from the rest of the samples. We repeated this three-fold cross validation with 10 random splits for each choice of λ on each dataset. The choices of λ were pre-specified to the range of [0, 50] with a step size of 1. For each method, the choice of λ that gave the best cross validated heritability was used in the subsequent analysis.
In the experiments with the trait y 1 , all methods did not use covariate data as the trait was not simulated with fixed effects. In the experiments with the trait y 2 , because Ott [13] and Anova [16] could not take into account any covariate, we compared our approach with only the maximum likelihood method (ML) [17] with sex, age and race as covariates for fair comparison. The ML software package, downloaded from http://www.genetics.ucla.edu/software/ mendel, had the default maximum number of iterations equal to 200, and we also experimented with 500 and 1000. We observed that the ML method could not reach convergence in the experiments with even 20 phenotypic features within a reasonable time limit (two days). Due to this computational hurdle, the ML method could not be applied to datasets with over 20 features.
Observations from simulations. We first examined the algorithmic behavior of the proposed approach. Fig 1 shows box plots of three-fold cross validated heritability (average values over the 10 trials and standard deviations) of the linear models derived by our approach for the simulated trait y 1 from the five datasets. We observed that the proposed method was able to Identify Heritable Components recover the linearly-combined traits with a relatively wide range of λ choices. From Fig 1, when λ = 1, 1, 13, 18, and 18 respectively for the five datasets, the best validation heritability was obtained. This observation shows that when the underlying model gets sparse, larger λ is favorable to prevent overfitting by removing irrelevant features. We had similar observations in the experiments with y 2 as shown in Fig 2. Fig 2 reports the same box plots for the simulated trait y 2 . The validation heritability of the derived traits were high (with a small decrease when more irrelevant features were experimented), which demonstrated that the proposed approach could effectively correct for covariates in finding heritable components.
We next examined the comparison of our approach against the state of the art. To be more thorough, we compared all four methods using four different metrics including validated heritablity, sum of squared residuals to the simulated trait y 1 or y 2 (SE(trait)), squared difference between the learned weightsŵ and the true weights w, i.e., jjw Àŵjj 2 (SE(w)), as well as the computation cost. Table 2 shows the cross validated heritability of the traits derived by each of the methods in the two sets of experiments with y 1 and y 2 . The performance was reported with the best λ choice of each method. It is clear that the traits derived by our approach always achieved the highest heritability. Table 3 compares the values of SE(trait), SE(w), and the computation time in seconds. In particular, the computation cost was measured by running each of the methods on the full datasets when the best λ value was used. Across all the datasets, our approach obtained the smallest errors as measured by SE(trait) and SE(w). Because Anova used analytic formula to compute covariance matrices, and Ott used a single locus in the covariance estimation, both methods required slightly less computation cost than our approach. However, they were limited only to the situations that had no confounding factors (covariates or other loci) in the heritability calculation. Between the two comprehensive methods, our approach was significantly more efficient than the ML method in computation, making the heritable component analysis with a large number of phenotypic features feasible.
Our approach identified multivariate traits of much higher heritability than the commonly used traits. We compared the heritability of the traits derived by our approach against that of commonly-used features. We used the traits derived by our approach from the cross validation process when the best λ values were used. As shown in Fig 3 (without covariates) and Fig 4  (with covariates), the validation heritability of the derived traits were significantly higher than that of individual features and the average of them.
Without loss of generality, we used the 20 feature dataset that we synthesized for y 1 to evaluate if our approach could still find heritable components when the groundtruth models were broken. The results are reported in Fig 5 where we compared the heritability of our derived Table 3. Comparison of the methods on the sum of squared residuals (SE(trait)), squared difference of the true weights and the learned weights (SE(w)), and the computation time (in seconds) in the experiments without covariates (results are presented in rows from 3 to 7) and with covariates (results are presented in rows from 8 to 12). The "-" sign indicates that those experiments were infeasible due to prohibitive computation cost. The "*" sign indicates that the corresponding methods were not tested due to the limitation of the methods that could not handle covariates. The computation time reported for the ML method was measured when the maximum number of iterations was set to 200.
doi:10.1371/journal.pone.0144418.t003  traits against the maximum heritability that other methods could reach and that of the original features. Clearly, the traits derived by our approach achieved much higher heritability.

A Case Study: Cocaine Use and Related Behaviors
We applied the proposed approach to a genetic study of cocaine use and related behaviors. Two independent sets of samples were used in our analysis: the SSADDA dataset [7], which was used for discovery; and the Study of Addiction: Genetics and Environment (SAGE) dataset [25], which was used for replication of the SSADDA findings. The SAGE data were aggregated from multiple NIH-funded projects [26] by NIH's dbGap system. We downloaded the data from the dbGap public domain [25] through dbGap accession number phs000092.v1.p. The SSADDA sample included 4895 unrelated individuals which were used in our analysis to help estimate the total phenotypic variance even though they had no effect on the covariance estimates. The SAGE dataset consisted of 58 individuals from nuclear families and 1603 unrelated individuals. The two datasets contained samples from two populations: African American (AA) and European American (EA).
All subjects were reported to have used cocaine in their lifetime, and were assessed on the following 13 features of cocaine use and related behaviors: • F1-tolerance to cocaine; • F2-withdrawal from cocaine; • F3-using cocaine in larger amounts or over longer period than intended; • F4-persistent desire or unsuccessful efforts to cut down or control cocaine use; • F5-great amount of time spent in activities necessary to obtain, use or recover from the effects of cocaine; • F6-gave up or reduced important social, occupational, or recreational activities because of cocaine use; • F7-cocaine use despite knowledge of persistent or recurrent physical or psychological problems likely to have been caused or exacerbated by cocaine; • F8-number of cocaine symptom endorsed; • F9-age when first used cocaine; • F10-age when last used cocaine; • F11-age when first being diagnosed with DSM4 cocaine dependence; • F12-age when last being diagnosed with DSM4 cocaine dependence; • F13-transition time in years between the first time cocaine use and the first cocaine dependence diagnosis.
Features F1-F7 were binary variables that took a value of "yes = 1" or "no = 0", and F8-F13 were continuous variables, which we normalized to the range of [0, 1] in the analysis. The majority of the 6810 subjects interviewed with the SSADDA, were genotyped on an Illumina microarray for 988,306 autosomal single-nucleotide polymorphisms (SNPs). Genotypes for additional 37,427,733 SNPs were imputed using IMPUTE2 [27] from genotyped SNPs and 1000 Genomes reference panel released in June 2011 (http://www.1000genomes.org). Both subjects and SNPs were undergone stringent quality control (readers can consult with [7] for details). After data cleaning, there were a total of 4,845 subjects (2674 AAs, 2171 EAs) and 30,078,279 SNPs (695,308 genotyped) remained for analysis. Top three ancestral principal components were computed using 145,472 SNPs that were common to discovery samples and the Hapmap panel. All of the 1661 SAGE subjects (640 AAs, 1021 EAs) in the replication dataset were genotyped for 1,072,657 SNPs. We derived a multivariate trait based on the 13 features of cocaine use and related behaviors. This trait was derived from the SSADDA data by Algorithm 1 with a correction for the fixed effects of age and race. Three-fold cross validation was performed to find the optimal λ, which was subsequently used to find a linearly combined trait from the 13 features based on the entire SSADDA data. The heritability of the derived trait was estimated and compared to that of individual quantitative features in the data, including the cocaine symptom count (F8). The feature F8 was recognized as a better trait than the binary trait induced by the diagnosis of cocaine dependence in a recent genomewide association study (GWAS) [7]. We compared the utility of the derived trait and the symptom count as traits in an association analysis. Association tests were performed on the SSADDA sample for both traits and separately for EAs and AAs to identify significant genetic markers at p < 5 × 10 −6 . We then computed the derived trait for the subjects in the replication SAGE sample. The markers identified from the SSADDA data were tested using the replication subjects. All tests included age, sex and the first three ancestral principal components as covariates. The association test results on discovery and replication data were combined by performing meta analysis using Metal [28]. Genomewide associations were identified from the meta analysis. Note that the heritability of the derived trait was not estimated on the SAGE data because 97% of the SAGE subjects were unrelated individuals. Identify Heritable Components on average in the cross validation. We hence used λ = 2 in Algorithm 1, and derived a linear combination of the features from the entire SSADDA data. The heritability of the derived trait and all individual quantitative features was estimated using SOLAR and reported in Table 4. The quantitative trait derived by our approach has substantially higher heritability than that of all other traits.
Using a regularization condition based on the sparsity-favoring ℓ 1 vector norm created shrinkage effects on our model. In other words, our approach selected parsimonious features to use in the linear combination. Fig 7 shows the combination weights of the features obtained in our model. Five of the 13 features had weight of 0, thus were not used by the model. The feature-age when first used cocaine received the largest positive weight and therefore had the strongest impact on the derived trait. The other four important features were F11-age onset of DSM4 CD diagnosis, F4-persistent desire or unsuccessful efforts to cut down or control cocaine  use, F5-great amount of time spent in activities necessary to obtain, use or recover from the effects of cocaine, and F3-using cocaine in larger amounts or over longer period than intended.
Features F6, F1 and F2 had some but limited effect on the derived trait. We identified three SNPs for the AA population and four SNPs for the EA population that passed our p-value threshold (5 × 10 −6 ) in the genomewide association tests with the discovery sample. These SNPs are listed in Table 5. In recent GWAS of substance use disorders, meta analysis was commonly used to identify genomewide significant associations, e.g., [7,23,29]. Following the same strategy in [7], we identified significant markers from the meta analysis results. Another recent study that used the same 1000 Genomes reference panel identified that 10 −8 is an appropriate p-value threshold for use in a GWAS that employs imputed SNPs [30]. Based on this threshold, the markers rs833936 and rs7224135 in Table 5 were significantly associated with the derived trait at the genomewide level, respectively for AAs and EAs, but not with the commonly-used cocaine symptom count. The other five markers in Table 5 were nominally significantly (1 × 10 −8 < meta p-value < 5 × 10 −6 ) associated with the derived trait only. In other words, using the standard phenotype in association tests would not discover these SNPs that are associated with a specific subtype (a quantitative subphenotype) of cocaine dependence. The marker rs833936 is located at the TXNIP gene which may act as an oxidative stress mediator when its expression is suppressed by synaptic activity in brain [31]. Two markers rs11079045 and rs7224135 are located at the PTRF gene which has been identified to be associated with cocaine abuse in an early transcriptional change study [32]. The EFEMP1 gene has not been reported in the genetic analysis of cocaine dependence. Since all the identified SNP markers have not been thoroughly studied in genetics of cocaine dependence, our findings may promote subsequent investigations for these genes as well as subtypes of cocaine dependence. The proposed heritable component analysis for multivariate phenotypes may provide a new strategy to improve genomewide association studies of complex disorders.

Discussion and Conclusion
In this paper, we have proposed a quadratic optimization formulation that is capable of identifying highly heritable components of complex phenotypes. The multivariate trait is derived as a linear function y = x > w of lower level traits x by explicitly maximizing its heritability. Specifically, we search for the optimal w that maximizes the likelihood of observing a high value of heritability. This is equivalent to finding the best w, so that the projected trait x > w will be best aligned with the kinship matrix F of the pedigree. An efficient algorithm based on sequential quadratic programming has been developed to optimize the proposed formulation. The algorithm is extended to allow the correction for covariate effects when deriving a heritable component.
Our simulation study provides evidence of the effectiveness of the proposed approach as a means to find highly heritable components of multivariate phenotypes. Then a case study on the phenotypes of cocaine use and dependence was conducted. A quantitative trait was identified based on thirteen cocaine use symptoms and behaviors. The trait had a heritability estimate of 0.7 (with p = 4.36 × 10 −22 , std = 0.06), which was much higher than a standard cocaine-use phenotype, e.g., the symptom-count trait, with heritability of 0.41. The subsequent phenotype-genotype association study demonstrated important utility of the derived trait for use in association analysis. Our results show that seven SNPs were significantly or nominally significantly associated with the derived subphenotype, but were not associated with the symptom count phenotype. Two out of the seven associated SNPs reached genome-wide significant level after correction for multi-testing following the procedure in [7,30].
Our formulation has a hyper-parameter λ. Using a hyper-parameter is common in machine learning algorithms such as support vector machines [33]. As a hyper-parameter, λ is not determined by solving the formulation itself and instead needs to be pre-specified. Both our simulation study and our case study showed that our formulation is fairly robust to the value of λ when it is chosen from a reasonably wide range. In real-world applications, hyper-parameters are often determined by a cross-validation process, which was used in our experiments.
Discovering heritable components of a multivariate phenotype can also improve genomic prediction [34]. If a trait is highly heritable, a model that is based on genomic markers to predict the trait value can achieve high accuracy [35]. In agricultural science, heritability of the breeding trait is considered to be one of the most important factors for the performance of a breeding program. Breeding programs targeted at conceptual but economically important phenotypes, such as feed efficiency or heat tolerance of animals, are confronted with a wide variety of available measures for the phenotype [36,37]. Residual body weight gain, residual feed intake, or relative growth rate are feed efficiency measures for dairy cattle with heritability ranging from 0.28 to 0.45 [36,38]. Each of these measures forms a multivariate trait that is defined by a linear function of low level traits, such as body weight, diet and feed energy intake, and days in milk. Our new algorithm can help the identification of more heritable measures for conceptual phenotypes of animal or plant.
There are limitations of our proposed technique. The non-convex quadratic optimization formulation requires a complex solver, such as sequential quadratic programming. For a sample that contains millions of subjects, it may become computationally prohibitive. More efficient solvers or approximations may be needed to scale up the proposed approach. In some applications, complex grouping structures may exist in the data between different lower level traits. A formulation that takes into account the special data structure may be more useful in producing biologically and clinically meaningful traits. As discussed in the paper, alternative regularization conditions exist, including some that may deal well with complex data structures, such as the one based on ℓ 2,1 vector norm. Algorithms that can solve the formulations with alternative regularization terms need to be developed. Additional empirical studies across different disciplines are needed to evaluate the capability and effectiveness of the proposed approach. and Harvard Medical School, and David Oslin, M.D., of the University of Pennsylvania Perelman School of Medicine oversaw study recruitment at their respective sites. Funding support for that work was provided by NIH.
Funding support for the Study of Addiction: Genetics and Environment (SAGE) was provided through the NIH Genes, Environment and Health Initiative [GEI] (U01 HG004422). SAGE is one of the genome-wide association studies funded as part of the Gene Environment Association Studies (GENEVA) under GEI. The dataset used for the analyses described in this manuscript were obtained from dbGaP at http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/ study.cgi?study_id = phs000092.v1.p1 through dbGaP accession number phs000092.v1.p.