A gene based approach to test genetic association based on an optimally weighted combination of multiple traits

There is increasing evidence showing that pleiotropy is a widespread phenomenon in complex diseases for which multiple correlated traits are often measured. Joint analysis of multiple traits could increase statistical power by aggregating multiple weak effects. Existing methods for multiple trait association tests usually study each of the multiple traits separately and then combine the univariate test statistics or combine p-values of the univariate tests for identifying disease associated genetic variants. However, ignoring correlation between phenotypes may cause power loss. Additionally, the genetic variants in one gene (including common and rare variants) are often viewed as a whole that affects the underlying disease since the basic functional unit of inheritance is a gene rather than a genetic variant. Thus, results from gene level association tests can be more readily integrated with downstream functional and pathogenic investigation, whereas many existing methods for multiple trait association tests only focus on testing a single common variant rather than a gene. In this article, we propose a statistical method by Testing an Optimally Weighted Combination of Multiple traits (TOW-CM) to test the association between multiple traits and multiple variants in a genomic region (a gene or pathway). We investigate the performance of the proposed method through extensive simulation studies. Our simulation studies show that the proposed method has correct type I error rates and is either the most powerful test or comparable with the most powerful tests. Additionally, we illustrate the usefulness of TOW-CM based on a COPDGene study.


Introduction
Complex diseases are often characterized by many correlated phenotypes which can better reflect their underlying mechanism. For example, hypertension can be characterized by systolic and diastolic blood pressure [1]; metabolic syndrome is evaluated by four component traits: high-density lipoprotein (HDL) cholesterol, plasma glucose and Type 2 diabetes, PLOS ONE | https://doi.org/10.1371/journal.pone.0220914 August 9, 2019 1 / 17 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 abdominal obesity, and diastolic blood pressure [2]; and a person's cognitive ability is usually measured by tests in domains including memory, intelligence, language, executive function, and visual-spatial function [3]. Also, more and more large cohort studies have collected or are collecting a broad array of correlated phenotypes to reveal the genetic components of many complex human diseases. Therefore, by jointly analyzing these correlated traits, we can not only gain more power by aggregating multiple weak effects, but also understand the genetic architecture of the disease of interest [4]. Even though genome-wide association studies (GWASs) have been remarkably successful in identifying genetic variants associated with complex traits and diseases, the majority of the identified genetic variants only explain a small fraction of total heritability [5]. Furthuer, a gene is the basic functional unit of inheritance whereas the GWAS are primarily focused on the paradigm of single common variant. However, most published GWASs only analyzed each individual phenotype separately, although results on related phenotypes may be reported together. Large-scale GWAS of complex traits have consistently demonstrated that, with few exceptions, common variants have moderate-to-small effects. Therefore, it is important to identify appropriate methods that fully utilize information in multivariate phenotypes to detect novel genes in genetic association studies.
In GWAS, several methods have been developed for multivariate phenotypes association analysis [3] to test association between multivariate continuous phenotypes and a single common variant. To our knowledge, current multivariate phenotypes association methods can be roughly classified into two categories: univariate analysis and multivariate analysis. Univariate analysis methods perform an association test for each trait individually and then combine the univariate test statistics or combine the p-values of the univariate tests [6][7][8][9]. Even though such methods are computationally efficient, they neglect the omnipresent correlation between individual phenotypes and may reduce the power compared to multivariate analysis. Multivariate analysis methods jointly analyze more than one phenotype in a unified framework and test for the association between multiple phenotypes and genetic variants. Multivariate analysis methods include multivariate analysis of variance (MANOVA) [10], linear mixed effect models (LMM) [11], and generalized estimating equations (GEE) [12]. Another special approach is to consider reducing the dimension of the multivariate phenotypes by using dimension reduction techniques. The common method for dimensionality reduction is principal component analysis (PCA) [13] which essentially finds the combination of these phenotypes and assumes that the transformed phenotypes are independent. The limitation of this method is that it can not properly account for the variation of phenotypes or genotypes. It is also hard to interpret the meaning of principle components of the multivariate phenotypes, especially in practice.
Recent studies show that complex diseases are caused by both common and rare variants [14][15][16][17][18][19][20]. Gene-based analysis requires statistical methods that are fundamentally different from association statistics used for testing common variants. It is essential to develop a novel statistical method to test the association between multiple traits and multiple variants (common and/or rare variants). In this article, we develop a statistical method to test the association between multiple traits and genetic variants (rare and/or common) in a genomic region by Testing the association between an Optimally Weighted combination of Multiple traits (TOW-CM) and the genomic region. TOW-CM is based on the score test under a linear model, in which the weighted combination of phenotypes is obtained by maximizing the score test statistic over weights. The weights at which the score test statistic reaches its maximum are called the optimal weights. We also use extensive simulation studies to compare the performance of TOW-CM with MANOVA [10], multi-trait sequence kernel association test MSKAT [21] and minimum p-value [22]. Simulation studies demonstrate that, in all the simulation scenarios, TOW-CM is either the most powerful test or comparable to the most

Methods
We consider a sample with n unrelated individuals. Each individual has K (potentially correlated) traits and has been genotyped at M variants in a considered region (a gene or a pathway). Denote y ik as the k th trait value of the i th individual and x im as the genotype score in additive coding of the i th individual at the m th variant. Let Y = (Y 1 , � � �, Y K ) denote the random vector of K traits and X = (X 1 , � � �, X M ) denote the random variable of the genotype score at M variants for these n individuals where Y k = (y 1k , � � �, y nk ) T and X m = (x 1m , � � �, x nm ) T . Consider a linear combination of Y denoted as Yw ¼ P K k¼1 w k Y k , where w = (w 1 , � � �, w K ) T . We model the relationship between the combination of multiple continuous traits with the M genetic variants in the considered region using the linear model where β 0 is the intercept and β = (β 1 , � � �, β M ) T is the corresponding vector of coefficients. To test the association between the combination of the multiple traits and the M genetic variants is equivalent to test the null hypothesis H 0 : β = 0 under Eq (1). We use the score test statistic to test H 0 : β = 0 under Eq (1). Let P ¼ I n À 1 n 1 n 1 T n and then the test statistic is: where U = (PX) 0 PYw and V ¼ 1 n ðPYwÞ 0 PYwðPXÞ 0 PX. The score test can be rewritten as a function of w: where P = P 0 and PP 0 = P. We propose to maximize S(w) to get the optimal weight and then define the statistic to evaluate the association between the optimally weighted combination of the target traits and test genetic variants. When D = Y 0 PY is positive definite, maximizing S(w) is equivalent to maximizing where L is the lower triangular matrix obtained from the Cholesky decomposition of D = LL T . However, the matrix of D is usually not full rank because of existing correlation between multiple traits. If the matrix D is semi-positive define matrix, we introduce a ridge parameter λ 0 , for which we suggest the choice l 0 ¼ ffi ffi ffi ffi ffi ffi ffi ffi 1=n p , where n is the number of individuals in the testing data, and modify the adjustment to mitigate the effect of the non-positive matrix D in order to avoid the instability: and c be the eigenvector corresponding to the largest eigenvalue of the matrix C, then S(w) is maximized when L 0 (w) equals c. Hence Eq (4) is maximized when w o = L −T c. In a special case, if all the traits we consider are independent and M = 1, we can get an analytical weight referred to [22]: for the k th phenotype, k = 1, 2, 3, . . ., K. The Eq (5) is equivalent to w k ¼ CorrðY k ;XÞ ffi ffi ffi ffi ffi ffi ffi ffi ffi where the numerator is the correlation coefficient between the k th phenotype Y k and the genotypic variant X and the denominator can be viewed as the variance of the k th phenotype Y k . It means that w k has same direction with the correlation between the phenotype Y k and the genotypic variant X, and puts big weight to the k th trait when it has strong association with the genotypic variant and/or it has low variance. We define the statistic to test an optimally weighted combination of multiple traits (TOW-CM), We use permutation methods to evaluate P-values of T. The TOW-CM method can also be extended to incorporate covariates. Suppose that there are p covariates. Let z il denote l th covariate of the i th individual. We adjust both trait value y ik and genotypic score x im for the covariates by applying linear regressions. That is, Letỹ ik andx im denote the residuals of y ik and x im , respectively. We incorporate the covariate effects in TOW-CM by replacing y ik and x im in Eq (6) byỹ ik andx im . With covariates, the statistic of TOW-CM is defined as:

Comparison of tests
We compared the performance of our method (TOW-CM) with the following methods: 1) Multivariate Analysis of Variance (MANOVA) [10]; 2) Multi-trait Sequence Kernel Association Test (MSKAT) [21]; 3) Minimum p-value based on the p-values of the individual trait TOW [22] (denoted as minP).

Simulation
In simulation studies, we use the empirical Mini-Exome genotype data including genotypes of 697 unrelated individuals on 3205 genes obtained from Genetic Analysis Workshop 17 (GAW17). Two differen type of variants (Common variants: minor allele frequency (MAF)> 0.05 and Rare variants: MAF<0.05) are chosen from a super gene (Sgene) including four genes: ELAVL4 (gene1), MSH4 (gene2), PDE4B (gene3), and ADAMTS4 (gene4). The pattern of the allele frequency distribution of the Sgene is similar as the 3205 genes' [22]. In our simulation studies, we generate genotypes based on the genotypes of 697 individuals in these four genes. The genotypes are extracted from the sequence alignment files provided by the 1,000 Genomes Project for their pilot3 study (http://www.1000genomes.org). To generate the genotype of an individual, we generate two haplotypes according to the haplotype frequencies.
We test K = 4 related traits with a compound-symmetry correlation matrix and consider two covariates: a standard normal covariate z 1 and a binary covariate z 2 with P(z 2 = 1) = 0.5. We generate trait values based on genotypes by using the following models: where � = (� 1 , � 2 , � 3 , � 4 ) is zero-mean normal with variances 1 and correlation ρ. We set the magnitude of correlation |ρ| to 0.2, 0.5, and 0.8, and the signs of symmetric location of covariate matrix are randomly chosen from (-1,1). η = (η 1 , η 2 , η 3 , η 4 ) are contributions from a set of genotypic variants, which are simulated as follows.
For type I error, phenotypes are generated under the null model i.e. η = 0. To evaluate power, we randomly choose one common variant and n c (20%) rare variants as casual variants. We assume that all the n c rare causal variants have the same heritability and the heritability of the common causal variant is twice of the heritability of rare causal variants. That is, we model the genotypic variants' contribution to disease risk as 4 where x c and x j denote the common variant and rare variant, respectively. β c and β kj represent the corresponding effect size. Let h and h k denote the heritability of all the causal variants for all the K traits and for the k th trait, respectively. We generate K random numbers t 1 , � � �, t K from a uniform distribution between 0 and 1, and the heritability of k th trait denotes where R denotes the ratio of the heritability of rare causal variants to the heritability of the common causal variant.
For power comparisons, we conducted simulations under the four scenarios: each time only the first L traits are associated with the set of variants, L = 1, 2, 3, 4, respectively. Intuitively, in the first scenario (L = 1), when only the first trait is associated with the variants set, the minP method (it equals to test the first trait alone) may have good performance. However, we will show that by simultaneously testing correlated null traits, our proposed method (TOW-CM) could actually improve the detection power compared to test the first trait alone. When there are multiple correlated traits that are associated with the rare variants set, the proposed TOW-CM would offer vastly improved detection power than the minimum p-value based approach. In each scenario, we also consider different percentage of risk variants for rare variants. Table 1 summarizes the estimated type I error rates of our method TOW-CM with other three comparable methods under different significance levels and different magnitude of trait correlation |ρ|. The type I error rates are evaluated using 10000 replicated samples and the P-values are estimated using 10000 permutations for TOW-CM and minP. For the 10000 replicated samples, the 95% confidence intervals (CIs) for the estimated type I error rates of nominal levels 0.05, 0.01, and 0.001 are (0.046, 0.054), (0.008, 0.012), and (0.0004, 0.0016), respectively. From this table, we can see that all of the estimated type I error rates are either within 95% CIs or close to the bound of the corresponding 95% CIs, which indicate that the type I error rates of all methods are controlled under all considered scenarios.

Simulation results
In Power is a function of the total heritability based on three cases (all causal are risk variants, 90% causal are risk variants, and 80% causal are risk variants) for each specific scenario L. These figures show that TOW-CM is consistently the most powerful test among the four tests, and MANOVA is the second most powerful test when genotypes of genetic variants have impact on more than 1 traits. MSKAT is consistently less powerful than the other two multivariate tests (TOW-CM and MANOVA) most likely because there are only 8% variants with MAF in the range of (0.01,0.035) in Sgene which the simulations are based on. Similar to SKAT, MSKAT will lose power when the MAF of causal variants are not in the range (0.01,0.035) [23]. The minP method is consistently less powerful than TOW-CM and MAN-OVA because they ignore the traits' dependence by directly using minimum of the P-values of testing the four single traits. Overall, we can see that they suffer power loss when the correlations among traits increase. Test genetic association with multiple traits using optimally weighted combination Test genetic association with multiple traits using optimally weighted combination An interesting scenario is one in which only the first trait is associated with the variants set and all the others are null traits (L = 1). Stephens [24] and Wu et al. [25] have reported that joint testing by incorporating a correlated null trait could improve the power for testing association of a common variant. When only the first trait is associated with the variants set, minP is either the most powerful test or has similar power to the most powerful test especially in the case of both causal variants under weak traits correlation (|ρ| = 0.2). The TOW-CM and MAN-OVA statistic could benefit from increased traits correlations, and offer vastly improved power Test genetic association with multiple traits using optimally weighted combination Test genetic association with multiple traits using optimally weighted combination Test genetic association with multiple traits using optimally weighted combination Test genetic association with multiple traits using optimally weighted combination by incorporating strongly correlated null traits. Thus, our results verify the conclusion of [24] and [25].
Overall, we can see that the proposed TOW-CM is an attractive approach that provides good power in most of the scenarios.

Application to the COPDGene
Chronic obstructive pulmonary disease (COPD) is one of the most common lung diseases characterized by long term poor airflow and is a major public health problem [26]. The COPDGene Study is a multi-center genetic and epidemiologic investigation dedicated to studying COPD [27]. Participants in the COPDGene Study gave consent for the use of data collected during the study in downstream analyses. This study is sufficiently large and appropriately designed for analysis of COPD. In this study, we consider more than 5000 non-Hispanic Whites (NHW) participants where the participants have completed a detailed protocol, including questionnaires, pre-and post-bronchodilator spirometry, high-resolution CT scanning of the chest, exercise capacity (assessed by six-minute walk distance), and blood samples for genotyping. The participants were genotyped using the Illumina OmniExpress platform. The genotype data have gone through standard quality-control procedures for genome-wide association analysis detailed at http://www.copdgene.org/sites/default/files/GWAS_QC_ Methodology_20121115.pdf.
Based on the literature studies of COPD [28,29], we selected 7 key quantitative COPDrelated phenotypes, including FEV1 (% predicted FEV1), Emphysema (Emph), Emphysema Distribution (EmphDist), Gas Trapping (GasTrap), Airway Wall Area (Pi10), Exacerbation frequency (ExacerFreq), Six-minute walk distance (6MWD), and 4 covariates, including BMI, Age, Pack-Years (PackYear) and Sex. EmphDist is the ratio of emphysema at -950 HU in the upper 1/3 of lung fields compared to the lower 1/3 of lung fields where we did a log transformation on EmphDist in the following analysis, referred to [28]. In the analysis, participants with missing data in any of these phenotypes were excluded.
To evaluate the performance of our proposed method on a real data set, we applied all of the 4 methods (TOW-CM, MANOVA, MSKAT and minP) to the COPD associated genes or genes containing significant single-nucleotide polymorphisms (SNPs) in NHW population with COPD-related phenotypes [30]. In the analysis, we first removed the missing data in any genotypic variants and then adjusted each of the 7 phenotypes for the 4 covariates using linear models. In the analysis, participants with missing data in any of the 11 variables were excluded. Therefore, a complete set of 5,430 individuals across 50 genes were used in the following analyses. In order to compare these methods, we adopted the commonly used 10 7 permutations for TOW-CM and minP methods. For this verification study, we use 0.05 as the significance level for MANOVA, MSKAT and TOW-CM methods and use Bonferroni corrected significance level 0.05/7 = 7.14 × 10 −3 for minP methods since this method perform association tests across each trait, respectively. The results are summarized in Table 2. From Table 2, we can see that TOW-CM identified 14 genes, minP identified 14 genes, MANOVA identified 12 genes and MSKAT identified 4 genes. Among these four methods, TOW-CM identified the most significant genes where all of these 14 genes had previously been reported to be in association with COPD by eligible studies [7,30], among which 5 genes (LOC105377462,CHRNA3, CHRNA5, HYKK,IREB2) are statistically significant if we use a more stringent cut-off 1.00 × 10 −3 for a multiple testing issue with 50 genes in total. Because the MAFs of most variants are not in the range of (0.01,0.035) which is a range favoring MSKAT, MSKAT performs worse than the other three comparable methods (Yang et al. 2017). TOW-CM and minP perform better than MANOVA, which is perhaps because only a proportion of phenotypes are associated with COPD. The method minP missed some genes in comparision to our method TOW-CM, it may because the method minP ignores the correlation between these seven phenotypes.

Discussion
GWAS have identified many variants with each variant affecting multiple phenotypes, which suggests that pleiotropic effects on human complex phenotypes may be widespread. Also, recent studies have shown that complex diseases are caused by both common and rare variants [14,16,19]. Therefore, statistical methods that can jointly analyze multiple phenotypes for common or/and rare variants have advantages over analyzing each phenotype individually or only considering for common variants (GWAS). In this article, we propose TOW-CM method to perform multivariate analysis for multiple phenotypes in association studies based on the following reasons: (1) complex diseases are usually measured by multiple correlated phenotypes in genetic association studies; (2) there is increasing evidence showing that studying multiple correlated phenotypes jointly may increase power for detecting disease associated genetic variants, and (3) there is a shortage of gene-based approaches for multiple traits. Simulation results show that TOW-CM has correct type I error rates and is consistently more powerful in comparision to the other three tests. The real data analysis results show that TOW-CM has excellent performance in identifying genes associated with complex disease with multiple correlated phenotypes such as COPD.
One disadvantage of TOW-CM is that the test statistic does not have an asymptotic distribution and a permutation procedure is needed to estimate its P-value, which is time consuming compared to the methods whose test statistics have asymptotic distributions. To save time when applying TOW-CM to genetic association studies, we can use the "step-up" procedure [31] to determine the number of permutations, which can show evidence of association based on a small number of permutations first (e.g. 1,000) and then a large number of permutations are used to test the selected potentially significant genes. Specifically, for the analysis of real data, the computation time of p-value estimation of TOW-CM for all genes was about three days using our R program on 50 Dell PowerEdge C6320 servers. Each server has two 2.4GHz Intel Xeon E5-2680 v4 fourteen-core processors and 600 MB average memory. We also uploaded the R program on GitHub, https://github.com/Jianjun-CN/TOW-CM/blob/master/ R%20Code Furthermore, TOW-CM method can not only be used for gene-based association studies, but also can be extended to transcriptome-wide association study (TWAS), which needs further investigations. Test genetic association with multiple traits using optimally weighted combination 17 Simulated Exome Data Set was supported in part by NIH R01 MH059490 and used sequencing data from the 1000 Genomes Project (www.1000genomes.org). This research used data generated by the COPDGene study, which was supported by NIH grants U01HL089856 and U01HL089897. The COPDGene project is also supported by the COPD Foundation through contributions made by an Industry Advisory Board comprised of Pfizer, AstraZeneca, Boehringer Ingelheim, Novartis, and Sunovion.
A superior high-performance computing infrastructure at University of North Texas, was used in obtaining results presented in this publication.