Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A novel method to test associations between a weighted combination of phenotypes and genetic variants

  • Huanhuan Zhu,

    Roles Formal analysis, Methodology, Writing – original draft

    Affiliation Department of Mathematical Sciences, Michigan Technological University, Houghton, Michigan, United States of America

  • Shuanglin Zhang,

    Roles Formal analysis, Methodology, Writing – original draft, Writing – review & editing

    Affiliation Department of Mathematical Sciences, Michigan Technological University, Houghton, Michigan, United States of America

  • Qiuying Sha

    Roles Methodology, Project administration, Writing – review & editing

    qsha@mtu.edu

    Affiliation Department of Mathematical Sciences, Michigan Technological University, Houghton, Michigan, United States of America

Abstract

Many complex diseases like diabetes, hypertension, metabolic syndrome, et cetera, are measured by multiple correlated phenotypes. However, most genome-wide association studies (GWAS) focus on one phenotype of interest or study multiple phenotypes separately for identifying genetic variants associated with complex diseases. Analyzing one phenotype or the related phenotypes separately may lose power due to ignoring the information obtained by combining phenotypes, such as the correlation between phenotypes. In order to increase statistical power to detect genetic variants associated with complex diseases, we develop a novel method to test a weighted combination of multiple phenotypes (WCmulP). We perform extensive simulation studies as well as real data (COPDGene) analysis to evaluate the performance of the proposed method. Our simulation results show that WCmulP has correct type I error rates and is either the most powerful test or comparable to the most powerful test among the methods we compared. WCmulP also has an outstanding performance for identifying single-nucleotide polymorphisms (SNPs) associated with COPD-related phenotypes.

Introduction

Genome-wide association studies (GWAS) aim to discover genetic variants associated with complex diseases [1, 2]. In GWAS, researchers often collect data on multiple correlated phenotypes to get a better understanding of the complex disease [3]. Here are some examples of what diseases are measured by multiple phenotypes. In type 2 diabetes (T2D) studies data are usually collected on a number of risk factors and diabetes-related quantitative phenotypes. Hypertension is measured by systolic blood pressures (SBP) and diastolic blood pressures (DBP) [2], and the correlation coefficient between SBP and DBP was greater than 0.5 in 95% of patients [4]. The metabolic syndrome refers to the co-occurrence of insulin resistance, obesity, atherogenic dyslipidemia and hypertension, and these factors are associated and share underlying mediators, pathway and mechanisms [5]. The correlations between multiple phenotypes can be leveraged to improve the power of genetic association tests to identify markers associated with one or more of the phenotypes [6]. The standard approach to analyze these multiple correlated phenotypes is to perform single-phenotype analyses separately and report the findings for each phenotype [1]. However, analyzing one phenotype at a time will suffer penalties from the multiple testing and result in a reduced power especially for GWAS [3]. Recently, the joint analysis of multiple phenotypes has become popular because it can increase statistical power over analyzing phenotypes separately in detecting genetic variants [3, 6].

There are three commonly used strategies to detect genetic associations between a genetic variant and multiple correlated phenotypes. The first one is combining test statistics (or p-values) from univariate analysis. This strategy first tests an association between each phenotype and a genetic variant individually and then combines the univariate analysis results, i.e. test statistics or p-values, by using different approaches. The O’Brien’s method [7], sample splitting and cross-validation method [3], Trait-based Association Test that uses Extended Simes procedure (TATES) [8], Unified Score-Based Association Test (USAT) [9], Fisher’s Combination [10], and Adaptive Fisher’s Combination (AFC) [11] belong to this strategy. The advantage of this strategy is its simplicity and is especially useful for analyzing different types of phenotypes such as continuous, dichotomous and survival [2]. The second one is data reduction. This strategy derives a single or a few new phenotypes that are linear combinations of the original phenotypes. Existing methods include projection-based techniques and canonical correlation analysis (CCA). Projection-based approaches include principal components analysis (PCA) and principal component of heritability (PCH), where principal components (PCs) are built to maximize either the phenotypic variance or heritability [2, 6, 12, 13]. Canonical correlation analysis (CCA) finds the linear combination of phenotypes that explain the largest possible amount of the correlation between the genetic variant and all multiple phenotypes [14]. Data reduction approaches are in general only applicable to multiple phenotypes consisting of all continuous phenotypes that are approximately normally distributed [2]. The third strategy is regression models which include mixed effect models [1517], the generalized estimating equation (GEE) [18, 19], and reverse regression methods [1, 20, 21]. The linear mixed effects model (LME) and generalized linear mixed effects model (GLMM) are two commonly used mixed effects models, where the fixed effects are used for the genetic variant and random effects are used to account for phenotypic correlations. The GEE methods collapse the random effects and random residual errors in marginal regression models which are a class of models different from mixed effect models. The reverse regression methods take genotypes as the response variable and multiple phenotypes as predictors, such as the proportional odds logistic regression for joint model of multiple phenotypes (MultiPhen) [1]. Regression approaches are able to deal with a mixture of continuous, dichotomous, and survival phenotypes, but they are complicated and few available software were developed to implement these methods [2].

In this article, we developed a novel allele-based method for testing association between multiple phenotypes and a genetic variant. First, we take the allele at the genetic variant as the response variable and the multiple phenotypes as predictors. Then, we present a new multivariate method that we refer to as WCmulP (Weighted Combination of multiple Phenotypes), inspired by TOW (Test for testing the effect of an Optimally Weighted combination of variants) procedure proposed by Sha et al. [22] for rare variant association studies and allele-based aproach proposed by Majumdar et al. [23]. For each of the independent individuals, WCmulP linearly combines the multiple phenotypes to “one phenotype” by using the optimal weights proposed by Sha et al. [22]. Then we use the score test based on the logistic model to test the association between the genetic variant and the linear combination of phenotypes. Using extensive simulation studies, we compare the performance of WCmulP with some of the existing methods, MultiPhen[1], O’Brien’s method [7], TATES [8], CCA [14], and SHet [24]. Our results show that, in all of the simulation scenarios, WCmulP is either the most powerful test or comparable to the most powerful tests among the methods we compared. Finally, we evaluate the performance of our proposed method using a real data set, the COPDGene study from dbGaP.

Methods

We consider a sample of n unrelated individuals. Each individual has K possibly correlated phenotypes. Let Yi,k denote the kth phenotype of the ith individual. We propose to use an allele-based logistic regression model to test the association between a variant of interest and multiple phenotypes. For a genetic variant with two alleles, we use x2i−1 and x2i to denote the coding of the two alleles of the ith individual such that we use x1 and x2 to code the two alleles of the first individual, use x3 and x4 to code the two alleles of the second individual, and so on. For a variant with two alleles A and a, if the genotype of the ith individual is AA, we define x2i−1 = x2i = 1; if the genotype is aa, we define x2i−1 = x2i = 0; and if the genotype is Aa, we define x2i−1 = 1; and x2i = 0. We define the kth phenotype corresponding to the two alleles x2i−1 and x2i of the ith individual as y2i−1,k and y2i,k, where y2i−1,k = y2i,k = Yi,k. Hence, the total number of observations in the allele-based data is 2n. We model the relationship between alleles and multiple phenotypes using the inverse logistic regression model (1) where πj = Pr(xj = 1|Yj = (yj,1,…,yj,K)T), α is the intercept, and β = (β1,…,βK)T is a K-dimention vector of parameters. To test the association between multiple phenotypes and the variant is equivalent to test the null hypothesis H0: β = 0 under Eq (1). We use the score test statistic given by Sha et al. [25] to test H0: β = 0 under Eq (1). The test statistic is (2) where and for k = 1,…,K. The test statistic S asymptotically follows a chi-square distribution with K degrees of freedom.

When K is large, the score test may lose power due to the large degrees of freedom. To overcome this problem, we combine the K phenotypes to one variable by using a linear combination of phenotypes, , where w1,…, wK are the weights. With the linear combination of phenotypes , the score test statistic in Eq (2) becomes (3)

We propose to use the optimal weights proposed by Sha et al. [22], that is, for k = 1,2, …, K. Actually, the optimal weights maximize S(w1,…, wK) in Eq (3). With this optimally weighted combination of phenotypes , the test statistic given in Eq (3) becomes (4) where . From Eq (2)–Eq (4), we reduced the dimension of the phenotypes from multivariate (yj,k, k = 1, …, K) to univariate with optimal weights such that Eq (4) is the maximum of Eq (3). Since are data-driven weights, does not follow a chi-square distribution. We use a permutation procedure to evaluate the p-value of . In each permutation, we randomly shuffle the genotypes and keep the phenotypes unchanged. Since does not change under each permutation, the test statistic is equivalent to (5) This test statistic T is our proposed test statistic to test the effect of the Weighted Combination of multiple Phenotypes (WCmulP).

The WCmulP method can also be extended to incorporate covariates. Suppose that there are p covariates. Let Zi,l denote the lth covariate of the ith individual. We define the lth covariate corresponding to the two alleles x2i−1 and x2i of the ith individual as z2i−1,l and z2i,l, where z2i−1,l = z2i,l = Zi,l. We then adjust the phenotype value yj,k for the covariates by applying linear regressions. That is, Let denote the residuals of yj,k in the linear regression. We incorporate the covariate effects in WCmulP by replacing yj,k in Eq (5) by . With covariates, the statistic of WCmulP is defined as

Comparison of methods

We compare the power of the proposed WCmulP with that of the following methods:

Score (Score test): the test statistic of Score is given by Eq (2).

OB (O’Brien’s method) [7]: the test statistic of OB, eTΣ−1Tuni, is a linear combination of univariate test statistics, and it is the most powerful test among a class of test statistics that are linear combination of Tuni, where Tuni is the vector of the univariate test statistics, Σ is the covariance matrix of Tuni, and e = (1,1…,1)T is a 1’s vector with length K (the number of phenotypes).

MultiPhen (Joint model of Multiple Phenotypes) [1]: it uses the proportional odds logistic regression to model the genotype data as ordinal response and phenotypes as predictors. A likelihood ratio test is used to test the null hypothesis.

TATES (Trait-based Association Test that uses Extended Simes procedure) [8]: it combines univariate p-values to acquire one phenotype-based p-value, while correcting for correlations between phenotypes. The TATES p-value is given by , where p(k) is the kth (k = 1,…,K) sorted p-value in ascending order, me and me(k) are the effective numbers of independent p-values of all K phenotypes and k specified phenotypes, respectively. The effective numbers can be calculated from the correlation matrix of p-values.

CCA (Canonical Correlation Analysis) [14]: it extracts the linear combination of phenotypes that maximizes the correlations between linear combinations of phenotypes and genotypes at the variant of interest. The test is based on Wilks’ lambda and the corresponding F-approximation.

SHet (Test for Heterogeneous genetic effects) [24]: The test statistic of SHet, SHet, is based on SHom, which is the most powerful test statistic when the genetic effect is homogeneous. Both SHom and SHet are quadratic combinations of the univariate test statistics. The test statistic of SHom is , where R is the correlation matrix of Tuni, W is a diagonal matrix of weights for the univariate test statistics, and e is a 1’s vector with length K (number of phenotypes). SHet can be viewed as the maximum of SHom’s satisfying different thresholds. More specifically, given a threshold, only test statistics with absolute values that are greater than the threshold are used, R and W are therefore partially used corresponding to the selected test statistics. The p-values of SHet can be evaluated by simulation.

Simulation studies

Our simulations are similar to that of Wang et al. [13]. To evaluate the type I error rates and powers of our method, we simulate genotype-phenotype data sets for n unrelated individuals with total K phenotypes according to a variety of simulation scenarios. Specifically, genotype data at a genetic variant are simulated according to the minor allele frequency (MAF) under the assumption of Hardy-Weinberg equilibrium. We generate K phenotypes by the factor model (6) where y = (y1,…,yK)T; x is the genotype score at the variant of interest; λ = (λ1,…,λK) is the vector of effect sizes of the genetic variant on the K phenotypes; f = (f1,…,fR)TMVN(0,Σ), Σ = (1−ρ)I + ρA, R is the number of factors, A is a matrix with elements of 1, I is the identity matrix, and ρ is the correlation between fi and fj for ij; γ is a K by R matrix; c is a constant number; and ε = (ε1,…,εK)T is a vector of residuals, ε1,…,εK are independent, and εkN(0,1) for k = 1,…,K. Based on Eq (6), we consider the following six models.

Model 1: There is only one factor and genotype has an impact on all traits with the same effect size. That is, R = 1, λ = (β,…,β)T, and γ = (1,…,1)T.

Model 2: There are two factors and genotype has an impact on two factors with opposite effects. That is, R = 2, , and γ = bdiag(D1,D2), where for i = 1,2, “bdiag” indicates the block diagonal matrix.

Model 3: There are two factors and genotype has an impact on one factor. That is, R = 2, , and γ = bdiag(D1,D2), where for i = 1,2.

Model 4: There are four factors and genotype has an impact on one factor. That is, R = 4, , and γ = bdiag(D1,D2,D3,D4), where for i = 1,…,4.

Model 5: There are four factors and genotype has an impact on two factors with opposite effects. That is, R = 4, , and γ = bdiag(D1,D2,D3,D4), where for i = 1,…,4.

Model 6: There are four factors and genotype has an impact on three factors with effects of different directions. That is, R = 4, , and γ = bdiag(D1,D2,D3,D4), where for i = 1,…,4.

In the six models, the within-factor correlation is c2 and the between-factor correlation is ρc2. Table A in S1 File gives the structures of γ and cov(y|x) for different numbers of factors (R = 1,2, and 4) when the number of phenotypes is 8.

We also generate phenotypes with covariates effects. We refer to Sha et al. [22] and Sun et al. [26] by adding two covariates in Eq (6) as , where z1 is a continuous random variable generated from a standard normal distribution, z2 is a binary random variable taking values of 0 and 1 with a probability of 0.5, and e is a K-dimensional vector with all elements being 1’s. To evaluate type I error rates and powers, we consider n = 1,000 unrelated individuals, MAF = 0.3, and different numbers of phenotypes K = 8,16. To evaluate the type I error rates of all methods, we generate all phenotypes independent of genotypes by setting β = 0. We evaluate type I error rates at significance levels α = 0.001 and 0.01 for all methods. To evaluate powers, we vary the values of β (within-factor correlation c2 = 0.5 and between-factor correlation ρc2 = 0.1) and vary the values of within-factor correlation c2 (0.3,0.5,…,0.9) (between-factor correlation ρc2 = 0.1 and β = 0.1,).

Simulation results

To evaluate the type I error rates of WCmulP and other six methods, we consider different numbers of phenotypes, different significance levels, and different numbers of factors. In each simulation scenario, the p-values of WCmulP and SHet are estimated using 10,000 permutations, and the p-values of Score, MultiPhen, TATES, CCA and OB are estimated using their asymptotic distributions. The type I error rates of the seven methods are evaluated using 10,000 replicated samples. For 10,000 replicated samples, the 95% confidence intervals (CIs) for type I error rates of nominal levels 0.001 and 0.01 are (0.00038,0.00162) and (0.008,0.012), respectively. The estimated type I error rates of WCmulP and other six methods are summarized in Table 1 (K = 8) and Table 2 (K = 16). From these tables, we can see that all estimated type I error rates of WCmulP are within 95% CIs, which indicates that the proposed WCmulP is a valid test. The estimated type I error rates of SHet, Score, MultiPhen, TATES, CCA and OB are not significantly different from the nominal levels.

thumbnail
Table 1. Estimated type I error rates for the seven methods under three simulation settings.

The number of phenotypes is K = 8, c2 = 0.5, ρc2 = 0.1, and MAF = 0.3. The p-values of WCmulp and SHet are evaluated using 10,000 permutations. The type I error rate of all of the seven methods is evaluated using 10,000 replicated samples at a significance level of α. R is the number of factors.

https://doi.org/10.1371/journal.pone.0190788.t001

thumbnail
Table 2. Estimated type I error rates for the seven methods under three simulation settings.

The number of phenotypes is K = 16, c2 = 0.5, ρc2 = 0.1, and MAF = 0.3. The p-values of WCmulp and SHet are evaluated using 10,000 permutations. The type I error rate of all of the seven methods is evaluated using 10,000 replicated samples at a significance level of α.

https://doi.org/10.1371/journal.pone.0190788.t002

For power comparisons, we consider power as a function of genetic effect β (Figs 1 and 2) and power as a function of within-factor correlation c2 (Figs 3 and 4). In each of the simulation scenario, the p-values of WCmulP and SHet are estimated using 1,000 permutations and the p-values of Score, MultiPhen, TATES, CCA and OB are estimated using their asymptotic distributions. The powers of the seven methods are evaluated using 1,000 replicated samples at a significance level of 0.01.

thumbnail
Fig 1. Power comparisons of the seven methods as a function of β for the six models.

The total number of phenotypes is K = 8, c2 = 0.5, ρc2 = 0.1, and MAF = 0.3. The p-values of WCmulP and SHet are evaluated using 1,000 permutations. The power of all of the seven methods is evaluated using 1,000 replicated samples at a significance level of 0.01.

https://doi.org/10.1371/journal.pone.0190788.g001

thumbnail
Fig 2. Power comparisons of the seven methods as a function of β for the six models.

The total number of phenotypes is K = 16, c2 = 0.5, ρc2 = 0.1, and MAF = 0.3. The p-values of WCmulP and SHet are evaluated using 1,000 permutations. The power of all of the seven methods is evaluated using 1,000 replicated samples at a significance level of 0.01.

https://doi.org/10.1371/journal.pone.0190788.g002

thumbnail
Fig 3. Power comparisons of the seven methods as a function of c2 for the six models.

The total number of phenotypes is K = 8, ρc2 = 0.1, β = 0.1, and MAF = 0.3. The p-values of WCmulP and SHet are evaluated using 1,000 permutations, the p-values of other methods are evaluated using asymptotic distribution. The power of all of the seven methods is evaluated using 1,000 replicated samples at a significance level of 0.01.

https://doi.org/10.1371/journal.pone.0190788.g003

thumbnail
Fig 4. Power comparisons of the seven methods as a function of c2 for the six models.

The total number of phenotypes is K = 16, ρc2 = 0.1, β = 0.1, and MAF = 0.3. The p-values of WCmulP and SHet are evaluated using 1,000 permutations, the p-values of other methods are evaluated using asymptotic distribution. The power of all of the seven methods is evaluated using 1,000 replicated samples at a significance level of 0.01.

https://doi.org/10.1371/journal.pone.0190788.g004

Our simulation results show that:

  1. As expected, the powers of all methods increase as the genetic effect β increases in each model (Figs 1 and 2).
  2. WCmulP is either the most powerful test or comparable to the most powerful tests in all six models (Figs 14).
  3. As number of phenotypes increases from K = 8 to K = 16, WCmulP presents more obvious ascendancy than other methods.
  4. SHet, Score, MultiPhen, and CCA have similar performance in all six models; we call these four tests as group 1.
  5. OB is the most powerful test when the genetic effects are homogeneous (model 1). However, OB reduces power significantly when genetic effects are heterogeneous, especially when opposite directions of the genetic effects exist (models 2, 5–6) or when the genetic variant impacts only a small portion of phenotypes (model 4). This phenomenon was also observed by Zhu et al. [27].
  6. Power comparisons of TATES with tests in group 1 depend on the models. In general, TATES is more powerful than tests in group 1 when the genetic variant impacts on a portion of phenotypes (models 3 and 4).
  7. In general, as the within-factor correlation c2 increases, the powers of all methods decrease (Figs 3 and 4). TATES is relatively robust to c2 because it essentially only depends on the phenotype that has the strongest association with the genetic variant, as explained in Zhu et al. [27].

We also considered using principal components (PCs) of the phenotypes instead of the original phenotypes to do power comparisons and the results are given in Figures A-D in S1 File. We exclude PCs that explain less than 10−6 of the total variation. Using PCs of the phenotypes, we observe that: (1) WCmulP, Score, MultiPhen, and CCA have very similar powers in all six models (Figures A-D in S1 File). We call these tests as group s1. The tests in group s1 are either the most powerful tests or comparable to the most powerful one; (2) SHet is less powerful than the tests in group s1; (3) OB is the least powerful method in all six models because PCs likely have effects with different directions; (4) TATES becomes the most powerful method when the genetic variant has effects on all phenotypes with the same absolute value of effect sizes (models 1 and 2) because in this case, one of the PCs may capture the most of association information.

We also compared the powers using a lower significance level 5×10−5 (Figure E in S1 File). Figure E in S1 File shows that the pattern of the power comparisons by using significance level 5×10−5 is similar to that by using significance level 0.01 (Fig 1).

Real data analysis

Chronic obstructive pulmonary disease (COPD) refers to a group of diseases that cause airflow blockage and breathing-related problems. The Genetic Epidemiology of COPD Study (COPDGene) is a multicenter observational study designed to identify genetic factors associated with COPD, to define and characterize disease-related phenotypes, and to assess the association of disease-related phenotypes with the identified susceptibility genes [28]. 10,192 participants (including 6,784 non-Hispanic Whites (NHW) and 3,408 African-Americans (AA)) are included in COPDGene. We selected 7 key quantitative COPD-related phenotypes and 4 covariates that are the same as those in Liang et al. [11]. The detailed description of these 7 phenotypes is in Table 3, and their correlation structure is given in Figure F in S1 File. The four covariates include Body Mass Index, Age, Pack-Years (one pack-year is defined as smoking one pack per day for one year), and gender. A set of 5,430 NHW across 630,860 SNPs were used in the analysis after excluding subjects with missing data in any of the 11 variables.

We apply WCmulP and other six methods to both original 7 phenotypes (Table 4) and the principal components (PCs) of the phenotypes (Table B in S1 File). PCs that explain less than 10−6 of the total variation are excluded. In this way, one PC is excluded and there are 6 PCs left. Using the first few PCs is also a dimension reduction method. Thus, using PCs of the phenotypes, WCmulP uses two dimension reduction methods: using the first few PCs and the weighted combination of those PCs. To identify SNPs significantly associated with the 7 COPD-related phenotypes and the top 6 PCs of the phenotypes, we use the genome-wide significance threshold of 5 × 10−8. There are total 16 SNPs that are significant under at least one method (Table 4 and Table B in S1 File). Those 16 SNPs have been reported being associated with the COPD-related phenotypes by previous studies [2942]. From Table 4, we can see that MultiPhen identified the largest number of SNPs, 14 SNPs; WCmulP, SHet, Score, and CCA identified 13 SNPs; TATES identified 9 SNPs; and OB didn’t identify any SNPs, that’s likely because the true genetic effects of each SNP are heterogeneous for all phenotypes. From Table B in S1 File, we can see that using PCs of the phenotypes, WCmulP identified all of the 16 SNPs; MultiPhen identified 15 SNPs; SHet, Score, and CCA identified 13 SNPs; TATES identified 4 SNPs; and OB identified 3 SNPs. In summary, the number of SNPs identified by WCmulP is comparable to the largest number of SNPs identified by other tests; and using PCs of phenotypes, WCmulP is the only method that identified all 16 SNPs. The results of the real data analysis are consistent with our simulation results.

thumbnail
Table 4. Significant SNPs and the corresponding p-values in the analysis of COPDGene.

The p-values of WCmulP are evaluated using 109 permutations; the p-values of SHet are evaluated using 108 permutations. The p-values of Score, MultiPhen, CCA, TATES, and OB are evaluated using asymptotic distributions. The grayed-out p-values indicate the p-values > 5 × 10−8.

https://doi.org/10.1371/journal.pone.0190788.t004

Discussion

In this article, we developed WCmulP to perform multivariate analysis of multiple phenotypes in association studies based on the following reasons: (1) complex diseases are usually measured by multiple correlated phenotypes in genetic association studies; and (2) there is increasing evidence showing that studying multiple correlated phenotypes jointly may increase powers for detecting genetic variants that are associated with complex diseases. Our results show that WCmulP has correct type I error rates and is either the most powerful test or comparable to the most powerful tests among the seven tests we considered. None of the other methods showed consistent good performances under the simulation scenarios. OB is the most powerful test when the genetic effects are homogeneous, while it loses power dramatically when genetic effects are heterogeneous; especially when opposite directions of the genetic effects exist. SHet, Score, MultiPhen, and CCA have similar powers and they are less powerful than WCmulP in most scenarios. TATES is more powerful only when the genetic variant affects a portion of phenotypes. In addition, in the real data analysis, WCmulP identified 13 (out of 16) significant SNPs, 1 SNP less than the largest number of identified SNPs; using PCs of phenotypes, WCmulP is the only method that identified all 16 SNPs. The real data analysis results show that WCmulP has excellent performance in identifying SNPs associated with complex disease with multiple correlated phenotypes such as COPD.

In the context of association studies, it is important to correct for population stratification (PS). PS refers to allele frequency differences between populations unrelated to the outcome of interest, but due to systematic ancestry differences. PS can cause seriously confounded associations if not adjusted properly [43, 44]. The principal component analysis (PCA) method [4549] and linear mixed model (LMM) approach [5052] have been used to adjust for population stratification. There are also other methods such as multidimensional scaling (MDS) [53], the robust PCA based on resampling by half means (RPCA-RHM) [54], and the robust PCA based on the projection pursuit (RPCA-PP) [54], which are extension methods of the PCA approach. PCA identifies several top principal components of the genotype data matrix and uses them as covariates in the association analysis. We propose to use PCA to control for PS in our proposed method when samples from different populations are involved. However, the performance needs further investigations.

One disadvantage of WCmulP is that the test statistic does not have an asymptotic distribution and a permutation procedure is needed to calculate its p-value, which is time consuming compared to the methods whose test statistics have asymptotic distributions. The running time of WCmulP with 1,000 permutations on a data set with 5,000 individuals and 20 phenotypes on a laptop with 4 Intel(R) Cores(TM) i7-4790 CPU @ 3.6GHz and 4 GB memory is no more than 0.15s. To perform GWAS, we can first select genetic variants that show evidence of association based on a small number of permutations (e.g. 1,000), and then a large number of permutations are used to test the selected significant genetic variants [21]. Furthermore, WCmulP cannot be used for rare variant association studies, although recent studies have shown that complex diseases are caused by both common and rare variants [50, 5558]. How to extend WCmulP to rare variant association studies is our future work.

In our simulation studies, the numbers of phenotypes varied from 8 to 16 and the methods rely on all observations having fully observed phenotypes. However, in real data analysis, as the number of phenotypes increases the chance that missing at least one observation increases exponentially, especially in epidemiological and clinical research [59, 60]. There are several approaches to handle missing phenotypes: deletion-based methods, simple replacement methods, and imputation methods [59]. The most commonly used method for dealing with missing data is deletion-based method, in which observations with missing values are removed from the analysis [59]. However, removal of observations with missing values will reduce sample size, thus resulting in power losses [60]. The simple replacement methods replace the missing values with plausible values for the variable with missing values, such as the sample mean [8, 59]. It is a simple, unconditional method that does not depend on other variables. However, mean substitution approach may result in biased estimates where data are not missing completely at random [59]. Imputation is a more sophisticated approach that fills in missing values with predicted values using model-based methods or conditional imputation, including multiple imputation (MI), multivariate normal imputation (MVNI), and fully conditional specification (FCS) [59, 6166]. In MI, the incomplete dataset is generated multiple times and missing values are replaced by values drawn from a posterior distribution according to a suitable imputation model that utilizes the rest of the data [59, 61]. MVNI fits a joint imputation model to all the variables containing missing values under the assumption that the variables follow a multivariate normal distribution [62, 63]. For each variable with missing values, FCS fits separate univariate regression models and iteratively cycles through the univariate regression models [6466]. In our real data analysis, we removed 1354 observations with missing either phenotypes or covariates from 6784 samples. An alternative approach is to use mean substitution or imputation approaches to fill in the missing values.

Acknowledgments

Research reported in this article was supported by the National Human Genome Research Institute of the National Institutes of Health under Award Number R15HG008209. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

This research used data generated by the COPDGene study, which was supported by NIH grants U01HL089856 and U01HL089897. The COPDGene project is also supported by the COPD Foundation through contributions made by an Industry Advisory Board comprised of Pfizer, AstraZeneca, Boehringer Ingelheim, Novartis, and Sunovion.

Superior, a high-performance computing infrastructure at Michigan Technological University, was used in obtaining results presented in this publication.

The authors have no conflict of interests to declare.

References

  1. 1. O'Reilly PF, Hoggart CJ, Pomyen Y, Calboli FC, Elliott P, Jarvelin MR, et al. MultiPhen: joint model of multiple phenotypes can increase discovery in GWAS. PloS one. 2012; 7(5):e34861. pmid:22567092; PubMed Central PMCID: PMC3342314.
  2. 2. Yang Q, Wang Y. Methods for analyzing multivariate phenotypes in genetic association studies. J Probab Stat. 2012; 2012:652569. pmid:24748889; PubMed Central PMCID: PMCPMC3989935.
  3. 3. Yang Q, Wu H, Guo CY, Fox CS. Analyze multivariate phenotypes in genetic association studies by combining univariate association tests. Genetic epidemiology. 2010; 34(5):444–54. pmid:20583287; PubMed Central PMCID: PMC3090041.
  4. 4. Gavish B, Ben-Dov IZ, Bursztyn M. Linear relationship between systolic and diastolic blood pressure monitored over 24 h: assessment and correlates. Journal of hypertension. 2008; 26(2):199–209. pmid:18192832.
  5. 5. Huang PL. A comprehensive definition for metabolic syndrome. Disease models and mechanisms. 2009; 2(5–6):231–7. pmid:19407331
  6. 6. Aschard H, Vilhjalmsson BJ, Greliche N, Morange PE, Tregouet DA, Kraft P. Maximizing the power of principal-component analysis of correlated phenotypes in genome-wide association studies. American journal of human genetics. 2014; 94(5):662–76. pmid:24746957; PubMed Central PMCID: PMC4067564.
  7. 7. O'Brien PC. Procedures for comparing samples with multiple endpoints. Biometrics. 1984; 40(4):1079–87. pmid:6534410.
  8. 8. van der Sluis S, Posthuma D, Dolan CV. TATES: efficient multivariate genotype-phenotype analysis for genome-wide association studies. PLoS genetics. 2013; 9(1):e1003235. pmid:23359524; PubMed Central PMCID: PMC3554627.
  9. 9. Ray D, Pankow JS, Basu S. USAT: A Unified Score-Based Association Test for Multiple Phenotype-Genotype Analysis. Genetic epidemiology. 2016; 40(1):20–34. pmid:26638693; PubMed Central PMCID: PMCPMC4785800.
  10. 10. Yang JJ, Li J, Williams LK, Buu A. An efficient genome-wide association test for multivariate phenotypes based on the Fisher combination function. BMC bioinformatics. 2016; 17:19. pmid:26729364; PubMed Central PMCID: PMCPMC4704475.
  11. 11. Liang X, Wang Z, Sha Q, Zhang S. An Adaptive Fisher’s Combination Method for Joint Analysis of Multiple Phenotypes in Association Studies. Scientific reports. 2016; 6:34323. pmid:27694844
  12. 12. Klei L, Luca D, Devlin B, Roeder K. Pleiotropy and principal components of heritability combine to increase power for association analysis. Genetic epidemiology. 2008; 32(1):9–19. pmid:17922480.
  13. 13. Wang Z, Sha Q, Zhang S. Joint Analysis of Multiple Traits Using "Optimal" Maximum Heritability Test. PloS one. 2016; 11(3):e0150975. pmid:26950849
  14. 14. Ferreira MA, Purcell SM. A multivariate test of association. Bioinformatics. 2009; 25(1):132–3. pmid:19019849.
  15. 15. Zhou X, Stephens M. Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nature methods. 2014; 11(4):407–9. pmid:24531419; PubMed Central PMCID: PMC4211878.
  16. 16. Korte A, Vilhjalmsson BJ, Segura V, Platt A, Long Q, Nordborg M. A mixed-model approach for genome-wide association studies of correlated traits in structured populations. Nature genetics. 2012; 44(9):1066–71. pmid:22902788; PubMed Central PMCID: PMC3432668.
  17. 17. Casale FP, Rakitsch B, Lippert C, Stegle O. Efficient set tests for the genetic analysis of correlated traits. Nature methods. 2015; 12(8):755–8. pmid:26076425.
  18. 18. Zeger SL, Liang KY. Longitudinal data analysis for discrete and continuous outcomes. Biometrics. 1986; 42(1):121–30. pmid:3719049.
  19. 19. Zhang Y, Xu Z, Shen X, Pan W, Alzheimer's Disease Neuroimaging I. Testing for association with multiple traits in generalized estimation equations, with application to neuroimaging data. Neuroimage. 2014; 96:309–25. pmid:24704269; PubMed Central PMCID: PMCPMC4043944.
  20. 20. Yan T, Li Q, Li Y, Li Z, Zheng G. Genetic association with multiple traits in the presence of population stratification. Genetic epidemiology. 2013; 37(6):571–80. pmid:23740720.
  21. 21. Wang Z, Wang X, Sha Q, Zhang S. Joint analysis of multiple traits in rare variant association studies. Annals of human genetics. 2016; 80(3):162–71. pmid:26990300
  22. 22. Sha Q, Wang X, Wang X, Zhang S. Detecting association of rare and common variants by testing an optimally weighted combination of variants. Genetic epidemiology. 2012; 36(6):561–71. pmid:22714994.
  23. 23. Majumdar A, Witte JS, Ghosh S. Semiparametric Allelic Tests for Mapping Multiple Phenotypes: Binomial Regression and Mahalanobis Distance. Genetic epidemiology. 2015; 39(8):635–50. pmid:26493781; PubMed Central PMCID: PMCPMC4958458.
  24. 24. Zhu X, Feng T, Tayo BO, Liang J, Young JH, Franceschini N, et al. Meta-analysis of correlated traits via summary statistics from GWASs with an application in hypertension. American journal of human genetics. 2015; 96(1):21–36. pmid:25500260; PubMed Central PMCID: PMCPMC4289691.
  25. 25. Sha Q, Zhang Z, Zhang S. An improved score test for genetic association studies. Genetic epidemiology. 2011; 35(5):350–9. pmid:21484862.
  26. 26. Sun J, Oualkacha K, Forgetta V, Zheng H-F, Brent Richards J, Ciampi A, et al. A method for analyzing multiple continuous phenotypes in rare variant association studies allowing for flexible correlations in variant effects. European journal of human genetics. 2016; 24(9):1344–51. pmid:26860061
  27. 27. Zhu H, Zhang S, Sha Q. Power Comparisons of Methods for Joint Association Analysis of Multiple Phenotypes. Human heredity. 2015; 80(3):144–52. pmid:27344597.
  28. 28. Regan EA, Hokanson JE, Murphy JR, Make B, Lynch DA, Beaty TH, et al. Genetic Epidemiology of COPD (COPDGene) Study Design. COPD. 2010; 7(1):32–43. pmid:20214461
  29. 29. Pillai SG, Ge D, Zhu G, Kong X, Shianna KV, Need AC, et al. A Genome-Wide Association Study in Chronic Obstructive Pulmonary Disease (COPD): Identification of Two Major Susceptibility Loci. PLoS genetics. 2009; 5(3):e1000421. pmid:19300482
  30. 30. Wilk JB, Chen TH, Gottlieb DJ, Walter RE, Nagle MW, Brandler BJ, et al. A genome-wide association study of pulmonary function measures in the Framingham Heart Study. PLoS genetics. 2009; 5(3):e1000429. pmid:19300500; PubMed Central PMCID: PMCPMC2652834.
  31. 31. Wilk JB, Shrine NRG, Loehr LR, Zhao JH, Manichaikul A, Lopez LM, et al. Genome-Wide Association Studies Identify CHRNA5/3 and HTR4 in the Development of Airflow Obstruction. American journal of respiratory and critical care medicine. 2012; 186(7):622–32. pmid:22837378
  32. 32. Cho MH, Boutaoui N, Klanderman BJ, Sylvia JS, Ziniti JP, Hersh CP, et al. Variants in FAM13A are associated with chronic obstructive pulmonary disease. Nature genetics. 2010; 42(3):200–2. pmid:20173748
  33. 33. Cho MH, Castaldi PJ, Wan ES, Siedlinski M, Hersh CP, Demeo DL, et al. A genome-wide association study of COPD identifies a susceptibility locus on chromosome 19q13. Human molecular genetics. 2012; 21(4):947–57. pmid:22080838
  34. 34. Cho MH, McDonald M-LN, Zhou X, Mattheisen M, Castaldi PJ, Hersh CP, et al. Risk loci for chronic obstructive pulmonary disease: a genome-wide association study and meta-analysis. The lancet respiratory medicine. 2014; 2(3):214–25. pmid:24621683
  35. 35. Hancock DB, Eijgelsheim M, Wilk JB, Gharib SA, Loehr LR, Marciante KD, et al. Meta-analyses of genome-wide association studies identify multiple novel loci associated with pulmonary function. Nature genetics. 2010; 42(1):45–52. pmid:20010835
  36. 36. Young RP, Whittington CF, Hopkins RJ, Hay BA, Epton MJ, Black PN, et al. Chromosome 4q31 locus in COPD is also associated with lung cancer. The European respiratory journal. 2010; 36(6):1375–82. pmid:21119205.
  37. 37. Li X, Howard TD, Moore WC, Ampleford EJ, Li H, Busse WW, et al. Importance of hedgehog interacting protein and other lung function genes in asthma. Journal of allergy and clinical immunology. 2011; 127(6):1457–65. pmid:21397937
  38. 38. Zhang J, Summah H, Zhu YG, Qu JM. Nicotinic acetylcholine receptor variants associated with susceptibility to chronic obstructive pulmonary disease: a meta-analysis. Respiratory research. 2011; 12:158. pmid:22176972; PubMed Central PMCID: PMCPMC3283485.
  39. 39. Cui K, Ge X, Ma H. Four SNPs in the CHRNA3/5 Alpha-Neuronal Nicotinic Acetylcholine Receptor Subunit Locus Are Associated with COPD Risk Based on Meta-Analyses. PloS one. 2014; 9(7):e102324. pmid:25051068
  40. 40. Zhu AZX, Zhou Q, Cox LS, David SP, Ahluwalia JS, Benowitz NL, et al. Association of CHRNA5-A3-B4 SNP rs2036527 with smoking cessation therapy response in African American smokers. Clinical pharmacology and therapeutics. 2014; 96(2):256–65. pmid:24733007
  41. 41. Lutz SM, Cho MH, Young K, Hersh CP, Castaldi PJ, McDonald M-L, et al. A genome-wide association study identifies risk loci for spirometric measures among smokers of European and African ancestry. BMC genetics. 2015; 16:138. pmid:26634245
  42. 42. Lee JH, Cho MH, Hersh CP, McDonald M-LN, Wells JM, Dransfield MT, et al. IREB2 and GALC Are Associated with Pulmonary Artery Enlargement in Chronic Obstructive Pulmonary Disease. American journal of respiratory cell and molecular biology. 2015; 52(3):365–76. pmid:25101718
  43. 43. Knowler WC, Williams RC, Pettitt DJ, Steinberg AG. Gm3;5,13,14 and type 2 diabetes mellitus: an association in American Indians with genetic admixture. American journal of human genetics. 1988; 43(4):520–6. pmid:3177389; PubMed Central PMCID: PMC1715499.
  44. 44. Lander ES, Schork NJ. Genetic dissection of complex traits. Science. 1994; 265(5181):2037–48. pmid:8091226.
  45. 45. Chen HS, Zhu X, Zhao H, Zhang S. Qualitative semi-parametric test for genetic associations in case-control designs under structured populations. Annals of human genetics. 2003; 67(Pt 3):250–64. pmid:12914577.
  46. 46. Zhang S, Zhu X, Zhao H. On a semiparametric test to detect associations between quantitative traits and candidate genes using unrelated individuals. Genetic epidemiology. 2003; 24(1):44–56. pmid:12508255.
  47. 47. Zhu X, Zhang S, Zhao H, Cooper RS. Association mapping, using a mixture model for complex traits. Genetic epidemiology. 2002; 23(2):181–96. pmid:12214310.
  48. 48. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nature genetics. 2006; 38(8):904–9. pmid:16862161.
  49. 49. Bauchet M, McEvoy B, Pearson LN, Quillen EE, Sarkisian T, Hovhannesyan K, et al. Measuring European population stratification with microarray genotype data. American journal of human genetics. 2007; 80(5):948–56. pmid:17436249; PubMed Central PMCID: PMC1852743.
  50. 50. Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY, Freimer NB, et al. Variance component model to account for sample structure in genome-wide association studies. Nature genetics. 2010; 42(4):348–54. pmid:20208533; PubMed Central PMCID: PMC3092069.
  51. 51. Zhang Z, Ersoz E, Lai CQ, Todhunter RJ, Tiwari HK, Gore MA, et al. Mixed linear model approach adapted for genome-wide association studies. Nature genetics. 2010; 42(4):355–60. pmid:20208535; PubMed Central PMCID: PMC2931336.
  52. 52. Hoffman GE. Correcting for Population Structure and Kinship Using the Linear Mixed Model: Theory and Extensions. PloS one. 2013; 8(10):e75707. pmid:24204578
  53. 53. Li Q, Yu K. Improved correction for population stratification in genome-wide association studies by identifying hidden population structures. Genetic epidemiology. 2008; 32(3):215–26. pmid:18161052.
  54. 54. Liu L, Zhang D, Liu H, Arendt C. Robust methods for population stratification in genome wide association studies. BMC bioinformatics. 2013; 14:132. pmid:23601181; PubMed Central PMCID: PMCPMC3637636.
  55. 55. Bodmer W, Bonilla C. Common and rare variants in multifactorial susceptibility to common diseases. Nature genetics. 2008; 40(6):695–701. pmid:18509313; PubMed Central PMCID: PMC2527050.
  56. 56. Pritchard JK, Cox NJ. The allelic architecture of human disease genes: common disease-common variant…or not? Hum Mol Genet. 2002; 11(20):2417–23. pmid:12351577.
  57. 57. Teer JK, Mullikin JC. Exome sequencing: the sweet spot before whole genomes. Hum Mol Genet. 2010; 19(R2):R145–51. pmid:20705737; PubMed Central PMCID: PMC2953745.
  58. 58. Walsh T, King MC. Ten genes for inherited breast cancer. Cancer Cell. 2007; 11(2):103–5. pmid:17292821.
  59. 59. Ali AM, Dawson SJ, Blows FM, Provenzano E, Ellis IO, Baglietto L, et al. Comparison of methods for handling missing data on immunohistochemical markers in survival analysis of breast cancer. British journal of cancer. 2011; 104(4):693–9. pmid:21266980; PubMed Central PMCID: PMCPMC3049587.
  60. 60. Dahl A, Iotchkova V, Baud A, Johansson A, Gyllensten U, Soranzo N, et al. A multiple-phenotype imputation method for genetic studies. Nat Genet. 2016; 48(4):466–72. pmid:26901065; PubMed Central PMCID: PMCPMC4817234.
  61. 61. De Silva AP, Moreno-Betancur M, De Livera AM, Lee KJ, Simpson JA. A comparison of multiple imputation methods for handling missing values in longitudinal data in the presence of a time-varying covariate with a non-linear association with time: a simulation study. BMC medical research methodology. 2017; 17(1):114. pmid:28743256; PubMed Central PMCID: PMCPMC5526258.
  62. 62. Schafer JL. Analysis of incomplete multivariate data: CRC press; 1997.
  63. 63. Carlin J. Multiple imputation: a perspective and historical overview. Handbook of Missing Data. 2015.
  64. 64. Raghunathan TE, Lepkowski JM, Van Hoewyk J, Solenberger P. A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey methodology. 2001; 27(1):85–96.
  65. 65. Van Buuren S, Brand JP, Groothuis-Oudshoorn C, Rubin DB. Fully conditional specification in multivariate imputation. Journal of statistical computation and simulation. 2006; 76(12):1049–64.
  66. 66. Carpenter J, Kenward M. Multiple imputation and its application: John Wiley & Sons; 2012.