Skip to main content
Advertisement
  • Loading metrics

A novel method for multiple phenotype association studies based on genotype and phenotype network

Abstract

Joint analysis of multiple correlated phenotypes for genome-wide association studies (GWAS) can identify and interpret pleiotropic loci which are essential to understand pleiotropy in diseases and complex traits. Meanwhile, constructing a network based on associations between phenotypes and genotypes provides a new insight to analyze multiple phenotypes, which can explore whether phenotypes and genotypes might be related to each other at a higher level of cellular and organismal organization. In this paper, we first develop a bipartite signed network by linking phenotypes and genotypes into a Genotype and Phenotype Network (GPN). The GPN can be constructed by a mixture of quantitative and qualitative phenotypes and is applicable to binary phenotypes with extremely unbalanced case-control ratios in large-scale biobank datasets. We then apply a powerful community detection method to partition phenotypes into disjoint network modules based on GPN. Finally, we jointly test the association between multiple phenotypes in a network module and a single nucleotide polymorphism (SNP). Simulations and analyses of 72 complex traits in the UK Biobank show that multiple phenotype association tests based on network modules detected by GPN are much more powerful than those without considering network modules. The newly proposed GPN provides a new insight to investigate the genetic architecture among different types of phenotypes. Multiple phenotypes association studies based on GPN are improved by incorporating the genetic information into the phenotype clustering. Notably, it might broaden the understanding of genetic architecture that exists between diagnoses, genes, and pleiotropy.

Author summary

Biological pleiotropy refers to a SNP or gene that has a direct biological influence on more than one phenotypic trait, which can offer significant insights in understanding the complex genotype-phenotype relationships. Network analyses provide an integrative approach to characterize complex genomic associations by linking phenotypes and genotypes into a Genotype and Phenotype Network (GPN). Jointly analyzing multiple phenotypes and incorporating the genetic information into the phenotype clustering may increase the statistical power to discover the cross-phenotype association and pleiotropy. We evaluate our proposed multiple phenotype association tests based on network modules detected by GPN for 72 EHR-derived phenotypes in the diseases of the musculoskeletal system and connective tissue in the UK Biobank. From the post-GWAS analyses, we observe that the test based on GPN can identify more significantly enriched biological pathways than that without considering the network modules. Meanwhile, some of the uniquely identified SNPs by the test based on GPN are also colocalized in the eQTL study of the gene expression in the Muscle Skeletal tissue.

Introduction

Genome-wide association studies (GWAS) have successfully identified thousands of single nucleotide polymorphisms (SNPs) genetically associated with a wide range of complex human diseases and traits [1,2]. Over the past decade, more than 10,000 associations between SNPs and diseases/traits have been discovered [3]. Although GWAS have emerged as a common and powerful tool to detect the complexity of the genotype-phenotype associations, a common limitation of GWAS is that they focus on only a single phenotype at a time [47]. Joint analysis of multiple correlated phenotypes for GWAS may provide more power to identify and interpret pleiotropic loci, which are essential to understand pleiotropy in diseases and complex traits [4,8,9]. In brief, biological pleiotropy refers to a SNP or gene that has a direct biological influence on more than one phenotypic trait [10]. Biological pleiotropy can offer significant insights in understanding the complex genotype-phenotype relationships [2]. Therefore, multiple phenotypes are usually collected in many GWAS cohorts and jointly analyzing multiple phenotypes may increase statistical power to discover the cross-phenotype associations and pleiotropy [1013].

Many statistical methods have been developed to jointly test the association between a SNP and multiple correlated phenotypes [14]. The most widely used methods for multiple phenotype association studies can be roughly classified into three categories: 1) statistical tests based on combining either the univariate test statistics or p-values, such as O’Brien’s method [15], adaptive Fisher’s combination (AFC) [16], aSPU [17], and others [18]; 2) multivariate analyses based on regression methods, such as multivariate analysis of variance (MANOVA) [19], reverse regression methods (MultiPhen) [20], linear mixed effect models (LMM) [21], and generalized estimating equations (GEE) [22]; and 3) dimension reduction methods, such as clustering linear combination (CLC) [12], canonical correlation analysis (CCA) [23], and principal components analysis (PCA) [24,25]. However, most phenotypes are influenced by many SNPs that act in concert to alter cellular function [26], the above mentioned methods are only based on phenotypic correlation without considering the genetic correlation among phenotypes. Therefore, these methods may loss statistical power to detect the true pleiotropic effects comparing the methods based on genetic architecture among complex diseases. To address this issue, numerous types of algorithms to investigate the genetic correlation among complex traits and diseases have been developed [2729]. Many of these algorithms are often in conjunction with linkage disequilibrium (LD) information by using GWAS summary association data [28]. For example, cross-trait LD score regression has been developed to estimate genetic and phenotypic correlation that requires only GWAS summary statistics and is not biased by overlapping samples [27].

In 2007, a conceptually different approach based on the human disease network had been developed, exploring whether human complex traits and the corresponding genotypes might be related to each other at a higher level of cellular and organismal organization [30]. Network analyses provide an integrative approach to characterize complex genomic associations [31]. Over the past decade, network methodologies, particularly Weighted Gene Co-expression Network Analysis (WGCNA) [32], have become increasingly popular in genetic association studies. This popularity is due to their effectiveness in identifying complex patterns of gene expression and clarifying the relationships among genes. Researchers have applied WGCNA to unravel the genetic underpinnings of complex traits, focusing primarily on gene-gene interactions [33,34]. However, constructing networks that map the associations between phenotypes and genotypes can provide fresh perspectives, enabling the simultaneous analysis of multiple phenotypes and SNPs. Notably, it might broaden the understanding of genetic architecture that exists between diagnoses, genes, and pleiotropy [8]. Modules detected from human disease networks are useful in providing insights pertaining to biological functionality [35]. Therefore, community detection methods play a key role in understanding the global and local structures of disease interaction, in shedding light on association connections that may not be easily visible in the network topology [36]. Many community detection methods have been applied from social networks to human disease networks, such as Louvain’s method [8] with modularity as a measure and core module identification to identify small and structurally well-defined communities [35]. However, most community detection methods have been developed for unsigned networks [3743].

To date, many biobanks, such as the UK Biobank [44], aggregate data across tens of thousands of phenotypes and provide a great opportunity to construct the human disease network and perform joint analyses of multiple correlated phenotypes. The electronic health record (EHR)-driven genomic research (EDGR) workflow is the most popular way to analyze multiple diagnosis codes in Biobank data, at its core, which is the use of EHR data for genomic research in the investigation of population-wide genomic characterization [45]. In most EHR systems, the whole phenome can be divided into numerous phenotypic categories according to the first few characters of the International Classification of Disease (ICD) billing codes [46]. However, the ICD-based categories are based on the underlying cause of death rather than on the shared genetic architecture among all complex diseases and traits. Meanwhile, the phenotypes in large biobanks usually have extremely unbalanced case-control ratios. Therefore, linking phenotypes, especially EHR-derived phenotypes, with genotypes in a network is also very important to examine the genetic architecture of complex diseases and traits.

Results

Overview of methods

In this paper, we develop a bipartite signed network by linking phenotypes and genotypes into a Genotype and Phenotype Network (GPN; Fig 1A). The GPN can be constructed by a mixture of quantitative and qualitative phenotypes and is applicable to phenotypes with extremely unbalanced case-control ratios for large-scale biobank datasets since the saddlepoint approximation [47] is used to test the association between genotype and phenotype with extremely unbalanced case-control ratio. After projecting genotypes into phenotypes, the genetic correlation of phenotypes can be calculated based on the shared associations among all genotypes (Fig 1B). We then apply a powerful community detection method to partition phenotypes into disjoint network modules using the hierarchical clustering method and the number of modules is determined by perturbation (Fig 1C) [48]. The phenotypes in each network module share the same genetic information. After partitioning phenotypes into disjoint network modules, a statistical method for multiple phenotype association studies can be applied to test the association between phenotypes in each module and a SNP, then a Bonferroni correction can be used to test if all phenotypes are associated with a SNP (Fig 1D). To jointly analyze the association between multiple phenotypes in each module with a SNP, we use six multiple phenotype association tests, including ceCLC [49], CLC [12], HCLC [50], MultiPhen [20], O’Brien [15], and Omnibus [12]. The advantage of the association test based on network modules detected by GPN is that phenotypes in a network module are highly correlated based on the genetic architecture, therefore, the association test is more powerful to identify pleiotropic SNPs. After we obtain the GWAS signals from the previous steps, post-GWAS analyses can be applied to understand the high level of biological mechanism, such as pathway/tissue enrichment analysis and colocalization of GWAS signals and eQTL analysis in the specific disease-associated tissue (Fig 1E–1G). The construction of GPN, community detection method, and six multiple phenotype association tests have been implemented in R, which is an open-source software and publicly available on GitHub: https://github.com/xueweic/GPN.

thumbnail
Fig 1. Overview of the method.

a. Construction of GPN. Each phenotype (yellow square) and each SNP form a directed edge which represents the strength of the association, where the red dashed line indicates that the minor allele of the SNP is a protective allele to the phenotype, and the blue dashed line indicates that the minor allele of the SNP is a risk allele to the phenotype. b. Construction of PPN, which is the one-mode projection of GPN on phenotypes. c. The community detection method is used to partition phenotypes into disjoint network modules. d. Multiple phenotype association tests based on the network modules detected by GPN. e. GWAS signals are identified by a multiple phenotype association test with or without considering network modules. f. Functional enrichment analysis based on the detected GWAS signals and the publicly available functional database. g. Colocalization of GWAS signals and eQTL analysis. (All networks are generated by an open source software platform, Cytoscape 0.9.2, which can be accessed via https://cytoscape.org/ [51]; Other figures are generated by an open source software, R 4.2.2, which can be accessed via https://www.r-project.org/).

https://doi.org/10.1371/journal.pgen.1011245.g001

Simulation studies

We first use extensive simulation studies to validate multiple phenotype association studies based on the newly proposed GPN. In the simulation studies, we assess the type I error rate and power with different numbers of phenotypes (60, 80, and 100), different types of phenotypes along with different sample sizes: (i) mixture phenotypes are half quantitative and half qualitative with balanced case-control ratios for sample sizes of 2,000 and 4,000, and (ii) binary phenotypes are all qualitative but with extremely unbalanced case-control ratios for sample sizes of 10,000 and 20,000. Similar to the simulation models introduced in Sha et al. [12], we generate six different models (see Data Simulation).

Type I error rates.

Tables A-F in S1 Text summarize the estimated type I error rates of six multiple phenotype association tests for mixture phenotypes under models 1–6, respectively. “N.O.” represents the type I error rates of multiple phenotype association tests being calculated without considering network modules; “NET” presents the type I error rates of the tests being evaluated by considering network modules detected by GPN. Based on 500 Monte-Carlo (MC) runs which is the same as 106 replicates, the 95% confidence intervals (Cis) for type I error rates divided by nominal significance levels 0.001 and 0.0001 are (0.938, 1.062) and (0.804,1.196), respectively. The bold-faced values indicate that the values are beyond the upper bounds of the 95% Cis. We can see that almost all of the estimated type I error rates of ceCLC, CLC, HCLC, and Omnibus tests are within 95% Cis. However, O’Brien in NET has inflated type I error rates under model 6. MultiPhen has inflated type I error rates for the sample size of 2,000. If the sample size is 4000, MultiPhen in N.O. also inflates type I error rates, but MultiPhen in NET can control type I error rates for the significance level is 0.0001. Tables G-L in S1 Text summarize the estimated type I error rates of six multiple phenotype association tests for binary phenotypes with extremely unbalanced case-control ratios under models 1–6. Similar to Tables A-F in S1 Text, ceCLC, CLC, HCLC, and Omnibus have corrected type I error rates at almost all simulation settings. However, O’Brien in NET has inflated type I error rates and MultiPhen has inflated type I error rates at all scenarios. If there is no clusters of the phenotypes, we also see that only MultiPhen has inflated type I error rates and other five multiple phenotype association tests have well-controlled type I error rates (Table M in S1 Text).

Power comparisons.

For power comparisons, we consider 100 causal SNPs for models 1–4 and 200 causal SNPs for models 5–6 (see Data Simulation). In each of the simulation models, the power is evaluated using 10 MC runs which is the same as 1,000 replicates for models 1–4 and 2,000 replicates for models 5–6. Meanwhile, the power is evaluated at the Bonferroni corrected significance level of 0.05 based on the number of causal SNPs in each MC run.

Fig 2 (Fig A in S1 Text) shows the power of six multiple phenotype association tests under six simulation models for different effect sizes with a total of 80 mixture phenotypes and a sample size of 4,000 (2,000). From Figs 2 and A in S1 Text, we can see that: (i) All tests in NET (filled by the dashed line) are much more powerful than those in N.O., indicating that tests based on network modules detected by GPN are more powerful than the tests without considering network modules. Since the community detection method can partition phenotypes into different network modules based on shared genetic architecture, the phenotypes can be clustered in the same module if they have higher genetic correlations. In particular, the power of O’Brien [15] increases a lot in the case of a SNP affecting phenotypes in different directions. (ii) ceCLC is more powerful than other tests in both N.O. and NET under the six simulation models. (iii) As sample size increases, the power of all multiple phenotype association tests increases. We also perform power comparisons for a total of 60 and 100 mixture phenotypes with 2,000 and 4,000 sample sizes for different effect sizes under the six simulation models (Figs B-E in S1 Text), respectively. We observe that the patterns of the power are similar to those observed in Figs 2 and A in S1 Text.

thumbnail
Fig 2. Power comparisons of the six tests as a function of effect size β under the six models.

The number of mixture phenotypes (half continuous phenotypes and half binary phenotypes with balanced case-control ratios) is 80 and the sample size is 4,000. The power of all of the six tests is evaluated using 10 MC runs. (Figure is generated by an open source software, R 4.2.2, which can be accessed via https://www.r-project.org/).

https://doi.org/10.1371/journal.pgen.1011245.g002

To mimic phenotypes in the UK Biobank, we also consider the case with all phenotypes being binary with extremely unbalanced case-control ratios. The phenotypes are generated based on extremely unbalanced case-control ratios which are randomly selected from the set of case-control ratios with cases greater than 200 from UK Biobank ICD-10 code level 3 phenotypes (see Real Dataset; case-control ratios belong to [0.000658,0.03937]). In this simulation, we consider a total of 60, 80, and 100 phenotypes along with two sample sizes, 10,000 and 20,000. Figs F-K in S1 Text show the power comparisons of the six tests under six simulation models. Fig L in S1 Text shows the power comparisons of the six tests under the models that mimic real data cluster structures. The patterns of power comparisons for binary phenotypes and for the models that mimic real data cluster structure are similar to those observed in Figs 2 and A-E in S1 Text.

Real data analysis based on UK Biobank

Furthermore, we apply the newly proposed multiple phenotype association test based on network modules detected by GPN to a set of diseases of the musculoskeletal system and connective tissue across more than 300,000 individuals from the UK Biobank.

Network module detection.

We construct GPN based on 72 EHR-derived phenotypes in the diseases of the musculoskeletal system and connective tissue with 288,647 SNPs in autosomal chromosomes in the UK Biobank. Due to all phenotypes in our analysis being extremely unbalanced, the strength of the association between phenotype and genotype is calculated by the saddlepoint approximation [47]. After the construction of GPN, we apply a powerful community detection method and these 72 phenotypes are partitioned into 8 disjoint network modules (Fig 3). There are 2–37 phenotypes in each module.

thumbnail
Fig 3. The network modules detected by the powerful community detection method based on GPN.

The blocks with different color indicate different modules, where the values in the legend represent the number of phenotypes in each network module. The labels of phenotypes are listed in the form of ICD-10 code and the corresponding diseases can be found in the UK Biobank. The connection between two phenotypes represents the absolutely value of the weight greater than 40. (The graph was prepared by Cytoscape 0.9.2, which can be accessed via https://cytoscape.org/)

https://doi.org/10.1371/journal.pgen.1011245.g003

We can see that the network modules are not consistent with the ICD-based categories which are based on the underlying cause of death rather than the shared genetic architecture among all complex diseases. For example, Fig 3 shows three phenotypes, M32.9 Systemic lupus erythematosus, M35.0 Sicca syndrome, and M65.3 Trigger finger, are detected in network module III (in red). However, these three phenotypes do not belong to the same ICD-category (Data-Field 41202 in UK Biobank), where M35.0 is one of the diseases in the other systemic involvement of connective tissue (M35) and M65.3 belongs to the synovitis and tenosynovitis (M65). To investigate the genetic correlation among these three phenotypes, we use the saddlepoint approximation to test the association between each phenotype and each SNP. As shown in Fig M in S1 Text, the Manhattan plots for the three phenotypes in network module III (M32.9, M35.0, and M65.3) have a similar pattern. Although the synovitis and tenosynovitis (M65.9) and M65.3 belong to the same ICD code category (M65), the Manhattan plot of M65.9 shows that there are no SNPs significantly associated with this phenotype and the genetic correlation between M65.9 and M65.3 is not strong. Therefore, we can conclude that the community detection method based on our proposed GPN can partition phenotypes into different categories based on the shared genetic architecture.

Furthermore, we apply the hierarchical clustering method to compare the genetic correlation of phenotypes obtained by our proposed GPN and that estimated by LDSC [27]. Figs N-O in S1 Text show that dendrograms of hierarchical clustering method based on the genetic correlation of phenotypes obtained by GPN, and the phenotypic or genetic correlation estimated by LDSC, respectively. In Fig N in S1 Text, the cluster results of the phenotypic correlation estimated by LDSC are similar to that of the genetic correlation based on GPN, but GPN can separately identify two highly genetic correlated phenotypes, ankylosing spondylitis (M45) and ankylosing spondylitis with site unspecified (M45.X9). However, the cluster results of the genetic correlation estimated by LDSC are different from those obtained by GPN. Some phenotypes in the same UK Biobank level 1 category can be clustered in the same group by GPN but not by LDSC (Fig O in S1 Text).

Interpretation of the association test.

We apply five multiple phenotype association tests (ceCLC, CLC, HCLC, O’Brien, and Omnibus) to test the association between 72 EHR-derived phenotypes and each of 288,647 SNPs in the UK Biobank. MultiPhen is not considered here since it has inflated type I error rates, especially for the phenotypes with extremely unbalanced case-control ratios.

First, we apply the five tests in N.O. to test the association between 72 phenotypes and each SNP. We use the commonly used genome-wide significance level 5×10−8. Fig 4A shows the Venn diagram of the number of SNPs identified by the five tests. There are 11 SNPs identified by all five tests. ceCLC identifies 647 SNPs with 32 unique SNPs not being identified by other four tests. Among the 32 novel SNPs, two SNPs, rs13107325 (p-value = 4.6×10−10) and rs443198 (p-value = 1.73×10−11), are significantly associated with at least one of the 72 phenotypes reported in the GWAS catalog (Table N in S1 Text). rs13107325 is reported to be associated with osteoarthritis (M19.9) [52] and rotator cuff syndrome (M75.1) [53]. Meanwhile, rs13107325 is mapped to gene SLC39A8 that is also reported to be significantly associated with multisite chronic pain (M25.5) [54]. rs443198 is mapped to gene NOTCH4 which is associated with systemic sclerosis (M34) [55]. Moreover, the mapped gene NOTCH4 is one of the most important genes reported to be associated with multiple diseases in the disease category of the musculoskeletal system and connective tissue, such as rheumatoid arthritis (M06.9) [56], psoriatic arthritis (M07.3) [57], Takayasu arteritis (M31.4) [58], systemic lupus erythematosus (M32.9) [59], and appendicular lean mass (M62.9) [60]. We map these 32 unique SNPs into genes with 20 kb upstream and 20 kb downstream regions. There are 27 out of 32 SNPs with corresponding mapped genes associated with 14 phenotypes reported in the GWAS catalog (Table N in S1 Text). These 14 phenotypes and corresponding ICD-10 codes are summarized in Table O in S1 Text.

thumbnail
Fig 4. The Venn diagram of the number of SNPs identified by ceCLC, CLC, HCLC, O’Brien, and Omnibus in N.O.

(a) and in NET (b). The number below each method indicates the total number of SNPs identified by the corresponding method. (Figure is generated by an open source software, R 4.2.2, which can be accessed via https://www.r-project.org/).

https://doi.org/10.1371/journal.pgen.1011245.g004

Next, we test the associations between phenotypes in each of the eight network modules detected by the GPN and each SNP. Then, we adjust the p-value of each method for testing the association between a SNP and all of the 72 phenotypes by Bonferroni correction. We adopt the commonly used genome-wide significance level 5×10−8. Fig 4B shows that all tests can identify more SNPs comparing with the number of SNPs identified in N.O. ceCLC in NET identifies 980 SNPs, where 647 SNPs are identified in N.O. Meanwhile, there are 950 SNPs identified by HCLC, 949 SNPs by CLC, and 891 SNPs by Omnibus, where the corresponding results in N.O. are 354 SNPs, 808 SNPs, and 634 SNPs, respectively. In particular, the number of SNPs identified by O’Brien in NET is increased a lot, where there are 948 SNPs identified in NET and only 57 SNPs identified in N.O. As the results shown in Fig 4B, there are 807 overlapped SNPs identified by all five tests in NET which is much larger than 11 overlapped SNPs identified in N.O.

To compare the difference between the tests in N.O. and in NET, we summarize the number of overlapping SNPs identified by each method in N.O. and NET in Fig P in S1 Text. We observe that most SNPs identified in N.O. can be identified in NET. Meanwhile, tests in NET can identify much more SNPs than those in N.O. As mentioned previously, the advantage of the tests based on the network modules detected by GPN is that we can identify potential pleiotropic SNPs and also interpret SNP effects on which network modules based on the shared genetic architecture. Notably, we also investigate the smallest p-value obtained by each of the eight phenotypic modules for each of the 980 SNPs identified by ceCLC. For example, 396 SNPs have the smallest p-values for testing the association with network module III. Based on the results of the univariate score test corrected for saddlepoint approximation (SPAtest) (Fig M in S1 Text), 104 SNPs are significantly associated with at least one phenotype in module III. All of these 104 SNPs can be identified by ceCLC, HCLC, and Omnibus in NET and 103 SNPs can be identified by CLC and O’Brien in NET. The results show that the tests based on network modules can detect potential pleiotropic loci which can not be detected by the univariate test. Fig Q in S1 Text shows the QQ plots and inflation factors in each of the eight network modules for six tests in the real data analysis. We see that the inflation factors for all approaches are close to 1.

Pathway enrichment analysis.

ceCLC is more powerful than the other four tests in simulations and also can identify more SNPs in real data analysis, therefore, we only perform the post-GWAS analyses of the SNPs identified by ceCLC. There are 191 mapped genes containing at least one of the 647 SNPs identified by ceCLC in N.O. and 252 mapped genes containing at least one of the 980 SNPs identified by ceCLC in NET. In this study, significantly enriched pathways are identified by those genes with false discovery rate (FDR) < 0.05.

From the pathway enrichment analyses, we observe that ceCLC based on the network modules identifies more significantly enriched pathways than that without considering network modules. Fig 5 shows that 16 pathways are significantly enriched by 191 mapped genes in N.O. and 29 pathways are significantly enriched by 252 mapped genes in NET, where all of the 16 pathways identified in N.O. are also identified in NET. Two pathways identified in N.O. and NET, rheumatoid arthritis (hsa05323; FDR = 8.72×10−3 in N.O. and FDR = 6.48×10−8 in NET) and systemic lupus erythematosus (hsa05322; FDR = 4.25×10−19 in N.O. and FDR = 1.02×10−40 in NET) showed in Fig 5, are related to the diseases of the musculoskeletal system and connective tissue. For example, osteopetrosis (M19.9) and rheumatoid arthritis (M06.9) are related to the rheumatoid arthritis pathway. Meanwhile, the pathway related to at least one of the 72 phenotypes, hematopoietic cell lineage (hsa04640; FDR = 1.08×10−5), is only identified in NET. Notably, DBGET system (https://www.genome.jp/dbget-bin/www_bget?hsa05322) reports that there are two pathways related to systemic lupus erythematosus: antigen processing and presentation (hsa04612; FDR = 4.83×10−3 in N.O. and FDR = 2.82×10−16 in NET) identified in both N.O. and NET and cell adhesion molecule (hsa04514; FDR = 1.04×10−5) only identified in NET.

thumbnail
Fig 5. The results for the pathway enrichment analysis based on the genes identified by ceCLC and the KEGG database in N.O.

(a) and NET (b). The red marked pathways denote the pathways related to the diseases of the musculoskeletal system and connective tissue. There are 191 genes in N.O. and 252 genes in NET that are applied to the pathway enrichment analysis. (Figure is generated by an open source software, R 4.2.2, which can be accessed via https://www.r-project.org/).

https://doi.org/10.1371/journal.pgen.1011245.g005

Meanwhile, the above five pathways related to the diseases of the musculoskeletal system and connective tissue contain more enriched genes identified by ceCLC in NET than the enriched genes identified in N.O. For example, 43 SNPs within six mapped genes identified by ceCLC in N.O. are enriched in rheumatoid arthritis pathway, including ATP6V1G2, HLA-DRA, LTB, TNF, HLA-DRB1, and HLA-DQA1; and 111 SNPs within 12 mapped genes in NET are enriched in this pathway, including HLA-DMA, HLA-DMB, ATP6V1G2, HLA-DRA, LTB, HLA-DOA, TNF, HLA-DOB, HLA-DQA2, HLA-DRB1, HLA-DQA1, and HLA-DQB1. Compared with the results of ceCLC in N.O., the test based on network modules identifies six more enriched genes, especially, gene HLA-DMB (including rs241458; p-value = 7.09×10−9) and gene HLA-DOA (including rs3097646; p-value = 5.50×10−9) that have not been reported in the GWAS catalog.

Tissue enrichment analysis.

To further investigate the biological mechanism, we use FUMA [61] to annotate 191 mapped genes in N.O. and 252 mapped genes in NET in terms of biological context. Due to these mapped genes associated with at least one phenotype in the diseases of the musculoskeletal system and connective tissue, we can test if these mapped genes are enriched in the relevant-tissue based on FUMA. Fig R in S1 Text shows the ordered enriched tissues based on the mapped genes identified by ceCLC in N.O. and NET. We observe that the mapped genes identified by ceCLC in N.O. are most enriched in brain-related tissue (Fig R(a) in S1 Text). Nevertheless, Fig R(b) in S1 Text shows that the mapped genes identified by ceCLC in NET are significantly enriched in the Muscle-Skeletal tissue with p-value < 0.05. The construction of GPN is benefit to multiple phenotype association studies by clustering the related phenotypes based on the genetic information. Notably, the identified SNPs are more likely to be within the same relevant biological context.

Colocalization of GWAS and eQTL analysis.

We perform the colocalization analysis on the 33 unique SNPs identified by ceCLC (Table N in S1 Text; one SNP in NET and 32 SNPs in N.O.) and all SNP-gene association pairs in the Muscle Skeletal tissue reported in GTEx. Fig S in S1 Text shows the colocalization signals with the uniquely identified SNPs by ceCLC that are selected to be the lead SNPs in the colocalization analysis. NET identifies one unique SNP, rs4148866, which is mapped to gene ABCB9. Even if gene ABCB9 has no reported associations with any diseases of the musculoskeletal system and connective tissue in the GWAS Catalog, the Bayesian posterior probability of colocalization analysis for shared variant of significant SNPs identified by ceCLC and gene expression in the Muscle Skeletal tissue (PPH4) is 98.4%. The higher value of PPH4 indicates that gene ABCB9 and Muscle Skeletal tissue play an important role in the disease mechanism due to the same variant responsible for a GWAS locus and also affecting gene expression [62]. Among 32 unique SNPs identified by ceCLC in N.O., there are two SNPs, rs34333163 and rs6916921, selected to be the lead SNPs (Fig S in S1 Text). Both of them are reported in the GWAS Catalog that have associations with at least one of the diseases in the musculoskeletal system. However, the PPH4 values for the corresponding genes SLC38A8 and ATP6V1G2 are lower than 50%.

Discussion

In this paper, we propose a novel method for multiple phenotype association studies based on genotype and phenotype network. The construction of a bipartite signed network, GPN, is to link genotypes with phenotypes using the evidence of associations. To understand pleiotropy in diseases and complex traits and explore the genetic correlation among phenotypes, we project genotypes into phenotypes based on the GPN. We also apply a powerful community detection method to detect the network modules based on the shared genetic architecture. In contrast to previous community detection methods for disease networks, the applied method benefits from exploring the biological functionality interactions of diseases based on the signed network. Furthermore, we apply several multiple phenotype association tests to test the association between phenotypes in each network module and a SNP. Extensive simulation studies show that all multiple phenotype association tests based on network modules have corrected type I error rates if the corresponding test is a valid test for testing the association between a SNP and phenotypes without considering network modules. Most tests in NET are much more powerful than those in N.O. Meanwhile, we evaluate the performance of the association tests based on network modules detected by GPN through a set of 72 EHR-derived phenotypes in the diseases of the musculoskeletal system and connective tissue across more than 300,000 samples from the UK Biobank. Compared with the tests in N.O., all tests based on network modules can identify more potentially pleiotropic SNPs and ceCLC can identify more SNPs than other methods.

In addition, the construction of GPN does not require access to individual-level genotypes and phenotypes data, which only requires association evidence between each genotype and each phenotype. Therefore, when individual-level data are not available, this evidence can be obtained from GWAS summary statistics, such as the effect sizes (odds ratios for binary phenotypes) and corresponding p-values. The development of GPN can also be applied to omics studies, such as constructing a GPN that incorporates expression Quantitative Trait Locus (eQTLs) and gene expressions. However, in the context of numerous omics studies, the sample sizes are not very large. We have broadened our simulation analysis to include the same six simulation models, specifically targeting scenarios with the number of phenotypes of 60 and the sample size of 1,000. We observe similar results as simulations with larger sample sizes: the tests in NET are much more powerful than those in N.O (Fig T in S1 Text). Meanwhile, the simulation studies show that the powerful network community detection method can correctly partition phenotypes into several disjoint network modules based on the shared genetic architecture. Since the determination of the number of network modules in community detection method is independent of the association tests [48], we only need to perform the perturbation procedure once in real data analyses. In our real data analysis with 72 phenotypes and 288,647 SNPs, it only takes 1.5 hour with 1,000 perturbations to obtain the optimal number of network modules on a macOS (2.7 GHz Quad-Core Intel Core i7, 16 GB memory).

In this paper, the multiple phenotype association test based on the network module uses association information twice. We first use association information to detect communities and to cluster phenotypes into different groups, then we use the association information to perform the multiple phenotype association test. One may doubt whether the multiple phenotype association test has inflated type I error rates by using the association information twice. However, the community detection uses association information between all SNPs and all phenotypes, while the multiple phenotype association test only considers one SNP. Based on our simulation studies, the first time use of association information only affects the multiple phenotype association test slightly and is not enough to affect the type I error rates.

In summary, the proposed GPN provides a new insight to investigate the genetic correlation among phenotypes. Especially when the phenotypes have extremely unbalanced case-control ratios, the weight of an edge in the signed bipartite network can be calculated based on the saddlepoint approximation. The power of multiple phenotype association tests based on network modules detected by GPN are improved by incorporating the genetic information into the phenotypic clustering. Therefore, the proposed method can be applied to large-scale data across multiple related traits and diseases (i.e., biobanks data set, etc.).

Methods

Consider a sample with n unrelated individuals, indexed by i = 1,⋯,n. Suppose each individual has a total of K phenotypes and M SNPs. Let Y = (yik) be an n×K matrix of K phenotypes, where yik denotes the phenotype value of the ith individual for the kth phenotype. The phenotypes can be both quantitative and qualitative, especially for phenotypes with extremely unbalanced case-control ratios. Let G = (gim) be an n×M matrix of genotypes, where gim represents the genotypic score of the ith individual at the mth SNP which is the number of minor alleles that the ith individual carries at the SNP.

Construction of the genotype and phenotype network

We first introduce a signed bipartite genotype and phenotype network (GPN) (Fig 1A). The weight of an edge represents the strength of the association between the two nodes (one is the phenotype and the other one is the genotype). The strength of the association has two directions, positive and negative. The adjacency matrix of GPN is a K×M matrix T = (Tkm), where Tkm represents the strength of the association between the kth phenotype and the mth SNP. To calculate the adjacency matrix T, we consider both the strengths and the directions of the associations. We first consider that there are no covariates. The strength of the association Tkm can be estimated by the score test statistic and its p-value pkm under the generalized linear models (k = 1,⋯,K and m = 1,⋯,M) [63]. Here, and g( ) is a monotonic link function. Two commonly used link functions are the identity link for quantitative traits and the logit link for binary traits. If there are p covariates for the ith individual, xi1,⋯,xip, we adjust genotype and phenotype for the covariates using the following linear models proposed by Price et al. [64] and Sha et al. [65], where εk = (ε1k,⋯,εnk)T and τm = (τ1m,⋯,τnm)T denote the error terms of the kth phenotype and the mth SNP, respectively. We use the residuals of the respective linear model to replace the original genotypes and phenotypes.

For quantitative traits or binary traits with fairly balanced case-control ratios, we can use the normal approximation of to calculate p-value pkm under the null hypothesis that the kth phenotype and the mth SNP have no association, where and . Dey et al. [47] pointed out that a normal approximation of Skm has inflated type I error rates for binary traits with unbalanced case-control ratios. Therefore, we use saddlepoint approximation to calculate the p-value pkm for the phenotypes with unbalanced, especially extremely unbalanced case-control ratios [47]. We define the (k,m)th element of the adjacency matrix of GPN, Tkm, as , where FChi( ) denotes the CDF of . That is, we use sign(Skm) to define the direction of the association and use to define the strength of the association. Tkm>0 and Tkm<0 represent two directions of the association between the kth phenotype and the mth SNP. If Tkm>0, the minor allele of the mth SNP is a protective allele to the kth phenotype; if Tkm<0, the minor allele of the mth SNP is a risk allele to the kth phenotype.

Although a bipartite network may give the most complete representation of a particular network, it is often convenient to work with just one type of nodes, that is, phenotypes or genotypes. The Phenotype and Phenotype Network (PPN) is the one-mode projection of GPN on phenotypes. In PPN, nodes only represent phenotypes (Fig 1B). Let W = (Wkl) denote the adjacency matrix of the PPN in which each edge has a positive or negative weight. We define Wkl as the weight of the edge connecting the kth and lth phenotypes, which is given by

Here, Wkl is the genetic correlation between the kth and lth phenotypes based on the association strengths Tkm for k = 1,⋯,K and m = 1,⋯,M. Thus, the PPN is also a signed network.

Community detection method

We apply a powerful community detection method to partition K phenotypes into disjoint network modules using the Ward hierarchical clustering method with a similarity matrix defined by the genetic correlation matrix W [48]. The number of network modules is determined by the following perturbation procedure [66]. In details, we first use the Ward hierarchical clustering method to group the K phenotypes into k0 (k0 = 1,⋯,K−1) clusters and build the K×K connectivity matrix with the (k,l)th element of matrix given by

Then, we generate B perturbed data sets. The bth perturbed data set is generated by , where εkmN(0,σ2), σ2 = median(var(T1),⋯,var(TM)), and Tm = (T1m,⋯,TKm). We denote the connectivity matrix of k0 cluster based on the bth perturbed data set by . Let and , denotes the empirical CDF of the elements of , and denotes the area under the curve of , where . Then, the optimal number of network modules is given by

We can use the identified C network modules to further investigate the associations between phenotypes in each network module and SNPs.

Multiple phenotype association tests

After we obtain C network modules for the phenotypes, we apply a multiple phenotype association test to identify the association between phenotypes in each of the C network modules and a SNP. Any multiple phenotype association test can be applied here. In this article, we apply six commonly used multiple phenotype association tests to each network module, including ceCLC [49], CLC [12], HCLC [50], MultiPhen [20], O’Brien [15], and Omnibus [12] (see details in Text A in S1 Text), then a Bonferroni correction is used to adjust for multiple testing for the C network modules to test if all phenotypes in the C network modules associated with a SNP.

Data simulation

We conduct comprehensive simulation studies to evaluate the type I error rates and powers of multiple phenotype association tests based on network modules detected by GPN and compare them to the powers of the corresponding tests without considering network modules. To evaluate the performance of our proposed method, we consider different types of phenotypes: (i) mixture phenotypes: half quantitative and half qualitative with balanced case-control ratios, and (ii) binary phenotypes: all qualitative but with extremely unbalanced case-control ratios. We generate N individuals with M SNPs and K phenotypes. The genotypes at M SNPs are generated according to the minor allele frequency (MAF) under Hardy-Weinberg Equilibrium (HWE). Below, we first describe how to generate quantitative phenotypes. Suppose that there are C phenotypic categories and k = K/C phenotypes in each phenotypic category. Let Yc = (yc1,⋯,yck) denote the phenotypes in the cth category. Similar to Sha et al. [12], we generate k quantitative phenotypes in each category using the following factor model, where G = (G1,⋯,GM) is the matrix of M SNPs with dimension N×M which are generated from a binomial(2,MAF) distribution for each SNP; Bc is an M×k matrix of effect sizes of M SNPs on k phenotypes in the cth phenotypic category; EcMVNk(0,Σ) is an N×k matrix of error term with Σ = (σij), where σij = ρ|ij| and ρ is a constant between 0 to 1; fc is a factor vector in f = (f1,⋯,fC) which follows MVNC(0,Σf), where Σf = (1−ρf)IC+ρfJC, ρf = corr(fi,fj) if ij, JC is a C×C matrix with all elements of 1, and Ic is the identity matrix; C0 is a constant number which represents a proportion. Therefore, the correlation between the ith phenotype and the jth phenotype within each category is and the between-category correlation is . In our simulation studies, we use ρ = 0.3 and , therefore, the maximum correlation between two phenotypes within the same category is 0.65. We also assign to ensure that the between-category correlation is 0.3.

To generate a qualitative disease affection status, we use a liability threshold model based on a quantitative phenotype and its case-control ratio. Let na and nC denote the number of affected individuals and the number of non-affected individuals. For a given case-control ratio r and sample size N, nc = N/(r+1) and na = rN/(r+1). An individual is defined to be affected if the individual’s phenotype is in the top na of all phenotypes. For each phenotype, the case-control ratio is randomly chosen from a set S. The set S contains all case-control ratios with the number of cases greater than 200 from UK Biobank ICD-10 code level 3 phenotypes.

Based on the factor model, we consider different numbers of phenotypes, 60, 80, and 100, and different sample sizes. For mixture phenotypes, the sample sizes are 2,000 and 4,000; for binary phenotypes, the sample sizes are 10,000 and 20,000. We consider six simulation models (Text B and Table P in S1 Text) with M = 2,000 and MAFU(0.05,0.5). The calculations of the type I error rates and power of multiple phenotype association test in N.O. and in NET are summarized in Text C in S1 Text.

Real dataset

The UK Biobank is a population-based cohort study with a wide variety of genetic and phenotypic information [67]. It includes ~ 500K people from all around the United Kingdom who were aged between 40 and 69 when recruited in 2006–2010 [44,68]. Following the genotype and phenotype preprocess introduced in Liang et al. [50], there are 288,647 SNPs and 72 EHR-derived phenotypes in the diseases of the musculoskeletal system and connective tissue for 322,607 individuals are kept in our real data analysis [69] (Text D and Fig U in S1 Text). Among the 72 phenotypes, lumbar and other intervertebral disk disorders with myelopathy (M51.0) has the smallest case-control ratio 0.000658 with 212 cases and 322,395 controls; Gonarthrosis (M17.9) has the largest case-control ratio 0.03937 with 12,218 cases and 310,389 controls. Therefore, all of the phenotypes we considered in our analysis have extremely unbalanced case-control ratios. Furthermore, each phenotype is adjusted by 13 covariates, including age, sex, genotyping array, and the first 10 genetic principal components (PCs) [65]. The analysis is performed based on the adjusted phenotypes.

Correlation analysis

To compare the genetic and phenotypic correlations among the 72 EHR-derived phenotypes, we apply cross-triat LDSC regression [27] to obtain the genetic correlation and phenotypic correlation which can provide useful etiological insights [27]. GWAS summary statistics are generated from the association between phenotype and genotype which are calculated by the saddlepoint approximation. We use the precomputed LD scores of European individuals in the 1000 Genomes project for high-quality HapMap3 SNPs (‘eur_w_ld_chr’). For the phenotypic correlation, we consider 70 phenotypes excluding M79.6 (Enthesopathy of lower limb) and M67.8 (Other specified disorders of synovium and tendon), since the heritabilities of these two phenotypes estimated by LDSC are out of bounds. For the genetic correlation, we only consider 52 phenotypes exlcuding 20 phenotypes, where the heritabilities of these phenotypes are not significantly different from zero. We apply the K-means hierarchical clustering method to compare the correlations of phenotypes obtained by our proposed GPN and LDSC.

Post-GWAS analyses

Pathway enrichment analysis.

To better understand the biological functions behind the SNPs identified by one multiple phenotype association test, we identify the pathways in which the identified SNPs are involved. We use the functional annotation tool named Database for Annotation, Visualization, and Integrated Discovery bioinformatics resource (DAVID: https://david.ncifcrf.gov/) [70,71] for the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis. A mapped gene used in the pathway enrichment analysis denotes the gene that includes at least one identified SNPs with a 20kb window region. The biological pathways with FDR < 0.05 and enriched gene count > 2 are considered statistically significant [72].

Tissue enrichment analysis.

To prioritize and interpret the GWAS signals and identify lead SNPs, tissue enrichment analyses are performed using the Functional Mapping and Annotation (FUMA: https://fuma.ctglab.nl/) [61] platform and the GWAS signals from one multiple phenotype association test in N.O. and in NET, respectively. FUMA first performs a genic aggregation analysis of GWAS association signals to calculate gene-wise association signals using MAGMA, which is a commonly used generalized gene-set analysis of GWAS summary statistics [73]. Then, it subsequently tests whether tissues and cell types are enriched for expression of the genes with gene-wise association signals. For tissue enrichment analysis, we use 30 general tissue types in GTEx v8 reference set (https://gtexportal.org/home/).

Colocalization analysis.

As most associated variants are noncoding, it is expected that they influence disease risk through altering gene expression or splicing [74]. The colocalization analysis is a way to identify the association of a GWAS SNP and a gene expression QTL that are colocalized. We perform colocalization analysis using the ‘coloc’ package in R [62], a Bayesian statistical methodology that tests pairwise colocalization of eQTLs with unique identified SNPs by ceCLC in NET and N.O. from the UK Biobank dataset. The SNP-gene associations in the Muscle Skeletal tissue are downloaded from GTEx v7. We use the default setting of the prior probabilities, p1 = p2= 10−4 and p12 = 10−5, for a causal variant in an eQTL or a GWAS SNP and a shared causal variant between eQTL and GWAS SNP, respectively.

Supporting information

S1 Text. Supplemental Texts, Tables, and Figures.

Text A. Details of the six multiple phenotype association tests. Text B. Details of simulation models. Text C. Comparison of methods in the simulation studies. Text D. The preprocess of genotype and phenotype in UK Biobank. Table A-F. The estimated type I error rates of the six multiple phenotype association tests divided by the nominal significance level for 60, 80, and 100 mixture phenotypes (half continuous phenotypes and half binary phenotypes with balanced case-control ratios) under models 1–6. The type I error rates are evaluated using 500 MC runs (equivalent to 106 replicates). Table G-L. The estimated type I error rates of the six multiple phenotype association tests divided by the nominal significance level for 60, 80, and 100 binary phenotypes (with extremely unbalanced case-control ratios) under models 1–6. The type I error rates are evaluated using 500 MC runs (equivalent to 106 replicates). Table M. The estimated type I error rates of the six multiple phenotype association tests divided by the nominal significance level for 60, 80, and 100 mixture phenotypes (half continuous phenotypes and half binary phenotypes with balanced case-control ratios) under null simulation (no clusters/categories of phenotypes). The type I error rates are evaluated using 500 MC runs (equivalent to 106 replicates). Table N. 33 unique SNPs identified by ceCLC for testing the association in NET (one SNP) or in N.O. (32 SNPs). Table O. ICD-10 codes and names of the14 reported diseases shown in Table N. Table P. Simulation settings for the six models with and . Fig A. Power comparisons of the six tests as a function of effect size β under six models. The number of mixture phenotypes (half continuous phenotypes and half binary phenotypes with balanced case-control ratios) is 80 and the sample size is 2,000. The power of all of the six tests is evaluated using 10 MC runs. Fig B. Power comparisons of the six tests as a function of effect size β under six models. The number of mixture phenotypes (half continuous phenotypes and half binary phenotypes with balanced case-control ratios) is 60 and the sample size is 2,000. The power of all of the six tests is evaluated using 10 MC runs. Fig C. Power comparisons of the six tests as a function of effect size β under six models. The number of mixture phenotypes (half continuous phenotypes and half binary phenotypes with balanced case-control ratios) is 60 and the sample size is 4,000. The power of all of the six tests is evaluated using 10 MC runs. Fig D. Power comparisons of the six tests as a function of effect size β under six models. The number of mixture phenotypes (half continuous phenotypes and half binary phenotypes with balanced case-control ratios) is 100 and the sample size is 2,000. The power of all of the six tests is evaluated using 10 MC runs. Fig E. Power comparisons of the six tests as a function of effect size β under six models. The number of mixture phenotypes (half continuous phenotypes and half binary phenotypes with balanced case-control ratios) is 100 and the sample size is 4,000. The power of all of the six tests is evaluated using 10 MC runs. Fig F. Power comparisons of the six tests as a function of effect size β under the six models. The number of binary phenotypes (with extremely unbalanced case-control ratios) is 80 and the sample size is 20,000. The power of all of the six tests is evaluated using 10 MC runs. Fig G. Power comparisons of the six tests as a function of effect size β under six models. The number of binary phenotypes (with extremely unbalanced case-control ratios) is 80 and the sample size is 10,000. The power of all of the six tests is evaluated using 10 MC runs. Fig H. Power comparisons of the six tests as a function of effect size β under six models. The number of binary phenotypes (with extremely unbalanced case-control ratios) is 60 and the sample size is 10,000. The power of all of the six tests is evaluated using 10 MC runs. Fig I. Power comparisons of the six tests as a function of effect size β under six models. The number of binary phenotypes (with extremely unbalanced case-control ratios) is 60 and the sample size is 20,000. The power of all of the six tests is evaluated using 10 MC runs. Fig J. Power comparisons of the six tests as a function of effect size β under six models. The number of binary phenotypes (with extremely unbalanced case-control ratios) is 100 and the sample size is 10,000. The power of all of the six tests is evaluated using 10 MC runs. Fig K. Power comparisons of the six tests as a function of effect size β under six models. The number of binary phenotypes (with extremely unbalanced case-control ratios) 100 and the sample size is 20,000. The power of all of the six tests is evaluated using 10 MC runs. Fig L. Power comparisons of the six tests as a function of sample size under the six models. The number of mixture phenotypes (half continuous phenotypes and half binary phenotypes with balanced case-control ratios) is 52 in the UK Biobank. The power of all of the six tests is evaluated using 10 MC runs. Fig M. The Manhattan plots of four different diseases based on the saddlepoint approximation. Systemic lupus erythematosus (M32.9), Sicca syndrome (M35.0), and Trigger finger (M65.3) are detected in Module III by our proposed GPN. Both Trigger finger (M65.3) and Synovitis and tenosynovitis (M65.9) are classified into the same ICD-codes category (M65). The horizontal red dashed line represents the threshold for commonly used genome-wide significance level 5×10−8. Fig N. Dendrogram of hierarchical clustering method based on the genetic correlation of phenotypes obtained by GPN and the phenotypic correlation estimated by LDSC, respectively. Fig O. Dendrogram of hierarchical clustering method based on the genetic correlation of phenotypes obtained by GPN and the genetic correlation estimated by LDSC, respectively. Fig P. The Venn diagrams of the number of significant SNPs identified by ceCLC, CLC, HCLC, O’Brien, and Omnibus in N.O. and NET. Fig Q. The QQ plots and inflation factors λ in each of 8 network modules for different tests in the real data analysis. Fig R. Tissue expression analysis for mapped genes identified by ceCLC in N.O. (a) and NET (b), respectively. Fig S. Colocalization signals. Lead SNPs are selected for colocalization analysis when the top associated SNP identified by ceCLC was also associated with gene expression in the Muscle Skeletal tissue. Fig T. Power comparisons of the six tests under six models. The number of mixture phenotypes (half continuous phenotypes and half binary phenotypes with balanced case-control ratios) is 60 and the sample size is 1,000. The power of all of the six tests is evaluated using 10 MC runs. Fig U. Flow chart of UK Biobank data preprocessing. Pre-process on phenotype: i. Select White British subjects (White British); ii. Remove individuals who are marked as outliers for heterozygosity or missing rates (Low Heterozygosity); iii. Exclude individuals who have been identified to have ten or more third-degree relatives or closer (Not Three-degree Relatives); iv. Remove individuals having very similar ancestry based on a principal component analysis of the genotypes (Similar Ancestry); v. Remove individuals based on removal by the UK Biobank (Removal by the UK Biobank). Quality controls (QCs) on genotype: Filter out genetic variants, with i. Missing rate larger than 5% (“—mind 0.05”), ii. Hardy-Weinberg equilibrium exact test p-values less than 10−6 (“—hwe 1e-6”), iii. Minor allele frequency (MAF) less than 5% (“—maf 0.05”). We also filter out individuals, with iv. Missing rate larger than 5% (“—mind 0.05”) v. Individuals without sex (“—no-sex”).

https://doi.org/10.1371/journal.pgen.1011245.s001

(DOCX)

Acknowledgments

Part of this research has been conducted using the UK Biobank Resource under application number 102999 and the NHGRI-EBI GWAS Catalog. High-Performance Computing Shared Facility (Superior) at Michigan Technological University was used in obtaining results presented in this publication. Some parts of this work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. Specifically, it used the Bridges-2 system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC).

References

  1. 1. Fine RS, Pers TH, Amariuta T, Raychaudhuri S, Hirschhorn JN. Benchmarker: an unbiased, association-data-driven strategy to evaluate gene prioritization algorithms. The American Journal of Human Genetics. 2019;104(6):1025–39. pmid:31056107
  2. 2. Li R, Duan R, Kember RL, Rader DJ, Damrauer SM, Moore JH, et al. A regression framework to uncover pleiotropy in large-scale electronic health record data. Journal of the American Medical Informatics Association. 2019;26(10):1083–90. pmid:31529123
  3. 3. Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, et al. 10 years of GWAS discovery: biology, function, and translation. The American Journal of Human Genetics. 2017;101(1):5–22. pmid:28686856
  4. 4. Bush WS, Oetjens MT, Crawford DC. Unravelling the human genome–phenome relationship using phenome-wide association studies. Nature Reviews Genetics. 2016;17(3):129–45. pmid:26875678
  5. 5. Pendergrass SA, Brown-Gentry K, Dudek S, Frase A, Torstenson ES, Goodloe R, et al. Phenome-wide association study (PheWAS) for detection of pleiotropy within the Population Architecture using Genomics and Epidemiology (PAGE) Network. PLoS Genet. 2013;9(1):e1003087. pmid:23382687
  6. 6. Denny JC, Bastarache L, Roden DM. Phenome-wide association studies as a tool to advance precision medicine. Annual review of genomics and human genetics. 2016;17:353–73. pmid:27147087
  7. 7. Pendergrass SA, Dudek SM, Crawford DC, Ritchie MD. Visually integrating and exploring high throughput phenome-wide association study (PheWAS) results using PheWAS-view. BioData mining. 2012;5(1):1–11.
  8. 8. Verma A, Bang L, Miller JE, Zhang Y, Lee MTM, Zhang Y, et al. Human-disease phenotype map derived from PheWAS across 38,682 individuals. The American Journal of Human Genetics. 2019;104(1):55–64. pmid:30598166
  9. 9. Lee CH, Shi H, Pasaniuc B, Eskin E, Han B. PLEIO: a method to map and interpret pleiotropic loci with GWAS summary statistics. The American Journal of Human Genetics. 2021;108(1):36–48. pmid:33352115
  10. 10. Solovieff N, Cotsapas C, Lee PH, Purcell SM, Smoller JW. Pleiotropy in complex traits: challenges and strategies. Nature Reviews Genetics. 2013;14(7):483–95. pmid:23752797
  11. 11. Zhou X, Stephens M. Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nature methods. 2014;11(4):407–9. pmid:24531419
  12. 12. Sha Q, Wang Z, Zhang X, Zhang S. A clustering linear combination approach to jointly analyze multiple phenotypes for GWAS. Bioinformatics. 2019;35(8):1373–9. pmid:30239574
  13. 13. Stephens M. A unified framework for association analysis with multiple related phenotypes. PloS one. 2013;8(7):e65245. pmid:23861737
  14. 14. Yang Q, Wang Y. Methods for analyzing multivariate phenotypes in genetic association studies. Journal of probability and statistics. 2012;2012. pmid:24748889
  15. 15. O’Brien PC. Procedures for comparing samples with multiple endpoints. Biometrics. 1984:1079–87. pmid:6534410
  16. 16. Liang X, Wang Z, Sha Q, Zhang S. An adaptive Fisher’s combination method for joint analysis of multiple phenotypes in association studies. Scientific reports. 2016;6(1):1–10.
  17. 17. Kim J, Bai Y, Pan W. An adaptive association test for multiple phenotypes with GWAS summary statistics. Genetic epidemiology. 2015;39(8):651–63. pmid:26493956
  18. 18. Yang JJ, Li J, Williams LK, Buu A. An efficient genome-wide association test for multivariate phenotypes based on the Fisher combination function. BMC bioinformatics. 2016;17(1):1–11. pmid:26729364
  19. 19. Cole DA, Maxwell SE, Arvey R, Salas E. How the power of MANOVA can both increase and decrease as a function of the intercorrelations among the dependent variables. Psychological bulletin. 1994;115(3):465.
  20. 20. O’Reilly PF, Hoggart CJ, Pomyen Y, Calboli FC, Elliott P, Jarvelin M-R, et al. MultiPhen: joint model of multiple phenotypes can increase discovery in GWAS. PloS one. 2012;7(5):e34861. pmid:22567092
  21. 21. Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982:963–74. pmid:7168798
  22. 22. Liang K-Y, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13–22.
  23. 23. Tang CS, Ferreira MA. A gene-based test of association using canonical correlation analysis. Bioinformatics. 2012;28(6):845–50. pmid:22296789
  24. 24. Aschard H, Vilhjálmsson BJ, Greliche N, Morange P-E, Trégouët D-A, Kraft P. Maximizing the power of principal-component analysis of correlated phenotypes in genome-wide association studies. The American Journal of Human Genetics. 2014;94(5):662–76. pmid:24746957
  25. 25. Wang Z, Sha Q, Zhang S. Joint analysis of multiple traits using" optimal" maximum heritability test. PloS one. 2016;11(3):e0150975. pmid:26950849
  26. 26. Hawkins RD, Hon GC, Ren B. Next-generation genomics: an integrative approach. Nature Reviews Genetics. 2010;11(7):476–86. pmid:20531367
  27. 27. Bulik-Sullivan B, Finucane HK, Anttila V, Gusev A, Day FR, Loh P-R, et al. An atlas of genetic correlations across human diseases and traits. Nature genetics. 2015;47(11):1236. pmid:26414676
  28. 28. Pasaniuc B, Price AL. Dissecting the genetics of complex traits using summary association statistics. Nature Reviews Genetics. 2017;18(2):117. pmid:27840428
  29. 29. O’Connor LJ, Price AL. Distinguishing genetic correlation from causation across 52 diseases and complex traits. Nature genetics. 2018;50(12):1728–34. pmid:30374074
  30. 30. Goh K-I, Cusick ME, Valle D, Childs B, Vidal M, Barabási A-L. The human disease network. Proceedings of the National Academy of Sciences. 2007;104(21):8685–90. pmid:17502601
  31. 31. Gaynor SM, Fagny M, Lin X, Platig J, Quackenbush J. Connectivity in eQTL networks dictates reproducibility and genomic properties. Cell Reports Methods. 2022;2(5):100218. pmid:35637906
  32. 32. Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Statistical applications in genetics and molecular biology. 2005;4(1). pmid:16646834
  33. 33. Gao C, Kim J, Pan W, Initiative AsDN, editors. Adaptive testing of SNP-brain functional connectivity association via a modular network analysis. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2017; 2017: World Scientific.
  34. 34. Zhu L, Lei J, Devlin B, Roeder K. Testing high-dimensional covariance matrices, with application to detecting schizophrenia risk genes. The annals of applied statistics. 2017;11(3):1810. pmid:29081874
  35. 35. Tripathi B, Parthasarathy S, Sinha H, Raman K, Ravindran B. Adapting community detection algorithms for disease module identification in heterogeneous biological networks. Frontiers in genetics. 2019;10:164. pmid:30918511
  36. 36. Newman M. Networks: Oxford university press; 2018.
  37. 37. Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment. 2008;2008(10):P10008.
  38. 38. Fortunato S, Barthelemy M. Resolution limit in community detection. Proceedings of the national academy of sciences. 2007;104(1):36–41. pmid:17190818
  39. 39. Clauset A, Newman ME, Moore C. Finding community structure in very large networks. Physical review E. 2004;70(6):066111. pmid:15697438
  40. 40. Newman ME, Girvan M. Finding and evaluating community structure in networks. Physical review E. 2004;69(2):026113. pmid:14995526
  41. 41. Newman ME. Communities, modules and large-scale structure in networks. Nature physics. 2012;8(1):25–31.
  42. 42. Fortunato S, Hric D. Community detection in networks: A user guide. Physics reports. 2016;659:1–44.
  43. 43. Barber MJ. Modularity and community detection in bipartite networks. Physical Review E. 2007;76(6):066102. pmid:18233893
  44. 44. Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. Plos med. 2015;12(3):e1001779. pmid:25826379
  45. 45. Kohane IS. Using electronic health records to drive discovery in disease genomics. Nature Reviews Genetics. 2011;12(6):417–28. pmid:21587298
  46. 46. Pendergrass SA, Crawford DC. Using electronic health records to generate phenotypes for research. Current protocols in human genetics. 2019;100(1):e80. pmid:30516347
  47. 47. Dey R, Schmidt EM, Abecasis GR, Lee S. A fast and accurate algorithm to test for binary phenotypes and its application to PheWAS. The American Journal of Human Genetics. 2017;101(1):37–49. pmid:28602423
  48. 48. Xie H, Cao X, Zhang S, Sha Q. Joint analysis of multiple phenotypes for extremely unbalanced case-control association studies. Genetic Epidemiology. 2023. pmid:36691904
  49. 49. Wang M, Zhang S, Sha Q. A computationally efficient clustering linear combination approach to jointly analyze multiple phenotypes for GWAS. PloS one. 2022;17(4):e0260911. pmid:35482827
  50. 50. Liang X, Cao X, Sha Q, Zhang S. HCLC-FC: A novel statistical method for phenome-wide association studies. Plos one. 2022;17(11):e0276646. pmid:36350801
  51. 51. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome research. 2003;13(11):2498–504. pmid:14597658
  52. 52. Tachmazidou I, Hatzikotoulas K, Southam L, Esparza-Gordillo J, Haberland V, Zheng J, et al. Identification of new therapeutic targets for osteoarthritis through genome-wide analyses of UK Biobank data. Nature genetics. 2019;51(2):230–6. pmid:30664745
  53. 53. Kim SK, Nguyen C, Jones KB, Tashjian RZ. A Genome Wide Association Study For Shoulder Impingement and Rotator Cuff Disease. Journal of Shoulder and Elbow Surgery. 2021. pmid:33482370
  54. 54. Johnston KJ, Adams MJ, Nicholl BI, Ward J, Strawbridge RJ, Ferguson A, et al. Genome-wide association study of multisite chronic pain in UK Biobank. PLoS genetics. 2019;15(6):e1008164. pmid:31194737
  55. 55. Gorlova O, Martin J-E, Rueda B, Koeleman BP, Ying J, Teruel M, et al. Identification of novel genetic markers associated with clinical phenotypes of systemic sclerosis through a genome-wide association strategy. PLoS Genet. 2011;7(7):e1002178. pmid:21779181
  56. 56. Terao C, Yamada R, Ohmura K, Takahashi M, Kawaguchi T, Kochi Y, et al. The human AIRE gene at chromosome 21q22 is a genetic determinant for the predisposition to rheumatoid arthritis in Japanese population. Human molecular genetics. 2011;20(13):2680–5. pmid:21505073
  57. 57. Aterido A, Cañete JD, Tornero J, Ferrándiz C, Pinto JA, Gratacós J, et al. Genetic variation at the glycosaminoglycan metabolism pathway contributes to the risk of psoriatic arthritis but not psoriasis. Annals of the Rheumatic diseases. 2019;78(3):355–64.
  58. 58. Renauer PA, Saruhan-Direskeneli G, Coit P, Adler A, Aksu K, Keser G, et al. Identification of susceptibility loci in IL6, RPS9/LILRB3, and an intergenic locus on chromosome 21q22 in Takayasu arteritis in a genome-wide association study. Arthritis & rheumatology. 2015;67(5):1361–8. pmid:25604533
  59. 59. Chung SA, Brown EE, Williams AH, Ramos PS, Berthier CC, Bhangale T, et al. Lupus nephritis susceptibility loci in women with systemic lupus erythematosus. Journal of the American Society of Nephrology. 2014;25(12):2859–70. pmid:24925725
  60. 60. Cordero AIH, Gonzales NM, Parker CC, Sokolof G, Vandenbergh DJ, Cheng R, et al. Genome-wide associations reveal human-mouse genetic convergence and modifiers of myogenesis, CPNE1 and STC2. The American Journal of Human Genetics. 2019;105(6):1222–36. pmid:31761296
  61. 61. Watanabe K, Taskesen E, Van Bochoven A, Posthuma D. Functional mapping and annotation of genetic associations with FUMA. Nature communications. 2017;8(1):1–11.
  62. 62. Hormozdiari F, Van De Bunt M, Segre AV, Li X, Joo JWJ, Bilow M, et al. Colocalization of GWAS and eQTL signals detects target genes. The American Journal of Human Genetics. 2016;99(6):1245–60. pmid:27866706
  63. 63. Sha Q, Zhang Z, Zhang S. Joint analysis for genome-wide association studies in family-based designs. PloS One. 2011;6(7):e21957. pmid:21799758
  64. 64. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nature genetics. 2006;38(8):904–9. pmid:16862161
  65. 65. Sha Q, Wang X, Wang X, Zhang S. Detecting association of rare and common variants by testing an optimally weighted combination of variants. Genetic epidemiology. 2012;36(6):561–71. pmid:22714994
  66. 66. Nguyen T, Tagett R, Diaz D, Draghici S. A novel approach for data integration and disease subtyping. Genome research. 2017;27(12):2025–39. pmid:29066617
  67. 67. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562(7726):203–9. pmid:30305743
  68. 68. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. Genome-wide genetic data on~ 500,000 UK Biobank participants. BioRxiv. 2017:166298.
  69. 69. Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4(1):s13742-015-0047-8.
  70. 70. Huang DW, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic acids research. 2009;37(1):1–13. pmid:19033363
  71. 71. Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature protocols. 2009;4(1):44. pmid:19131956
  72. 72. Cao X, Liang X, Zhang S, Sha Q. Gene selection by incorporating genetic networks into case-control association studies. European Journal of Human Genetics. 2022:1–8. pmid:36529820
  73. 73. de Leeuw CA, Mooij JM, Heskes T, Posthuma D. MAGMA: generalized gene-set analysis of GWAS data. PLoS computational biology. 2015;11(4):e1004219. pmid:25885710
  74. 74. Mountjoy E, Schmidt EM, Carmona M, Schwartzentruber J, Peat G, Miranda A, et al. An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci. Nature Genetics. 2021;53(11):1527–33. pmid:34711957