A novel method for multiple phenotype association studies based on genotype and phenotype network

Joint analysis of multiple correlated phenotypes for genome-wide association studies (GWAS) can identify and interpret pleiotropic loci which are essential to understand pleiotropy in diseases and complex traits. Meanwhile, constructing a network based on associations between phenotypes and genotypes provides a new insight to analyze multiple phenotypes, which can explore whether phenotypes and genotypes might be related to each other at a higher level of cellular and organismal organization. In this paper, we first develop a bipartite signed network by linking phenotypes and genotypes into a Genotype and Phenotype Network (GPN). The GPN can be constructed by a mixture of quantitative and qualitative phenotypes and is applicable to binary phenotypes with extremely unbalanced case-control ratios in large-scale biobank datasets. We then apply a powerful community detection method to partition phenotypes into disjoint network modules based on GPN. Finally, we jointly test the association between multiple phenotypes in a network module and a single nucleotide polymorphism (SNP). Simulations and analyses of 72 complex traits in the UK Biobank show that multiple phenotype association tests based on network modules detected by GPN are much more powerful than those without considering network modules. The newly proposed GPN provides a new insight to investigate the genetic architecture among different types of phenotypes. Multiple phenotypes association studies based on GPN are improved by incorporating the genetic information into the phenotype clustering. Notably, it might broaden the understanding of genetic architecture that exists between diagnoses, genes, and pleiotropy.


Introduction
Genome-wide association studies (GWAS) have successfully identified thousands of single nucleotide polymorphisms (SNPs) genetically associated with a wide range of complex human diseases and traits [1,2].Over the past decade, more than 10,000 associations between SNPs and diseases/traits have been discovered [3].Although GWAS have emerged as a common and powerful tool to detect the complexity of the genotype-phenotype associations, a common limitation of GWAS is that they focus on only a single phenotype at a time [4][5][6][7].Joint analysis of multiple correlated phenotypes for GWAS may provide more power to identify and interpret pleiotropic loci, which are essential to understand pleiotropy in diseases and complex traits [4,8,9].In brief, biological pleiotropy refers to a SNP or gene that has a direct biological influence on more than one phenotypic trait [10].Biological pleiotropy can offer significant insights in understanding the complex genotype-phenotype relationships [2].Therefore, multiple phenotypes are usually collected in many GWAS cohorts and jointly analyzing multiple phenotypes may increase statistical power to discover the cross-phenotype associations and pleiotropy [10][11][12][13].
Many statistical methods have been developed to jointly test the association between a SNP and multiple correlated phenotypes [14].The most widely used methods for multiple phenotype association studies can be roughly classified into three categories: 1) statistical tests based on combining either the univariate test statistics or p-values, such as O'Brien's method [15], adaptive Fisher's combination (AFC) [16], aSPU [17], and others [18]; 2) multivariate analyses based on regression methods, such as multivariate analysis of variance (MANOVA) [19], reverse regression methods (MultiPhen) [20], linear mixed effect models (LMM) [21], and generalized estimating equations (GEE) [22]; and 3) dimension reduction methods, such as clustering linear combination (CLC) [12], canonical correlation analysis (CCA) [23], and principal components analysis (PCA) [24,25].However, most phenotypes are influenced by many SNPs that act in concert to alter cellular function [26], the above mentioned methods are only based on phenotypic correlation without considering the genetic correlation among phenotypes.Therefore, these methods may loss statistical power to detect the true pleiotropic effects comparing the methods based on genetic architecture among complex diseases.To address this issue, numerous types of algorithms to investigate the genetic correlation among complex traits and diseases have been developed [27][28][29].Many of these algorithms are often in conjunction with linkage disequilibrium (LD) information by using GWAS summary association data [28].For example, cross-trait LD score regression has been developed to estimate genetic and phenotypic correlation that requires only GWAS summary statistics and is not biased by overlapping samples [27].
Cytoscape: an open source software platform for visualizing complex networks which can be accessed via https://cytoscape.org/[71].
In 2007, a conceptually different approach based on the human disease network had been developed, exploring whether human complex traits and the corresponding genotypes might be related to each other at a higher level of cellular and organismal organization [30].Network analyses provide an integrative approach to characterize complex genomic associations [31].Over the past decade, network methodologies, particularly Weighted Gene Co-expression Network Analysis (WGCNA) [32], have become increasingly popular in genetic association studies.This popularity is due to their effectiveness in identifying complex patterns of gene expression and clarifying the relationships among genes.Researchers have applied WGCNA to unravel the genetic underpinnings of complex traits, focusing primarily on gene-gene interactions [33,34].However, constructing networks that map the associations between phenotypes and genotypes can provide fresh perspectives, enabling the simultaneous analysis of multiple phenotypes and SNPs.Notably, it might broaden the understanding of genetic architecture that exists between diagnoses, genes, and pleiotropy [8].Modules detected from human disease networks are useful in providing insights pertaining to biological functionality [35].Therefore, community detection methods play a key role in understanding the global and local structures of disease interaction, in shedding light on association connections that may not be easily visible in the network topology [36].Many community detection methods have been applied from social networks to human disease networks, such as Louvain's method [8] with modularity as a measure and core module identification to identify small and structurally well-defined communities [35].However, most community detection methods have been developed for unsigned networks [37][38][39][40][41][42][43].
To date, many biobanks, such as the UK Biobank [44], aggregate data across tens of thousands of phenotypes and provide a great opportunity to construct the human disease network and perform joint analyses of multiple correlated phenotypes.The electronic health record (EHR)-driven genomic research (EDGR) workflow is the most popular way to analyze multiple diagnosis codes in Biobank data, at its core, which is the use of EHR data for genomic research in the investigation of population-wide genomic characterization [45].In most EHR systems, the whole phenome can be divided into numerous phenotypic categories according to the first few characters of the International Classification of Disease (ICD) billing codes [46].However, the ICD-based categories are based on the underlying cause of death rather than on the shared genetic architecture among all complex diseases and traits.Meanwhile, the phenotypes in large biobanks usually have extremely unbalanced case-control ratios.Therefore, linking phenotypes, especially EHR-derived phenotypes, with genotypes in a network is also very important to examine the genetic architecture of complex diseases and traits.

Overview of methods
In this paper, we develop a bipartite signed network by linking phenotypes and genotypes into a Genotype and Phenotype Network (GPN; Fig 1A).The GPN can be constructed by a mixture of quantitative and qualitative phenotypes and is applicable to phenotypes with extremely unbalanced case-control ratios for large-scale biobank datasets since the saddlepoint approximation [47] is used to test the association between genotype and phenotype with extremely unbalanced case-control ratio.After projecting genotypes into phenotypes, the genetic correlation of phenotypes can be calculated based on the shared associations among all genotypes (Fig 1B).We then apply a powerful community detection method to partition phenotypes into disjoint network modules using the hierarchical clustering method and the number of modules is determined by perturbation (Fig 1C) [48].The phenotypes in each network module share the same genetic information.After partitioning phenotypes into disjoint network modules, a statistical method for multiple phenotype association studies can be applied to test the association between phenotypes in each module and a SNP, then a Bonferroni correction can be used to test if all phenotypes are associated with a SNP (Fig 1D).To jointly analyze the association between multiple phenotypes in each module with a SNP, we use six multiple phenotype association tests, including ceCLC [49], CLC [12], HCLC [50], MultiPhen [20], O'Brien [15], and Omnibus [12].The advantage of the association test based on network modules detected by GPN is that phenotypes in a network module are highly correlated based on the genetic architecture, therefore, the association test is more powerful to identify pleiotropic SNPs.After we obtain the GWAS signals from the previous steps, post-GWAS analyses can be applied to understand the high level of biological mechanism, such as pathway/tissue enrichment analysis and colocalization of GWAS signals and eQTL analysis in the specific diseaseassociated tissue (Fig 1E-1G).The construction of GPN, community detection method, and six multiple phenotype association tests have been implemented in R, which is an open-source software and publicly available on GitHub: https://github.com/xueweic/GPN.

Simulation studies
We first use extensive simulation studies to validate multiple phenotype association studies based on the newly proposed GPN.In the simulation studies, we assess the type I error rate and power with different numbers of phenotypes (60, 80, and 100), different types of phenotypes along with different sample sizes: (i) mixture phenotypes are half quantitative and half qualitative with balanced case-control ratios for sample sizes of 2,000 and 4,000, and (ii) binary phenotypes are all qualitative but with extremely unbalanced case-control ratios for sample sizes of 10,000 and 20,000.Similar to the simulation models introduced in Sha et al. [12], we generate six different models (see Data Simulation).
Type I error rates.Tables A-F in S1 Text summarize the estimated type I error rates of six multiple phenotype association tests for mixture phenotypes under models 1-6, respectively."N.O." represents the type I error rates of multiple phenotype association tests being calculated without considering network modules; "NET" presents the type I error rates of the tests being evaluated by considering network modules detected by GPN.Based on 500 Monte-Carlo (MC) runs which is the same as 10 6 replicates, the 95% confidence intervals (Cis) for type I error rates divided by nominal significance levels 0.001 and 0.0001 are (0.938, 1.062) and (0.804,1.196), respectively.The bold-faced values indicate that the values are beyond the upper bounds of the 95% Cis.We can see that almost all of the estimated type I error rates of ceCLC, CLC, HCLC, and Omnibus tests are within 95% Cis.However, O'Brien in NET has inflated type I error rates under model 6.MultiPhen has inflated type I error rates for the sample size of 2,000.If the sample size is 4000, MultiPhen in N.O.also inflates type I error rates, but MultiPhen in NET can control type I error rates for the significance level is 0.0001.Tables G-L in S1 Text summarize the estimated type I error rates of six multiple phenotype association tests for binary phenotypes with extremely unbalanced case-control ratios under models 1-6.Similar to Tables A-F in S1 Text, ceCLC, CLC, HCLC, and Omnibus have corrected type I error rates at almost all simulation settings.However, O'Brien in NET has inflated type I error rates and MultiPhen has inflated type I error rates at all scenarios.If there is no clusters of the phenotypes, we also see that only MultiPhen has inflated type I error rates and other five multiple phenotype association tests have well-controlled type I error rates (Table M in S1 Text).
Power comparisons.For power comparisons, we consider 100 causal SNPs for models 1-4 and 200 causal SNPs for models 5-6 (see Data Simulation).In each of the simulation models, the power is evaluated using 10 MC runs which is the same as 1,000 replicates for models 1-4 and 2,000 replicates for models 5-6.Meanwhile, the power is evaluated at the Bonferroni corrected significance level of 0.05 based on the number of causal SNPs in each MC run.O'Brien [15] increases a lot in the case of a SNP affecting phenotypes in different directions.(ii) ceCLC is more powerful than other tests in both N.O.and NET under the six simulation models.(iii) As sample size increases, the power of all multiple phenotype association tests increases.We also perform power comparisons for a total of 60 and 100 mixture phenotypes with 2,000 and 4,000 sample sizes for different effect sizes under the six simulation models (Figs B-E in S1 Text), respectively.We observe that the patterns of the power are similar to those observed in Figs 2 and A in S1 Text.
To mimic phenotypes in the UK Biobank, we also consider the case with all phenotypes being binary with extremely unbalanced case-control ratios.The phenotypes are generated based on extremely unbalanced case-control ratios which are randomly selected from the set of case-control ratios with cases greater than 200 from UK Biobank ICD-10 code level 3 phenotypes (see Real Dataset; case-control ratios belong to [0.000658,0.03937]).In this simulation, we consider a total of 60, 80, and 100 phenotypes along with two sample sizes, 10,000 and 20,000.

Real data analysis based on UK Biobank
Furthermore, we apply the newly proposed multiple phenotype association test based on network modules detected by GPN to a set of diseases of the musculoskeletal system and connective tissue across more than 300,000 individuals from the UK Biobank.
Network module detection.We construct GPN based on 72 EHR-derived phenotypes in the diseases of the musculoskeletal system and connective tissue with 288,647 SNPs in autosomal chromosomes in the UK Biobank.Due to all phenotypes in our analysis being extremely unbalanced, the strength of the association between phenotype and genotype is calculated by the saddlepoint approximation [47].After the construction of GPN, we apply a powerful community detection method and these 72 phenotypes are partitioned into 8 disjoint network modules (Fig 3).There are 2-37 phenotypes in each module.
We can see that the network modules are not consistent with the ICD-based categories which are based on the underlying cause of death rather than the shared genetic architecture among all complex diseases.For example, Fig 3 shows three phenotypes, M32.9 Systemic lupus erythematosus, M35.0 Sicca syndrome, and M65.3 Trigger finger, are detected in network module III (in red).However, these three phenotypes do not belong to the same ICDcategory (Data-Field 41202 in UK Biobank), where M35.0 is one of the diseases in the other systemic involvement of connective tissue (M35) and M65.3 belongs to the synovitis and tenosynovitis (M65).To investigate the genetic correlation among these three phenotypes, we use the saddlepoint approximation to test the association between each phenotype and each SNP.As shown in Fig M in S1 Text, the Manhattan plots for the three phenotypes in network module III (M32.9,M35.0, and M65.3) have a similar pattern.Although the synovitis and tenosynovitis (M65.9) and M65.3 belong to the same ICD code category (M65), the Manhattan plot of M65.9 shows that there are no SNPs significantly associated with this phenotype and the genetic correlation between M65.9 and M65.3 is not strong.Therefore, we can conclude that the community detection method based on our proposed GPN can partition phenotypes into different categories based on the shared genetic architecture.
Furthermore, we apply the hierarchical clustering method to compare the genetic correlation of phenotypes obtained by our proposed GPN and that estimated by LDSC [27].), are significantly associated with at least one of the 72 phenotypes reported in the GWAS catalog (Table N in S1 Text).rs13107325 is reported to be associated with osteoarthritis (M19.9)[52] and rotator cuff syndrome (M75.1)[53].Meanwhile, rs13107325 is mapped to gene SLC39A8 that is also reported to be significantly associated with multisite chronic pain (M25.5)[54].rs443198 is mapped to gene NOTCH4 which is associated with systemic sclerosis (M34) [55].Moreover, the mapped gene NOTCH4 is one of the most important genes reported to be associated with multiple diseases in the disease category of the musculoskeletal system and connective tissue, such as rheumatoid arthritis (M06.9)[56], psoriatic arthritis (M07.3)[57], Takayasu arteritis (M31.4)[58], systemic lupus erythematosus (M32.9)[59], and appendicular lean mass (M62.9)[60].We map these 32 unique SNPs into genes with 20 kb upstream and 20 kb downstream regions.There are 27 out of 32 SNPs with corresponding mapped genes associated with 14 phenotypes reported in the GWAS catalog (Table N in S1 Text).These 14 phenotypes and corresponding ICD-10 codes are summarized in Table O in S1 Text.
Next, we test the associations between phenotypes in each of the eight network modules detected by the GPN and each SNP.Then, we adjust the p-value of each method for testing the association between a SNP and all of the 72 phenotypes by Bonferroni correction.We adopt the commonly used genome-wide significance level 5×10  Pathway enrichment analysis.ceCLC is more powerful than the other four tests in simulations and also can identify more SNPs in real data analysis, therefore, we only perform the post-GWAS analyses of the SNPs identified by ceCLC.There are 191 mapped genes containing at least one of the 647 SNPs identified by ceCLC in N.O.and 252 mapped genes containing at least one of the 980 SNPs identified by ceCLC in NET.In this study, significantly enriched pathways are identified by those genes with false discovery rate (FDR) < 0.05.
From the pathway enrichment analyses, we observe that ceCLC based on the network modules identifies more significantly enriched pathways than that without considering network modules.Tissue enrichment analysis.To further investigate the biological mechanism, we use FUMA [61] to annotate 191 mapped genes in N.O.and 252 mapped genes in NET in terms of biological context.Due to these mapped genes associated with at least one phenotype in the Colocalization of GWAS and eQTL analysis.We perform the colocalization analysis on the 33 unique SNPs identified by ceCLC (Table N  uniquely identified SNPs by ceCLC that are selected to be the lead SNPs in the colocalization analysis.NET identifies one unique SNP, rs4148866, which is mapped to gene ABCB9.Even if gene ABCB9 has no reported associations with any diseases of the musculoskeletal system and connective tissue in the GWAS Catalog, the Bayesian posterior probability of colocalization analysis for shared variant of significant SNPs identified by ceCLC and gene expression in the Muscle Skeletal tissue (PP H4 ) is 98.4%.The higher value of PP H4 indicates that gene ABCB9 and Muscle Skeletal tissue play an important role in the disease mechanism due to the same variant responsible for a GWAS locus and also affecting gene expression [62].Among 32 unique SNPs identified by ceCLC in N.O., there are two SNPs, rs34333163 and rs6916921, selected to be the lead SNPs (Fig S in S1 Text).Both of them are reported in the GWAS Catalog that have associations with at least one of the diseases in the musculoskeletal system.However, the PP H4 values for the corresponding genes SLC38A8 and ATP6V1G2 are lower than 50%.

Discussion
In this paper, we propose a novel method for multiple phenotype association studies based on genotype and phenotype network.The construction of a bipartite signed network, GPN, is to link genotypes with phenotypes using the evidence of associations.To understand pleiotropy in diseases and complex traits and explore the genetic correlation among phenotypes, we project genotypes into phenotypes based on the GPN.We also apply a powerful community detection method to detect the network modules based on the shared genetic architecture.In contrast to previous community detection methods for disease networks, the applied method benefits from exploring the biological functionality interactions of diseases based on the signed network.Furthermore, we apply several multiple phenotype association tests to test the association between phenotypes in each network module and a SNP.Extensive simulation studies show that all multiple phenotype association tests based on network modules have corrected type I error rates if the corresponding test is a valid test for testing the association between a SNP and phenotypes without considering network modules.Most tests in NET are much more powerful than those in N.O.Meanwhile, we evaluate the performance of the association tests based on network modules detected by GPN through a set of 72 EHR-derived phenotypes in the diseases of the musculoskeletal system and connective tissue across more than 300,000 samples from the UK Biobank.Compared with the tests in N.O., all tests based on network modules can identify more potentially pleiotropic SNPs and ceCLC can identify more SNPs than other methods.
In addition, the construction of GPN does not require access to individual-level genotypes and phenotypes data, which only requires association evidence between each genotype and each phenotype.Therefore, when individual-level data are not available, this evidence can be obtained from GWAS summary statistics, such as the effect sizes (odds ratios for binary phenotypes) and corresponding p-values.The development of GPN can also be applied to omics studies, such as constructing a GPN that incorporates expression Quantitative Trait Locus (eQTLs) and gene expressions.However, in the context of numerous omics studies, the sample sizes are not very large.We have broadened our simulation analysis to include the same six simulation models, specifically targeting scenarios with the number of phenotypes of 60 and the sample size of 1,000.We observe similar results as simulations with larger sample sizes: the tests in NET are much more powerful than those in N.O (Fig T in S1 Text).Meanwhile, the simulation studies show that the powerful network community detection method can correctly partition phenotypes into several disjoint network modules based on the shared genetic architecture.Since the determination of the number of network modules in community detection method is independent of the association tests [48], we only need to perform the perturbation procedure once in real data analyses.In our real data analysis with 72 phenotypes and 288,647 SNPs, it only takes 1.5 hour with 1,000 perturbations to obtain the optimal number of network modules on a macOS (2.7 GHz Quad-Core Intel Core i7, 16 GB memory).
In this paper, the multiple phenotype association test based on the network module uses association information twice.We first use association information to detect communities and to cluster phenotypes into different groups, then we use the association information to perform the multiple phenotype association test.One may doubt whether the multiple phenotype association test has inflated type I error rates by using the association information twice.However, the community detection uses association information between all SNPs and all phenotypes, while the multiple phenotype association test only considers one SNP.Based on our simulation studies, the first time use of association information only affects the multiple phenotype association test slightly and is not enough to affect the type I error rates.
In summary, the proposed GPN provides a new insight to investigate the genetic correlation among phenotypes.Especially when the phenotypes have extremely unbalanced case-control ratios, the weight of an edge in the signed bipartite network can be calculated based on the saddlepoint approximation.The power of multiple phenotype association tests based on network modules detected by GPN are improved by incorporating the genetic information into the phenotypic clustering.Therefore, the proposed method can be applied to large-scale data across multiple related traits and diseases (i.e., biobanks data set, etc.).

Methods
Consider a sample with n unrelated individuals, indexed by i = 1,� � �,n.Suppose each individual has a total of K phenotypes and M SNPs.Let Y = (y ik ) be an n×K matrix of K phenotypes, where y ik denotes the phenotype value of the i th individual for the k th phenotype.The phenotypes can be both quantitative and qualitative, especially for phenotypes with extremely unbalanced case-control ratios.Let G = (g im ) be an n×M matrix of genotypes, where g im represents the genotypic score of the i th individual at the m th SNP which is the number of minor alleles that the i th individual carries at the SNP.

Construction of the genotype and phenotype network
We first introduce a signed bipartite genotype and phenotype network (GPN) (Fig 1A).The weight of an edge represents the strength of the association between the two nodes (one is the phenotype and the other one is the genotype).The strength of the association has two directions, positive and negative.The adjacency matrix of GPN is a K×M matrix T = (T km ), where T km represents the strength of the association between the k th phenotype and the m th SNP.To calculate the adjacency matrix T, we consider both the strengths and the directions of the associations.We first consider that there are no covariates.The strength of the association T km can be estimated by the score test statistic S km ¼ X n i¼1 ðy ik À � y k Þg im and its p-value p km under the generalized linear models gðEðy ik jg im ÞÞ Here, � y k ¼ X n i¼1 y ik =n and g( ) is a monotonic link function.Two commonly used link functions are the identity link for quantitative traits and the logit link for binary traits.If there are p covariates for the i th individual, x i1 ,� � �,x ip , we adjust genotype and phenotype for the covariates using the following linear models proposed by Price et al. [64] and Sha et al. [65], where ε k = (ε 1k ,� � �,ε nk ) T and τ m = (τ 1m ,� � �,τ nm ) T denote the error terms of the k th phenotype and the m th SNP, respectively.We use the residuals of the respective linear model to replace the original genotypes and phenotypes.
For quantitative traits or binary traits with fairly balanced case-control ratios, we can use the normal approximation of S km � Nð0; s 2 km Þ to calculate p-value p km under the null hypothesis that the k th phenotype and the m th SNP have no association, where Dey et al. [47] pointed out that a normal approximation of S km has inflated type I error rates for binary traits with unbalanced case-control ratios.Therefore, we use saddlepoint approximation to calculate the p-value p km for the phenotypes with unbalanced, especially extremely unbalanced case-control ratios [47].We define the (k,m) th element of the adjacency matrix of GPN, T km , as That is, we use sign(S km ) to define the direction of the association and use F À 1 Chi ð1 À p km Þ to define the strength of the association.T km >0 and T km <0 represent two directions of the association between the k th phenotype and the m th SNP.If T km >0, the minor allele of the m th SNP is a protective allele to the k th phenotype; if T km <0, the minor allele of the m th SNP is a risk allele to the k th phenotype.
Although a bipartite network may give the most complete representation of a particular network, it is often convenient to work with just one type of nodes, that is, phenotypes or genotypes.The Phenotype and Phenotype Network (PPN) is the one-mode projection of GPN on phenotypes.In PPN, nodes only represent phenotypes (Fig 1B).Let W = (W kl ) denote the adjacency matrix of the PPN in which each edge has a positive or negative weight.We define W kl as the weight of the edge connecting the k th and l th phenotypes, which is given by ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Here, W kl is the genetic correlation between the k th and l th phenotypes based on the association strengths T km for k = 1,� � �,K and m = 1,� � �,M.Thus, the PPN is also a signed network.

Community detection method
We apply a powerful community detection method to partition K phenotypes into disjoint network modules using the Ward hierarchical clustering method with a similarity matrix defined by the genetic correlation matrix W [48].The number of network modules is determined by the following perturbation procedure [66].In details, we first use the Ward hierarchical clustering method to group the K phenotypes into k 0 (k 0 = 1,� � �,K−1) clusters and build the K×K connectivity matrix C k 0 with the (k,l) th element of matrix C k 0 given by C k 0 ðk; lÞ ¼ 1; if phenotype k and phenotype l are in the same cluster 0; otherwise : ( Then, we generate B perturbed data sets.The b th perturbed data set is generated by , where ε km *N(0,σ 2 ), σ 2 = median(var(T 1 ),� � �,var(T M )), and T m = (T 1m ,� � �, T Km ).We denote the connectivity matrix of k 0 cluster based on the b th perturbed data set by the empirical CDF of the elements of D k 0 , and AF k 0 denotes the area under the curve of F k 0 , where Then, the optimal number of network modules is given by We can use the identified C network modules to further investigate the associations between phenotypes in each network module and SNPs.

Multiple phenotype association tests
After we obtain C network modules for the phenotypes, we apply a multiple phenotype association test to identify the association between phenotypes in each of the C network modules and a SNP.Any multiple phenotype association test can be applied here.In this article, we apply six commonly used multiple phenotype association tests to each network module, including ceCLC [49], CLC [12], HCLC [50], MultiPhen [20], O'Brien [15], and Omnibus [12] (see details in Text A in S1 Text), then a Bonferroni correction is used to adjust for multiple testing for the C network modules to test if all phenotypes in the C network modules associated with a SNP.

Data simulation
We conduct comprehensive simulation studies to evaluate the type I error rates and powers of multiple phenotype association tests based on network modules detected by GPN and compare them to the powers of the corresponding tests without considering network modules.To evaluate the performance of our proposed method, we consider different types of phenotypes: (i) mixture phenotypes: half quantitative and half qualitative with balanced case-control ratios, and (ii) binary phenotypes: all qualitative but with extremely unbalanced case-control ratios.We generate N individuals with M SNPs and K phenotypes.The genotypes at M SNPs are generated according to the minor allele frequency (MAF) under Hardy-Weinberg Equilibrium (HWE).Below, we first describe how to generate quantitative phenotypes.Suppose that there are C phenotypic categories and k = K/C phenotypes in each phenotypic category.Let Y c = (y c1 ,� � �,y ck ) denote the phenotypes in the c th category.Similar to Sha et al. [12], we generate k quantitative phenotypes in each category using the following factor model, where G = (G 1 ,� � �,G M ) is the matrix of M SNPs with dimension N×M which are generated from a binomial(2,MAF) distribution for each SNP; B c is an M×k matrix of effect sizes of M SNPs on k phenotypes in the c th phenotypic category; E c *MVN k (0,S) is an N×k matrix of error term with S = (σ ij ), where σ ij = ρ |i−j| and ρ is a constant between 0 to 1; f c is a factor vector is a C×C matrix with all elements of 1, and I c is the identity matrix; C 0 is a constant number which represents a proportion.Therefore, the correlation between the i th phenotype and the j th phenotype within each category is c 2 0 þ ð1 À c 2 0 Þr jiÀ jj and the between-category correlation is c 2 0 r f .In our simulation studies, we use ρ = 0.3 and c 2 0 ¼ 0:5, therefore, the maximum correlation between two phenotypes within the same category is 0.65.We also assign r f ¼ 0:3=c 2 0 to ensure that the between-category correlation is 0.3.
To generate a qualitative disease affection status, we use a liability threshold model based on a quantitative phenotype and its case-control ratio.Let n a and n C denote the number of affected individuals and the number of non-affected individuals.For a given case-control ratio r and sample size N, n c = N/(r+1) and n a = rN/(r+1).An individual is defined to be affected if the individual's phenotype is in the top n a of all phenotypes.For each phenotype, the case-control ratio is randomly chosen from a set S. The set S contains all case-control ratios with the number of cases greater than 200 from UK Biobank ICD-10 code level 3 phenotypes.
Based on the factor model, we consider different numbers of phenotypes, 60, 80, and 100, and different sample sizes.For mixture phenotypes, the sample sizes are 2,000 and 4,000; for binary phenotypes, the sample sizes are 10,000 and 20,000.We consider six simulation models (Text B and Table P in S1 Text) with M = 2,000 and MAF*U(0.05,0.5).The calculations of the type I error rates and power of multiple phenotype association test in N.O.and in NET are summarized in Text C in S1 Text.

Real dataset
The UK Biobank is a population-based cohort study with a wide variety of genetic and phenotypic information [67].It includes ~500K people from all around the United Kingdom who were aged between 40 and 69 when recruited in 2006-2010 [44,68].Following the genotype and phenotype preprocess introduced in Liang et al. [50], there are 288,647 SNPs and 72 EHRderived phenotypes in the diseases of the musculoskeletal system and connective tissue for 322,607 individuals are kept in our real data analysis [69] (Text D and Fig U in S1 Text).Among the 72 phenotypes, lumbar and other intervertebral disk disorders with myelopathy (M51.0) has the smallest case-control ratio 0.000658 with 212 cases and 322,395 controls; Gonarthrosis (M17.9) has the largest case-control ratio 0.03937 with 12,218 cases and 310,389 controls.Therefore, all of the phenotypes we considered in our analysis have extremely unbalanced case-control ratios.Furthermore, each phenotype is adjusted by 13 covariates, including age, sex, genotyping array, and the first 10 genetic principal components (PCs) [65].The analysis is performed based on the adjusted phenotypes.

Correlation analysis
To compare the genetic and phenotypic correlations among the 72 EHR-derived phenotypes, we apply cross-triat LDSC regression [27] to obtain the genetic correlation and phenotypic correlation which can provide useful etiological insights [27].GWAS summary statistics are generated from the association between phenotype and genotype which are calculated by the saddlepoint approximation.We use the precomputed LD scores of European individuals in the 1000 Genomes project for high-quality HapMap3 SNPs ('eur_w_ld_chr').For the phenotypic correlation, we consider 70 phenotypes excluding M79.6 (Enthesopathy of lower limb) and M67.8 (Other specified disorders of synovium and tendon), since the heritabilities of these two phenotypes estimated by LDSC are out of bounds.For the genetic correlation, we only consider 52 phenotypes exlcuding 20 phenotypes, where the heritabilities of these phenotypes are not significantly different from zero.We apply the K-means hierarchical clustering method to compare the correlations of phenotypes obtained by our proposed GPN and LDSC.

Post-GWAS analyses
Pathway enrichment analysis.To better understand the biological functions behind the SNPs identified by one multiple phenotype association test, we identify the pathways in which the identified SNPs are involved.We use the functional annotation tool named Database for Annotation, Visualization, and Integrated Discovery bioinformatics resource (DAVID: https:// david.ncifcrf.gov/)[70,71] for the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis.A mapped gene used in the pathway enrichment analysis denotes the gene that includes at least one identified SNPs with a 20kb window region.The biological pathways with FDR < 0.05 and enriched gene count > 2 are considered statistically significant [72].
Tissue enrichment analysis.To prioritize and interpret the GWAS signals and identify lead SNPs, tissue enrichment analyses are performed using the Functional Mapping and Annotation (FUMA: https://fuma.ctglab.nl/)[61] platform and the GWAS signals from one multiple phenotype association test in N.O.and in NET, respectively.FUMA first performs a genic aggregation analysis of GWAS association signals to calculate gene-wise association signals using MAGMA, which is a commonly used generalized gene-set analysis of GWAS summary statistics [73].Then, it subsequently tests whether tissues and cell types are enriched for expression of the genes with gene-wise association signals.For tissue enrichment analysis, we use 30 general tissue types in GTEx v8 reference set (https://gtexportal.org/home/).
Colocalization analysis.As most associated variants are noncoding, it is expected that they influence disease risk through altering gene expression or splicing [74].The colocalization analysis is a way to identify the association of a GWAS SNP and a gene expression QTL that are colocalized.We perform colocalization analysis using the 'coloc' package in R [62], a Bayesian statistical methodology that tests pairwise colocalization of eQTLs with unique identified SNPs by ceCLC in NET and N.O.from the UK Biobank dataset.The SNP-gene associations in the Muscle Skeletal tissue are downloaded from GTEx v7.We use the default setting of the prior probabilities, p 1 = p 2 = 10 −4 and p 12 = 10 −5 , for a causal variant in an eQTL or a GWAS SNP and a shared causal variant between eQTL and GWAS SNP, respectively.Quality controls (QCs) on genotype: Filter out genetic variants, with i. Missing rate larger than 5% ("-mind 0.05"), ii.Hardy-Weinberg equilibrium exact test p-values less than 10 −6 ("-hwe 1e-6"), iii.Minor allele frequency (MAF) less than 5% ("-maf 0.05").We also filter out individuals, with iv.Missing rate larger than 5% ("-mind 0.05") v. Individuals without sex ("-no-sex"). (DOCX)

Fig 1 .
Fig 1. Overview of the method.a. Construction of GPN.Each phenotype (yellow square) and each SNP form a directed edge which represents the strength of the association, where the red dashed line indicates that the minor allele of the SNP is a protective allele to the phenotype, and the blue dashed line indicates that the minor allele of the SNP is a risk allele to the phenotype.b.Construction of PPN, which is the one-mode projection of GPN on phenotypes.c.The community detection method is used to partition phenotypes into disjoint network modules.d.Multiple phenotype association tests based on the network modules detected by GPN.e. GWAS signals are identified by a multiple phenotype association test with or without considering network modules.f.Functional enrichment analysis based on the detected GWAS signals and the publicly available functional database.g.Colocalization of GWAS signals and eQTL analysis.(All networks are generated by an open source software platform, Cytoscape 0.9.2, which can be accessed via https://cytoscape.org/[51]; Other figures are generated by an open source software, R 4.2.2, which can be accessed via https://www.r-project.org/).https://doi.org/10.1371/journal.pgen.1011245.g001

Fig 2 (
Fig 2 (Fig A in S1 Text) shows the power of six multiple phenotype association tests under six simulation models for different effect sizes with a total of 80 mixture phenotypes and a sample size of 4,000 (2,000).From Figs 2 and A in S1 Text, we can see that: (i) All tests in NET (filled by the dashed line) are much more powerful than those in N.O., indicating that tests based on network modules detected by GPN are more powerful than the tests without considering network modules.Since the community detection method can partition phenotypes into different network modules based on shared genetic architecture, the phenotypes can be clustered in the same module if they have higher genetic correlations.In particular, the power of

Fig 2 .
Fig 2. Power comparisons of the six tests as a function of effect size β under the six models.The number of mixture phenotypes (half continuous phenotypes and half binary phenotypes with balanced case-control ratios) is 80 and the sample size is 4,000.The power of all of the six tests is evaluated using 10 MC runs.(Figure is generated by an open source software, R 4.2.2, which can be accessed via https://www.r-project.org/).https://doi.org/10.1371/journal.pgen.1011245.g002 Figs F-K in S1 Text show the power comparisons of the six tests under six simulation models.Fig L in S1 Text shows the power comparisons of the six tests under the models that mimic real data cluster structures.The patterns of power comparisons for binary phenotypes and for the models that mimic real data cluster structure are similar to those observed in Figs 2 and A-E in S1 Text.
Figs N-O in S1 Text show that dendrograms of hierarchical clustering method based on the genetic correlation of phenotypes obtained by GPN, and the phenotypic or genetic correlation estimated by LDSC, respectively.In Fig N in S1 Text, the cluster results of the phenotypic correlation estimated by LDSC are similar to that of the genetic correlation based on GPN, but GPN can separately identify two highly genetic correlated phenotypes, ankylosing spondylitis (M45) and ankylosing spondylitis with site unspecified (M45.X9).However, the cluster results of the genetic correlation estimated by LDSC are different from those obtained by GPN.Some phenotypes in the same UK Biobank level 1 category can be clustered in the same group by GPN but not by LDSC (Fig O in S1 Text).Interpretation of the association test.We apply five multiple phenotype association tests (ceCLC, CLC, HCLC, O'Brien, and Omnibus) to test the association between 72 EHR-derived phenotypes and each of 288,647 SNPs in the UK Biobank.MultiPhen is not considered here

Fig 3 .
Fig 3.The network modules detected by the powerful community detection method based on GPN.The blocks with different color indicate different modules, where the values in the legend represent the number of phenotypes in each network module.The labels of phenotypes are listed in the form of ICD-10 code and the corresponding diseases can be found in the UK Biobank.The connection between two phenotypes represents the absolutely value of the weight greater than 40.(The graph was prepared by Cytoscape 0.9.2, which can be accessed via https://cytoscape.org/)https://doi.org/10.1371/journal.pgen.1011245.g003 −8 .Fig 4B shows that all tests can identify more SNPs comparing with the number of SNPs identified in N.O.ceCLC in NET identifies 980 SNPs, where 647 SNPs are identified in N.O.Meanwhile, there are 950 SNPs identified by HCLC, 949 SNPs by CLC, and 891 SNPs by Omnibus, where the corresponding

Fig 4 .
Fig 4. The Venn diagram of the number of SNPs identified by ceCLC, CLC, HCLC, O'Brien, and Omnibus in N.O.(a) and in NET (b).The number below each method indicates the total number of SNPs identified by the corresponding method.(Figure is generated by an open source software, R 4.2.2, which can be accessed via https://www.r-project.org/).https://doi.org/10.1371/journal.pgen.1011245.g004

Fig 5
shows that 16 pathways are significantly enriched by 191 mapped genes in N. O. and 29 pathways are significantly enriched by 252 mapped genes in NET, where all of the 16 pathways identified in N.O.are also identified in NET.Two pathways identified in N.O.and NET, rheumatoid arthritis (hsa05323; FDR = 8.72×10 −3 in N.O.and FDR = 6.48×10 −8 in NET) and systemic lupus erythematosus (hsa05322; FDR = 4.25×10 −19 in N.O.and FDR = 1.02×10 −40 in NET) showed in Fig 5,are related to the diseases of the musculoskeletal system and connective tissue.For example, osteopetrosis (M19.9) and rheumatoid arthritis (M06.9) are related to the rheumatoid arthritis pathway.Meanwhile, the pathway related to at least one of the 72 phenotypes, hematopoietic cell lineage (hsa04640; FDR = 1.08×10 −5 ), is only identified in NET.Notably, DBGET system (https://www.genome.jp/dbget-bin/www_bget?hsa05322) reports that there are two pathways related to systemic lupus erythematosus: antigen processing and presentation (hsa04612; FDR = 4.83×10 −3 in N.O.and FDR = 2.82×10 −16 in NET) identified in both N.O.and NET and cell adhesion molecule (hsa04514; FDR = 1.04×10 −5 ) only identified in NET.Meanwhile, the above five pathways related to the diseases of the musculoskeletal system and connective tissue contain more enriched genes identified by ceCLC in NET than the enriched genes identified in N.O.For example, 43 SNPs within six mapped genes identified by ceCLC in N.O.are enriched in rheumatoid arthritis pathway, including ATP6V1G2, HLA--DRA, LTB, TNF, HLA-DRB1, and HLA-DQA1; and 111 SNPs within 12 mapped genes in NET are enriched in this pathway, including HLA-DMA, HLA-DMB, ATP6V1G2, HLA-DRA, LTB, HLA-DOA, TNF, HLA-DOB, HLA-DQA2, HLA-DRB1, HLA-DQA1, and HLA-DQB1.Compared with the results of ceCLC in N.O., the test based on network modules identifies six more enriched genes, especially, gene HLA-DMB (including rs241458; p-value = 7.09×10 −9 ) and gene HLA-DOA (including rs3097646; p-value = 5.50×10 −9 ) that have not been reported in the GWAS catalog.

Fig 5 .
Fig 5.The results for the pathway enrichment analysis based on the genes identified by ceCLC and the KEGG database in N. O. (a) and NET (b).The red marked pathways denote the pathways related to the diseases of the musculoskeletal system and connective tissue.There are 191 genes in N.O.and 252 genes in NET that are applied to the pathway enrichment analysis.(Figure is generated by an open source software, R 4.2.2, which can be accessed via https://www.r-project.org/).https://doi.org/10.1371/journal.pgen.1011245.g005 in N.O.(32 SNPs).
Fig F. Power comparisons of the six tests as a function of effect size β under the six models.The number of binary phenotypes (with extremely unbalanced case-control ratios) is 80 and the sample size is 20,000.The power of all of the six tests is evaluated using 10 MC runs.Fig G. Power comparisons of the six tests as a function of effect size β under six models.The number of binary phenotypes (with extremely unbalanced case-control ratios) is 80 and the sample size is 10,000.The power of all of the six tests is evaluated using 10 MC runs.Fig H. Power comparisons of the six tests as a function of effect size β under six models.The number of binary phenotypes (with extremely unbalanced case-control ratios) is 60 and the sample size is 10,000.The power of all of the six tests is evaluated using 10 MC runs.Fig I. Power comparisons of the six tests as a function of effect size β under six models.The number of binary phenotypes (with extremely unbalanced case-control ratios) is 60 and the sample size is 20,000.The power of all of the six tests is evaluated using 10 MC runs.Fig J. Power comparisons of the six tests as a function of effect size β under six models.The number of binary phenotypes (with extremely unbalanced case-control ratios) is 100 and the sample size is 10,000.The power of all of the six tests is evaluated using 10 MC runs.Fig K. Power comparisons of the six tests as a function of effect size β under six models.The number of binary phenotypes (with extremely unbalanced case-control ratios) 100 and the sample size is 20,000.The power of all of the six tests is evaluated using 10 MC runs.Fig L. Power comparisons of the six tests as a function of sample size under the six models.The number of mixture phenotypes (half continuous phenotypes and half binary phenotypes with balanced case-control ratios) is 52 in the UK Biobank.The power of all of the six tests is evaluated using 10 MC runs.Fig M. The Manhattan plots of four different diseases based on the saddlepoint approximation.Systemic lupus erythematosus (M32.9),Sicca syndrome (M35.0), and Trigger finger (M65.3) are detected in Module III by our proposed GPN.Both Trigger finger (M65.3) and Synovitis and tenosynovitis (M65.9) are classified into the same ICD-codes category (M65).The horizontal red dashed line represents the threshold for commonly used genome-wide significance level 5×10 −8 .Fig N. Dendrogram of hierarchical clustering method based on the genetic correlation of phenotypes obtained by GPN and the phenotypic correlation estimated by LDSC, respectively.Fig O. Dendrogram of hierarchical clustering method based on the genetic correlation of phenotypes obtained by GPN and the genetic correlation estimated by LDSC, respectively.Fig P. The Venn diagrams of the number of significant SNPs identified by ceCLC, CLC, HCLC, O'Brien, and Omnibus in N.O.and NET.Fig Q.The QQ plots and inflation factors λ in each of 8 network modules for different tests in the real data analysis.Fig R. Tissue expression analysis for mapped genes identified by ceCLC in N.O.(a) and NET (b), respectively.Fig S. Colocalization signals.Lead SNPs are selected for colocalization analysis when the top associated SNP identified by ceCLC was also associated with gene expression in the Muscle Skeletal tissue.Fig T. Power comparisons of the six tests under six models.The number of mixture phenotypes (half continuous phenotypes and half binary phenotypes with balanced case-control ratios) is 60 and the sample size is 1,000.The power of all of the six tests is evaluated using 10 MC runs.Fig U. Flow chart of UK Biobank data preprocessing.Pre-process on phenotype: i. Select White British subjects (White British); ii.Remove individuals who are marked as outliers for heterozygosity or missing rates (Low Heterozygosity); iii.Exclude individuals who have been identified to have ten or more third-degree relatives or closer (Not Three-degree Relatives); iv.Remove individuals having very similar ancestry based on a principal component analysis of the genotypes (Similar Ancestry); v. Remove individuals based on removal by the UK Biobank (Removal by the UK Biobank).
in S1 Text; one SNP in NET and 32 SNPs in N.O.) and all SNP-gene association pairs in the Muscle Skeletal tissue reported in GTEx.Fig S in S1 Text shows the colocalization signals with the

Table O .
ICD-10 codes and names of the14 reported diseases shown in TableN.TableP.Simulation settings for the six models with λ1 ¼ bð1; � � � ; 1Þ Power comparisons of the six tests as a function of effect size β under six models.The number of mixture phenotypes (half continuous phenotypes and half binary phenotypes with balanced case-control ratios) is 80 and the sample size is 2,000.The power of all of the six tests is evaluated using 10 MC runs.Fig B. Power comparisons of the six tests as a function of effect size β under six models.The number of mixture phenotypes (half continuous phenotypes and half binary phenotypes with balanced case-control ratios) is 60 and the sample size is 2,000.The power of all of the six tests is evaluated using 10 MC runs.Fig C. Power comparisons of the six tests as a function of effect size β under six models.The number of mixture phenotypes (half continuous phenotypes and half binary phenotypes with balanced case-control ratios) is 60 and the sample size is 4,000.The power of all of the six tests is evaluated using 10 MC runs.Fig D. Power comparisons of the six tests as a function of effect size β under six models.The number of mixture phenotypes (half continuous phenotypes and half binary phenotypes with balanced case-control ratios) is 100 and the sample size is 2,000.The power of all of the six tests is evaluated using 10 MC runs.Fig E. Power comparisons of the six tests as a function of effect size β under six models.The number of mixture phenotypes (half continuous phenotypes and half binary phenotypes with balanced case-control ratios) is 100 and the sample size is 4,000.The power of all of the six tests is evaluated using 10 MC runs.
T .Fig A.