HCLC-FC: A novel statistical method for phenome-wide association studies

The emergence of genetic data coupled to longitudinal electronic medical records (EMRs) offers the possibility of phenome-wide association studies (PheWAS). In PheWAS, the whole phenome can be divided into numerous phenotypic categories according to the genetic architecture across phenotypes. Currently, statistical analyses for PheWAS are mainly univariate analyses, which test the association between one genetic variant and one phenotype at a time. In this article, we derived a novel and powerful multivariate method for PheWAS. The proposed method involves three steps. In the first step, we apply the bottom-up hierarchical clustering method to partition a large number of phenotypes into disjoint clusters within each phenotypic category. In the second step, the clustering linear combination method is used to combine test statistics within each category based on the phenotypic clusters and obtain p-values from each phenotypic category. In the third step, we propose a new false discovery rate (FDR) control approach. We perform extensive simulation studies to compare the performance of our method with that of other existing methods. The results show that our proposed method controls FDR very well and outperforms other methods we compared with. We also apply the proposed approach to a set of EMR-based phenotypes across more than 300,000 samples from the UK Biobank. We find that the proposed approach not only can well-control FDR at a nominal level but also successfully identify 1,244 significant SNPs that are reported to be associated with some phenotypes in the GWAS catalog. Our open-access tools and instructions on how to implement HCLC-FC are available at https://github.com/XiaoyuLiang/HCLCFC.


Introduction
Genome-wide association studies (GWAS) have emerged as a common and powerful tool for investigating the genetic architecture of human disease over the last ten years [1,2]. Over the last decade, numerous disease-and trait-associated common SNPs have been successfully identified by using statistical methods of GWAS.
To date, many software packages, such as PLINK, Gen/ProbABEL, MaCH, SNPTEST, and FaST-LMM, have been developed to support GWAS [3][4][5][6][7][8][9]. However, GWAS suffer from important shortcomings. First of all, GWAS usually focus on a pre-defined and limited phenotypic domain and ignore the potential power gained through the use of intermediate phenotypes that may more closely reflect a gene's mechanism, as well as the association between genetic variation and multiple phenotypes [10,11]. Moreover, it is difficult to reach the threshold of statistical significance by GWAS due to the burden of multiple comparisons conducted; only those associations with a p-value less than 5 × 10 −8 are considered statistically significant. GWAS have difficulty in explaining a significant portion of the predicted phenotypic heritability even though a significant number of SNPs are identified [12]. Lastly, the genotype-phenotype association is assessed for millions of SNPs one by one, false-positive results may easily arise due to large-scale multiple testing. Therefore, large sample size is needed to achieve the optimal statistical power and minimize spurious associations. Furthermore, the replication of the significant loci in independent populations is necessary according to the GWAS criteria [13].
Recently, large-scale DNA databanks linked to longitudinal electronic medical records (EMRs) offer the possibility of phenome-wide association studies (PheWAS) and have been proposed as an approach for rapidly generating large, diverse cohorts for the discovery and replication of genotype-phenotype associations [14][15][16]. In most EMR systems, the whole phenome can be classified into numerous phenotypic categories according to genotypic and phenotypic information such as phenotype similarity [15], genetic architecture [17], and disease network [18]. As a complementary approach to GWAS, PheWAS investigate the association between SNPs and a diverse range of phenotypes. By utilizing all available phenotypic information and all genetic variants in the estimation of associations between genotype and phenotype, a broader picture of the relationship between genetic variation and networks of phenotypes is possible [17]. In summary, GWAS use a phenotype-to-genotype strategy, beginning with a specific phenotype or disease; PheWAS reverse this paradigm by using a genotype-to-phenotype approach, starting with a genotype to test for associations over a wide spectrum of human phenotypes [12].
We are motivated primarily by PheWAS, which aim to assess associations between SNPs and a diverse range of phenotypes. Many of the issues that arise in this setting also occur elsewhere, for example, in clinical trials, the outcomes of cardiovascular risk may include hospitalization, stroke, heart failure, myocardial infarction, cardiac arrest, disability, and death [19]. Therefore, the statistical framework and results given here have a potential for wider application.
Several statistical methods for genetic association studies based on multiple phenotypes have been developed. The traditional Multivariate Analysis of Variance (MANOVA) [20] can take into account multiple continuous phenotypes to essentially test whether or not the independent genetic variant simultaneously explains a statistically significant amount of variance in multiple phenotypes. By performing ordinal regression analysis (proportional odds logistic regression), the joint model of Multiple Phenotypes (MultiPhen) [21] was developed using a reversed analysis by considering a genetic variant of interest as an ordinal response variable and the correlated phenotypes as predictors. A limitation of these multivariate approaches is that their performance depends on the specific configuration of phenotypic correlation structure. To address the limitation of some of the multivariate approaches, the Trait-based Association Test that uses Extended Simes procedure (TATES) [22] was developed to combine pvalues obtained in standard univariate GWAS while correcting for the observed correlational structure between phenotypes. However, TATES essentially only depends on the phenotype that has the strongest association with the variant. Thus, MANOVA and MultiPhen are more powerful than TATES when genotypes impact on all phenotypes or on a large proportion of phenotypes because TATES may lose information in this scenario, while TATES is more powerful than MANOVA and MultiPhen when genotypes impact on one or very few phenotypes [23].
In 2019, Sha et al.
[24] developed the Clustering Linear Combination (CLC) method that combines univariate test statistics for jointly analyzing multiple phenotypes in association analysis. CLC has been theoretically proved to be the most powerful test among all tests with certain quadratic forms if the phenotypes are clustered correctly. It is not only robust to different signs of means of individual statistics but also reduces the degrees of freedom of the test statistics. Therefore, the CLC method can be applied to PheWAS. However, due to the unknown number of clusters for a given data, the final test statistic of the CLC method is the minimum p-value among all p-values of the test statistics obtained from each possible number of clusters [25], and a simulation procedure is used to estimate the p-value of the final test statistic which would be time-consuming, especially in the PheWAS setting.
In this article, we derive a novel and powerful multivariate method, which we referred to as HCLC-FC (Hierarchical Clustering Linear Combination with False discovery rate Control) to test the association between a genetic variant with a large number of phenotypes. The HCLC-FC method is applicable to PheWAS. In PheWAS, the whole phenome can be classified into numerous phenotypic categories according to genotypic and phenotypic information, and each category contains a certain number of phenotypes. The proposed method (HCLC-FC) involves three steps. In the first step, we use the bottom-up Hierarchical Clustering Method (HCM) [26] to partition a large number of phenotypes into disjoint clusters within each category. In the second step, we apply the CLC method to combine test statistics within each phenotypic category based on the phenotypic clusters and obtain p-values from each phenotypic category. In the third step, we develop a false discovery rate (FDR) control approach based on a large-scale association testing procedure with theoretical guarantees for FDR control under flexible correlation structures [10]. Using extensive simulation studies, we evaluate the performance of the proposed method and compare the power of the proposed method with the powers of three commonly used methods in association studies using multiple phenotypes. These three methods include MANOVA [20], MultiPhen [21], and TATES [22]. Our simulation studies show that the proposed method outperforms the other three methods for different within-group and between-group phenotypic correlation structures we consider. Furthermore, the existing methods using our proposed FDR control procedure can control FDR efficiently. We also evaluate the performance of HCLC-FC through a set of 1,869 EMRbased phenotypes based on the International Classification of Diseases, 10 th Revision (ICD-10 code, Data-Field 41202), across more than 300,000 samples from the UK Biobank, where these phenotypes can be classified into 260 ICD-10 level 1 blocks. The real data analysis results show that HCLC-FC can well control the type I error rate and can identify 1,244 SNPs that have previously been reported in the GWAS catalog.

Statistical methods
Consider a sample with n unrelated individuals for a PheWAS, indexed by i = 1,2,. . .,n. Each individual has the phenome with K phenotypes. The K phenotypes can be divided into M phe- We assume that there are no covariates. If there are covariates, such as, gender, age, BMI, and top principal components to adjust for population stratification, we adjust both phenotype and genotype values for the covariates using the method applied by Price et al. [27] and Sha et al. [28]. That is, if there are p covariates, z i1 , . . ., z ip , for the i th individual, we adjust both phenotype and genotype values for the covariates through linear models In this article, we derived a novel and powerful multivariate method for PheWAS, which is referred to as HCLC-FC. The proposed method (HCLC-FC) involves three steps. In the first step, we use the bottom-up HCM [26] to partition K m phenotypes into L m disjoint clusters within each category, where m = 1, . . ., M. In the second step, we apply the CLC [24] to combine test statistics within each category. The CLC test statistic with L m clusters follows a chisquare distribution with L m degrees of freedom. We then obtain the p-value of the CLC test statistic for each phenotypic category. In the third step, we propose an FDR control approach based on the method proposed by Cai et al. [10]. FDR is widely used to claim significance for high-dimensional correlated data. However, most of the existing methods of FDR cannot accurately estimate FDR due to different directions of genetic effects on different phenotypes. Recently, Cai et al. (2019) developed a method to evaluate FDR that works well for PheWAS if only a single phenotype is considered at a time. However, Cai's method is based on test statistics which are difficult to extend to test statistics for multiple phenotypes. Instead of using test statistics, we propose a new approach to evaluate FDR which is based on p-values and does not depend on test statistics. In the following sections, we give a detailed approach for each step.
Step 1: HCM to partition phenotypes in each phenotype category. For the m th phenotypic category, we partition K m phenotypes into L m disjoint clusters. Denote D m = 1 − S m with entries d m ll � as the dissimilarity matrix, where S m is K m × K m similarity matrix of Y m for the m th phenotypic category and d m ll � is the dissimilarity value between l th and l �th phenotypes. The HCM is based on the agglomerative clustering algorithm. In agglomerative clustering, all the phenotypes are a cluster of their own, and we merged pairs of clusters until they form a single cluster. In each iteration, we merge two clusters that have the smallest value of the average dissimilarity d m ll � between all phenotypes in two clusters and define the smallest average dissimilarity h b as the height of the b th iteration. The established principle in Bühlmann et al. [29] is used to determine the number of clusters for each phenotypic category. That is, the number of clus- Step 2: CLC to test the association between phenotypes in each category and a genetic variant. For each phenotypic category, we apply the CLC method [24] to combine test statistics among the L m clusters. We use T mk to denote the score test statistic to test the null hypothesis H 0mk : β 1mk = 0 (the k th phenotype in the m th phenotypic category is not associated with the genetic variant) under the generalized linear model y imk = β 0mk + β 1mk x i + ε imk , where k = 1, . . ., K m . So T mk is given by . . . ; T mK m Þ T be the test statistic vector that contains score test statistics for each phenotype in the m th phenotypic category and let B m be a K m × L m matrix with the indicator entry b kl = 1 if the k th phenotype belongs to the l th cluster and b kl = 0 otherwise. Then the CLC test statistic for the L m clusters in the m th phenotypic category is given by follows a chi-square distribution with L m degrees of freedom. We denote p m as the p-value of T L m CLC .
Step 3: Threshold for FDR-controlling. The method proposed by Cai et al. [10] is based on test statistics which are hard to extend to other test statistics. Therefore, in this step, we develop a new approach to evaluate FDR which is based on p-values. In the second step, the pvalue for the test statistic in the m th category for m = 1, . . ., M can be obtained. In this step, we propose a new multiple testing FDR controlling procedure by thresholding the p-values {p m : m = 1, . . ., M}. Under the null hypothesis, each p m follows a uniform distribution U(0,1). Let t, 0 � t � 1, be a rejection threshold so that H 0m is rejected if and only if p m � t. For any given threshold t, 0 � t � 1, the false discovery proportion (FDP) based on a random sample is given by To maximize the power of the test or equivalently the rejection rate among H 1 while maintaining an FDP level of α, the optimal threshold t is b t 0 ¼ supft : FDPðtÞ � ag. The key to empirically controlling the FDP is to find a good estimate of the numerator Using the idea in Cai et al. [10], we estimate the numerator by where m 0 is the number of categories under the null hypothesis and we can use M to estimate m 0 due to the sparsity in the number of alternative hypotheses in many real data applications, and For a given nominal FDR level α 2 (0,1), we reject H 0i whenever

Comparison of methods
We compare the performance of the proposed method HCLC-FC with those of MultiPhen [21], MANOVA [20], and TATES [22]. To evaluate the FDR-controlling performance, MAN-OVA, MultiPhen, and TATES are first applied to each category. Then, we apply the third step of HCLC-FC to the three methods to control FDR, which are referred to as MANOVA-FC, MultiPhen-FC, and TATES-FC. That is, we not only compare the performance of different methods for joint analysis of multiple phenotypes but also compare the performance of different methods with the newly developed FDR-controlling process.
In the following sections, we will estimate the FDR and power of each method. FDR is estimated by FDP and the estimated FDR is d where B is the number of replications, H a is the alternative hypothesis, p m is the p-value of the test statistic for the m th phenotypic category, m = 1, . . ., M, and b t is the threshold estimated by HCLC-FC in step 3. The power of each method is the probability of correctly rejecting H 0 ; it is esti-

Simulation study
To where x i is the genotype score at the variant of interest; λ m = (λ m1 ,. . ., λ mK � ) T is the vector of effect sizes of the genetic variant on phenotypes in the m th category; r f is a constant to define the phenotypic correlation between phenotypic categories, A is an M × M matrix with elements of 1, and I is an and ρ e is constant to define the phenotypic correlation within each phenotypic category.
Based on Eq (1), we consider the following six models. In these six models, the correlation between the h th and h �th phenotypes within each category is c 2 þ ð1 À c 2 Þr jhÀ h � j e , and between categories is c 2 ρ f . We set M = 100 for Model 1-3 and M = 50 for Model 4-6.

Model 1:
There are M = 100 categories and genotypes impact on only one category. Let β is a constant that is used to define the effect size.

Model 2:
There are M = 100 categories and genotypes impact on two categories. Let

Model 3:
There are M = 100 categories and genotypes impact on three categories. Let l 1 ¼

Simulation results
In our simulation studies, we estimate the p-values of all test statistics using their asymptotic distributions. We first set ρ f = 0.2, ρ e = 0.3, c 2 = 0.5, and K = 1,000, 2,000 for comparing the performance of different methods for joint analysis of multiple phenotypes, in other words, we consider the proposed FDR-controlling method and compare the performance of HCLC-FC, MANOVA-FC, MultiPhen-FC, and TATES-FC. For FDR evaluation, we consider different numbers of phenotypes, different sample sizes, different values of effect size, and different models.
The estimated FDRs of the four methods are summarized in Tables 1 and 2. From these tables, we can see that all methods using our FDR control procedure control their respective targeted error rates very well, which indicates applying our new FDR-controlling procedure to the entire collection of hypotheses can control the rate of FD of associated genetic variants as well as the expected value of the average proportion of FD of phenotypic categories influenced by such variants.
To compare the power of HCLC-FC with that of MANOVA-FC, MultiPhen-FC, and TATES-FC, we consider different numbers of phenotypes, different sample sizes, different models, and different genetic effect sizes. The power of the four tests at an FDR level of 5% for 1,000 phenotypes and 2,000 phenotypes are shown in Figs 1 and 2, respectively. According to the power comparison results, we summarize the following conclusions. (1) HCLC-FC Table 1. The estimated FDR of the four tests under the six models for 1,000 phenotypes (K = 1,000). MAF is 0.3. The sample size (n) is 2,000. ρ f = 0.2, ρ e = 0.3, and c 2 = 0.5. β is the effect size. FDR is evaluated using 200 replicated samples at a nominal FDR level of 5%. All estimated FDR are within the 95% confidence interval (0.0198, 0.0802).  In addition to considering power as a function of genetic effect size, we further evaluate power with varying the correlation between phenotypic categories ρ f (S1 Fig in S1 File), the correlation within each phenotypic category ρ e (S2 Fig in S1 File), the constant c 2 in the model (S3 Fig in S1 File), and the MAF (S4 Fig in S1 File). These figures show that 1) The powers of HCLC-FC, MANOVA-FC, and MultiPhen-FC slightly decrease with the increasing correlation between phenotypic categories. HCLC-FC outperforms MANOVA-FC, MultiPhen-FC, and TATES-FC consistently for different correlations between phenotypic categories no matter the effect sizes show no groups (Model 1 and 4) or show some groups (Model 2, 3, 5, and 6) (S1 Fig in S1 File).

Model
2) The powers of HCLC-FC, MANOVA-FC, and MultiPhen-FC considerably decrease as the within-category correlation increases, while the power of TATE-FC does not change too much as the correlation increases (S2 Fig in S1 File). HCLC-FC is the most powerful test for correlation less than or equal to 0.4. For strong correlation within category structures (within-category correlation � 0.6), TATES-FC outperforms other methods when the effect sizes show no groups (Models 1 and 4) or show some groups and genotype impact on multiple categories (Models 3 and 6). The power gap is much larger when the phenotypes are highly correlated and show no groups (Models 1 and 4). The reason is that the p-value of TATES equals the smallest weighted p-value, so TATES is expected to outperform multivariate approaches as the phenotype correlations increase; The power of MANOVA-FC and Multi-Phen-FC are nearly identical. 3) For power as a function of c 2 (S3 Fig in S1 File), HCLC-FC is either the most powerful test (Model 1, 2, 4, 5, and 6) or comparable with the most powerful  (S4 Fig in S1 File).
One of the important steps of our method, HCLC-FC, is the third step, the FDR controlling procedure. To date, many methods have been developed to address multiple test correction. Here, we compare the performance of using our proposed FDR controlling procedure in step 3 of HCLC-FC with some existing FDR controlling approaches, namely the spectral decompo-  across all six models. In contrast, the tests using NySi, NyBo, JiSi, and JiBo suffer FDR inflation, and the inflation is especially severe when the number of categories is large (Model 1, 2, and 3).

Real data applications
The UK Biobank is a population-based cohort study with a wide variety of genetic and phenotypic information [ . The preprocess of genotype is achieved by quality control (QC) which is performed on both genotypic variants and samples using PLINK 1.9 [37] (https://www.cog-genomics. org/plink/1.9/). We summarize the QC procedures in S6 Fig in S1 File. In QC, we filter out genetic variants with variant-based missing rates larger than 5%, p-values of Hardy-Weinberg equilibrium exact test less than 10 −6 , and MAF less than 5%. We also filter out individuals with sample-based genotype missing rates larger than 5% and individuals without sex. After QC, there are 250,850 SNPs and 466,501 individuals remaining in the following analysis.

PLOS ONE
In this study, we define phenotypes using ICD-10 codes, a standardized coding system for defining disease status as well as for billing purposes [38]. After truncating each full ICD-10 code to the UK Biobank ICD-10 level 2 code (https://biobank.ndph.ox.ac.uk/showcase/field. cgi?id=41202), we generate a total of 1,869 unique phenotypes with the names of these phenotypes being the unique truncated ICD codes. For each individual, we denote the EMR-based phenotype for that individual as "1" if a corresponding truncated ICD code ever appears, otherwise, we denote the EMR-based phenotype as "0". To ensure the individuals in our analysis are from the same ancestry, we first restrict individuals to the individuals who self-report themselves from a white British ancestry and have very similar ancestry based on a principal component (PC) analysis of genotypes [34]. To avoid the low quality of phenotype data, we exclude individuals who are marked as outliers for heterozygosity or missing rates and have been identified to have ten or more third-degree relatives or closer. Finally, we also exclude individuals that are recommended for removal by the UK Biobank. After preprocessing the phenotype data, there are 337,285 individuals left (details described in S6 Fig in S1 File). It is worth noting that some individuals violate multiple criteria, therefore, the total number of individuals we start with minus the number of individuals that need to be removed does not necessarily equal the number of individuals we keep.
There are 260 blocks based on the UK Biobank ICD-10 level 1 code, therefore, 1,869 phenotypes from the UK Biobank ICD-10 level 2 can be classified into 260 blocks (M = 260). We further limit SNPs of interest to those SNPs reaching the genome-wide significance threshold 5 × 10 −8 . On Oct. 21 st , 2019, the GWAS catalog (https://www.ebi.ac.uk/gwas/) contains a total of 90,428 data entries covering 3,153 publications of 61,613 SNPs which contains 29,297 significant SNPs. Among 250,850 SNPs obtained from the UK Biobank after QC, there are 3,267 SNPs matched with those significant SNPs in GWAS Catalog. After preprocessing procedures, individuals with both genotype and phenotype information are used in our study. There is a total of 322,607 individuals across 3,267 common SNPs and 1,869 case-control phenotypes which are classified into 260 blocks. Furthermore, we adjust each phenotype by thirteen covariates, including age, sex, genotyping array, and the first 10 PCs [28].
Based on the results shown in Tables 1 and 2, we know that HCLC-FC, MultiPhen-FC, MANOVA-FC, and TATES-FC can control targeted FDR under all of the simulation models. However, in the UK Biobank data, most of the phenotypes have extremely unbalanced casecontrol ratios, where the case-control ratios of 1,869 phenotypes are ranged from 3.10 × 10 −6 to 1.87 × 10 −1 . Meanwhile, many widely used approaches for joint analysis of multiple phenotypes produce inflated type I error rates for such extremely unbalanced case-control phenotypes [39]. Notably, our proposed FDR control method assumes that the p-value of the test statistic in the m th category, p m , for m = 1, . . ., M, follows a uniform distribution U(0,1). Therefore, we first evaluate the distributions of the p-values under the null hypothesis for each of the four methods based on the UK Biobank data by permutation procedures. For each of the four tests, we randomly permute genotypes for each of the 3,267 SNPs. After permutation, 3,267 SNPs have no association with each of the 260 phenotypic blocks. Therefore, we consider 260 blocks and 3,267 SNPs as 260 × 3,267 = 849,420 replicated samples. For each replicated sample, we apply four tests for testing the association between each permuted SNP and each phenotypic block.
S7 Fig in S1 File shows the histogram of p-values and QQ plot for uniform distribution for each method based on 849,420 replicated samples. The red dashed line in the histogram represents the theoretical frequency (849,420/25 � 33,977) for the standard uniform distribution. The frequencies of the p-values of the HCLC method are the only ones that approach the theoretical frequency. We also calculate the genomic inflation factor (λ) and show the observed and expected p-values from the standard uniform distribution in quantile-quantile (QQ) plots for each method. In general, the genomic inflation factor λ should be close to 1 if the p-values fall within the standard uniform distribution [40]. In the QQ plots in S7 Fig in S1 File, our proposed HCLC method forms a line that's roughly straight and λ = 0.99, indicating that the p-values based on 849,420 replicated samples come from the standard uniform distribution. In contrast, λ = 0.58 for MultiPhen and λ for MANOVA, where the sample quantiles of these methods deviate from the theoretical quantile. Even though the genomic inflation factor of TATES is equal to 0.97 which is pretty satisfactory, the sample quantiles fluctuate around the theoretical quantiles slightly which is not as good as our proposed HCLC method. Here are the possible reasons why the other three methods do not satisfy the uniform distribution assumption of p-values, and only HCLC works. The main assumption of MANOVA is that phenotypes should be continuous. However, all of the phenotypes in the analysis are binary phenotypes that violate the main assumption of MANOVA. MultiPhen uses the likelihood ratio test statistic based on the proportional odds logistic regression and TATES uses the extended Simes procedure to integrate the p-values from the score test statistics for the univariate association tests. It has been shown that the commonly used methods, such as the likelihood ratio test and score test, can inflate type I error rates for unbalanced case-control studies [34] that may result in the non-uniform distribution of the p-values of these two methods under the null hypothesis. Even though our proposed method, HCLC, uses the score test statistic to test the association between each phenotype and a SNP, it then uses the CLC test statistics to combine the individual statistics linearly within each cluster and combine the between-cluster terms in a quadratic form [28]. Our real data analysis shows that CLC is robust to unbalanced case-control studies.
Since MultiPhen is very time-consuming for real data analysis, we apply the other three methods, HCLC, MANOVA, and TATES, to test the association between each of the 3,267 SNPs and each of the 260 phenotypic blocks. S8 Fig in S1 File shows the number of SNPs identified by the three methods. Although MANOVA-FC and TATES-FC identified more SNPs than HCLC-FC, they violate the uniform distribution assumption of p-values. Therefore, in the following, we focus on the SNPs identified by HCLC-FC.
There is a total of 3,267 significant SNPs related to different phenotypes in the GWAS Catalog. If a SNP is associated with at least one phenotype in a block, we define this block as a SNPrelated phenotypic block in the GWAS Catalog. By controlling the FDR at the 5% level, HCLC-FC identifies 1,244 out of 3,267 SNPs that are significantly associated with at least one phenotypic block. Table 3 lists the top nine SNPs identified by the HCLC-FC method. We use SNP rs3129716 as an example. rs3129716 is mapped to genes HLA-DQB1 and MTCO3P1. By controlling the FDR at 5%, the FDR threshold is 5.96 × 10 −3 . Using this threshold, HCLC-FC identifies 28 phenotypic blocks significantly associated with this SNP. Based on the GWAS catalog, 17 out of 28 phenotypic blocks (bold-faced) are reported to be significantly associated with this SNP.
To visualize the associations between SNPs and phenotypic blocks identified by our proposed HCLC-FC method, we use two sets of phenotypic blocks, the diseases of the circulatory system (I00-I99) and the malignant neoplasms (C00-C97) as examples. Fig 3 and S9 Fig in S1 File are used to showcase interconnections among phenotypes due to shared genetic associations. Fig 3 shows the associations between SNPs (red circle) and the diseases of the circulatory system phenotypic blocks (I00-I99; blue square) identified by the HCLC-FC method. There are a total of nine phenotypic blocks. S9 Fig in S1 File shows the associations between SNPs (red circle) and the set of malignant neoplasms phenotypic blocks (C00-C97; blue square). There are a total of 15 phenotypic blocks. From these two figures, we can see that many SNPs are associated with one phenotypic block while some SNPs are associated with multiple phenotypic blocks, which supports our hypothesis that some SNPs are associated with at least one phenotypic block. For example, Fig 3 shows that 108 SNPs are associated with the phenotypic block hypertensive diseases (I10-I15) and 17 out of 108 SNPs are associated with both hypertensive diseases (I10-I15) and Ischaemic heart diseases (I20-I25).

Discussion
GWAS have become a very effective research tool to investigate associations between genetic variation and a disease/phenotype. In spite of the success of GWAS in identifying thousands of reproducible associations between genetic variants and complex diseases, in general, the association between genetic variants and a single phenotype is usually weak. It is increasingly recognized that joint analysis of multiple phenotypes can be potentially more powerful than the univariate analysis and can shed new light on underlying biological mechanisms of complex diseases. As a complementary approach to GWAS, PheWAS analyze many phenotypes with a genetic variant and combine both the exploration of phenotypic structure and genotypic variation [11].
Similar to the widely used GWAS approaches, existing methods for PheWAS largely focus on the association between a single genetic variant with a large number of candidate phenotypes and test the association between one genetic variant and one phenotype at a time. In this paper, we develop a novel and powerful multivariate method, HCLC-FC, to test the association between a genetic variant with multiple phenotypes in each phenotypic category. HCLC-FC Table 3. The top nine SNPs that are associated with multiple phenotypic blocks identified by the HCLC-FC method based on the UK Biobank data. The information of the phenotypic blocks can be found at https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=41202. The bold-faced blocks indicate the associations with the corresponding SNP reported in the GWAS catalog. The number under the rs-number of SNP represents the total number of phenotypic blocks identified. FDR threshold is calculated at a nominal FDR level of 5%.

SNPs
Mapped

PLOS ONE
involves three steps. In the first step, we use the bottom-up hierarchical clustering method [26] to partition a large number of phenotypes into disjoint clusters within each category. In the second step, we apply the clustering linear combination method [24] to combine test statistics within each category based on the phenotypic clusters and obtain a p-value from each phenotypic category. In the third step, we propose a large-scale association testing procedure with theoretical guarantees for FDR control under flexible correlation structures. We perform extensive simulation studies to compare the performance of HCLC-FC with that of other existing methods. The results show that the existing methods using our proposed FDR control procedure can control FDR at a nominal level, and our proposed HCLC-FC method outperforms the other three methods we compare under the six models for different within-group and between-group phenotypic correlation structures. Finally, we also evaluate the performance of HCLC-FC through a set of 1869 case-control phenotypes based on ICD-10 code across more than 300,000 samples from the UK Biobank, where these phenotypes can be classified into 260 ICD-10 level 1 blocks. The real data analysis results show that HCLC-FC not only can well- control type I error rates but also can identify 1,244 SNPs that have previously been reported to be associated with some phenotypes in the GWAS catalog.
As we all know, over the last decades, biobanks have been extremely prevalent in medical research [41] and enable access to a large collection of high-quality biological or medical data and tissue samples, which contain thousands of diseases/traits and a large sample size [42]. However, in biobanks, case-control ratios of most phenotypes are extremely unbalanced. Dey et al. [39] pointed out that a normal approximation of the score test statistic has inflated type I error rates for phenotypes with unbalanced case-control ratios. They proposed a score-testbased single-variant test that estimates the distribution of the test statistic by using the saddlepoint approximation (SPA) [39] to control type I error rates and to adjust for covariates even in an extremely unbalanced case-control setting. Based on SPA, the Scalable and Accurate Implementation of GEneralized mixed model (SAIGE) was proposed to analyze large biobank data, controlling for both unbalanced case-control ratio and sample relatedness [43]. It uses SPA [39] to calibrate unbalanced case-control ratios in score tests based on logistic mixed models.
To extend our method to phenotypes with extreme unbalanced case-control ratios, we can apply the SPA method to adjust the score test statistics T mk in Step 2. The adjusted test statistic of T mk is given by signðT mk Þ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi where F Chi (.) denotes the cumulative distribution function of chi-squared distribution with one degree of freedom and p SPA mk is the p-value of T mk calculated using SPA. Then, we can use the adjusted test statistic to calculate T L m CLC . To extend our method to adjust for both case-control imbalance and family relatedness, we can apply the SAIGE method in Step 2 [43]. Instead of fitting the linear regression model, we can fit the null logistic mixed model to estimate the variance component and other model parameters, then test the association between each genetic variant and phenotype by applying SPA to the score test statistics. Finally, the adjusted test statistic can be used to calculate T L m CLC . However, the performance of these approaches for phenotypes with extreme unbalanced case-control ratios needs further evaluations.
TreeWAS is another approach that was developed for identifying cross-disease components of genetic risk across hospital classification codes within a hierarchical ontology in the UK Biobank [44]. It is based on a Bayesian approach that can estimate a Bayes factor statistic for the evidence that genetic coefficients are nonzero for at least one node and also estimate the marginal posterior probability of each node with a nonzero genetic coefficient. The Bayes factor supports the evaluation of evidence in favor of a null hypothesis, rather than only allowing the null to be rejected or not rejected. However, calculating the Bayes factor based on the estimation of the marginal likelihoods of each model requires complicated and extensive time-consuming operations. Moreover, TreeWAS is based on the tree-structured disease and diagnostic ontologies that are built into the systematized coding of medical conditions from the biobank [45]. Therefore, it is developed within two sources of tree-structured phenotypic data sets from the UK Biobank, one is the hospitalization episode statistics data that are coded by ICD-10 codes, and the other one is the self-reported diagnoses that are coded using UK Biobank classification tree [44]. However, our proposed method, HCLC-FC, is not limited to biobank data; it is suitable to be applied to data sets with multiple phenotypes from electronic health records, epidemiological studies, and clinical trial data [46].
Given the extensibility of our method, there are some natural avenues for future work. 1) Our study has been mainly focused on testing the association between one genetic variant with a large number of phenotypes to identify cross-disease components of genetic risk. Future work is needed to extend the current single variant test to gene-or region-based multiple variant tests to improve the power to identify disease susceptibility genes. For example, some existing methods that are developed to test an optimally weighted combination of common and/or rare variants with multiple phenotypes can be used to each phenotypic category [47] in the second step of HCLC-FC. Then, our FDR-control procedure can be used to calculate the rejection threshold. In addition to utilizing existing methods, developing new methods that make use of both gene-or region-based SNPs and a large number of phenotypes is also a direction of our future research. 2) HCLC-FC needs individual-level data for the analysis. We can extend this methodology to use GWAS summary statistics by estimating the dissimilarity matrix using the cross-trait linkage disequilibrium (LD) score regression that requires only GWAS summary statistics [48,49], then use this dissimilarity matrix to perform the HCM in the first step. However, the performance of these aforementioned extensions need to be evaluated carefully. We would like to pursue these important extensions in our future studies.
Despite the limitations of HCLC-FC, HCLC-FC has several important advantages over other existing methods for association studies using multiple phenotypes. First, it clusters phenotypes within each phenotypic category, which reduces the degrees of freedom of the association tests and has the potential to increase statistical power. Second, it is computationally fast and easy to implement. The CLC approach [24] uses a simulation procedure to estimate the pvalue of the final test statistic. HCLC-FC has an asymptotic distribution which avoids the computational burden of permutations. Third, the newly developed FDR controlling process is based on p-values and does not depend on test statistics. Therefore, it is more general and can be applied to other multiple testing procedures to control FDR. Fourth, HCLC-FC can be used for both continuous and discontinuous phenotypes. It can be applied to data sets with multiple phenotypes from electronic health records, epidemiological studies, and clinical trial data.