Multivariate analysis of genomics data to identify potential pleiotropic genes for type 2 diabetes, obesity and dyslipidemia using Meta-CCA and gene-based approach

Previous studies have demonstrated the genetic correlations between type 2 diabetes, obesity and dyslipidemia, and indicated that many genes have pleiotropic effects on them. However, these pleiotropic genes have not been well-defined. It is essential to identify pleiotropic genes using systematic approaches because systematically analyzing correlated traits is an effective way to enhance their statistical power. To identify potential pleiotropic genes for these three disorders, we performed a systematic analysis by incorporating GWAS (genome-wide associated study) datasets of six correlated traits related to type 2 diabetes, obesity and dyslipidemia using Meta-CCA (meta-analysis using canonical correlation analysis). Meta-CCA is an emerging method to systematically identify potential pleiotropic genes using GWAS summary statistics of multiple correlated traits. 2,720 genes were identified as significant genes after multiple testing (Bonferroni corrected p value < 0.05). Further, to refine the identified genes, we tested their relationship to the six correlated traits using VEGAS-2 (versatile gene-based association study-2). Only the genes significantly associated (Bonferroni corrected p value < 0.05) with more than one trait were kept. Finally, 25 genes (including two confirmed pleiotropic genes and eleven novel pleiotropic genes) were identified as potential pleiotropic genes. They were enriched in 5 pathways including the statin pathway and the PPAR (peroxisome proliferator-activated receptor) Alpha pathway. In summary, our study identified potential pleiotropic genes and pathways of type 2 diabetes, obesity and dyslipidemia, which may shed light on the common biological etiology and pathogenesis of these three disorders and provide promising insights for new therapies.

Introduction Inspired by this, we performed the current work to identify potential pleiotropic genes for type 2 diabetes, obesity, and dyslipidemia using Meta-CCA. We used six correlated quantitative traits reported to be established related factors for type 2 diabetes, obesity or dyslipidemia, including FG, FI, BMI, WHR, HDL and triglyceride (TG). Interestingly, our findings indicated that five pathways and 25 genes (including two confirmed pleiotropic genes and eleven novel pleiotropic genes) were potential pleiotropic genes for type 2 diabetes, obesity, and dyslipidemia. These findings yielded some genetic basis for the common biological etiology and pathogenesis, and thus provided promising insights for a potential common therapy for the three disorders.

GWAS datasets and processing
Step1: Annotation genes to SNPs, and SNP prune. The large-scale GWAS datasets of the six correlated traits in the present study were downloaded from http://diagram-consortium. org/2015_ENGAGE_1KG/ [20,21]. The glycemic traits (FG and FI) data [20] were derived from a meta-analysis of 13 original GWAS studies, including FG data for 46,694 individuals and FI data for 24,245 individuals. The obesity-related data (BMI and WHR) [20] were derived from a meta-analysis of 22 original GWAS studies, including BMI data for 87,048 individuals and WHR data for 54,572 individuals. The lipids (HDL and TG) data [21] were derived from a meta-analysis of 22 original GWAS studies, including 62,166 individuals for both HDL and TG. The large-scale GWAS datasets were collected by the European Network for Genetic and Genomic Epidemiology (ENGAGE) Consortium, and all samples were from individuals of European ancestry (The details were shown in Table 1). The large-scale GWAS datasets were the largest datasets that included all six correlated traits and contained all the information needed to conduct the analyses in the Meta-CCA framework in one ethnicity (Caucasians). We selected the overlapped SNPs (9,411,134 SNPs) of the six traits to perform the multivariate analysis.
The analytical workflow of our study is presented in Fig 1. Firstly, we completed the gene annotation according to the 1000 Genome datasets using PLINK1.9. We downloaded the reference data, which contained 26,291 genes from the website: https://www.cog-genomics.org/ static/bin/plink/glist-hg19. We recognized the transcript including all SNPs (both exonic and intronic SNPs) in the region as genes. We selected the overlapped SNPs between the six traits and the reference data in our study. After the gene annotation of SNPs, we pruned the SNPs for each gene using the parameter r 2 = 0.01 [22]. r is the Pearson correlation coefficient between any of the two SNPs in one region. The purpose of SNP pruning is to reduce potential biases caused by the linkage disequilibrium (LD) among SNPs [22].  For the present work, the GWAS datasets we used provided not only the p value of the SNPs but also the regression coefficient β and the standard error (SE). We normalized the regression coefficient β before conducting the Meta-CCA using the following equation (n is the sample size in the corresponding GWAS dataset of each trait):

Statistical analysis
Step 2: Meta-CCA analysis. The present work was conducted by the Meta-CCA R package, and each gene as a unit for Meta-CCA analysis. We recognized the transcript including all SNPs (both exonic and intronic SNPs) in the region as genes. The program of Meta-CCA required three basic data inputs: the genotypic correlation structures between SNPs (∑XX), the correlation coefficients between SNPs and traits (∑XY), and the phenotypic correlation structures between traits (∑YY) ( Table 1) [19]. In Meta-CCA, ∑XX was estimated using a reference SNP dataset such as the HapMap data or the 1000 Genomes data representing the study population. It is better if ∑XX was estimated from the target population or the same ethnicity from the database used [19]. As all of the participants in our study were Caucasians, we calculated the ∑XX based on the 1000 Genomes data and downloaded the reference data from 1000 Genomes Project (Phase 3 European (CEU) population reference data (https://www.coggenomics.org/static/bin/plink/glist-hg19). ∑XY was estimated by the normalized regression coefficient β: G and P are the number of genotypic and phenotypic variables, respectively. And ∑YY was estimated based on the Phenotypic Pearson Correlation Coefficients (YY), which was shown in Table 1 [23]. In the original study, all of the participants had the clinical data of the six traits (FG, FI, BMI, WHR, HDL and TG). The phenotypic Pearson correlation coefficients between the six traits were calculated from the same corresponding individuals [23]. The details of the exact procedures of Meta-CCA program were described in Anna Cichonska's paper [19]. After Meta-CCA, we obtained the output data of the relationship between the genes and the six traits. We used a p value < 0.05 (after Bonferroni correction) as the nominal significance threshold.
Step 3: Gene-based association analysis. To refine the identified genes by Meta-CCA, we tested their specific relationships with the six traits respectively using VEGAS-2 (Versatile Gene-based Association Study-2) [24,25], a gene-based algorithm widely used for gene-based p value calculation using GWAS summary statistics. VEGAS-2, an approach provides the correlation of all the SNPs in one gene region for one single trait, also has lower false positive rates compared with other gene-based approaches [26,27]. We selected the potential pleiotropic genes significantly associated with more than one trait (P value< 0.05/2720, Bonferroni correction) [28] after obtaining the gene-based p-value of each gene for the six traits using VEGAS-2.

Functional annotation and gene enrichment analyses
Step 4: Pathway and GO (Gene Ontology) term enrichment analyses of the potential pleiotropic genes. To explore the biological functions of the identified potential pleiotropic genes, we performed pathway enrichment analyses and GO enrichment analyses for the potential pleiotropic genes using Enrichr (a web server tool for gene set enrichment analysis: http:// amp.pharm.mssm.edu/Enrichr/) [29]. The Pathway and GO Term enrichment analyses also provide a better understanding of the polygenic associations and the potential mechanisms of the biological process. With this program, we used hypergeometric tests and Fisher's exact tests for the statistical analysis. Benjamini-Hochberg corrected p value <0.05 in the enrichment analysis is used as the threshold for significance.

Results
After gene annotation and SNP pruning, 21,209 genes were left to conduct the Meta-CCA analysis in our study. The number of SNPs in each gene ranged from 1 to 280; the average was 13. We used the threshold of 0.05/21209 (Bonferroni correction) as our target alpha level for the Meta-CCA analysis [28]. For the Meta-CCA analysis, 2,720 genes with the p value < 0.05/ 21209 were identified as potential pleiotropic genes for the six correlated quantitative traits. After Meta-CCA analysis, we tested the 2,720 genes' relationship to the six traits using VEGAS-2. Only the genes significantly associated (Bonferroni corrected p value < 0.05) with more than one trait were kept [24,25]. There were 31, 0, 75, 1, 225 and 185 significant genes (Bonferroni corrected p value < 0.05) for FG, FI, BMI, WHR, HDL, and TG respectively. By screening the genes based on the results of gene-based p value, a total of 25 associated genes related to more than one trait in VEGAS-2 analysis were identified as potential pleiotropic genes for type 2 diabetes, obesity and dyslipidemia. The details are shown in Table 2.
Interestingly, four of the top five significant genes (GALNT2, SNX17, CETP, LIPC) were regarded as dyslipidemia associated genes in the original GWAS study [21]. In particular, two (GALNT2, SNX17) were also suggested to be associated with type 2 diabetes and obesity in previous studies [35, 36]. All 25 potential pleiotropic genes were identified as the associated genes/loci (with at least one SNP p value < 5 Ã 10 −8 ) for TG in the original GWAS study, while eight of these 25 genes (GALNT2, GCKR, LPL, FADS1, LIPC, CETP, APOA5, ZPR1) were reported to be TG associated genes in the original GWAS study after validation [21]. Specifically, two of these 16 genes (FADS1, GCKR) have been identified as susceptibility candidate genes for type 2 diabetes in early GWAS studies [37, 47,51].
GO enrichment analyses (conforming to the up-to-date 2017 database) [29, 52] revealed that the biological functions of these pleiotropic genes were mainly involved in the metabolism of lipids. For the GO biological process, the top five significant GO terms were Triglyceride homeostasis (GO:0070328), Cellular triglyceride homeostasis (GO:0035356), Positive regulation of lipoprotein lipase activity (GO:0051006), Cholesterol homeostasis (GO:0042632) and Reverse cholesterol transport (GO:0043691). For the GO cellular component, the top five significant GO terms were Very-low-density lipoprotein particle (GO:0034361), Spherical highdensity lipoprotein particle (GO:0034366), Early endosome (GO:0005769), Early endosome   Table 4. GO Term is the gene collection of different arborescence types [53]. Therefore, some GO Term is the branch of others. As a result, there is a considerable overlap of genes between related GOterms such as "Triglyceride homeostasis" and "Cellular triglyceride homeostasis". In summary, our present work identified twenty-five potential pleiotropic genes as well as the enriched pathways and GO terms of potential pleiotropic genes for type 2 diabetes, obesity and dyslipidemia.

Discussion
The present study, the first systemically multivariate analysis of genomics data for type 2 diabetes, obesity and dyslipidemia jointly using Meta-CCA, identified potential pleiotropic genes as well as enriched pathways and GO terms. Importantly, two of the 25 identified genes (GCKR, FADS1) were reported to be associated with type 2 diabetes, obesity and dyslipidemia in different prior studies [21,37,38,46,47], validated by our present study. Other significant genes, excluding the genes that were reported to be associated with hyperglycemia, obesity and dyslipidemia in other types of previous studies, might be novel pleiotropic candidate genes (such as LIPC-AS1, IFT172, KRTCAP3, CSGALNACT1, EIF2B4, GTF3C2, ZNF513, NRBP1, FNDC4, TMEM258, and CLPTM1) for the three disorders. We preformed the functional proteinprotein interaction network analysis for the potential pleiotropic candidate genes by using STRING 10.5 (https://string-db.org/cgi/input.pl). Fig 2 shows that there are interactions between most of the potential pleiotropic candidate genes. The results not only revealed some of the shared genetic components but also provided novel insights for exploring the potential common biological pathogenesis of these three disorders.
Many genes and pathways have pleiotropic effects on more than one disease, a common phenomenon supported by this study of type 2 diabetes, obesity and dyslipidemia. Recently, animal experiments and cross-sectional population-based studies have shown evidence of large shared gene components. Pleiotropic genes have been successfully identified in bivariate analyses of type 2 diabetes with obesity, type 2 diabetes with dyslipidemia, and obesity with dyslipidemia. However, multivariate analysis had not previously been conducted for these three disorders simultaneously. Systemically exploring the pleiotropic genes and their effects on these three disorders is essential, and is possible because of the accessibility of the GWAS summary statistics. The advantages of Meta-CCA are listed as follows. Firstly, Meta-CCA can detect correlations between multiple variants and multiple traits based on GWAS summary statistics [19], which might provide richer clues for finding novel gene targets in multivariate analyses compared to the univariate and bivariate analysis [18]. For example, TMEM258, a gene for adipose tissue regulation, was not identified by any previous GWAS studies for type 2 diabetes and obesity, but was one of the novel pleiotropic candidate genes for type 2 diabetes, obesity and dyslipidemia identified in this study. Secondly and notably, Meta-CCA can identify novel candidates, since some of the associations become detectable only when multiple variants and multiple traits are tested jointly [19]. For example, CETP, a well-known gene for dyslipidemia, was not identified by any previous GWAS studies for type 2 diabetes, but it was one of the novel findings in our study. Last but not least, Meta-CCA is a cost-effective analytical method based on the data of GWAS summary statistics, which provides an enlarged larger effective sample size to detect potential pleiotropic genes for multivariate traits. Meta-CCA and similar types of analyses are an emerging and powerful tool for detection of pleiotropic genes of multiple correlated traits using GWAS summary statistics.
Among the 25 potential pleiotropic genes, two genes, GCKR, and FADS1, were suggested to be pleiotropic genes for type 2 diabetes, obesity and dyslipidemia based on the results of previous studies [21,37,38,46,47]. GCKR, located in 2p23.3, which encodes the protein belonging to the glucosidase regulatory subfamily, which in turn inhibits glucosidase by binding to enzymes in pancreatic islet cells and liver. A recent study found that the variants of GCKR were associated with obesity in postmenopausal women [38]. FADS1, located in 11q12.2, encodes a protein that is a member of fatty acid desaturases. FADS1 was reported to be related to type 2 diabetes by a previous GWAS study, but the mechanism was still unknown [51].
From a biochemistry point of view, eight (APOA5, APOA1, APOC2, CETP, LPL, LIPC, GCKR, GALNT2) of the twenty-five potential pleiotropic genes were involved in important metabolic routes. Details are summarized in Fig 3. APOA5, APOA1, and APOC2 encode lipoproteins which mainly ferry TG, HDL, and VLDL (very low density lipoprotein), respectively. CETP encodes cholesteryl ester-transfer protein, which transfers HDL into VLDL and IDL (intermediate density lipoprotein) by involving the transportation of cholesteryl ester. LPL is a Fig 2. The nodes represent proteins which were encoded by corresponding genes, edges represent the proteinprotein association, line color represents types of interaction evidence (e.g., text mining, co-expression and so on). All of the interacting proteins with an interaction score ! 0.15 (based on previous study). https://doi.org/10.1371/journal.pone.0201173.g002 The identification of potential pleiotropic genes for type 2 diabetes, obesity and dyslipidemia PLOS ONE | https://doi.org/10.1371/journal.pone.0201173 August 15, 2018 lipoprotein lipase which plays a critical role in lipid metabolism such as transferring VLDL into IDL. The function of the protein hepatic triglyceride lipase encoded by LIPC is important in catabolism of lipids, including transferring IDL into LDL. GALNT2 and GCKR are involved in the metabolism of glucose as mentioned above.
Our results, as previously described, have identified 25 genes and five pathways associated with type 2 diabetes, obesity, and dyslipidemia. Interestingly, all 25 genes were identified as the associated genes/loci for TG in the original GWAS study [21], though just eight of these 25 genes were refined in the validation stage [21]. All five pathways were associated with the metabolism of lipids. Specifically, two types of lipid-lowering drugs successfully targeted the statin pathway and PPAR Alpha pathway respectively, suggesting that abnormal plasma levels of lipids play a critical role in the common biological pathogenesis of the three disorders. Some drugs targeted on the statin pathway have been used successfully in therapy for type 2 diabetes patients with dyslipidemia. Another significant pathway is the PPAR Alpha pathway. The PPAR pathway family, which includes the PPAR Alpha pathway, the PPAR Beta pathway, and the PPAR Gamma pathway, plays a key role in substance metabolism (including glucose metabolism, lipid metabolism, and protein metabolism). Specifically, PPAR Alpha was a core The identification of potential pleiotropic genes for type 2 diabetes, obesity and dyslipidemia factor for fatty acid oxidation in liver, which was activated by ligands or drugs such as fibrates, resulting in a decrease in serum level of TG [54]. PPAR Gamma was also an important factor for the etiology of IR [55]. Drugs targeting PPAR Gamma, such as thiazolidinedione, were effective in the control of IR [56].
Our present study also indicated that TG played an important role in these three disorders, as all 25 potential pleiotropic genes were identified as associated genes/loci for TG in the original GWAS study (though eight of these 25 genes were reported to be TG associated genes after validation) [21]. For type 2 diabetes, hypertriglyceridemia is the most common type of dyslipidemia [57], which is mainly induced by IR and impairment in insulin secretion. Further, genomic studies [57,58] have indicated that hypertriglyceridemia has a higher genetic correlation with type 2 diabetes than other types of dyslipidemia. For obesity, most of the plasma TG is determined by the level of VLDL-TG (the balance between synthesis and clearance of VLDL-TG), and the synthesis of VLDL-TG is associated with total fat mass and liver fat [59]. Thus, the large amount of fat mass in obese patients leads to increasing synthesis of VLDL-TG, but the clearance of VLDL-TG remains unchanged. Hypertriglyceridemia is a principal characteristic of dyslipidemia and is linked to many other types of dyslipidemia such as decreased HDL level and increased small dense LDL level [60]. Above all, the metabolism of TG seems to play a core role in the common biological pathogenesis of these three disorders.
Our study not only provides a better understanding of the shared genetic background for the three disorders, but also produced a list of potential novel pleiotropic candidate genes for follow-up study in further biological experiments. Some of the 25 pleiotropic genes (LIPC-AS1, IFT172, KRTCAP3, CSGALNACT1, EIF2B4, GTF3C2, ZNF513, NRBP1, FNDC4, TMEM258, and CLPTM1) were first reported to be associated with type 2 diabetes, obesity and dyslipidemia. The findings of our present work were not completely consistent with the findings in previous GWAS studies or other types of systematically analysis studies in type 2 diabetes and other metabolic related diseases (The details of the overlapped identified genes and the novel potential pleiotropic candidate genes were shown in Table 2). The reason for the different findings in type 2 diabetes might be the using of different datasets and different methods in these studies. For example, one published work that using integrative omics data shown that 15 SNPs and the corresponding genes were associated with type 2 diabetes [61]. However, these genes were not identified by our present work. Our present work did not identify all the genes that were identified in GWASs or other types of studies, as it was only a supplementary study to identify the potential pleiotropic genes for chronic complex diseases. We hope that the potential novel pleiotropic candidate genes can provide some clues for molecular biologists performing future functional validation studies to determine whether the findings truly have pathophysiological significance for type 2 diabetes, obesity and dyslipidemia.

Conclusion
In this study, we identified and assessed some potential pleiotropic genes and pathways for type 2 diabetes, obesity and dyslipidemia using novel Meta-CCA analysis. The findings validated two previously identified pleiotropic genes (GCKR, FADS1) for these three disorders and highlighted another eleven significant genes (LIPC-AS1, IFT172, KRTCAP3, CSGALNACT1, EIF2B4, GTF3C2, ZNF513, NRBP1, FNDC4, TMEM258, and CLPTM1) as potential novel pleiotropic candidate genes for the three disorders. Further, the potential pleiotropic genes were significantly enriched in five pathways including the statin pathway and PPAR Alpha pathway.
In conclusion, our findings may yield novel insights into exploring the common biological pathogenesis of these three disorders, which ultimately may lead to the development of effective drug therapies.