Principal Component Analysis Characterizes Shared Pathogenetics from Genome-Wide Association Studies

Genome-wide association studies (GWASs) have recently revealed many genetic associations that are shared between different diseases. We propose a method, disPCA, for genome-wide characterization of shared and distinct risk factors between and within disease classes. It flips the conventional GWAS paradigm by analyzing the diseases themselves, across GWAS datasets, to explore their “shared pathogenetics”. The method applies principal component analysis (PCA) to gene-level significance scores across all genes and across GWASs, thereby revealing shared pathogenetics between diseases in an unsupervised fashion. Importantly, it adjusts for potential sources of heterogeneity present between GWAS which can confound investigation of shared disease etiology. We applied disPCA to 31 GWASs, including autoimmune diseases, cancers, psychiatric disorders, and neurological disorders. The leading principal components separate these disease classes, as well as inflammatory bowel diseases from other autoimmune diseases. Generally, distinct diseases from the same class tend to be less separated, which is in line with their increased shared etiology. Enrichment analysis of genes contributing to leading principal components revealed pathways that are implicated in the immune system, while also pointing to pathways that have yet to be explored before in this context. Our results point to the potential of disPCA in going beyond epidemiological findings of the co-occurrence of distinct diseases, to highlighting novel genes and pathways that unsupervised learning suggest to be key players in the variability across diseases.


Introduction
Comorbidity studies show that some distinct diseases tend to cooccur [1][2][3][4][5][6], pointing to a shared genetic and/or environmental component. In the era of genome-wide association studies (GWASs), direct evidence of shared genetic risk factors of diseases comes to light [7]. For example, while it has been previously shown that rheumatoid arthritis and type-1 diabetes co-occur [1], GWASs have identified 12 genes associated with both diseases [8][9][10][11][12][13][14][15][16]. More broadly, disease genes obtained from the Online Mendelian Inheritance in Man [17] were used to assemble the Human Disease Network (HDN) [18,19], a visual representation of genetic similarity between diseases. Pleiotropy of complex diseases and traits has also been explored by searching genome-wide for variants implicated in more than one disease [16,20,21]. Such studies promise to reveal shared genes and offer an expanded understanding from a genetic standpoint of why some diseases tend to co-occur.
Methods for exploring shared genetic risk factors between diseases belong to two main categories (see also recent review [7]). The first category of methods focuses on finding individual variants that are associated with a pair or more of diseases being investigated. In one set of such methods, a GWAS is carried out on a pooled set of individuals with different diseases [10,16,20,21], or by analyzing information for multiple diseases available for the same individuals [22,23]. Alternatively, and based only on summary statistics of the association test for each single nucleotide polymorphism (SNP), one can simply combine p-values from several GWASs using Fisher's method [24]. The CPMA (crossphenotype meta-analysis) statistic [25] is another statistic that tests whether a SNP is associated to more than one phenotype. In addition, methods such as the conditional false discovery rate or mixed-models for multiple traits have used known pleiotropy between diseases or traits to increase power [26,27]. Studies employing these methods have found shared associations between pairs of diseases such as Crohn's disease and celiac disease [16], other autoimmune disease pairs [20,21], bipolar disorder and schizophrenia [26] and multiple sclerosis and schizophrenia [28]. They have additionally shown that SNPs associated with one autoimmune disease are likely to be associated to other (though not all) autoimmune phenotypes [25].
The second category of methods focuses on using shared variants to learn about the genetic similarity between diseases. One method employed by Sirota et al. utilizes the correlation between association signals across many SNPs to assess the similarity between pairs of diseases and showed that there are likely two distinct autoimmune classes where a risk allele for one class may be protective in another [29]. Similar methods based on classifier [30] and linear mixed model approaches [27,31] have also been proposed for assessing the shared genetic variation between two diseases.
These exciting new methods are powerful for studying shared genetic risk variants between diseases. At the same time, overcoming some of their limitations can improve the study of shared pathogenesis using data from multiple GWASs. First, some methods have focused on analysis of individual SNPs. Though well suited for scenarios of a single causal SNP in a locus, such methods would suffer a reduction in power when several causal SNPs exist or if different SNPs tag the same underlying causal variant, which is especially relevant for diseases with rare causal variants [32,33] and when the different GWASs are across different populations [34] or have used different genotyping arrays. Second, when considering the correlation between association statistics of different studies, it might be beneficial to not consider all variants equally (as is the case in [29]), whether or not they play a role in disease susceptibility. Third, most methods assume as known which diseases share pathogenesis, and while the shared pathogenesis of autoimmune disease has been well established [25,29], it is worthwhile to study shared pathogenesis of other disease classes [6,35,36]. And fourth, while some approaches perform well for two correlated traits or diseases, extending the analysis to more than two traits can become difficult [27].
In this study, we present a novel method, disPCA, which uses principal component analysis (PCA) to learn about the shared genetic risk of distinct diseases. PCA maps data from the original axes into new axes in principal component (PC) space via a stretch and rotation of the original axes. Each new axis or PC captures the maximal level of variation in the data not captured by previous PCs. Thus, each PC can potentially capture a different, orthogonal story told by the data. Our method is based on summary level statistics from GWASs of different diseases. We combine data from individual SNPs into gene-based statistics via several p-value combination methods. PCA is applied to a matrix across genes and GWAS datasets, with entries representing the strength of association between a gene and the disease studied in a dataset. Thus, disPCA reveals principal components that are linear combinations of all genes, weighed in accordance with their role in differentiating between the different GWASs. It can be applied to study multiple diseases without prior knowledge of their shared pathogenesis, thereby overcoming all the limitations of existing methods outlined above. disPCA also accounts for potential confounders due to methodological differences between studies, such as in genotyping array, which can otherwise lead to these differences being captured by the PCA.
Equipped with this novel method and with data from 31 GWAS datasets, we considered the level of shared pathogenesis between diseases and classes of diseases from all genes, which we term shared pathogenetics. Diseases with more similar underlying genetics are more likely to be located closer together in PC space. As PCA is a non-parametric method, it makes no assumptions regarding which diseases are more similar and does not aim to model it, thereby allowing discovery of new relationships between diseases by examining the top PCs. Each PC captures a different combination of genes that distinguish well between some diseases, or the remaining variation between diseases. No separation between diseases along a PC indicates that they tend to share the pathogenetics underlying that PC. By studying the set of genes underlying each PC for enrichment in specific pathways, we further assessed the function and relationship of genes that separate different disease clusters in PC space.

disPCA
We developed a method, disPCA, for studying the relationship between diseases based on their level of disease risk genes shared. The method works on the gene-level by first combining information from all SNPs in and around each gene. Considering gene-level statistics compensates for different tag SNPs being associated in different datasets even in cases where they capture the same causal variant. It also aggregates information across multiple tag SNPs in each dataset, as well as allows for different underlying causal variants in the same gene being associated with the risk of different diseases. To be widely applicable, disPCA is based solely on the p-values of association of each SNP with the disease under study. Importantly, all SNPs and consequently all genes are considered, rather than focusing on genes that meet a genome-wide significance level of association with a disease. We apply PCA to many different GWASs to axiomatically find and assign importance to genes based on their contribution to distinguishing between diseases and disease classes. The ensuing distance between different disease datasets in PC space inversely corresponds to their level of shared pathogenetics.

Gene-level significance levels
For each protein-coding gene from the HGNC database [37], we mapped all SNPs that are in the gene or within 0.01 cM from it (genetic distances were determined via the Oxford genetic map based on HapMap2 data [38,39]). We discarded all SNPs that were not mapped to within 0.01 cM of any gene. If a SNP lay between two genes, it was assigned to the closer gene. For each GWAS dataset, we determined the significance of association of each gene with the assayed disease using the following simulation procedure. Let the observed p-value of a gene be the minimum p-value of the n SNPs mapped to the gene. We compared the observed p-value to that of 100,000 groups of n consecutive SNPs chosen in random. Based on these groups, we assign a new p-value to each gene as the proportion of groups for which the observed minimum p-value for that gene is less significant than that of the group. This random sampling procedure may be biased in regions of high linkage disequilibrium (LD) when mapping SNPs to genes using genetic distance (e.g. consecutive SNPs in regions of high LD will be more

Author Summary
Epidemiological studies have revealed distinct diseases that tend to co-occur in individuals. As genome-wide association studies (GWASs) have increased in numbers, more evidence regarding the genetic nature of this shared disease etiology is revealed. Here, we present a novel method that utilizes principal component analysis (PCA) to explore the relationships and shared pathogenesis between distinct diseases and disease classes. PCA groups and distinguishes between data points by uncovering hidden axes of variation. Applying PCA to 31 GWASs of autoimmune diseases, cancers, psychiatric disorders, neurological disorders, other diseases and body mass index, we report several findings. Diseases of similar classes are located near each other, supporting the genetic component of shared disease etiology. Genes that contributed to distinguishing between diseases are enriched for various pathways including those related to the immune system. These results further our knowledge of the genetic component of shared pathogenesis, highlight possible pathways involved and provide new guidelines for future genetic association studies. correlated than those in regions of lower LD). However, for any given gene, these will equally affect each of the datasets. To validate this, we also applied disPCA to p-values obtained from mapping SNPs to genes using physical distance: a SNP was mapped to a gene if it was in the gene or within 10 kb of it. Comparing these results to results based on mapping via genetic coordinates revealed the same clustering of diseases ( Figure S1). Furthermore, in studying the loading of each gene, namely their contribution to each PC, we found that the genes with the top 50 average loadings on the first two PCs were significantly correlated (r.0.67, p-value,8.4610 28 , Table S1). Thus, in the main text we present results based on mapping by genetic distance as described above.
To consider information from beyond only the most significant SNP in a gene, we also implemented the truncated tail strength [40] and the truncated product methods [41] to combine p-values in each gene in replacement of the minimum p-value, and followed a similar procedure for assigning new gene-level p-values. For the analyses presented in the following, results from all methods were similar though results with the minimum p-value approach clusters similar diseases better ( Figure S2, S3). We thus only report in the main text results from the minimum p-value approach. Code to carry out this procedure is publicly available at http://keinanlab.cb.bscb.cornell.edu/content/tools-data.

PCA implementation and confounders
Assume a matrix Z, a d6g matrix of the 2log 10 gene-level pvalues, where d is the number of GWAS datasets, and g is the number of genes present in all datasets. We center the matrix by subtracting the column means from each column. Thus the centered matrix B has entries: To obtain the PCs of matrix B, we must find the eigenvectors and eigenvalues of its covariance matrix BB T . Let v i be a vector of length d and let l i be a scalar. v i is the eigenvector and l the eigenvalue of BB T if the following is satisfied: The principal components of B are the normalized eigenvectors of its covariance matrix, BB T , where the eigenvectors are ordered such that the largest eigenvalue corresponds to the first principal component. Each eigenvector is additionally orthogonal to all other eigenvectors. Thus, from (2), we can decompose BB T as follows: Where the columns of U contain the principal components and g is a diagonal matrix with entries equal to the eigenvalues of B's covariance matrix. One can similarly construct the singular value decomposition (SVD) of B. The SVD of B can be written as: where V is a d6d matrix, D is a d6g diagonal matrix, and W is a g6g matrix. V and W contain the left and right singular vectors of B, respectively, and D contains the singular values of B in its diagonal. Substituting equation (4) for B in equation (3), we find that Thus, the principal components of B, the eigenvectors of its covariance matrix, are equivalent to the left singular vectors of B. In addition, the eigenvalues of B are equivalent to the square of its singular values.
We applied SVD to the matrix B using the R [42] implementation of PCA/SVD (prcomp), with no scaling of the data. Due to the heterogeneity of the GWAS datasets (Table S2), variation uncovered by PCA can also reflect differences in features such as genotyping array, association method, and sample size, rather than underlying disease risk genes. To ensure that these features did not influence our results, we first tested each gene for association with each of these features. Let z i = Z i,? be the vector corresponding to the association statistic for gene i across the d datasets. We considered a linear regression of z i as a function of the covariates: z i~a zb i,1 C 1 zb i,2 C 2 zb i,3 C 3 z", where C 1 , C 2 , C 3 are vectors of length d that represent the genotyping array, association method and the log 10 of the sample size respectively, in each of the studies (Table S2). Testing the significance of regression coefficients can reveal genes that are associated with any of these potential confounders. In our following analysis, 19 genes were significantly associated with association method. However, genes not significantly associated to the above confounders may similarly have an effect. Hence, we also applied SVD (as described above) to the residualized matrix, namely . We found that applying SVD to R results in the top PCs capturing a higher fraction of the variance of the data than when applied to the original matrix Z, though results are qualitatively similar between the two. We thus present results derived from the residualized matrix R. Resulting distances between datasets were assessed visually by plotting datasets in PC space. To quantify the clustering of datasets, we additionally applied hierarchical clustering in R [42] (hclust) to the Euclidean distance between pairs of datasets across the first two PCs.

Simulation study
We simulated a matrix Z for two disease classes, each with 5 diseases (A 1 ,A 2 ,A 3 ,A 4 ,A 5 ,B 1 ,B 2 ,B 3 ,B 4 ,B 5 ) and 10,000 genes. In general, under the null hypothesis of a region containing no risk variant and assuming no confounding factors (e.g. population stratification), p-values should be uniformly distributed between 0 and 1. On the other hand, associated risk variants should be enriched for smaller p-values. We thus considered three sets of genes. The pvalues for the first set of genes was drawn from the U(0,1) distribution for all diseases, thus no pleiotropy was captured in this set of genes. The second set of genes was distributed U(0,0.05) for the first disease class (A 1 ,…,A 5 ) and distributed U(0,1) for the second disease class (B 1 ,…,B 5 ). Finally the third set of genes was distributed U(0,0.05) for the following diseases: A 1 , A 2 , B 1 , B 2 and distributed U(0,1) for all other diseases. Thus the second set of genes simulates pleiotropy between diseases in disease class A, while the last set of genes simulates pleiotropy between diseases in both disease classes.

Disease and pathway enrichment analysis
Disease enrichment analysis was completed using the online tool WebGestalt [43,44] to query the PharmGKB [45] database. WebGestalt tests for enrichment of a category of genes in the observed set of genes using the hypergeometric test [43]. Bonferroni correction for multiple tests was applied and all reported p-values are following this correction. We restricted analysis to categories that contained a minimum of 5 genes in our analysis with the largest 50 weightings in the top two PCs. For gene categories with overlapping or the same set of genes, we list the most significant category. To reduce biases introduced by the clustering of genes with similar function, we filtered our list of genes with the top 50 loadings on the top two PCs by removing the latter gene out of a pair of genes within 0.1 cM of each other. We then applied WebGestalt to this filtered subset of genes.
Pathway enrichment analysis was completed using the Gene Set Enrichment Analysis (GSEA) tool [46]. GSEA sorts genes according to a score, which here is the weighting of a gene in the PC under study. It then assesses whether genes belonging to a certain category (e.g. pathway) are non-randomly distributed in the sorted list. As input to GSEA, we utilized the weights of genes in the top two PCs. GSEA carried out 10,000 gene-set permutations to determine FDR (false discovery rate) q-values. We queried the BioCarta and KEGG pathway databases. We restricted analysis to categories that contained a minimum of 5 genes in our analysis. Throughout we present enrichment analysis only for the top two PCs, though other PCs are available and can be assayed for further insight into the diseases studied. We considered an FDR of 0.25, suggested by GSEA [46] (GSEA manual online), though this entails that 1 in 4 of our results are false positives on average. As above, to reduce biases introduced by the clustering of genes with similar function, we filtered our full list of genes by removing the latter gene out of a pair of genes within 0.1 cM of each other and reanalyzed this subset of genes (n = 5,298) with GSEA.

Testing for non-random distribution of p-values
We followed a similar approach to that implemented in Zhernakova et al. 2011 [21] while applying it to genes instead of individual SNPs to test for non-random distribution of association values. For each disease pair we retained all k genes that were nominally significant in one disease (p-value,0.01). We then tested the null hypothesis of a uniform distribution of p-values in the second disease using Fisher's method for combining p-values: where p i is the p-value for association of gene i in the second disease. Nearby genes in linkage disequilibrium may violate the independency assumption in Fisher's method. We thus performed a separate analysis after removing the latter of the two genes that were within 0.1 cM of each other and nominally significant in one disease.

Application of disPCA to 31 GWAS datasets
We analyzed a total of 31 GWAS datasets [10, that spanned different types of cancers, autoimmune diseases, neurological disorders, psychiatric disorders, type-2 diabetes (T2D), ischemic stroke and body mass index (BMI) ( Table S2). Datasets were publicly available, obtained from dbGaP or obtained via collaborations. These datasets had non-overlapping samples and were of European ancestry only. For Wellcome Trust Case Control (WT) related datasets, we distributed controls between the five datasets such that none had overlapping samples. For WT type-1 diabetes, rheumatoid arthritis and Crohn's disease, we obtained further controls from the WT hypertension, cardiovascular disease and bipolar disorder case data [10]. After obtaining gene-level association statistics for 14,018-17,438 autosomal genes for each dataset, we limited our analysis to the 11,927 genes that overlapped all studies. Nineteen of these genes were significantly associated with association method after multiple-testing correction (see above).

Replication of disPCA
We tested the replicability of disPCA when applied to real GWASs using six datasets for which we had access to the original data [10,57,60,61,74,75]. Each dataset was split into independent subsets of equal size (+/2 two samples). We then used PLINK's logistic regression [77] to evaluate association of each SNP to disease risk. We additionally incorporated covariates derived from EIGENSOFT into the regression analysis [78] to control for population structure. We randomly chose one subset of each of the six datasets for one disPCA analysis, and the rest for another. Hence, these two analyses consist of independent samples.

Results
We first applied disPCA to a simulated dataset (Materials and Methods). We varied the number of genes that have correlated association results across simulated datasets, thereby varying the level of pleiotropy between the simulated diseases. disPCA clearly clustered pleiotropic diseases when diseases shared at least 40 shared genes with p-values randomly distributed below 0.05 in each disease (Figure 1a-b, S4, S5, S6). This can be seen both visually via PCA plots, and via hierarchical clustering based on the Euclidean distance between datasets in the presented space of the first two principal components (PCs) (Figure 1, S4, S5, S6). When diseases are indeed clustered by their simulated pleiotropy according to disPCA (Figure 1b), the first two PCs explain a similar fraction of the variance (Figure 1c), which may increase or decrease depending on the number of genes contributing to pleiotropy ( Figure S7). We next examined the contribution of each gene to each PC as captured by its absolute ''loading''. Considering the first two PCs in this disPCA analysis, genes with p-values,0.05 (Materials and Methods) are also enriched for larger absolute loadings, stressing their role in differentiating between the simulated disease classes (Figure 1d-e).
We next applied disPCA to empirical data from GWAS datasets. First, we considered only diseases for which we had two datasets: autoimmune diseases (for which we had the most pairs of datasets) and a pair of schizophrenia datasets (as schizophrenia has a high heritability [79]). We observed that datasets of the same diseases were generally clustered together (Figure 2-3). We additionally observed that Crohn's disease is separated from other autoimmune diseases. This result is consistent with previous reports that inflammatory bowel disorders (IBDs) are distinct from other autoimmune disorders [29]. As in the simulated scenarios, the variance explained by each PC was similar (Figure 2b), and the results suggest that less than a hundred genes contribute to the similarity between each pair of datasets (Figure 3c-d).
To test the replicability of the results, we further divided each of the six datasets, for which we had the raw data, into two subsets consisting of the same or similar number of cases and controls (Materials and Methods). We then performed two disPCA analyses, one based on a randomly chosen subset of each of the six datasets, and another based on the remaining subset of each dataset. We found that both independent sets produced the same clustering of diseases ( Figure S8, S9). Loadings for 50 genes with the largest average loading across the two disPCA analyses of PC1 and PC2 were also significantly correlated across the two (r.0.44, p-value,1.2610 23 , Table S3). These results point to disPCA capturing some of the same pleiotropy in both cases, and further support the replicability of its results.
We applied disPCA to a final set of 31 datasets (Table S2), including autoimmune diseases, cancers, obesity-related diseases and traits, psychiatric disorders and neurological disorders. The first two PCs capture visually-interpretable separation of diseases. PC1 for the most part splits the two systemic lupus erythematosus (SLE) and the one dataset of celiac disease from all other datasets ( Figure 4). Independent of that separation, PC2 splits autoimmune diseases (in purple) from other diseases, and within autoimmune diseases, inflammatory bowel disorders (Crohn's disease and ulcerative colitis) are clustered together (Figures 4-5). Schizophrenia, major depressive disorder, cancers, T2D and neurological disorders lie on the negative end of PC2, while attention deficit hyperactivity disorder (ADHD), and some autoimmune diseases that are not well separated on this PC from other diseases, lie near the origin. PCs beyond the first two explain almost the same fraction of the variance (Figure 4b) and hence merit further investigation (see Discussion).
As disPCA teases out the important genes of shared and distinct pathogenetics across disease datasets, we next investigated which genes strongly contribute to each PC based on their absolute loadings. Specifically, we retrieved the genes with the top 50 absolute loadings for each of the top two PCs underlying Figure 4 and tested their disease enrichment (Materials and Methods). The top genes underlying the first PC were significantly enriched for genes associated with lupus and autoimmune related diseases, while genes underlying the second PC were mostly enriched for association to IBD (Table 1). These enrichment results are consistent with the separation of studies across each of these 2 PCs with PC1 mostly separating studies of SLE and celiac diseases, and PC2 mostly separating studies of IBD from other diseases. The results were largely unchanged following filtering genes that were within 0.1 cM of each other to account for linkage disequilibrium and for similar genes being co-located to each other, such as gene families (Table 1) (Materials and Methods).
Though the results of the disease enrichment analysis support that disPCA extracts biologically relevant signals, the arbitrary cutoff of the 50 top genes goes against the potential of PCs being linear combinations of all genes. We thus used GSEA [46], which supports analyzing a pre-ranked list of all genes, to perform pathway enrichment of each PC. GSEA assesses whether genes belonging to a certain pathway are non-randomly distributed in the list of pre-ranked genes. We ranked all genes by the absolute loading in the PC under study. Results of this pathway analysis revealed enrichment for immune related pathways on the first 2 PCs (Table 2) at an FDR of 0.25. The top two pathways enriched on PC1 were the antigen processing and presentation and the intestinal immune network IgA production pathways, which are crucial immune-related pathways. In particular, intestinal IgA antibodies may have a role in celiac disease [80] and inflammatory bowel disease [81,82]. On PC2, the most significant pathway was the NOD-like receptor signaling pathway. NOD-like receptors have been associated to Crohn's disease, while other immunerelated genes likely interacting with NOD2 have been associated to ulcerative colitis [83]. Other immune system pathways were enriched, including the Fc epsilon RI signaling pathway that is related to the antibody IgE, which induces inflammatory response [84]. Two enriched pathways are related to neurons (i.e. the neurotrophin signaling pathway and the Trk-A pathway). In particular, the neurotrophic factor BDNF (brain-derived neurotrophic factor), which is a part of the neurotrophin pathway, has been previously associated to Alzheimer's, Parkinson's disease and depression [85][86][87]. More recently, an intronic variant in this gene has also been associated to BMI [88]. The contribution of genes in these pathways to PC2 may explain the separation of neurological, psychiatric and BMI studies along that PC. As above, we reran GSEA after filtering genes that were within 0.1 cM of each other (Materials and Methods). The top two pathways on the first PC remained significant, while only the top pathway in PC2 remained significant (Table S4). This is likely due to the contribution to enrichment of several genes that are co-located, which should hence not necessarily be discounted.
Many autoimmune diseases share associations from the HLA region. We thus reran disPCA after removing all genes in and around the HLA region, and found a slightly different visual PCA map ( Figure 6). SLE and celiac disease were no longer distinguished from other autoimmune diseases and instead lay near the origin. PC1 now differentiated IBD from other diseases, and PC2 separated some autoimmune diseases from the rest on one extreme, and schizophrenia from the rest on the other. This was further supported by clustering results on the first two PCs ( Figure S10). A GSEA analysis of the PC loadings retained the NOD-like receptor signaling pathway on PC1 instead of PC2 ( Table 3). Analysis of PC2 loadings revealed additional immune  (Table S2)  related pathways that were not enriched in our previous analysis that included the HLA region.
Results such as PC1 in the main analysis clustering schizophrenia close to some autoimmune diseases ( Figure 4) prompted us to further explore the shared pathogenetics between diseases by testing for the non-random distribution of gene-based p-values in one disease based on their nominal significance in another disease (Materials and Methods). Generally, the results show that association statistics are non-randomly distributed when considering most pairs of autoimmune diseases, i.e. testing for non-random distribution in one autoimmune disease dataset based on significance in another autoimmune disease dataset (Figure 7). As a control, we tested for non-random distribution for a random set of genes and found that no disease pair was significant for non-random distribution ( Figure S11). Our results reported a similar story as observed via disPCA. Genes nominally significant in rheumatoid arthritis, type-1 diabetes and ankyolosing spondylitis were non-randomly distributed in SLE and vice versa. We also found that genes nominally significant for one schizophrenia study were non-randomly distributed in a number of autoimmune diseases (Figure 7). These signals remained even after genes within 0.1 cM of another gene were removed ( Figure S12) (Materials and Methods).

Discussion
In this study we introduced a new method, disPCA, to explore the shared pathogenetics of various diseases and disease classes based on GWAS data. PCA has been widely used in population and medical genetics. Applied to genome-wide genotyping data, it can recapitulate population structure such as revealing European geography [89], has been used as a tool to assess and correct for population stratification in GWAS [78,90] and has also been proposed as a tool for reducing the dimensionality of multiple phenotypes for association analysis [91]. Our disPCA method considers PCA on a different type of matrix, whereby different GWASs are studied in the space of all genes. It can group GWASs of different diseases together based on gene-level association statistics, while accounting for biases due to heterogeneity in sample size, association method, genotyping array and other confounders between studies. This implementation of PCA assigns weights to each gene and each PC in a manner that maximizes the variation between diseases. Hence, the higher the level of shared pathogenetics between diseases, the closer they will be in PC space. This is in contrast to methods that considered the correlation between diseases across all SNPs [29]. In fact, when we consider such correlations in our data, it is generally very low, even when considering it on the gene rather than on the SNP-level and even when the same disease is studied. For example, the correlation coefficient between the 2log10 p-values of the two Crohn's disease studies is 0.048, and it is 0.063 and 0.031 between ulcerative colitis and each of the two Crohn's disease studies. More generally, the highest correlation between pairs of datasets of the same disease was obtained for schizophrenia (0.13, p-value = 2.2610 216 ) while the lowest was obtained for type-2 diabetes (0.0031, p-val- ue = 0.73). These results show that there is less power when aggregating information across all genes and that disPCA is able to tease out and weigh the suitable set of genes underlying shared pathogenetics.
Though disPCA is designed to uncover shared disease etiology between diseases, other sources of correlation between datasets can also contribute to its results. Potential confounders include shared samples between datasets, technical artifacts, and population structure (if risk factors vary across ancestry). We accounted for technical artifacts introduced by the genotyping array, association method and sample size by regressing out variation in the data attributed to these sources (Materials and Methods). To minimize the impact of population structure and shared samples, we only applied disPCA to studies of individuals of European ancestry and datasets that had no overlapping case or control data. Though we cannot account for other potential confounders that are unknown, our results strongly suggest that the remaining correlation between studies represent shared disease etiology.
We applied disPCA to data from 31 GWASs that cover a range of diseases in four main classes: autoimmune diseases, cancers, neurological disorders and psychiatric disorders. We additionally analyzed GWASs on T2D, BMI and ischemic stroke. We first observed that different studies of the same diseases tend to lie closer together on the lead PCs ( Figure 2). This is in support of studies of the same disease replicating many of the same signals of associations when samples are of similar ancestry. We additionally find that disPCA positions diseases within the same class closer together (Figure 4). This was especially the case for the major types of IBDs (i.e. Crohn's disease and ulcerative colitis), which clustered close together ( Figure 5). This points to distinct etiology shared between IBDs, that is not shared between IBDs and most other autoimmune diseases. Indeed, it has recently been suggested that IBD is at least in part a primary immunodeficiency disorder [92,93]. Between the different disease classes, the main 2 PCs in disPCA found overlap between non-autoimmune diseases and traits, as well as pointed to a potential connection between schizophrenia and some autoimmune diseases.
Using the weightings of genes on each of the leading PCs, we performed disease and pathway enrichment analysis. We found that PC1, which mainly splits some autoimmune disorders from other autoimmune disorders, is significantly enriched for genes associated to immune and autoimmune disorders. PC2, which splits IBD studies from studies of other diseases, is significantly enriched for genes in some inflammatory related pathways and genes associated with IBD. Further results in PC2 highlighted neuron-related pathways that can be in line with evidence that Table 1. Disease enrichment analysis for disPCA (Figure 1).

PC
Disease P-value* P-value (distance pruned)* abnormal neurotrophins levels in the brain have been associated to schizophrenia [94,95]. Excluding the HLA region revealed significant enrichment for genes in other immune-related pathways. Though the specific analysis presented in this paper focused on the top two PCs, further PCs estimated by disPCA can be examined. For example, PC4 of disPCA on all GWASs distinguishes rheumatoid arthritis from other diseases ( Figure  S13). Pathway enrichment analysis highlighted the calcineurin pathway (FDR = 0.182), which is involved in T-cell activation. Additionally, though schizophrenia and vitiligo datasets are further apart on the first two PCs, each pair of datasets is clustered closer together on PC3 and PC4. Altogether, these results support the validity of the enrichment analysis based on disPCA. The analysis in turn also raises new hypotheses of disease etiology by pointing to additional pathways and enrichment for other diseases that were not previously observed. Prompted by the results of disPCA, we further explored shared pathogenetics by testing for the non-random distribution of association statistics between pairs of disease studies (Figure 7). Autoimmune diseases show non-random distribution of association statistics with one another. Interestingly, genes nominally associated with one of the schizophrenia studies were nonrandomly distributed in studies of several autoimmune diseases (i.e. ankyolosing spondylitis, systemic lupus erythematosus, and T1D), in support of the above disPCA results. Interestingly, this relationship was only observed for one of the two schizophrenia studies we analyzed, which may be due to a number of factors, including high number of risk factors for schizophrenia, with  different ones being associated in different studies. If indeed autoimmune diseases and schizophrenia share disease etiology, then just as one would not include individuals with ulcerative colitis as controls for a Crohn's disease GWAS since they both are IBDs, one should also be wary of including individuals with autoimmune disorders in a schizophrenia GWAS (and vice versa) as doing so may decrease power in loci implicated in both diseases. Lack of power due to such or other reasons might also underlie our lack of observation of significant shared etiology between the second schizophrenia dataset and autoimmune diseases. Finally, we make a few recommendations for future applications of disPCA to additional studies: (1) Biases can be introduced when studies share sample data; (2) As disPCA maximizes variance across diseases, genes that are implicated in all analyzed diseases will not contribute to the lead PC as they do not distinguish diseases from each other; (3) While here we only focused on using the strength of association and on gene-level signals, the method itself is highly flexible. One can further utilize the direction of association (protective versus deleterious), the heritability at each locus [96], an analysis at the pathway-level or in linkage disequilibrium blocks, include other non-genic functional elements, and/or environmental risk factors; (4) disPCA can be used to generate new hypotheses, which can then be tested by conducting more focused association studies in independent data or by using its output to better combine different diseases in an independent meta-analysis. New hypotheses can also be generated with regard to the genes that contribute to comorbidity between diseases. In conclusion, disPCA offers users a unique general overview of the disease landscape by studying their distinct and shared pathogenetics and flagging pathways and genes for further investigation. disPCA's flexibility and computational efficiency proves itself as an excellent tool to be applied to additional diseases and disease classes to further our knowledge of shared pathogenetics. Figure S1 Clustering dendrogram of datasets of the same diseases using physical distance mapping. SNPs were mapped to genes if they were within 10 kb of the gene. Clustering analysis of resulting disPCA revealed the same clusters as disPCA with genetic coordinates (Figure 3). (TIFF) Figure S2 Clustering dendrogram of datasets of the same diseases with the truncated product method. Similar to Figure 3, with the truncated product method used to combine SNP p-values per gene. (TIFF) Figure S3 Clustering dendrogram of datasets of the same diseases with truncated tail strength method. Similar to Figure 3, with the truncated tail strength method used to combine SNP p-values per gene. (TIFF) Figure S4 Simulated diseases with ten nominally significant genes. A) Similar to Figure 1 in main text with only ten Figure 7. Non-random distribution of genes for all analyzed datasets from Figure 4. Genes nominally significant for diseases on the y-axis were tested for non-random distribution in diseases on the x-axis (Materials and Methods), with 2log 10 presented on the color scale on the right. White entries denote p-values,1610 217 . The most significant results are for pairs of similar diseases and between pairs of autoimmune diseases. In addition, pairs between some autoimmune diseases and schizophrenia also display significant results. doi:10.1371/journal.pcbi.1003820.g007 nominally significant genes for each set of pleiotropic diseases (Materials and Methods). Clustering of the diseases sets is not observed. B) Clustering dendrogram as similarly presented in Figure 1b. C) The portion of variance explained by each PC is displayed. D-E) The loadings for PC1 and PC2 are displayed. (TIFF) Figure S5 Simulated diseases with twenty nominally significant genes. A) Similar to Figure 1 with twenty nominally significant genes for each set of pleiotropic diseases. As in Figure  S2, diseases are not clearly clustering according to the sets though nominally significant genes are enriched for larger absolute loadings (Materials and Methods). B) Clustering dendrogram as similarly presented in Figure 1b Figure 5, with clustering analysis of distance between datasets based on the disPCA between all diseases and traits presented in Table S2 after removing the HLA and surrounding regions. (TIFF) Figure S11 Non-random distribution of randomly chosen genes. A random subset of genes were chosen to be tested for non-random distribution in diseases on the x-axis, with 2log 10 presented on the color scale on the right. White entries denote pvalues,1610 217 . (TIFF) Figure S12 Non-random distribution for distance pruned set of genes. Genes were filtered such that no two genes were within 0.1 cM of another. The remaining subset of genes was then tested for non-random distribution in diseases on the x-axis. The 2log 10 of the p-value is presented on the color scale and white entries denote p-values,1610 217 . Results are largely similar to the original without filtering of nearby genes. (TIFF) Figure S13 PC3 and PC4 of all diseases disPCA. Similar to Figure 4 with data being presented for PC3 and PC4. A) PC1 accounts for 4.18% of the variance, while PC2 accounts for 4.08%. PC1 clusters schizophrenia and vitiligo datasets together on the two extremes, while PC2 separates rheumatoid arthritis from other diseases and traits. B) The portion of variance explained by each PC is displayed. C) The weightings for genes on PC1 are displayed and ordered according to their weights. D) Similar to (C) where loadings are for PC2. (TIFF)

Supporting Information
Table S1 Comparison of loadings between disPCA with mapping based on physical or genetic coordinates. Loadings for the top 50 genes ranked by either a physical or genetic coordinates based disPCA were compared. 'Correlation' denotes the Pearson's correlation coefficient with its significance denoted in the 'p-value' column. Rows denoted by 'mean(PC1,PC2)' indicate the correlation between the 50 genes with the largest average loading of PC1 and PC2. (DOC)