A “Candidate-Interactome” Aggregate Analysis of Genome-Wide Association Data in Multiple Sclerosis

Though difficult, the study of gene-environment interactions in multifactorial diseases is crucial for interpreting the relevance of non-heritable factors and prevents from overlooking genetic associations with small but measurable effects. We propose a “candidate interactome” (i.e. a group of genes whose products are known to physically interact with environmental factors that may be relevant for disease pathogenesis) analysis of genome-wide association data in multiple sclerosis. We looked for statistical enrichment of associations among interactomes that, at the current state of knowledge, may be representative of gene-environment interactions of potential, uncertain or unlikely relevance for multiple sclerosis pathogenesis: Epstein-Barr virus, human immunodeficiency virus, hepatitis B virus, hepatitis C virus, cytomegalovirus, HHV8-Kaposi sarcoma, H1N1-influenza, JC virus, human innate immunity interactome for type I interferon, autoimmune regulator, vitamin D receptor, aryl hydrocarbon receptor and a panel of proteins targeted by 70 innate immune-modulating viral open reading frames from 30 viral species. Interactomes were either obtained from the literature or were manually curated. The P values of all single nucleotide polymorphism mapping to a given interactome were obtained from the last genome-wide association study of the International Multiple Sclerosis Genetics Consortium & the Wellcome Trust Case Control Consortium, 2. The interaction between genotype and Epstein Barr virus emerges as relevant for multiple sclerosis etiology. However, in line with recent data on the coexistence of common and unique strategies used by viruses to perturb the human molecular system, also other viruses have a similar potential, though probably less relevant in epidemiological terms.


Introduction
As in other multifactorial diseases, genome-wide association studies (GWAS) are providing important data about diseaseassociated loci in multiple sclerosis (MS) [1]. In parallel, seroepidemiological studies are reinforcing the evidence that nonheritable factors such as Epstein-Barr virus (EBV) and vitamin D are associated with disease pathogenesis [2].
However, the effect size of the gene variants identified so far in MS appears small. It is therefore important (but difficult: Sawcer and Wason, 2012) [3] to establish if and in which cases (including those gene variants with small but measurable effect size that do not reach the significance threshold of GWAS) the interaction with nonheritable factors may help understand their true impact on disease pathogenesis [4]. Furthermore, as far as the seroepidemiological associations are concerned, their causal relevance and underlying pathogenetic mechanisms become clearer if interpreted in the light of genetic data.
As an attempt to consider, beyond the statistical paradigms of GWAS analysis, which gene-environment interactions may associate with the development of MS, we performed an interrogation of GWAS data [1] through a ''candidate interactome'' approach, investigating statistical enrichment of associations in genes whose products ''interact'' with putative environmental risk factors in MS.
We elected to center the analysis on viral interactomes, based on the classical hypothesis of a viral etiology of MS. Importantly, we examined only direct interactions between viral and human proteins as it has recently been shown that these are the interactions that are more likely to be of primary importance for the phenotypic impact of a virus in ''virally implicated diseases'' [5]. The chosen interactomes reflect the compromise between informative size and potential relevance for MS. In detail, EBV was chosen as main association to be verified against phylogenetically related or unrelated viruses. Given the profound influence of EBV on the immune response, and the preponderance of (auto)immune-mediated mechanisms in the pathogenesis of the disease, we added two interactomes of immunological relevance, human innate immunity interactome for type I interferon (hu-IFN) and autoimmune regulator (AIRE). Finally, we included the vitamin D receptor (VDR) and the aryl hydrocarbon receptor (AHR) interactomes to evaluate, on the same grounds, also part of the molecular interactions that compose other established or emerging ''environmental'' associations.
As reference to gather gene and single nucleotide polymorphism (SNP) details from their HUGO Gene Nomenclature Committee (HGNC) Ids and rsids, we employed a local copy of the Ensembl Human databases (version 66, databases core and variation, including SNPs coming from the 1000 Genome project); the annotation adopted for the whole analysis was GRCh37-p6, that includes the release 6 patches (Genome Reference Consortium: human assembly data -GRCh37.p6 -Genome Assembly. http: // www.ncbi.nlm.nih.gov/genome/assembly/304538/).
The genotypic p-values of association for each tested SNP were obtained from the International Multiple Sclerosis Genetics Consortium & Wellcome Trust Case Control Consortium,2 study. All SNPs which did not pass quality checks in the International Multiple Sclerosis Genetics Consortium & the Wellcome Trust Case Control Consortium,2 study were filtered out from the original data. We used ALIGATOR [14,15] to evaluate how single genes get summed to provide total contribution of candidate interactomes (Table S1). The idea behind ALIGATOR's strategy is to evaluate gene category significance by means of an empirical approach, comparing each interactome with the null hypothesis, built using random permutations of the data. Such method begins its analyses by evaluating the Gene Ontology (GO) category association in each interactome provided: (i) each SNP with a pvalue stronger than the P-CUT parameter is associated to the gene within 20 kb; then the most representative SNP for each gene is selected; (ii) LD filter of SNPs that have an r2#0.2 and those that are farther than 1000 kb; (iii) count the number of genes significant in each GO category. This is the real observed data.
A non parametric bootstrap approach was used to generate a null hypothesis as follows: (i) build 5000 random interactomes (of the same size of the one under analysis, this procedure is repeated for each interactome); (ii) obtain category-specific p-values by comparing each random interactome with the remaining 4999 built; (iii) elect one of the interactomes in (i) as simulated observed data; (iv) randomly sample interactomes in (i) to generate categoryspecific p-values; (v) repeat (iv) to simulate 1000 simulated studies. The GO category association distribution in the real observed data is then compared with the null hypothesis: (i) generate an expected number of significant genes in each category, using the simulated studies; (ii) compare the number of significant categories in the real observed data with (i). ALIGATOR parameters that we used are those of its reference paper [14]. p-value cut-off was taken at 0.05, only the SNPs with marginal p-value less than this cut-off were employed (p-value cut-offs were also taken at 0.005 and 0.03 for the re-analysis of interactomes that resulted associated at 0.05, see results). Furthermore, to limit the uncertainties introduced by combined SNP effects in the MHC extended region (that is the haplotype set with the strongest signal in our analysis), we computed two different statistical evaluations for each interactome, one including and the other one excluding SNPs coming from such region (we considered as belonging to the extended MHC region all those SNPs that participate in at least one of the following haplotypes: HSCHR6_MHC_APD, HSCHR6_MHC_COX, HSCHR6_MHC_DBB, HSCHR6_MHC_MANN, HSCHR6_MHC_MCF, HSCHR6_MHC_QBL, HSCHR6_MHC_SSTO according to GRC data). In both cases we used Ensembl API [16] and BioPerl [17] (version 1.2.3) to gather all SNP information, haplotype participation, genes position and size [18]; such annotated information was then fed into ALIGATOR together with the interactomes.
Ingenuity Pathway Analysis (IPA) was employed twice: (i) before the ALIGATOR statistics, to characterize the composition of our interactomes (Table S2), and (ii) on the genes with nominally significant evidence of association [1] that ALIGATOR took as representative of each interactome-SNP relation (Table S3). In both cases we performed the IPA-''core analysis'', and we restricted the settings to show only molecular and functional associations. Afterwards, we used IPA-''comparative analysis'' to produce the p-value of association between each functional class and all our interactomes. IPA knowledge base (ie, the input data used by IPA) was set to the following criteria in every analyses: consider only molecules and/or relationships where the species in object was human (or it was a chemical), and the datum was experimentally observed. Since IPA-''comparative analysis'' provides p-value ranges associated to functional classes, we took as reference the value used by IPA to fill its reports, namely the best p-value for that class.

Results
We performed a ''candidate interactome'' (i.e. a group of genes whose products are known to physically interact with environmental factors that may be relevant for disease pathogenesis) analysis of genome-wide association data in multiple sclerosis.
We obtained 13 interactomes, 7 from the literature (as such) and 6 by manually selecting those interactions that were reported by two independent sources or were confirmed by the same source with distinct experimental approaches. In all cases we considered only physical-direct interactions (Table S2,Table 1).
Preliminarily to the enrichment of association analysis, we used IPA to obtain a sense of the cellular signaling pathways that are targeted by each interactome. A classification for molecular and cellular functions showed a comparable distribution of components in most interactomes except for VDR, HBV, VIRORF and hu-IFN where a relative enrichment of some functional pathways (cell signaling, cellular growth and proliferation, cellular development, cell cycle, cell death and survival, protein synthesis, RNA post-transcriptional modification, gene expression) was present (Figure 1).
We investigated statistical enrichment of associations within each one of the above interactomes (Table 1). The analyses were performed with and without considering SNPs falling in the MHC extended region. In both cases the interactomes of EBV, HIV and HBV reached significance. To verify the sensitivity of our results with respect to a choice (SNPs p-value cut-off at 0.05) that is not obvious based on the literature published so far, we evaluated different cut-offs (p,0.005 and p,0.03) on the three interactomes that were MS-associated at p,0.05. These analyses supported the consistency of the results (Table S4).
We then performed the same IPA classification as in Figure 1  (Figure 2) on the MS-associated genes within the EBV, HIV and HBV interactomes (Table S3). The aim was to verify whether the associations emerging from the three interactomes implied new and MS-specific perturbations and whether these perturbations are virus-specific or shared by the three pathogens. The comparison between pre-and post-match distribution of the functional classes ( Figure 3) showed that the MS-associated interactomes did not reflect a clear cut involvement of specific pathways though, in the case of EBV, an enrichment of some biological functions (cellular function and maintenance, cell morphology, cellular assembly and organization, energy production) was present. On the other hand the most frequent changes for HBV and HIV could be in accord with the post-match reduction of the interactome sizes.

Discussion
Of the 13 interactomes, 3 show a statistical enrichment of associations. In line with the epidemiological and immunological literature, the EBV interactome is among these. The lack of significant associations with the hu-IFN and AIRE interactomes suggests, though does not exclude, that the result is not an effect of the immunological connotation of the EBV interactome. The absence of associations with the interactomes of phylogenetically related viruses (CMV and HHV8, both herpesviruses with the latter that shares the same site of latency as EBV and belongs to the same subfamily of gamma-herpesviridae) reinforces the specificity of the EBV result. The fact that a portion of the genetic predisposition to MS may be attributable to variants in genes that interact with EBV may be complementary to another our finding showing that EBV genomic variants significantly associate with MS (unpublished data): the two results suggest a model of genetic jigsaw puzzle, whereby both host and virus polymorphisms affect MS susceptibility and, through complex epistatic interactions, eventually lead to disease development.
The associations with the HBV and HIV interactomes were unexpected. Overall, epidemiological data do not support a role of these viruses in the pathogenesis of MS though some controversy still holds concerning the safety of HBV vaccination [19][20][21][22][23]. Interestingly, Gregory et al. (2012) [24] demonstrated that in the TNFRSF1A gene, which is part of the HBV interactome, the MSassociated variant directs increased expression of a soluble tumor necrosis factor receptor 1.
Concerning HIV, the lack of epidemiological association seems more established. However, demyelination is a feature of HIV encephalomyelopathy [25] and cases of difficult differential diagnoses or association between the two conditions are described in the literature [26,27]. All this considered, it might not be surprising that some molecular interactions that take place between HIV and host may predispose to demyelination. Other viruses, sharing homology with HIV may possess better paraphernalia and be more prone to cause MS. The HERV-W family has long been associated with MS [28] and HERV-W/Env, whose expression is associated with MS [29], is able to complement an env-defective HIV strain [30] suggesting a certain degree of functional kinship.
Apart from any conjectures about the data on HBV and HIV interactomes, it remains true, as recently demonstrated by Pichlmair and colleagues (2012) [12], that viruses use unique but also common strategies to perturb the human molecular network. Our pathway analyses do not suggest, in fact, any specific cellular signaling target for the three viruses in MS, perhaps with some exceptions as far as the EBV interactome is concerned. Though preliminary, this acquisition may be in accord with the largely accepted view that, alongside the risk associated with EBV infection, there can be a more general risk of developing MS linked to a variety of other infections [31,32].  The VDR interactome does not show significant enrichment of associations. The result does by no means diminishes the importance of the epidemiological association between vitamin D and MS: its causal relevance is already supported by data that are starting to explain the molecular basis of this association, upstream [33][34][35]1] and downstream the interactions between the VDR and its protein cofactors [36,37].
Current approaches for gene set analysis are in their early stage of development and there are still potential sources of bias or discrepancy among different methods, including those used in our study. As the reproducibility of the techniques increases, and new facilities [38] and methods become available to identify interactions that still escape detection, new lists will become available for matching with GWAS data. In parallel, also the assessment of human genetic variation will become more comprehensive [39]. Hence, the ''candidate interactome'' approach may become an increasingly meaningful strategy to interpret genetic data in the light of acquisitions from epidemiology and pathophysiology. Notably, this approach appears to be complementary to other studies, which look for statistical enrichment of associations in an unbiased way, and may disclose unexpected pathways in MS susceptibility [40].  (Table S3) computed by ALIGATOR (Association LIst Go AnnoTatOR) first flow process. These p-values were obtained through a Comparative Core-Analysis in IPA (Ingenuity Pathway Analysis). The functional components identified at the molecular and cellular level are presented row-wise; the interactome sub-sets are presented column-wise. Each cell in position (i,j) contains a number that represents in 2log notation the strength of the association between the functional class i and the interactome; this information is also color-matched with a color gradient that moves from white (2log[p] = 0.0, p = 1) to crimson (2log[p] = 14, p,10 214 ). Two hierarchical cluster analyses were employed to group functional classes that share similar patterns of associations across all interactome sub-sets (left-side clustering), and to group interactome sub-sets that share similar functional compositions (topchart clustering). doi:10.1371/journal.pone.0063300.g002 At present, our results support a causal role of the interaction between EBV and the products of MS-associated gene variants. Other viruses may be involved, through common and unique mechanisms of molecular perturbation.

Supporting Information
Table S1 ALIGATOR settings.