Protein-Protein Interaction Analysis Highlights Additional Loci of Interest for Multiple Sclerosis

Genetic factors play an important role in determining the risk of multiple sclerosis (MS). The strongest genetic association in MS is located within the major histocompatibility complex class II region (MHC), but more than 50 MS loci of modest effect located outside the MHC have now been identified. However, the relative candidate genes that underlie these associations and their functions are largely unknown. We conducted a protein-protein interaction (PPI) analysis of gene products coded in loci recently reported to be MS associated at the genome-wide significance level and in loci suggestive of MS association. Our aim was to identify which suggestive regions are more likely to be truly associated, which genes are mostly implicated in the PPI network and their expression profile. From three recent independent association studies, SNPs were considered and divided into significant and suggestive depending on the strength of the statistical association. Using the Disease Association Protein-Protein Link Evaluator tool we found that direct interactions among genetic products were significantly higher than expected by chance when considering both significant regions alone (p<0.0002) and significant plus suggestive (p<0.007). The number of genes involved in the network was 43. Of these, 23 were located within suggestive regions and many of them directly interacted with proteins coded within significant regions. These included genes such as SYK, IL-6, CSF2RB, FCLR3, EIF4EBP2 and CHST12. Using the gene portal BioGPS, we tested the expression of these genes in 24 different tissues and found the highest values among immune-related cells as compared to non-immune tissues (p<0.001). A gene ontology analysis confirmed the immune-related functions of these genes. In conclusion, loci currently suggestive of MS association interact with and have similar expression profiles and function as those significantly associated, highlighting the fact that more common variants remain to be found to be associated to MS.


Introduction
Multiple Sclerosis (MS) is the most common inflammatory disease of central nervous system (CNS) which affects young adults [1]. It is widely acknowledged that genetic factors play an important role in determining the risk of MS [2]. Several epidemiological studies demonstrated an increased frequency of MS among biological relatives of affected individuals [3,4]. Family based and association studies have shown that the strongest genetic association in MS is located within the major histocompatibility complex (MHC) class II region [5]. In particular the HLA-DRB1*1501 allele confers an approximate odds ratio of 3 [6]. However, during the last few years Genome Wide Association Studies (GWAS) have identified many other MS associated loci of modest effect located outside the MHC (now more than 50) [7][8][9][10][11].
Despite the recent advances in the understanding of the genetic architecture of MS, several questions remain to be answered. For example, due to stringent correction criteria many genetic variants fail to reach genome-wide significance but can still be considered as suggestive of genetic association. Furthermore, once a SNP is found to be associated with a particular disease, the relative candidate gene (or genes) that mediate such association is usually unknown.
Analysis of protein-protein interaction (PPI) networks is being increasingly recognized as an important tool to characterize the underlying biology of genes associated to complex diseases, in particular immune-mediated ones [12,13]. It is logical to hypothesize that those genes which are truly associated with the same trait will be involved in similar biological processes. For example, Rossin et al. found that proteins encoded in genomic regions associated to rheumatoid Arthritis and Crohn's disease physically interact more than what would be expected by chance and that the genes encoding these proteins are highly expressed in immune tissues [12]. Studying such PPI interactions can ultimately elucidate which suggestive regions are more likely to be truly associated and greatly aid the identification of those genes that are mediating the GWAS findings.
We conducted a PPI analysis of gene products coded in loci recently reported to be MS associated and suggestive of MS association. Our aim was to identify which suggestive regions are more likely to be truly associated, which genes are mostly implicated in the MS PPI network, their expression profiles and functions.

Methods
Three recent independent association studies were considered for our analysis [14][15][16]. In Sawcer et al. and Patsopoulos et al., SNPs were divided into significant and suggestive depending on the strength of the statistical association [14,15]. From Sawcer et al we defined as suggestive those SNPs with p values in the discovery phase of less than 1610 24 and significant those that either were replication of previous GWAS findings or had a replication p,0.05 and a p-combined,5610 27 [14]. In Patsopoulos et al., significant SNPs were defined as either those with p-value,5610 28 or replication of previously identified associated SNPs. Suggestive SNPs were those with p-values between 5610 28 and 1610 26 [15]. We also included in the analyses the top 82 SNPs (with a log p value.4.91) from Wang et al [16]. All SNPs from this study were considered as suggestive, because the study was not designed to meet currently accepted criteria for genome wide significance. After removing duplicate SNPs, 67 significant and 133 suggestive SNPs were obtained.
Protein-to-protein interaction assessment was conducted using the Disease Association Protein-Protein Link Evaluator (DAPPLE) tool [12]. This bioinformatics tool is able to investigate physical interactions among gene products encoded within certain genomic regions by the creation of a PPI network. Interactions are extracted from the database ''InWeb'' that combines data from a variety of public PPI sources including MINT, BIND, IntAct and KEGG and defines high confidence interactions as those seen in multiple independent experiments. The region around a given SNP is extended to the genomic interval defined by SNPs in moderate linkage disequilibrium (r ' 2. = 0.5) and then to the nearest recombination hotspots [12]. Connections can be direct (two proteins are physically linked to each other) and indirect (interaction is mediated by a common interactor). The extent of the PPI network are assessed using the following parameters: the number of direct interactions between proteins from different loci, the mean associated protein direct and indirect connectivities (the mean number of distinct loci a protein is directly or indirectly connected to) and the mean common interactor connectivity (average number of proteins in separate loci bound by common interactors) [12]. The non-randomness of the network and the significance of the interaction parameters are tested using a permutation method that compares the original network with thousands of networks created by randomly re-assigning the protein names while keeping the overall structure (size and number of interactions) of the original network. Those genes that participate in the network more than expected by chance are defined as genes to prioritize (corrected p,0.05) [12]. Expression data were gathered from BioGPS, an online gene annotation database that reports individual gene expression levels for a number of human tissues and cell types [17]. Analyses were performed using non-parametric tests (Kruskal-Wallis and Mann-Whitney tests). Gene ontology terms were investigated using The Database for Annotation, Visualization and Integrated Discovery (DAVID) v6.7, an online tool that is able to identify the functional categories and biological processes which are most represented within a list of genes [18,19].

Dapple analysis of significant SNPs
Our first aim was to assess the extent of PPI interactions among genes located within genomic regions with definite association with MS susceptibility. We therefore submitted into DAPPLE the 67 SNPs with genome-wide significant association with MS risk. There were a total of 75 proteins participating in the direct network with 104 direct interactions (expected direct interactions = 61, p,0.0002) (Table 1, Figure 1 and Table S1). The mean associated protein direct connectivity was 2.7 (expected = 1.7, p,0.0002). The mean associated protein indirect connectivity was 52.2 (expected = 43.8, p = 0.04) and the mean common interactor connectivity was 4.5. (expected = 3.9, p = 0.0002). The total number of genes implicated in the network was 215 (Table S1). The total number of genes that had more connections than expected by chance (genes to prioritize) was 22 and included previously shown putative candidate genes such as IL-12A, SOCS-1, CBLB, MALT-1, IL-22RA, MAPK-1 and IL-7R.

Dapple analysis of significant plus suggestive SNPs
When suggestive SNPs were included in the analysis, the number of proteins participating in the network and that of direct interactions increased from 75 to 189 and from 104 to 281 respectively (expected direct interactions = 242, p,0.007) (Table 1, Figure 2 and Table S2). The mean associated protein direct connectivity was also higher than expected (observed = 2.9, expected = 2.4, p = 0.0008). The mean associated protein indirect connectivity was 93 (expected = 91, p = 0.34). The mean common interactor connectivity was 5.05 (expected = 4.8, p = 0.05). The total number of genes analyzed was 445 (Table S2), while genes to prioritize were 43 of which 23 were located within suggestive regions. These included genes such as SYK, IL-6, CSF2RB, FCLR3, EIF4EBP2 and CHST12 (Table 2).

Tissue-specific expression and gene ontology terms of candidate genes
In order to further investigate the nature of our findings we assessed in which tissues these genes were mostly expressed. We used the gene portal BioGPS which contains gene expression data on a variety of human tissues and cell types [17]. For our analysis we considered 10 immune cell types and 14 non-immune tissues. We submitted the full list of candidate genes (n = 43) obtained from the significant plus suggestive DAPPLE analysis and for each gene we obtained a different genetic expression value in every tissue or cell type tested. Because of different background  characteristics between each probe set, a direct comparison of expression across different genes was not possible. Therefore, we decided to standardize the expression values of each single gene across different tissues and used the obtained z-values for all subsequent analyses. Figure 3 shows the standardized expression values in the 24 tissues and cell types tested. Expression appeared particularly high in whole blood as well as in most of immunerelated cell types (in particular B-cells, plasmacytoid dendritic cells (pDCs), natural killer (NK) cells, CD4+ and CD8+ T cells). An independent-sample Kruskal-Wallis test confirmed that gene expression was significantly different across tissues (p,0.001). When tissues were divided into immune and non-immune, expression was substantially different between the two groups (p,0.001) (Figure 4). When compared to average expression across tissues, candidate genes were significantly overexpressed in Blymphoblasts, pDCs, monocytes, B cells, NK cells, CD4+ T cells (p,0.001), CD34+ hematopoietic cells (p = 0.001) and CD8+ T cells (p = 0.003). Expression patterns were similar for significantly and suggestively associated loci.
We further confirmed the immunological nature of these candidate genes using DAVID [18,19], a bioinformatics tool that is able to identify the biological processes in which a group of genes are involved. Candidate genes were significantly enriched for immune related processes such as regulation of leukocyte activation (p = 3.10610 28 ), regulation of T cell proliferation (p = 3.25610 28 ), positive regulation of immune system processes (p = 7.7610 27 ), regulation of protein kinase cascade (p = 5.46610 24 ) and regulation of cytokine production (p = 0.001459) (see Table S3 for the full list). GO enrichment was similar for significantly and suggestively associated loci.

Discussion
We showed that genetic products coded in loci strongly associated with MS risk substantially interact with each other. Both direct and indirect interactions were significantly higher than what would be expected by chance only. When the PPI analysis was extended to suggestive SNPs, we found an increased number of total proteins participating in the network and direct interactions (Figure 1 and 2). The only parameter that did not reach significance was the number of indirect interactions. This finding could be explained by the possible lack of real MS association among several suggestive SNPs.
However, including suggestive SNPs in the PPI analysis increased the number of genes to prioritize from 22 to 43. Interestingly, more than half of these genes (n = 23) were located within suggestive regions and many of them directly interacted with proteins coded within significant regions (e.g. CSF2RB-CBLB, IL6-IL2RA, MAPK3K14-NFKB1, SYK-STAT3, see Table S2). Taken together the suggestive statistical evidence of genetic association and the functional evidence of protein-protein interaction support the hypothesis that these genes could play an important role in the pathogenesis of MS.
We validated our results looking at tissue specific expression of these candidate genes. Using the BioGPS database we were able to show that the suggestively associated genes identified by DAPPLE were largely and specifically expressed in immune cells as compared to other tissues. A gene ontology analysis also confirmed the immune-related functions of these genes. More generally, these findings provide additional support to the immunological nature of MS [20]. Notably, candidate gene expression was particularly high among CD8+ and CD4+ T cells, B cells, NK cells and pDCs. Interestingly all these cell types have been implicated in the pathogenesis of MS.
Several immune specific genes are located within MS suggestive regions. For example a SNP located near the gene encoding the Spleen Tyrosine Kinase (SYK) was found suggestive of association in Sawcer et al. Notably SYK was particularly highly expressed in B-cells, DCs, monocytes, CD33+ myeloid cells and NK cells. This protein has a central role in adaptive immune receptor signalling by phosphorylation of the immunoreceptor tyrosine-based activation motifs (ITAMs) [21]. SYK mediated ITAMs phosphorylation determines activation of signalling intermediates such as NF-kB, JNK and PYK2 that ultimately lead to lymphocyte activation [22]. ITAM signals mediated by SYK can also induce expansion of NK cells [23]. Interestingly, the SYK-inhibitor R788 (fostamatinib) has beneficial effects in patients affected by RA, when compared to placebo [24].
CSF2RB is another gene particularly highly expressed in B-cells, DCs, monocytes, CD33+ myeloid cells and NK cells. It codes for the b-subunit (bc) of the granulocyte-macrophage colony-stimulating factor (GM-CSF), IL-3 and IL-5 receptors that are expressed by peripheral leucocytes and blood DCs [25]. This gene appears to play an important role in allergic inflammation [26]. Interestingly, associations between CSF2RB and schizophrenia [27] and bipolar disorder [28] have been recently found.
EIF4EBP2 encodes the Eukaryotic Translation Initiation Factor 4E Binding Protein 2. The members of this family of proteins (4EBPs) can inhibit translation initiation through binding eIF4E [29]. 4EBPs regulate cell proliferation by interaction with mTORC1 pathway [30]. In addiction, EIF4EBP1 knock-out mice showed a type I IFN over production in pDCs [31]. We found an over-expression of EIF4BP2 in pDCs, CD4 cells, CD8 cells and   NK cells. CHST12 encodes the carbohydrate (chondroitin 4-O) sulfotransferase 2, a protein located in the membrane of the Golgi apparatus membrane and which is implicated in chondroitin and dermatan sulphate (DS) synthesis in different tissues [32]. DS proteoglycans participate in various biological events such as extracellular matrix assembly, cell adhesion, migration and proliferation [33]. We found high expression of CHST12 in pDCs, CD4 cells, CD8 cells and NK cells.
To conclude, a number of proteins coded by genes located within MS-associated genomic regions are implicated in the same PPI networks. The extent of this interaction substantially increases when genomic regions with suggestive evidence of association are included in the analysis. This suggests that at least some of these suggestive GWAS hits represent truly associated loci, and thus more common variants remain to be found to be associated to MS. Finally, we further confirmed the immunological nature of MS and show how a single cell type cannot explain the complexity of this disease. Future functional studies should investigate how and in which cell types the suggestive candidate genes are acting. This will improve our knowledge of this complex disease and hopefully provide future strategies of disease prevention and treatment.

Author Contributions
Conceived and designed the experiments: GR GD SVR. Performed the experiments: GR GD. Analyzed the data: GR GD SVR. Contributed reagents/materials/analysis tools: GR GD SVR. Wrote the paper: GR. Critical revision of the manuscript for important intellectual content: GCE GG SS SVR.