Genome-Wide Association Study of Multiple Sclerosis Confirms a Novel Locus at 5p13.1

Multiple Sclerosis (MS) is the most common progressive and disabling neurological condition affecting young adults in the world today. From a genetic point of view, MS is a complex disorder resulting from the combination of genetic and non-genetic factors. We aimed to identify previously unidentified loci conducting a new GWAS of Multiple Sclerosis (MS) in a sample of 296 MS cases and 801 controls from the Spanish population. Meta-analysis of our data in combination with previous GWAS was done. A total of 17 GWAS-significant SNPs, corresponding to three different loci were identified:HLA, IL2RA, and 5p13.1. All three have been previously reported as GWAS-significant. We confirmed our observation in 5p13.1 for rs9292777 using two additional independent Spanish samples to make a total of 4912 MS cases and 7498 controls (ORpooled = 0.84; 95%CI: 0.80–0.89; p = 1.36×10-9). This SNP differs from the one reported within this locus in a recent GWAS. Although it is unclear whether both signals are tapping the same genetic association, it seems clear that this locus plays an important role in the pathogenesis of MS.


Introduction
Multiple Sclerosis (MS) is the most common progressive and disabling neurological condition affecting young adults in the world today. The overall prevalence of MS is about 70 per 100,000 individuals (range from 2 to 150) [1]. Pathogenetically, MS is considered an inflammatory and autoimmune disease with secondary demyelination during the first years of disease progress, which often presents clinically with relapsing/remitting symptoms. After this phase, axonal loss occurs contributing to a secondary progressive course of the disease [2].
From a genetic point of view, MS is a complex disorder resulting from the combination of genetic and non-genetic factors [3]. Human leukocyte antigen (HLA) at 6p21 has long been recognized as the strongest locus increasing risk to MS in most populations [4]. More recently, at least two independent signals within the HLA chromosomal region have been reported [5]. Other genetic factors involved in MS have remained elusive until the arrival of Genome-Wide Association studies (GWAS) which facilitated the identification of new susceptibility loci (The MSGene Database. http://www.msgene.org/).
We aimed to identify previously unidentified loci conducting a new GWAS of Multiple Sclerosis (MS) in a sample of 296 MS cases and 801 controls from the Spanish population. Meta-analysis of our data in combination with previous GWAS was also planned. ( Figure S1)

Results
We analysed this new Spanish GWAS (named as the Macarena dataset) independently, but also in combination with the IMSGC/ WTCCC and the Gene MSA datasets, looking for consistent genetic factors associated in all datasets. Briefly, we genotyped 296 MS patients and 801 controls from Spain using the Affymetrix 250K Nsp I array as previously described [6]. After an extensive quality control, we performed principal components analysis (PCA) [7] and selected for analyses 286 MS cases and 767 controls described below. We also analyzed a dataset (named hereafter IMSGC/WT) based on 931 MS cases from the International Multiple Sclerosis Genetic Consortium (IMSGC) [8] and 2938 controls from the Wellcome Trust Case Control Consortium (WTCCC), who had been genotyped using Affymetrix 500K array [9]. After extensive quality control and PCA pruning, similar to what we applied to the Macarena dataset, a total 886 cases and 2933 controls were considered for downstream analyses. Furthermore, the Gene MSA study [10] was incorporated as a third study group. Given that the Gene MSA series had been genotyped using the Sentrix HumanHap550 BeadChip platform from Illumina, we imputed ,2.5 million markers from HapMap2 to this dataset using PLINK [11]. Genotypes on associated SNPs were also corroborated with a second imputation methodology (Mach v 1.0) [12]. Effective sample size after PCA and genotype quality control for Gene MSA was 955 MS cases and 858 controls.
We then performed a meta-analysis of these three GWAS datasets on the 130,903 shared SNPs ( Figure 1 and Figure 2), and selected markers associated with a p-value below 0.005 in the random effects model for further analyses. While this threshold is much less stringent than the common GWAS threshold or that of a Bonferroni correction, we thought to promote a wider list of SNPs as in a two-stage analysis given the availability of an ''insilico'' replication. A total of 1,064 SNPs meeting this criterion were indeed analyzed ''in silico'' in a recent GWAS performed on 882 MS cases and 872 controls from the Sardinian population [13]. After integrating these data in a new meta-analysis, we found that 17 markers reached the bonferroni significance of 3.82610-7 ( Table 1). Most of these markers (15 of them) were located within the HLA region, and the remaining two markers were located within the IL2RA gene (rs12722489, p = 2.16610-7) and the 5p13.1 regulatory region near PTGER4 (rs9292777, p = 9.84610-9), respectively. Both HLA and IL2RA are well established MS risk loci (www.msgene.org). PTGER4 has been previously found to be associated with Crohn's disease [9,14,15,16], reported as ''suggestive signal'' in a meta-analysis of MS GWAS [17], and also reported as genome-wide significant for MS in a recent GWAS by the International Multiple Sclerosis Genetics Consortium/Well-come Trust Case Control Consortium. The marker reported in this study (rs4613763) [18] is about 45 kb from our marker, with a D9 of 1 and r 2 of 0.13, so they may not be necessarily tapping the same genetic association. Indeed a conditional analysis shows independent effects of rs9292777 in two of the three GWAS datasets (rs4613763 could not be imputed in Macarena dataset). Thus, after conditioning on rs4613763 the p-value of rs9292777 went from the original 0.0003 to 0.0039 in IMSGC/WTCCC and from 0.0019 to 0.0012 in Gene MSA. The corresponding p-values of rs4613763 before and after conditioning on rs9292777 in the same two GWAS went from 0.00186 to 0.064 and from 0.826 to 0.335 respectively.
To further validate this marker we considered two independent samples from the Spanish population, namely the HCSC and the IPBLN samples. They included 646 and 1257 MS patients, and 746 and 1322 controls respectively.. These healthy controls were matched to cases by age, gender, ethnicity and place of recruitment [19]. The selected SNP rs9292777 was imputed in the HCSC sample and genotyped in the IPBLN sample, using Taqman technology as described elsewhere. The respective ORs for rs9292777 were 0.92 (95% CI: 0.79-1.07, p = 0.29) and 0.87 (95% CI: 0.78-0.98, p = 0.02).
All available evidence from the GWAS datasets and the validation samples were combined in a final meta-analysis that included a total of 4912 MS cases and 7498 controls. Overall, the C allele of rs9292777 was associated with an OR of 0.84 (95%CI: 0.80-0.89) corresponding to a p-value of 1.36610-9, below the 5610-8 Genome-wide significant cutoff. (Figure 3) To further determine if the MS associated SNPs at the 5p13.1 locus could have a functional effect on the transcription of the genes located in the region, we assessed eQTLs in the chr5: 39885067-40883829 (NCBI37/hg19) locus by spearman rank correlation test for association between SNP genotypes of the region and mRNA levels of 5 genes in the region (TTC33, PRKAA1, PTGER4, CARD6, RPL37). The expression of the genes was assayed by Illumina Human-6 v2 Expression BeadChip with RNA samples from 109 lymphoblastoid cell lines from Utah residents with Northern and Western European ancestry from the CEPH collection (HapMap CEU population). Since the CEU samples were sequenced by the 1000 Genomes project, we downloaded the variant information from the 1000 Genomes database. 2593 SNPs and 671 insertion/deletion variants with a MAF.0.05 were correlated with the expression of the 5 genes. Significant thresholds for each population were assigned through 1000 permutations. At the 0.05 permutation threshold, we observed 2 SNPs that were significantly associated with PTGER4 expression (rs7725052 and rs7714574) (Table S1). These 2 SNPs are in almost total linkage disequilibrium (LD)and are located 193-187 Kbp from the PTGR4 gene. Other eQTLs in the region reported by other authors were also explored. Several SNPs located in position chr5:39,904,511-40,111,728 (NCBI37/hg19) have been reported to be eQTLs for PTGER4 by Zeller et al [20]. These eQTLs are within 605 to 811 Kbp from the PTGR4 gene and are different SNPs than the eQTLs found by us in the CEU samples. Expression QTLs for the RPL37 gene in the region chr5:40,723,806-40,790,598 (NCBI37/hg19) were also reported by Stranger et al. [21] and by Dixon et al. [22] as shown in Figure 4. Since none of the MS associated SNPs were eQTLs for any of the genes in the region ( Figure S2), we analyzed their LD with the associated variants. There is no LD between the distal PTGER4 eQTLs, the RPL37 eQTLs and the SNPs reported associated with Crohn by GWAS or the variants associated with MS ( Figure S3). However, the PTGER4 eQTLs obtained in the CEU population were in LD with the rs9292777 (r 2 = 0.23). A     lower LD was observed for the rs4613763 (r 2 = 0.14) variant. None of the PTGER4 proximal eQTLs were genotyped in any of the GWAS analyzed in the present work. We could impute data for rs7725052 in the Macarena and GeneMSA datasets. The ORs were 0.8403 and 0.8968 respectively.

Discussion
We have analysed a total 130,903 markers using data from three GWAS and conducted several replications to detect a new marker associated with MS. While the failure to use imputation methods to increase the number of markers to be assessed might have restricted the power of our study (i.e. novel signals of markers that were not genotyped in the original GWAS would be missed), by no means it invalidates our findings. Our main finding, rs9292777 SNP, is located within an intergenic region on 5p13.1 [23]. As mentioned, this chromosomal region has been unequivocally associated (i.e. meta-GWAS significance, p,5e-8) with two autoimmune bowel diseases i.e. Crohn's disease and ulcerative colitis [9,14,15,16]. Furthermore, a recent meta-GWAS presented preliminary evidence that this 5p13.1 locus could be associated to MS [17]. Specifically, the authors observed a signal for MS located at rs6896969 (p = 2610-7). According to the 1000 genomes project data, and employing the SNAP software, rs6896969 and rs9292777 are 13522 bp away and can be considered reciprocal proxies (r 2 = 0.901, D9 = 1) [24]. Therefore our results extend previous observations and confirm unequivocally the protective role of rs9292777, located on the 5p13.1 gene desert, for MS phenotype. eQTL analysis of the 5p13.1 region revealed that some SNPs in LD with the MS associated variant rs9292777 correlated with the expression levels of the PTGER4 gene. This expression association has been previously suggested by Lebouille et al. who observed that polymorphisms within this gene desert could modulate the expression levels of PTGER4 gene, located 270 kb centromeric to rs9292777 [14]. PTGER4 encodes the prostaglandin receptor EP4 which is a strong candidate susceptibility gene for inflammatory bowel diseases. In fact, ptger4 knock-out mice develop severe colitis upon dextran sodium sulphate treatment [14]. Most importantly, recent experiments suggested that ptger4 knock-out mice are protected against experimental autoimmune encephalomyelitis (EAE) signs compared with control mice [25]. Notably, EAE has proven to be a valuable model in elucidating pathogenesis as well as identifying candidate therapies for MS [25]. So, combining genetic and molecular data, PTGER4 pharmacological modulation appears as an excellent strategy for new drug development in human autoimmune diseases including MS.

Samples
In the present study we have analyzed three independent genomewide datasets: Macarena, IMSGC/WT and GeneMSA datasets and three validation collections: Sardinian, HCSC and IPBLN samples.
The Macarena dataset consisted of a genomewide genotyping effort (Affymetrix GeneChip Human Mapping 250K Nsp Array) including 293 definite MS cases and 801 controls, all natives of Spain [6]. MS cases were recruited during routine hospital visits to the MS unit at Virgen Macarena Universitary Hospital and diagnosed according to Poser and MacDonald criteria [26,27]. Oligoclonal IgG intrathecal secrection in CSF and/or at least 3 out of 4 MRI Barkhof criteria was also considered necessary for inclusion in the study. All participants gave written informed consent under Macarena Hospital Institutional Review Board. DNA extraction from blood, PCR amplification, and genotyping was performed at Neocodex (Seville, Spain).
The second dataset (IMSGC/WT) consisted of MS cases from the International Multiple Sclerosis Genetics Consortium (IMSGC) and controls from the Wellcome Trust case-control Consortium (WTCCC). We employed the NINDS dbGaP dataset that includes genomewide genotypes (Affymetrix GeneChip Human Mapping 500K Array Set) for 931 individuals with MS from the IMSGC [8], all drawn from the UK and the USA. These MS cases were matched to a control dataset (n = 2938) from the WTCCC, deposited in the European Genotype Archive (EGA) and genotyped with the same chip [9]. All WTCCC controls are .99% of European origin living in the UK.
Although these downloaded datasets had already undergone quality control (QC) and stratification filters by the original authors, we performed an extra layer of QC analyses on them (described below).
In order to validate the most significant results (top 1064 SNPs with p-value,0,005) of a meta-analysis of the 3 GWAS datasets, we obtained aggregated frequency data and performed a first insilico replication in an independent dataset of 882 MS patients and 872 controls from Sardinia, Italy [13].
The top novel SNP associated with MS was genotyped or imputed in two independent validation samples from Spain. The first sample (HCSC) includes 557 Spanish MS patients and 799 ethnically matched controls, mostly blood donors and staff, who were consecutively recruited from the Hospital Clínico San Carlos (Madrid). Patients were diagnosed based on the Poser criteria and included in the study after informed consent. Clinical features in the MS cohort from Madrid (35% men and 65% women): mean age at onset of 29 years and age of the patients ranging from 16 to 79 years, 36% of them were carriers of the major susceptibility factor for MS (HLA-DRB1*1501). The mean Expanded Disability Status Scale (EDSS) was 3.5 and approximately 10% of the patients had father, mother or a sibling with MS. The clinical forms distributed as follows: 7% Primary Progressive, 10% Secondary Progressive and 75% Relapsing-Remitting. The Ethics Committee of this Hospital approved the study.
The second sample (Instituto de Parasitología y Biomedicina Lopez-Neyra, IPBLN) comprises 1282 MS cases and 1322 controls from 4 hospitals and a blood bank of Andalucía, specifically, Hospital Virgen Macarena of Sevilla, Hospital Carlos Haya of Málaga, Hospital Clínico Virgen de las Nieves of Granada and the Blood Bank of Granada. The mean age at sample collection of the cases was 29.84 (10.66 sd) years and mean age of controls at interview was 33.43 (12.19 sd) years. The percentage of females was 68% for cases and 68% for controls. All MS cases were classified as RR (relapsing-remitting) or SP (secondary progressive). All patients were ascertained to have definite MS according to the Poser or MacDonald criteria [26,27]. The study was approved by the Ethics Committees of each of the hospitals participating in the study and written informed consent was obtained from all participants. Genotyping was performed using the TaqmanH SNP Genotyping Assay (Applied Biosystems) C_30119807_10 (rs9292777) Quality Control (QC) analyses. We performed an extensive quality control on these datasets, using Affymetrix Genotyping Console software (http://www.affymetrix.com) and Plink [11]. Only individuals with a sample call rate above 93% were later re-called with the Bayesian Robust Linear Model with Malalanobis (BRLMM) distance algorithm, ran with default parameters, which improves call rates in most samples. Self-reported sex was compared to sex assigned by chromosome X genotypes, and discrepancies were resolved or samples removed. The program Graphical Representation of Relationships (GRR) [29] was used to check sample relatedness and to correct potential sample mixups, duplications, or contaminations. SNPs were selected to have a call rate above 95% (in each case, control, and combined group, within each dataset), and a minor allele frequency above 1% (again in each case, control, and combined group, within each dataset). SNPs that deviated grossly from Hardy-Weinberg equilibrium (p-value,10 24 ) in control samples were also removed. We also removed SNPs with a significantly different rate of missingness (p-value,5610 24 ) between case and control samples within each dataset.
To ensure all SNPs across all datasets were typed according to the same DNA strand, each dataset was normalized using HapMap as the reference set. We merged each study with the HapMap CEU samples and compared genotype calls. SNP calls were flipped (if typed on opposite strand) or removed (if strand could not be undoubtedly assigned) as necessary. We also removed SNPs that were significantly associated with ''study status''. That is, we labeled individuals from each study as cases and HapMap CEU individuals as controls, and removed SNPs with p-values smaller than 10-6 in a test for association.
Principal Components (PC) analysis. Principal components analysis was carried out with EIGENSOFT [7,30], to evaluate population admixture within each population, and to identify individuals as outliers. We ran the SMARTPCA program with default parameters, excluding chromosome X markers. To minimize the effect of LD in the analysis, we also excluded markers in high LD (with the indep-pairwise option in Plink) and long-range LD regions reported previously or detected in our population. Individuals identified as outliers (6 standard deviations or more along one of the top ten principal components) were removed from all subsequent analyses. PC analysis was run within each dataset, and also together with other HapMap European and worldwide populations to detect individuals of different ethnicities. Overall, we removed 7 cases and 34 controls from the Macarena study, 45 cases and 5 controls from the IMSGC/WTCCC dataset, and 25 cases and 42 controls from the GeneMSA project.

Imputation
Because two of the three datasets applied Affymetrix genotyping technology, we imputed the Affymetrix SNPs on all the participants in the GeneMSA study. Using the original Illumina genotypes for the GeneMSA participants, and HapMap reference genotypes (60 unrelated CEU samples), we imputed all of the Affymetrix 500K SNPs using two different methodologies: Plink [11] and MaCH [31]. Genotype calls with high quality scores (info.0.8 in Plink and r 2 .0.3 in MaCH) and consistently called with both methodologies were used in subsequent association analyses.
After all these quality control and preparatory steps, the Macarena study kept 195,035 SNPs for 286 cases and 767 controls; the IMSGC/WTCCC dataset kept 312,869 SNPs for 886 cases and 2933 controls; and the GeneMSA dataset contained 2,649,462 SNPs in 955 cases and 858 controls. The three datasets had 130,903 SNPs in common, which were then used in association analyses. Additionally, rs9292777 was imputed in the HCSC sample from surrounding SNPs available on this sample using the same methodology described above.
Association Analysis. Single-locus allelic (1 df) association analysis within each independent GWAS sample, and of the combined sample, was carried out using Plink [11]. The genomic inflation factor was also estimated, with Plink, for each dataset (Macarena:1.13; GenMSA:1.02; IMSGC/WT:1.16). We also carried out a meta-analysis of all three datasets using Plink. The top 1064 most significant SNPs (p,0.005) resulting from this meta-analysis were selected for replication using aggregated frequency data from an Italian sample, and performed a metaanalysis of these SNPs in the four populations. The resulting novel SNP (rs9292777) was imputed and genotyped respectively in two independent samples ascertained from the Spanish population. A final meta-analysis and Forest plot for this marker including estimates from the GWAS datasets, plus the validation samples was done with Stata 10.0 (College Station, TX) metan command.

Expression association analysis and multiple test correction
We used Spearman's correlation to test for associations in cis between SNP genotypes and probe expression levels in each population. This method has been previously shown to produce robust results and avoids the effect of outliers in gene expression values [32]. We analyzed the SNPs from the 1000 Genomes project included between chr5: 39885067-40883829 (Build 37). Significance thresholds for each gene were assigned after 1000 permutations of expression values relative to genotypes. Computation of the Spearman correlation test has been performed with Genetranssoc (http://bios.ugr.es/Genetranassoc/), a c++ software with an analogous implementation of the Spearman coefficient in the statistical package R which also computes statistical significance using permutation tests.Reported eQTL were assessed by http://eqtl.uchicago.edu/cgi-bin/gbrowse/eqtl/. Figure S1 Samples used in the different analyses. This flow-chart describes the samples used in the different analyses. It starts with three initial GWAS which after extensive quality control yielded a total of 2,127 cases and 4,558 controls. Metaanalysis using 130,903 SNPs common to these samples was done and SNPs with a p value below 0.001 in this analysis were requested from the Sardinian GWAS. A new meta-analysis was done adding these new data. Only previously unreported SNPs with a p-value below 3.82610-7 in this analysis were chosen for a final validation. Only one marker (rs929777) met the criteria and was analysed in a final meta-analysis that included the previous four samples and two new validation samples (HSCS and IPBLN).