Genome-Wide Association Study Identifies Novel Restless Legs Syndrome Susceptibility Loci on 2p14 and 16q12.1

Restless legs syndrome (RLS) is a sensorimotor disorder with an age-dependent prevalence of up to 10% in the general population above 65 years of age. Affected individuals suffer from uncomfortable sensations and an urge to move in the lower limbs that occurs mainly in resting situations during the evening or at night. Moving the legs or walking leads to an improvement of symptoms. Concomitantly, patients report sleep disturbances with consequences such as reduced daytime functioning. We conducted a genome-wide association study (GWA) for RLS in 922 cases and 1,526 controls (using 301,406 SNPs) followed by a replication of 76 candidate SNPs in 3,935 cases and 5,754 controls, all of European ancestry. Herein, we identified six RLS susceptibility loci of genome-wide significance, two of them novel: an intergenic region on chromosome 2p14 (rs6747972, P = 9.03 × 10−11, OR = 1.23) and a locus on 16q12.1 (rs3104767, P = 9.4 × 10−19, OR = 1.35) in a linkage disequilibrium block of 140 kb containing the 5′-end of TOX3 and the adjacent non-coding RNA BC034767.


Introduction
Restless legs syndrome (RLS) is a common neurological disorder with a prevalence of up to 10 %, which increases with age [1]. Affected individuals suffer from an urge to move due to uncomfortable sensations in the lower limbs present in the evening or at night. The symptoms occur during rest and relaxation, with walking or moving the extremity leading to prompt relief. Consequently, initiation and maintenance of sleep become defective [1]. RLS has been associated with iron deficiency, and is pharmacologically responsive to dopaminergic substitution. Increased cardiovascular events, depression, and anxiety count among the known co-morbidities [1].
Genome-wide association studies (GWAs) identified genetic risk factors within MEIS1, BTBD9, PTPRD, and a locus encompassing MAP2K5 and SKOR1 [2][3][4]. To identify additional RLS susceptibility loci, we undertook an enlarged GWA in a German casecontrol population, followed by replication in independent casecontrol samples originating from Europe, the United States of America, and Canada. In doing so, we identified six RLS susceptibility loci with genome-wide significance in the joint analysis, two of them novel: an intergenic region on chromosome 2p14 and a locus on 16q12.1 in close proximity to TOX3 and the adjacent non-coding RNA BC034767.

Results/Discussion
We enlarged our previously reported [2,4] GWA sample to 954 German RLS cases and 1,814 German population-based controls from the KORA-S3/F3 survey and genotyped them on Affymetrix 5.0 (cases) and 6.0 (controls) arrays. To correct for population stratification, as a first step, we performed a multidimensional scaling (MDS) analysis, leading to the exclusion of 18 controls as outliers. In a second step, we conducted a variance components analysis to identify any residual substructure in the remaining samples, resulting in an inflation factor l of 1.025 (Figures S1 and S2). The first four axes of variation from the MDS analysis were included as covariates in the association analysis of the genomewide stage and all P-values were corrected for the observed l.
Prior to statistical analysis, genotyping data was subjected to extensive quality control. We excluded a total of 302 DNA samples due to a genotyping call rate ,98 %. For individual SNP quality control, we adopted a stringent protocol in order to account for the complexity of an analysis combining 5.0 and 6.0 arrays. We excluded SNPs with a minor allele frequency (MAF) ,5%, a callrate ,98%, or a significant deviation from Hardy-Weinberg Equilibrium (HWE) in controls (P,0.00001). In addition, we dropped SNPs likely to be false-positive associations due to differential clustering between 5.0 and 6.0 arrays by adding a second set of cases of an unrelated phenotype and discarding SNPs showing association in this setup (see Materials and Methods). Finally, we tested 301,406 SNPs for association in 922 cases and 1,526 controls. Based on a threshold level of a nominal l-corrected P GWA ,10 -4 , a total of 47 SNPs distributed over 26 loci were selected for follow-up in the replication study ( Figure 1, Table S1).
We genotyped these 47 SNPs together with 29 adjacent SNPs in strong linkage disequilibrium (LD, r 2 = 0.5-0.9) using the Sequenom iPLEX platform in seven case-control populations of European descent, comprising a total of 3,935 cases and 5,754 controls. Eleven SNPs with a call rate ,95%, MAF,5%, and P,0.00001 for deviation from HWE in controls as well as 432 samples with a genotyping call rate ,90% were excluded. A set of 47 SNPs, genotyped in 186 samples on both platforms (Affymetrix and Sequenom), was used to calculate an average concordance rate of 99.24 %.
The combined analysis of all replication samples confirmed the known four susceptibility loci and, in addition, identified two novel association signals on chromosomes 2p14 and 16q12.1 (Table 1). To address possible population stratification within the combined replication sample, we performed a fixed-effects meta-analysis. For four of the replication case-control populations, we included l inflation factors which were available from a genomic controls experiment in a previous study in these populations [4]. These were used to correct the estimates for the standard error. Joint analysis of GWA and all replication samples showed genome-wide significance for these two novel loci as well as for the known RLS loci in MEIS1, BTBD9, PTPRD, and MAP2K5/SKOR1 with a nominal l -corrected P JOINT ,5610 28 (Table 1). Depending on the variable power to detect the effects, the separate analyses of individual subsamples in the replication either confirmed the association after correction for multiple testing or yielded nominally significant results (Tables S2 and S3). The differing relevance of the risk loci in the individual samples is illustrated in forest plots ( Figure 2). There was no evidence of epistasis between any of the six risk loci (P Bonferroni .0.45).
The association signal on 2p14 (rs6747972: nominal l-corrected P JOINT = 9.03610 211 , odds ratio (OR) = 1.23) is located in an LD block of 120 kb within an intergenic region 1.3 Mb downstream of MEIS1 (Figure 3). Assuming a long-range regulatory function of the SNP-containing region, in silico analysis for clusters of highly conserved non-coding elements using the ANCORA browser (http://ancora.genereg.net) identified MEIS1 as well as ETAA1 as potential target genes [5,6].
The second locus on chromosome 16q12.1 (rs3104767: nominal l-corrected P JOINT = 9.4610 219 , OR = 1.35) is located within an LD block of 140 kb (Figure 3), which contains the 59UTR of TOX3 (synonyms TNRC9 and CAGF9) and the non-coding RNA BC034767 (synonym LOC643714). TOX3 is a member of the high mobility box group family of non-histone chromatin proteins which interacts with CREB and CBP and plays a critical role in mediating calcium-dependent transcription in neurons [7]. GWAs have identified susceptibility variants for breast cancer in the identical region [8]. The best-associated breast cancer SNP, rs3803662, is in low LD (r 2 ,0.1, HapMap CEU data) with rs3104767, but showed association to RLS (l-corrected nominal P GWA = 7.29610 27 ). However, logistic regression analysis conditioned on rs3104767 demonstrated that this association is dependent on rs3104767 (rs3803662: P GWA/conditioned = 0.2883).
BC034767 is represented in GenBank by two identical mRNA transcripts, BC034767 and BC029912. According to the gene model information of the UCSC and Ensembl genome browsers (http://genome.ucsc.edu and http://www.ensembl.org/index. html), these mRNAs are predicted to be non-coding. Additional in silico analysis using the Coding Potential Calculator (http://cpc. cbi.pku.edu.cn) supported this by attributing only a weak coding potential to this RNA, suggesting a regulatory function instead [9]. We also searched for rare alleles with strong effects and performed a mutation screening by sequencing all coding and non-coding

Author Summary
Restless legs syndrome (RLS) is one of the most common neurological disorders. Patients with RLS suffer from an urge to move the legs and unpleasant sensations located mostly deep in the calf. Symptoms mainly occur in resting situations in the evening or at night. As a consequence, initiation and maintenance of sleep become defective. Here, we performed a genome-wide association study to identify common genetic variants increasing the risk for disease. The genome-wide phase included 922 cases and 1,526 controls, and candidate SNPs were replicated in 3,935 cases and 5,754 controls, all of European ancestry. We identified two new RLS-associated loci: an intergenic region on chromosome 2p14 and a locus on 16q12.1 in a linkage disequilibrium block containing the 59end of TOX3 and the adjacent non-coding RNA BC034767. TOX3 has been implicated in the development of breast cancer. The physiologic role of TOX3 and BC034767 in the central nervous system and a possible involvement of these two genes in RLS pathogenesis remain to be established.
exons of TOX3 and BC034767 in 188 German RLS cases (Table  S4). In TOX3, a total of nine variants not listed in dbSNP (Build 130) were found, three of which are non-synonymous. Only one of these is also annotated in the 1000 Genomes project (November 2010 data release). Three additional new variants were located in putative exons 1 and 2 of BC034767. Analysis of the frequency of these variants as well as all known non-synonymous, frameshift, and splice-site coding SNPs in TOX3 in a subset of one of the replication samples (726 cases and 735 controls from the GER1 sample) did not reveal any association to RLS. For a power of .80%, however, variants with an OR above 4.5 and a MAF $0.01 would be required. For even lower MAFs, ORs $10 would be necessary for sufficient power. Furthermore, the described CAG repeat within exon 7 of TOX3 was not polymorphic as shown by fragment analysis in 100 population-based controls.
According to publicly available expression data (http://genome. ucsc.edu), in humans, BC034767 is expressed in the testes only, while TOX3 expression has been shown in the salivary glands, the trachea, and in the CNS. Detailed in-depth real time PCR profiling of TOX3 showed high expression levels in the frontal and occipital cortex, the cerebellum, and the retina [10]. To assess a putative eQTL function of rs6747972 or rs3104767, we studied the SNP-genotype-dependent expression of TOX3 and BC034767 as well as of genes known to directly interact with TOX3 (CREB-1/ CREBBP/CITED1) and potential target genes of long-range regulatory elements at the locus on chromosome 2 (MEIS1/ ETAA1) in RNA expression microarray data from peripheral blood in 323 general population controls [11]. No differential genotype-dependent expression variation was found.
To assess the potential for genetic risk prediction, we split our GWA sample in a training and a test set and determined classifiers for case-control status in the training set to predict case-control status in the test set. Training and test set were independent of each other -not only with respect to included individuals but also with respect to the genotyping procedure as we used genotypes generated on different genotyping platforms. As training set, we used those cases of the current GWA which had been genotyped on 500K arrays in a previous GWA and the corresponding control set [2], in total, 326 cases and 1,498 controls. The test set comprised 583 cases and 1,526 controls, genotyped on 5.0/6.0 arrays as part of the current study. Prior to the analysis, we removed the six known risk loci and performed LD-pruning to limit the analysis to SNPs not in LD with each other. In the end, a total of 76,532 SNPs were included in the pruned dataset. We conducted logistic regression with age and sex as covariates. Based on these association results, the sum score of SNPs showing the most significant effects (i.e. the number of risk alleles over all SNPs) weighted by the ln(OR) of these effects was chosen as predictor variable in the test set. We then varied the P-value threshold for SNPs included in the sum score. For a P-value ,0.6, we observed a maximum area under the curve (AUC) of 63.9% and an explained genetic variance of 6.6% (Nagelkerke's R), values comparable to estimates obtained for other complex diseases such as breast cancer or diabetes (Table S5) [12][13][14]. Inclusion of the six known risk loci in this analysis resulted in a maximum AUC of 64.2% and an explained genetic variance of 6.8%.
Additionally, we performed risk prediction in the combined GWA and replication sample including only the six established RLS risk loci. For this purpose, we used the weighted risk allele score resulting in ORs of up to 8.6 (95% CI: 2.46-46.25) and an AUC of 65.1% ( Figures S3 and S4).
By increasing the size of our discovery sample, we have identified two new RLS susceptibility loci. The top six loci show effect sizes between 1.22 and 1.77 and risk allele frequencies between 19 and 82%, and reveal genes in neuronal transcription pathways not previously suspected to be involved in the disorder.

Study population and phenotype assessment
Ethics statement. Written informed consent was obtained from each participant in the respective language. The study has   (Finland (FIN), n = 141). In all patients, diagnosis was based upon the diagnostic criteria of the International RLS Study Group [1] as assessed in a personal interview conducted by an RLS expert. A positive family history was based on the report of at least one additional family member affected by RLS. We excluded patients with secondary RLS due to uremia, dialysis, or anemia due to iron deficiency. The presence of secondary RLS was determined by clinical interview, physical and neurological examination, blood chemistry, and nerve conduction studies whenever deemed clinically necessary.
In addition, 1,104 participants (GER2) of the ''Course of RLS (COR-) Study'', a prospective cohort study on the natural course of disease in members of the German RLS patient organizations, were included as an additional replication sample. After providing informed consent, study participants sent their blood for DNA extraction to the Institute of Human Genetics, Munich, Germany. A limited validation of the RLS diagnosis among the majority of members was achieved through a diagnostic questionnaire. Five percent had also received a standardized physical examination and interview in one of the specialized RLS centers in Germany prior to recruitment. To avoid doublets, we checked these subjects against those recruited through other German RLS centers and excluded samples with identical birth date and sex. 556 cases (US) were recruited in the United States at Departments of Neurology at Universities in Baltimore, Miami, Houston, and Palo Alto. Diagnosis of RLS was made as mentioned above.
285 cases (CA) were recruited and diagnosed as above in Montréal, Canada. All subjects were exclusively of French-Canadian ancestry as defined by having four grandparents of French-Canadian origin.
Detailed demographic data of all samples are provided in Table  S6.
Control populations (GWA and replication phase). Controls for German and Austrian cases were of European descent and recruited from the KORA S3/F3 and S4 surveys, general populationbased controls from southern Germany. KORA procedures and samples have been described [15]. For the GWA phase, we included 1,814 subjects from S3/F3, and, for the replication stage, 1,471 subjects from S4.
For replication of the GER2 sample, we used controls from the Dortmund Health Study (DHS), a population-based survey conducted in the city of Dortmund with the aim of determining the prevalence of chronic diseases and their risk factors in the general population. Sampling for the study was done randomly from the city's population register stratified by five-year age group and gender [16]. 597 subjects selected at random from the Czech blood and bone marrow donor registry served as Czech controls [17]. French controls included 768 parents of multiple sclerosis patients recruited from the French Group of Multiple Sclerosis Genetics Study (REFGENSEP) [18]. Finnish controls comprised 360 participants of the National FINRISK Study, a cross-sectional population survey on coronary risk factors collected every five years. The current study contains individuals recruited in 2002. Detailed description of the FINRISK cohorts can be found at www.nationalbiobanks.fi.
French-Canadian controls were 285 unrelated individuals recruited at the same hospital as the cases.
1,200 participants of the Wisconsin Sleep Cohort (WSC), an ongoing longitudinal study on the causes, consequences, and natural course of disease of sleep disorders, functioned as US controls [19].
None of the controls were phenotyped for RLS. All studies were approved by the institutional review boards in Germany, Austria, Czech Republic, France, Finland, the US, and Canada. Written informed consent was obtained from each participant. Detailed demographic data of all samples are provided in Table S6.
Genotyping GWA. Genotyping was performed on Affymetrix Genome-Wide Human SNP Arrays 5.0 (cases) and 6.0 (controls) following the manufacturer's protocol. The case sample included 628 cases from previous GWAs [2,4] and 326 new cases. After genotypecalling using the BRLMM-P clustering algorithm [20], a total of 475,976 overlapping SNPs on both Affymetrix arrays were subjected to quality control. We added 655 cases of a different phenotype unrelated to RLS, genotyped on 5.0 arrays, to the analysis and excluded those SNPs which showed a significant difference of allele frequencies in cases (RLS and unrelated phenotype on 5.0) and controls (6.0) (n = 92). Thereby, we filtered out SNPs likely to be false-positive associations. We excluded SNPs with a minor allele frequency (MAF) ,5% (n = 88,582), a callrate ,98% (n = 65,906) or a significant deviation from Hardy-Weinberg Equilibrium (HWE) in controls (P,0.00001) (n = 20,060). Cluster plots of the GWA genotyping data for the best-associated SNPs in Table 1 are shown in Figure S5. Genotypes of these SNPs are available in Table S7.
Replication. We selected all SNPs with a l-corrected P nominal ,10 24 in the GWA for replication. These SNPs clustered in 26 loci (defined as the best associated SNP 6150 kb of flanking sequence). We genotyped a total of three SNPs in each of the 26 regions. These were either further associated neighbouring SNPs with a l-corrected P nominal ,10 23 or, in case of singleton SNPs, additional neighbouring SNPs from HapMap with the highest possible r 2 (at least .0.5) with the best-associated SNP. We also genotyped the best-associated SNPs identified in the previous GWAs [2,4].
Genotyping was performed on the MassARRAY system using MALDI-TOF mass spectrometry with the iPLEX Gold chemistry (Sequenom Inc, San Diego, CA, USA). Primers were designed using AssayDesign 3.1.2.2 with iPLEX Gold default parameters. Automated genotype calling was done with SpectroTYPER 3.4. Genotype clustering was visually checked by an experienced evaluator.  SNPs with a call rate,95%, MAF,5%, and P,0.00001 for deviations from HWE in controls were excluded. DNA samples with a call rate,90% were also excluded.

Population stratification analysis
GWA. To identify and correct for population stratification, we performed an MDS analysis as implemented in PLINK 1.07 (http://pngu.mgh.harvard.edu/,purcell/plink, [21]) on the IBS matrix of our discovery sample. After excluding outliers by plotting the main axes of variation against each other, we performed logistic regression with age, sex, and the values of the MDS components as covariates. Using the Genomic Control approach [22], we obtained an inflation factor l of 1.11.
Additionally, we performed a variance components analysis using the EMMAX software (http://genetics.cs.ucla.edu/emmax, [23]) and, again, calculated the inflation factor with Genomic Control, now resulting in a l of 1.025. EMMAX uses a mixed linear model and does not only correct for population stratification but also for hidden relatedness. We, therefore, decided to base correction for population substructure on the EMMAX results.
Replication. Correction for population stratification was performed for the German, Czech, and the Canadian subsamples. The l-values of 1.1032, 1.2286, and 1.2637 were derived from a previous Genomic Control experiment within the same samples using 176 intergenic or intronic SNPs [4]. Here, we had applied the expanded Genomic Control method GCF developed by Devlin and Roeder [24]. In the meta-analysis of all replication samples, the l-corrected standard errors were included for the German, Czech, and Canadian samples. For the other replication samples from France, Finland, and the USA, no such data was available and, therefore, no correction factor was included in the analysis.

Statistical analysis
Statistical analysis was performed using PLINK 1.07 (http:// pngu.mgh.harvard.edu/,purcell/plink, [21]). In the GWA sample, we applied logistic regression with age, sex, and the first four axes of variation resulting from an MDS analysis as covariates.
P-values were l-corrected with the l of 1.025 from the EMMAX analysis. In the individual analysis of the single replication samples, we tested for association using logistic regression and correcting for gender and age as well as for population stratification where possible (see Population Stratification). Each replication sample was Bonferroni-corrected using the number of SNPs which passed quality control for the respective sample.
For the combined analysis of all replication samples, we performed a fixed-effects inverse-variance meta-analysis. Where available, we used l-corrected standard errors in this analysis. Bonferroni-correction was performed for 74 SNPs, i.e. the number of SNPs which passed quality control in at least one replication sample.
For the joint analysis of the GWA and the replication samples, we also used a fixed-effects inverse-variance meta-analysis and again included l-corrected values as far as possible. For the conditioned analysis, the SNP to be conditioned on was included as an additional covariate in the logistic regression analysis as implemented in PLINK.
Interaction analysis was performed using the -epistasis option in PLINK. Significance was determined via Bonferroni-correction (i.e. 0.05/28, as 28 SNP combinations were tested for interaction).

Power calculation
Power calculation was performed using the CaTS power calculator [25] using a prevalence set of 0.08 and an additive genetic model (Table S3). The significance level was set at 0.05/74 for replication stage analysis and at 0.05/301,406 for genomewide significance in the joint analysis of GWA and replication. For the rare variants association study, the significance level was set at 0.05/12.

Mutation screening of TOX3 and BC034767
All coding and non-coding exons including adjacent splice sites of TOX3 (reference sequence NM_001146188) and BC034767 (reference sequence IMAGE 5172237) were screened for mutations in 188 German RLS cases.
Mutation screening was performed with high resolution melting curve analysis using the LightScanner technology and standard protocols (IDAHO Technology Inc.). DNAs were analyzed in doublets. Samples with aberrant melting pattern were sequenced using BigDyeTerminator chemistry 3.1 (ABI) on an ABI 3730 sequencer. Sequence analysis was performed with the Staden package [26]. Primers were designed using ExonPrimer (http:// ihg.gsf.de) or Primer3plus (www.bioinformatics.nl/cgi-bin/pri-mer3plus/primer3plus.cgi). All identified variants were then genotyped in 735 RLS cases and 735 controls of the general population (KORA cohort) on the MassARRAY system, as described above.
In addition, fragment analysis of exon 7 of TOX3 was performed to screen for polymorphic CAG trinucleotide repeats. DNA of 100 controls (50 females, 50 males) was pooled and analyzed on an ABI 3730 sequencer with LIZ-500 (ABI) as a standard. Primers were designed using Primer3plus, the forward Primer contains FAM for detection. Analysis was performed using GeneMapper v3.5.

Expression analyses
Associations between MEIS1/ETAA1 RNA expression and rs6747972 and between TOX3/BC034767/CREB-1/CREBBP/ CITED1 expression and rs3104767 were assessed using genomewide SNP data (Affymetrix 6.0 chip) in conjunction with microarray data for human blood samples (n = 323 general population controls from the KORA cohort, Illumina Human WG6 v2 Expression BeadChip) [11]. A linear regression model conditioned on expression and controlling for age and sex was used to test for association.

Prediction of genetic risk
Based on the performance of P-value-threshold selected SNPs in a training and a test sample. As training sample, we used those GWA-cases which had also been genotyped for our previous study [2]. We also included the control samples from this study. As a first quality control step, we carried out an association analysis comparing the Affymetrix 500K genotypes of these GWAcases to the Affymetrix 5.0 genotypes of the same cases. Significant P-values would indicate systematic differences in the genotyping between the different chips. For further analysis, we only used those 259,302 SNPs with P-values .0.10. We performed a second quality control step in which IDs with a callrate below 98% and SNPs with a callrate below 98%, a MAF lower than 5%, or a Pvalue for deviation from HWE,0.00001 were removed.
Further, we excluded the four already known risk loci as well as the two newly identified loci and performed LD-pruning to limit the analysis to SNPs not in LD with each other. This was performed using a window-size of 50 SNPs. In each step, this window was shifted 5 SNPs. We used a threshold of 2 for the VIF (variance inflation factor). 76,532 SNPs, 326 cases, and 1,498 controls were included in the final training dataset. We conducted logistic regression with age and sex as covariates. Based on these association results, the sum score of SNPs showing the most significant effects (i.e. the number of risk alleles over all SNPs) weighted by the ln(OR) of these effects was chosen as predictor variable in the test set, comprising the remaining 583 cases of the GWA sample and 1,526 controls. None of these cases/controls were included in the training-sample, i.e. the test-sample constitutes a completely independent sample. Based on this sum score, we calculated the ROC curve and Nagelkerke's R to measure the explained variance.
Based on a weighted risk allele score. To evaluate the predictive value in our sample, we calculated a weighted sum score of risk alleles in the combined GWA and replication sample. To this end, we used one SNP from each RLS risk region and also included markers from the two newly identified regions on chromosome 16q12 and 2p14 (MEIS1: rs2300478, 2p14: rs6747972, BTBD9: rs9296249, PTPRD: rs1975197, MAP2K5: rs11635424, TOX3/BC034767: rs3104767). At each SNP, the number of risk alleles was weighted with the corresponding ln(OR) for this SNP. The corresponding distribution of the score in cases and controls is illustrated in Figure S3. Employing this score for risk prediction resulted in an AUC of 0.651 ( Figure S4).  Figure S5 Cluster plots of GWA genotyping for the six risk loci. For the best-associated SNPs at each risk locus, clusterplots were generated for cases and controls. Intensities of the A and B allele (based on the Affymetrix annotation of the SNPs) are given on the x-and y-axes and the respective genotypes are indicated in blue, green, and orange. (PDF) Table S1 GWA results for SNPs with l-corrected P GWA ,10-4 and additional SNPs selected for replication. A star (*) indicates SNPs which had been identified in previous RLS GWAs [2][3][4]. P-values of the GWA phase are given as l-corrected nominal P-values. Two different methods for l correction were applied, multi-dimensional-scaling (MDS)-analysis using PLINK and variance components (VC)-analysis using the EMMAX software with the P-values listed in the respective columns ''MDS l-corrected P GWA '' and ''VC l-corrected P GWA ''. The selection of SNPs for replication was based on the MDS l-corrected Pvalues. r 2 -values based on Hapmap CEU data are given for those SNPs which were selected for replication based on their LD with the best-associated SNP in each region. Genomic position and gene annotation refer to the hg18 genome. (DOC)

Table S2
Replication stage association results for individual replication samples. P-values are derived from logistic regression and correcting for gender and age as well as for population stratification where possible (see Materials and Methods). Each replication sample was Bonferroni-corrected using the number of SNPs which passed quality control for the respective sample. The OR refers to the minor allele. NA; SNP could not be analysed due to failing quality control in the respective sample. (DOC) Table S3 Power analysis for GWA, replication and joint analysis of GWA and replication. Power calculation was performed using the CaTS power calculator [25] using a prevalence set of 0.08 and an additive genetic model. The significance level a was set at 0.05/74 for replication stage analysis and at 0.05/301,406 for genome-wide significance in the joint analysis of GWA and replication. (DOC)

Table S4
Results of TOX3 and BC034767 mutation screening. * ''A'' refers to the mutant allele, ''B'' to the reference allele. Position refers to hg18 genome annotation. Codon numbering refers to the reference sequence NM_001146188. Data of the 1000 genomes project was obtained from the November 2010 release via the 1000 genomes browser (http://browser.1000genomes.org/ index.html).

(DOC)
Table S5 Prediction of genetic risk; training-and test-set approach. Inclusion threshold P-values were derived from a logistic regression with age and sex as covariates in the training sample. # SNPs indicates the number of SNPs passing the inclusion threshold. Based on these association results, the sum score of SNPs showing the most significant effects (i.e. the number of risk alleles over all SNPs) weighted by the ln(OR) of these effects was chosen as predictor variable in the test set. Based on this sum score, an AUC and Nagelkerke's R were calculated. (DOC) Table S7 Genotype data of GWA samples. Genotypes of the GWA samples are given for the eight best-associated SNPs (see Table 1). SNP alleles are ACGT-coded. Phenotype information includes gender (1 = male, 2 = female) and disease status (1 = unaffected, 2 = affected). (XLS)