Genetic Structure of Europeans: A View from the North–East

Using principal component (PC) analysis, we studied the genetic constitution of 3,112 individuals from Europe as portrayed by more than 270,000 single nucleotide polymorphisms (SNPs) genotyped with the Illumina Infinium platform. In cohorts where the sample size was >100, one hundred randomly chosen samples were used for analysis to minimize the sample size effect, resulting in a total of 1,564 samples. This analysis revealed that the genetic structure of the European population correlates closely with geography. The first two PCs highlight the genetic diversity corresponding to the northwest to southeast gradient and position the populations according to their approximate geographic origin. The resulting genetic map forms a triangular structure with a) Finland, b) the Baltic region, Poland and Western Russia, and c) Italy as its vertexes, and with d) Central- and Western Europe in its centre. Inter- and intra- population genetic differences were quantified by the inflation factor lambda (λ) (ranging from 1.00 to 4.21), fixation index (Fst) (ranging from 0.000 to 0.023), and by the number of markers exhibiting significant allele frequency differences in pair-wise population comparisons. The estimated lambda was used to assess the real diminishing impact to association statistics when two distinct populations are merged directly in an analysis. When the PC analysis was confined to the 1,019 Estonian individuals (0.1% of the Estonian population), a fine structure emerged that correlated with the geography of individual counties. With at least two cohorts available from several countries, genetic substructures were investigated in Czech, Finnish, German, Estonian and Italian populations. Together with previously published data, our results allow the creation of a comprehensive European genetic map that will greatly facilitate inter-population genetic studies including genome wide association studies (GWAS).


Introduction
Over the last few years, the number of genome-wide association studies GWAS has increased markedly and, in concert, these efforts have led to the identification of a large number of new susceptibility loci for common multi-factorial disorders [1]. The underlying technology is developing rapidly and is currently moving from the use of high density SNP arrays towards medical re-sequencing of large genomic regions. Given this development, the availability of thoroughly phenotyped patient and control samples is becoming even more important. Furthermore, due to the small effect sizes that characterize susceptibility genes for multi-factorial traits, potentially successful GWAS rely on large sample number, with additional pressure put on the quality of samples [2]. In reality, however, there will be only very few cohorts comprising 10,000 or even more samples (www.p3gconsortium. org). Exceptions include, for example, the DeCODE studies in Iceland (www.decode.com) and the EPIC (European Prospective Investigation into Cancer and Nutrition) cohort (http://epic.iarc. fr). Collaborations involving diverse sample collections are therefore essential and efforts in this field are promising, for example the establishment of the Biobanking and BioMolecular Resource Infrastructure (www.bbmri.eu). With cohorts from different countries or even from different sites within the same country being used for genetic epidemiological research, the problem of confounding by population stratification has to be addressed. Fortunately, with the vast amount of the genome-wide data available, the actual extent and relevance of population genetic differences can be clarified with high confidence for most commonly used SNP sets.
Confounding by population stratification has been extensively studied in the past [3]. Heterogeneity between studied samples can give false-positive results in association studies, as the association with the trait may by the result of the systematic ancestry difference in allele frequencies between groups [4]. Three main approaches have been proposed so far to capture population genetic differences analytically, namely a) Bayesian clustering [5], b) principal component (PC) analysis [6] and c) multidimensional scaling (MDS) analysis based upon genome-wide identity-by-state (IBS) distances [7]. With the recent availability of high density SNP data, PC and MDS methodologies have become increasingly popular because they require less computing power and have higher discriminatory power than Bayesian analysis for closely related (e.g. European) populations [8]. Therefore, PC analysis is more widely used in the literature. Examples of its recent use are provided by the analysis of high density microarray SNP data at either a global level [9,10] or, in greater detail, for selected European populations [11][12][13][14][15] or within a single country [16][17][18].
In Europe, PC analysis has revealed the strongest genetic differentiation between the northwest and southeast of the continent. The first PC accounts for approximately twice as much of the genetic variation as PC2 [12,13,15]. In addition, Price et al. (2008) have shown in their study of US Americans of European descent that the consideration of three clusters of individuals, which roughly corresponded to Northwest Europe, Southeast Europe and Ashkenazi Jewish ancestry, may be sufficient to correct for most of the population stratification affecting genetic association studies. However, the extent to which the results of PC analysis reflect the true underlying genetic map of Europe is critically dependent upon the choice of populations analyzed. Optimal coverage of European populations has not been achieved so far and still represents a goal for future collaborative studies. At present, however, it appears essential that the peripheral populations of Europe or those with a strong founder effect in particular must not be left out of studies aiming at the construction of a continent-wide genetic map.
Here, we present an analysis of more than 270,000 SNPs, genotyped with the Illumina 318K/370CNV chips, on 3,112 individuals across 16 European countries (comprising 19 different samples). Our focus has been on the Baltic region and Eastern Europe since these regions have not been studied in much detail before. The results suggest that geographically adjacent populations overlap partly according to the PC analysis forming four subgroups. Consideration of the inflation factor lambda (l) [19] further indicates that the loss of power would be minimal when performing and adjusting genetic association studies within these groups.

Results
In order to investigate in detail the genetic structure of the Baltic countries and neighbouring North-Eastern Europe, whole genome genotyping was undertaken for over 1,000 Estonians and additional individuals from Bulgaria, the Czech Republic, Hungary, Latvia, Lithuania, Poland and Russia, using the Illumina Human370CNV chip. In addition, raw genotyping data were obtained from Scandinavia and other Western and Northern European countries (Table 1). From samples with .100 individuals available (Table 1), a sub-set of 100 individuals was chosen at random for subsequent analyses in order to minimize sample size effects. In all instances, the inflation factor l as computed for the complete data set versus the random sub-set was close to unity, indicating that the latter sets were representative of the entire samples. In total, genotypes of 273,464 SNPs from 1,564 individuals were included in the statistical analyses.
The HapMap data was used for valuation of our results and showing the genetic distance from other continents. The HapMap data included four populations: CEU -U.S. Utah residents with ancestry from Northern and Western Europe, YRI -the Yoruba people of Ibadan, Nigeria, CHB -Han Chinese from Beijing, and JPT -Japanese from Tokyo, in total of 203 individuals.

Minor allele frequency (MAF)
PLINK was used to compute the minor allele frequencies (MAF) using all the 273,464 SNPs that passed the quality control (QC) procedures. Since the Estonian biobank sample (www. geenivaramu.ee) has been part of several previous GWAS, it was interesting to compare the MAF spectrum seen particularly in Estonia with that of other populations. The correlation coefficient r 2 obtained varied markedly, from 0.9247 for Latvia and 0.8913 for Finland (Helsinki) to 0.7312 for Southern Italy. In order to examine the extent and likely impact of MAF differences between the studied populations in general, we next examined LD structure, undertook PCA, and calculated fixation indexes F st and inflations factor l (see below).

Linkage disequilibrium (LD) structure
Pair-wise LD between SNPs was measured by means of the r 2 statistics (see Methods). Genome-wide, average r 2 ranged from 0.24 to 0.28 at smaller distances (5 kb), and decreased to between 0.05 and 0.07 at larger distances (100 kb), depending upon population. Above 75 kb the cohorts started to diverge reflecting the LD extinction towards the north (Figure 1), although the difference was not statistically significant (one-tailed t-test, p-value#0.05 was considered as statistically significant).

Principal component (PC) and multidimensional scaling (MDS) analysis
PC analysis has been used in most previous studies of the European genetic structure. Here, PC analyses were performed using EIGENSOFT with default parameters. In total, 1,564 individuals plus 203 HapMap members and 266,356 autosomal SNPs were used as the input dataset. After the removal of outliers, 1,539 individuals (or 1,742 including the HapMap members) remained (Table S1). The first PC explains 8.7% of the genetic variance, the second PC explains 4.9%; all other PC explained much smaller fractions demonstrating that the Europe is genetically quite uniform. If we add African and Asian HapMap populations to European samples, the two first PCs describe 36.6% and 23.8% of the genetic variance (Figure 2). At a more detailed level, however, several distinct regions can be distinguished within Europe: 1) Finland, 2) the Baltic region (Estonia, Latvia and Lithuania), Eastern Russia and Poland, 3) Central and Western Europe, and 4) Italy, with the southern Italians being more ''distant'' ( Figure 2). PC analysis of the 1,026 Estonians revealed the fine-structure of this population, with the first two PCs describing 1.9% and 1.5% of the genetic variance, respectively. The spread of Estonian individuals is relatively wide as the subregions overlap on individual level, but the median value of PCs, calculated for each county show a remarkable correlation with the regional map of Estonian geography (Figures 2 and S1). PC analysis of genome-wide SNP genotypes is therefore capable of highlighting both global and minute intra-population genetic differences ( Figure 2). As expected, MDS analyses of the data with PLINK yielded a scatter plot of the two first dimensions that looked very similar to that generated by PC analyses ( Figure S2).
The twenty-two (11 SNPs for the first PC and 11 SNPs for the second) most variable SNPs presented as default output of the EIGENSOFT analysis are listed on Table S3. These SNPs have significantly different allele frequencies between studied populations and correspond to the largest eigenvalues of the first two PCs explaining the most variance.

Fixation index (F st )
Pair-wise F st values between samples were calculated using EIGENSOFT. F st values indicate how much of the genetic variability between individuals from different populations is due to population affiliation. In our study, F st was found to correlate considerably with geographic distances (r 2 = 0.382, p-value%0.01). Values ranged from #0.001 for neighbouring populations to 0.023 for Southern Italy and in a young subisolate of Finland (Kuusamo) ( Table S2). The F st distances between HapMap CEU sample and the other samples also correlated with geographic distance (r 2 = 0.291, pvalue,0.01). The German population sample showed zero F st with the CEU sample whereas the Finns from Kuusamo and the southern Italians were most different from them (F st = 0.013 and 0.008, respectively) (Table S2). Pair-wise F st values for CEU and either Latvians, Lithuanians, Estonians or western Russians were intermediate (0.006, 0.005, 0.004 and 0.004, respectively).
Two or more samples were available from several countries which allowed us to measure the intra-population variability by Using Barrier 2.2 software, we also correlated geographic and genetic distances as measured by pair-wise F st and great-circle coordinates of capitals or the city where an individual population sample had been recruited, respectively. The results overlapped with previous findings in that the first barrier was seen between Finland and all other samples, a second barrier separated Southern Italy from the remainder, a third was found between Western Russia, Poland and Lithuania on the one hand, and Bulgaria on the other, a fourth was seen between Kuusamo and Helsinki, and a fifth was between the Baltic region and Poland on the one hand, and Sweden on the other ( Figure S4). Table 2 lists the pair-wise inflation factor l between studied samples. The inflation factor l was calculated with the method of the Genomic Control [19]. We assumed l to be constant across the genome and l was estimated as the median of the observed chi-square statistics divided by the median of the central chisquare distribution with 1 degree of freedom (i.e. 0.456). This factor was found to range from unity (between the samples from the same country) to 4.21 (between Spain and the Kuusamo region). The overall average l value was 1.82; in separate clusters it amounted to 1.23 (Baltic Region, Western Russia and Poland), 1.54 (Italy and Spain), 1.22 (Central and Western Europe), and 1.86 (Finland), respectively. The correlation coefficient between geographic distance and l was r 2 = 0.386 (p-value%0.01). This value is probably an underestimate of the European-wide relationship due to the inclusion of the Kuusamo and Geneva samples. One is an isolate and the other is a highly heterogeneous international metropolis. The l values between CEU and the other samples (Table 2) were smaller than those obtained using the Northern German sample as a reference, chosen as the nearest to the origin of CEU sample, and the correlation between geography and l with CEU was only r 2 = 0.251 (p-value 0.017). Both results probably reflect the higher genetic variability in the CEU sample.

Inflation factor lambda (l)
The high level of genetic homogeneity in Europe was again highlighted by the l values calculated between the four HapMap samples (data not shown), which ranged from 21.56 (YRI vs JPT) via 13.27 (CEU vs CHB) to 1.77 between CHB and JPT. The l value between the African and European samples was slightly smaller than that between the African and Asian samples.

Marker-wise significance test
Marker-wise significance test was performed in order to assess the allelic distribution in pair-wise comparison of studied cohorts (CEU sample was not included) ( Table 2). After applying  1 SNPs, respectively. The total number of loci that had a ''significant SNP'' was 2,263. In order to decrease the amount of loci and identify the meaningful hits, only the loci which had at least two significant hits in at least two pair-wise comparisons were considered, thereby decreasing the total number to 594 loci. Only 18 of those arose from comparisons between other populations than Italy or Finland (Table S4).

Discussion
Studies of mitochondrial DNA (mtDNA) have suggested substantial genetic homogeneity of European populations [20], with only a few geographic or linguistic isolates appearing to be genetic isolates as well [21]. On the other hand, analyses of the Y chromosome [22,23] and of autosomal diversity [24] have shown a general gradient of genetic similarity running from the southeast to the northwest of the continent.
In the present study using autosomal SNPs and high density genotyping, we have focused on the genetic structure of the Baltic, Finnish and other North-Eastern European populations, while populations from Western and Southern Europe were included mainly for comparison ( Figure 2). Overall the samples under investigation have a large geographic coverage, ranging from Spain and Italy, through the Baltic to Finland and Western Russia. Previous studies have focused upon the genetic structure in Central and Western Europe [11][12][13], Northern Europe [17,25] or studied US Americans of European and Ashkenazi-Jewish descent [14,15].
Genome-wide analyses presented here have revealed, as expected, more extensive LD in isolated populations than in outbred populations. It can be presumed that the average r 2 value, particularly at larger inter-marker distances, reflects the extent of panmixia in a population. Indeed, the Kuusamo sample, a population isolate that was established from a small number of founders only 300 years ago, had the highest r 2 irrespective of distance in our and previous studies [26]. At the other end of the scale was Geneva, one of the most cosmopolitan cities in Europe, which yielded the lowest r 2 values. Thus, our data corroborate earlier suggestions that the amount of LD that persists over time is markedly reduced in more admixed populations [27]. Surprisingly, the Polish cohort showed a similar LD pattern as the Kuusamo population, which is probably reflecting the homogeneity of the Polish population. Here the similarity could be attributed to the founder effect or admixture as the Polish sample comes from West Pomerania, a region that was repopulated after the Second World War, after the expulsion of the German population, with other people from (Eastern Poland) and also some Ukrainians. Small sample size (n = 45) does not provide a sufficient explanation for this finding because the Hungarian and Bulgarian samples were also similar in size (Table 1), but gave LD patterns distinct from the Polish and Kuusamo samples (Figure 1). PC analysis yielded a genetic map where the first two PCs highlight the genetic diversity corresponding to the Northwest to Southeast gradient and position the populations according to their approximate geographic origin. Our genetic map shows slightly different tendency from previously published ones in that the scatter plot takes the form of a triangle, with the Finnish, Baltic and Italian samples as its vertexes, and with Central Europe residing in its centre. The two PCs explain 8% and 4% of the genetic variability in the samples, which is almost twice as much as in previous European-based studies. This increase is likely due to the fact that the geographic coverage in our study has been broader and that our data captured more genetic variability ( Figure 2).
Interestingly, PC analysis was also capable of highlighting intrapopulation differences, such as between the two Finnish and the two Italian samples, respectively. A low level of intra-population differentiation in Germany has been reported previously [18], and was confirmed here. In addition, we detected intra-population differences within the Czech and Estonian samples ( Figure S3). In the case of the Czech, two samples were available: Prague and Moravia. Although their pair-wise F st was virtually zero, the median values of PCs for the two samples sets are different. This is explicable by the fact that Moravia has a long shared history with the remainder of the Czech Republic, but is nevertheless separated from the rest of the country by the Czech-Moravian highlands, which in the past hindered stronger intermixing.
Estonia is a small country with no geographic barriers and its Estonian population is merely one million. In order to study the genetic structure of Estonia in more detail, all Estonian individuals were grouped here by their county of birth. Then, PCA was performed and the mean values of the two first PC of the counties were plotted onto the Estonian regional map (Figure 2). Surprisingly, the resulting genetic map correlates almost perfectly with the geographic map, although Estonia is only 43,400 km 2 in size, and the mean area of a county only 2,900 km 2 . Thus, fine-scale genetic difference can be revealed by PC analysis, and the results can be useful for identification of the distant relatives.
Barrier analysis revealed genetic barriers between Finland, Italy and other countries, as has been described before [12].
Interestingly, barriers could be demonstrated within Finland (between Helsinki and Kuusamo) and Italy (between northern and southern part). Another barrier emerged between the Eastern Baltic region and Sweden, but not between the Eastern Baltic region and Poland ( Figure S4). The barrier between Bulgaria and Western Russia, Poland and Lithuania may have arisen due to the fact that several populations are missing in between those countries. It has been shown previously that the populations of central European background are less differentiated genetically, whereas the Finns exhibit a more homogeneous population structure with decreased genetic diversity [17,25].
In GWAS using large numbers of markers, multiple testing correction becomes an important issue, and a genome-wide significance threshold of p,5610 27 has been proposed [16]. At the same time, adjustment for population stratification can decrease the necessary level of nominal significance even further. This can be illustrated, for example, by adopting the Genomic Control approach [19] where the factor l by which the chisquared statistic is inflated by confounding is first estimated from the null loci and correction is then applied by dividing the actual association chi-square statistic by l. Figure 3 illustrates the effect that this procedure would have by showing, for each possible l, the highest p-value that stays below 0.05 after correction. Two scenarios are presented: 1) tests with 1 degree of freedom (Allelic, Additive, Dominant and Receive) and 2) tests with 2 degrees of freedom (Genotypic). When l = 1.5 (which would be common if patients and controls came from different European countries) ( Table 2), the original p-value must be approximately three times lower than 0.05. For geographically distant samples, the necessary reduction may be by a factor of up to 500, as would be the case with Kuusamo and Southern Italy. Interestingly, l values with respect to other samples are smaller for CEU (originating mostly from Northern Germany, Netherlands and Belgia [12]) than for Northern Germany. This is probably due to the higher genetic variability in the CEU sample, ancestry of which is from a mixture of several different populations and therefore the CEU sample is a better reference for European population than a single population. It should be pointed out that any adjustment for stratification does inflate the multiple testing correction so that, if genetically distant case and control samples are compared in an association study, the genome-wide significance threshold in some cases would even be as low as p,1610 210 .
From our results, conclusions can be drawn as to which European populations can be combined in GWAS, considering the pair-wise calculations of inflation factor l and F st values, although meta-analyses may often be a more appropriate option [28,29].
Marker-wise significance test for allelic differences in pair-wise comparisons between the studied samples resulted in 2,263 loci. As our sample included some genetically and geographically distant cohorts (Finns and Italians) where the strong founder effect and isolation driven genetic drift has changed respective allele frequencies, therefore only loci that were present in non-Italian and non-Finnish comparisons were considered. This step decreased the number of significantly different loci to 18 (Table S4). Four genes were within LCT loci (haplotype block covering more than 1 Mb [30]) and it has been shown, that LCT region differentiates European populations [11], but also within a given population [16]. Three genetically most variable SNPs revealed by PC analysis represented the same loci also present in the previously mentioned list of 18 loci.
In conclusion, we have described the European genetic structure by three different measures: the inflation factor l, F st and PC. As a result, according to the first two PCs, individuals from the same geographic origin cluster together and form a genetic map where four areas could be identified: 1) Central and Western Europe, 2) the Baltic countries, Poland and Western Russia, 3) Finland, and 4) Italy. If not corrected for the interpopulation differences would affect the significance of diseasegene associations. A detailed description of the European population structure has consequences and implications for the design of future GWAS, particularly regarding sample size and choice of controls. As a matter of fact, the knowledge of genetic distances between different populations is helpful in defining which biobanks could sensibly contribute samples and data to GWAS.

Ethics Statement
The study was approved by the Ethics Review Committee on Human Research of the University of Tartu (166/T -21, 17.12.2007). Written informed consent for participation was obtained from all study subjects.

Samples
Samples are described in detail in the supplementary methods section (Text S1). The studied 3,112 individuals representing a total of 19 cohorts (Czech Republic samples were used as one in all analyses except from the inter-population structure analyses) samples from 16 countries: Austria (Vienna), Bulgaria (entire country), Czech Republic (Prague, Moravia and Silesia), Estonia (entire country), Finland (Helsinki, and a young internal subisolate of Kuusamo), France (Paris), Germany (Schleswig-Holstein, Augsburg region), Hungary (entire country), Italy (Borbera Valley, Region of Apulia), Latvia (Riga), Lithuania (entire country), Poland (West-Pomerania), Russia (Andreapol district of the Tver region), Spain (entire country), Sweden (Stockholm) and Switzerland (Geneva) ( Table 1).
The HapMap data used in our study comprised four populations, namely CEU -U.S. Utah residents with ancestry from Northern and Western Europe, YRI -the Yoruba people of Ibadan, Nigeria, CHB -unrelated individuals from Beijing, China, and JPT -unrelated individuals from Tokyo, Japan. Human-Hap300 (v1-0.0) genotypes were downloaded from Illumina iControlDB 1.1.2 (www.illumina.com/pages.ilmn?ID = 231), comprising a total of 203 individuals. For the CEU and YRI samples, only parents were used.

Genotyping
For the samples from Bulgaria, Czech Republic, Estonia, Hungary, Latvia, Lithuania, Poland and Russia, genotyping was performed at the Estonian Biocentre (Tartu, Estonia) according to the manufacturer's instructions, using the Illumina Hu-man370CNV-duo chips.
Additional raw genotyping data were obtained for the samples from Austria, Finland, Southern Germany (Augsburg region) and Italy for Illumina Human370CNV-duo, from France, Northern Germany (Schleswig-Holstein), Spain and Sweden for Human-Hap300-duo, and from Switzerland for HumanHap550 data.
Systematic quality control (QC) was applied to all genotypes generated at the Estonian Biocentre. Duplicates from the Estonian sample were used to assess genotyping reproducibility, i.e. every 40 th individual was duplicated and the mean discordance per SNP between pairs of individuals was found to be less than 1 in 5000 (0.0002%). The per individual call rate had to be at least 95% for individuals to be included into subsequent analyses. The number of individuals before and after QC is shown in detail in Table 1.
Only the genotypes for those 311,226 SNPs that were typed in all 3,378 individuals were included in subsequent computational analyses. Closely related individuals were identified using estimation of the proportion of the genome shared identical by descent (IBD), and the relative with the lower call rate was removed. Inbreeding coefficient F was assessed in order to detect potential DNA contamination. SNPs found to be out of Hardy-Weinberg equilibrium at p,10 25 , or missing more than 1% of genotypes, or with a minor allele frequency ,0.01 were removed from the dataset [16]. The total rate of genotyping calls in the remaining individuals was 0.995. After QC, 273,454 SNPs remained (from 3,112 individuals), including 203 HapMap individuals that increased the overall sample size to 3,315. All QC procedures were conducted with Illumina's BeadStudio (www.illumina.com) and the PLINK software [7].

Statistical analysis
Pair-wise LD was measured by r 2 for all SNPs less than 100 kb apart using the Haploview software [31]. A custom Perl script was used to categorize r 2 according to inter-marker distance (0-5 kb, 5-10 kb etc.) and mean r 2 was calculated for each category. The significance of the mean r 2 values between cohorts was tested with the one-tailed t-test and p-value#0.05 was considered as statistically significant.
Principal component (PC) analysis was performed and F st determined between samples using EIGENSOFT [6] on three sets of samples: 1) HapMap+Europe, 2) Europe, and 3) Estonia alone with individual counties. All analyses were performed with the default parameters. Multidimensional scaling (MDS) analyses were performed with the PLINK software. The marker set was filtered according to pair-wise LD (r 2 cut-off = 0.2) in order to remove correlated markers. The number of remaining markers was 68,201. All PC values and MDS dimensions were multiplied by 21 to render scatter plots more similar to the geographic distribution of individual origin.
Geographic barriers were computed with the Barrier v2.2 software [32]. For the geographic positioning of samples the greatcircle coordinates of the respective capital of the country of origin, or the city where an individual population sample had been recruited was used. The F st pair-wise comparison matrix for genetic and geographic distance was used in barrier analyzes. The geographic location of the CEU sample was approximated by Northern Germany (as shown in the Lao et al. 2008 paper). The geographic distances between the above mentioned cities were used to calculate the correlation coefficient between geography and statistics, like F st and inflation factor l. Statistical tests were performed in R v2.8.1 (www.R-project.org).
Trend tests were performed in order to identify markers with significant pair-wise allele frequency differences between populations. The resulting p-values were subjected to Bonferroni correction and the significance threshold was set at p,0.05, although the multiple testing which arises from the pair-wise comparisons was not taken into account. The ''inflation factor'' l of the Genomic Control method [19] was calculated using HelixTree (Golden Helix, Inc. Bozeman, MT, USA, HelixTreeH Software; www.goldenhelix.com).

Table S4
Top eighteen genetically most variable loci from the pair-wise cohort association analysis. The locus was described by at least two SNPs and was present in at least two pair-wise cohort analyses.