A Genome-Wide Analysis of Populations from European Russia Reveals a New Pole of Genetic Diversity in Northern Europe

Several studies examined the fine-scale structure of human genetic variation in Europe. However, the European sets analyzed represent mainly northern, western, central, and southern Europe. Here, we report an analysis of approximately 166,000 single nucleotide polymorphisms in populations from eastern (northeastern) Europe: four Russian populations from European Russia, and three populations from the northernmost Finno-Ugric ethnicities (Veps and two contrast groups of Komi people). These were compared with several reference European samples, including Finns, Estonians, Latvians, Poles, Czechs, Germans, and Italians. The results obtained demonstrated genetic heterogeneity of populations living in the region studied. Russians from the central part of European Russia (Tver, Murom, and Kursk) exhibited similarities with populations from central–eastern Europe, and were distant from Russian sample from the northern Russia (Mezen district, Archangelsk region). Komi samples, especially Izhemski Komi, were significantly different from all other populations studied. These can be considered as a second pole of genetic diversity in northern Europe (in addition to the pole, occupied by Finns), as they had a distinct ancestry component. Russians from Mezen and the Finnic-speaking Veps were positioned between the two poles, but differed from each other in the proportions of Komi and Finnic ancestries. In general, our data provides a more complete genetic map of Europe accounting for the diversity in its most eastern (northeastern) populations.


Introduction
Identifying and understanding patterns of genetic variation within and between populations has long been the major focus of studies in human population genetics. Over the last decade, our ability to investigate population structure has been significantly enhanced by the advances in high-throughput genotyping technologies, as these allow simultaneous genotyping of hundreds of thousands of polymorphic markers. Compared with the previous methodology used in human population genetics, they enabled a new level of accuracy and power without the constraint of having to use only a few loci as a proxy for the entire genome [1,2].
To date, there is a number of studies in which the fine-scale structure of human genetic variation have been examined at a global, continental, geographic region, single country, or even a subpopulation level [3][4][5][6][7][8][9][10][11]. European ancestry is the best studied of these aspects, for which the strongest genetic differentiation has been found between the north and south of the continent. The identified European population substructure correlated well with geography [4][5][6]12]. Although these studies included many population samples, they mainly represented northern, western, central, and southern Europe, while populations from Eastern Europe, particularly from the European part of Russia, were less represented. The region is inhabited by ethnic Russians as well as different indigenous Finno-Ugric groups. In this study, we report an analysis of 165872 single nucleotide polymorphisms (SNPs) in four Russian populations from European Russia, as well as in populations from two of the northernmost Finno-Ugric ethnic groups: Veps and Komi.
Russians are the largest ethnic group among the European populations: more than 80 million individuals live in an area that covers more than a third of continental Europe [13]. A recent study of genetic diversity in Europe performed by Nelis et al. [5] resulted in a genetic map of the continent that had a triangular structure and showed that Russians were forming one of its vertexes, together with Polish and Baltic samples. However, the Russian population included in that study originated from a single region of the European part of Russia (Tver), even though, in the context of existing genetic data (i.e., Y-chromosome and several autosomal polymorphisms) [14][15][16], European Russians could be subdivided into at least two groups: central-southern and northern Russians.
In order to study genetic structure of the European Russians in greater detail, we combined genome-wide SNP data from the Tver sample mentioned above with the genotypes of three new Russian samples from southern (Kursk), eastern (Murom), and northern (Mezen) regions of European Russia ( Figure 1). Taking into account the well-documented impact of Finno-Ugric communities on the ethnogenesis of Russians [17], the genotypes of Veps and Komi were also included in our analysis. An additional reason of involving of Veps and Komi was the scarcity of the data on finescale genetic structure of Finno-Ugrians, which were mainly presented by Finns, Saami, Estonians and Hungarians [2,5,11]. The Finnic-speaking Veps (also called Vepsians or Ves in ancient times) are one of the oldest people of northern Europe that are still found in the northwest Russia ( Figure 1). Veps were first mentioned in historical chronicles in the middle of the 6 th century [18]. It has been proposed that Veps tribes inhabited the territories between Lakes Onega, Ladoga, and Beloe as early as the first half of the first millennium [18]. In contrast to the scarce Veps, the Komi (Komi-Zyryan) people, belonging to the different linguistic branch of the Finno-Ugric family, the Permian branch, is more numerous [13,19]. They occupy the northeastern-most location of Europe and consist of several ethnographic groups, formed during the 8 th -19 th centuries [19]. We included samples from two of the geographically and socioeconomically distant Komi groups: the Izhemski Komi and Priluzski Komi [20]. Finally, to place genetic variation into the geographical context of the continental Europe, we also included genotypic data from several reference populations ( Figure 1). The obtained results demonstrated similarity between Russian populations from the central part of European Russia as well as their proximity to the populations from central-eastern Europe. They were also showed that genetic peculiarity of Russians from northern Russia was resulted from their admixture with Finno-Ugric populations among them a special impact should be attributed to Komi people. It was manifested by a distinct ancestry component differed Komi from all other European populations studied.

Samples
The used research protocols and forms of informed consent have been approved by the Ethic Commission of the Medico-Genetic Scientific Centre of the Russian Academy of Medical Sciences (an approval was signed by the Head of the Ethic Commission, Prof. L.F. Kurilo). Written informed consent for participation was obtained from all subjects included in the study.
Blood samples were collected in EDTA-coated vacutainers after recording genealogical information and obtaining informed consent from each individual. Inclusion in the study required that all individuals belong to the native ethnic group of the region studied (i.e., they belonged to at least the third generation living in a particular geographic region), were healthy and unrelated. DNA was isolated from peripheral leukocytes according to standard techniques using proteinase K treatment and phenol-chloroform extraction [21]. Among the 615 individuals genotyped, 384 were Russians from Archangelsk (Mezen district, n = 96), Vladimir (Murom district, n = 96), Kursk (Kursk and Oktyabrsky districts, n = 96), and Tver (Andreapol district, n = 96) regions; 81 were Veps from the Babaevo district of Vologodsky region and 150 were Komi from the Izhemski (Izhemski Komi, n = 79) and Priluzski (Priluzski Komi, n = 71) districts of the Komi Republic. DNA samples were genotyped using different versions of Illumina BeadChips: Human370CNV-Duo (Tver and Murom), Human660W-Quad (Kursk), and HumanOmniExpress (Mezen, Veps, and Komi), according to the manufacturer's protocol (Illumina Inc., USA). All samples were subjected to the same quality control procedures using SNP and Variation Suite v.7.4.0 software package (Golden Helix, Bozeman, MT, USA). Only SNPs from the 22 autosomal chromosomes with minor allele frequency .1%, at Hardy-Weinberg equilibrium P.0.00001, and with genotyping success rate .95% were accepted. Cryptic relatedness was tested with the same software and from the detected relative pairs (PI .0.2), only one was chosen for the subsequent analyses at random. These steps resulted in the retention of 165,872 autosomal SNPs in 603 individuals. To investigate population genetic structure, we also included genotypes of several populations described by Nelis et al. [5]: Finns (samples from Helsinki (n = 100) and Kuusamo (n = 84), Estonians (n = 100), Latvians (n = 95), Poles (n = 48), Czechs (n = 94), and Germans (n = 100). In addition, we used free genotype data from the HapMap 3 project (Italians from Tuscany (n = 88) and Han Chinese from Beijing (n = 78) [22], and as well as from the human genome diversity panel (HGDP, Russians (n = 25) [23]. After filtering and removing all non-overlapping SNPs, a subset of 128,844 autosomal SNPs included genotypes available for all populations (except Chinese). Because background linkage disequilibrium (LD) can induce biases in principal component (PCA) [24] and structure analyses [25], both marker sets -165,872 and 128,844 SNPs -were further thinned by excluding SNPs with strong LD (pairwise genotypic correlation r 2 .0.2) using a window of 200 SNPs (sliding the window by 25 SNPs at a time), which yielded 59,318 and 52,808 SNPs, respectively.

Statistical Analysis
In order to explore the genetic structure of the populations from European Russia, several forms of analyses were performed. We started with principal component analysis (PCA), a widely used method for identifying and visualizing patterns of population structure [26]. It was carried out using the Genotypic Principal Components Analysis module of SNP and Variation Suite v.7.4.0. To obtain non-overestimated eigenvectors [27], we first ran the software using an outlier removal procedure, in which individuals with values that were greater than six standard deviations from the mean along any of the top 10 eigenvectors (principal components) were identified and removed.
Genetic differentiation among the populations was quantified by estimating pairwise Wright's fixation indices (F ST ) using the SMARTPCA program in the EIGENSOFT software package (v.4.2). Allele frequency differences in pairs of populations were evaluated using trend tests. The resulting P values were subjected to Bonferroni correction and the significance threshold was set at P = 0.05 (Bonferroni-adjusted P was equal 3610 -7 ).
Next, the population structure was examined using the ADMIXTURE software package (v.1.22), which, in contrast to PCA, implements a model-based clustering algorithm for estimating individual ancestry proportions [25]. This approach assumes that the genome of each subject originates from K unknown ancestral populations and estimates the proportions of the genome derived from each of these populations. To identify putative ancestral clusters within the samples, we ran the software assuming 2-12 subpopulations on separate runs, using default parameters. Each run was repeated at least three times to assess the stability of the clustering patterns. To validate the results, a cross-validation procedure was used [28].
Finally, to assess the potential effect of population demographics on the population structure, the runs of homozygosity (ROH) and the extent of pairwise linkage disequilibrium (LD) were examined in the populations studied. ROH in the individuals were identified using SNP and Variation Suite v.7.4.0. ROH was defined as a sequence of at least 25 consecutive homozygous SNPs spanning at least 1500 kb, with a maximum gap of 100 kb between adjacent SNPs and a minimum density of 1 SNP per 50 kb [29]. Taking into account the limited number of SNPs tested, we also used another definition of ROH, in which the limitations on the maximum distance between SNPs and the minimum density of SNPs were excluded [30,31]. For comparative purposes the results obtained were summarized by the calculation of means for the number of ROH and the cumulative length of ROH per individual for each population. The extent of pairwise linkage disequilibrium (LD) was calculated as the genotype correlation (r 2 ) between marker pairs located less than 100 kb apart using the PLINK v. 1.07.29 software [32]. A custom Perl script was applied to categorize the r 2 values according to intermarker distances (0-5 kb, 5-10 kb, etc.) and a mean r 2 was calculated for each category.

Results
To probe population structure, we first analyzed our data sets using a model-free ancestry PCA. In Figure 2 we plotted the first two principal components (PC) that had the highest eigenvalues ( Figure S1). The plot demonstrated the presence of significant differences between Russian populations from the central part of the Russian Plain (i.e., populations from the Kursk, Murom, and Tver regions), which formed a single cluster on the PC plot, and the Russian population from the northern Archangelsk region (Mezen Russians). Mezen Russians exhibited closer relationships with the population of Veps. The samples of Izhemski and Priluzski Komi were located distantly, not only from Veps and Russians, but also from each other.
The lack of separation between populations from the Kursk, Murom, and Tver regions in the PC plot was consistent with the results of the assessment of population differentiation via the calculation of pairwise F ST statistics, in which F ST values were not greater than 0.001 ( Table 1). The pairwise F ST value between these populations and Mezen Russians was 0.006. The same F ST value characterized the genetic relationships between Mezen Russians and Veps. This finding correlated with the population substructure observed in a plot of PC3 versus PC4, in which Mezen Russians and Veps were clearly separated from each other along PC4 ( Figure S2). The highest pairwise F ST estimates were obtained from comparisons that included Komi samples.
None of the SNPs analyzed showed significant (P,3610 -7 ) differences in allele frequencies between populations from the Kursk, Murom, and Tver regions, but 144 to 172 SNPs in each of these populations differed from those of Russians from the Mezen region. The highest number of SNPs with large differences in allele frequencies was found between Izhemski Komi and populations from the Kursk, Murom, and Tver regions (Table 1).
To understand the place of Russians, Komi, and Veps on the genetic canvas of Europe, we combined their genotypes with the genotypic data of several European populations (Finns, Estonians, Latvians, Poles, Czechs and Germans, as well as Italians, who are the most distant from our populations [5]). The results of PCA performed on this extended number of samples are shown in Figure 3, and may be described as having an ''airplane''-like structure with the two wings represented by the Finnish (upper left), and Komi (lower left) samples. A comparison of the resulting genetic map, with the results presented by Nelis et al. [5], shows that the populations from one of the vertices of the latter are now located at the intersection formed by the two genetic ''wings''. Russians from Murom, Kursk, and Tver were also placed at this intersection. However, Russians from Mezen were located outside this intersection. This population, together with the Finnicspeaking Veps, was located in the space between the Finnish and Komi ''wings'' on the chart. Taking into account the genetic differences found for Mezen Russians among the other Russian populations studied here, a Russian-only sample from the HGDP set was also included in the analysis. The HGDP Russians were also from the Archangelsk region (Kargopol district), but their location is geographically closer to samples of populations from central regions of European Russia (Figure 1). This is reflected in their intermediate position on the PC plot ( Figure 3) and lower pairwise F ST values (0.004 against Mezen and 0.002 against the Russians from Kursk, Murom, and Tver regions) (Table S1, Figure S3).
To further explore the population structure, a model-based structure-like analysis using the ADMIXTURE software was performed [25]. This analysis considers the genome of each individual as having originated from several hypothetical ancestral populations, the number of which (K) could be specified. In addition to populations already used in PCA, a Chinese sample was included to check for the potential presence of East Asian admixture. We ran ADMIXTURE at K = 2 to 12. At K = 2, only the population groups corresponding to Europe and Asia were separated (Figure 4). Subtle variations detected in this analysis were possibly due to the differences in the proportion of East Asian ancestry, which was present in all European populations included in this study, but had a higher average contribution in Komi samples. Subcontinental patterns appeared at K = 3: one ancestry component was the most abundant in Izhemski Komi and Finns from Kuusamo (red) and a different component (blue) was at the maximum in the Italian population (Figure 4). At K = 4, the red component has diverged into two parts and distinguished Finns (purple) from Komi (red). These results match closely with the population structure revealed by the PCA, where they corresponded to the genetic ''wings'' described in Figure 3. Mezen Russians and Veps exhibited the highest proportions of both red and purple ancestry components, differing only in their ratios, which were the opposite of each other (henceforth, we will refer to these crucial components as Komi and Finnic).  Figure S4). The situation in which a new ancestry component introduced for the next K value differentiated only a single population was considered as being less informative for the hierarchical comparisons of populations [33,34]. Therefore, although the lowest cross-validation errors were observed at K = 7 ( Figure S5), our further discussion will focus on the results of clustering at K = 5.
To explore the potential effect of population demographics on the population structures identified, ROH were compared across populations. ROH may indicate prolonged isolation and a reduced population size [29,35]. Here, we analyzed ROH longer than 1,500 kb as being the most informative [29]. Using the thresholds for SNP density along a ROH tract ($ 1 SNP per 50 kb, with a gap size # 100 kb), the total number of ROH in 16 populations (the Chinese sample was not included) was 1,298, with a mean population number of ROH (nROH) of 0.20-2.68 per individual. The population average of the cumulative ROH length (cROH) per individual ranged from 0.43 to 6.31 Mb ( Table 2). The use of the alternative definition of ROH, which allows the screening of ROH across various SNPs, resulted in an increase in both the number and length of ROH, which ranged between 6.77

Discussion
In this study, we used genome-wide SNP data to analyze the population genetic structure of Russians, Veps, and Komi. The samples under investigation covered territories in the northeastern Europe, not been included in previous analyses.
The results obtained revealed no substantial genetic stratification within Russians from central-southern regions of European Russia (i.e., samples from the Kursk, Murom, and Tver regions). These three populations were clustered in close proximity to other populations from central-eastern Europe. In contrast, a sample from the northern Archangelsk region of Russia, Mezen Russians, was clearly distant from those of Kursk, Murom, and Tver. These data are in good agreement with earlier data obtained for polymorphisms of the Y-chromosome [14,15,36,37] and several autosomal loci [16,38,39]. It has been proposed that the genetic specificity of northern Russians is because of admixture with Finno-Ugric populations. The results of our ADMIXTURE analysis suggest that, although they descended historically from the Novgorod Russians, Mezen Russians admixed significantly with both Finnic-speaking and Komi populations (Komi belongs to the different linguistic branch of the Finno-Ugric family, the Permian branch). The estimated proportion of Komi ancestry in Mezen Russians was higher than the Finnic proportion. This might be explained by either a more extensive or a later admixture with Komi people. The existing anthropological data favor the latter explanation, proposing a two-staged inclusion of Finno-Ugric elements during the ethnogenesis of Northern Russians, in which Komi elements were included last [40]. Both the Komi and Finnic ancestry components occurred at lower proportions in other Russians, as well as in the populations of Poles, Czechs, Germans, and Italians, which are geographically distant from Finns and Komi. The proportions of Komi and Finnic components were also low in Latvians, but not in Estonians, among whom the proportion of Finnic ancestry was relatively high.
The Veps were another population that exhibited an increased percentage of both Komi and Finnic ancestries. The high level of Finnic ancestry is evidently characteristic of this population, as they belong to the same linguistic community, the Finnic-speaking community, as Finns do. The higher level of Komi ancestry in this population compared with that of Finns and Estonians could be from admixture of Veps (Ves) with Komi (ancient Permians) in the 11 th -14 th centuries, when Komi lived westward of their current territory and were the neighbors of Veps [41].
As for the Komi themselves, it has been proposed [42] that their ethnogenesis was influenced by Finnic (e.g., Veps or ''Chud'') and Russian people. The evaluation of the impact of Finnic people in the context of Finnic ancestry revealed that the corresponding component was not represented at a high proportion in the Komi samples studied. The impact of Russians on the ethnogenesis of Komi seems to be indicated by the yellow component. It was abundant in Priluzski Komi (Figure 4), which is in good agreement with the population history of this region -the basin of the Luza river, where Russian people resided as far back as the 13 th -14 th centuries [41]. In contrast to the Priluzski Komi, Komi component was overrepresented in the ancestry of the Izhemski Komi, accounting for more than 80% of the total ancestry (86% at K = 5). Historical records show that the first mention of the current center of Izhemski Komi, the Izhma village, occurred at the end of the 16 th century and that Izhma was founded mainly by a group of Vimski Komi. Later, some Russian and Nenets families joined them [41,43]. Nenets were not studied here. Although the ADMIXTURE components depend on the samples included, a minimal influence of the genetics of Nenets on the results of clustering can be proposed. Here, we can refer to both the existing data on the absence of (or very limited) genetic relationships between the Nenets and the populations listed (including Komi) [15,44,45], and the results of our analyses, which indicate the genetic isolation of the Izhemski Komi. Evidence of the latter stemmed both from pairwise F ST values, which were the same between Izhemski Komi and both Priluzski Komi, who shared the same ethnic territory, and the geographically distant Finns from Helsinki, and from their higher parameters of ROH estimated.
Both nROH and cROH have been shown to be higher in northern Europeans compared to their southern counterparts, which is consistent with the smaller effective population size and lower population density in northern Europe [35]. In our study, all northern samples (Mezen Russians, Veps, and both Komi samples) were also characterized by higher nROH and cROH compared to Russians from the central part of the Russian Plain and most of the European populations tested. However, the Izhemski Komi had the highest nROH and cROH, comparable to the values calculated in the sample from Kuusamo, the known Finnish isolate [46]. Similar to the Finns from Kuusamo, the Izhemski Komi exhibited elevated LD. Taking into account the history of the Komi people, the recorded genetic distinction of the Izhemski Komi can be due to the increased stability of their community life reinforced by the advanced type of traditional economy, including reindeer breeding [47]. Reindeer breeding was adopted by this group from the Nenets and currently differentiates the Izhemski Komi from the other Komi groups.
In summary, we reported results of the first genome-wide autosomal SNP-based study of the population structure of European Russia, in which samples of Russians, Veps, and Komi were analyzed. The data obtained strongly supports the results of earlier genetic studies, based either on Y-chromosome polymorphisms or on a limited number of autosomal markers, and suggested a genetic distinction of the northern Russian populations. Here, we were able to show clearly that this distinction was attributed to admixture with Finno-Ugric populations. The second important finding of our work was the context of that admixture. Our data on Komi population structure led us to consider this group as the second pole of genetic diversity in northern Europe (in addition to the pole occupied by Finns). Although we understand that the picture of the genetic structure of populations from European Russia obtained is still sparse, we propose that populations (ethnic groups) located between those two poles will have different proportions of Komi and Finnic ancestries (e.g., Veps and Mezen Russians).   samples. The samples of Poles and Russians from the HGDP were not included because of their smaller sample size. The Italian sample was also excluded (its merging with other samples resulted in a significant decrease in the number of SNPs).