The Genetic Legacy of the Expansion of Turkic-Speaking Nomads across Eurasia

The Turkic peoples represent a diverse collection of ethnic groups defined by the Turkic languages. These groups have dispersed across a vast area, including Siberia, Northwest China, Central Asia, East Europe, the Caucasus, Anatolia, the Middle East, and Afghanistan. The origin and early dispersal history of the Turkic peoples is disputed, with candidates for their ancient homeland ranging from the Transcaspian steppe to Manchuria in Northeast Asia. Previous genetic studies have not identified a clear-cut unifying genetic signal for the Turkic peoples, which lends support for language replacement rather than demic diffusion as the model for the Turkic language’s expansion. We addressed the genetic origin of 373 individuals from 22 Turkic-speaking populations, representing their current geographic range, by analyzing genome-wide high-density genotype data. In agreement with the elite dominance model of language expansion most of the Turkic peoples studied genetically resemble their geographic neighbors. However, western Turkic peoples sampled across West Eurasia shared an excess of long chromosomal tracts that are identical by descent (IBD) with populations from present-day South Siberia and Mongolia (SSM), an area where historians center a series of early Turkic and non-Turkic steppe polities. While SSM matching IBD tracts (> 1cM) are also observed in non-Turkic populations, Turkic peoples demonstrate a higher percentage of such tracts (p-values ≤ 0.01) compared to their non-Turkic neighbors. Finally, we used the ALDER method and inferred admixture dates (~9th–17th centuries) that overlap with the Turkic migrations of the 5th–16th centuries. Thus, our results indicate historical admixture among Turkic peoples, and the recent shared ancestry with modern populations in SSM supports one of the hypothesized homelands for their nomadic Turkic and related Mongolic ancestors.


Introduction
Linguistic relatedness is frequently used to inform genetic studies [1] and here we take this path to reconstruct aspects of a major and relatively recent demographic event, the expansion of nomadic Turkic-speaking peoples, who reshaped much of the West Eurasian ethno-linguistic landscape in the last two millennia. Modern Turkic-speaking populations are a largely settled people; they number over 170 million across Eurasia and, following a period of migrations spanning the~5th-16th centuries, have a wide geographic dispersal, encompassing Eastern Europe, Middle East, Northern Caucasus, Central Asia, Southern Siberia, Northern China, and Northeastern Siberia [2][3][4].
The extant variety of Turkic languages spoken over this vast geographic span reflects only the recent (2100-2300 years) history of divergence, which includes a major split into Oghur (or Bolgar) and Common Turkic [5,6]. This period was preceded by early Ancient Turkic, for which there is no historical data, and a long-lasting proto-Turkic stage, provided there was a Turkic-Mongolian linguistic unity (protolanguage) around 4500-4000 BCE [7,8].
The earliest Turkic ruled polities (between the 6th and 9th centuries) were centered in what is now Mongolia, northern China, and southern Siberia. Accordingly, this region has been put forward as the point of origin for the dispersal of Turkic-speaking pastoral nomads [3,4]. We designate it here as an "Inner Asian Homeland" (IAH) and note at least two issues with this working hypothesis. First, the same approximate area was earlier dominated by the Xiongnu Empire (Hsiung-nu) (200 BCE-100 CE) and later by the short-lived Xianbei (Hsien-pi) Confederation (100-200 CE) and Rouran State (aka Juan-juan or Asian Avar) (400-500 CE). These steppe polities were likely established by non-Turkic-speaking peoples and presumably united ethnically diverse tribes. It is only in the second half of the 6th century that Turkic-speaking peoples gained control of the region and formed the rapidly expanding Göktürk Khaganate, succeeded soon by numerous khanates and khaganates extending from northeastern China to the Pontic-Caspian steppes in Europe [2][3][4]. Secondly, Göktürks represent the earliest known ethnic unit whereby Turkic peoples appear under the name Turk. Yet, Turkic-speaking peoples appear in written historical sources before that time, namely when Oghuric Turkic-speaking tribes appear in the Northern Pontic steppes in the 5th century, much earlier than the rise of Göktürk Khaganate in the IAH [9]. Thus, the early stages of Turkic dispersal remain poorly understood and our knowledge about their ancient habitat remains a working hypothesis.
Previous studies based on Y chromosome, mitochondrial DNA (mtDNA), and autosomal markers show that while the Turkic peoples from West Asia (Anatolian Turks and Azeris) and Eastern Europe (Gagauzes, Tatars, Chuvashes, and Bashkirs) are generally genetically similar to their geographic neighbors, they do display a minor share of both mtDNA and Y haplogroups otherwise characteristic of East Asia [10][11][12][13][14][15]. Expectedly, the Central Asian Turkic speakers (Kyrgyz, Kazakhs, Uzbeks, and Turkmens), share more of their uniparental gene pool (9-76% of Y chromosome and over 30% of mtDNA lineages) with East Asian and Siberian populations [16,17]. In this regard, they differ from their southern non-Turkic neighbors, including Tajiks, Iranians, and different ethnic groups in Pakistan, except Hazara. However, these studies do not aim to identify the precise geographic source and the time of arrival or admixture of the East Eurasian genes among the contemporary Turkic-speaking peoples. The "eastern" mtDNA and even more so Y-chromosome lineages (given the resolution available to the studies at the time) lack the geographic specificity to explicitly distinguish between regions within Northeast Asia and Siberia, and/or Turkic and non-Turkic speakers of the region [18,19].
Several studies using genome-wide SNP panel data describe the genetic structure of populations in Eurasia and although some include different Turkic populations [15,[20][21][22][23][24], they do not focus on elucidating the demographic past of the Turkic-speaking continuum. In cases where more than one geographic neighbor is available for comparison, Turkic-speaking peoples are genetically close to their non-Turkic geographic neighbors in Anatolia [22,25], the Caucasus [15], and Siberia [21,23]. A recent survey of worldwide populations revealed a recent (13th-14th century) admixture signal among the three Turkic populations (Turks, Uzbeks, and Uygurs) and one non-Turkic population (Lezgins) with Mongolas (from northern China), the Daurs (speaking Mongolic language), and Hazaras (of Mongol origin) [26]. This study also showed evidence for admixture (dating to the pre-Mongol period of 440-1080 CE) among non-Turkic (except Chuvashes) East European and Balkan populations with the source group related to modern Oroqens, Mongolas, and Yakuts. This is the first genetic evidence of historical gene flow from a North Chinese and Siberian source into some north and central Eurasian populations, but it is not clear whether this admixture signal applies to other Turkic populations across West Eurasia.
Here we ask whether it is possible to identify explicit genetic signal(s) shared by all Turkic peoples that have likely descended from putative prehistoric nomadic Turks. Specifically, we test whether different Turkic peoples share genetic heritage that can be traced back to the hypothesized IAH. More specifically, we ask whether this shared ancestry occurred within an historical time frame, testified by an excess of long chromosomal tracts identical by descent between Turkic-speaking peoples across West Eurasia and those inhabiting the IAH. To address these questions we used a genome-wide high-density genotyping array to generate data on Turkic-speaking peoples representing all major branches of the language family ( Fig 1B). shown with light blue, light green, dark green, light brown, and yellow circles, depending on the region. Turkic-speaking populations are shown with red circles regardless of the region of sampling. Full population names are given in S1 Table Panel To characterize the population structure of Turkic-speaking populations in the context of their  geographic neighbors across Eurasia, we genotyped 322 new samples from 38 Eurasian populations and combined it with previously published data (see S1 Table and Material and Methods for details) to yield a total dataset of 1,444 samples genotyped at 515,841 markers. The novel samples introduced in this study geographically cover previously underrepresented regions like Eastern Europe (Volga-Ural region), Central Asia, Siberia, and the Middle East. We used a STRUCTURE-like [27] approach implemented in the program ADMIXTURE [28] to explore the genetic structure in the Eurasian populations by inferring the most likely number of genetic clusters and mixing proportions consistent with the observed genotype data (from K = 3 through K = 14 groups) (S1 Fig). As shown in previous studies [15,20,29] East Asian populations commonly contained alleles that find membership in two general clusters, shown here as k6 and k8, in a model assuming K = 8 "ancestral" populations ( Fig 2). Geographically, the spread zones of these two components (clusters) were centered on Siberia and East Asia, respectively. Their combined prevalence declined as one moves west from East Asia (correlation with longitude, p = 8.8×10 −16 , R = 0.77, 95% CI: 0.66-0.85). Overall, alleles from the Turkic populations sampled across West Eurasia showed membership in the same set of West Eurasian genetic clusters, k1-k4, as did their geographic neighbors. In addition, the Volga-Uralic Turkic peoples (Chuvashes, Tatars, and Bashkirs) also displayed membership in the k5 cluster, which contained the Siberian Uralic-speaking populations (Nganasans and Nenets) and extended to some of the European Uralic speakers (Maris, Udmurts, and Komis). However, in most cases the Turkic peoples showed a higher combined presence of the "eastern components" k6 and k8 than did their geographic neighbors.

Three-population test
The "eastern components" k6 and k8 inferred among Turkic-and non-Turkic peoples across West Eurasia, as well as the "western components" k1, k2, and k3 present among Siberian populations can originate through gene flow episodes in opposite directions in the past and this population mixture history can be statistically tested using f 3 -statistics [30,31]. In order to evaluate the admixture scenarios suggested by the ADMIXTURE analysis, we tested all possible three population combinations in our dataset using the three-population test (f 3 -statistics) [30,31]. We reported only population trios f 3 (target, source1, source2) with the most negative f 3statistics (S2 Table) and considered populations to be significantly admixed when their Z-score was smaller than 1.64 (i.e. p-value was less than 0.05, for a one-tailed test). Our three-population tests showed that almost all the West Eurasian Turkic peoples (15 out of 16) and their non-Turkic neighbors (49 out of 61) (see S2 Table for geographic subdivision) were admixed with East Asian-and Siberian-related populations. Similarly, all the Siberian Turkic populations, as well as some (11 out of 27) East Eurasian non-Turkic populations showed an admixture signal with West Eurasian-related populations. In interpreting f 3 -statistics results, it is important to point out that the reported source populations do not necessarily represent the true admixing populations [31]. Although the exact source populations were uncertain, significantly negative f 3 -statistics provided strong evidence for admixture in most of the Turkic and non-Turkic populations in our dataset. In order to test whether these admixture signals resulted from recent gene flow events, we next explored the distribution of long chromosomal tracts shared between populations in our dataset.

Geographic distribution of recent shared ancestry
A recent study shows that even a pair of unrelated individuals from the opposite ends of Europe share hundreds of chromosomal tracts of IBD from common ancestors that lived over the past 3,000 years. The amount of such recent ancestry declines exponentially with geographic distance between population pairs, and such a distance-dependent pattern can be distorted due to population expansion or gene flow [32]. We observed a reasonably high correlation (Pearson's correlation coefficient = 0.77, 95% CI: 0.76-0.79, p < 2.2×10 -16 ) between the rate of IBD sharing decay and geographic distance in our set of Eurasian populations. This distance-dependent pattern is likely shaped by both isolation-by-distance and gene flow: many of the populations are admixed (the negative f3-statistics in S2 Table) and there is a longitude dependent decrease in the prevalence of "eastern components" k6 and k8. Some populations might stand out in this distance dependent pattern due to isolation, greater gene flow, or genetic drift. For example, when we removed the West Eurasian Turkic populations (sampled in the Middle East, Caucasus, Eastern Europe, and Central Asia) from our dataset, we observed better correlation between IBD sharing decay and geographic distances between populations (Pearson's correlation coefficient = 0.83, 95% CI: 0.82-0.85, p < 2.2×10 −16 ). To identify populations for which IBD sharing with Turkic populations departs from a distance-dependent decay pattern, we first computed IBD sharing (the average length of genome IBD measured in centiMorgans) for each of the 12 western Turkic populations with all other populations in the dataset (S3 Table) and then subtracted the same statistic computed for their geographic neighbors (see the Materials and Methods section for details and S2 Fig for a schematic representation of this analysis). When the differences were overlaid for all 12 Turkic populations, we detected an unusually high signal of accumulated IBD sharing (samples indicated by a "plus symbol" on Northeast Siberia, except the two samples in Eastern Europe (Maris) and the North Caucasus (Kalmyks). In principle, when we compare the IBD sharing pattern in this way between neighboring Turkic and non-Turkic populations, we might observe a high IBD sharing signal with some Siberian populations due to drift in one of the populations compared, but chances that such random signals would correlate between multiple Turkic populations and accumulate in a single region is negligible. Indeed the null hypothesis for this analysis assumed no systematic difference between any of the Turkic populations and their respective geographic neighbors. Therefore, the null hypothesis predicted that random differences accumulated across the entire geographic range of the western Turkic populations. To demonstrate this null expectation, we replaced each of the western Turkic populations by populations randomly drawn from the sets of respective non-Turkic neighbors, and repeated this subtraction/accumulation analysis, as shown in S2   and Nenets) that, when examined closely, suggest an interesting finding consistent with our ADMIXTURE results. These two Siberian populations, Nganasans and Nenets (S3A, S3B, S3E, S3I and S3J Fig), speak Uralic languages and demonstrated a high accumulated signal only when our tested sets contained the western Uralic speakers (Maris, Komis, Vepsas, and Udmurts). This was in line with our ADMIXTURE results (Fig 2), as the k5 ancestry component was shared specifically between these western Uralic speakers and the two Siberian Uralic-speaking Nganasans and Nenets. We now return to the overall difference between the accumulated IBD sharing signal under the null hypothesis (see S3 Fig) and that observed for the set of western Turkic populations (Fig 3). Some of the populations in SSM and Northeast Siberia demonstrated a strong IBD sharing signal with the western Turkic populations and this pattern most likely indicates recent gene flow from Siberia. To narrow down the source of this gene flow it is important to know which of the Siberian populations are indigenous to their current locations. We show in the Discussion section that only Tuvans, Buryats, and Mongols from the SSM area are indigenous to their current locations (at least within the known historical time) and therefore this area is the best candidate for the source of recent gene flow into the western Turkic populations. It should be noted that this east-west directionality is implied by the fact that 12 populations sampled across different West Eurasian locations are unlikely to show a correlated signal of high IBD sharing with a single region unless they received gene flow from it. Indeed, when we repeated our analysis by randomly choosing non-Turkic populations (S2 Fig), we could not reproduce a similar correlated signal.
Our previous analysis suggests that the western Turkic populations (Fig 3 and S3 Table) experienced stronger gene flow from the SSM area than their non-Turkic neighbors, but it is not clear whether this signal is statistically significant. To test this, we computed IBD sharing between the group of SSM populations (Tuvans, Mongols, and Buryats, as well as a known migrant population, Evenkis) and each of the western Turkic populations. Then, for each of the western Turkic populations, we pooled their non-Turkic neighbors, and generated 10,000 permuted samples to see whether a comparable amount of IBD sharing (observed in tested Turkic populations) with the four Siberian populations is obtained by chance. IBD sharing was estimated separately for different classes of chromosomal tracts (1-2 cM, 2-3 cM, 3-4 cM, etc.), and permutation tests were performed. In most of the cases, higher IBD sharing between the western Turkic populations (compared to non-Turkic neighbors) and the Siberian populations was statistically significant ( We conclude that the recent gene flow from the SSM area inferred in our previous analysis was not restricted to the western Turkic peoples, and the higher IBD sharing is evidence that Turkic populations are distinct from their non-Turkic neighbors. A spatial pattern in IBD sharing was noted when IBD tracts of different length classes were considered separately. For segment classes of 1-2 cM and 2-3 cM, higher IBD sharing is statistically significant for most Turkic speakers, except Gagauzes and Chuvashes (and Tatars in the case of 2-3 cM). For longer IBD tracts of 3-4 cM, statistical evidence for higher IBD sharing becomes weaker in some Middle Eastern and Caucasus (Azeris, Kumyks, and Balkars) samples. By weaker evidence, we mean that a statistically significant excess of IBD sharing was restricted to a subset of the four candidate ancestors tested. In the Volga-Ural region, for the same class of segments (3-4 cM), only Bashkirs continued to show strong evidence for gene flow, while Tatars and Chuvashes do not. For these two Turkic populations, not all tests were statistically significant because the background group, from which permuted samples are drawn, contained the Finnic speaking Mari population, which shows comparable levels of Asian admixture ( Fig  2) and IBD sharing (S4 Fig). When we considered even longer segments (4-5 cM and 5-6 cM), we no longer observed a systematic excess of IBD sharing for Turkic peoples in the Middle East, the Caucasus, or in the Volga-Ural region. In contrast, populations closer to the SSM area (Uzbeks, Kazakhs, Kyrgyz, and Uygurs, and also Bashkirs from the Volga-Ural region) still demonstrated a statistically significant excess of IBD sharing. This spatial pattern can be partly explained by a relative rarity of longer IBD tracts compared to shorter ones and recurrent gene flow events into populations closer to the SSM area.
Dating the age of Asian admixture using the ALDER and SPCO methods According to historical records, the Turkic migrations took place largely during~5th-16th centuries (little is known about earlier periods) and partly overlap with the Mongol expansion. Pairwise IBD sharing based on 1-2 cM long segments. For each population ordered along the xaxis, IBD sharing is computed with three SSM populations (Tuvans, Buryats, Mongols) and Evenkis. Each Turkic-speaking population (shown in red) is grouped with its respective geographic neighbors using parentheses. The grouped geographic neighbors were pooled and used to perform a permutation test as described in the M&M section. Red numbers under the Turkic population name indicate how many SSM populations demonstrate a statistically significant excess of IBD sharing with a given Turkic population. Note that, for example, Bashkirs, Tatars, and Chuvashes share their geographic neighbors. Assuming 30 years per generation, the common Siberian ancestors of various Turkic peoples lived prior to and during this migration period between 20 and 53 generations ago. The expected length of a single-path IBD tract inherited from a common ancestor that lived~20-53 generations ago ranges between 2.5 cM and 0.94 centiMorgans (see Methods for details). Taking into account that multi-path IBD tracts will be on average longer [33], the IBD sharing signal at 1-5 cM detected between the western Turkic peoples and the SSM area populations may be due to historical Turkic and Mongolic expansions from the SSM area. It is possible to approximately outline the age of common ancestors directly from the distribution of shared IBD tracts [32], but such an inference would be too coarse for our purposes. Here we use two different methods implemented in ALDER [34] and SPCO [35] to infer the age of Siberian/Asian admixture among Turkic peoples. The admixture dates for all the analyzed Turkic peoples (Fig 5  and S4 Table) fell within the historical time frame (5th-17th century) that overlaps with the period of nomadic migrations triggered by Turkic (6th-16th centuries CE) and Mongol expansions (13th century) [2,3]. However, individual admixture dates estimated using the two methods overlap only partially and were discordant for most populations (Fig 5). Therefore, we simulated a series of admixture events spanning a target historical period and compared how the two methods performed (see Material and Methods for details). The dates inferred by ALDER tended to be closer to simulated true values, while SPCO consistently estimated slightly older dates (Fig 6). Importantly, the SPCO-inferred dates for our real dataset (Fig 5) also tended to be older, and we therefore suspect bias in our SPCO estimates. From here onward we discuss only ALDER-inferred dates.
The admixture-induced linkage disequilibrium signal in an admixed population is "restored" using two surrogates (reference populations) and when several such pairs are tested, ALDER allows the comparison of their genetic proximity to true mixing populations based on the amplitude of the weighted LD curve [36]. It should be noted, however, that inferences made in this way might be misleading when the reference population genetically related to the true mixing population underwent admixture itself. In this case, another less related population would be inferred as a better surrogate for the true mixing population. We considered all possible combinations of populations in our dataset as references for Turkic speaking populations, and report the set of "best" pairs (S4 Table) demonstrating the highest amplitude of the weighted LD curve. Interpretation of these results, however, is complicated since some of the references are admixed themselves according to our three-population tests (S2 Table). We cannot exclude the possibility that some of the references truly related to the historical ancestors yielded a lower amplitude because they were admixed. For example, the SSM populations that demonstrated the signature of recent gene flow (excess of IBD tracts) into western Turkic populations showed lower amplitude of the weighted admixture LD compared to non-admixed references that we reported in S4 Table. The SSM populations are significantly admixed (See S2 Table), and this likely happened during the Turkic migration period (S4 Table).
Although we report a single admixture date for each population, we note that it is likely that the contemporary Turkic peoples were established through several migration waves [2][3][4]37]. Indeed, Turkic peoples closer to the SSM area (those from the Volga-Ural region and Central Asia) showed younger dates compared to more distant populations like Anatolian Turks, Iranian Azeris, and the North Caucasus Balkars. Only Nogais, the former steppe belt nomadic people, and Kumyks inhabiting northern slopes of the Caucasus stand out from this spatial pattern.

Discussion
Our ADMIXTURE analysis (Fig 2) revealed that Turkic-speaking populations scattered across Eurasia tend to share most of their genetic ancestry with their current geographic non-Turkic neighbors. This is particularly obvious for Turkic peoples in Anatolia, Iran, the Caucasus, and Eastern Europe, but more difficult to determine for northeastern Siberian Turkic speakers, Yakuts and Dolgans, for which non-Turkic reference populations are absent. We also found that a higher proportion of Asian genetic components distinguishes the Turkic speakers all over West Eurasia from their immediate non-Turkic neighbors. These results support the model that expansion of the Turkic language family outside its presumed East Eurasian core area occurred primarily through language replacement, perhaps by the elite dominance scenario, that is, intrusive Turkic nomads imposed their language on indigenous peoples due to advantages in military and/or social organization.
When the Turkic peoples settled across West Eurasia are compared with their non-Turkic neighbors, they demonstrate higher IBD sharing with populations from SSM and Northeast Siberia (Fig 3). There are, however, two non-Siberian populations that also demonstrate high IBD sharing with the tested Turkic peoples, the Kalmyks and Maris. These exceptions need careful consideration in light of historical data and previously published studies. For example, the Mongol-speaking Kalmyks migrated into North Caucasus from Dzhungaria (the northwestern province of China at the Mongolian border) only in the 17th century [37], while Maris stand out from other geographic neighbors due to unusually high recent admixture with Bashkirs: they demonstrate higher IBD sharing with Bashkirs for all IBD tract length classes (from 1-2 cM up to 11-12 cM) compared to other populations in the region (p < 0.05). This might be explained by the fact that we collected Maris samples in the Republic of Bashkortostan, where they seemingly intermarried with Bashkirs to some extent. Finally, some of the Siberian populations are in fact migrants in their current locations. For example, Yakuts, Evenkis, and Dolgans largely stem from the Lake Baikal region, which is essentially the SSM area [23]. It turns out that most of the populations showing a high signal of IBD sharing with the western Turkic populations originated from the SSM area or had admixture with one of the tested Turkic populations. The only exception is the Nganasans; they demonstrate unusually high IBD sharing with both western Turkic peoples (Fig 3) and randomly chosen non-Turkic populations (S3 Fig). Taking into account that SSM area populations (Tuvans, Mongols, and Buryats) can be reliably considered indigenous to their locations, and that other Siberian and non-Siberian populations (demonstrating high IBD sharing with western Turkic peoples in Fig 3) all have SSM origins, we suggest that ancestral populations from this area contributed recent gene flow into western Turkic peoples. We note that SSM matching IBD tracts were also observed with considerable frequency among some non-Turkic peoples, such as Adyghe, Maris, North Ossetians, and Udmurts (Fig 4 and S4 Fig) suggesting that gene flow from the SSM area also contributed to non-Turkic populations. In this regard, alternative explanations not related to Turkic and Mongolic migrations cannot be excluded, but these historical events remain the most likely scenario, since the high proportion of SSM matching tracts is a unifying hallmark of many western Turkic peoples and such a correlated signal of sharing with Siberian populations is not observed for any other group of populations (S3 Fig). Thus, it is likely that migrants of SSM origin interacted with many of the ancestors of contemporary West Eurasian populations, but it was the stronger interaction (reflected in higher IBD sharing) with migrant SSM ancestors that drove Turkicization. We performed a permutation test for each western Turkic population and the observed excess of IBD sharing (compared to non-Turkic neighbors) with the SSM area populations was statistically significant (Fig 4 and S4 Fig).
Another important outcome of our IBD sharing analysis is the finding that two of the three SSM populations that we consider "source populations" or modern proxies for source populations are both Mongolic-speaking. This observation can be explained in several ways. For example, one may surmise that the Mongol conquests, starting in the 13th century, were accompanied by their demographic expansion over the territories already occupied, in part, by Turkic speakers, and this led to admixture between Turkic and Mongolic speakers. Alternatively, it is also probable that the ancestors of Turkic and Mongolic tribes stem from the same or nearly the same area and underwent numerous episodes of admixture before their respective expansions. The latter explanation is indirectly testified by a complex, long-lasting stratigraphy of Mongolian loan words in Turkic languages and vice versa [38]. The first explanation is unlikely from a historical perspective since although Mongolic conquests were launched by Genghis Khan troops in the early 13th century, it is well known that they did not involve massive re-settlements of Mongols over the conquered territories. Instead, the Mongol war machine was progressively augmented by various Turkic tribes as they expanded, and in this way Turkic peoples eventually reinforced their expansion over the Eurasian steppe and beyond [39]. Therefore, we prefer the second explanation, although we cannot entirely exclude the Mongol contribution, especially in light of admixture dates that overlap with the Mongol expansion period.
Finally, our IBD sharing analysis suggested that the SSM area is the source of recent gene flow. This area is one of the hypothesized homelands for Turkic peoples and linguistically related Mongols. While the presence of the Mongol empire over this territory is well recorded, historical sources alone are insufficient to unambiguously associate this area with the Turkic homeland for several reasons: some of the Turkic groups speaking the Oghuric branch of Turkic were attested westerly in the Pontic-Caspian steppes in the mid-late 5th century CE. This is geographically distant from the SSM area, and temporarily much earlier than the Göktürk Empire was established in the SSM area. Thus, our study provides the first genetic evidence supporting one of the previously hypothesized IAHs to be near Mongolia and South Siberia.
The gene flow from the SSM area that we inferred based on our IBD sharing analysis should also be detected using an alternative approach such as ALDER, which is based on the analysis of linkage disequilibrium (LD) patterns due to admixture. Using the ALDER method, we tested all possible combinations of reference populations in our dataset. LD decay patterns observed among western Turkic populations were consistent with admixture between West Eurasian and East Asian/Siberian populations (see detected reference populations in S4 Table).
Admixture dating with the set of East Asian/Siberian populations (S4 Table) inferred admixture events ranging between 816 CE for Chuvashes and 1657 CE for Nogais. We chose these reference populations based on the highest LD curve amplitudes, as suggested by the authors of the method. It is notable that all the SSM populations that were inferred to be the source of SSM gene flow were either filtered out by an ALDER pre-test procedure because of the shared admixture signal with the tested Turkic populations or had a lower amplitude of the weighted LD curve compared to the non-admixed references. Indeed, as we show, SSM populations and the two Northeastern Siberian populations all demonstrated a statistically significant admixture signal between the same set of West Eurasian and East Asian populations as western Turkic peoples do (S2 Table and S4 Table). Therefore, the set of reference populations reported in S4 Table that demonstrate the highest LD curve amplitude, in fact represent the set of non-admixed reference populations (S2 Table) that passed ALDER's filtering procedure. This filter removes any reference population that shows shared admixture signal with the tested population. It was important for our study that the range of ALDER-inferred admixture dates overlaps with the major Turkic migrations and later Mongolic expansion (Fig 5), both of which are known to trigger nomadic migrations to Medieval Central Asia, the Middle East and Europe. In addition, when linguistic classification and regional context is taken into account, we found parallels with large-scale historical events. For example, the present-day Tatars, Bashkirs, Kazakhs, Uzbeks, and Kyrgyz span from the Volga basin to the Tien-Shan Mountains in Central Asia, yet (Fig 5) showed evidence of recent admixture ranging from the 13th to the 14th centuries. These peoples speak Turkic languages of the Kipchak-Karluk branch and their admixture ages postdate the presumed migrations of the ancestral Kipchak Turks from the Irtysh and Ob regions in the 11th century [37]. There are exceptions, like the Balkars, Kumyks, and Nogais in Northern Caucasus, who showed either earlier dates of admixture (8th century) or much later admixture between the 15th century (Kumyks) and 17th century (Nogais).
Chuvashes, the only extant Oghur speakers showed an older admixture date (9th century) than their Kipchak-speaking neighbors in the Volga region. According to historical sources, when the Onogur-Bolgar Empire (northern Black Sea steppes) fell apart in the 7th century, some of its remnants migrated northward along the right bank of the Volga river and established what later came to be known as Volga Bolgars, of which the first written knowledge appears in Muslim sources only around the end of the 9th century [40]. Thus, the admixture signal for Chuvashes is close to the supposed arrival time of Oghur speakers in the Volga region.
Differences in admixture dates for the three Oghuz speaking populations (Azeris, Turks, and Turkmens) were notable and their geographical locations suggest a possible explanation. Anatolian Turks and Azeris, whose Central Asian ancestors crossed the Iranian plateau and became largely inaccessible to subsequent gene flow with other Turkic speakers, both have evidence of earlier admixture events (12th and 9th centuries, respectively) than Turkmens. Turkmens, remaining in Central Asia, showed considerably more recent admixture dating to the 14th century, consistent with other Central Asian Turkic populations and most likely due to admixture with more recent, perhaps recurrent, waves of migrants in the region from SSM.
In summary, our collection of samples, which covered the full extent of the current distribution of Turkic peoples, shows that most Turkic peoples share considerable proportion of their genome with their geographic neighbors, supporting the elite dominance model for Turkic language dispersal. We also showed that almost all the western Turkic peoples retained in their genome shared ancestry that we trace back to the SSM region. In this way, we provide genetic evidence for the Inner Asian Homeland (IAH) of the pioneer carriers of Turkic language, hypothesized earlier by others on the basis of historical data. Furthermore, because Turkic peoples have preserved SSM ancestry tracts in their genomes, we were able to perform admixture dating and the estimated dates are in good agreement with the historical period of Turkic migrations and overlapping Mongols expansion. Finally, much remains to be learned about the demographic consequences of this complex historical event and further studies will allow the disentangling of multiple signals of admixture in the human genome and fine scale mapping of the geographic origins of individual chromosomal tracts.

Ethics statement
All subjects signed personal informed consents and ethical committees of the institutions involved approved the study.

Samples, genotyping, and quality control
In total, 322 individuals from 38 populations were genotyped on different Illumina SNP arrays (all targeting > 500,000 SNPs) according to manufacturers' specifications. Our data was combined with published data from Li et al. [20], Rasmussen et al. [21], Behar et al. [22], Yunusbayev et al. [15], Metspalu et al. [29], Fedorova et al. [23], Raghavan et al. [41], Behar et al. [42], and covered all the Turkic-speaking populations (373 individuals from 22 samples) from key regions across Eurasia and their geographic neighbors (see details about sample source in S1 Table). Individuals with more than 1.5% missing genotypes were removed from the combined dataset. Only markers with a 97% genotyping rate and minor allele frequency (MAF) > 1% were retained. The absence of cryptic relatedness corresponding to first and second degree relatives in our dataset was confirmed using King [43]. The filtering steps resulted in a dataset of 1,444 individuals remaining for downstream analyses. It is important to note that in our dataset there are 312,524 SNPs that are common for Human1M-Duo and 650k, 610k, and 550k Illumina BeadChips. Different analyses have different requirements regarding marker density and we therefore prepared two datasets. For Admixture, three-population test, and ALDER analyses that require minimum background LD, LD pruning on the combined 1M-Duo and 650k, 610k, and 550k dataset was performed. The marker set was thinned by excluding SNPs in strong LD (pairwise genotypic correlation r 2 > 0.4) in a window of 1,000 SNPs, sliding the window by 150 SNPs at a time. This resulted in a dataset of 174,187 SNPs. Another dataset with a dense marker set for IBD sharing and wavelet transform admixture dating analyses was prepared. For this, the 1M-Duo genotyped samples (S1 Table) were excluded to increase the SNP overlap among remaining samples up to 515,841 markers. Genetic distances between SNPs in centiMorgans were incorporated from the genetic map generated by the HapMap project [44].

Admixture analysis
We inferred population structure in our dataset using a model-based clustering method implemented in the ADMIXTURE software. Because the ADMIXTURE algorithm expects SNPs to be unlinked, we used an LD pruned marker set of 174,187 SNPs. We ran ADMIXTURE assuming 3 to 14 (K = 3 to K = 14) genetic clusters or "ancestral populations" (see S1 Fig) in 100 replicates and assessed convergence between individual runs. For low values of K, all runs arrive at the same or very similar log-likelihood scores (LLs), whereas runs using higher K values have more variable LLs. Relying on the low level of variation in LLs (LLs < 1) within a fraction (10%) of runs with the highest LLs, we assume that the global log-likelihood maximum was reached at K = 3 to K = 11 and K = 13 to K = 14 (S6 Fig). ADMIXTURE provides an assessment of the "best" K by computing a cross-validation index (CV), which estimates the predictive accuracy of the model at a given K. In our setting clustering solutions at K = 5, 6, 7, and 8 showed better predictive power than other K values (S7 Fig). In choosing which model(s) (K) to discuss further, we draw from the CV results and restrict ourselves to the Ks that likely converged at the global LL maximum for the particular model. In addition, we acknowledge that clustering solutions at different Ks may reflect the hierarchical nature of human population structure. In this study, we are interested in distinguishing regional groupings, like Northeast Asia and Europe, and possible admixture between such groups. In sum, we found that the clustering solution that met our selection criteria best was K = 8.

Correlation between longitude and proportion of "eastern" genetic components
Pearson's correlation coefficient between longitude and the proportion of genetic clusters k6 and k8 was computed using the cor.test() function in the R package.

Three-population test (f 3 -statistics)
We constructed all possible sets of target and source populations in our dataset f 3 (target, source1, source2) and then excluded combinations that contained matching source and target populations. Altogether, 97548 combinations were prepared and subjected to the three-population test implemented in the qp3Pop software [31]. This test produces a negative f 3 -statistic together with a corresponding negative Z-score when the target population tested is admixed with a source related to the given source populations. We considered the target population to be significantly admixed when its Z-score was less than -1.64 (i.e. p-value < 0.05 for a onetailed test) (S2 Table). Assuming that the Z-score is approximately normally distributed, we performed the Bonferroni correction as follows: α was divided by the number of tests preformed α = 0.05/97548 = 5.125682 Ã 10 −7 and then a quantile function for the corrected α was computed using the qnorm() function in the R package. The Z-scores that were significant after Bonferroni correction (Z-score < -4.89) were denoted with two asterisks in S2 Table. IBD sharing analysis We used the fastIBD algorithm [45] implemented in the BEAGLE 3.3 software to detect extended chromosomal tracts (> 1 cM in length) that are IBD between pairs of individuals. We ran the fastIBD algorithm ten times with different random seeds and called IBD tracts using a modified post-processing tool 'plus-process-fibd.py'. The original post-processing tool developed by the BEAGLE authors was modified by [32]. They added an algorithm that minimizes the number of spurious breaks and gaps introduced into long segments due to low marker density [32].
Patterns of IBD sharing between modern populations can bear information about historical events [32]. According to known history, the Turkic migrations took place roughly 600-1600 years ago. Assuming a human generation time of 30 years, the immigrant chromosomes left during Turkic migrations have passed through 600 / 30 = 20 and 1600 / 30 = 53 generations/ meiosis. The mean length of a single-path IBD tract passed through~20-53 generations, is expected to be 100 cM / (2 Ã 20) = 2.5 cM and 100 cM / (2 Ã 53) = 0.94 cM [33]. We provide these estimates only as reference, and note that the true IBD tract length distribution for a given historical event is influenced by past demography. The fastIBD method that we use to search for IBD tracts has sufficient power (~0.7-0.9) to detect chromosomal tracts of 2-3 cM in length, and importantly, close to zero false discovery rate [45]. The power to detect IBD tracts of 1 cM in length varies from 0.2 to 0.5 depending on the chosen fastIBD score threshold. We used a fastIBD score threshold of 1e-10, which, in the trade-off between power-loss and minimizing false discovery rate favors the latter, keeping it close to zero. These parameter settings fit our purposes since we are interested in estimating the relative amount of IBD sharing between populations rather than the total amount of IBD sharing.

Correlation between geographic distance and IBD sharing
Chromosomal tracts that were IBD between two populations were first sorted into bins (classes) based on their length: 1-2, 2-3, 3-4, 4-5 cM, and then the total length was divided within each bin (class) by sample size to obtain the average IBD sharing for each population pair tested. From here onward, we refer to this statistic as IBD sharing. It was shown previously that IBD sharing between populations decays exponentially with distance between samples [32]. To test whether IBD sharing between populations in our dataset is consistent with this distance dependent pattern, we first converted the IBD sharing statistic between populations into an IBD sharing distance using-ln(IBD sharing statistics). For each pair of populations, we then calculated geodesic distances in kilometers using the ''distonearth" R function [46]. Geographic coordinates for populations in our dataset were calculated using the central point from multiple sampling locations, or when such coordinates were not available, using coordinates for the country center (where sample was collected). Geographic coordinates for HGDP populations were computed as a central point in a range of longitude and latitude values given in [47]. After obtaining matrixes of geodesic and IBD sharing distances between populations, values were standardized in each of these matrixes using the maximum values observed. The Pearson's correlation coefficient between the obtained standardized distance matrixes was computed using the cor.test() function in the R package.

Identifying deviations from the distance-dependent decay pattern in IBD sharing
Even if there is a statistically significant correlation between IBD sharing and geographic distance in the data, spatial patterns of recent ancestry between real populations are unlikely to follow a distance-dependent decay pattern ideally. One way to detect the departure from the expected distance-dependent decay pattern is to compute parameters that describe the relationship between IBD sharing and distance in a population set where you do not expect any deviation. These parameters can then be used to compute the expected range of IBD sharing for a given pair of populations at a given distance and report deviations, if any. In our dataset, it was difficult to define a population subset that was completely devoid of samples with departures from the expected distance-dependent decay pattern. Therefore, a comparative approach was used, in which the IBD sharing pattern in our dataset with Turkic populations and without them was compared. Because departures from the distance-dependent decay pattern may already exist in the dataset without Turkic peoples, this test determines whether Turkic peoples have more extreme departure due to, for example, a long range migration from Asia, as suggested by our ADMIXTURE analysis. Therefore, the test dataset only included western Turkic populations from the Middle East, Eastern Europe, the Caucasus, and Central Asia. The overall goal is to find systematic (correlated) differences between these western Turkic populations and their geographic neighbors. The null hypothesis was that differences between western Turkic populations and their geographic neighbors are random. To perform this analysis, sets of geographic neighbors were defined for each of the 12 western Turkic populations (See S3  Table; the same sets are used for permutation test described below; See Fig 4) and IBD sharing that these populations demonstrate with other samples in our dataset was computed. A total of 63 Eurasian populations were included in our dataset (excluding 12 western Turkic populations). Thus, for each set of geographic neighbors, a vector of 63 ordered IBD sharing values with other populations in the dataset was obtained, including self-comparisons. For each of the western Turkic populations, the same vector of IBD sharing values was computed (the Turkic population versus 63 Eurasian populations). After performing element-wise subtraction of values in the "vector for geographic neighbors" from values in the "vector for Turkic population", differences in IBD sharing were obtained. Provided that western Turkic populations have no systematic difference with their geographic neighbors, differences are expected to occur at random; that is, a vector of 63 random values is expected (positive when greater than average, negative when lower than average, and zero when equal to average). By contrast, if some of the 63 populations in the dataset, for example, Romanians, systematically demonstrate higher IBD sharing with all the tested Turkic populations, positive values are expected for Romanians. Such nonrandom signals can be detected by computing an element-wise sum over all the vectors (for all the 12 Turkic populations) containing differences. If they are random, there will be no accumulation. We call this procedure "subtraction and accumulation" throughout the text. Element-wise addition of vectors would accumulate positive IBD sharing values for Romanians (versus 12 Turkic populations) and when such values are plotted on a geographic map, a very high signal of (due to accumulation) IBD sharing should be observed for Romanians, compared to the other 63 samples. See the schematic representation of this "subtraction and accumulation" procedure in S2 Fig. An accumulated value (IBD sharing signal) for a given population was considered high when it exceeded the 0.90 sample quantile point. Finally, this "subtraction and accumulation" procedure was repeated multiple times by replacing each of the 12 Turkic populations with randomly chosen non-Turkic neighbors from respective sets of geographic neighbors (see S3 Fig for subset of results). Doing so demonstrates the kind of results expected when the "subtraction and accumulation" procedure is done with population sets that do not have systematic differences in IBD sharing.

Permutation test
A permutation test was designed to verify whether the excess of IBD sharing that western Turkic populations demonstrate with SSM populations is statistically significant. IBD sharing (as described previously) between a western Turkic-speaking population and each of the three SSM populations (Tuvans, Buryats, Mongols) that show high accumulated IBD sharing was calculated. A permutation procedure was then used to test whether observed excess of IBD sharing that a given Turkic population demonstrates can be expected by chance among its non-Turkic neighbors. For each Turkic population, their geographic neighbors were pooled and 10,000 random samples of the same size were generated (as for the Turkic population tested). For each random sample, IBD sharing with the three SSM populations and Evenkis was calculated. Obtained IBD sharing statistics from permuted samples were compared to that of Turkic populations from the same region and the number of tests showing equal or higher values was divided by the total number of permutations to obtain a p-value.

Admixture dating using ALDER
We used ALDER [34] to test whether the gene flow from SSM area suggested by our IBD sharing analysis left detectable trace in LD pattern among the Turkic populations, and date this admixture signal. ALDER has a functionality to perform a statistical test for the presence of admixture and dating admixture signal. We tested all possible combinations of populations in our dataset to be reference group for Turkic populations and report a pair that successfully passed all the pre-test steps and has significant p-value for admixture. By choosing the pair of reference populations with the highest LD curve amplitude we report ALDER-inferred admixture date for the admixed population.

Dating admixture using wavelet transform method
For each Turkic-speaking population, two parental populations were selected, so that one represented local ancestry and another represented immigrant SSM ancestry. Based on our IBD sharing analysis, Tuvans were used to represent the SSM parental population for all Turkicspeaking populations. Alternative SSM ancestors were also tested (See S5 Table). The parental population that represented local ancestry was chosen among geographic neighbors. The wavelet transform method as implemented in the StepPCO software was used [35]. This software projects an admixed population, in this case the Turkic-speaking population, along a principal component separating the two chosen ancestors, the Tuvans, representing the SSM ancestor and a geographic neighbor (see S5 Table). At this step, PC coordinates for each projected Turkic population were used to infer the proportion of admixture contributed by each parental population. Given that x1 and x2 are the average PC coordinates for the first and second parental populations and x3 is the average PC coordinate for the admixed population, the proportion of ancestry contributed by the first parental population is α = (x2-x3) / (x2-x1) [48]. Accordingly, the proportion of ancestry contributed by the second parental population is (1 -α). Admixture proportions inferred at this stage were used later to find a matching simulated dataset with the same admixture proportion. The StepPCO software infers which chromosomal tracts have local or immigrant SSM ancestry and uses observed length distributions of the ancestry tracts to compute a genome-wide wavelet transform (WT) coefficient [35]. These WT coefficients for each Turkic-speaking population are compared with those of simulated samples that have matching admixture proportions and for which the admixture history is known. Samples with known admixture history were generated using a forward simulation in the SPCO software. In this forward simulation, a population with effective population size of 1,000 individuals at time T 0 receives migrants from another population and grows for 300 generations until it reaches an effective population size of 10,000 individuals. We modeled different amounts of migrants, replacing 5%, 10%, 15%, 20%, . . ., 75% of the recipient population. Each admixture scenario was repeated 100 times and a random sample of 20 individuals was drawn to compute WT coefficients. The WT coefficients from these 100 independent runs were used to construct a 95% confidence interval for a given admixture scenario. Simulated WT coefficients with 95% confidence interval were plotted against the known number of generations since admixture (S5

Coalescent simulations
To test how different dating methods recover signals of admixture known to have occurred during historical periods, gene flow events occurring 5, 10, 20, 30, 40, 50, and 60 generations ago were simulated in 100 repetitions. 22 chromosomes were simulated with lengths and variation in recombination rates for each based on HapMap Phase 2 build 36 genetic maps [44]. The MaCS coalescent simulator [49] was used to generate a series of historical gene flow events between two parental populations whose demographic parameters imitate that of Asian and European populations. Demographic parameters (split time, growth rate, and bottlenecks) for the two simulated parental populations were taken from the study by Schaffner et al. [50]. In this study, a series of population genetic statistics were used to fit the demographic histories of simulated populations to those observed for African, Asian, and European populations. Here, these best-fitting demographic parameters were used to simulate samples that imitate sequences drawn from Asian and European populations. Each "historical time" gene flow event represents a mixture of the two parental populations (Asian and European) at equal proportions. From each simulated admixed population a sample of 30 sequences were drawn to construct 15 genotypes that were then subjected to the same data preparation and quality control steps as for real data during admixture dating. The only difference was that we applied the LD pruning step before ALDER and SPCO dating (while we performed no LD pruning with the real dataset before SPCO analysis) in order to make the data preparation steps identical. Before doing so ALDER and SPCO performance was tested using 100 simulated datasets with and without LD pruning. Root-mean-square error (RMSE) between the true and estimated values and bias (difference between sums of true and estimated values divided by the number of observations) were computed to compare the accuracy of ALDER and SPCO inferences with and without LD pruning. While LD pruning slightly improved the accuracy of the ALDER estimates for older admixture events, it had a minimal effect on the accuracy of the SPCO estimates. Thus, in case of ALDER dating, LD pruning slightly improved the accuracy of estimates (RMSE = 2.5, bias = -0.19) compared to the analysis without LD pruning (RMSE = 5.5, bias = 0.64), See also S8A Fig. When the same datasets were analyzed with the SPCO method, we observed minimal differences in accuracy (See also S8B Fig) when analyzing the datasets with LD pruning (RMSE = 7.6 and bias = 5.5) and without LD pruning (RMSE = 8.0 and bias = 6.1). This comparative analysis also showed that the ALDER method recovers true dates with less bias than the SPCO method under default settings. We note, however, that the SPCO estimates were generally close to the true dates for older admixture events.

Accession numbers
The genome-wide SNP data generated for this study can be accessed through the data repository of the Estonian Biocentre (www.ebc.ee/free_data), and the raw Illumina genotype data through the data repository of the National Center for Biotechnology Information-Gene Expression Omnibus (NCBI-GEO) under accession number GSE66157.
Supporting Information S1 Fig. Population structure inferred using a structure-like approach assembled in AD-MIXTURE (at K = 2 to K11, K13, and K14). Each individual is represented by a vertical (100%) stacked column indicating the proportions of ancestry in K constructed ancestral populations. The models (K) shown here likely each converged to a global likelihood maximum as > 10% of runs (100 replicates in total for each K) with the highest log-likelihood (LL) converged to essentially to the same solution with a log-likelihood difference of > 1 LL units. We plotted the runs with the highest LL at each K. . For each population ordered along the x-axis, IBD sharing is computed with four populations (Tuvans, Buryats, Mongols (Mongolia), and Evenkis) from the SSM area. Each Turkic-speaking population (shown in red) is grouped with its respective geographic neighbors using parentheses. The grouped geographic neighbors were pooled and used to perform a permutation test as described in the M&M section. The red number under the Turkic population name shows how many SSM populations demonstrate a statistically significant excess of IBD sharing with a given Turkic population. Overlapping parentheses show Turkic-speaking populations with shared geographic neighbors. (PDF)  Table. Three-population test results. For each target population, we computed f3-statistics using all possible pairs of source populations, and report the pair that produces the most negative f3-statistic. One asterisk indicates that the f3-statistic is significant at p = 0.05 (one-tailed test), i.e. Z-score = 1.64. Two asterisks indicate that the negative f3-statistic is significant after Bonferroni correction for multiple hypothesis testing; that is, the Z-score is more than 4.89 standard errors below zero. (XLS) S3 Table. Population samples used in the "subtraction-accumulation" IBD sharing analysis.