The authors have read the journal's policy and have the following competing interests: GH and EG are employees of Gene by Gene, Ltd. which offers direct to consumer genetic testing. However, this does not alter the authors' adherence to PLOS ONE policies on sharing data and materials.
Conceived and designed the experiments: OB MZ EB. Performed the experiments: AA MC KD RS MK YY MH. Analyzed the data: OB MZ AA MC VZ OU GH EG. Contributed reagents/materials/analysis tools: MZ VZ SK ZS YY PN ZZ EP PAZ LY AD EB. Wrote the paper: OB CTS. Study initiation: OB. Read and approved the final version of the paper: OB MZ AA MC VZ OU GH ZS EG KD RS MK SK YY PN ZZ EP MH PAZ LY AD CTS EB.
Y-chromosomal haplogroup G1 is a minor component of the overall gene pool of South-West and Central Asia but reaches up to 80% frequency in some populations scattered within this area. We have genotyped the G1-defining marker M285 in 27 Eurasian populations (n= 5,346), analyzed 367 M285-positive samples using 17 Y-STRs, and sequenced ~11 Mb of the Y-chromosome in 20 of these samples to an average coverage of 67X. This allowed detailed phylogenetic reconstruction. We identified five branches, all with high geographical specificity: G1-L1323 in Kazakhs, the closely related G1-GG1 in Mongols, G1-GG265 in Armenians and its distant brother clade G1-GG162 in Bashkirs, and G1-GG362 in West Indians. The haplotype diversity, which decreased from West Iran to Central Asia, allows us to hypothesize that this rare haplogroup could have been carried by the expansion of Iranic speakers northwards to the Eurasian steppe and via founder effects became a predominant genetic component of some populations, including the Argyn tribe of the Kazakhs. The remarkable agreement between genetic and genealogical trees of Argyns allowed us to calibrate the molecular clock using a historical date (1405 AD) of the most recent common genealogical ancestor. The mutation rate for Y-chromosomal sequence data obtained was 0.78×10-9 per bp per year, falling within the range of published rates. The mutation rate for Y-chromosomal STRs was 0.0022 per locus per generation, very close to the so-called genealogical rate. The “clan-based” approach to estimating the mutation rate provides a third, middle way between direct farther-to-son comparisons and using archeologically known migrations, whose dates are subject to revision and of uncertain relationship to genetic events.
Despite multiple studies of the phylogeography of individual Y-chromosomal haplogroups, haplogroup G1-M285 has not received attention so far. This is partly explained by its relatively low frequency in its main area of distribution in South-West Asia [
Adyghe | 154 | 1 | 0.006 | 44,92 | N | 39,25 | E | Russian Federation | Adygea | [ |
Armenians from Ararat Valley | 110 | 2 | 0.020 | 40,15 | N | 44,18 | E | Armenia | Ararat Valley | [ |
Armenians from Erzurum | 99 | 3 | 0.030 | 39,54 | N | 41,16 | E | Turkey | Erzurum | this study |
Armenians from Gardman | 96 | 1 | 0.010 | 40,41 | N | 46,21 | E | Azerbaijan | Gardman | [ |
Armenians from Iran | 34 | 1 | 0.030 | 35,42 | N | 51,25 | E | Iran | Tehran | [ |
Armenians (diaspora sampled in Krasnodar region) | 155 | 19 | 0.123 | 40,99 | N | 39,71 | E | Turkey | Trabzon | this study |
Armenians Hamshenis | 90 | 38 | 0.422 | 41,01 | N | 39,72 | E | Turkey | Trabzon | this study |
Chechens | 283 | 1 | 0.003 | 43,25 | N | 45,82 | E | Russian Federation | Chechnya | [ |
Azeri | 21 | 1 | 0.050 | 38,68 | N | 47,38 | E | Iran | [ |
|
Georgians | 64 | 1 | 0.016 | 42,14 | N | 43,57 | E | Georgia | this study | |
Iranians (Gilan) | 91 | 3 | 0.033 | 36,96 | N | 49,62 | E | Iran | Gilan | [ |
Iranians (Kordestan) | 25 | 1 | 0.040 | 35,09 | N | 47,23 | E | Iran | Kordestan | [ |
Iranians (south-east) | 358 | 18 | 0.051 | 29,72 | N | 56,11 | E | Iran | [ |
|
Kabardinians | 371 | 2 | 0.005 | 43,41 | N | 43,32 | E | Russian Federation | Kabardino-Balkaria | this study; [ |
Saudi Arabians | 157 | 1 | 0.006 | 24,70 | N | 46,70 | E | Saudi Arabia | [ |
|
Turks (North-Eastern) | 80 | 5 | 0.063 | 40,80 | N | 38,60 | E | Turkey | [ |
|
United Arab Emirates | 163 | 4 | 0.025 | 24,28 | N | 54,22 | E | United Arab Emirates | [ |
|
Jordanians | 286 | 3 | 0.011 | 30,92 | N | 36,29 | E | Jordan | this study | |
Lebanese | 1425 | 12 | 0.008 | 33,84 | N | 35,81 | E | Lebanon | this study | |
Syrians | 566 | 3 | 0.005 | 35,09 | N | 38,47 | E | Syria | this study | |
Assyrian | 39 | 2 | 0.051 | 37,90 | N | 45,69 | E | Iran | Azarbaijan Gharbi | [ |
Persian | 44 | 1 | 0.023 | 29,37 | N | 52,32 | E | Iran | Fars | [ |
Bandari | 131 | 4 | 0.031 | 27,18 | N | 56,27 | E | Iran | Hormozgan | [ |
Persian | 59 | 1 | 0.017 | 36,29 | N | 59,60 | E | Iran | Khorosan | [ |
Kurd | 59 | 2 | 0.034 | 35,64 | N | 46,87 | E | Iran | Kurdestan | [ |
Lur | 50 | 1 | 0.020 | 33,48 | N | 48,35 | E | Iran | Lurestan | [ |
Mazandarani | 72 | 3 | 0.042 | 36,56 | N | 53,05 | E | Iran | Mazandaran | [ |
Baluch | 24 | 1 | 0.042 | 28,53 | N | 64,25 | E | Iran | Balouchestan | [ |
China (Inner Mongolia and Ningxia) | 151 | 2 | 0.016 | 37,53 | N | 105,91 | E | China | Ningxia; Inner Mongolia | [ |
Kazakhs (Kerbulaksky) | 134 | 2 | 0.015 | 44,33 | N | 78,43 | E | Kazakhstan | Kerbulak, Almaty | this study |
Kazakhs (Katonkaragaysky) | 130 | 2 | 0.015 | 49,17 | N | 85,60 | E | Kazakhstan | Katonkaragay, East Kazakhstan | this study |
Kazakhs (Zharminsky) | 101 | 3 | 0.030 | 49,80 | N | 81,27 | E | Kazakhstan | Zharma, East Kazakhstan | this study |
Kazakhs (Moiynkumsky) | 108 | 6 | 0.056 | 44,42 | N | 71,59 | E | Kazakhstan | Moiynkum, Jambyl | this study |
Kazakhs (Karkaralinsky) | 178 | 94 | 0.528 | 49,40 | N | 75,47 | E | Kazakhstan | Karkaraly, Karagandy | this study |
Kazakhs (Amangeldinsky) | 141 | 36 | 0.255 | 52,35 | N | 65,04 | E | Kazakhstan | Amangeldi, Kostanay | this study |
Kazakhs (Akzharsky) | 90 | 50 | 0.556 | 53,31 | N | 71,36 | E | Kazakhstan | Akzhar, North Kazakhstan | this study |
Kazakhs (Magzhan Zhumabaev) | 87 | 30 | 0.345 | 54,45 | N | 70,26 | E | Kazakhstan | Magzhan Zhumabaev, North Kazakhstan | this study |
Kazakhs (Arysky) | 118 | 8 | 0.068 | 42,43 | N | 68,80 | E | Kazakhstan | Arysky, South Kazakhstan | this study |
Kazakhs Madjar | 45 | 39 | 0.867 | 49,56 | N | 64,00 | E | Kazakhstan | Taush, Torgay area | [ |
Kirghiz (Pamirs) | 106 | 1 | 0.009 | 38,15 | N | 73,95 | E | Tajikistan | Gorno-Badakhshan Autonomous Province | this study |
Mongols Khalkh (Setsen khan) | 68 | 1 | 0.015 | 48,00 | N | 113,00 | E | Mongolia | historical aimak Setsen | this study |
Mongols Dariganga | 73 | 4 | 0.055 | 47,13 | N | 114,47 | E | Mongolia | Dornod and Sükhbaatar Provinces | this study |
Mongols Uuld | 41 | 1 | 0.024 | 48,95 | N | 91,16 | E | Mongolia | Bayan-Ölgii Province | this study |
Mongol-SouthEast | 23 | 1 | 0.040 | 45,87 | N | 113,04 | E | Mongolia | [ |
|
Tajiks from Afghanistan | 56 | 1 | 0.020 | 35,94 | N | 69,96 | E | Afghanistan | [ |
|
Tajiks Mountain | 85 | 1 | 0.012 | 39,37 | N | 68,52 | E | Tajikistan | Aininsky district | this study |
Tajiks-Badakhshan from Afghanistan | 37 | 1 | 0.030 | 37,11 | N | 70,84 | E | Afghanistan | [ |
|
Tajiks-Takhar from Afghanistan | 35 | 1 | 0.030 | 36,70 | N | 69,45 | E | Afghanistan | [ |
|
Pashtun-Baghlan | 34 | 1 | 0.030 | 36,29 | N | 68,29 | E | Afghanistan | [ |
|
Brahui | 25 | 1 | 0.040 | 29,02 | N | 62,84 | E | Pakistan | [ |
|
Gujarat | 185 | 2 | 0.011 | 22,78 | N | 71,90 | E | India | Gujarat | [ |
Lingayat | 101 | 1 | 0.010 | 12,97 | N | 77,56 | E | India | Karnataka | [ |
Pakistan (south) | 91 | 1 | 0.011 | 26,35 | N | 68,00 | E | Pakistan | [ |
|
Bashkirs (Ancient tribes) | 87 | 1 | 0.011 | 52,59 | N | 58,06 | E | Russian Federation | Bashkortostan Republic | this study |
Bashkirs (Kipchak tribes) | 125 | 15 | 0.120 | 52,40 | N | 56,33 | E | Russian Federation | Bashkortostan Republic | this study |
Crimean Tatars | 323 | 2 | 0.006 | 45,00 | N | 34,00 | E | Crimea | this study | |
Italians | 193 | 4 | 0.020 | 42,05 | N | 13,42 | E | Italy | different regions | [ |
Russians (Ryazan) | 195 | 2 | 0.010 | 53,93 | N | 40,68 | E | Russian Federation | Ryazan region | this study |
Russians (Vologda) | 121 | 2 | 0.017 | 59,38 | N | 39,15 | E | Russian Federation | Vologda region | [ |
Ukrainians (Rovno) | 100 | 1 | 0.010 | 51,32 | N | 26,58 | E | Ukraine | Rovno region | this study |
A) Area populated by Iranic speakers in the middle of the first millennium BC. States whose languages belonged to the Iranic and Armenian linguistic groups are shown in red (modified from [
These details of haplogroup G1 phylogeography have been hard to answer, because existing methods allowed only slow progress in discovering phylogenetically informative SNPs. Fortunately, during recent years the possibility for full resequencing of the Y-chromosome [
Within the last decade, there has been significant uncertainty in dating Y-chromosomal haplogroups due to a three-fold difference between so-called “genealogical” and “evolutionary” mutation rates of Y-STRs. The former rates were repeatedly obtained in a set of studies [
Migration of Iranic-speaking populations between the Central Asian steppes and South-West Asian uplands is an important issue in human population history, directly related to the much-debated problem of the homeland and early migrations of Indo-Europeans. Followers of the Kurgan theory propose that the carriers of Iranic languages expanded from the Eurasian steppe southward to present-day Iran, from which region these languages received their name (
This study presents a deep phylogeographic analysis of haplogroup G1 by combining traditional approaches with the new powerful options emerging from complete sequencing of the Y-chromosome. We set out to provide a new independent estimate of the mutation rate using the tight links between haplogroups and clans typical in patrilineal nomadic societies. In addition, we aimed to find which direction of the ancient migration of Iranic speakers better fits the haplogroup G1 phylogenetic pattern.
We genotyped the commonly-used SNP M285 which defines haplogroup G1 (YCC, 2002) in multiple Eurasian populations using the TaqMan technique (Applied Biosystems) and identified 367 M285-derived samples in 27 populations. All these samples were then genotyped at 17 Y-chromosomal STRs using the Y-filer genotyping kit (Applied Biosystems). All sample donors gave their written informed consent (the study was approved by the Ethics Committee of the Research Centre for Medical Genetics, Russian Academy of Medical Sciences). Data available from the literature were also incorporated (
Then we selected 19 samples for high-throughput sequencing of the Y-chromosome. To capture maximum phylogenetic diversity and thus increase the cost-effectiveness of the analyses, we applied three criteria for selecting samples. The geographic criterion led to samples from both steppe and mountain parts of the haplogroup’s area being included, particularly from populations where G1 frequency is high. The phylogenetic criterion led to samples from all clusters revealed on the STR network being included and represented by at least two samples for full sequencing, because STR-clusters might reflect real phylogenetic branches and a single sample would not allow us to distinguish phylogenetically-informative SNPs from private ones. The third criterion could be applied only to those populations where paternal clan structure is present: it led to representatives from different clans being included because members of the same clan have a high probability of sharing almost identical paternal lineages. As an outgroup for the 19 G1 samples, we also sequenced one sample from its brother haplogroup, G2.
Y-chromosomal genotyping was performed using a custom enrichment design created for the commercially available “BigY” product offered by Gene By Gene, Ltd. In total, the target regions attempt to sequence around 20 million base pairs with 67,000 capture probes, on the Illumina HiSeq platform. This design captured 11,383,697 bp within the non-recombining male-specific Y-chromosome, consistent with regions genotyped by previous Y sequencing studies [
In addition to the genotyping per sample, we wanted to ensure for this study that SNP positions examined were adequately covered across all samples. This is a concern, because many variant calling methods in high-throughput sequencing are ambiguous when not reporting a variant as to whether there was not enough coverage to genotype, or if there was a legitimate homozygous reference genotype. To discern such cases, each BigY sample was given a “confidence” region list determined by genotype quality scores for each base. The genotype quality is computed as the probability that the genotype is correct, according to a phred score. This probability is derived from AEngine’s proprietary statistical model considering characteristics of read coverage, individual read mapping qualities, and base sequencing quality scored by the HiSeq. A base position is appended to the confidence regions for that sample if its genotype quality score is above 3.02. Thus, if there is no variant occurring at a base within confidence intervals for a sample, it can be assumed that the sample is reference genotype at that position. Variant calls were produced and handled as Variant Call Format (VCF) files, according to the established field standards (
To estimate the potential sequencing error rate, we applied a phylogenetic approach. We checked whether we found all SNPs in the BigY captured region which are known to be phylogenetically located between haplogroups A0 and G (
The frequency distribution map of haplogroup G1-M285 was created using data reported here for the first time (27 populations,
The black points represent the populations analyzed. Abbreviations in the statistical legend indicate the following: K, number of the populations studied; MIN and MAX, the minimal and maximum frequencies on the map.
An analysis of molecular variance (AMOVA) was performed using Arlequin [
Reduced median networks [
Haplotype diversity was calculated according to [
The BigY output VCF files (
The parsimony trees were constructed from this dataset using TNT [
The same dataset was also subjected to analysis with BEAST software [
In an additional analysis we included two G1 samples from the 1000 Genomes Project (NA20858 and NA20870, Gujarati Indians sampled in Houston, Texas (GIH), 2-4X average coverage). Data were handled in the same way, although the lower coverage of the 1000 Genomes samples halved the number of SNP calls and the filtered dataset consisted of 22 samples and 393 SNPs (
We genotyped the haplogroup G1-specific marker M285 [
The frequency distribution of haplogroup G1 in Eurasia is presented in
It is notable that the area of haplogroup G1, including the Eurasian steppes from the North Black Sea region to the Mongolian Altai and South-Western Asian uplands (Iran and historical Great Armenia), corresponds well with the area populated by Iranic speakers in the second and first millennia BC (
On the network (
Arrows mark samples chosen for Y-chromosomal sequencing.
The haplotype diversity of haplogroup G1 varies drastically from 92% in Iran to zero in Mongolia (
Iranians and Azeris (Iran) | 16 | 15 | 0.125 | 0.9297 | this study |
Armenians (Turkey) | 60 | 31 | 0.250 | 0.9056 | this study |
Lebanese and Jordanians | 8 | 7 | 0.250 | 0.8438 | this study |
Kazakhs (North Kazakhstan) | 116 | 35 | 0.448 | 0.7794 | this study |
Tajiks (Afghanistan, Tajikistan) | 6 | 5 | 0.333 | 0.7778 | this study |
Armenians (Armenia) | 7 | 5 | 0.286 | 0.7755 | this study |
Kazakhs (Central Kazakhstan) | 100 | 26 | 0.490 | 0.7394 | this study |
Kazakhs (South Kazakhstan) | 14 | 8 | 0.500 | 0.7143 | this study |
Bashkirs (Russia) | 15 | 6 | 0.467 | 0.6933 | this study |
Kazakhs (East Kazakhstan) | 9 | 4 | 0.444 | 0.6667 | this study |
Kazakhs (Altaian) | 6 | 2 | 0.833 | 0.2778 | this study |
Mongols (Mongolia) | 7 | 1 | 1.000 | 0.0000 | this study |
N—number of G1 samples genotyped by 17 Y-STRs;
NHT—number of different Y-chromosomal STR haplotypes;
FMAX—frequency of the most frequent haplotype;
HD—haplotype diversity; the populations were sorted according to the level of HD.
The black points represent the populations for which diversity values were calculated. Abbreviations in the statistical legend indicate the following: MIN and MAX, the minimal and maximum values on the map.
We sequenced ~11 Mb of the Y-chromosome in 19 samples selected using three criteria to cover the maximum diversity within haplogroup G1. The average coverage was 67x, ranging from 48x to 88x. Among the 766 SNPs in the filtered dataset (see
The phylogenetic trees created by parsimony (
This tree corresponds in general with the pattern revealed by the STR-based network (
The presence of additional clusters was confirmed when we included two GIH (Gujarat Indians from Houston) samples from the 1000 Genomes Project, which are the only publicly available data on haplogroup G1. Including the low coverage sequences halved the number of SNPs called in all samples (
The tree combines the high-coverage dataset reported in this study with data from 1000 Genomes Project. Dotted lines indicate the approximate phylogenetic position of two previously reported G1 branches which were absent among our samples.
The Kazakh cluster fits the previously described G-L1323 branch (
The Argyn tribe in which haplogroup G1 predominates is believed to descend from a single male common ancestor (Argyn) and is divided into 12 clans (
A) Genetic tree reconstructed from Y-chromosome sequences of the Kazakh samples. B) Genealogical tree of the Argyn tribe of the Kazakh. Each sequenced Kazakh sample is attributed to the clan it originates from. The genealogical ancestor with the known historical date is marked in grey.
The genetic tree based on high-throughput sequencing of the Kazakh G1 chromosomes (
While this paper was under review, we obtained experimental data from three additional samples, representing clans claiming their origin from Karakhoja’s brother Somdyk (Kazakh 7, Kazakh8, and Kazakh9,
We applied the same approach to Y-STR data also. Including data on 15-Y-STR haplotypes in Argyns (
The pattern of geographic distribution of haplogroup G1-M285 is to some degree exceptional, as it cannot be called either a West-Eurasian or an East-Eurasian lineage (
The question arises of whether the homeland of G1 was in steppe or mountains. Much higher STR variation in the west part of the Iranian-Armenian plateau makes the mountain homeland a more probable candidate. This conclusion fits the Anatolian theory of Indo-European origins, and the pattern of STR diversity (
The expansion in Kazakhs is genetically dated to an interval of 470–750 YBP, using the range of published mutation rate point estimates [
The expansion in the Hemsheni Armenian is genetically dated to 1150 YBP using our rate (
The expansion in the Kangly tribe of Bashkirs is genetically dated to the 15th century AD (
We note that despite geographic proximity, the ancestor of the G1 cluster in Bashkirs had no close genetic relationship to the corresponding ancestor in Kazakhs. These branches (and the third branch detected in Mongolians) have survived in the Eurasian steppe perhaps since the Scythian epoch.
The remarkable coincidence between the genealogical tree of the Argyn Kazakh clan (
(ZIP)
This scale is typically used in the GeneGeo software for frequency distribution maps of all haplogroups, thus allowing easy comparisons of different maps. The black points represent the populations analyzed. Abbreviations in the statistical legend indicate the following: K, number of the populations studied; MIN and MAX, the minimal and maximum frequencies on the map.
(TIFF)
The tree is based on the high quality filtered dataset from this study consisting of 20 samples and 636 SNPs. The Build 37 coordinates of the SNPs are shown along branches. ISOGG marker names are shown in red. Further details of these mutations are reported in
(PDF)
The tree is based on the high quality filtered dataset from this study consisting of 20 samples and 636 SNPs. The tree was created in the BEAST software. The mean age estimates are shown for all branches.
(TIFF)
Data on haplogroup G1 Y-STRs in the Argyn tribe of the Kazakh clan came from both this study and [
(TIFF)
(XLS)
(XLS)
(XLS)
(XLS)
(XLS)
We thank David Mittelman and Carter Cole for help in analyzing BigY data, YFull team (