A glimpse at the intricate mosaic of ethnicities from Mesopotamia: Paternal lineages of the Northern Iraqi Arabs, Kurds, Syriacs, Turkmens and Yazidis

Widely considered as one of the cradles of human civilization, Mesopotamia is largely situated in the Republic of Iraq, which is also the birthplace of the Sumerian, Akkadian, Assyrian and Babylonian civilizations. These lands were subsequently ruled by the Persians, Greeks, Romans, Arabs, Mongolians, Ottomans and finally British prior to the independence. As a direct consequence of this rich history, the contemporary Iraqi population comprises a true mosaic of different ethnicities, which includes Arabs, Kurds, Turkmens, Assyrians, and Yazidis among others. As such, the genetics of the contemporary Iraqi populations are of anthropological and forensic interest. In an effort to contribute to a better understanding of the genetic basis of this ethnic diversity, a total of 500 samples were collected from Northern Iraqi volunteers belonging to five major ethnic groups, namely: Arabs (n = 102), Kurds (n = 104), Turkmens (n = 102), Yazidis (n = 106) and Syriacs (n = 86). 17-loci Y-STR analyses were carried out using the AmpFlSTR Yfiler system, and subsequently in silico haplogroup assignments were made to gain insights from a molecular anthropology perspective. Systematic comparisons of the paternal lineages of these five Northern Iraqi ethnic groups, not only among themselves but also in the context of the larger genetic landscape of the Near East and beyond, were then made through the use of two different genetic distance metric measures and the associated data visualization methods. Taken together, results from the current study suggested the presence of intricate Y-chromosomal lineage patterns among the five ethic groups analyzed, wherein both interconnectivity and independent microvariation were observed in parallel, albeit in a differential manner. Notably, the novel Y-STR data on Turkmens, Syriacs and Yazidis from Northern Iraq constitute the first of its kind in the literature. Data presented herein is expected to contribute to further population and forensic investigations in Northern Iraq in particular and the Near East in general.

Introduction silico Y-chromosomal haplogroup assignment tools have also become available, which allow haplogroup assignment for a given paternal lineage based on Y-STR data alone and with accuracies over 95% [11].
The aim of the current study was to contribute to a better understanding of the genetic basis of the Northern Iraqi ethnic diversity through a comparative analysis of the paternal lineages belonging to five of the most populous ethnicities from the region. To achieve this, a total of 500 samples were collected from the Arab, Kurd, Turkmen, Yazidi and Syriac communities, and each was analyzed by 17-loci Y-STR haplotyping and then in silico haplogroup assignment. Systematic comparisons of the paternal lineages, not only among themselves but also in the context of the larger genetic landscape of the Near East and beyond, revealed the presence of intricate Y-chromosomal lineage patterns among the five ethic groups analyzed, wherein both interconnectivity and independent microvariation were observed in parallel, albeit in a differential manner.

Materials and methods
A total of 500 buccal swab samples were collected from healthy and unrelated individuals, each of whom was aged 18 and above and belonged to one of the five major ethnic groups in Northern Iraq as follows: Arabs (n = 102), Kurds (n = 104), Syriacs (n = 86), Turkmens (n = 102) and Yazidis (n = 106). Determination of ethnicity was based on that of both parents. While the Arab, Kurdish and Turkmen samples were largely collected from among the students of the Salahaddin University in Erbil, the Syriac and Yazidi samples were mostly collected at various refugee camps in Erbil. Yet, the actual birthplaces of the volunteers encompassed a wider geography from Northern Iraq as depicted in Fig 1. All samples were collected with written informed consent and according to the principles of the Helsinki Declaration of the World Medical Association. Local translators were also available to ensure informed consent. Approvals for the study were provided by the Ethics Committee of the Department of Genetics and Bioengineering, as well as that of the Faculty of Engineering and Information Systems, both at the International Burch University. All sample collections in Northern Iraq were carried out through the College of Education-Scientific Department at the University of Salahaddin, which also approved the project, procured the requisite permissions from the local authorities and actively participated in the realization of the project.
Genomic DNA extractions and 17-loci Y-STR haplotyping (DYS19, DYS385a/b, DYS389I/II, DYS390, DYS391, DYS392, DYS393, DYS437, DYS438, DYS439, DYS448, DYS456, DYS458, DYS635 and Y-GATA-H4) were carried out with the Life Technologies PureLink TM Genomic DNA Mini Kit and AmpFlSTR 1 Y-filer TM Kit, respectively. Capillary gel electrophoreses were conducted on a Life Technologies ABI 3130 Genetic Analyzer. Alleles were assigned according to the current International Society for Forensic Genetics (ISFG) guidelines for forensic Y-STR analysis [12]. Samples with Y-STR haplotypes bearing bi-allelic patterns at loci other than DYS385a/b were further typed with autosomal STRs (Life Technologies AmpFlSTR 1 Identifiler TM Kit) to ascertain their single-source status. All DNA extractions and typing were conducted at the Turkish Cypriot DNA Laboratory as previously described [13,14]. Y-STR haplotyping and autosomal STR genotyping proficiencies were certified though participation in the YHRD Quality Control Exercise (2013) and ISFG English-Speaking Working Group Relationship Testing Workshop (2015). The following YHRD Accession Numbers were assigned for the five novel Y-STR datasets from the current study: Northern Iraq [ Haplotype and allele frequencies were calculated using the direct counting method. Statistical parameters of forensic interest, such as gene diversity (GD) and haplotype diversity (HD) were both calculated according to the Nei's formula [15]. Analysis of molecular variance (AMOVA) and the subsequent visualization by multi-dimensional scaling (MDS) were carried out using the YHRD online tool [16]. The AMOVA/MDS genetic distance measures were based on Slatkin's R st values, significance of which were ascertained with probability (P) values (10,000 permutations), which were revised following a Bonferroni correction to account for potential Type I errors [17]. In addition to the five novel Y-STR datasets from the current study, the following datasets from nearby and distant populations and with at least 17-loci Y-STR coverage were also included during AMOVA/MDS analysis (population sample size, YHRD  A neighbor-joining (N-J) phylogenetic tree based on the Nei's discriminant analysis (D A ) genetic distance metric and the allele frequencies of each dataset was constructed using the POPTREE2 software [18]. Bootstrap values were calculated based on 10,000 replications. Along with the five novel Y-STR datasets from the current study, the following population datasets with equivalent loci coverages were included during analysis: Cyprus [Greek Cypriot] (n = 344) [19] 17-loci Y-STR-based in silico haplogroup assignments were made using the 21-haplogroup batch processing version of the Whit Athey algorithm [29]. Validation of the in silico haplogroup assignments were carried out using a second algorithm called NevGen Y-DNA Haplogroup Predictor (www.nevgen.org). A stand-alone Python program was implemented, which called the NevGen haplogroup prediction AJAX API directly for each haplotype to allow automated processing of all Y-STR haplotypes. Prior to the NevGen analysis, null alleles, intermediate/partial alleles and multi-allelic patterns (except for DYS385) were each assigned a value of '0'.
Median-joining network (M-JN) analyses were carried out using the Network v.5.0.0.1 software (www.fluxus-engineering.com) as previously described [13]. Briefly, (a) all haplotypes with intermediate/partial alleles and/or multi-allelic patterns were removed prior to analysis, (b) a default epsilon parameter value of zero was used, and (c) maximum parsimony post-processing was applied again with the default parameters. Time to the most recent common ancestor (TMRCA) estimates were done on the resultant M-JN trees by selecting a proposed central ancestral node and then all the other nodes in the remaining network as the descendant nodes. Each TMRCA estimate was done in duplicate based on a generation time of 25 years, and the genealogical and evolutionary Y-STR mutation rates of 0.00267 and 0.00069, respectively, both per locus per generation [30][31][32][33].

Results
A combined Y-STR dataset with 500 haplotype from the Northern Iraq populations was generated (S1 Table), wherein there were 360 different and 280 unique haplotypes, hence yielding unique haplotypes (UH) of 56.0% and a discrimination capacity (DC) of 72.0% for the entire dataset. An overall haplotype diversity of 0.9979 was calculated. A number of haplotypes were observed as replicates, often exclusively among a single ethnic group, but a few of these haplotypes were also found to be shared by two different ethnic groups. Tables A-F in S1 File provide allele frequencies and the associated gene diversity (GD) values for the new combined dataset, as well as those for each of the five ethnic groups analyzed. Table 1 lists the different allelic variants, null alleles and bi-allelic patterns observed among the 500 samples from Northern Iraq: 13 allelic variants at six different loci, eight bi-allelic patterns at five different loci (excluding those at DYS385a/b) and null alleles at three different loci.
Based on the calculated GD values, apart from DYS385a/b, the two most informative loci for the combined dataset are DYS458 (0.8270) and DYS635 (0.7644), while the least informative locus is DYS391 (0.4934) ( Table 2). DYS458 also turned out to be the most informative locus for each of the five ethnic groups analyzed.  Table 2. Statistical parameters of forensic interest for the combined Northern Iraqi (n = 500), as well as the Arab (n = 102), Kurdish (n = 104), Syriac (n = 86), Turkmen (n = 102) and Yazidi (n = 106) populations.   Asian, African and European population datasets differentiated in both dimensions from the core cluster, but respective population datasets clustered among themselves as expected ( Fig  2). https://doi.org/10.1371/journal.pone.0187408.g002

Paternal lineages of Northern Iraqi ethnic groups
To provide an alternative view on the genetic affinities among the five different ethnic datasets from the current study, a phylogenetic tree was also constructed based on Nei's D A genetic distance metric and in the context of a even wider genetic landscape (S2 Table and   S3 Table lists the individual 'fitness scores' and 'Bayesian probabilities' for the in silico haplogroup assignment for each sample by two different algorithms used in the current study. Notably, 96.8% of the in silico haplogroup assignments by the Whit Athey algorithm had 'fitness scores' and 'Bayesian probabilities' above the set thresholds, which were 25 and 50%, respectively. There were no particular trends for the ambiguous haplogroup assignments, i.e. those with the associated fitness score and/or Bayesian probability below the set threshold for this algorithm. A comparison of the in silico haplogroup assignments made by the two different algorithms suggested a 'gross discrepancy rate' of 10.2% (51 discrepancies out of a total of 500 assignments) and a 'corrected discrepancy rate' of only 5.8% (28 discrepancies out of 484 assignments). The 'corrected discrepancy rate' reflects a more accurate picture, because (a) out of a total of 500 haplogroup assignments made by the Whit Athey algorithm, only 484 were assumed to be unambiguous, and hence processed any further (S3 Table), and (b) out of the 51 discrepancies observed between the 500 haplogroups assignments made by the two algorithms tested, only 28 of them corresponded to full discrepancies with the 484 unambigious haplogroup assignments by the White Athey method, while the rest corresponded to discrepancies at only the sub-clade level (e.g. J2a1 versus J2a2, etc.). Table 4 and Fig 4 show distributions of the haplogroup assignments for the combined dataset from Northern Iraq, as well as for each of the five different ethnic groups therein. 18 out of the 21 possible haplogroup assignments that could be made were observed in the combined dataset, hence pointing out to the high heterogeneity of the Northern Iraqi populations. However, it must be noted that without proper haplogroup assignments by Y-SNP typing, such in silico haplogroup assignments should be treated solely as preliminary findings since being based on Y-STR data alone, they may not always be accurate [34]. In other words, caution should always be exercised when making relevant conclusions based on such in silico produced data alone.
While the most prevalent four lineages observed in the combined dataset were J1 (17.98%), R1b (12.81%), R1a (12.40%) and J2a1b (12.19%), the distributions among the five ethnic groups were found to vary significantly: (a) 14 different haplogroups were observed in Arabs, with the three most common being J1 (38.  30) or vice versa. Since the ancestral haplotype could not reliably be determined with the available data, four different sets of TMRCA estimates were made with each of the genealogical and evolutionary Y-STR mutation rates, where the DYS448 locus was invariably excluded due to the biallelic pattern, and suggested a time-scale of 468±287 to 936±597 years and 1811±1109 to 3622 ±2309 years, respectively.

Discussion
HD values ranging between 0.97456 and 0.99739 were observed for the Syriac and Kurdish population datasets, respectively, and intermediate values for the remaining three ethnic groups analyzed (Table 2). An immediate difference between the 17-loci Y-STR datasets obtained was that in the number of haplotype replicates observed, both at intra and inter population levels, and as reflected by the UH values observed: Arabs (78.43%), Kurds (80.77%), Syriacs (36.05%), Turkmens (72.55%) and Yazidis (22.64%). Such low UH values observed for the Syriac and Yazidi ethnic groups are perhaps reflective of the well-documented isolation and/or strict, religious endogamy in these communities [7,35]. The observed DC values for each population dataset also exhibited significant variations, ranging from 47.17% for Yazidis to 89.42% for Kurds and intermediate values for the other three ethnicities (Table 2). A somewhat counteracting effect was the observation of numerous rare genetic variations that could potentially help during forensic investigations and may also provide novel insights from an anthropological perspective (Table 1).
Although based on two different genetic distance metrics, namely R st and Nei's D A , and also analyses comprising largely different population datasets, AMOVA/MDS (Table 3 and Fig  2) and N-J phylogenetic tree (S2 Table and Fig 3) analyses seemingly revealed concordant results whereby each of the new population datasets from the current study were found to be distinct in the sense that they all exhibited differential clustering with each other and those from other nearby/distant populations.  To provide further insights from an anthropological perspective, haplogroup assignments were made with the popular Whit Athey haplogroup assignment algorithm, the results of which were then further validated through the use of a second algorithm, namely the NevGen Y-DNA Haplogroup Predictor (S3 Table). Observation of a 'gross discrepancy rate' of 10.2% and a 'corrected discrepancy rate' of only 5.8% suggested that such in silico haplogroup assignment tools could perhaps provide some insights when proper Y-SNP data is not available. So, with great caution, the following relevant conclusions were made based on such in silico produced data alone. The R (25%) and J (39%) macrohaplogroups were found to account for over 60% in total for the combined dataset from Northern Iraq, which is consistent with the fact that both macrohaplogroups are thought to originate from the Near East as pre-Last Glacial Maximum events that subsequently spread to Europe during late Mesolithic and early Neolithic time, respectively (Table 4 and Fig 4) [36,37]. In contrast, significant variations were observed in the actual distribution of specific sub-clades of these and other macrohaplogroups among the five different ethnic groups from Northern Iraq, perhaps akin to other highly admixed and/or divergent populations from the Near East [13,[37][38][39]. While there are a number of earlier studies on the paternal lineages of various Kurdish populations, these correspond to smaller population samples and/or loci coverages than that in the current study [39][40][41][42][43]. One of these earlier studies included Y-SNP-based haplogroups distribution for four Kurdish populations in total from Turkey, Georgia and Turkmenistan, where J2 and R were observed up to 32% and 37%, respectively [42]. In a more recent study focusing on different ethnic groups from Iran, haplogroups J2 and R were both observed at 24% in Kurds, wherein R1a alone accounted for 20% [39]. Consequently, results from these earlier studies are in good agreement with those for Northern Iraqi Kurds from the current study, wherein J2 subclades were found to account for 22%, while lineages R1a and R1b together accounted for 21%, and with R1a at 17%. Y-chromosomal data on various Arabic-speaking populations across a wide geography ranging from North Africa to West Asia are also available in the literature, often pointing out to the heterogeneous nature of these populations and reflective of their panethnic composition. Y-chromosomal haplogroup distributions in Marsh Arabs from the eastern part of Iraq were also investigated, wherein J1 was found to be the most prevalent lineage with its three markers accounting for 81% in total [44]. Hence, results from the current study on the Northern Iraqi Arabs are in good agreement with those for Marsh Arabs because J1 lineages accounted for around 39% in the former, constituting the highest not only in this ethnic group, but also among all five analyzed. Considering that J1 is thought to originate from a geographical zone that includes northeastern Syria, northern Iraq and eastern Turkey, from where it expanded to the rest of the Near East and North Africa, such high prevalence of J1 among Iraqi Arabs is indicative of their indigenous nature [45]. There are also a number of earlier investigations on the paternal lineages of various Turkmen populations [25,26,39,46]. However, a distinction should perhaps be made between the Turkic populations from Turkmenistan in Central Asia and elsewhere, such as in Northern Iraq and Northern Syria. At least the Northern Iraqi Turkmen, although still Turkic and thus with historical links with Central Asia, have even closer links with the Turkic populations from Anatolia and/or Azerbaijan/ Northwestern Iran. Earlier investigations on the Turkmen population in Afghanistan, Uzbekistan and Iran, suggested that haplogroup Q was the most prevalent accounting for 34%, 73% and 43%, in that order [25,26,39]. An earlier study from the Turkmenistan population per se also exists, albeit of relatively poor Y-SNP typing resolution, whereby the most prevalent haplogroups observed were P(xR1a), J and N(x3) with the frequencies of 52%, 24% and 10%, in that order [46]. Results from the current study suggest that haplogroup distribution for the Northern Iraqi Turkmen population is more similar to that of other Northern Iraqi populations, such as Kurds, as well as Turkish populations in Southeastern Anatolia and Cyprus [13,37]. Results from the current study also suggested that, the paternal lineages of the Northern Iraqi Syriacs are rather homogenous, and exhibit signs of a strong population bottleneck, a situation perhaps even further emphasized due to strict endogamy known to be practiced in this ethnic group (Table 2). This also seems to be the case for the Northern Iraqi Yazidis, where strict endogamy is also practiced in a relatively small and isolated population of around half a million people [7,47]. In the case of Northern Iraqi Syriacs, significant R st genetic distances were observed with all other nearby populations, except for the Yazidis from the current study, and Iraqis, Iranians, Italian (Marche) and Turkish populations from Cukurova, the Marmara Region and Southeastern Anatolia in general (Table 3, Fig 2). In contrast, the Northern Iraqi Yazidis were found to have non-significant R st genetic distances with all other four ethnic groups from the current study, as well as those from Albania, Cyprus, Iraq, Iran Lebanon and Italy (Marche), as well as the Turkish populations from the Marmara Region and Southeastern Anatolia (Table 3, Fig 2). Consequently, despite corresponding to isolated and homogenous populations, contemporary Syriacs and Yazidis from Northern Iraq may in fact have a stronger continuity with the original genetic stock of the Mesopotamian people, which possibly provided the basis for the ethnogenesis of various subsequent Near Eastern populations. Such an observation seems to be in line with genetic distance calculations based on a different method, namely Nei's D A genetic distance, whereby the Northern Iraqi Syriac and Yazidi populations from the current study were found to position in the middle of a genetic continuum between the Near East and Southeastern Europe. Earlier Y-chromosomal haplogroup distribution data on Syriacs from Northern Iraq (n = 7) and Iran (n = 48 and 55) suggested an overall dominance by the R and J haplogroups [35,39,45]. In particular, in the most recent study with the highest haplogroup resolution (n = 48), R1a, R1b, J1 and J2 sub-clades were found to account for 8%, 29%, 15% and 15% in that order among Assyrians from Iran [39]. In this respect, the results from the current study, albeit on Northern Iraqi Syriacs (n = 86) are in good agreement because J and R subclades were observed at 36% and 41%, respectively, where R1a, R1b, J1 and J2 sub-clades accounted for 11%, 30%, 12% and 24%. Unfortunately no previously published data exists on the Y-chromosomal haplogroup distributions in Yazidis from Northern Iraq or  (excluded loci are DYS385a/b, DYS389I/II, DYS392, DYS437, DYS438,  DYS448 and DYS58); Panel B, the combined R1a and R1b M-JN based on 13 Y-STR loci (excluded loci are DYS385a/b and DYS389I/II), the R1a and R1b networks are in fact split along the right and left of the black arrow, respectively, and just below the proposed ancestral modal haplotype for both haplogroups, which was not sampled; Panel C, J2a1b M-JN based on eight Y-STR loci (excluded loci are DYS385a/b, DYS389I/II, DYS392, DYS437, DYS438, DYS448 and DYS58). Asterisks (*) mark the proposed ancestral modal haplotypes. A scale bar whose length denotes a single mutation event between two neighbouring haplotypes is also provided for each network. https://doi.org/10.1371/journal.pone.0187408.g005 Paternal lineages of Northern Iraqi ethnic groups elsewhere, hence precluding comparisons with those from the current study. Results from the current study suggest dominance by R haplogroup subclades among Yazidis, where R1a and R1b account for 9% and 21%, respectively. M-JN and associated TMRCA analyses on haplotypes with J1, J2a1b, R1a and R1b haplogroup assignments among Northern Iraqis all suggested in situ radiation as a plausible model to explain the diversity of the corresponding paternal lineages. This is because there were seemingly: (a) a number of star-like descent clusters in the J1 network, exclusively or partially comprised of Arab haplotypes, which dominated the overall network, (b) two star-like descent clusters in the R1b network, one comprising Syriac and the other Yazidi haplotypes, which also both dominated the overall network, and (c) two star-like descent clusters in the J2a1b network, one comprising Syriac / Kurdish and the other Yazidi haplotypes, although the overall network was dominated by Kurdish haplotypes.
In conclusion, data presented herein constitutes a significant primer for further population studies and forensic investigations in Northern Iraq, such as the missing person identification efforts due to past and present conflicts. Novel insights into the molecular anthropology of Near Eastern populations are also expected due to hitherto scantity of genetic data from this corner of the world of immense historical importance. However, it should be noted that the major limitation to this study is the lack of Y-SNP genotyping.
Supporting information S1 Table. Table. In silico Y-chromosomal haplogroup assignments for the Northern Iraqi samples by the Whit Athey 21-haplogroup prediction and the NevGen Y-DNA haplogroup predictor algorithms (n = 500). (DOC) S1 File. Table A: Allele frequencies of the 17 Y-STR loci for the combined Northern Iraqi population (n = 500). Table B: Allele frequencies of the 17 Y-STR loci for the Northern Iraq Arab population (n = 102).