Y chromosomal evidence on the origin of northern Thai people

The Khon Mueang represent the major group of people present in today’s northern Thailand. While linguistic and genetic data seem to support a shared ancestry between Khon Mueang and other Tai-Kadai speaking people, the possibility of an admixed origin with contribution from local Mon-Khmer population could not be ruled out. Previous studies conducted on northern Thai people did not provide a definitive answer and, in addition, have largely overlooked the distribution of paternal lineages in the area. In this work we aim to provide a comprehensive analysis of Y paternal lineages in northern Thailand and to explicitly model the origin of the Khon Mueang population. We obtained and analysed new Y chromosomal haplogroup data from more than 500 northern Thai individuals including Khon Mueang, Mon-Khmer and Tai-Kadai. We also explicitly simulated different demographic scenarios, developed to explain the Khon Mueang origin, employing an ABC simulation framework on both mitochondrial and Y microsatellites data. Our results highlighted a similar haplogroup composition of Khon Mueang and Tai-Kadai populations in northern Thailand, with shared high frequencies of haplogroups O-PK4, O-M117 and O-M111. Our ABC simulations also favoured a model in which the ancestors of modern Khon Mueang originated recently after a split from the other Tai-Kadai populations. Our different analyses concluded that the ancestors of Khon Mueang are likely to have originated from the same source of the other Tai-Kadai groups in southern China, with subsequent admixture events involving native Mon-Khmer speakers restricted to some specific populations.


Introduction
The area of northern Thailand situated in proximity to southern China, northern Myanmar and northern Laos hosts several ethnicities who can be linguistically classified in four groups: Austroasiatic, Tai the Austroasiatic subfamily Mon-Khmer (MK) are spoken today by populations, e.g. Lawa (LW) and Mon (MO), historically and archaeologically recorded as the native inhabitants of this area before the arrival of Tai-Kadai people form southern China 2,000 years ago [1][2][3].
Other ethnicities, such as the Hmong-Mien (e.g. Hmong) and different Sino-Tibetan groups (e.g. Karen), migrated from nearby countries to the mountainous areas of northern and western Thailand no more than 200 years ago [4][5]. In addition, other recent migrations from southern China and/or northern Myanmar are recorded as involving several Tai-Kadai groups such as Lue (LU), Khuen (KH), Yong (YO) and Shan (SH) [6]. Linguistic similarity between populations has often been used to disentangle patterns of relationship, under the assumption that a common language implies a common origin [7,8]. However, genetic similarities between different populations are often more complex than expected from linguistic data due to the effect of processes such as drift and migration [9]. The mountainous area of northern Thailand consists of several river plains surrounded by mountains, which continue from the Shan Hills in bordering Myanmar to Laos. In this region, dissimilar geographic areas are often occupied by different ethnolinguistic groups. The hill tribes, such as LW and SH speaking groups, currently live on the mountain where mobility is quite limited while, on the other hand, populations such as MO and different TK speaking groups inhabit the well-connected lowland regions. Geography, acting in addition to cultural/linguistic isolation, might have been an influential factor in determining the divergence by inbreeding of these populations [10]. Due to the combined effect of these different processes, northern Thailand is an extremely interesting site for studies on human population genetics.
Among the multi-ethnic groups in northern Thailand, the Khon Mueang (KM) are the most represented, with a total number of individuals reaching 6 million [11]. KM is the name with which local northern Thai people, possibly the Yuan (YU), call themselves, and refers more to a past social and political category rather than a distinct population [12,13]. Linguistically speaking, the KM's language is similar to that of the YU, which is classified as belonging to the TK family.
It is widely accepted that genetic markers can be efficiently used to reconstruct past populations' history and interactions [14][15][16][17][18][19][20]. Three types of genetic markers are commonly used to reconstruct past population processes, differing in the modality of inheritance: the maternally inherited mitochondrial DNA (mtDNA), paternally inherited Y chromosome, and the autosomal and X chromosomes that are inherited by both parents. Even after the rise of wholegenome techniques, uniparental markers continue to be widely used in population genetics. Due to their specific patterns of inheritance, the information provided by the Y-chromosome and by the mitochondrial DNA (mtDNA) allows for in depth analyses of sex-specific patterns of population history and demography. These data can be used to locate asymmetric contributions of male and female individuals to a migration, an event of admixture and generally to the pattern of gene-flow in a certain area.
A previous study, based on comparisons among autosomal Short Tandem Repeat (STR) loci, suggested an admixed origin for KM, with a higher contribution from the TK than from the MK groups [21]. On the other hand, a coalescent modelling using mtDNA genome data indicated southern China as the most probable origin of the KM, without admixture with LW groups in northern Thailand [22]. Y chromosomal data of KM and of their linguistic and geographic neighbours in northern Thailand have been reported, but they have been limited to STR markers [10,23]. Here, we investigated newly generated data of single nucleotide polymorphism (SNP) on Y chromosome along with previously published Y-STRs and mtDNA data. We used a combination of classical statistical analyses and model based simulations to shed light on the past population dynamics linked to the origin of KM of northern Thailand.

Material and methods Y binary markers and assembled datasets
A sample of 519 males belonging to 24 populations from northern Thailand was subdivided into three groups: Khon Mueang (KM), Mon-Khmer (MK) and Tai-Kadai (TK) ( Table 1). The DNA samples were obtained from our previous studies with written informed consent [21,23].  [24] using a Sequenom Mass ARRAY iPLEX Platform (Sequenom, Hamburg, Germany). To assign specific Y chromosomal haplogroup or Y lineage to each individual, we employed a phylogenetic hierarchical approach based on Y-DNA Haplogroup Tree YSOGG 2016 [25]. The use of human subjects for this study was ethically approved by Chiang Mai University, Thailand.  20 7 In order to investigate the origin of the KM populations from both maternal and paternal perspectives using a simulations based analysis, we assembled two datasets in which we collected genetic information for 17 Y-STRs loci: DYS19, DYS388, DYS389a, DYS389b, DYS390, DYS391, DYS392, DYS393, DYS426, DYS434, DYS435, DYS436, DYS437, DYS439, DYS460, DYS461, Y-GATA-A10 and mtDNA HVR-I sequence (Table 1). A total of 536 genotypes for Y-STR and 1,109 for mtDNA-HVRI sequences were retrieved from literature [10,23].

Statistical analyses
We employed a discriminant analysis of principal components (DAPC) [26] on both uniparental datasets to define genetic relationship among KM, MK and TK groups. The DAPC analysis can be used to investigate the relationship between populations optimizing the variation between-and within-groups while being free from assumptions about Hardy-Weinberg equilibrium or linkage equilibrium [26]. We first assessed the best number of clusters in the HVR-I and Y-STR datasets using the find.clusters function in adegenet 1.3-1 [27] and compared the results of 5 independent runs using a custom made R script. We then ran the DAPC analysis with 100,000 iterations and checked for the consistency of the groups founded.
An exploratory analysis such as the DAPC however do not account for geographic location of the samples, thus precluding the visualization of geographic locations where gene flow between populations is either hindered or facilitated. Estimated Effective Migration Surfaces (EEMS) which employed individual based migration rates can be used to visualize zones with higher/lower migration with respect to the overall rate [28]. The region under study is first divided in a grid of demes and the individuals are assigned to the deme closest to their sampling location. The matrix of effective migration rates is then computed by EEMS based on the stepping-stone model [29] and on resistance distances [30]. We applied EEMS to a matrix of pairwise F st distances constructed on the mtDNA dataset using Arlequin 3.5 [31] and on the 17 Y-STRs using the script available from Github at https://github.com/dipetkov/eems. We averaged three runs each with 200, 300, 400 and 500 demes to produce the final EEMS surface, as the number of demes simulated during the grid construction phase can influence the scale of the deviation from overall migration detected [28]. Each single run consisted of 200,000 burn-in steps followed by 500,000 MCMC iterations sampled every 1000 steps. We plotted the averaged EEMS and checked for MCMC convergence using the rEEMSplots package in R v 3.2.2.
In order to unravel the origin of the KM population, we employed an Approximate Bayesian Computation (ABC) approach on both the HVR-I and Y-STR datasets. We first constructed two competing models (Fig 1), which are admixture and tree-like models based on our previous results [21,22].
In the admixture model, the KM population originated as a consequence of an admixture event from the parental populations, the MK and TK groups. The tree-like model postulates instead a recent separation of the KM and TK populations and a split of this combined population from the MK ones further back in time. For both the admixture and the tree-like models, we assumed constant effective population sizes based on historical records (S1 Table) and that the prior distributions were all uniform. The ABC methodology allows us to simulate thousands of genetic datasets for both of our competing models by means of the coalescent theory. These datasets are generated taking into account the prior distribution associated with each of the model parameters while being also composed of the same number of individuals and type of genetic markers that characterize the observed ones. The genetic variation in both the observed and simulated datasets is then summarized using a fixed set of statistics and compared using Euclidean distance. The posterior probabilities of each model are computed using a weighted multinomial logistic regression (LR) considering the simulations, which generate summary statistics most similar to the observed ones, as shown by smallest Euclidean distances. In the LR methodology, the model is considered as the categorically dependent variable in the simulations and the summary statistics as the predictive variables. The regression is local around the vector of observed summary statistics and the probability of each model is finally evaluated at the point corresponding to the observed vector of summary statistics. Maximum likelihood is used to estimate the β coefficients of the regression model. To evaluate the stability of the models' posterior probabilities, we considered different thresholds by considering different number of retained simulations for LR (25 000, 50 000, 75 000 and 100 000 best simulations). To generate the simulated datasets, we used the software package ABCtoolbox, running 500 000 simulations for each model. To calculate the models' posterior probabilities, we used R scripts from http://code.google.com/p/popabc/source/browse/#svn%2Ftrunk%2Fscripts, modified by SG. To summarize the genetic information contained in both the HVR-I and Y-STR datasets we calculated two arrays of statistics within and between populations. For the mitochondrial DNA dataset, we considered the number of haplotypes, the number of private polymorphic sites, Tajima's D, the mean number of pairwise differences for each population, the mean number of pairwise differences between populations and pairwise F st . When we analysed Y-STR, we used as summary statistics the mean and the sd. over loci in each population of four parameters: the number of alleles, haplotype diversity, modified Garza-Williamson index and the allelic range. Finally, as the genetic heterogeneity was observed in KM populations [23], we repeated the ABC approach outlined in Fig 1 for each KM population (KM1 to KM10).

Y haplogroups
The haplogroup frequencies in each studied population are listed in Table 2 and represented geographically in Fig 2, while the evolutionary relationships between them are presented in S1     The EEMS surfaces showed an overall pattern of good connectivity between neighbouring populations in northern Thailand with only moderate reductions/increment of migration rates (Fig 4A and 4B). Especially evident in the Y-STR dataset, the geographic outlier population, YU4, was connected with TK and KM populations residing in the central part of northern Thailand by a corridor of high effective migration (Fig 4A). These northern Thai populations were in turn well connected with each other and with the eastern Lue (LU1 and LU2) populations. The strongest barrier in the Y-STR dataset was, not surprisingly, the one separating LW1 and LW2 populations, leading to lower migration rates with surrounding KM and TK populations (KM7, YU3 and YO). The MO did not conform to the isolation pattern presented by the Lawa, showing higher than expected migration rates with TK and KM populations. The EEMS surface based on the mtDNA-HVR1 dataset (Fig 4B) highlighted a weak spatial structure compared with the Y-STR dataset. We observed lower than expected migration rates between LW1, LW2 and the KM populations from the southern part of northern Thailand (KM8, KM9, KM10), as well as a feeble barrier between LU1 and LU2. However, higher migration rates indicate the connection between the SH and populations residing in the central part of northern Thailand.

Model selection
The posterior probabilities from ABC analysis of the two considered evolutionary models are presented in Table 3. For the Y-STR dataset, we found that the tree-like model postulating a recent split between the ancestors of modern KM and TK populations provided a better explanation for the KM origin than an admixture model. The high and stable posterior probabilities  Table), the results weakly supported tree-like model in most of the KM populations. The ABC analysis failed to support the tree-like model in KM9. The results based on mtDNA HVR-I were also less indicative of supporting the tree-like model than the admixture, however, the results obtained from simulations conducted on separated KM populations supported the tree-like model in almost all of the KM populations (e.g. KM2, KM3, KM4, KM6, KM7 and KM10). There was only one instance of KM5, in which the admixture model was preferred for the mtDNA-HVR1 dataset.

Discussion
The investigation of Y chromosomal lineages in northern Thai populations revealed that the majority of the sampled individuals could be assigned to one of three common haplogroups: O-PK4 (O1b1a1), O-M117 (O2a2b1a1) and O-M111 (O1b1a1a1a1a). These lineages are also prevalent in Chinese and other Southeast Asian populations [32][33][34][35]. The overall pattern of haplogroups distribution was also generally homogeneous in our studied populations, with subhaplogroups of O1 and O2 reaching the highest frequencies amongst the studied individuals. This was especially true for TK and KM populations. Interestingly, the MO from northern  Thailand show the presence of haplogroups usually found in South, Central and West Asia: R-P249 (R2a) and J-M172 (J2) [36]. Connection between the ethnic Mon and populations from South and Central Asia was already proposed from previous identification of mtDNA lineage W3a1b [20]. Both groups of the MK speaking LW groups showed high differences between each other and from other populations, presenting low levels of haplogroup diversity (Table 1) with high frequencies of O-PK4 (O1b1a1) (72% in LW1) and N-M231 (56% in LW2). Haplogroup N-M231 is prevalent in today's TK and Hmong-Mien speaking populations, as well as in Han of southern China [37]. We also detected haplogroup C-M130 and its sublineages (C-M356 and C-M217) in almost all KM and YU populations, as well as in one Lue group (LU1). Haplogroup C-M130 has been found mainly in Mongolia and in Korea, while in Southeast Asia it reaches high frequencies in the eastern part of Indonesia. C-M217 is, instead, typically present at high frequencies across northeast Asia [38]. It is worth to note that we observed haplogroup D1-M15 in several TK and KM populations, although at low frequencies. High frequency of this lineage was reported in China especially in Tibet, Quiang and Yao [39]. To account for the presence of C and D lineages in our TK groups, we speculate that paternal admixture among several ethnolinguistic groups in the area of southern China were heavily influenced by Han and Mongol expansion from the north [40][41][42][43]. This would have happened before the southward migration of the TK ancestor to northern Thailand, which could contribute to the presence of C and D lineages in these populations.
In agreement with earlier studies of mtDNA and autosomal STRs variation [20][21][22] that reported genetic differentiation of MK speaking groups in northern Thailand, the two LW groups appeared to differ from other populations and from each other, as indicated by the DAPC results (S2C Fig). A lower level of gene flow caused by the presence of geographic barriers may be the driven factor of this differentiation, as suggested by the EEMS surfaces (Fig 4). On the other hand, the Mon is less genetically differentiated from the KM and TK groups. Although the genetic origin of the Mon is related to South Asia, recent admixture with the Tai sources could have been the main force that shaped the genetic variation of the northern Thai Mon.
To investigate the origin of the KM, we proposed two demographic models, which summarized previous hypothesis on their origin. The first scenario depicted an admixture event involving local MK populations and migrating TK groups, while a second tree-like model suggested that the KM originated following a recent split from the ancestors of modern TK populations (Fig 1). We then proceeded to test these hypotheses with ABC simulations on both maternally and paternally inherited data. When all the KM individuals were pooled together, the most supported model was the one postulating a close relationship between these populations and TK (Table 3). This demographic pattern was clearer when Y-STRs were employed instead of mtDNA, possibly suggesting significant maternal contribution from local MK speaking on current KM populations. This conclusion was reinforced when the single KM populations were considered separately. In some specific cases (KM5, S2 Table), the admixture scenario obtained higher posterior probabilities than the tree-like one, highlighting the importance of considering small scale local processes when investigating the history of a population. In conclusion, we observe contrasting pattern of paternal and maternal genetic variation with a clearer genetic structure in the Y-STRs than in the mtDNA-HVR1 sequences. Our different approaches suggest nonetheless a common origin between KM and TK populations, previously proposed to be located in southern China [22]. After an initial migration southward, fragmentation process in separate villages and contact with local MK speaking people could have promoted secondary events of admixture, especially in the matrilineal lineage (S2 Table). Although some models and populations compared are different from our previous mtDNA genome study [22], the results are confirmed with the additional inclusion of post-settlement contacts with MK populations. Future work employing complete Y sequences and/or in depth autosomal information from northern Thai and surrounding populations will be crucial to determine the migration origin and to evaluate the full impact of secondary admixture.