The Contribution of Genetic Diversity to Subdivide Populations Living in the Silk Road of China

There are several indigenous ethnic populations along the silk road in the Northwest of China that display clear differences in culture and social customs, perhaps as a result of geographic isolation and different linguistic traditions. However, extensive trade and other interactions probably facilitated the admixture of different gene pools between these populations over the last two millennia. To further explore the evolutionary relationships of the 13 ethnic populations residing in Northwest China and to reveal the features of population admixture, the 9 most-commonly employed CODIS loci (D3S1358, TH01, D5S818, D13S317, D7S820, CSF1PO, vWA, TPOX, FGA) were selected for genotyping and further analysis. Phylogenetic tree and principal component analysis revealed clear pattern of population differentiation between 4 populations living in Sinkiang Uighur Autonomous Region and other 9 populations dwelled in the upper regions of Silk Road. R matrix regression showed high-level gene flow and population admixture dose exist among these ethic populations in the Northwest region of China. Furthermore, the Mantel test suggests that larger percent of genetic variance (21.58% versus 2.3%) can be explained by geographic isolation than linguistic barriers, which matched with the contribution of geographic factors to other world populations.


Introduction
The Northwest region of China has a very complex geography, encompassing mountains, plateaus and basins, as well as some special landscapes, such as the Gobi desert. There are at least 20 ethnic populations and isolated groups that reside in this region. The Han, Hui and Mongolian people are three of the largest ethnic groups in China. The Han ethnic group has a population of more than 1 billion. The Hui and Mongolian ethnic groups each have populations of more than 5 million and are regarded as typical examples of Chinese ethnic minorities [1]. Of the populations that live in Sinkiang Uighur Autonomous Region, all are aboriginals except for the Han ethnic group, which has migrated to the region from Central China since the 1950s. According to written records, the Uyghur ethnic group has been in frequent contact with both eastern and western populations since the 3rd Century B.C. The immigration of the Uzbek, Kazakh and Kirghiz ethnic groups was the result of the expansion of Mongol Empire in the 13th century, and their ancestors may be the people that inhabited central Asia 2,000 years ago. Of the five ethnic populations living in the Qinghai and Gansu provinces, the Yugur ethnic population has a relatively long history. The other four populations are most probably the product of population admixtures among the Mongolian, Hui, Han and Tibetan ethnic populations [2][3][4] (Figure 1). The ''Silk Road'', which could date back to the Western Han Dynasty, starts geographically from the ancient capital Chang-an, passes through the ''Hosi Corridor'' and Sinkiang Uighur Autonomous Region and extends into Central Asia, India and finally the Mediterranean region. Previous research has suggested that extensive genetic admixture exists in the Silk Road region [5,6]. Evidence from the mitochondrial hyper-variable region showed that populations in central Asia contain gene pool elements of both Eastern and Western Euro-Asians. Furthermore, historical records indicate that factors such as religious belief, marriage customs, linguistic traditions and migratory history may have played important roles in shaping the matrilineal genetic structure of the populations living in this region [7]. However, these investigations have seldom examined the genetic structure and population differentiation of the populations living near the starting point of the Silk Road.
Addressing major issues in the field of human genetics requires multiple types of genetic markers and various analytical methods and statistical models and the consideration of geographic, linguistic and social factors [8]. In recent years, several Chinese investigators have examined population differentiation and admixture patterns for Chinese populations and some Central Asian populations. On the basis of the allele frequency data of 15,30 STR loci, Chu et al. constructed a phylogenetic tree for 32 East Asian populations and proposed a hypothesis for the origin of East Asian people [9]. Using Y haplotype features, Su and Jin inferred the origin of the Chinese Han population and East Asian people and hypothesized that the Northern Chinese Han population derives from migrants from the Southern Chinese Han population [10]. Recently, Xie et al. analyzed Y chromosome STRs and SNPs from selected individuals living in Gansu Province and suggested that they might be the offspring of ancient Roman soldiers [11]. Moreover, Zhang et al. have used mitochondrial sequence diversity to study the evolution and origin of Chinese populations. They constructed a phylogenetic tree for the Chinese Han populationbased on mitochondrial haplogroups that has been widely employed in later investigations on mitochondrial polymorphisms in East Asian populations [12]. Most of the previous studies agree that high genetic differentiation exists among Chinese populations and that the gene flow and genetic admixture are very complex. Samples covering a wider range and larger size are needed to improve the robustness of the statistical analysis, and more sophisticated statistical models and analysis should also render the results more convincing.
Genetic markers on the Y chromosome and on mitochondrial DNA, such as Y-STRs, Y-SNPs and mitochondrial hyper-variable regions I and II, have low recombination rates and lack of recombination respectively, are widely used to address the genetic differentiation between populations [13,14]. Confounding issues such as low effective sample size and ascertainment bias can be problematic, and genetic markers on the Y chromosome are especially susceptible to genetic drift and male reproductive functions [10,15]. Microsatellites have been applied in detecting human genome variation, conducting linkage analysis and in forensic applications,such as DNA fingerprinting. Microsatellites have proven to be especially useful in studies of the evolutionary relationships between species or between populations with relatively close genetic relationships [16]. These studies suggest that the behavior of autosomal genetic markers is similar to human linguistic patterns [17].
In this paper, we have selected 13 representative populations (12 different ethnic groups) living in the Northwest region of China, analyzed the statistical distribution of allele frequency at 9 STR loci, and attempted to reconstruct the genetic structure and reveal the respective gene flows. Our analyses also consider geographic and linguistic factors. With these factors in mind, we have quantitatively analyzed the variance components contributed by genetic differentiation, geographic isolation and linguistic differences.  (Table 1). All individuals were selected randomly with the appropriate informed consent. Confirmation was obtained that all four grandparents of each genotyped individual had been born in the same area. The sample size used is sufficient for a genetic population analysis using microsatellites [18]. Furthermore, as the allele frequencies of Kazakh, Salar, Tu and Baoan have been published previously by our lab [19][20][21], we adopted the original data instead of repeating the experiment. In addition, the population data for the Han living in Sinkiang Uighur Autonomous Region were acquired from one Chinese study [22].

Samples and population data
The study was approved by the Xi'an Jiaotong University Ethics Committee. All participants signed the written informed consent. One of previous study was published using part of these samples [23].

DNA extraction and genotyping
Genomic DNA was extracted using the Chelex-100 protocol as described by Walsh et al. and quantified spectrophotometrically [24]. Multiplex PCR amplification was performed on approximately 1-3 ng of genomic DNA in a total reaction volume of 25 ml, consisting of 9.5 ml of the AmpFlSTR Identifiler PCR reaction mix, 0.5 ml of AmpliTaq Gold DNA polymerase, and 5.0 ml of the AmpFlSTRI dentifiler primer set. Amplification was carried out in a 9700 Perkin-Elmer DNA Thermal Cycler (Applied Biosystems) using 28 cycles under the following conditions (after an initial denaturation step of 11 min at 95uC): 94uC for 1 min, 59uC for 1 min, 72uC for 1 min (following the recommendations from the AmpFlSTR Identifiler PCR kit manufacturer's manual). The amplified DNA products were separated and detected using an ABI Prism 3730 DNA sequencer (Applied Biosystems). One microliter of PCR product was combined with 12 ml of formamide and 0.5 ml of size standard (GeneScan 500 LIZ). The resultant data analysis and allele designation were carried out using the GeneScan and Genotype software programs.

Data analysis
Allele frequencies were estimated by gene counting following exact tests of Hardy-Weinberg equilibrium with Genepop [25]. Gene diversity was estimated as n/(n-1)(1-gx i 2 ), where x is the estimated frequency of the ith allele in the system. The combined power of exclusion probability of paternity (EPP) and combined probability of matching(PM) for the nine STR systems for each population were calculated as 1-(1-EPP 1 )*(1-EPP 2 )…*(1-EPP 9 ) and PM 1 *PM 2 *…*PM 9 , respectively, where EPP n and PM n can be estimated by the Powerstats program [26].
Nei's D A distance was selected to be the genetic measure, as it best reflects the real differentiation among populations; it was calculated by the Dispan program [27,28]. A neighbor-joining phylogenetic tree based on genetic distance (with bootstrap 1000 times) was constructed by Mega 4.1. Because the matrix of allele frequencies of several STR loci has some defects [29], we transformed it into its variance-covariance matrix by PAST[30]. SPSS 13.0 was then used to perform the principal component analysis and draw the scatter plot. The R matrix model of Harpending and Ward was applied to perform the regression analysis with the formula E(H i )~H t (1{r ii ), where r ii is the genetic distance of a particular population from the gene frequency centroid, which can be calculated from allele frequency data, as in the formula r ij~( p i {p)(p j {p)=p(1{p). H i is the average heterozygosity of the ith population, and H t is equal to the overall mean heterozygosity of the entire population [31].
The geographic distances were entered as a matrix of the greatcircle distances between pairs of populations and were assessed on the basis of population geographic coordinates [32]. Linguistic distances were estimated as simple dissimilarity indexes ranging from 0 to 4. Languages belonging to different phyla were assigned a value of 4; languages belonging to different branches, 3; languages belonging to different families, 2; different languages, 1; and the same language, 0 [33]. The linguistic classification of the Northwest China languages used in this process was adopted from the Ethnologue online language database (http://www. ethnologue.com). Table 2 summarizes the genetic polymorphisms of the selected 13 populations. The gene diversity values across the 9 STR loci are all above 0.7 (with a range of 0.7435,0.7793). The Hui have the lowest gene diversity value and the Kirghiz the highest value. On the other hand, the total number of alleles detected is generally greater than 60, while the Han in Xi'an and the Hui have the lowest number at 63, and the Tu have the highest number at 80. The overall Gst value for all loci is 0.0142. The combined EPP is always used to estimate the application value for a given marker system, and the combined probability of matching is always considered an important index in the individual identification or discrimination. These two values together show great value in the application of the 9 STR marker system for the 13 selected populations. The EPP value in all cases is above 0.9999, and the EPM value is below 10 28 . Furthermore, more than 500 paternity cases and cases of individual identification have been successfully resolved using the 16 Powerplex system, which contains the abovementioned 9 STR loci.

Genetic distance and phylogenetic trees
Pairwise genetic distance were shown in Table 3. Among populations from Northwest China, the largest value was found between the Kazakh and Han_XA samples, with a pairwise distance of 0.06, suggesting a relatively remote relationship. A neighbor-joining tree for the 13 population samples was constructed using the pairwise D A distance with 1000 bootstrap times (Table 3, Figure 2). All of the populations were clustered into two main branches (83% bootstrap support). One included four populations in Sinkiang Uighur Autonomous Region, while the other nine populations were grouped together. Conversely, the closest relationship was between the Tu and Dongxiang, with the lowest distance at only 0.0092 (67% bootstrap support). Other main sub-clusters with .50% bootstrapping support included Uyghur and Kirghiz (73%), Salar and Baoan (55%).

Principal component analyses
The gene frequency matrix is characterized by closed datawith aneffect of closure, which confounds the analysis of the population genetic structure, we have used the model established by Xue et al. [29] to perform the principal component analysis, which uses the averaged covariance matrix calculated from gene frequencies.
The three main principal components from the result have a ratio of variance of 31.88%, 17.1%, and 15.01% respectively, with the total ratio at 63.99%. The two-dimensional scatter plot (Figure 3) revealed that four populations in Sinkiang Uighur Autonomous Region (Uyghur, Kirghiz, Uzbek and Kazakh) are apparently separated from the other populations by the first principal component (Figure 3a).

R matrix analyses
A regression plot was built to examine the level of genetic exchange and patterns of gene flow within the general region of Northwest China using the R matrix model described by Reddy et al. [34]. As shown in Figure 4, the Kirghiz, Uzbek and Salar populations have received a higher-than-normal level of gene flow from outside because they fall far above the expected regression line. In comparison, the more isolated populations include the Hui and Han in Xi'an, which fall far below the line. The remaining eight populations are clustered into one group that are scattered on either side, but close to the regression line, indicating they received an average level of gene flow in the total region.

Mantel tests
The results of the Mantel tests are shown in Table 4 and include correlation and partial correlation for three distance matrices. We performed the analysis for all 13 populations. The Dgen (genetic distance) and Dgeo (geographic distance) correlation shows a significant P value (0.002), with a correlation coefficient of 0.4769 and a 21.58% variance. In contrast, there is no significant difference for Dgen and Dlan, with a low correlation coefficient and variance. When linguistic is kept constant, the partial correlation coefficient for genetics and geography is 0.4516, with high statistical significance (P = 0.004); conversely, the correlation coefficient is not significant for genetics and language, with a P value of 0.096 and a correlation coefficient value of 0.1230.

Discussion
The Northwest region of China was the starting point of the ancient ''Silk Road'' and served to link Central China in the East to Central Asia, South Asia and even Europe in the West. According to historical records, cultural and commercial communication between the eastern and western was frequent in this region. Moreover, inter-population marriage and genetic exchange among the different populations were very common. It is now quite clear that the Uyghur, Uzbek, Kazakh and Kirghiz populations, which originally lived in Central Asia, migrated into the Sinkiang Uighur Autonomous Region in China in approximately the 5th century, A.D. This is supported by genetic distance we calculated using 13 Northwest Chinese populations and 4 other world populations (Table S1) [35][36][37]. Apparently those 4 minority populations from Sinkiang are closer to Turkish or Caucasian American, but more distance from Japanese.
There is still no definitive answer regarding the origin of the five ethnic populations living in the Gansu and Qinghai Provinces. Historical records support two hypotheses about the origin of the Tu population. The first and more popular hypothesis proposes that their ancestors were actually from Liaoning Province in the East and that they later migrated into Qinghai and Gansu Provinces in the early 4th century and inter-married with local Mongolian, Tibetan and Han populations [3,4,38]. The second hypothesis considers the Tu population to be descendants of 13th  century Mongolian soldiers and women from local nomadic groups [2,3,39]. Although controversy exists, there is no doubt that extensive genetic admixture once occurred in the history of the Tu population. The ancestors of the Baoan population are most likely Mongolians who arrived with the Turkistan soldiers after the 13th century. These people first reclaimed and grazed their cattle along the ''Tongren region'' in Qinghai Province and gradually formed a new ethnic population after long-time fusion and inter-marriage with the local Hui, Dongxiang, Salar, Tibetan and Han populations [2][3][4]34]. The origin of the Yugur population may date back to the ancient ''Uighur population'' that established the ''Uighur Kingdom'' in 745 AD, covering the grasslands south of Lake Baikal, north of Yinshan Mountain, west of Khingan Mountains and east of Altai Mountain [39]. There is also controversy concerning the origin of the Dongxiang population. One hypothesis suggests that the ancient Hui population living in Dongxiang, together with local Mongolian, Han and Tibetan populations, inter-married and formed the current Dongxiang population. A second hypothesis argues for a Mongolian origin [2][3][4]. The Salar population is derived from the Ogus group from the Western Turkic State, which first lived in China and later migrated to the central Asian region. In the 13th century, the Ogus migrated through the Samarkand region to east Qinghai Province, where they settled. They gradually adapted to the new environment and inter-married with local Han, Tibetan, Hui and Mongolian populations, finally forming the current Salar population [2][3][4].
The polymorphisms of the nine selected autosomal microsatellite markers have been reported in different populations of the world [40]. Their application as CODIS markers for personal discrimination and human identification has been evaluated, and it was demonstrated that the use of these forensically accepted loci with high heterozygosity and allele numbers is feasible for the study of population differentiation and admixture. The principal component analysis extracted several PCs as new variables by the dimension reduction method, which can be used to determine the features of and basic reasons for population differentiation. Complementary phylogenetic trees constructed from specific genetic distances are ideal tools to deduce the evolutionary relationships and origins of different populations [41]. We applied these two major statistical approaches to the datasets and found that minorities that live in Sinkiang Uighur Autonomous Region tend to be more differentiated than other populations. Two major elements should be taken into consideration when drawing any conclusions regarding the patterns of gene flow for the 13 populations. One is the demographic size of the population, and the other is the equilibrium of genetic drift and population migration [42]. Therefore, the Han in Xi'an and the Hui did not receive an average level of gene flow based on the R-matrix analysis might be a result of the large demographic size involved. In other words, marriage between individuals from these two populations with members of other populations would be diluted and have little effect, especially because the majority of marriages were within the population.
Populations from different continents that are geographically close are also more similar genetically than predicted by the simple hypothesis that they are from their respective continents [43,44]. Recent studies have analyzed the origin and evolutionary relationship of different major world populations and have attempted to explain the genetic variance by geographic and linguistic characteristics using large scale genetic markers. Most of these published papers considered geography to be the main factor and argued that language exerted a secondary but detectable effect [32,33,44,45]. For populations that are geographically close, Table 3. Pairwise D A distance among the 13 populations. genetic and geographic distances are often highly correlated. QasimAyub et al. [44] has suggested that the genetic relationships of the 19 extant human populations around the world, as ascertained by 182 microsatellites, are dictated primarily by geographic proximity, with R = 0.484 (p = 0.05). In a subsequent paper, Elise M. S. Belle et al. [33] pointed out that the genetic differences of 52 world-wide populations, indicated by 377 microsatellites, appear to more closely reflect geographic differentiation, although linguistic differences also have a detectable effect on DNA diversity. This latter article first quantified the contributions of geography and language to the populations living in Northwest China, avoiding purely subjective conclusions.  Partial correlation from the Mantel test for 13 independent populations suggested that geographical differences have a significant influence on genetic differentiation (r of partial correlation equals 0.4516, while the P value is 0.0004.). Language distance represents an additional contribution to the effect (correlation coefficient at 0.123 with P value 0.096). In our current analysis, we have considered only the geographic coordinates for calculating the geographical distance but have not included the complicated terrain of the Northwest, which is characterized by mountains, deserts and plateaus. In the future, we will establish a more complex and precise mathematical model to quantify the geographic isolation.
In conclusion, our results demonstrate that high-level admixture does exist in the Northwest region of China, which is part of the Silk Road of ancient times. However, the populations living in northern China in Sinkiang Uighur Autonomous Region, which include Uyghur, Kazakh, Uzbek and Kirghiz, are closely clustered, but quite distant from other populations living in Qinghai and Gansu and from the subpopulations of the Han, Hui and Mongol. Those findings reveal that geographic isolation plays a significant role in population differentiation, whereas language differences exert a much smaller influence.