Forensic characterization and genetic polymorphisms of 19 X-chromosomal STRs in 1344 Han Chinese individuals and comprehensive population relationship analyses among 20 Chinese groups

X-chromosomal short tandem repeats (X-STRs) may assist resolution of complex forensic kinship cases and complement autosomal and Y-chromosomal STRs in routine forensic practice and population genetics. In the present study, we investigated the allele/haplotype diversity and forensic genetic characteristics of 19 X- STRs in 206 Guizhou Han and 1344 Meta-Han Chinese individuals using AGCU X19 PCR amplification system. Population relationships within five Han Chinese population groups (1344 individuals), between Guizhou Han and other 19 Chinese reference populations belonging to four language families (5074 individuals), as well as between Meta-Han Chinese and other 15 minorities (3730 individuals) were performed using Reynolds’s, Nei’s and Fst genetic distances, principal component analysis (PCA), multidimensional scaling (MDS), Structure and Neighbor-Joining tree. Mean paternity exclusion chance (MEC) in Duos > 0.99999999453588 and in trios > 0.99999999999781, as well as power of discrimination (PD) > 0.99999999999980 in Guizhou Han on the basis of allele frequencies. Consistent high MECs and PDs can be observed in Meta-Han Chinese population based on both allele diversities of 19 markers and haplotype diversities of seven linkage groups (LG). DXS10135 and LG1 are the most informative and polymorphic in Han Chinese group. The comprehensive population comparisons reveal that Han Chinese is a homogenous population and has the genetically closer relationship with Hmong-Mien-speaking groups than Tibetan-Burman-speaking and Turkic-speaking populations. In summary, AGCU X19 PCR amplification system is highly polymorphic and informative in Guizhou Han and Han Chinese populations. The comprehensive population data from 20 Chinese populations analyzed in this study may be used as a reference Chinese frequency database of X-STRs for forensic casework applications.

Introduction Short tandem repeats (STRs), also known as microsatellites and composed of repeating 2-6 base pair motifs, are highly variable variants with the number of approximately 700,000 in the human genome, play a pivotal role in population genetics, anthropology, genetic genealogy and forensics. Previous studies revealed that STRs are associated with the susceptibility and morbidity of more than 30 Mendelian hereditary disorders [1] and other complex traits heritability via regulating DNA methylation and gene expression [2][3][4][5]. STRs are highly prone to mutations through the gain or loss of single repeat units under DNA replication and evolutionary pressures (such as UV exposure, hypoxia, limited food sources and cold in Tibetans) [1,6,7]. This mechanism namely called simple stepwise mutation model (SMM) [1,6]. Accumulating mutation evidence from pedigree or population whole genome sequence studies showed that the average mutation rate of the STR locus generally exceeds that of point mutation (single nucleotide polymorphisms with 10 −8 ) by several orders of magnitude and is approximately 10 −3 to 10 −4 mutations per generation [6,8,9].
X-chromosomal STRs (X-STRs) with the unique pattern of inheritance (father transmits it to daughter and mother transmits one of them to her offspring) can complement autosomal and Y-chromosomal STRs in forensic identity (predominantly in identification cases of missing person and mass disaster victim) and complex kinship analyses, especially in the deficiency and incestuous cases [10]. Recently, AGCU X19 amplification system (AGCU ScienTech Inc., Wuxi, Jiangsu, China) was specifically designed to facilitate the X-STRs into the applications of forensic routine cases. This system is a five-dye, multiplex that allows co-amplification and fluorescent detection of 19 loci belonging to seven linkage groups (LG), in which DXS10148, DXS10135 and DXS8378 comprise the LG1 [11,12], DXS10159, DXS10162 and DXS10164 comprise the LG2 [13], DXS7132, DXS10079, DXS10074 and DXS10075 comprise the LG3 [14], DXS6809 and DXS6789 comprise the LG4 [15], DXS7424 and DXS101 comprise the LG5 [16], DXS10103, HPRTB and DXS10101 comprise the LG6 [12], and DXS10134 and DXS7423 comprise the LG7 [14]. This new X-chromosomal STR amplification includes eleven X-STRs included in the Investigator 1 Argus X-12 Kit [17] and eight additional new selected loci [18]. The impact of the new generation X-chromosomal STR amplification system is contingent upon its forensic reference database construction and discriminative ability in the personal identification and parentage testing. Tremendous progresses have been made in exploring the genetic variations and establishing the forensic reference database of 12 X-STRs included in the Investigator 1 Argus X-12 Kit in China [14,17], while forensic information focused on 19 X-STRs included in the AGCU X19 kit in Chinese ethnically/geographically diverse populations keep largely underrepresented [19][20][21][22][23][24][25][26][27][28][29].Han Chinese, who traces a common ancestry to the initial Neolithic Huaxia agricultural confederation residing in Yellow River and shares and exchanges culture and language with non-Han Chinese population when Huaxia culture continuous expansion toward southern China, exceeds 1.3 billion in the world and 1.282 billion in China (2010 census) [30,31]. China is a state of considerable cultural, linguistic, genetic, phenotypic diversity in the 960 square kilometers of land. There are at least seven languages families which comprise Sino-Tibetan, Tai-Kadai, Hmong-Mien, Altaic, Austroasiatic, Indo-European and Austronesian. Guizhou, located in the southwestern of China, is demographically one of China's most diverse provinces including Han Chinese, Miao, Yao, Yi and other minority groups. Han Chinese nowadays account for more than 60% of the population in Guizhou and are mostly the descants of the ancient Han soldiers, who massively moved into Guizhou during the 8th and 9th centuries in the Tang Dynasty (https://en.wikipedia.org/wiki/Guizhou).

Ethics statements
This study was specially approved (Approval No. (2014)-1-044) by the Biomedical Research Ethics committee of Zunyi Medical University. All subjects were kept informed of the purpose and signed the informed consent before taking part in sample collection. Each subject was confirmed the offspring of indigenous Han nationality and without consanguineous marriage with minority groups at least three generations.

Samples, DNA extraction and quantification
Peripheral blood samples were collected from 206 unrelated Han Chinese individuals (104 females and 102 males) residing in Guizhou province, southwest China. We used PureLink Genomic DNA Mini Kit (Thermo Fisher Scientific) to extract and isolate human genomic DNA, and used an Applied Biosystem 7500 Real-time PCR System (Thermo Fisher Scientific) and Quantifiler Human DNA Quantification Kit (Thermo Fisher Scientific) to measure the DNA concentration on the manufacturer's protocol. Finally, we diluted the DNA to 2.0 ng/μL and stored at -20˚C until amplification.

DNA amplification and genotyping
We genotyped 206 Guizhou Han individuals on the ProFlex 96-Well PCR System (Thermo Fisher Scientific) using the AGCU X19 kit (DXS8378, DXS7423, DXS10148, DXS10159,  DXS10134, DXS7424, DXS10164, DXS10162, DXS7132, DXS10079, DXS6789, DXS101,  DXS10103, DXS10101, HPRTB, DXS6809, DXS10075, DXS10074 and DXS10135) on the basis of the recommendations. We employed a total of 10 μL as the final PCR reaction volume, including 4 μL of reaction mix, 0.2μL of A-Taq DNA polymerase, 0.8 μL of template DNA, 2 μL of primers and 3 μL of sdH2O (sterile deionized H 2 O). The PCR conditions for 10 cycles (95˚C for 2 min, 94˚C for 30 s, 60˚C for 1 min and 65˚C for 1 min) and 20 cycles (94˚C for 30 s, 59˚C for 1 min and 72˚C for 1 min) and followed a final extension for 30 min at 60˚C and finally holding at 4˚C for preservation. Capillary electrophoresis separation of amplified products was conducted on the Applied Biosystems 3130 Genetic Analyzers (Thermo Fisher Scientific, MA, USA) with the POP7 1 polymer and a 36cm capillary array. GeneMapper ID-X v.1.4 software (Thermo Fisher Scientific) was utilized to analyze the electrophoretogram and assign the genotypes of 19 X-STRs.

Data analysis
We separately calculated allele frequencies in the males, females and pooled Guizhou Han Chinese population (206 subjects) and Meta-Han Chinese population (1344 subjects) using the modified PowerStatesV1.2 spreadsheet (Promega, Madison WI, USA). Arlequin software (version 3.5.2) [32] was used to estimate the genetic differentiation between males and females, and calculate the p values of Hardy-Weinberg equilibrium (HWE) and linkage disequilibrium (LD), as well as estimate the observed heterozygosity (Ho) and expected heterozygosity (He) in Guizhou females and Meta-Han females. Haplotype frequencies of seven linkage groups were calculated using the direct count method. The forensic parameters of polymorphism information content (PIC) and paternity exclusion chance (MEC) in the Trios and Duos (MEC_Krüger [33], MEC_Kishida [34], MEC_Desmarais [35] and MEC_Desmarais_Duos [35]) were estimated using StatsX (Statistics for X-STR) v2.0 [36] and the ChrX-STR.org 2.0 database (http://www.chrx-str.org/). Gene diversities (GD) of X-STRs and haplotype diversity (HD) of seven linkage groups were calculated using Nei's formula [37] as employed in previous Y-chromosomal STR variation analyses [38,39].

Quality control
The experiment was conducted at the Institute of Forensic Medicine, West China School of Basic Medical Sciences & Forensic Medicine, Sichuan University. Control DNA 9947A and sdH2O included in the AGCU X19 kit were chosen as controls for allele assignment. This laboratory has been approved the accreditation of ISO/IEC 17025 and CNAS (China National Accreditation Service for Conformity Assessment). Besides, our experiment followed the recommendations of the Scientific Working Group on DNA Analysis (SWGDAM) [48] and the guidelines focused on the population data publication [49] and X-STRs analysis [50].

Allelic diversity and forensic parameters of Guizhou Han Chinese
The Fst and corresponding p values of 19 X-STRs between females and males in Guizhou Han are presented in S8 Table,

Allelic diversity and forensics parameters of Meta-Han Chinese
In our genetic diversity and forensic characteristic analyses of Meta-Han Chinese population, as shown in S13 Table, Table 1).

Haplotype diversity
19 X-chromosomal STRs can be grouped into seven linkage groups on the basis of physical distances, previously linkage analyses and population genetic researches. The haplotype distributions of Guizhou Han population are presented in S18 Table, a total 16 in LG7 are unique (S20 Table)

Intra-population genetic differentiation among Han Chinese
To explore the genetic homogeneity and heterozygosity among Han Chinese populations along different administrative divisions, we calculated the Nei's genetic distances between Guizhou Han and other four Han Chinese populations (S22 Table). Guizhou Han is genetically close to Guanzhong Han (Nei's genetic distance: 0.0104), and keeps a relatively distinct genetic relationship with Hainan Han which is the southernmost Han Chinese population (0.0236). Population differentiation within five Han subpopulations is further dissected and visualized using principal component analysis, multidimensional scaling plot, Structure and one Neighbor-Joining tree (Fig 2). In the PCA dimensional plots constructed on the basis of PCA1 (45.08%) and PCA2 (23.86%), Guanzhong Han is located in the first quadrant near the X axis, Sichuan Han and South Han are located in the second quadrant near the Y axis. The remaining two groups are respectively located in the third quadrant (Hainan Han) and fourth quadrant (Guizhou Han) (Fig 2A). Consistent population distribution patterns are observed in the MDS based on the pairwise Nei's genetic distances (Fig 2B). Phylogenetic relationship reconstruction reveals two genetically close clusters: one cluster comprises Guanzhong Han and Guizhou Han; and the other comprises Hainan, Southern China and Sichuan Han (Fig 2C). No population substructure is identified in the model-based genetic structure dissection (Structure in the Fig 2D). Genetic cluster analyses among Han Chinese populations show that Han Chinese are relatively homogeneity with the modest levels of genetic differentiation (aver-age±standard deviation (sd): 0.0136±0.0044).

Inter-population genetic differentiation among Guizhou Han and other 19 Chinese groups
The pairwise Reynolds's genetic distances between Guizhou Han and other 19 Chinese adjacent populations are calculated on the basis of genetic variations of 19 X-chromosomal STRs and are listed in S23 Table. The Reynolds's genetic distances range from 0.0022 (between Guanzhong Han and Guizhou Gelao) to 0.0161 (between Hainan Li and Xinjiang Uyghur2) whose average±sd is 0.0080±0.0032. Guizhou Han has the smallest genetic distance (0.0028) when compared with Guanzhong Han and has the largest genetic distance (0.0122) when compared with Xinjiang Uyghur2 with average±sd (0.0062±0.0027). The first ten PCAs can extract a total of 85.236% genetic variations from the 20 populations.  Population genetics of X-STR in Han Chinese population and relationship analyses with nearing populations clearly separate Hainan li from other populations (Fig 3B). For further validation, we subsequently drew the MDS and N-J tree on the basis of Reynolds's genetic distance matrix. Consistent population distribution patterns can be observed in the MDS analysis (Fig 4). In the N-J tree, Uyghurs, Kazakhs and one Wuzhong Hui form the nethermost cluster, Tibetans, Yis and one Ili Xibe form the intermediated cluster, and the remaining Hans, Lis, Miaos, Huis and Gelaos form the upper cluster. Guizhou Han first grouped with Guanzhong Han and then grouped with the other Sinitic-speaking population sub-cluster (Fig 5).

Genetic relationship between Meta-Han Chinese and other ethnic groups
Considering the homogeneity within Han Chinese populations, we integrated the 1344 genotype data from five different geographical divisions into one group as the Meta-Han Chinese population. Pairwise Reynolds genetic distances, PCA, MDS and N-J tree are performed to assess and dissect the genetic relationship between Meta-Han Chinese group and 15 Chinese relative populations. The first ten principal components (25.29%, 18.28%, 10.19%, 7.78%, 7.41%, 5.81%, 5.30%, 4.17%, 3.19% and 2.84%) from a national scale can extract a total of 90.265% genetic variation. Fig 6A was constructed on the basis of the first two components which reveals that the Meta-Han Chinese population constitute the same genetic group with admixture-language-speaking population cluster, suggesting there are a high level of gene flow between Han Chinese populations and other adjacent groups (Hui, Xibe, Yi, Gelao and Miao) and may have a common ancestry. As shown in S24 Table, the smallest genetic pairwise Reynolds's genetic distance is observed between Meta-Han Chinese population and Guizhou  Population genetics of X-STR in Han Chinese population and relationship analyses with nearing populations MDS (Fig 6B) and phylogenetic tree reconstruction (Fig 6C) also consistently reveal that the Meta-Han group exhibits a closer affinity to other Sinitic/Tai-Kadai/Hmong-Mien-speaking populations. Collectively, we observe that genetic differences exist between Meta-Han group and Turkic/Sino-Tibetan-speaking populations and genetic similarities can be found between Han Chinese and Tai-Kadai/Hmong-Mien-speaking populations.

Discussion
Clearly understanding the patterns of genetic variations of Han Chinese (the largest ethnicity in China and world) is important in the exploration of the population origin, migration, evolution and admixture in the prehistory and history, and providing investigative leadings and evidences in forensic cases. Although autosomal STRs have been the gold standard in forensic science and much effort has been made based on the genetic variations of autosomal or Ychromosomal STRs in diverse populations [51][52][53][54][55][56], X-STRs have begun to draw more attention by forensic scientist with the appearance of Investigator 1 Argus X-12 and AGCU X19 STR Kits. In this study, we first established one reference databases of Han Chinese population, extending already investigated data with additional 206 unrelated Chinese Han citizens and Population genetics of X-STR in Han Chinese population and relationship analyses with nearing populations the total of 1,344 samples typed by AGCU X19 kit, to promote and implement X-STRs typing into Chinese routine forensic practice. Allele and haplotype frequencies and corresponding forensic parameters, as well as HWE and LD were first analyzed in the Guizhou Han and the comprehensive Han Meta-population. The combined MECs and PDs in Guizhou Han and Meta-Han Chinese combined with our previous studies [19,22,27] indicated the commutative forensic parameters of 19 X-STRs are high enough to meet the application of forensic complex biological relationship identification. 19 X-STRs PCR amplification system is discriminatory and informative for using as a complementary tool for autosomal, Y-chromosomal and mitochondrial genetic markers.
The peopling history of East Asia is complex [57,58]. The comprehensive population comparisons (intra-population relationship among Han Chinese populations, between Guizhou Han and 19 national wide populations, as well as Meta-Han and 15 Chinese minorities) illustrate that they could better reflect linguistic, ethnical, geographical and historical relationships. Our results consistently demonstrate genetic affinity exists within linguistic/ethnical/geographical populations. Due to the complex origin, migration and admixture of Chinese populations, further studies based on high coverage whole genome sequencing of anatomically modern humans and Chinese ancient DNA are needed to promote the understanding of Chinese human evolutionary history and dissect the Chinese population structure as well as reconstruct the population genetic history.

Conclusion
To implement 19 X-chromosomal STRs PCR amplification system into routine forensic practice, we genotyped 206 Guizhou Han Chinese individuals and combined with previously reported 4868 genotypes from 19 Chinese populations to extend and establish the reference database of Chinese populations along linguistic divisions. We used the Nei's genetic distances, PCA, MDS and N-J tree to test the genetic homogeneity of the Han Chinese population from different geographical administrative divisions. Due to no significant genetic difference exists among them, we estimated the allele and haplotype frequencies as well as forensic parameters in Guizhou Han and Meta-Han Chinese population on the basis of allele frequencies and haplotype frequencies. DXS10135 and LG1 are the most informative and polymorphic in Han Chinese group. The cumulative power of discrimination and mean paternity exclusion chance according to the allele and haplotype diversity are high enough to complete autosomal and Ychromosomal STRs in the forensic routine practices (complex kinship cases and individual identification) and population genetics. Subsequently, we compared Guizhou Han with 19 Chinese reference populations, as well as Meta-Han Chinese population and other 15 Chinese minority groups based on allele frequency distributions via pairwise Reynolds's genetic distances, PCA, MDS, and N-J tree. Population comparisons revealed the tight grouping within linguistic close populations of Tibetan-Burman, Turkic-speaking groups. Besides, Han Chinese is a homogeneous population and Guizhou Han and Meta-Han Chinese population have genetically close relationship with Tai-Kadai-speaking, Hmong-Mien-speaking populations. We concluded that the reference databases of AGCU X19 kit in Han Chinese populations, Tibetan-Burman-populations and Turkic-speaking populations are universally suitable and applicable for Chinese forensic casework.