Y-Chromosome Evidence for Common Ancestry of Three Chinese Populations with a High Risk of Esophageal Cancer

High rates of esophageal cancer (EC) are found in people of the Henan Taihang Mountain, Fujian Minnan, and Chaoshan regions of China. Historical records describe great waves of populations migrating from north-central China (the Henan and Shanxi Hans) through coastal Fujian Province to the Chaoshan plain. Although these regions are geographically distant, we hypothesized that EC high-risk populations in these three areas could share a common ancestry. Accordingly, we used 16 East Asian-specific Y-chromosome biallelic markers (single nucleotide polymorphisms; Y-SNPs) and six Y-chromosome short tandem repeat (Y-STR) loci to infer the origin of the EC high-risk Chaoshan population (CSP) and the genetic relationship between the CSP and the EC high-risk Henan Taihang Mountain population (HTMP) and Fujian population (FJP). The predominant haplogroups in these three populations are O3*, O3e*, and O3e1, with no significant difference between the populations in the frequency of these genotypes. Frequency distribution and principal component analysis revealed that the CSP is closely related to the HTMP and FJP, even though the former is geographically nearer to other populations (Guangfu and Hakka clans). The FJP is between the CSP and HTMP in the principal component plot. The CSP, FJP and HTMP are more closely related to Chinese Hans than to minorities, except Manchu Chinese, and are descendants of Sino-Tibetans, not Baiyues. Correlation analysis, hierarchical clustering analysis, and phylogenetic analysis (neighbor-joining tree) all support close genetic relatedness among the CSP, FJP and HTMP. The network for haplogroup O3 (including O3*, O3e* and O3e1) showed that the HTMP have highest STR haplotype diversity, suggesting that the HTMP may be a progenitor population for the CSP and FJP. These findings support the potentially important role of shared ancestry in understanding more about the genetic susceptibility in EC etiology in high-risk populations and have implications for determining the molecular basis of this disease.


Introduction
The non-recombining portion of the Y chromosome (NRY) has unique characteristics, including paternal inheritance, absence of recombination at meiosis, and a relatively low probability of recurrent mutations, thus endowing it with population-and areaspecific polymorphisms. Thus, NRY is particularly useful for the study of human evolution and population genetics. Two types of polymorphisms exist on the NRY: single nucleotide polymorphisms (SNPs) and short tandem repeat (STR) loci, each with different mutation rates and mechanisms. Accordingly, combined analysis using these 2 types of polymorphic markers increases the power of the NRY for use in tracing human evolution as well as migration through different geographic locales and time scales, and therefore could also be effective in depicting the paternal structures of populations.
Indeed, in 1999, Su et al. [1] ascertained 17 Y-chromosome haplogroups based on 19 East Asian-specific biallelic markers that reveal the paternal structures of populations in East Asia. By investigating 63 population samples from Asia, Africa, America, Europe, and Oceania, the authors found that southeast populations in East Asia are more genetically diverse than are northern ones and suggested that East Asians originated from the south. On the basis of Y-SNP and Y-STR variance, they concluded that the initial settlement of modern humans in East Asia occurred about 18,000-60,000 years ago [1].
Esophageal cancer (EC) is one of the most common fatal cancers worldwide [2] and has high incidences in some geographical regions. In China, most EC patients live in the socalled ''EC belt,'', which stretches from central China westward through Central Asia to northern Iran [3][4]. The best-known region for high EC risk is the north-central Henan Taihang Mountain area (the HTM population [HTMP]), situated among the Henan, Hebei, and Shanxi Provinces (Fig. 1). Less well-known regions are the southeastern littoral Chaoshan Plain in Guangdong Province (the CS population [CSP]) and the Minnan area of Fujian Province (the FJ population [FJP]). Although the 2 latter provinces are relatively geographically isolated from the interior of China, they are still considered to reside in the EC belt and evidence exists for a high EC risk in these areas [5][6]. For example, in a relatively isolated district within the CSP-Nanao Island-annual average crude incidence of EC was 103.98/ 100,000 people from 1995 to 2004; the age-standardized incidence rates for EC were 72,150/100,000 males and 26,64/100,000 females, with a significantly increased incidence for males, but not females between 1995 and 2004 [5]. For the FJP, 18 counties reported a mortality rate for EC greater than 30/100,000 people [7], which is at least twice the national average [8]. The CSP, FJP and HTMP is belong to Northeast-Asian groups.
The geographies of south-littoral (Chaoshan and Fujian areas) and north-central China (Henan and Shanxi) are distinct, but populations within these regions share a high risk of EC [9]. According to historical records [10][11], Han inhabitants of northcentral China (Henan and Shanxi Hans) continuously migrated into the Chaoshan area through Fujian Province to escape war and famine, and gradually became the predominant inhabitants of Chaoshan. Moreover, familial aggregation of EC has been observed in north-central China and the Chaoshan area [12][13][14]. These findings support the potentially important role of shared ancestry contributing to the genetic susceptibility for EC in highrisk populations in China.
Therefore, we hypothesized that these three EC high-risk populations may be genetically related. To test our hypothesis of common ancestry, we analyzed 16 East Asian-specific biallelic markers [1,15] (SNPs) and seven Y-STR loci were analyzed to examine the NRY structure in the EC high-risk CSP, FJP and HTMP. Indeed, haplogroup frequencies and principal component, correlation, hierarchical clustering, phylogenetic (neighborjoining tree) and network analyses all support the close genetic relatedness of the CSP, FJP and HTMP. Ascertaining the genetic background of EC patients can help clarify the molecular genetic mechanisms of esophageal carcinogenesis, improve the evaluation of risk factors in individuals from high-risk populations, and promote the implementation of effective screening and preventive measures.

Distribution of NRY haplogroups in the three EC high-risk populations in China
The haplogroup frequencies of the three EC high-risk populations were based on Y-SNP typing. As shown in Table 1, 10 Y-chromosome haplogroups were identifed. The Y-chromosome haplogroups of the three studied populations mainly cluster around O3*, O3e*, and O3e1, which are the characteristic haplogroups for Northern East Asians; the overall frequencies were 65.16%, 66.21% and 60.42% for the CSP, FJP and HTMP, respectively, and did not significantly differ among the populations (2-sided X 2 = 4.213, p = 0.122). These results provide evidence for genetic affinity between these three EC high-risk populations. O*, O1*, O2a*, and O2a1, the four common haplogroups in the southern East Asians, were more frequent in the CSP and HTMP than in the FJP (2-sided X 2 = 8.355, p = 0.015). The C*, D1, and F* haplogroups were more common in the northern than southern group, and their combined frequencies were significantly lower in the CSP than in the other two (2-sided X 2 = 11.327, p = 0.003). These results support the hypothesis of gene flow between the HTMP, FJP, and CSP.

Correlation analysis reveals positive associations between the three EC high-risk populations and Chinese Hans
To further elucidate the relationship between populations, we performed correlation analysis based on Y-SNP haplogroup and Y-STR haplotype frequencies (Tables 2 and 3, respectively). Significant positive correlation existed among the three EC highrisk populations. The correlation coefficient between the EC highrisk CSP and FJP was higher than that between the CSP and the EC high-risk HTMP. Moreover, Table 2 shows a positively correlated between the HTMP and most of the Chinese Hans, a positive correlation between the FJP and only the Liaoning Han, and a positive correlation between the CSP and the Hakka, Hebei, and Hunan Hans ( Table 2). Because of poor communication and transportation in past, Chaoshan and Fujian areas may have become a relatively closed society, with less gene flow with other populations, and because the EC high-risk HTMP is located in central China, this population may have had more opportunity to communicate with other Chinese Hans. Table 3 shows a positive correlation between the three high-risk populations and the Chinese Hans. Unexpectedly, the data show a positive correlation between the three studied populations and the Manchu, which is a northern minority population. No significant correlation was found between other populations and the three EC high-risk populations. The discrepancy between data in Tables 2 and 3 may be related to the compared populations and the different characteristics of STRs and SNPs. SNPs are characterized by low mutation rate and low probabilities of back and parallel mutation, making SNP analysis suitable for tracing early demographic events in human history. The mutation rate of STRs is much higher than that of SNPs and, therefore, the former are suitable for investigating details of demographic events occurring on a more recent time-scale.

The three EC high-risk populations cluster in the Sino-Tibetan language family
To delineate the genetic relationship among the three EC highrisk populations and populations based on other language families, we performed a further principal component analysis using data, provided by the State Key Laboratory of Genetic Engineering and Center for Anthropological Studies (School of Life Sciences, Fudan University), which include Y-SNP data for 64 Chinese populations belonging to the 5 language families: Sino-Tibetan, Hmong-Mien, Daic (Baiyue), Austronesian, and Austroasiatic.
The Y-chromosome haplogroup profiles identified in these populations were treated as input vectors for PC analysis. The cumulative contribution of PC1 and PC2 accounted for 54.85% of the total variance. A PC dot plot was drawn with values for PC1 and PC2 as the X and Y axes, respectively. As shown in Fig. 4, PC1 values are associated with a tighter clustering of populations belonging to the Sino-Tibetan language family; high PC2 values are associated with a more scattered distribution of Sino-Tibetan populations. The three EC high-risk populations are clustered in the rightmost part of the PC plot, among the most intense distribution of Sino-Tibetans.
Correlation analysis was carried out to seek the origin of each PC (Table 4). PC1 showed a significant negative correlation with PC2 (r = 20.44, P,0.001). Analysis of PC1 showed that the number of negatively correlated haplogroups, despite their weak correlations, was larger than that of the positive groups. O3e* is the only haplogroup with a significantly positive correlation with PC1 (r = 0.61, P,0.05). For PC2, the number of positively and negatively correlated haplogroups was similar. O2a* and O1 represent the southern aboriginal haplogroups in the East Asian population and O3e* is probably a northern haplogroup. As shown in Table 4, the distribution of O3e*, O2a*, and O1 is opposite with PC1 and PC2. O3e* was positively correlated with PC1, but O2a* and O1 were negatively correlated. O2a* and O1 were positively correlated with PC2, but O3e* was negatively correlated. This finding implies that these three haplogroups are the main components of PC1 and PC2. Of note, analysis of only  haplogroup O3e* was consistent with the distribution of Sino-Tibetan populations shown in Fig. 4. Furthermore, the three EC high-risk populations are almost at the peak value of PC1 and cluster with some Sino-Tibetan populations, suggesting that the EC high-risk populations are all typical Sino-Tibetan populations.

Hierarchical cluster analysis reveals isolation of the EC high-risk cluster from other populations
To further elucidate the affinity among the three EC high-risk populations and other Chinese populations, hierarchical cluster analysis was carried out with average linkage (between groups) based on Y-SNP data. To illustrate the cluster of the three highrisk populations and its relationship with other Hans, Baiyue, and Hmong-Mien groups, we included 20 Chinese Hans [16218], data for Liaoning, Guangzhou and Guangxi Hans was provided by the State Key Laboratory of Genetic Engineering and Center for Anthropological Studies (School of Life Sciences, Fudan University), two Hmong-Mien (She [19] and Yao [1]) and two Baiyue populations (Zhuang and Dong) [1] (Fig. 5). The information of the data for comparison is shown in Table 5. Hierarchical clustering analysis elucidates the genetic distance among populations. In Fig. 5, the EC high-risk CSP is genetically close to the EC high-risk FJP and HTMP. The genetic distance between FJP and CSP is shortest. Moreover, the three are isolated from other populations, which implies a particular migration event in ancient times.

Genetic distance analysis and construction of a phylogenetic tree
To further investigate the genetic relationships between the three EC high-risk populations, Rst distances between pairs of populations were calculated on the basis of seven Y-STRs by use of Alrequin 3.1 software. Six additional Chinese populations were included in this analysis: Zhejiang [20], Henan [21], Dongbei [22], Tianjing [23], and Hunan Hans [24], and Tibetan [25], all of which belong to the Sino-Tibetan language family, as do the three EC high-risk populations. From the Rst distance matrix, an unrooted neighbor-joining tree was constructed with use of MEGA 2.1 software (Fig. 6). The EC high-risk CSP was closely related to the EC high-risk FJP and HTMP. All three are closer to Chinese Hans than to the Tibetan population. The Hunan, Tianjing, Dongbei, and Henan Hans are grouped together.

Network analysis of Y-STR haplogroups of the three EC high-risk populations
The highest haplogroup frequency shared by the CSP, FJP and HTMP was haplogroup O3. The network for haplogroup O3 was further constructed using Network 4.516 software (www.fluxusengineering.com) based on all of haplogroup O3 (including O3*, O3e* and O3e1*) individuals for analyzing the relationship among CSP, FJP and HTMP. As shown in Fig. 7, 29 individuals from STR frequency of the CSP is slightly higher than that of the FJP, which may be due to the geographical proximity of these two areas, and more frequent gene flow between them.

Discussion
Our hypotheses for this study, that EC high-risk populations in the Henan Taihang Mountain, Fujian Minnan, and Chaoshan regions of China could share a common ancestry, was based on historical records of migration across China. In southern China (Guangdong district) the Baiyue populations formed the earliest settlement in modern history. Before 2200 BC, one branch of the Baiyue population-the Minyue-was the main group living in the Chaoshan littoral areas. The north-to-south strategic expansion started by Emperor Qin Shi Huang initiated large southward migrations of central China Hans from 214 BC onward [26]. During the Han Dynasty (206 BC,220 AD), three waves of large-scale migrations into southern China resulted in a decrease in the native population in this area. As recorded in pedigrees and ancient inscriptions during the Northern Song Dynasty (960,1127 AD), large numbers of southern Fujian people, especially from Quanzhou and Putian, settled in the Chaoshan area [27]. Gradually, over a period of 2000 years, this became the major population in the Chaoshan region (called the Helao or Fulao peoples), coming largely from Henan and Shanxi via Fujian with well-maintained language and customs from north-central China. Because of geographic isolation and the historical difficulty in traveling, the Helao/Fulao became a relatively isolated population. Currently, most Fujian and Chaoshan populations believe they are descendants of north-central China Hans, which is supported by genealogical records, stone tablets, and archeological discoveries [28]. Our team has collected and analyzed 40 genealogies of different surnames in Chaoshan areas. Nearly all the ancestors of these 40 genealogies come from North-central China. Most of them first settled in Fujian, then migrated to the Chaoshan area (data unpublished). This result also confirmed the historical record and our hypothesis. The incidence and mortality rate for EC is very high in the CSP, FJP, and HTMP areas. We propose that the Chaoshan littoral region is an EC high-risk area because of the genetic background of the CSP and FJP shared with those in north-central China. The ancestors of the EC high-risk CSP may have derived from the EC high-risk HTMP via the FJP. This study provides genetic evidence to support this hypothesis In general, populations sharing similar patterns of haplogroup distribution are likely to have a relatively close genetic relationship. In this study, the three EC high-risk populations resemble one another in distribution of haplogroups O3*, O3e* and O3e1 (Table 1), which suggests they are relatively closely related. PC and correlation analyses were also carried out to further verify the genetic relatedness among the three EC high-risk populations and other groups. As shown in the PC analysis, the paternal structure for the EC high-risk CSP differs from that for the Guangzhou and Hakka Hans, although they are in geographic proximity and all consider themselves descendants of north-central China Hans. In contrast, the EC high-risk CSP closely clustered with the EC highrisk FJP and HTMP, although the HTMP and CSP are geographically disjunct. In addition, correlation analysis based on the Y-SNP haplogroup and Y-STR haplotype frequencies revealed a positive correlation among the three EC high-risk populations, which further supports their close genetic affinity.
As historically recorded 2 millennia ago, southern China was originally inhabited by the southern natives, including those speaking Daic (Baiyue), Austro-Asiatic, and Hmong-Mien languages [29230]. Hence, we included Daic, Hmong-Mien, Sino-Tibetan, Austronesian, and Austroasiatic populations in the PC analysis for comparison. The results in Fig. 4 and the correlation analysis (Table 4) reveal that the EC high-risk CSP is related to Sino-Tibetans, not Baiyues, which is consistent with the migration history of the CSP. The results of hierarchical clustering analysis further support the close genetic affinity among the three EC highrisk populations. Furthermore, our results also revealed that these three populations are closer to the Yunnan Han and to two Baiyue populations (Zhuang and Dong) than to other Chinese Hans, which suggests gene flow among them. The paternal genetic structure of the three EC high-risk populations is distinct from those of other Chinese Hans. The phylogenetic affinity among the three studied populations was further revealed in a neighborjoining tree (Fig. 6). The network for haplogroup O3 (including O3*, O3e* and O3e1) showed that the HTMP has a higher STR haplotype diversity than the CSP and FJP, while sharing some STR mutations with the CSP and FJP, suggesting the close relationship among them and further suggesting that the HTMP may be a progenitor population for the CSP and FJP. Taken together, these findings support the hypothesis that the EC highrisk CSP shares common genetic traits with the EC high-risk FJP and HTMP, and that they may share a recent common ancestor. Unexpectedly, we found a positive correlation between the three EC populations and the Manchu, a northern minority population.  The reason for this affinity is unknown but suggests gene flow between these groups. Although we used two types of Y-chromosome polymorphic markers and demonstrated consistent results from multiple analyses, populations from other EC high-risk areas were not included in this study, so we cannot ascertain whether all the EC high-risk populations share a common genetic background. To further explore this aspect, a large-scale study of EC high-risk populations is necessary.
In summary, the patrilineal genetic structure of the EC high-risk CSP, HTMP, and FJP suggests an origin in genetic background of  the EC high-risk CSP. The three EC high-risk populations in this study appear to share a similar patrilineal genetic background that may explain, at least in part, the high incidence of EC in these areas in China. The extent to which other factors such as environment and customs may contribute to the high incidence of EC remains to be explored.

Sample collection and DNA extraction
Blood samples of 211 unrelated healthy males were collected from the the three EC high-risk areas in China during 2002 to 2004; 89 samples were from the Chaoshan area, 48 from the Henan Taihang mountain area and 74 from the Fujian area (Minnan area). All individuals gave their informed consent before being included in the study. The study was approved by the ethical review committees of the Medical College of Shantou University. Genomic DNA was extracted from whole blood by standard phenol/chloroform methods [31]. DNA samples were stored at 220uC after extraction.

Genotyping of Y chromosome SNPs and STRs
Three strategies were used to type Y-SNPs and Y-STRs. SNPs without length changes (base substitutions) were genotyped by a PCR-based restriction fragment length polymorphism (PCR-RFLP) method (primer information, restriction enzymes, pattern of polymorphism, and PCR conditions are in Table 6) [17,32].
SNPs with length variation (e.g., deletion or insertion) were typed by fluorescence PCR (primer information and PCR conditions are in Table 7), and fluorescent-labeled extension products were electrophoresed on a 3100 Genetic Analyzer (ABI company, USA). Y-STRs were also analyzed by this method ( Table 7). Analysis of the M1 polymorphism (Alu insertion, also called YAP, see Table 6) was by agarose gel electrophoresis directly after PCR [33]. All primers were synthesized by Sangon Co. (Shanghai, China). Restriction endonucleases were purchased from New England Biolabs, USA. Y-SNP haplogroup assignments were based on the typing results. The phylogenetic diagram of 17 haplogroups defined by 16 Y-SNPs is in Fig. 8 [34].

Data analysis
The Y-SNP haplogroup of every individual was defined according to the genotyping results and Fig. 7. Y-chromosome haplogroups can be considered as a monophyletic clade in the phylogenetic tree (i.e., a set of haplogroups comprising all descendants of their most recent common ancestor, inferred from the shared mutations). For example, O3*, O3e*, and O3e1* all share a TRC mutation at locus M122. O3* is the ancestral haplogroup of the M122C alleles, whereas O3e* and O3e1* are the 2 derived haplogroups with additional mutations, M134 and M117, respectively. Haplogroup frequencies were calculated and compared among the three EC high-risk populations. Chi-square tests were performed to evaluate the differences in haplogroup frequency among populations.