Patrilineal Background of Esophageal Cancer and Gastric Cardia Cancer Patients in a Chaoshan High-Risk Area in China

The Taihang Mountain range of north-central China, the Southern region area of Fujian province, and the Chaoshan plain of Guangdong province are 3 major regions in China well known for their high incidence of esophageal cancer (EC). These areas also exhibit high incidences of gastric cardia cancer (GCC). The ancestors of the Chaoshanese, now the major inhabitants in the Chaoshan plain, were from north-central China. We hypothesized that EC and GCC patients in Chaoshan areas share a common ancestry with Taihang Mountain patients. We analyzed 16 East Asian-specific Y-chromosome biallelic markers (single nucleotide polymorphisms; Y-SNPs) and 6 Y-chromosome short tandem repeat (Y-STR) loci in 72 EC and 48 GCC patients from Chaoshan and 49 EC and 63 GCC patients from the Taihang Mountain range. We also compared data for 32 Chaoshan Hakka people and 24 members of the aboriginal She minority who live near the Chaoshan area. Analysis was by frequency distribution and principal component, correlation and hierarchical cluster analysis of Y-SNP. Chaoshan patients were closely related to Taihang Mountain patients, even though they are geographically distant. Y-STR analysis revealed that the 4 patient groups were more closely related with each other than with other groups. Network analysis of the haplogroup O3a3c1-M117 showed a high degree of patient-specific substructure. We suggest that EC and GCC patients from these 2 areas share a similar patrilineal genetic background, which may play an important role in the genetic factor of EC and GCC in these populations.


Introduction
Esophageal cancer (EC) is one of the most common fatal cancers worldwide. China has geographical ''hot spots'' of high EC incidence. A well-known region with high risk of EC in China is the Taihang Mountain area between Henan, Hebei, and Shanxi provinces in north-central China, the famous ''Asian EC belt'' ranging from the Caucasian mountains, across northern Iran, all the way to northern China [1]. As well, the incidence of gastric cardia cancer (GCC) is high in the belt. For example, the world standardized incidence of EC and GCC in Linxian, Henan province, was 81.96/100,000 people and 31.04/100,000, respectively between 1983 and 2002 [2,3]. The Chaoshan area in southern China is another EC high-risk area. The age-standardized incidence rates in Nanao island for EC and GCC were 74.47/ 100,000 and 34.81/100,000, respectively, between 1995 and 2004 [4].
The geographic features of south-littoral Chaoshan and northcentral Taihang Mountain area are distinct, but the incidence of EC and GCC is high within these 2 regions [5]. We and others have reported familial aggregation of EC and GCC and increased EC and GCC risk in family members in this high-risk population [6][7][8][9]. In the Chaoshan high-risk area, the incidence of EC and GCC is not even among population groups, although they are exposed to the similar environment.
The 3 main populations in Chaoshan area include 2 Han populations -Chaoshanese with Chaoshan dialects and Hakka with Hakka dialects -and one local aboriginal She population. Since the Qing Dynasty (216,207 BC), the Henan and Shanxi Han people of north-central China migrated into the Chaoshan area in Guangdong province via Fujian province because of war and famine. They gradually became the predominant inhabitants of the Chaoshan area and are called Chaoshanese [10], so the Chaoshan dialect is similar to ancient Chinese. Hakka Chinese originated from the north Han Chinese of the Yellow River and Luohe River basin of the Central Plain. From the Jin Dynasty (266,316 AD) to the Tong Dynasty (960,1297 AD), they were forced to move to southern areas also because of wars. When the Hakkas arrived in the Chaoshan area, the Chaoshanese had already settled in the rich plain area, so the Hakkas had to settle in the mountain area, where they lived with the local aborigines, the She population (Fig 1).
The Hakka and Chaoshanese populations show the characteristics of their unique cultures [10][11][12][13] which have many similarities to northern Han Chinese, including some features of dialect, life style, customs, and habits [10]. The Chaoshan She population is the only aboriginal and minority population. She people mainly work in agriculture, forestry, and animal husbandry; their language and living customs differ from that of the Han population [14]. Although all 3 populations are exposed to a similar geographical environment, only the Chaoshanese have a high incidence of EC and GCC.
Our previous research of Y-chromosome and mtDNA haplogroups concluded that the EC high-risk populations in Taihang Mountain, Fujian Minnan and Guangdong Chaoshan share a similar patrilineal and matrilineal genetic background [15,16]. In the present study, we further explored the patrilineal genetic structure of EC and GCC patients in Chaoshan high-risk areas and compared it with matched high-risk populations and corresponding low-risk populations. We aimed to examine whether Chaoshan cancer patients have a common ancestry with Taihang Mountain patients and whether they share the same unique Y-chromosome haplotypes. We also compared these data for Y-chromosome single nucleotide polymorphisms (Y-SNPs) and Y-chromosome short tandem repeat (Y-STRs) with that of other Chinese populations from public databases to explore the relative genetic affinity of the studied populations. We first analyzed nonrecombining portion of the Y chromosome (NRY) in these 6 populations with 16 East Asian-specific biallelic markers [17,18] (SNPs), which were characterized by low mutation rate and low probabilities of back and parallel mutation and suitable for tracing early demographic events in human history. Then we investigated the genetic distance among EC and GCC patients with Y-STR loci with relatively high mutation rate and appropriate for analyzing the relationship among close groups and their microevolution [15,16]. Both Y-SNP and Y-STR analysis results support that the Chaoshan patients have close genetic relatedness with Taihang Mountain patients and the patients have closer relationship with each other than with the high risk population.

Distribution of NRY Haplogroups in the 6 Studied Populations in China
Y-SNP genotyping revealed the haplogroup frequencies of the Chaoshan EC or GCC patients, Taihang Mountain EC or GCC patients, and Chaoshan Hakka and She populations. The highest haplogroup of Chaoshan patients was O3a3c1-M117, which is the characteristic haplogroup for Northern East Asians (Table 1). It was also high for Taihang Mountain patients but was significantly lower for Chaoshan Hakka and She populations than Chaoshan patients (p,0.05). Both Chaoshan Hakka and She populations showed a high frequceny of O1a*, the characteristic haplogroup for Southeastern Asians. It was significantly higher for Chaoshan Hakka than Chaoshan patients (p,0.05). The She population showed a unique high frequency of O3a3b* as compared with other studied populations, except the Chaoshan GCC patients, with very low frequency of 2.08%.

Principal Component Analysis Revealed Close Affinity among the 4 Patient Groups
Principal component analysis (PCA) involves a mathematical procedure that transforms a number of correlated variables into a (smaller) number of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. In the principal-component plot, the smaller the distance of two populations, the closer the genetic relationship is between the two. Figure 2 shows the results of principal component analysis, with 3 components (PC1, 2, 3), for Y-SNP frequencies based on genotyping results of the 6 studied populations and additional data for other Chinese Han. For comparison, the haplotype frequencies of 4 high-risk populations from Chaoshan (CSHR), Fujian (FJHR) and Taihang Mountain (THHR) areas were included [15]. The 3 components accounted for 86.2% of the total variation in Y-SNP. The 4 patient groups and 3 high-risk populations clustered together. The Chaoshan She and Hakka populations formed another cluster. The rest of the Northern Han and Southern Han formed another group. The Chaoshan patients and high-risk population were isolated from the Chaoshan Hakka and She populations and Guangzhou Han.

Positive Correlation between 4 Patient Populations and Chinese Han Populations
Y-SNP haplogroup frequencies for the patient groups and highrisk population from the same area were positively correlated, and frequencies for all patient groups were positively correlated with the Fujian and Chaoshan high-risk populations ( Table 2). Frequencies for the Chaoshan EC patients and Chaoshan Hakka were correlated but the coefficient was the lowest. Frequencies for HC were positively correlated with most of the Chinese Han frequencies and those for HNEC were positively correlated with some of the Chinese Han frequencies.

Hierarchical Cluster Analysis Isolates Patients and Highrisk Population from Other Populations
To study the affinity among the 4 patient groups and their relationship with other Han and minority nationalities, we analyzed Y-SNP data by hierarchical cluster analysis with average linkage (between groups). We compared 17 Chinese Han populations (population information was the same as from principal component analysis), 3 southern minority nationalities (Yao, Zhuang and Dong; [19] and 5 northern minority nationalities (Tibetan, Mongol (MG), Hui, Ewenki (EWK), Shui). The Taihang Mountain patients and high-risk population (Taihang) were genetically close and formed a branch; meanwhile, the Chaoshan patients were genetically close to the Chaoshan and Fujian high-risk populations (Chaoshan, Fujian) and formed another branch (Fig. 3). Then these 2 branches crossed and clustered with Chaoshan Hakka and She populations. All other populations clustered outside the main branch formed by Table 1. Y-chromosome single nucleotide polymorphism (Y-SNP) haplogroup frequencies of the 6 studied populations (%).

Genetic Distance Analysis and Construction of a Phylogenetic Tree
We used Y-STR data to investigate the genetic relationships between the 4 patient populations. R st distances between pairs of populations were calculated on the basis of 6 Y-STRs:DYS389 (I, II), DYS390, DYS391, DYS392, DYS393 and DYS394. We included 6 additional Chinese populations and 3 high-risk populations: Zhejiang [20], Henan [21], Dongbei [22], Tianjing [23], Hunan Han [24], and Tibetan [25], and Chaoshan, Fujian, and Taihang Moutain high-risk populations, all of which belong to the Sino-Tibetan language family [15], as do the 4 patient groups. From the R ST distance matrix, we constructed an unrooted neighbor-joining tree (Fig. 4). The patient groups were closer to each other than to the high-risk populations and the other Chinese Han populations.

Network Analysis of Y-STR Haplogroups of the 4 Patient Groups and 3 High-risk Populations
The highest haplogroup frequency shared by the Chaoshan patients was O3a3c1-M117 (Table 1). The network for patients and high risk populations was further constructed based on the haplogroup O3a3c1-M117. In all, 12 Henan and 15 Chaoshan EC patients, 17 Chaoshan and 9 Henan GCC patients, and 23 Chaoshan, 8 Henan and 24 Fujian high-risk individuals belonged to haplogroup O3a3c1-M117. Individuals with Y-STR frequency ,2 were eliminated from the analysis. Finally, data for 55 individuals were included and analyzed (Fig. 5). The central node was represented by 8 Fujian high-risk individuals, 1 Henan highrisk individual and 1 Chaoshan EC patient. All of the other haplogroup O3a3c1-M117 individuals came from this central node. This central node was connected to 5 one-step neighbors, with 2 neighbors representing 5 Fujian high-risk individuals; the third neighbor represented 8 Chaoshan high-risk individuals, 1 Henan high-risk individual, 2 Fujian high-risk individuals and 1 Chaoshan EC patient; the fourth neighbor represented 2 Chaoshan EC patients, 1 Chaoshan high-risk individual and 1 Fujian high-risk individual; and the fifth neighbor represented 1 Chaoshan GCC patient and 1 Chaoshan high-risk individual. Most patients were generated from the fifth one-step neighbor and thus clustered mainly in one area (circle in Fig. 5). This area included all GCC patients and 5 EC patients, with the remaining 6 EC patients scattered in other nodes.  county. Disease in all patients was confirmed pathologically. All participants involved in our study were given written informed consents. The study was approved by the ethical review committee of Shantou University Medical College. Genomic DNA was extracted from whole blood by the TIANamp Blood DNA kit (DP318-03) (Tiangen Biotech Co., Beijing).

Genotyping of Y-SNPs and Y-STRs
Y-SNPs were genotyped by Sequenom MassARRAY iPLEX Gold module (Sequenom Inc.) (PCR primers and extension primers are in Table 3). M1 polymorphism (Alu insertion, also called YAP) was directly analyzed by agarosegel electrophoresis after PCR [26]. STRs were genotyped by fluorescence PCR as previously described [15], and fluorescent-labeled extension products were capillary electrophoresed on an ABI 3730x Genetic Analyzer (ABI, USA). All primers were synthesized by Sangon Co. (Shanghai). In 1999, Su et al. ascertained 17 Y-chromosome haplogroups based on 19 East Asian-specific biallelic markers as the paternal structure of East Asians [19]. The adjusted phylogenetics diagram of Y-SNPs [27] includes nearly 600 SNPs and defines 311 haplogroups. The phylogenetic diagram of 17 haplogroups defined by 16 Y-SNPs is in Figure 6.

Population and Genotyping
Subjects were genotyped for Y-SNP haplogroup and frequencies were compared among the 4 patient populations and She and Hakka populations (Table S1). Principal component, correlation and hierarchical cluster analyses were used to analyze the relationship among the 6 populations. Three high-risk populations from the Taihang Mountain, Fujian Minnan, and Chaoshan areas and 25 previously published Chinese populations were compared. The 25 Chinese populations were divided into 4 groups by geographic location and nationality [15]: Northern Han (NHs) and northern minority nationalities (NMNs), southern Han (SH) and southern minority nationalities (SMNs).
NH populations were Hebei [28], Liaoning (data provided by the State Key Laboratory of Genetic Engineering and Center for Anthropological Studies, School of Life Sciences, Fudan University), Xinjiang, Gansu, Shanxi, Neimeng, Shandong and Henan [28]; SH populations were Hunan, Hubei, Zhejiang, Jiangxi, Shanghai, Anhui, Jiangsu, Sichuan [28], Guangzhou and Guangxi (data provided by Fudan University); NMN populations were Tibetan, Mongol, Hui, Ewenki, and Shui (data provided by Fudan University); data for 3 southern minority nationalities (Yao, Zhuang and Dong [19] and 5 northern minority nationalities (Tibetan, Mongol, Hui, Ewenki, and Shui populations were STRs can be used to analyze minute genetic diversity in close populations, so on the basis of Y-SNP results, Y-STRs were used to analyze the genetic differentiation and origin among patients and high-risk populations (Table S2). We added Y-STR data for 3 high-risk populations from our previous research [15] and for 6 previously published populations: Zhejiang [20], Henan [21], Dongbei [22], Tianjing [23], Hunan Han [24] and Tibetan people [25].
The extent of genetic differentiation of the populations was estimated by the R st statistic on the basis of the Y-STR haplotypes by use of Alrequin 3.1. A neighbor-joining tree was constructed according to the R st distance matrix with use of MEGA 5.1. A network of Y-STR data was constructed by use of Network 4.6.1.1 (www.fluxus-engineering.com). In the network map, individuals with the same mutations of Y-STRs were in the same node, and one node could generate other nodes due to gradual Y-STR mutation [15].

Discussion
Chaoshanese are descendants of north-central China Han people. North-central Chinese Han began to migrate into southern China beginning in the Qin Dynasty (216 BC). The Han Dynasty (206 BC-220 AD) experienced another 3 waves of large-scale migration into southern China because of the decrease in the native population in this area. Gradually, over 2,000 years, the north-central Chinese Han became the main population -Chaoshanese in the Chaoshan region, called Helao, who directly migrated from north-central China, or Fulao, who first migrated to Fujian Minnan, then to Chaoshan with well-maintained language and customs from north-central China. The Taihang Mountain people in north-central China, Fujian Minnan and Chaoshan areas are well known for their high incidence of EC [15].
With the development of diagnostic techniques and improved epidemiology, more GCC cases have been confirmed in these areas. EC and GCC are the 2 most common cancers in these 3  areas. Our previous genetic research showed that high-risk populations in these 3 areas share a common ancestry [15,16]. In the present study, we studied Y-chromosome haplogroups of EC and GCC patients from the Chaoshan and Taihang Mountain areas to further explore the paternal genetic background of the patients. We compared the data with 2 low-risk Chaoshan Hakka and She populations and 3 high-risk populations. We first analyzed the distribution of Y-SNP haplogroups among the studied populations. The haplogroup with the highest frequency shared by Chaoshan EC and GCC patients was O3a3c1-M117, one of the northern Han dominant haplogroups, which was also high in Taihang Mountain patients but low in the Chaoshan Hakka and She populations. As compared with Chaoshan patients and the high-risk population, the Chaoshan Hakka and She populations showed a relatively higher frequency of the southern native dominant O1*. Similar to Taihang Mountain patients, Chaoshan patients showed northern Han dominant haplogroups as their highest frequency haplogroups, so Chaoshan and Taihang Mountain patients are relatively closely related.
On Y-SNP principal component analysis, the paternal structure for Chaoshan patients differed from that for Chaoshan Hakka and She populations, although they are in geographic proximity and Chaoshan Hakka are also descendants of north-central Chinese Hans. Chaoshan patients clustered closely with the Fujian and Henan high-risk population and patients, although they are geographically distant. Chaoshan Hakka and She populations clustered together, which agrees with historical records. Chaoshan Hakka mainly inhabit the mountain area, for more gene flow with the She population, who also live in the mountain area. Y-SNP haplotype frequencies were positively correlated among patients, which further supports their close genetic affinity. The results of hierarchical cluster analysis also supported the close genetic affinity among patients and high-risk populations. Phylogenetically, the patient groups were more closely related to each other than with the high-risk population (Fig. 4). Network analysis (Fig. 5) suggested that the patrilineal lineage of haplogroup O3a3c1-M117 individuals was the Taihang Mountain and Fujian high-risk individuals and Chaoshan EC patients, who constituted the central node, and patients of the O3a3c1-M117 individuals from the 2 studied areas were largely from one one-step neighbors containing 1 Chaoshan high-risk individual and 1 Chaoshan GCC patient. The haplogroup O3a3c1-M117 network analysis revealed variation among populations but also a high degree of patient-specific substructure. All 14 GCC patients and 5 of the 11 EC patients fall into one cluster (Fig. 5, circle). Haplogroup O3a3c1-M117 patients may have originated from the same ancestral haplogroup. Thus, we suggest patrilineal genetic affinity among the 2 geographically separated GCC and EC patients in China.
Recent genome-wide association studies from China high-risk areas showed significant association of a variant at 10q23 in PLCE1 and both esophageal squamous cell carcinoma and gastric cardia adenocarcinoma, which highlights the common genetic mechanisms that may contribute to the etiology of both cancers [29]. Though EC and GCC are pathologically distinct, the epidemiology studies [2][3][4][5][6][7][8][9], genome-wide association studies and present study all support that EC and GCC may share common genetic structure. EC and GCC are anatomically adjacent and they have similar embryogenesis. They are exposed to similar environmental condition during life. However why they may be affected by a common genetic structure is still unknown.
We suggest that EC and GCC do not occur at random in highrisk populations but are closely associated with a certain patrilineal background structure and these related patients may inherit a pathogenic genetic structure from their common ancestors.
In summary, the patrilineal genetic structure of Chaoshan and Taihang Mountain patients is similar, and patients have closer affinity with each other than with the high-risk populations. The EC and GCC patients share a recent common ancestor. In contrast, the Chaoshan Hakka and She populations have a relatively distant relationship with Chaoshanese people, which may explain in part the high incidence of EC and GCC in Chaoshanese people.