The Peopling of Korea Revealed by Analyses of Mitochondrial DNA and Y-Chromosomal Markers

Background The Koreans are generally considered a northeast Asian group because of their geographical location. However, recent findings from Y chromosome studies showed that the Korean population contains lineages from both southern and northern parts of East Asia. To understand the genetic history and relationships of Korea more fully, additional data and analyses are necessary. Methodology and Results We analyzed mitochondrial DNA (mtDNA) sequence variation in the hypervariable segments I and II (HVS-I and HVS-II) and haplogroup-specific mutations in coding regions in 445 individuals from seven east Asian populations (Korean, Korean-Chinese, Mongolian, Manchurian, Han (Beijing), Vietnamese and Thais). In addition, published mtDNA haplogroup data (N = 3307), mtDNA HVS-I sequences (N = 2313), Y chromosome haplogroup data (N = 1697) and Y chromosome STR data (N = 2713) were analyzed to elucidate the genetic structure of East Asian populations. All the mtDNA profiles studied here were classified into subsets of haplogroups common in East Asia, with just two exceptions. In general, the Korean mtDNA profiles revealed similarities to other northeastern Asian populations through analysis of individual haplogroup distributions, genetic distances between populations or an analysis of molecular variance, although a minor southern contribution was also suggested. Reanalysis of Y-chromosomal data confirmed both the overall similarity to other northeastern populations, and also a larger paternal contribution from southeastern populations. Conclusion The present work provides evidence that peopling of Korea can be seen as a complex process, interpreted as an early northern Asian settlement with at least one subsequent male-biased southern-to-northern migration, possibly associated with the spread of rice agriculture.


Introduction
An understanding of the evolutionary history of East Asian populations has long been a subject of interest in the field of human evolutionary genetics. Based on results of classical genetic markers, there is significant separation between southern and northern populations of East Asia [1]. This north-south genetic differentiation is likely to have an origin in the early peopling of the region. There have been two major models for early migration routes into East Asia. The first model postulates a southeast Asian origin, followed by a northward migration [2]. Recent genetic surveys using autosomal microsatellite markers [3] and Ychromosomal binary markers [4] have been interpreted as supporting this model. In contrast, the second model suggests a multidirectional route: one migration through central Asia and one through southeast Asia [1,5,6]. Thus, understanding the genetic origin and history of Korea may be informative for questions concerning prehistoric migration route(s) and population expansions in East Asia.
The Korean Peninsula is located to the north of the Yellow and Yangtze Rivers of China, and bounded to the northeast by Russia. Therefore, the Koreans are geographically a northeast Asian group. Anthropological and archeological evidence suggests that the early Korean population was related to Mongolian ethnic groups who inhabited the general area of the Altai Mountains and Lake Baikal regions of southeast Siberia [7]. According to Korea's founding myths, the Ancient Chosun (the first state-level society of Korea) was established around 2,333 BC in the region of southern Manchuria but later moved into the Pyongyang area of northwest Korea. In addition, archeological evidence reveals that rice cultivation had spread to most parts of the Korean Peninsula by around 1,000-2,000 BC, introduced from the Yellow River and/ or Yangtze River basin in China [8].
Studies of classical genetic markers showed that Koreans tend to have a close genetic affinity with Mongolians among East Asians [9][10][11]. In contrast, recent surveys of Y-chromosomal DNA variation revealed that the Korean population contained lineages typical of both southern and northern East Asian populations [6,12,13]. The Koreans appeared to have affinities with Manchurians, Yunnan-Chinese from southern China, and Vietnamese [13].
To understand the genetic history of Korea better, more data from additional genetic markers from Korea and its surrounding regions are necessary. Mitochondrial DNA (mtDNA), like the Y chromosome, can also provide valuable information about the phylogeography of human populations due to its special features of haploidy and uniparental inheritance [14][15][16][17][18]. Although recent investigations of mtDNA variation in East Asia have provided valuable information for constructing a robust phylogenetic tree of mtDNA haplotypes, limited data on the Korean population are available [19][20][21].
In this study, we present new data on the mtDNA sequence variation of the hypervariable segments I and II (HVS-I and HVS-II) and haplogroup-specific mutations in coding regions in 445 individuals from seven East Asian populations, including Korea. In addition, mtDNA haplogroup data (N = 3307), mtDNA HVS-I sequences (N = 2313), Y chromosome haplogroup data (N = 1697) and Y chromosome STR data (N = 2713) from the literature were analyzed to elucidate wider aspects of the genetic structure of East Asian populations.

DNA samples and reference data
We analyzed a total of 445 individuals, collected from seven East Asian populations (Korean, Korean-Chinese (People of Korean origin now living in China), Mongolian, Manchurian, Chinese Han (Beijing), Vietnamese, and Thai). The DNA samples included subsets of the samples examined by Jin et al. [13] and Kwak et al. [22], although the exact number of subjects for each population occasionally varies between these studies. In addition, we included the following new Korean-Chinese and Mongolian samples: 51 Korean-Chinese from northern China and 47 Mongolians from Ulaanbaatar. This study was approved by the Ethics Committee and institutional review boards of Institute of Bio-Science and Technology in the Dankook University in Cheonan, and separate written informed consent was obtained for enrollment from all participants. DNA was prepared from whole blood by the standard method [23] or was extracted from buccal cells according to the procedure of Richards et al. [24].
mtDNA sequencing and genotyping of RFLP After PCR amplification, each PCR product was purified using the WizardH PCR Preps DNA Purification System (Promega, WI, USA) and then sequenced by cycle sequencing using either a MegaBase 1000 sequencer (Amersham Bioscience, USA) or an ABI PRISM TM 310 Genetic Analyzer (Applied Biosystems, CA, USA) with DYEnamic ET Dye Terminator (Amersham Bioscience, USA) or BigDye TM Terminator (PE Biosystems, USA), respectively. DNA sequences of the PCR amplicons were determined from both forward and reverse sequence data using the original primer pairs. The sequences from nucleotide position (np) 16024 to 16365 in HVS-I and from 73 to 340 in HVS-II were determined, since ambiguous electropherograms for 20-30 nucleotides near the primers were frequently observed.
The intergenic COII/tRNA Lys 9-bp deletion was analyzed as described in Jin et al. [42]. In addition, several amplified segments, mainly in the mtDNA coding regions, were analyzed by RFLP typing and additional sequencing, as listed in Table 1.

Sequence alignment and haplogroup analyses
Sequences were aligned and compared with the revised Cambridge Reference Sequence (rCRS) [43] using the Sequencher program ver. 2000 (Gene Codes corporation, MI, USA). The results were converted into a Microsoft Excel table (Microsoft Corporation, CA, USA). The mtDNAs were classified into the (sub-)haplogroups based on HVS-I/II motifs of haplogroup specific-sequences as well as coding regions as described in recent surveys [19,20,25,44,45]. The HVS-I motif searching and haplogroup-directed comparison with closely related sequences from other databases led us to tentatively assign each mtDNA to a haplogroup. To further characterize the mtDNA lineage tested, we compared their HVS-II motif to verify the predicted haplogroup status of each mtDNA. In general, more than 95% of mtDNA lineages can faithfully be classified to specific haplogroups using HVS-I/II motifs without extra information from coding region sequences [44]. However, in the remaining cases, their (sub-)haplogroups were characterized using sequence information from some coding region sites (Table 1). After each mtDNA was assigned to the most-derived named haplogroup, the haplogroup distribution frequencies in each of seven populations were estimated. For quality assurance purposes, we performed quasi-median network analysis [46,47]. The HVS-I (np 16024-16365 np) and HVS-II (np 74-340) sequence of 445 individuals of this study have been submitted to GenBank (Accession Numbers, FJ493775-FJ494664).

Data analyses
The genetic differentiation between different population samples and its statistical significance were assessed via F ST (mtDNA HG and HVS-I/II and Y-SNPs) and R ST (Y-STRs) values. The population genetic structure of the ethnic and/or regional groups was analyzed through the analysis of molecular variance (AMOVA) approach [48]. The calculations of diversity indices, F ST , R ST and AMOVA were performed using the Arlequin 2.000 package [49]. Population pairwise F ST and R ST values were visualized by multidimensional scaling (MDS) plot analyses using SPSS 12.0 software.
Haplogroup-specific median-joining networks [50] for Y chromosome data were constructed using the NETWORK 4.2 program (www.fluxus-technology.com). Such networks were initially highly reticulated, and we reduced reticulations by first weighting the loci according to the inverse of their variance in the dataset used [51] and subsequently constructing a reduced-median network [52] to form the input of the median-joining network [53].
The admixture proportions of northeast Asian and the southeast Asian parental populations in the Korean population were estimated for mtDNA and the Y chromosome using the Admix 2.0 software [54]

Results and Discussion
Almost all of the mtDNA lineages analyzed here could be assigned to the East Asian-specific (sub)haplogroups described recently [19,20,25,44,45], with the exception of two individuals belonging to the European mtDNA haplogroups T (Manchurian) and U5a (Mongolian) ( Table 2). The gene diversity (H), nucleotide diversity (p n ), and mean number of pairwise differences of the population samples are listed in Table 3. All seven populations displayed high levels of genetic diversity (H.0.99), suggesting a relatively large population size and heterogeneity of each mtDNA pool. The haplogroup frequencies observed in each population are summarized in Table 2. Based on these haplogroup assignments, the Koreans share lineages with both the southern and the northern haplogroup complexes of East Asia. We first attempted to quantitate these contributions by a detailed consideration of the distribution of each lineage.
The highest (23.8%) frequency in the Korean mtDNA pool was observed for haplogroup D4, which is widespread in northern East Asia and especially in the Korean-Chinese (21.6%), and Manchurians (20.0%). In total, haplogroup D lineages including the subhaplogroups (D4, D4a, D4b, D5, and D5a) accounted for 32.4% of the Korean mtDNA pool. In addition, the Koreans present moderate frequencies of (sub)haplogroup A (8.1%) and (sub)haplogroup G (10.3%) lineages, mostly prevalent in northeast Asia and southeast Siberia [20,[55][56][57]. Other Siberian and Mongolian-prevalent haplogroups from the C, Y and Z lineages make up less than 4% of the Korean mtDNA pool. Haplogroups A5a and Y2 are found almost exclusively in Korea but were present at extremely low frequencies. In total, these northern haplogroups account for ,60% of the mtDNA gene pool of the Koreans. In addition, southeast Asian-prevalent mtDNA lineages of (sub)haplogroups B (14.6%), M7 (10.3%), and F (9.7) are also found at moderate frequencies in the Korean population (Table 2). These findings suggest that more than 30% of the Korean mtDNA pool is attributable to maternal lineages with a more southern origin. We also found the haplogroup M7a1 exclusively in the Korean population. This result is consistent with previous reports that haplogroup M7a is restricted to Japan and south Korea [18,20]. Thus, the distribution pattern of mtDNA haplogroups leads us to consider that the peopling of Korea is likely to have involved multiple sources.
We then investigated the mtDNA and Y-chromosomal relationships between the East Asian populations, using both the new and published data. In these analyses mtDNA haplogroups, mtDNA HVS-I sequences, Y-SNPs and Y-STRs were compared (Supplementary Tables S1, S2, Table S1). The F ST distances of mtDNA markers (mtDNA haplogroups and HVR-I sequences) of Korean populations showed close relationships with Manchurians, Japanese, Mongolians and northern Han Chinese but not with southern Asians (Supplementary  Tables S4 and S5; Figure 2A, B). In the MDS plots, the Korean samples lay entirely within the cluster of northern populations.
In contrast, the results of Y chromosome analyses (based on Y-SNPs and Y-STRs) of Korean populations revealed closer relationships with both northeast and southeast Asian populations (Supplementary Tables S6 and S7; Figure 2C, D). Like the mtDNA distances, Y-chromosomal distances from Manchurian, Japanese and northern Han Chinese populations were usually not significantly greater than zero, but some distances from southern Han populations (e.g. Yunnan Han, Y haplogroups; Meixian Han, Y-STRs) or other southern populations (e.g. Vietnamese, Y haplogroups) were also not significantly above zero (Supplementary Tables S6 and S7), as noted previously [13]. In the MDS plots, the Korean samples lay at the border between the northern and southern clusters, rather than within the northern cluster ( Figure 2C, D). In order to investigate Y-chromosomal relationships in more detail, we visualized STR haplotypes within a common predominantly northern haplogroup (C*) and southern haplogroup (O3) using networks [50] constructed with the seven Y-STRs common to all datasets (Figure 3). These networks did not show striking geographical structure, so we calculated, for each Korean haplotype, the distance to the closest northern and southern haplotype. In both haplogroups, the mean distance to the southern haplotypes was lower than to the northern haplotypes (C* Korean-north 5.0 steps, Korean-south 4.5 steps; O3 Koreannorth 3.5 steps, Korean-south 2.2 steps). This finding is particularly striking for haplogroup C* because it is far more prevalent in the north ( Figure 3A).
The genetic differences between the Koreans and other East Asians were examined by AMOVA (Table 4). When samples were     grouped into northeast Asians and southeast Asians (excluding Koreans), a highly significant difference was found between the two groups with all markers. Thus there is significant genetic differentiation within the region, and we could then compare each group separately with the Koreans. With mtDNA, Koreans were not significantly different from either group when HVRI sequences were compared, although they were distinct from the southeast Asians in the haplogroup comparisons. With the Y chromosomes, they were again not distinct from either group when haplogroup comparisons were made, but were distinct from the southeast Asians in the STR-based comparison (Table 4).
Our study documents the genetic relationships of the Koreans with their neighboring populations in unprecedented detail. Two major findings emerge. First, the Koreans are overall more similar to northeast Asians than to southeast Asians. This conclusion would be expected from the general correlation between genetic variation and geography observed for human populations, and is supported here by an examination of individual mtDNA haplogroups (Table 2), genetic distances between populations derived from mtDNA or Y-chromosomal data (Figure 2), and the apportionment of genetic diversity between different groups of populations (Table 4). Second, the conclusions from mtDNA and Y-chromosomal analyses differ. Sex-biased admixture is common in human expansions such as that of Bantu-speaking farmers in Africa [58], the spread of the Han ethnic group in China [59] or the post-Columbian peopling of the Americas [60]. The effects in Korea are more subtle, but show a larger male than female contribution from southern East Asia to the population of Korea, most clearly revealed by the admixture estimates, where a 35% contribution from the south was estimated for mtDNA, compared with a 83% contribution for the Y chromosome ( Table 5).
The predominant genetic relationship with northern East Asians is consistent with other lines of evidence. Xue et al. [31] reported that the northern East Asian populations started to expand in number before the last glacial maximum at 21-18 KYA, while the southern populations all started to expand after it, but then grew faster, and they suggested that the northern populations expanded earlier because they could exploit the abundant megafauna of the ''Mammoth Steppe,'' while the southern populations could increase in number only when a warmer and more stable climate led to more plentiful plant resources such as tubers. By this criterion, the Koreans, expanding at about 30 KYA [31] also resemble other northern populations. Historical evidence suggests that the Ancient Chosun, the first state-level society, was established in the region of southern Manchuria and later moved into the Pyongyang area of the northwestern Korean Peninsula. Based on archeological and anthropological data, the early Korean population possibly had an origin in the northern regions of the Altai-Sayan and Baikal regions of Southeast Siberia [7,8,61].
What could be the origin of the male-biased southern contribution to Korean gene pool illustrated, for example, by haplogroups O-M122 (42.2%) and O-SRY465 (20.1%) [29]. Recent molecular genetic analyses and the geographical distribution of haplogroup O-M122 lineages, found widely throughout East Asia at high frequencies (especially in southern populations and China), have suggested a link between these Y-chromosome expansions and the spread of rice agriculture in East Asia [62][63][64]. In general, Y-chromosomes might be spread via a process of demic diffusion during the early agricultural expansion period [65,66]. If this interpretation were substantiated, the spatial pattern of Yhaplogroup O would imply a genetic contribution to Korea through the spread of male-mediated agriculture. Large-scale genetic analyses thus begin to reveal some of the complexities of the peopling of Korea, and further studies of individual autosomal loci or genomewide genotyping and sequencing are expected to provide further insights.