Analysis of East Asia Genetic Substructure Using Genome-Wide SNP Arrays

Accounting for population genetic substructure is important in reducing type 1 errors in genetic studies of complex disease. As efforts to understand complex genetic disease are expanded to different continental populations the understanding of genetic substructure within these continents will be useful in design and execution of association tests. In this study, population differentiation (Fst) and Principal Components Analyses (PCA) are examined using >200 K genotypes from multiple populations of East Asian ancestry. The population groups included those from the Human Genome Diversity Panel [Cambodian, Yi, Daur, Mongolian, Lahu, Dai, Hezhen, Miaozu, Naxi, Oroqen, She, Tu, Tujia, Naxi, Xibo, and Yakut], HapMap [ Han Chinese (CHB) and Japanese (JPT)], and East Asian or East Asian American subjects of Vietnamese, Korean, Filipino and Chinese ancestry. Paired Fst (Wei and Cockerham) showed close relationships between CHB and several large East Asian population groups (CHB/Korean, 0.0019; CHB/JPT, 00651; CHB/Vietnamese, 0.0065) with larger separation with Filipino (CHB/Filipino, 0.014). Low levels of differentiation were also observed between Dai and Vietnamese (0.0045) and between Vietnamese and Cambodian (0.0062). Similarly, small Fst's were observed among different presumed Han Chinese populations originating in different regions of mainland of China and Taiwan (Fst's <0.0025 with CHB). For PCA, the first two PC's showed a pattern of relationships that closely followed the geographic distribution of the different East Asian populations. PCA showed substructure both between different East Asian groups and within the Han Chinese population. These studies have also identified a subset of East Asian substructure ancestry informative markers (EASTASAIMS) that may be useful for future complex genetic disease association studies in reducing type 1 errors and in identifying homogeneous groups that may increase the power of such studies.


Introduction
Analysis of population genetic substructure has been enhanced by the ability to perform large genome array studies. The differences and patterns of variation within continental populations are useful for several reasons including recapitulating population migration and origins of ethnic groups, forensic identification, and for defining and applying an understanding of allele frequency variation to genetic association studies. Recent studies by several groups including our own have examined European population substructure [1][2][3][4]. Importantly, these studies have shown that discerning and accounting for differences in substructure can improve error rates in association studies [5]. With the availability of East Asian (EAS) SNP genotypes, we undertook the current study to perform similar studies for this sub-continental region that contains the largest contribution to the world's population. East Asian population genetic structure is particularly important since multiple genetic studies of complex disease are currently underway including studies of autoimmune diseases in Korean, Chinese and Japanese populations [6][7][8][9][10][11]. An understanding of the relationship among these different populations and ascertaining ancestry informative markers (AIMs) that can discern East Asian substructure will undoubtedly facilitate accurate interpretation of such studies [5].
This study combines high density SNP array genotypes from studies of EAS population groups within the Human Genome Diversity Panel (HGDP) [12,13] with those of several additional population groups of EAS ancestry. The use of high density SNP genotypes containing over 200 K common autosomal genotypes allows a more comprehensive analyses than those previously performed using limited number of autosomal genotypes. It also complements studies of mitochondrial and Y chromosome haplogroups as well as classical markers that provide important information with respect to part of the history of particular EAS ethnic groups [14][15][16][17][18][19][20]. Our study expands on previous analyses using HGDP population groups [13] by examining additional parameters of population structure/ diversity and by including many additional samples including those from several of the most populous EAS groups (Korean, Filipino and Vietnamese) and Chinese American participants of diverse origin. We apply the genotypic information to identify a set of SNPs that may be useful in the design and execution of association studies.

Population Differentiation between East Asian Populations
To examine similarities and differences in population differentiation among EAS populations paired Fst values were determined between 19 EAS population groups that were typed with genomewide SNP arrays (see Methods). The studies included samples derived from HapMap [21,22], HGDP [13], samples collected in Korea and East Asian American participants (see Methods). The Fst values were obtained using three random non-overlapping sets of 3500 SNPs distributed over the autosomal genome (minimum of 50 kb distance between SNPs). This approach was taken to limit potential bias from SNPs in close linkage disequilibrium and to measure of variability of Fst. The small differences in these independent samplings (mean SD = 0.0015; median SD = 0.0013) indicate that this approach resulted in good estimations of paired Fst values. Relatively large Fst values were evident between many of the relatively small ethnic groups within China ( Table 1 and see Figure 1 for geographical information). In particular, those population groups derived from Mongolia or near by provinces including Oroqen, Hezhen, and Daur show relatively large differences with Han Chinese. Similarly, two of the ethnic groups in the southeastern region of China, Lahu and Dai, also showed large paired Fst values with Han Chinese. With respect to population groups derived from very populous groups, the data indicate that Japanese and Korean were very closely related, as were Korean and Han Chinese but that these groups are much further from the south-east Asian populations (Filipino and Vietnamese). The Han Chinese and Japanese groups showed larger separation than either with Korean, although the paired Fst values were still small relative to Chinese/Filipino Fst. The Fst values also showed a close relationship between the Dai ethnic group in China and the Vietnamese population sample. Each of the groups had large paired Fst values with the Yakut from Siberia with the exception of the Mongolian, Hezhen and Oroqen ethnic groups that derive from north-eastern China or Mongolia. The relative size of the Fst values also generally corresponded to the geographical separation of the EAS population groups (depicted in Figure 1).
Fis values were also determined for each of the population sample and did not indicate a strong inbreeding component for any of the tested sample groups (Supplemental Table S1).
The different Chinese subjects derived from different regions of origin were also examined. For each of the Chinese American groups with self reported origin from North China, South China and Taiwan the paired Fst values with the Han Chinese from Beijing was small (,0.0025) (Supplemental Table S2).

Principal Component Analyses Using .200 K SNPs Show Substructure Relationships
To further explore the relationship among EAS population groups and examine population substructure PCA was performed using the genotype results from a set of .200 K SNPs. Analyses were done with and without the inclusion of the Yakut population thought to originate in central Asia, since PCA results are influenced by the inclusion or exclusion of different population groups and we were interested in the relationship between EAS and central Asian populations. The first two principal components in these analyses display the largest genotype variation ( Table 2) and are graphically depicted in Figure 1. Inclusion of the Yakut group showed a possible cline in PC1/PC2 that extends from the current Siberian location of the Yakut to the northern East Asian population groups ( Figure 1A). Interestingly, the position of the different population groups shows a remarkable correspondence with the geographic origin of each group. This is more clearly suggested when the Yakut population is excluded ( Figure 1B) and is best illustrated by comparing these geographic locations with rotated PCA results ( Figure 1C and D). Additional, PCA analyses including the central Asian Uygur and Hazara population groups were also performed but these did not show a clear relationship with the EAS (Supplemental Figure S1).
The PCA results for PC1 and PC2 are generally consistent with the relative paired Fst values with respect to the distance separation among the different population groups. For example the position of the Korean group approximately midway between the HapMap CHB and JPT groups both graphically ( Figure 1) and as discussed above for paired Fst values. It is also consistent with the closer relationship between the Dai ethnic group and the Vietnamese subjects. However, the first two PCs do not show the full relationships among the population groups. For example the Lahu ethnic group appears to be closely related to the Cambodian ethnic group (Figure 1), although the paired Fst value is relatively large ( Table 1). Examination of additional PCs shows the large difference between the Lahu and Cambodian ethnic groups in PCs 3, 4 and 5 ( Figure 2). Using both the Kruskal-Wallis test [23], a nonparametric alternative to the ANOVA, and a split half reliability test (see Methods) substructure was present in multiple principal components ( Table 2). Substantial population substructure can be observed by the nonrandom grouping of population groups that extends through PC7.
For the entire EAS population groups studied, the majority of substructure variation defined by PCA appears to be within the first 4 PCs ( Table 2). The eigenvalues plateau after PC4 with only small differences observed in subsequent PCs (Figure 3a). The proportion of the sum of the eigenvalues above this plateau provides a measure of the relative amount of substructure variation defined by each PC (Figure 3b). For the total EAS group, .90% of the substructure is defined in the first four PCs by this measurement. For the group of the five populations representing the most populous ethnic groups studied the first two PCs account for 90% of the variation above the plateau. Similar analyses were also performed using population sets restricted to the more closely related Han Chinese, Japanese, and Korean groups, as well as a group restricted to Han Chinese and Chinese Americans ( Table 2). These results as expected indicated substantially less substructure. However, even the subject set limited to Han Chinese and Chinese Americans showed substructure in PC1 using the split half reliability test and with the self identified groupings (ANOVA result). The relationship among the Han Chinese can be demonstrated in PCAs performed either including or excluding other EAS populations ( Figure 4). Although there is variability in the distribution of many of the selfidentified groups there was a general northwest/southeast gradient within these Chinese participants. In PC1 the North Han Chinese (HGDP from north central China [12]) were most separated from the southern Chinese participants including the Chinese American participants from Taiwan or with self-reported southern China origin.

Informativeness of Smaller Sets of SNPs for Large East Asian Population Groups
We next examined the ability of smaller sets of SNPs to define population genetic structure in EAS populations. Random sets of 20 K, 5 K and 1 K SNPs were used to examine substructure in the combined population set and a subset of subjects from the most populous EAS groups (Han Chinese, Japanese, Korean, Filipino and Vietnamese). Correlation values (r 2 ) were calculated compar-   Figure 1 with the exception of the Yakut that derive from northern Siberia. doi:10.1371/journal.pone.0003862.t001 ing these SNP subsets with the 200 K SNP set. These results, summarized in Table 3, showed that the 20 K random SNP set and 5 K random SNP set corresponded closely with the .200 K SNP set for the first 4 PCs, with decreased correlations observed for the 1 K random SNP set. The relatively poor performance of the 1 K random sets was more pronounced when more closely related population groups were considered e.g. Japanese and Korean for PC1, 20 K/200 K r 2 = 0.82+/20.12 (mean+/2SD), 5 K/200 K r 2 = 0.69+/20.03, and 1 K/200 K r 2 = 0.28+/ 20.06. These results suggest that random sets of 5 K SNPs may be necessary for resolving and adjusting for substructure in these EAS populations (see discussion).

East Asian Substructure Ancestry Informative Markers
AIMs that discern population substructure are likely to be useful in candidate gene, chromosomal position based association studies and defining homogeneous subject sets [24]. Since the application of these methods is most applicable to large population groups we restricted our ascertainment to five populations (Han Chinese, Japanese, Korean, Vietnamese and Filipino)(See Methods). To access the potential usefulness of these AIMs an independent set of samples was used and compared with the same number of random SNPs. For this assessment we included Cambodian and Dai samples since we had limited samples from the Vietnamese and Filipino populations. 3 K AIMs showed close correlation between the 200 K results for the first two PCs (Table 3). A set of the best 1.5 K AIMs also showed close correlation ( Figure 5 and Table 3). A reduced set of 750 AIMs showed a fall-off in correlation but was still equivalent to 3 K random SNPs. None of the AIM sets correlated with PC3 or PC4 (r2,0.01, p,0.05), however, these PCs distinguished the Dai and Cambodian from the other population groups and these were not included in our AIM selections. Nevertheless, for the common EAS populations these data suggest that the EAS-AIMs (Table S3) will be useful for association studies in the majority of EAS and EAS-American populations.

Discussion
The current study extends the definition of EAS population substructure and the relationships among these ethnic groups. The inclusion of participant groups from populous countries in this region with large contributions to the USA population is an important aspect of our study. These population groups complement those included within HapMap studies as well as the HGDP in showing relationships between EAS groups and demonstrating that autosomal genotypes can be used to ascertain membership to various EAS groups. These results emphasize that EAS substructure, similar that previously shown for European substructure, will likely be important for complex disease association studies in defining study participants and reducing type 1 and type 2 error rates.
Our study extends the results of PCA analyses of EAS populations including those of HGDP populations that was recently reported [13]. The graphic representation of the first two PCs showed close correspondence to the historical geographical location and/or sample collection site for most of the EAS population groups. Thus, despite admixture and perhaps uncertain migration patterns, overall the largest component of genotypic variation that is discernable by reducing high order data (all genotypes) to lower order variations (PCs) is consistent with the population geography. This finding supports hypotheses that the relationships among the EAS populations are largely explained by clines formed by demic expansion(s). We speculate that the inclusion of many different related ethnic groups has recapitulated the most common events that separated these ethnic groups. The first PC axis accounting for the largest variation has a north/south orientation. One major part of this pattern forms a line from Siberia (Yakut) to Mongolia to Eastern China (Figure 1). The PCA analyses also suggest that at least two separate clines originating or terminating in eastern China at one end and Cambodia and the Philippines at the other end. In addition there is another cline extending from Eastern China to the Korean peninsular and Japan.
Multiple previous studies have examined the relationship between and possible origins of different EAS population groups. Analysis of mitochondrial and Y chromosome haplogroups as well as a limited numbers of classical markers and microsatellite polymorphisms have also provided results that are generally consistent with a north/south orientation of relationships between different EAS population groups [15][16][17][18]. However, there are exceptions with some studies failing to show this relationship    [Note: the eigenvalues plateau as shown in panel A and there is no discernable substructure beyond PC10 for these analyses ( Table 2)]. For each PC, the PC10 eigen value was subtracted to determine an ''adjusted'' eigenvalue. The % substructure variation measurement was the proportion of each adjusted eigenvalue divided by the sum of the adjusted eigenvalues (PC1 through PC10). doi:10.1371/journal.pone.0003862.g003 [19,25]. Summarized by a recent review [26] there are three different postulates regarding the origins of EAS population groups: 1) South East Asian origin [14][15][16][17][18], 2) North Asian origin [27] and 3) a combination of northern and southern origin [19,20]. However, the majority of studies have supported a South-East Asian origin for most EAS populations and include detailed analyses of the age of specific mitochondrial haplogroups, Y chromosome sequences as well as limited marker studies [26]. In contrast, hierarchical trees in the recent HGDP study [13] show branching points consistent with a Yakut derivation. Recent studies using a novel copying model statistical approach appear to suggest an initial northern and southern origin (Cambodians, Mongolians, Xibo, Yi , Tu, Daur, and Naxi receiving large contributions from central-Asian populations) that contribute to Han ancestry [28]. These studies also provide data supporting the derivation of many other EAS groups from a Han expansion (including She, Japanese, Dai, Lahu and Miao). While the current study does not strongly support any of these hypotheses, it does suggest that eastern China is central to the events shaping the population groups in this region.   ). b Summary of analyses is provided for correlations of three independent random marker sets for each random marker group. For each random group the correlation with the full array set (.200 K SNPs) and is expressed as the mean r 2 +/2S.D.  Multiple additional PCs are necessary to define the overall substructure relationships for the entire group of EAS populations studied as shown in Figure 2. However, most of the variation is discerned in the first four PCs for the EAS populations examined and in the first two PCs for the five most populous EAS groups studied. There was no geographic correspondence of the additional PCs and it is unclear whether these additional patterns correspond to individual or multiple different events in the histories of these population groups. Overall the size of the paired Fst values, as expected, showed a strong correlation with the PC eigenvalues summed over the first four PCs (data not shown). Although Fis values do not provide evidence for inbreeding in the current populations, it is unclear whether inbreeding or other factors including bottlenecks during the history of particular EAS ethnic groups may have contributed to the relationships between these populations.
An important aspect of the current study was the identification of EAS-AIM sets. The results show that these AIMs can distinguish the major variation between the populous population groups including Han Chinese, Japanese, Korean, Vietnamese, and Filipino. Additional testing to examine correction for stratification with these population groups was not possible due to limited genotypes currently available. However, by analogy with previous studies in European population groups, these AIMs particular the 1500 EAS-AIM set should be effective in addressing population stratification. The close correspondences of the relative positions in the first two PCs in individual subjects, even within the Han Chinese group, support the potential use of these SNP AIMs. Furthermore, the SHT analysis suggests that studies within the Han Chinese population and Chinese-Americans will benefit from the use of such AIMs in candidate gene studies.

Populations studied
The populations including those from the HGDP, HapMap, the I-control database, a Korean sample set and East Asian Americans. For all but the East Asian American and Korean samples set, genotypes were available from online databases. For all EAS American and Korean subjects, blood cell samples were obtained from all individuals, according to protocols and informed-consent procedures approved by institutional review boards, and were labeled with an anonymous code number linked only to demographic information.
The Korean participants were from recruited in Korea (21 subjects). The EAS American samples were individuals born in the respective EAS country and were from Vietnam (22 subjects), Philippines (17 subjects) and different regions of the Peoples Republic of China (23 subjects) and Taiwan (9 subjects). The Filipino American participants included 15 that were recruited as part of the New York Cancer Project (NYCP); a prospective longitudinal study [29] and two recruited in Houston TX. 3 Filipino, 15 Vietnamese and 32 Chinese American samples were recruited in Houston TX. An additional 7 Vietnamese and 3 Korean genotypes were from the I-ControlDB. Of the Chinese American participants (CHA), 28 also indicated their general origin from regions within China (6 north, 10 south, 3 central and 9 subjects Taiwan).

Genotyping
Genotyping was performed using a 300 K Illumina array according to the Illumina Infinium 2 assay manual (Illumina, San Diego), as previously described [30].

Data Filters
SNPs and individual samples with less than 90% complete genotyping information from any data set were excluded from analyses. SNPs that showed extreme deviation from Hardy Weinberg equilibrium (p,0.00001) in individual population groups were also excluded from analysis. These filters resulted in a total of 215 K autosomal SNPs that were used for these studies. In addition, for samples from nonHGDP origin individuals with evidence of .10% contribution from other continents were Figure 5. Ability of EAS-AIMs to discern population substructure. A, PCA analysis of tester population samples (see Table 3) using 200 K SNPs. B, PCA analysis of same tester population samples using 1500 EAS-AIMs. doi:10.1371/journal.pone.0003862.g005 excluded from further study. This was either performed prior to Illumina array genotyping for the Filipino, Vietnamese and CHA subjects using 128 continental AIMS [31]. Samples were also filtered for possible cryptic relationships using the PLINK program [32].

Statistical Analyses
F st and F is was determined using Genetix software [33] that applies the Weir and Cockerham algorithm [34]. A measure of informativeness for each SNP (I n ) was determined using an algorithm previously described [35]. Hardy-Weinberg equilibrium was determined using HelixTree 5.0.2 software (Golden Helix, Bozeman, MT, USA).
Population structure was examined using STRUCTURE v2.1 [36,37] using parameters and AIMs previously described [31]. This analysis was performed to exclude individuals with evidence of substantial continental admixture from Europe, Africa or the American continent (see Data Filters).
PCA was performed using the EIGENSTRAT statistical package [38]. All analyses were performed after deleting the MHC region on chromosome 6 since regions of high linkage disequilibrium can overly influence PCA results. The Kruskal-Wallis test [23], a nonparametric alternative to the ANOVA was used to examine the statistical significance of the difference in PC scores among subject groups pre-assigned based on self-identification.
The split half reliability test can determine whether independent (non-overlapping) SNP sets provide the same or different results. The split half reliability test was adjusted by the Spearman-Brown formula [39] and was performed three times using 1) alternate chromosomes, 2) alternate half chromosomes, and 3) half genome SNP sets. These sets were chosen to eliminate any dependency in each test between the two half data sets based on linkage disequilibrium.

Selection of EAS-AIMs
Genotypes from 32 Han Chinese (CHA and CHB), 36 Japanese (JPT), 19 Korean, 21 Filipino and 14 Vietnamese were used for SNP selection. An initial set of 3000 EAS substructure AIMs (EAS-AIMs) were based on either I n values or using SNP scores from PCA. The best performance using a testing panel was observed using a set of SNPs selected using I n values from a combination of 1) all five population groups (top 600 SNPs), 2) Chinese and Japanese (top 1200 SNPs), and 3) Chinese and Filipino (top 1200 SNPs). The best performance of a 1500 SNP set and a 750 SNP set were observed using a combination of 500 or 250 from each of these three groups. The testing panel consisted of 20 Chinese, 20 Japanese, 4 Korean, 3 Filipino, 1 Vietnamese, 10 Dai and 10 Cambodian. None of the samples in the testing panel overlapped with the ascertainment samples. The Dai and Cambodian samples were included since there were limited numbers of samples available from the Vietnamese group. The performance of the EAS-AIMs was evaluated using correlations in PC1 and PC2 with the .200 K SNP set.