Figures
Abstract
The Japanese archipelago is a terminal location for human migration, and the contemporary Japanese people represent a unique population whose genomic diversity has been shaped by multiple migrations from Eurasia. We analyzed the genomic characteristics that define the genetic makeup of the modern Japanese population from a population genetics perspective from the genomic data of 9,287 samples obtained by high-coverage whole-genome sequencing (WGS) by the National Center Biobank Network. The dataset comprised populations from the Ryukyu Islands and other parts of the Japanese archipelago (Hondo). The Hondo population underwent two episodes of population decline during the Jomon period, corresponding to the Late Neolithic, and the Edo period, corresponding to the Early Modern era, while the Ryukyu population experienced a population decline during the shell midden period of the Late Neolithic in this region. Haplotype analysis suggested increased allele frequencies for genes related to alcohol and fatty acid metabolism, which were reported as loci that had experienced positive natural selection. Two genes related to alcohol metabolism were found to be 12,500 years out of phase with the time when they began to increase in the allele frequency; this finding indicates that the genomic diversity of Japanese people has been shaped by events closely related to agriculture and food production.
Author summary
The human population in the Japanese archipelago exhibits significant genetic diversity, with the Ryukyu Islands and other parts of the archipelago (Hondo) having undergone distinct evolutionary paths that have contributed to the genetic divergence of the populations in each region. In this study, whole genome sequencing of healthy individuals from national research hospital biobanks was utilized to investigate the genetic diversity of the Japanese population. Haplotypes were inferred from the genomic data, and a thorough population genetic analysis was conducted. The results indicated not only genetic differentiation between Hondo and the Ryukyu Islands, but also marked differences in past population size. In addition, gene genealogies were inferred from the haplotypes, and the patterns were scrutinized for evidence of natural selection. This analysis revealed unique traces of natural selection in East Asian populations, many of which were believed to be linked to dietary changes brought about by agriculture and food production.
Citation: Kawai Y, Watanabe Y, Omae Y, Miyahara R, Khor S-S, Noiri E, et al. (2023) Exploring the genetic diversity of the Japanese population: Insights from a large-scale whole genome sequencing analysis. PLoS Genet 19(12): e1010625. https://doi.org/10.1371/journal.pgen.1010625
Editor: Takashi Gojobori, King Abdullah University of Science and Technology, SAUDI ARABIA
Received: January 21, 2023; Accepted: October 24, 2023; Published: December 7, 2023
Copyright: © 2023 Kawai et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The allele and genotype frequency data are available in the NBDC human database; Accession: hum0331. https://humandbs.biosciencedbc.jp/en/hum0331-v1 Individual-level genomic data will be made available upon review and approval by the biobanks from which the data is provided. The Secretariat of the NCBN Central Bank (secretariat@ncbiobank.org) will process requests for use.
Funding: This work was supported by grants from the Japan Agency for Medical Research and Development under Grant No. JP19kk0205012 to M.M. and K.T. Y.K. was supported by grants from the Ministry of Education, Culture, Sports, Science and Technology for Grant-in-aid for Scientific Research on Innovative Areas under Grant No. 18H05505. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
The Japanese archipelago is located in the eastern part of the Eurasian continent and is one of the final destinations of the human migration out of Africa [1]. While the identity of the first human groups to reach the Japanese archipelago is uncertain, the Jomon people, who were hunter-gatherers known for their pottery, lived in the region after 16,000 years ago [2]. The genetic diversity of the peoples of the Japanese archipelago underwent a dramatic transformation following the Yayoi Period, which began around 3,000 years ago with the migration of agriculturalists from Eurasia. Genome analysis of ancient and modern humans has shown that they admixed with the originally inhabited Jomon people, resulting in the genetic diversity of the modern Japanese population from the Yayoi people [3,4]. This process is thought to have started on Kyushu Island and then gradually spread throughout the archipelago. Although the Ryukyu Islands are separated from Kyushu Island by a significant distance, the agricultural culture known as the Gusuku period began 800 years ago, and it is believed that this agriculture was introduced by migrants from the mainland [5]. These long histories of migration and admixture have shaped the genetic diversity of the peoples of the Japanese archipelago. Previous genome analyses have attempted to reveal this genetic diversity, but most have only sampled specific regions and therefore have been insufficient to fully examine the genetic diversity of the Japanese archipelago as a whole[6,7]. In this study, we used whole-genome sequencing data from subjects from a wide range of regions in the Japanese archipelago to more fully understand the genetic diversity of the peoples of the archipelago.
There are six national research hospitals in Japan that specialize in advanced medical care and research, and each of them maintains its own biobank that collects and stores biological samples from patients. National Center Biobank Network (NCBN) is a federation of these centers that collaborate to provide samples, genomic and clinical information, and public relations [8]. In this study, we performed WGS of 9,850 individual DNA specimens stored in the biobanks of NCBN. These biobanks are located in three distinct regions of Japan: NCGM, NCCHD, and NCNP in the Tokyo area; NCGG in Aichi Prefecture in the central area in the Honshu Island; and NCVC in Osaka Prefecture in the western area in Honshu Island. Therefore, the genomic information obtained in the present study is expected to reflect the regional diversity of Japan to some extent. Here, we characterized the data obtained from this analysis and described the genetic diversity in the Japanese population.
Results
Whole genome sequencing analysis
DNA samples from 9,850 individuals from five National Center Biobanks were analyzed using WGS, and the data in FASTQ format were received from the outsourced laboratory. Of the 9,850 samples, 390 were derived from saliva and the remaining 9,460 samples were derived from blood. The received data were processed through the primary data analysis pipeline to obtain mapping results and variant call results. Quality control (QC) metrices were calculated to confirm that stable quality analysis was obtained (Fig 1). The average read depth per autosomal sample after mapping was 34.0 ± 2.4. The minimum read depth observed was 20.0, but the majority of samples exhibited a read depth of 30 or greater. Notably, samples derived from saliva showed a higher frequency of low read depths. In contrast, among the blood-derived samples, the lowest read depth recorded was 29.3. This observation can be attributed to the fact that non-human foreign DNA present in saliva was subjected to mapping. Indeed, the percentage of unmapped reads in the saliva-derived samples was clearly higher than in the blood-derived samples (S1 Fig). For comparison purposes, we utilized high-depth WGS data obtained from the International 1000 Genomes Project phase 3 [9]. Although this data originated from cell lines, it was generated at a high depth using the same sequencer (Illumina Novaseq6000), enabling straightforward comparison with the NCBN samples. The average depth profile per sample exhibited a similar trend between the NCBN and 1000 Genomes datasets (S1 Fig). There was no difference between NCBN and 1000Genomes in the comparison of base coverage across the genome. For example, on a sample average, 96.8% of the entire genome in both datasets had a coverage of 20 or more, while NCBN and 1000Genomes covered 70.0% and 67.5% of the genome, respectively, with a coverage of 30 or more (S2 Fig). We examined the contamination of DNA from the WGS results with VerifyBamID2 software [10], and found that 22 samples from NCBN had the estimated contamination rates (alpha) greater than 1%. These samples were excluded from further analysis (S3 Fig).
Quality control metrics for each sample are plotted against the sample in the horizontal axis direction; QC indices are (A) average coverage of reads in autosomal loci after excluding duplicated reads, (B) mapping rate, and (C) average insert length. Saliva-derived samples are colored by yellow.
Summary and accuracy of SNP and short insertion and deletions
Variants were characterized by joint calling to integrate individual variant information. In this analysis, we performed joint calling of the gVCFs of 9,287 samples from NCBN, which had been completed at the time of writing manuscript, together with 2,504 samples from the International 1000 Genomes project (S1 Table). The VCF obtained after the joint call contained a total of 208,785,859 records, of which 88.5% (184,864,563) passed the filtering using Variant Quality Score Recalibration (VQSR). We found 122,459,307 variants after focusing only on the variants in the NCBN samples, and 86.3% (105,729,588) of them passed the filter of VQSR. Of the variants that passed the filter, 87,246,166 were single nucleotide variants (SNVs) and 18,483,422 were short insertion and deletions (INDELs); 47% (41,046,547) of the SNVs and 39.8% (7,361,318) of the INDELs that passed the filter were novel variants not registered in dbSNP151. Most of the novel variants were very rare. For example, 34.56% of the known SNVs were singletons found as the heterozygous genotypes of one sample out of 9,287 individuals, and 86.73% were observed at a very low frequency of less than 0.5%. Conversely, 67.46% of the novel variants not registered in dbSNP were singletons, and the percentage of SNVs with a frequency less than 0.5% was more than 99.99%. This is consistent with previous reports that most novel variants are found privately [11–14].
We evaluated the accuracy of the variants using two approaches. First, we performed the genotyping using SNP array to estimate the degree of genotype concordance with WGS results. For this purpose, genome-wide genotyping using the SNP array on the DNA samples of 448 individuals who had undergone WGS analysis was conducted. The 639,508 autosomal SNPs remaining after variant QC in the SNP array were compared with the results obtained after WGS analysis. The number of mismatches ranged from 66 to 7,205 per sample, with an average of 408.7. As a result, the average discordance rate between the two sets of variants was 0.063%. This estimate appears to be a conservative estimate of the error, as it is primarily concentrated in a region that is easy to analyze and for which probes are designed on SNP arrays. Then, we compared the genotypes of the trio samples to estimate the frequency with which the offspring of a trio had heterozygous or non-reference homozygous variants whose parents’ genotypes did not follow the pattern expected from Mendelian low. The sample analyzed in this study contained 148 trios of parents and offspring. We observed the inheritance pattern of genotypes from an average of 4,284,264 variants per trio. Of these, 6,448.4 (0.15%) had an abnormal inheritance. This percentage became more pronounced when stratified by the novelty of the variants, e.g., the known SNVs and INDELs had error rates of 0.09% and 0.42%, respectively, with errors in the inheritance pattern, whereas the novel SNVs and INDELs had error rates of 2.26% and 10.9%, respectively. The Mendelian heritability errors can include the sequencing or genotyping error, de novo mutations, and gene conversions in the parents’ gametes. However, our estimates approximate the error rate of genotyping in this study.
Ancestry inference and allele frequency distribution
We conducted the principal component analysis (PCA) to identify the ancestry of NCBN samples. After the exclusion of 316 samples from NCBN and 11 samples from the 1000 Genomes Project, identified as closely related based on Identical-by-Descent (IBD) analysis, a total of 8,971 unrelated samples from NCBN and 2,493 from the 1000 Genomes Project were selected for subsequent analysis. PCA using these samples detected 21 NCBN samples not belonging to the East Asian populations (Fig 2A). These samples were confirmed to be from outside of East Asia by verifying with the biobank from which they were provided. Significantly, the NCBN sample was found on the far edge of the East Asian cluster, which could support the idea that the Japanese population is a separate group within East Asia. To verify the influence of the notably large size of the NCBN sample within the overall population, PCA was conducted on a downsampled subset of 200 NCBN samples. Similar results were obtained, corroborating the initial findings (S4 Fig). Furthermore, when PCA was performed exclusively on East Asians, the samples segregated into two clusters: one consisting of continental populations (Han Chinese in Beijing; CHB, Han Chinese South; CHS, Kinh Vietnamese; KHV, Chinese Dai in Xishuangbanna, China; CDX) and the other incorporating Japanese in Tokyo (JPT) from 1000 Genomes and NCBN samples (Fig 2B). Moreover, the latter cluster subdivided into larger and smaller subsets. The larger cluster comprised 8,524 individuals, whereas the smaller cluster consisted of 182 individuals (S5A Fig). The Biobank information showed that 140 of the 182 individuals of the small cluster lived in or were born in Okinawa Prefecture, while only 27 individuals of the large cluster lived in or were born in Okinawa Prefecture. This postulates that the larger one represents the Hondo population, while the smaller one corresponds to the Ryukyu population [15]. The division of the PCA cluster within the Japanese population aligns with findings reported in previous studies [15–17]. The same bifurcation into two clusters, Hondo and Ryukyu, was reproduced either by downsampling the NCBN sample (S4 Fig) or by employing UMAP, a different dimensionality reduction method (S5B Fig). When the PCA and UMAP plots were plotted with color coding by donor biobank, there were differences in the distribution among biobanks within the Hondo clusters (S5C and S5D Fig). This indicates the presence of genetic structure within the Hondo cluster because the biobanks were located in different locations. Notably, the cluster with the majority of NCGG samples in UMAP was located at the edge of the Hondo cluster, suggesting the presence of regional genetic structure as NCGG is located in Aichi Prefecture in central Japan. Further detailed analysis is needed to account for individual demographic information such as place of birth. We also mapped the saliva samples to PCA and UMAP plots to examine the effect of data quality on the analysis (S5E and S5F Fig). We found that there were no clear clusters of saliva samples and that they had little impact on the population structure analysis. We compared the allele counts of the Japanese population (GEM Japan Whole-genome Aggregation) estimated based on the WGS analysis in previous studies with those of the Hondo sample and found significant frequency agreement (Fig 3A). While the allele frequencies between the Hondo and Ryukyu populations also showed high agreement, the breadth of the distribution was wider than the comparison between Hondo and GEM Japan (Fig 3B). This could be due to the difference in the Hondo and Ryukyu populations and the subsequent genetic drift.
(A) The first and second principal components are plotted. The continental population of the international 1000 genomes and NCBN are plotted in different colors and shapes. (B) PCA plots of the East Asian population of the International 1000 Genomes and NCBN samples are shown. JPT: Japanese in Tokyo, Japan, CHB: Han Chinese in Beijing, China, CHS: Han Chinese South, KHV: Kinh in Ho Chi Minh City, Vietnam, CDX: Chinese Dai in Xishuangbanna, China.
(A) The non-reference allele frequencies of the Hondo population of NCBN samples (X-axis) and the corresponding variants of GEM Japan (Y-axis) were counted and then the numbers were plotted as density. (B) Same plot for Hondo population (X-axis) and Ryukyu population (Y-axis).
Functional landscape of variants
The variants identified by WGS analysis were annotated for their biological functions by Variant Effect Predictor v102 [18]. The impact of the variants was classified based on the criteria of the annotation software and the database as described in the Methods section. Deleterious mutations are more likely to be kept at low frequencies in the population, as such mutations are less likely to spread in the population because of negative selection. In fact, variants with a high impact on annotation showed a clear tendency to have a low frequency in the population. For example, 70.8% of the “stop gain” variants, which produce a termination codon in the middle of a protein sequence, were singletons observed in heterozygous individuals. On the other hand, in the class of intergenic regions not expected to have a particular impact (upstream & downstream gene variants), the percentage of singletons was about 51%. For a more detailed examination, the LOFTEE plugin of Variant Effect Predictor was used to detect loss-of-function (LoF) variants in the Hondo and Ryukyu populations. For comparison, we also detected LoF variants in 26 populations in the 1000 Genomes Project phase 3 dataset [9]. 14,145 SNVs and 16,823 INDELs were detected as high confident LoF specific to the Hondo population. For the Ryukyu population, 211 SNVs and 288 INDELs were detected. The vast majority of LoF SNVs exhibited a very low frequency in the Hondo population (Fig 4A). In fact, 76.0% of these SNVs exhibited allele frequencies below 0.01%. We compared the number of LoF alleles and the number of homozygous of LoF alleles per individual for Hondo, Ryukyu, and populations in the 1000 Genomes Project (Figs 4B, 4C, and S6). Since homozygous LoFs result in a complete loss of gene function, the number of homozygous LoFs in an individual’s genome can be used to measure the individual’s genetic burden. Both indices were highest in Africa, lowest in West Eurasia, and moderate in Hondo and Ryukyu. The number of homozygous LoF alleles per individual by allele frequency was generally higher in Africa across all allele frequencies (Fig 4D), which is consistent with the trend observed in a previous study [19] except for the high number of low frequency (<0.1%) homozygous LoF alleles.
(A) The allele frequency distribution of newly detected HC LoF SNPs in the Hondo population. (B) The number of LoF alleles and (C) the number of homozygous of LoF alleles per individual for Hondo Ryukyu, and populations of the International 1000 Genomes. (D) The number of homozygous of LoF alleles per individual by allele frequency for Hondo, Ryukyu, and the populations of the International 1000 Genomes.
We compared the variants of NCBN samples with ClinVar registered variants [20]. A total of 103,833 variants found in the Hondo population are registered in ClinVar. Of these, 2,427 were classified as “pathogenic” or “likely pathogenic” variants. Seven variants were found in the four-star category, the most reliable classification based on the ClinVar review status. Only one of them was “pathogenic” and a singleton variant (i.e., heterozygous in a person) of the CFTR gene which are known to cause the disease cystic fibrosis, a condition that affects the respiratory and digestive systems. The remaining six were polymorphic variants related to drug responsiveness of CYP2C19. There were 1,130 variants in the 3-star category reviewed by the expert panel. Of these, 56 were “pathogenic,” and 13 were “likely pathogenic.” The frequencies of these variants were the highest, at 1.0%, and most of them were extremely rare; only a few were observed in the population. Most of the less well-reviewed variants with <3 stars had frequencies of less than 1%, but 34 variants had a frequency of 1% or more.
Allele frequency estimation of HLA loci
Three-field HLA calling results from the WGS dataset in the present study were compared with HLA allele frequencies HLA Foundation Laboratory (Kyoto, Japan) (S7 Fig). All common HLA alleles (frequencies >1%) were concordant between the two datasets with observed differences of less than 1%. To further validate our HLA calling results, a subset of 94 samples was subjected to high-resolution HLA genotyping. Three-field HLA class I (HLA-A, -C, and -B) accuracies were 96.3%, 97.9%, and 96.8%, respectively, and 3-field HLA class II (HLA-DRB1, -DQA1, -DPA1, and -DPB1) accuracies were 98.9%, 100.0%, 98.9%, 100.0%, and 96.8%, respectively. The accuracy of 2-field HLA class I (HLA-A, -C, and -B) increased to 97.9%, 98.4%, and 97.3%, respectively.
Evolutionary perspective of genomic diversity
The recent decrease in population size was detected in Hondo and Ryukyu populations. Fig 5A shows the population histories of Hondo and Ryukyu populations inferred using IBDNe [21], which estimated the changes in population size in recent (~200 generations ago) past based on IBD sharing among individuals. In Hondo, the population size decreased from about 75 to 50 generations ago, and from 17 to 11 generations ago. In the Ryukyu population, a reduction in population size was observed from about 100 to 25 generations ago. The distributions of IBD length were multimodal in both populations, indicating fluctuations in population size (Fig 5B). We also estimated the long-term changes in the effective population size from the genome-wide genealogy using Relate software [22]. We estimated genome-wide genealogy based on the whole-genome data of 1,000 randomly selected samples from Hondo, 182 samples from the Ryukyu, and 103 samples of CHB population from the 1000 Genomes Project. The Ryukyu population showed a bottleneck that peaked around 2,700 years ago (S8 Fig). Hondo/CHB population and Ryukyu population diverged around 3,700 years ago, consistent with previous estimations of the divergence time using SNP arrays [23,24].
(A) Short-term effective population size change in Hondo and Ryukyu populations by IBDNe. (B) Distribution of IBD segment length in Hondo (top) and Ryukyu (bottom).
We detected positive natural selection based on genome-wide genealogy of 1,000 Hondo samples by Relate and found SNPs with p-values below the genome-wide significance level (p < 5.0 × 10−8) (S9 Fig and S2 Table). This analysis tests whether there are more branching after a particular time period in the genealogy estimated from the data than in the null model expected from the coalescent model. As the QQ plot suggested inflation of the test statistics (S10 Fig), it is possible that the results contain false positives. However, the genes reported in previous studies [25–34], which may have undergone positive natural selection, were correctly included in the results. It is therefore important to consider this when interpreting the results. For example, ALDH2 rs671 G/A (p-value = 2.0 × 10−17) and ADH1B rs1229984 T/C (p-value = 6.8 × 10−10), which are associated with alcohol metabolism, showed positive natural selection signals [27,28]. The genealogies of the genes showed that the derived alleles were spreading rapidly through the population (S11 Fig). The second example is signals of positive natural selection on the non-synonymous rs76930569 C/T (p-value = 1.1 × 10−12) variant in the OCA2 gene. This variant is in complete linkage equilibrium with rs1800414 T/C, involved in melanin biosynthesis, and has been shown to be associated with light skin color and tanning ability in Asian populations [26,33,34]. The third example of the positive selection is the FADS gene family. Multiple SNPs (rs174599, rs174600, rs174601, rs97384, rs57535397, rs76996928) showed the signatures of positive selection. FADS1 and FADS2 encode catalytic proteins, which synthesize long-chain fatty acids from short-chain fatty acids [30], and have been subjected to natural selection related to diet in several human populations [25,30–32]. The results confirm the existence of positive natural selection as evidenced by the analysis using integrated haplotype score (iHS) statistics [35] (S12 Fig). The most pronounced normalized iHS within the ADH1B, FADS family, and OCA2 loci were recorded as 3.25, 3.77, and 3.71, respectively. These values are within the top 0.31%, 0.17%, and 0.18% of the entire genome, respectively. Conversely, the peak iHS observed within the ALDH2 locus was 2.48, placing it in the top 2.3%, suggesting a relatively less robust signal of positive natural selection compared to other loci. We further analyzed change in allele frequency with time for these genes under positive natural selection. We used CLUES software [36] to estimate the allele frequency trajectory of SNPs in ALDH2, ADH1B, OCA2, and the FADS gene family. The frequency of the derived alleles in ADH1B rs1229984 increased about 20,000 years ago (Fig 6A). In contrast, the frequency of ALDH2 rs671 increased from about 7,500 years ago (Fig 6B). The allele frequency trajectory of OCA2 rs1800414 (Fig 6C) showed that the frequency of derived allele of OCA2 rs1800414 began to increase due to natural selection around 25,000 years ago. The frequency of derived allele of rs174599 began increasing around 25,000 years ago, slowed down 15,000 years ago, and started increasing again 10,000 years ago (Fig 6D).
Allele frequency trajectories of (A) ADH1B rs1229984, (B) ALDH2 rs671, (C) OCA2 rs1800414, and (D) FADS1 rs174599 are shown.
Discussion
In the present study, we conducted a WGS analysis of samples from five biobanks in Japan. Although the data obtained in this study are intended to be provided as control data for genomic studies of various diseases, the analysis in this study focused on data quality and population genetics properties. A uniform quality of data was obtained through the use of a single procedure that encompassed both sequencing and data analysis. Population-based studies using WGS analysis have been conducted in various populations [12,37,38]. Studies on Japanese populations have already been reported [12], and the allele frequency distributions in previous studies are consistent with the results of the present study (Fig 3A). The samples analyzed in the present study were provided by biobanks in three regions of Japan: NCGM, NCCHD, and NCNP in the Tokyo area; NCGG in Aichi Prefecture in the central area in the Honshu Island; and NCVC in Osaka Prefecture in the western area in Honshu Island. Therefore, the genomic information obtained in the present study is expected to reflect the regional diversity of Japan to some extent. For instance, the population genetic analysis identified two clusters representing the ancestry of Ryukyu Islands, comprising Okinawa Prefecture and the islands of Kagoshima Prefecture and the Hondo region (mainland). This supports the idea that the Hondo and Ryukyu populations are genetically differentiated, as suggested by anthropological studies [15–17]. We further found that past population sizes differed between Hondo and Ryukyu. There was a reduction in the Hondo population from 17 to 11 generations ago (Fig 5A). The corresponding period was 476 and 308 years ago, and the assumption is that each generation spanned 28 years; most of this duration overlaps with the Edo period in Japan. This is consistent with the findings from historical demography studies, which suggest that the population grew at the beginning of the Edo period, but declined or stagnated in the middle period (18th century) due to limited economic growth, population concentration in cities, and famine caused by cold weather-related damage during this period. In contrast, the Ryukyu populations showed population reduction from 100 generations ago to 25 generations ago but then increased until the present (Fig 5A). This population growth about 700 years ago was close to the beginning of farming in the Ryukyu Islands (12th century). Assuming that agriculture was brought to the Ryukyu Islands by migrants from the mainland of Japan, the population decline observed in the Ryukyu population can be considered a bottleneck associated with the migration. The population size estimated from the modern genome reflects the past population of the migrants and should be influenced negligibly by the genetic diversity of the original inhabitants of the Ryukyu Islands. Indeed, although the several human skeletal remains have been discovered from Pleistocene sites in Ryukyu Islands [39,40], the previous population genetic analysis based on genome-wide SNPs suggested minor genetic contribution of the Pleistocene Ryukyu Island population to the modern Ryukyu population [23,24]. The estimated time of divergence between Hondo/CHB and Ryukyu was 3,700 years ago (S8 Fig), suggesting that migration to the Ryukyu Islands occurred recently. This study primarily attributes the genetic differentiation observed within the Japanese archipelago to migration and subsequent isolation, while also underscoring the significance of admixture. Previous studies [6,17,41] has identified distinct ancestral components in both the Hondo and Ryukyu populations, a phenomenon interpreted as the result of admixture between the Jomon, the initial settlers of the Japanese archipelago, and the Yayoi, migrants to Japan approximately 3,000 years ago. The comprehensive sampling across Japan, as undertaken in this study, is anticipated to bring to light novel insights from this perspective. The feasibility of future research in this area hinges on the provision of sample origin data from biobanks.
Studies of rare genetic diseases require data on the frequency of variants in the population. Most of the variants we found in this study were rare, and many of them were newly discovered in this study, as expected from population genetics theory. However, the lower the frequency of the variants, the more difficult it becomes to distinguish them from errors. In this study, we evaluated the accuracy of genotype detection, estimating a discordance rate of 0.063% compared to genotype detection using WGS and SNP arrays. However, this is an overestimation of the error rate due to the combined error of both technologies. We also used the data obtained from the WGS analysis of the trio for validation. We estimated the Mendelian error rate, which is the proportion of genotypes detected in the offspring of a trio that is inconsistent with Mendel’s laws of heredity. This method has the advantage of being able to examine the entire genome compared to the use of SNP arrays. We found that the Mendelian error rate is much higher for novel variants that have not been reported previously. The Mendelian error rate for novel SNVs was 2.26%, much higher than that of the known SNVs (0.09%). The observed overall Mendelian error rate of 0.15% in our study is significantly lower than the 5.25% estimated in a previous whole exome analysis [42]. Drawing direct comparisons between these two studies is difficult due to the divergent methodologies employed. Therefore, while the precise reasons for this discrepancy remain unclear, our findings evidently demonstrate a decreased Mendelian error rate. However, it is important to note that the Mendelian error rates observed here are considerably higher than the theoretical values derived from the estimated de novo mutation rate based on identity by descent (IBD) (1.2 x 10−8 mutations per base pair per meiosis) [43], and thus the majority of these errors are artifacts that arise from sequencing and variant calls. This has important implications for the identification of causative mutations in rare genetic diseases, as many causative mutations for these conditions are newly discovered rare variants. This means that the discovery of such pathological variants in patient sequencing is subject to a non-negligible degree of error.
We conducted the functional annotation of the variants discovered in this study. Consistent with previous studies [12,37,38], variants that were expected to have a high biological impact were less common in the population, confirming that negative natural selection shapes the diversity of variants. Most of the LoF mutations were extremely rare, and most of them were heterozygous (Fig 4A). The number of LoF mutations in the genome was comparable to that in other Eurasian populations (Fig 4B). Although the Ryukyu population has experienced population decline (Fig 5A), the frequency of LoF variants was comparable to that in the Hondo population, and no evidence of differences in the profile of rare functional variants due to the bottleneck effect was noted. The number of LoF sites and homozygous LoF per individual in this study were higher than those detected in a previous study [44]. Among these, the number of stop gained SNVs was consistent with that recorded in the previous study [44], whereas the number of splice site SNVs and frameshift INDELs was higher than that in the previous study [44] (S6 Fig). The number of LoF sites was generally consistent with the number of LoF sites before manual curation in a previous study [45]; thus, it may be possible to remove false-positive homozygous LoFs through manual filtering, as in the previous study.
We also examined pathogenic variants that have been reported in the past. Pathogenic variants assessed by an expert panel (4-star status) on ClinVar were found only in one to a few individuals in the population. On the other hand, some variants that were less reviewed were polymorphic with high frequency. These results reinforce the importance of utilizing the frequency of the variants in the population to evaluate their pathogenicity.
Genes that have undergone positive natural selection in the East Asian populations are related to the metabolism. This study supported that the dietary changes in the ancestors seem to have shaped gene frequencies. Candidate regions undergoing positive natural selection were found on a genome-wide scale using genealogy analysis (S9 Fig). ADH1B is involved in metabolizing alcohol to acetaldehyde, and ALDH2 is involved in metabolizing acetaldehyde. Both the non-synonymous A allele of ALDH2 rs671 and the C allele of ADH1B rs1229984 affect the retention of acetaldehyde in the body and cause alcohol flush in Asians [27,46]. These alleles have been suggested to be associated with Japanese dietary habits and diseases, such as esophageal cancer [47–49]. Previous studies have hypothesized that positive selection may have acted to maintain acetaldehyde in the blood against parasite infection, which correlates with large-scale rice cultivation [28,49–51]. We also observed that the increase in the frequency of ADH1B occurred earlier than that of ALDH2, indicating that positive selection began to act at different times for these two genes (Fig 6A and 6B). Based on the geographic distribution of haplotype structures around ADH1B and ALDH2, according to Koganebuchi et al., positive selection on ADH1B rs1229984 started before the beginning of the Jomon period, while positive natural selection on ALDH2 began around 8,000 years ago, in association with the beginning of rice cultivation in China [28]. Our dating by genome-wide genealogy of the Japanese population genome is consistent with the above consideration. Using HapMap data, OCA2 rs1800414 has been shown in previous studies to be the effect of positive natural selection on East Asians [26]. For the OCA2 gene, positive natural selection signals were found in the European population for skin color-related SNPs other than those detected in this study [26]. As natural selection works for light skin color, a previous study mentioned that it enhances vitamin D synthesis capacity in regions with low sunlight [34]. For East Asians as well, positive natural selection may have operated in relation to vitamin D synthesis in regions with low sunlight. However, since the derived allele of rs1800414 is not necessarily more frequent at the high latitudes of East Asia, other possibilities, such as sexual selection, cannot be ruled out at this time [26]. The derived allele of rs1800414 has been shown to be associated with light skin color and tanning ability in Chinese and Japanese populations [33,34] and is widely observed in modern East Asians [26], suggesting that the derived allele of rs1800414 originated in the common ancestor of East Asians and spread throughout East Asia at very early stages of the East Asian population history. We estimated that the derived allele of OCA2 rs1800414 began to increase in frequency around 25,000 years ago (Fig 6C). Future analyses of older East Asian lineages, such as the ancient genome of the Jomon people, may reveal the original variant of this allele that led to positive natural selection. FADS1 and FADS2 participate in fatty acid metabolism. For example, in the Inuit population, which relies heavily on a marine animal diet, there are positive natural selection signals on SNPs of FADS2 genes, which are responsible for the increase in the concentrations of short-chain fatty acids [52]. Signals of positive natural selection on alleles that promote long-chain fatty acid synthesis have also been identified in African [30], European [25,32,53], and South Asian populations [29]. In particular, studies in European populations have shown that the derived alleles of rs174594 and rs1714546 are associated with increased total cholesterol and LDL cholesterol levels, increased expression of FADS2, and decreased expression of FADS1 [25]. In European populations, increased reliance on plant diets seemed to have resulted in positive natural selection on alleles that promote long-chain fatty acid synthesis pathways of the FADS gene family [25,31,32]. The SNPs in the FADS gene family detected in this study were associated with total cholesterol and LDL cholesterol levels, increased expression of FADS2, and decreased expression of FADS1 (S3 Table), like the SNPs subjected to natural selection in the European population (S3 Table). These results suggest that in Hondo populations, as in Europeans, the dietary change was accompanied by positive natural selection for alleles that promote the long-chain fatty acid synthesis. The frequency of the derived allele of rs174599 in FADS2 began to increase around 25,000 years ago, but the increase was not continuous, and there was a period of stagnation from 15,000 years ago for 5,000 years (Fig 6D). Interestingly, the frequency of this allele varies widely among East Asian populations. The derived allele was major in CHB (64%) and Japanese (63%), whereas it was minor in Dai (Chinese Dai in Xishuangbanna, China) (22%), Han Chinese in South (42%), and Kinh Vietnamese (20%); these data suggest that the positive natural selection of the FADS gene family in East Asians may reflect the association with agriculture and the complex dietary differences among regional populations. Notably, Mathieson and Mathieson (2018) disproved the simple idea that these derived alleles underwent positive natural selection in relation to the introduction of agriculture and speculated that there were complex underlying factors, such as unknown dietary changes [32].
In this study, we demonstrated that the data presented here can be used as a foundation for analysis of human genetics. While this study focused on population genetic characterization of the Japanese population, the data can be used in disease studies, as a resource for genotype imputation in studies of common diseases, and as a control in studies on rare diseases.
Materials and methods
Ethics statement
This study was conducted with approval from the ethics review committee of the NCGM. The approval number is NCGM-S-003413-06. Written informed consent for the analysis of these samples was received from all subjects in each biobank.
Sample preparation
DNA samples stored in the biobanks of five national centers (National Cerebral and Cardiovascular Center; NCVC, National Center for Geriatrics and Gerontology; NCGG, National Center for Global Health and Medicine; NCGM, National Center of Neurology and Psychiatry; NCNP and National Center for Child Health and Development; NCCHD) were submitted for WGS analysis. Samples derived from healthy individuals or patients with some common diseases were selected as control groups for future disease studies. Approximately 50 μl of DNA at a concentration of 80 ng/μl per sample was aliquoted into 96-well plates and shipped to an outsourced laboratory (TakaraBio, Shiga, Japan) for WGS analysis.
WGS
To avoid quality fluctuations and batch effects, all samples were analyzed by a single outsourced laboratory. WGS analysis was performed using NovaSeq6000 (Illumina, San Diego, CA, US), and sample preparation was performed using the procedures and reagents recommended by the manufacturer. DNA molecules were sonicated with a protocol targeting an average size of 550 bp. DNA libraries were prepared using the TruSeq DNA PCR-Free HT Library Prep Kit, and index sequences were added for multiplex analysis. The insert size was confirmed by electrophoresis in the range of 400–750 bp before sequencing runs. WGS was performed at 150 bp paired-end and repeated in multiplex until an output of >90 Gbases without duplicated reads was obtained.
Data analysis
We received the quality controlled FASTQ data from the outsourced laboratory and performed mapping and variant calling in an in-house data analysis pipeline. The mapping and variant calls were performed using the Parabricks v3.1.0 (Nvidia, Santa Clara, CA, US), which provides the capability to perform the analysis recommended by GATK at high speed using a GPU [54]. The GRCh38 was used as the reference sequence. The pipeline used in this study implements algorithms equivalent to those of bwa (v0.7.15) [55] and GATK (v4.1.0). We flagged duplicates from mapped reads but did not perform realignment and base quality score recalibration to reduce the computational time. The mapped data were outputted in BAM format and converted into CRAM format using samtools [56] to reduce the file size. Variant calls were output in gVCF format for joint calling. QC metrics were obtained to evaluate the quality of the analyzed data. The depth and map rate were calculated using GATK’s CollectWgsMetrics tool. These QC metrics were continuously monitored throughout the analysis. The sex chromosomes were analyzed assuming both male and female genders. Variant calls were performed for chromosome X in the diploid (-ploidy 2) model for females and the monoploid (-ploidy 1) model for males. Variant calls of the pseudoautosomal region were performed in the diploid model. Variants of chromosome Y were called in the monoploid mode regardless of the sample’s gender. Finally, the appropriate gVCF file for each sex was used during joint calling. Data from the high-coverage WGS analysis of 2,504 individuals of the International 1000 Genomes Project phase 3 [9] were used as population references for this study. The CRAM files reanalyzed using high depth WGS were downloaded from a public database, and variant calls were performed with the protocol described in this study.
Integrated analyses
To properly estimate the frequencies of variants found after WGS in the population, we integrated the gVCF files. The joint calling was conducted by combining samples from the biobank of NCBN and samples from the International 1000 Genomes Project phase 3. For the joint calling, we used the gVCFtyper program of the Sentieon package [57]. This program produces results equivalent to those of GeomicsDBImport followed by GenotypeGVCFs programs for the joint calling of GATK. To perform efficient computation in a cluster computation environment, we divided the autosomes into 29 regions evenly. Each variant was scored using VQSR to filter the integrated VCF. The VarCal and ApplyVarCal programs of the Sentieon package corresponding to GATK’s VariantRecalibrator and ApplyVQSR, respectively, were used for this process. The HapMap and International 1000 Genomes Omni2.5 sites, the high-confidence SNPs of the International 1000 Genomes Project, and the dbSNP151 sites were used as true, training, and known datasets, respectively. Variants identified as PASS, which correspond to filtering with 99.9% sensitivity, were used in subsequent analyses unless otherwise noted. The INDELs in the present variant set were normalized by performing left align, and multiallelic variants were split into multiple variant records using the norm subcommand in bcftools [44].
Variant annotation
Variants were annotated with the Variant Effect Predictor v102 [18]. We ran the loftee plugin to evaluate the effects of the LoF variants. For the other evaluation of the functional effects, dbNSFP4.1 [58] was used to assign precomputed evaluation values to the variants. The metrics used for the assignment included LRT, SIFT, MutationTaster, and Polyphen2.
Genotyping by SNP array
Using JaponicaArray [59], genome-wide genotyping was performed on a subset of samples for comparison with WGS results. Ninety-four samples each from five biobanks were analyzed using the residual DNA after WGS analysis. The analysis was performed by an outsourced laboratory (Toshiba, Tokyo, Japan), and the raw data in CEL format was received. Four samples were dropped from the genotyping due to a low call rate (<97%) in the first step of genotyping. Clustering for SNP genotyping of variants was performed on the data of 466 individuals using the Analysis Power Tools (ver. 2.10.2.2, Thermo Fisher Scientific, MA, USA). The clustering results for each probes’ intensity were classified using the SNPolisher program bundled with the Analysis Power Tools, and the 639,508 SNPs classified as “Recommended” in autosomes were used for subsequent analyses. The genotype concordance with WGS was estimated using the hap.py software. Comparison was made with 466 samples that overlapped with the WGS analysis samples selected for the calculation of allele frequencies. To compare the positions for which probes were designed in the SNP array, SNVs with the same position as the SNP array were extracted from the results of WGS analysis. The SNP array results were used as true data and the WGS results as query data. The genotype discordance rate between the SNP array and WGS was calculated by dividing the number of false positives by 639,508, which is the total number of SNPs compared.
Allele frequency estimation
To calculate the accurate allele frequencies, the ancestry of the samples was estimated using PCA. Variants were filtered under more stringent criteria for this purpose. Individual’s genotypes were considered no calls if they had a genotype quality (GQ) of less than 20, a depth outside the range of 11 to 64, or if less than 25% of the reads supported the minor allele for heterozygous calls. Then, sites with SNPs that had a VQSR filter of PASS, a minor allele frequency of >1%, and a call rate of >95% were retained. The KING program [60] selected samples of unrelated individuals in the third-degree kinship or more. For this dataset, independent SNPs were extracted using PLINK1.9 [61] with “-indep-pairwise 500 50 0.1”, and PCA was performed to calculate the principal component values for each sample using PLINK1.9. Clusters were identified visually on the scatter plot of the first and second principal components.
Allele and genotype frequencies were estimated for each ancestry group and biobank. The fill-tags plugin of bcftools was used for these calculations. To compare the allele frequencies in the Japanese population, we downloaded the GEM Japan frequency panel information from TogoVar. Since the GEM Japan panel only provides information in hg19 coordinates, we converted it to GRCh38 coordinates. We used GATK’s LiftoverVcf program for the conversion.
HLA analysis
Three-field HLA alleles calling was performed using HLA-HD v1.3.0 [62] based on IPD-IMGT/HLA v3.43.0 [63]; a score based on the weighted read counts considering variations in and outside of the domain for antigen presentation was calculated to select the most suitable pair of alleles amongst the candidate HLA alleles. To validate the accuracies of HLA calling from WGS, HLA allele frequency distribution was compared with the HLA frequency dataset from HLA Foundation Laboratory (Kyoto, Japan). To evaluate the accuracy of HLA calling from the WGS dataset, a subset of the samples (n = 94) was subjected to high-resolution experimental HLA genotyping for eight HLA genes (HLA-A, -C, -B, -DRB1, -DQA1, -DQB1, -DPA1, and -DPB1) using next-generation sequencing and AllType assay (One Lambda, West Hills, CA, US). Experimental HLA genotyping was carried out following the vendor instructions, which consist of HLA gene amplification, HLA library preparation, HLA template preparation, and HLA library loading onto an ion 530v1 chip (Thermo Fisher Scientific) in the Ion Chef (Thermo Fisher Scientific), followed by final sequencing on the Ion S5 machine (Thermo Fisher Scientific). HLA genotype assignments were carried out using HLATypeStream Visual (TSV v2.0; One Lambda, West Hills, CA, US) and NGSengine (v2.18.0.17625, GenDX, Utrecht, the Netherlands).
Haplotype phasing
Variant phasing was performed using shapeit v4.2 [64] in a haplotype-based analysis. SNPs of unrelated samples identified using the ancestry inference were extracted for phasing. The variant phasing was performed by dividing the autosomes into regions containing overlaps for efficient computation. Each region was about 10 Mb in length, with a 500 kb overlap margin at both ends. After phasing, VCFs were concatenated using the concat subcommand in bcftools.
Estimation of recent population size change
We estimated the effective population size change of the Japanese population from IBD sharing, which can estimate the population size change in the recent past (~200 generations ago) using WGS data [21]. Population size change was estimated for each population based on the whole-genome data of Hondo (8,524 individuals) and Ryukyu (182 individuals). First, the hapibd software [65] was used to detect the IBD segments shared by each individual. For the genetic distance, we referred to the HapMap genetic map data distributed with hapibd. IBD segments were detected by separating the short and long arms of the chromosomes so that IBD segments did not include the centromere. We then estimated the population size change of the Hondo and Ryukyu populations using IBDNe (ibdne.23Apr20.ae9.jar). The shortest threshold of the IBD segment length was set at 2 cM.
Estimation of genome-wide genealogy, estimation of population size change, and detection of positive natural selection
We conducted the analysis of gene genealogy using the Relate software [22] to detect long-term population size change and positive natural selection in Hondo and Ryukyu populations. Relate is a software that can estimate genealogy on a genome-wide scale for over 10,000 samples. In this study, we used 1,000 randomly selected individual genomes of Hondo, 182 Ryukyu samples, and 103 CHB samples of 1000 Genomes Project [9]. First, input files (.haps,.samples) were created from vcf files using the PrepareInputFiles.sh script in Relate software. We retrieved the Homo sapiens ancestral sequences (GRCh38) of Ensembl 103 for the ancestral sequence and StrictMask of 1000 Genomes Project for genomic mask. Next, genome-wide genealogy was estimated using the “Relate” command of Relate software packages. The mutation rate was set to 1.25 × 10−8 per base per generation and the effective population size was set to 30,000. We assumed 28 years as the generation time in humans. The estimated genome-wide genealogy (.anc,.mut) was used as input for population size estimation of Hondo and Ryukyu populations using the EstimatePopulationSize.sh script. This script simultaneously conducts estimation of population sizes, re-estimation of branch lengths using the estimated population sizes, and estimation an average mutation rate. Finally, based on genome-wide genealogy, we detected the target SNPs of positive natural selection acting on Hondo and Ryukyu populations. Relate calculates a p-value of each SNP for positive selection that quantifies how quickly a mutation has spread in the population. The p-values were calculated for each SNP using the DetectSelection.sh script using the output genealogies of population size estimation (.anc,.mut). We evaluated the quality of each SNP by “RelateSelection–mode Quality,” and SNPs inferred to be inaccurate tree estimation were excluded. The iHS statistic was computed on the same data set using the program selscan [66]. with the VCF containing phased genotypes as input, the analysis was run separately for the short and long arms of the chromosomes. The iHS obtained from the analysis were normalized stratified by 1% of allele frequencies with the norm program bundled with selscan.
Estimating the allele frequency trajectory
Changes in the allele frequency through the time were estimated using CLUES to infer allele frequency trajectories [36]. CLUES uses the genome-wide genealogy inferred by Relate. First, the sampleBranchLengths.sh script implemented in Relate was used to MCMC sample the gene trees for the focal SNPs. Then, using the sampled tree file (.timeb) as input, we estimated the allele frequency trajectory using CLUES’s inference.py command. The coalescence rate estimated by Relate (.coal file) can be used as an input to modify the population size change using the -coal option. In this study, we estimated the allele frequency trajectory by focusing on ALDH2 rs671, ADH1B rs1229984, OCA2 rs1800414, and FADS2 rs174600 among the SNPs that showed signals of natural selection in Relate.
Web resources
IGSR: The International Genome Sample Resource, https://www.internationalgenome.org
Hap.py, https://github.com/Illumina/hap.py
GEM Japan, https://www.amed.go.jp/en/aboutus/collaboration/ga4gh_gem_japan.html
TogoVar, https://togovar.biosciencedbc.jp/
GATK, https://gatk.broadinstitute.org/hc/en-us
HLA frequency dataset from HLA Foundation Laboratory (Kyoto, Japan), http://hla.or.jp/index.html
Homo sapiens ancestral sequences (GRCh38), ftp://ftp.ensembl.org/pub/current_fasta/ancestral_alleles/homo_sapiens_ancestor_GRCh38.tar.gz
Supporting information
S1 Fig. Comparison of QC metrics.
Comparison of QC metrics between NCBN and 1000 Genome samples. (A)(D) Histogram of average depth per sample. (B)(E) Cumulative frequencies of average depth per sample. Comparisons are made between NCBN and 1000Genome samples in (A)(B)(C); Comparisons are made between blood, saliva (both NCBN) and cell line (1000Genomes) in (D)(E)(F).
https://doi.org/10.1371/journal.pgen.1010625.s001
(PDF)
S2 Fig. Fraction of the genomic region that is mapped with specific depth (10x, 20x, 30x, 40x).
The fraction of genomic regions exhibiting a certain level of mapping depth is computed for each sample. These data are then represented and compared using a boxplot stratified by DNA source.
https://doi.org/10.1371/journal.pgen.1010625.s002
(PDF)
S3 Fig. Breakdown of QC steps and sample size for this study.
The steps in the analysis and QC process are illustrated in the figure, with a breakdown of the number of samples.
https://doi.org/10.1371/journal.pgen.1010625.s003
(PDF)
S4 Fig. PCA performed with the number of NCBN samples reduced to 200.
A PCA was performed on 200 randomly selected NCBN samples under the same conditions as in Fig 2. (A) The first and second principal components are plotted. The continental population of the international 1000 genomes and NCBN are plotted in different colors and shapes. (B) PCA plots of the East Asian population of the International 1000 Genomes and NCBN samples are shown.
https://doi.org/10.1371/journal.pgen.1010625.s004
(PDF)
S5 Fig. Genetic structure of East Asian populations.
(A) The clusters consisting of the NCBN samples in Fig 2 are classified into Hondo (blue), Ryukyu (orange), and others. (B) The six principal component of PCA were dimensionally reduced to a two-dimensional representation via the Uniform Manifold Approximation and Projection (UMAP). The color scheme employed matches that of (A). The UMAP transformation was executed using the UMAP function from the umap-learn package in Python, with the parameters set to n_neighbors = 10 and min_dist = 0.5. (C) (D) Plot of (A) and (B) color-coded by sample donor biobank. Peripheral density distribution is also attached to show the difference in distribution within clusters. (E) (F) Plots (A) and (B) with saliva-derived samples highlighted.
https://doi.org/10.1371/journal.pgen.1010625.s005
(PDF)
S6 Fig. Analysis of Loss-of-function (LoF) variants.
The numbers of LoF sites per individual by category are presented: (A) and (B) stop gained SNV; (C) and (D) splice site SNV; (E) and (F) frameshift INDELs.
https://doi.org/10.1371/journal.pgen.1010625.s006
(PDF)
S7 Fig. HLA alleles frequencies (%) between NCBN vs HLA Foundation Laboratory, Kyoto, Japan.
Comparison for class I HLA genes (HLA-A, -C, -B) (left). Comparison for class II HLA genes (HLA-DRB1, -DQA1, -DQB1, -DPA1, -DPB1) (right). Only common HLA alleles (HLA frequencies > 1%) are included in this analysis.
https://doi.org/10.1371/journal.pgen.1010625.s007
(PDF)
S8 Fig. Long-term effective population size change of Hondo, Ryukyu, and Han Chinese.
The changes in population size were estimated from the gene genealogy across the genome.
https://doi.org/10.1371/journal.pgen.1010625.s008
(PDF)
S9 Fig. Manhattan plot of the selection scan result of the whole-genome SNPs by Relate.
The red line represents the genome-wide significance level (5 × 10−8).
https://doi.org/10.1371/journal.pgen.1010625.s009
(PDF)
S10 Fig. QQ plot of the selection scan result of the whole-genome SNPs by Relate.
The red line denotes y = x.
https://doi.org/10.1371/journal.pgen.1010625.s010
(PDF)
S11 Fig. Gene genealogy estimated by Relate.
Genealogy of (a) ALDH2 rs671, (b) ADH1B rs1229984 (c) OCA2 rs1800414 (d) FADS1 rs174599 are presented. The vertical axis represents the age (years before present). Derived allele carriers are shown in red.
https://doi.org/10.1371/journal.pgen.1010625.s011
(PDF)
S12 Fig. Genome-wide signature of positive selection (his) of the whole genome SNPs by selscan.
The absolute values of iHS for each SNP are plotted on the vertical axis, with red and blue lines drawn at the top 0.1% and 1% of the genome, respectively, in descending order of absolute iHS value.
https://doi.org/10.1371/journal.pgen.1010625.s012
(PDF)
S1 Table. Summary of variants discovered by WGS.
https://doi.org/10.1371/journal.pgen.1010625.s013
(PDF)
S2 Table. List of SNPs for which natural selection was detected by Relate.
https://doi.org/10.1371/journal.pgen.1010625.s014
(XLSX)
S3 Table. Abbreviated GTEx eQTL Results of SNPs affected by positive natural selection in FADS gene family, P Value Cut Off of 10−20.
https://doi.org/10.1371/journal.pgen.1010625.s015
(PDF)
Acknowledgments
We thank the participants who provided biological samples for this study. We thank Ms. Megumi Tatsumi, Ms. Haruna Kaneko, Ms. Ayumi Fujisawa, Ms. Mami Sasaki, Ms. Yukiko Miyashita, and Ms. Chikako Kinjyo at NCNP Biobank; Ms. Sachiyo Ito at NCGG Biobank; Dr. Satoshi Suzuki at NCGM Biobank; the personnel at NCVC Biobank; Ms. Kazuyo Kawamura, Dr. Kumiko Yanagi, Dr. Saki Aoto, and Ms. Fuyuki Hasegawa at NCCHD Biobank; and Ms. Akino Takase at NCBN Central Biobank for their contributions to this study. We also thank Dr. Leo Speidel for discussion on the results of Relate analysis and Dr. Osamu Ogasawara for the technical support for the development of secondary analysis pipelines. Computations were partially performed on the NIG supercomputer at ROIS National Institute of Genetics.
References
- 1. Oppenheimer S. Out-of-Africa, the peopling of continents and islands: tracing uniparental gene trees across the map. Philos Trans R Soc Lond B Biol Sci. 2012;367(1590):770–84. pmid:22312044; PubMed Central PMCID: PMC3267120.
- 2.
Habu J. Ancient Jomon of Japan. Cambridge, UK; New York, NY, USA: Cambridge University Press; 2004. xv, 332 p. p.
- 3. Jinam TA, Kanzawa-Kiriyama H, Saitou N. Human genetic diversity in the Japanese Archipelago: dual structure and beyond. Genes & Genetic Systems. 2015;90(3):147–52. pmid:26510569
- 4. Nakagome S, Sato T, Ishida H, Hanihara T, Yamaguchi T, Kimura R, et al. Model-based verification of hypotheses on the origin of modern Japanese revisited by Bayesian inference based on genome-wide SNP data. Mol Biol Evol. 2015;32(6):1533–43. Epub 20150310. pmid:25758010.
- 5. Pearson R. Ancient Ryukyu An Archaeological Study of Island Communities: University of Hawai’i Press; 2013.
- 6. Jinam T, Kawai Y, Kamatani Y, Sonoda S, Makisumi K, Sameshima H, et al. Genome-wide SNP data of Izumo and Makurazaki populations support inner-dual structure model for origin of Yamato people. Journal of Human Genetics. 2021;66(7):681–7. pmid:33495571
- 7. Yasuda J, Katsuoka F, Danjoh I, Kawai Y, Kojima K, Nagasaki M, et al. Regional genetic differences among Japanese populations and performance of genotype imputation using whole-genome reference panel of the Tohoku Medical Megabank Project. BMC Genomics. 2018;19(1):551–. pmid:30041597
- 8. Omae Y, Goto Y-i, Tokunaga K. National Center Biobank Network. Human Genome Variation. 2022;9(1):38. pmid:36333292
- 9. Byrska-Bishop M, Evani US, Zhao X, Basile AO, Abel HJ, Regier AA, et al. High coverage whole genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. bioRxiv. 2021:2021.02.06.430068–2021.02.06. pmid:36055201
- 10. Zhang F, Flickinger M, Taliun SAG, In PPGC, Abecasis GR, Scott LJ, et al. Ancestry-agnostic estimation of DNA sample contamination from sequence reads. Genome Res. 2020;30(2):185–94. Epub 20200124. pmid:31980570; PubMed Central PMCID: PMC7050530.
- 11. Fu W O’Connor TD, Jun G, Kang HM, Abecasis G, Leal SM, et al. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature. 2013;493(7431):216–20. pmid:23201682
- 12. Nagasaki M, Yasuda J, Katsuoka F, Nariai N, Kojima K, Kawai Y, et al. Rare variant discovery by deep whole-genome sequencing of 1,070 Japanese individuals. Nature Communications. 2015;6(1):8018–. pmid:26292667
- 13. The Genome of the Netherlands Consortium. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nature Genetics. 2014;46(8):818–25. pmid:24974849
- 14. The 1000 Genomes Project Consortium An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491(7422):56–65. pmid:23128226
- 15. Yamaguchi-Kabata Y, Nakazono K, Takahashi A, Saito S, Hosono N, Kubo M, et al. Japanese Population Structure, Based on SNP Genotypes from 7003 Individuals Compared to Other Ethnic Groups: Effects on Population-Based Association Studies. The American Journal of Human Genetics. 2008;83(4):445–56. pmid:18817904
- 16. Jinam T, Nishida N, Hirai M, Kawamura S, Oota H, Umetsu K, et al. The history of human populations in the Japanese Archipelago inferred from genome-wide SNP data with a special reference to the Ainu and the Ryukyuan populations. Journal of human genetics. 2012;57(12):787–95. pmid:23135232
- 17. Watanabe Y, Isshiki M, Ohashi J. Prefecture-level population structure of the Japanese based on SNP genotypes of 11,069 individuals. Journal of Human Genetics. 2021;66(4):431–7. pmid:33051579
- 18. McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, et al. The Ensembl Variant Effect Predictor. Genome Biol. 2016;17(1):122. Epub 20160606. pmid:27268795; PubMed Central PMCID: PMC4893825.
- 19. Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536(7616):285–91. pmid:27535533
- 20. Landrum MJ, Lee JM, Benson M, Brown G, Chao C, Chitipiralla S, et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic acids research. 2016;44(D1):D862–8. pmid:26582918
- 21. Browning SR, Browning BL. Accurate Non-parametric Estimation of Recent Effective Population Size from Segments of Identity by Descent. The American Journal of Human Genetics. 2015;97(3):404–18. pmid:26299365
- 22. Speidel L, Forest M, Shi S, Myers SR. A method for genome-wide genealogy estimation for thousands of samples. Nature Genetics. 2019;51(9):1321–9. pmid:31477933
- 23. Matsunami M, Koganebuchi K, Imamura M, Ishida H, Kimura R, Maeda S. Fine-Scale Genetic Structure and Demographic History in the Miyako Islands of the Ryukyu Archipelago. Molecular Biology and Evolution. 2021;38(5):2045–56. pmid:33432348
- 24. Sato T, Nakagome S, Watanabe C, Yamaguchi K, Kawaguchi A, Koganebuchi K, et al. Genome-Wide SNP Analysis Reveals Population Structure and Demographic History of the Ryukyu Islanders in the Southern Part of the Japanese Archipelago. Molecular Biology and Evolution. 2014;31(11):2929–40. pmid:25086001
- 25. Buckley MT, Racimo F, Allentoft ME, Jensen MK, Jonsson A, Huang H, et al. Selection in Europeans on Fatty Acid Desaturases Associated with Dietary Changes. Molecular biology and evolution. 2017;34(6):1307–18. pmid:28333262
- 26. Donnelly MP, Paschou P, Grigorenko E, Gurwitz D, Barta C, Lu R-B, et al. A global view of the OCA2-HERC2 region and pigmentation. Human genetics. 2012;131(5):683–96. pmid:22065085
- 27. Harada S, Agarwal DP, Goedde HW. Aldehyde dehydrogenase deficiency as cause of facial flushing reaction to alcohol in Japanese. 1981. p. 982–.
- 28. Koganebuchi K, Haneji K, Toma T, Joh K, Soejima H, Fujimoto K, et al. The allele frequency of ALDH2*Glu504Lys and ADH1B*Arg47His for the Ryukyu islanders and their history of expansion among East Asians. American Journal of Human Biology. 2017;29(2):e22933–e. pmid:27801545
- 29. Kothapalli KSD, Ye K, Gadgil MS, Carlson SE, O’Brien KO, Zhang JY, et al. Positive Selection on a Regulatory Insertion-Deletion Polymorphism in FADS2 Influences Apparent Endogenous Synthesis of Arachidonic Acid. Molecular biology and evolution. 2016;33(7):1726–39. Epub 2016/03/29. pmid:27188529
- 30. Mathias RA, Fu W, Akey JM, Ainsworth HC, Torgerson DG, Ruczinski I, et al. Adaptive Evolution of the FADS Gene Cluster within Africa. PLoS ONE. 2012;7(9):e44926–e. pmid:23028684
- 31. Mathieson I, Lazaridis I, Rohland N, Mallick S, Patterson N, Roodenberg SA, et al. Genome-wide patterns of selection in 230 ancient Eurasians. Nature. 2015;528(7583):499–503. pmid:26595274
- 32. Mathieson S, Mathieson I. FADS1 and the Timing of Human Adaptation to Agriculture. Molecular Biology and Evolution. 2018;35(12):2957–70. pmid:30272210
- 33. Shido K, Kojima K, Yamasaki K, Hozawa A, Tamiya G, Ogishima S, et al. Susceptibility Loci for Tanning Ability in the Japanese Population Identified by a Genome-Wide Association Study from the Tohoku Medical Megabank Project Cohort Study. Journal of Investigative Dermatology. 2019;139(7):1605–8.e13. pmid:30690034
- 34. Yang Z, Zhong H, Chen J, Zhang X, Zhang H, Luo X, et al. A Genetic Mechanism for Convergent Skin Lightening during Recent Human Evolution. Molecular biology and evolution. 2016;33(5):1177–87. pmid:26744415
- 35. Voight BF, Kudaravalli S, Wen X, Pritchard JK. A map of recent positive selection in the human genome. PLoS Biol. 2006;4(3):e72. Epub 20060307. pmid:16494531; PubMed Central PMCID: PMC1382018.
- 36. Stern AJ, Wilton PR, Nielsen R. An approximate full-likelihood method for inferring selection and allele frequency trajectories from DNA sequence data. PLOS Genetics. 2019;15(9):e1008384–e. pmid:31518343
- 37. Genome of the Netherlands C. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat Genet. 2014;46(8):818–25. Epub 20140629. pmid:24974849.
- 38. Gudbjartsson DF, Helgason H, Gudjonsson SA, Zink F, Oddson A, Gylfason A, et al. Large-scale whole-genome sequencing of the Icelandic population. Nature Genetics. 2015;47(5):435–44. pmid:25807286
- 39. Nakagawa R, Doi N, Nishioka Y, Nunami S, Yamauchi H, Fujita M, et al. Pleistocene human remains from Shiraho-Saonetabaru Cave on Ishigaki Island, Okinawa, Japan, and their radiocarbon dating. Anthropological Science. 2010;118(3):173–83.
- 40. Suzuki H. Discoveries of the Fossil Man from Okinawa Island. The Journal of Anthropological Society of Nippon. 1975;83:113–24.
- 41. Jinam TA, Kawai Y, Saitou N. Modern human dna analyses with special reference to the inner dual-structure model of yaponesian. Anthropological Science. 2021;129(1):3–11.
- 42. Lin YL, Chang PC, Hsu C, Hung MZ, Chien YH, Hwu WL, et al. Comparison of GATK and DeepVariant by trio sequencing. Sci Rep. 2022;12(1):1809. Epub 20220202. pmid:35110657; PubMed Central PMCID: PMC8810758.
- 43. Tian X, Browning BL, Browning SR. Estimating the Genome-wide Mutation Rate with Three-Way Identity by Descent. Am J Hum Genet. 2019;105(5):883–93. Epub 20191003. pmid:31587867; PubMed Central PMCID: PMC6848988.
- 44. MacArthur DG, Balasubramanian S, Frankish A, Huang N, Morris J, Walter K, et al. A Systematic Survey of Loss-of-Function Variants in Human Protein-Coding Genes. Science. 2012;335(6070):823–8. pmid:22344438
- 45. Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581(7809):434–43. pmid:32461654
- 46. Edenberg HJ, McClintick JN. Alcohol Dehydrogenases, Aldehyde Dehydrogenases, and Alcohol Use Disorders: A Critical Review. Alcoholism, clinical and experimental research. 2018;42(12):2281–97. pmid:30320893
- 47. Cui R, Kamatani Y, Takahashi A, Usami M, Hosono N, Kawaguchi T, et al. Functional Variants in ADH1B and ALDH2 Coupled With Alcohol and Smoking Synergistically Enhance Esophageal Cancer Risk. Gastroenterology. 2009;137(5):1768–75. pmid:19698717
- 48. Matoba N, Akiyama M, Ishigaki K, Kanai M, Takahashi A, Momozawa Y, et al. GWAS of 165,084 Japanese individuals identified nine loci associated with dietary habits. Nature Human Behaviour. 2020;4(3):308–16. pmid:31959922
- 49. Oota H, Pakstis AJ, Bonne-Tamir B, Goldman D, Grigorenko E, Kajuna SLB, et al. The evolution and population genetics of the ALDH2 locus: random genetic drift, selection, and low levels of recombination. Annals of human genetics. 2004;68(Pt 2):93–109. pmid:15008789
- 50. Han Y, Gu S, Oota H, Osier MV, Pakstis AJ, Speed WC, et al. Evidence of positive selection on a class I ADH locus. American journal of human genetics. 2007;80(3):441–56. pmid:17273965
- 51. Luo H-R, Wu G-S, Pakstis AJ, Tong L, Oota H, Kidd KK, et al. Origin and dispersal of atypical aldehyde dehydrogenase ALDH2⁎487Lys. Gene. 2009;435(1):96–103. pmid:19393179
- 52. Fumagalli M, Moltke I, Grarup N, Racimo F, Bjerregaard P, Jørgensen ME, et al. Greenlandic Inuit show genetic signatures of diet and climate adaptation. Science (New York, NY). 2015;349(6254):1343–7. pmid:26383953
- 53. Mallick S, Li H, Lipson M, Mathieson I, Gymrek M, Racimo F, et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature. 2016;538(7624):201–6. pmid:27654912
- 54. Franke KR, Crowgey EL. Accelerating next generation sequencing data analysis: an evaluation of optimized best practices for Genome Analysis Toolkit algorithms. Genomics & informatics. 2020;18(1):e10–e. pmid:32224843
- 55. Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25(14):1754–60. pmid:19451168
- 56. Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, et al. Twelve years of SAMtools and BCFtools. GigaScience. 2021;10(2):giab008–giab. pmid:33590861
- 57. Freed D, Aldana R, Weber JA, Edwards JS. The Sentieon Genomics Tools—A fast and accurate solution to variant calling from next-generation sequence data. bioRxiv. 2017:115717–.
- 58. Liu X, Li C, Mou C, Dong Y, Tu Y. dbNSFP v4: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs. Genome Medicine. 2020;12(1):103–. pmid:33261662
- 59. Kawai Y, Mimori T, Kojima K, Nariai N, Danjoh I, Saito R. Japonica array: improved genotype imputation by designing a population-specific SNP array with 1070 Japanese individuals. Journal of Human Genetics. 2015;60(10):581–7. pmid:26108142
- 60. Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen W-M. Robust relationship inference in genome-wide association studies. Bioinformatics. 2010;26(22):2867–73. pmid:20926424
- 61. Chang CC, Chow CC, Tellier LCAM, Vattikuti S, Purcell SM, Lee J. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience. 2015;4(1):7-. pmid:25722852
- 62. Kawaguchi S, Higasa K, Shimizu M, Yamada R, Matsuda F. HLA-HD: An accurate HLA typing algorithm for next-generation sequencing data. Human Mutation. 2017;38(7):788–97. pmid:28419628
- 63. Robinson J, Barker DJ, Georgiou X, Cooper MA, Flicek P, Marsh SGE. IPD-IMGT/HLA Database. Nucleic Acids Research. 2020;48(D1):D948–D55. pmid:31667505
- 64. Delaneau O, Zagury J-F, Robinson MR, Marchini JL, Dermitzakis ET. Accurate, scalable and integrative haplotype estimation. Nature communications. 2019;10(1):5436–. pmid:31780650
- 65. Zhou Y, Browning SR, Browning BL. A Fast and Simple Method for Detecting Identity-by-Descent Segments in Large-Scale Data. The American Journal of Human Genetics. 2020;106(4):426–37. pmid:32169169
- 66. Szpiech ZA, Hernandez RD. selscan: an efficient multithreaded program to perform EHH-based scans for positive selection. Mol Biol Evol. 2014;31(10):2824–7. Epub 20140710. pmid:25015648; PubMed Central PMCID: PMC4166924.