Evolutionary Analysis of Classical HLA Class I and II Genes Suggests That Recent Positive Selection Acted on DPB1*04∶01 in Japanese Population

The human leukocyte antigen (HLA) genes exhibit the highest degree of polymorphism in the human genome. This high degree of variation at classical HLA class I and class II loci has been maintained by balancing selection for a long evolutionary time. However, little is known about recent positive selection acting on specific HLA alleles in a local population. To detect the signature of recent positive selection, we genotyped six HLA loci, HLA-A, HLA-B, HLA-C, HLA-DRB1, HLA-DQB1, and HLA-DPB1 in 418 Japanese subjects, and then assessed the haplotype homozygosity (HH) of each HLA allele. There were 120 HLA alleles across the six loci. Among the 80 HLA alleles with frequencies of more than 1%, DPB1*04∶01, which had a frequency of 6.1%, showed exceptionally high HH (0.53). This finding raises the possibility that recent positive selection has acted on DPB1*04∶01. The DPB1*04∶01 allele, which was present in the most common 6-locus HLA haplotype (4.4%), A*33∶03-C*14∶03-B*44∶03-DRB1*13∶02-DQB1*06∶04-DPB1*04∶01, seems to have flowed from the Korean peninsula to the Japanese archipelago in the Yayoi period. A stochastic simulation approach indicated that the strong linkage disequilibrium between DQB1*06∶04 and DPB1*04∶01 observed in Japanese cannot be explained without positive selection favoring DPB1*04∶01. The selection coefficient of DPB1*04∶01 was estimated as 0.041 (95% credible interval 0.021–0.077). Our results suggest that DPB1*04∶01 has recently undergone strong positive selection in Japanese population.


Introduction
The crucial immunological function of human leukocyte antigen (HLA) molecules is to present pathogen-derived antigenic peptides to T lymphocytes [1]. The HLA proteins are encoded by genes in the major histocompatibility complex region, which spans approximately 4 megabases (Mb) on the short arm of chromosome 6 (6p21.3) and includes the most polymorphic loci in the human genome [2]. A remarkable feature of the classical HLA class I and class II genes is the high degree of polymorphism. More than 1,750 HLA-A, 2,330 HLA-B, 1,300 HLA-C, 1,060 HLA-DRB1, 160 HLA-DQB1, and 150 HLA-DPB1 alleles have been reported (IMGT/HLA database; http:// www.ebi.ac.uk/imgt/hla/).
Positive selection has been shown as a driving force for the high degree of polymorphism at HLA loci [3,4]. The HLA genes show three remarkable signatures of positive selection: (1) the rate of nonsynonymous (amino acid altering) nucleotide substitution is substantially higher than that of synonymous substitution at antigen-recognition sites [5,6], (2) there are trans-species polymorphisms (i.e., similar alleles are present in multiple species) [7], and (3) there is a significant excess of heterozygosity [8,9]. Balancing selection, including overdominant selection and fre-quency-dependent selection, can easily account for these observations [3,4].
A number of studies have reported common long-range HLA haplotypes [10][11][12][13][14][15][16]. The extended length of common haplotype is a key feature of recent positive selection [17,18]. The HLA alleles on long-range haplotypes may have been subject to recent positive selection. In this study, to identify the signature of recent positive selection that has acted on specific HLA alleles in a local (i.e., geographically restricted) population, we investigated the allele frequencies and haplotype frequencies at HLA-A, HLA-C, HLA-B, HLA-DRB1, HLA-DQB1, and HLA-DPB1 in 418 Japanese individuals. Our theoretical and computer simulation analyses suggested that DPB1*04:01 has recently undergone strong positive selection in Japanese population.
Of the six HLA loci examined, the HLA-B locus showed the highest heterozygosity (0.937), and HLA-DPB1 showed the lowest (0.765) (Tables 1 and 2). None of the HLA class I or II loci  [19,20] of HLA allele frequencies in this study population revealed that the observed distributions of allele frequencies at HLA-C (P = 0.003), HLA-B (P = 0.002), HLA-DRB1 (P = 0.013), and HLA-DQB1 (P = 0.001) differed significantly (i.e., there was excess heterozygosity) from the distributions expected based on the assumption of neutrality, whereas there was no significant difference between the expected and observed distributions of allele frequencies at HLA-A or HLA-DPB1 (Tables 1 and  2).

Pairwise LD between HLA Alleles
The pairwise linkage disequilibrium (LD) parameters, r 2 and |D'| [21], for each possible pair of two HLA alleles were estimated ( Figure 1 and Data S1). Most alleles at HLA-A were not in strong LD with any of the alleles at the other loci because the physical distance from HLA-A to each of the other loci is large. To evaluate the relative strength of LD between two HLA loci, 2-locus r 2 and 2locus |D'| (see Materials and Methods for details), were calculated based on the pairwise LD parameters for all the allelic pairs (Table  S1). The values of 2-locus |D'| for HLA-C and HLA-B (|D'| = 0.91) and for HLA-DRB1 and HLA-DQB1 (|D'| = 0.80) were high, whereas the lowest 2-locus |D'| value was observed for HLA-A and HLA-DPB1 (|D'| = 0.25). These values reflected the physical distances between the respective loci. The values of 2locus |D'| for HLA-DRB1 and HLA-DPB1 and for HLA-DQB1 and HLA-DPB1 were relatively low compared to the values for the other pairs ( Figure 2). These low values probably result from the recombination hotspot in the HLA class II region [22][23][24].

Haplotype Omozygosity
Strong positive selection leads to a rapid increase in the frequency of a selected (target) allele in a population. The number of recombination events between the target allele and the surrounding polymorphic sites is limited while the advantageous allele increases in frequency; therefore, the diversity of haplotypes carrying the advantageous allele becomes low. Accordingly, strong LD is expected in the genomic region bearing the selected allele. In this study, the degree of LD for each HLA allele was measured by haplotype homozygosity (HH); this term is defined as the probability that any two randomly chosen samples of haplotype bearing a focal HLA allele have the same 6-locus HLA haplotype. Like EHH [17], a high HH value can be regarded as a signature of recent positive selection acting on a focal HLA allele.
Origin of DPB1*04:01 in Japanese DPB1*04:01 is common (.30%) in European populations [9,28], whereas the frequency of DPB1*04:01 is 6.1% in Japanese ( Table 2). Given the worldwide distribution of DPB1*04:01, it is unlikely that DPB1*04:01 originated in Japan. DPB1*04:01 seems to have entered Japan. Archaeological studies of Japanese history have suggested that the Yayoi people came from the Korean peninsula circa 300 B.C., and mixed with the indigenous Jomon people. A recent large-scale survey of single nucleotide polymorphisms (SNPs) on autosomal chromosomes [29] revealed that most people presently inhabiting mainland Japan are genetically closer to Koreans than to Ryukuans. Ryukuans are considered to be more pure descendants of the Jomon people than are mainland Japanese. These observations indicate that a large population of Yayoi people migrated from the Korean peninsula.  [28,30,31]. These and similar haplotypes have not been reported in other Asian populations (http://www.allelefrequencies.net) [28].
If the A*33:03-C*14:03-B*44:03-DRB1*13:02-DQB1*06:04-DPB1*04:01 haplotype has a single origin, the current genetic diversity of this haplotype must be low. To assess the genetic diversity of this haplotype, we performed a sliding window analysis of individual heterozygosity, defined as a proportion of heterozygous SNPs to all SNPs in the window ( Figure 5). Reduced individual heterozygosity was only found in the HLA region on the short arm of chromosome 6 in all the three subjects that were   Figure 5A); in contrast, such a reduction was not observed in two subjects that were heterozygous for this haplotype ( Figure 5B). Furthermore, three subjects that were homozygous for the A*33:03-C*14:03-B*44:03-DRB1*13:02-DQB1*06:04-DPB1*04:01 haplotype shared the same SNP haplotype that spanned more than 4 Mb in the HLA region ( Figure 5A). These observations suggest that the A*33:03-C*14:03-B*44:03-DRB1*13:02-DQB1*06:04-DPB1*04:01 haplotype in Japanese has a single origin, and has not been generated repeatedly by recombination.

Computer Simulation
The analysis of EHH revealed that the reduction in EHH for DPB1*04:01 resulted from recombination between DQB1*06:04 and DPB1*04:01 that inhabited the A*33:03-C*14:03-B*44:03-DRB1*13:02-DQB1*06:04-DPB1*04:01 haplotype ( Figure 4). Therefore, the relationship between DQB1*06:04 and DPB1*04:01 was focused in the following analyses. The high HH and EHH values of DPB1*04:01 (Figures 3 and 4) may merely reflect that a neutral random genetic drift, rather than a recent positive selection, occurred after the Yayoi people reached the Japanese archipelago (300 B.C. or 2300 years ago). To assess this possibility, we conducted a computer simulation assuming a twolocus two-allele model in which changes in the frequency of four haplotypes carrying DPB1*04:01 or non-DPB1*04:01 alleles at the HLA-DPB1 locus and DQB1*06:04 or non-DQB1:06:04 alleles at the HLA-DQB1 locus were evaluated. In the simulation, the values of three parameters: selection intensity, s, recombination rate, c, and frequency of DQB1*06:04-DPB1*04:01 haplotype, f 1 (0), in the beginning of the Yayoi period were drawn by a random number generator in every run. Haplotype frequencies were subject to change based on a stochastic model of positive selection, recombination, and random genetic drift. Dominant selection was assumed for DPB1*04:01, and, for the sake of simplicity, no selection (i.e., selectively neutral) was assumed for all alleles at the DQB1 locus. The rejection method [18,32,33] was applied to accept only simulation runs that gave results similar to the observed values (see Materials and Methods for details). The uniform distribution was used for each parameter as a prior distribution (see Materials and Methods for detail). Figure 6A shows 2,500 parameter sets (i.e., posterior distributions) that were accepted in these simulations. The posterior distribution of the initial frequency of DQB1*06:04-DPB1*04:01 haplotype was similar to the prior one, whereas the posterior distributions of selection intensity and recombination rate were different from the prior ones. In the posterior distribution, s ranged from 0.009 to 0.098, and the mean and 95% credible interval of s were 0.041 and 0.02120.077, respectively ( Figure 6B). It should be noted that neutral random genetic drift (i.e., s<0) did not yield the results similar to the observed values. The findings from the simulations indicated that DPB1*04:01 has been subject to relatively strong positive selection in Japanese since the Yayoi period.

Discussion
A number of HLA alleles have been shown to be associated with variations in immune responses to infectious diseases (e.g., human immunodeficiency virus [HIV]/AIDS, malaria, tuberculosis, hepatitis, leprosy, leishmaniasis, and schistosomiasis) caused by pathogenic microorganisms (see review by Blackwell et al. [34]). The most plausible explanation for positive selection favoring DPB1*04:01 would be its function in resistance to infections. A recent genome-wide association study showed that the DPA1*01:03-DPB1*04:01 haplotype confers protection against hepatitis B virus (HBV) infection (OR = 0.57, 95% CI = 0.33-0.96) [35]. Hepatitis B is a deadly infectious disease. Acute hepatitis B, which can cause fatal complications such as fulminant hepatitis, occurs in a percentage of the people infected with HBV. Although the estimated selection coefficient of s (0.0254-0.0550) for Here, the analysis of HH was used to detect a signature of recent positive selection. The advantage of using HH in the analysis of HLA genes is that alleles with similar frequencies not only at the same HLA locus, but also at different loci, can be compared. This feature of analyses based on HH allows us to compare HLA alleles even within the same long-range haplotype. Since the same polymorphic markers are used for all HLA alleles in the calculation of HH, the effect of recombination on the value of HH can be well controlled. However, the HH analysis has a disadvantage in that the empirical distribution of HH value has to be obtained from only those alleles that are in the targeted region. Therefore, unlike conventional long-range haplotype tests based on EHH values [17,36], the statistical test based on HH values cannot be performed using genome-wide data. Nevertheless, HH-based test is thought to be suitable for analysis of HLA genes because each locus has a number of alleles to be examined and strong LD exists between alleles even at distant loci. The use of HH in the analysis of various human populations would help us to detect other HLA  alleles that have been subject to geographically-restricted positive selection and to understand the role of HLA genes in the adaptation of human population to local environments over evolutionary time.
To estimate the selection coefficient of DPB1*04:01, we used a simple two-locus two-allele genetic model that was based on two assumptions, directional selection at DPB1 and selective neutrality at HLA-DQB1. The problem associated with the use of this model was that the Ewens-Watterson test revealed that the allele frequency distribution at HLA-DQB1 in this study population deviated significantly from that expected under neutrality (Table 2); therefore, the assumption of selective neutrality at HLA-DQB1 may not be valid. If balancing selection is operating at HLA-DQB1, the allele frequency of DQB1*06:04 is maintained at a certain frequency, and the change in the allele frequency of DPB1*04:01 must be influenced by this selection at HLA-DQB1, although the effect of balancing selection at HLA-DQB1 on the estimation of s is considered to be much smaller than that of directional selection favoring DPB1*04:01.
In this study, six HLA loci were investigated in 418 Japanese subjects. Of HLA alleles with high population frequencies, DPB1*04:01, which was present in the most common 6-locus HLA haplotype spanning more than 4 Mb, showed exceptionally high HH. A computer simulation estimated the selection coefficient of DPB1*04:01 as 0.041. Taken together with high HH value of DPB1*04:01, we conclude that DPB1*04:01 has recently undergone strong positive selection in Japanese population.

Subjects
All 418 individuals investigated in this study were unrelated Japanese adults living in Tokyo or neighboring areas. The genomic DNAs were extracted from peripheral blood samples using a commercial kit (QIAamp Blood Kit [Qiagen, Hilden, Germany]). All blood and DNA samples were de-identified. Verbal informed consent was obtained from all the participants before 1990. In this study, written informed consent was not obtained because the blood sampling was conducted before the ''Ethical Guidelines for Human Genome and Genetic Sequencing Research'' were established in Japan. Under the condition that DNA sample is permanently de-linked from the individual, this study was approved by the Research Ethics Committee of the Faculty of Medicine, University of Tokyo.

Statistical Analysis
Deviation from HWE for each HLA locus was tested using an exact test available in a web-based software, Genepop 4.0.10 [38]. Using Arlequin version 3.5 [39], the Ewens-Watterson test [40], which is based on Ewens sampling theory of neutral alleles [19], was performed to assess whether the observed distribution of allele frequencies at each HLA locus was different from an expectation that was based on neutrality.
To evaluate the degree of LD between HLA alleles, values of r 2 and D' [21] for all pairwise combinations of HLA alleles were calculated based on the haplotype frequencies estimated using the expectation maximization algorithm [20]. Here, each HLA allele was regarded as a single nucleotide polymorphism (SNP). For example, the A*01:01 allele and the other alleles at the HLA-A locus were designated as ''A'' and ''G'', respectively. Accordingly, the algorithm for the estimation of haplotype frequencies for two loci, each with two alleles, could be applied to the HLA loci with multiple alleles for the purposes of these pairwise comparisons.
The LD parameter, 2-locus |D'|, between any two HLA loci (locus 1 and locus 2) was calculated based on the pairwise LD parameter, D' ij , between ith allele at locus 1 and jth allele at locus 2 as follows: 2-locus DD 0 D~P m i~1 P n j~1 p i q j DD 0 ij D, where p i and q j represent the frequencies of ith allele at locus 1 with m different alleles and jth allele at locus 2 with n different alleles. Spearman's rank correlation coefficient between 2-locus |D'| and the physical distance was calculated. Assuming a model: 2-locus |D'| = 1{0:67|10 {5 |x À Á a , the curve fitting model parameter, a, was estimated using the least squares method; this method minimizes the sum-of-squared residual between an observed value and a fitted value that was determine by a model. In the above equation, the physical distance (Kb) between two loci is denoted by x and the recombination intensity in the HLA region was set at 0.65 cM/Mb [27,41]. The phased haplotypes consisting of two or more HLA loci were estimated using the PHASE program version 2.1 [25,26]. The estimated 6-locus haplotypes were further used for the calculation of extended haplotype homozygosity (EHH) [17] and of haplotype homozygosity (HH). In this study, HH of each HLA allele was defined as the probability that any two randomly chosen samples of haplotype bearing the HLA allele have the same 6-locus HLA haplotype.
A sliding window analysis of individual heterozygosity, which was defined as the proportion of heterozygous SNPs to SNPs genotyped in a single subject, was conducted to examine whether the A*33:03-C*14:03-B*44:03-DRB1*13:02-DQB1*06:04-DPB1*04:01 haplotype had a single origin in Japan. 19,949 SNPs located on 6p were genotyped, and the average SNP density was 0.34 SNP/kb. The window and step sizes were 1 Mb and 200 kb, respectively. This analysis was performed using the SNP data from the five subject included in the SNP typing: three subjects were homozygous for the
Next, to evaluate the similarity between simulated and observed frequencies, was calculated. As the simulated haplotype frequencies, f 1 (t), f 2 (t), f 3 (t), and f 4 (t), approaches values close to the observed frequencies, f 1 , f 2 , f 3 , and f 4 , the value of e approaches 0. The rejection method [18,32,33] was used to accept only simulation runs that resulted in (i) e of less than 0.01, (ii) f 1 (t) of not less than f 1 20.01 nor more than f 1 +0.01, and (iii) t of not less than 92 nor more than 115 generations. A total of 2,500 runs were accepted. The mean and 95% credible interval of s were obtained from the 2,500 accepted runs.

Supporting Information
Data S1 Pairwise LD measures for individual HLA allele pairs. (XLSX)