Genomic Scans of Zygotic Disequilibrium and Epistatic SNPs in HapMap Phase III Populations

Previous theory indicates that zygotic linkage disequilibrium (LD) is more informative than gametic or composite digenic LD in revealing natural population history. Further, the difference between the composite digenic and maximum zygotic LDs can be used to detect epistatic selection for fitness. Here we corroborate the theory by investigating genome-wide zygotic LDs in HapMap phase III human populations. Results show that non-Africa populations have much more significant zygotic LDs than do Africa populations. Africa populations (ASW, LWK, MKK, and YRI) possess more significant zygotic LDs for the double-homozygotes (DAABB) than any other significant zygotic LDs (DAABb, DAaBB, and DAaBb), while non-Africa populations generally have more significant DAaBb’s than any other significant zygotic LDs (DAABB, DAABb, and DAaBB). Average r-squares for any significant zygotic LDs increase generally in an order of populations YRI, MKK, CEU, CHB, LWK, JPT, CHD, TSI, GIH, ASW, and MEX. Average r-squares are greater for DAABB and DAaBb than for DAaBB and DAABb in each population. YRI and MKK can be separated from LWK and ASW in terms of the pattern of average r-squares. All population divergences in zygotic LDs can be interpreted with the model of Out of Africa for modern human origins. We have also detected 19735-95921 SNP pairs exhibiting strong signals of epistatic selection in different populations. Gene-gene interactions for some epistatic SNP pairs are evident from empirical findings, but many more epistatic SNP pairs await evidence. Common epistatic SNP pairs rarely exist among all populations, but exist in distinct regions (Africa, Europe, and East Asia), which helps to understand geographical genomic medicine.


Introduction
Epistasis between loci for fitness is one of genetic mechanisms for multilocus evolution in evolutionary biology [1,2,3]. For instance, epistatic selection enhances the formation of rugged fitness landscape and hence impedes the shifting of a local population from one fitness peak to another [4,5,6]. Epistasis also provides a genetic basis for developing therapeutic strategy in is used to infer Dobzhansky-Muller incompatibility model in a natural mouse hybrid zone [26]. Here, we search for epistatic SNPs by testing the difference between the composite digenic and maximum zygotic LDs in human populations [15,21].
The new approach differs from the existing methods in searching for epistasis in two aspects. One is that the trait refers to the general fitness, not a specific complex disease that may affect fitness. The epistatic selection for fitness naturally occurs in a population [15,21], emphasizing the functional interaction between loci. The existing methods for detecting epistasis often work on specific traits, such as the approach of genome-wide association studies (GWAS) for specific complex diseases [27]. The second aspect is that the new method is zygotic LD-based, and epistatic selection is detected on the basis of individual SNP pairs rather than multiple SNP pairs simultaneously. This minimizes the problem of small sample sizes (n) with a large number of SNPs (p), i.e. n<<p, which is frequently encountered in detecting multiple epistasis. The existing methods that deal with multiple SNPs under the situation of n<<p, often require expensive computations or the assumptions of prior parameter distributions [5,10,28,29]. Thus, the zygotic LD-based method provides an addition to the existing methods in searching for genome-wide epistasis arising from genotypic rather than allelic interactions.
Evidence supporting genotypic interactions is recorded in the literature although few studies link the genotypic interaction with zygotic LDs at the population level [21]. For instance, the double heterozygotes (one locus for insulin receptor and the other locus for insulin receptor substrate-1) produce significant insulin resistance in mice while any single-locus heterozygotes alone cannot [28,30]. Hoh and Ott [28] have reviewed a few examples of genotypic interactions with 100% penetrance, suggesting the presence of zygotic LD at the population level. Many complex diseases, fitness-related traits, might involve epistasis within genes or between genes, such as diabetes, cystic fibrosis, cardiovascular and neurodegenerative diseases [3,29,31,32,33]. It is of medical significance to map epistasis to specific regions at the DNA (epistatic SNPs), transcriptional RNA, and translational protein levels.
The purpose of this study is to (i) investigate the patterns of zygotic LDs in human populations that are focused on in the International HapMap Project, and (ii) search for genome-wide epistatic SNPs that are associated with fitness. These focal populations have distinct ancient originations and represent a broad genetic variation in H. sapiens. Through investigating the patterns of zygotic LDs, we corroborate the utility of zygotic LD in characterizing human population history [15]. By comparing the maximum zygotic LD with the composite digenic LD, we search for common epistatic SNPs in four Africa populations, three Asia populations, and two Europe populations. Because the method is not based on a specific disease trait but on the fitness only, the detected epistatic SNPs are likely associated with common diseases that have been clinically verified or unverified yet in different geographical regions. The findings are of potential significance for suggesting therapeutic strategies of common life-limiting disorders or for genomic medicine.
From the International HapMap Consortium, all online population SNPs have passed the quality control (QC) [34,35,36]. Here, we filter those SNPs (based on rs ID) that are not shared among the eleven populations for each chromosome. The numbers of common SNPs on each chromosome are shown in Table 1. In addition, according to SNP allele frequencies separately estimated in each population, we further filter all the SNPs with the minor allele frequencies (MAF) being smaller than 5% in each population.
To simplify zygotic LD analysis, SNP genotypes are re-coded by one diallelic locus (say, alleles A and a) in each population in the following way. First, we set the genotypes of the individuals with nucleotide bases A/C, A/G, and A/T as heterozygote Aa, the genotypes for the individuals with nucleotide bases G/G, C/C, and T/T as homozygote aa, and the genotypes of the individuals with nucleotide bases A/A as homozygote AA. Then, we set the genotypes of the individuals with nucleotide bases C/G and C/T as heterozygote Aa, the genotypes of the individuals with nucleotide bases G/G and T/T as homozygote aa, and the genotypes of the individuals with nucleotide bases C/C as homozygotes AA. Finally, we set the genotypes of the Table 1. Summary of the common SNPs among eleven populations, the mean total SNP pairs (±Sd) within 10Mb and their ranges among eleven populations, and the mean total SNP pairs (±Sd) each with at least one significant zygotic LD (p-value <10 −9 ) and their ranges among eleven populations.

Chr
Common individuals with nucleotide bases G/T as heterozygote Aa, the genotypes of the individuals with nucleotide bases G/G as homozygote AA, and the genotypes of the individuals with nucleotide bases T/T as homozygote aa. The information on individual nucleotide bases (A, T, G, C) at each locus is not recorded in analysis.

Statistical tests
Because genome-wide gametic LDs have already been extensively examined in the HapMap populations in separate studies [34,35,36], we do not further recalculate them. We concentrate on zygotic LD and its comparison with the composite digenic LD. Consider two diallelic loci A and B, with alleles A and a at locus A and B and b at locus B. Letp ijkl be the maximum likelihood estimate (MLE) of the frequency of genotype ijkl (i, j = A, a; k, l = B,b) in a population, andp ij (orp kl ) be the MLE of the frequency of genotype ij at locus A (or kl at locus B) in the population. MLEs of genotypic and allelic frequencies are obtained by the directly counting method from samples [25]. Let D ijkl be the zygotic LD between genotypes ij at locus A and kl at locus B for genotype ijkl, which is estimated bŷ where n ijkl , n ij , and n kl are the numbers of genotypes ijkl (i, j = A, a; k, l = B,b), ij (i, j = A, a), and kl (k, l = B,b) in the sample of size n. For two diallelic loci, there are nine zygotic LDs in total, but only four of them are random because of the following constraints: D AAkl +D Aakl +D aakl = 0, (k,l = B,b) and D ijBB +D ijBb +D ijbb = 0 (i,j = A, a). Here, we focus on four zygotic LDs, i.e. D AABB , D AABb , D AaBB , and D AaBb . The rest five zygotic LDs are constrained among the nine zygotic LDs. Note that there is one error in describing the total zygotic LDs (nine, not eight) and also an imprecise description of independence of the four zygotic LDs in [15,21]. The four zygotic LDs are correlated owing to their sharing of common alleles in the same samples, which can be seen from the covariances between distinct zygotic LDs, e.g., Eq (2) in [15] Hu and Yeh [21] have derived chi-square statistics to test the four zygotic LDs. Analogous to the commonly used r 2 for the normalized gametic LD (= D 2 AB =p ApapBpb = χ 2 AB /2n where χ 2 AB is the chi-square statistic with one degree of freedom andD AB is the gametic LD), we use the normalized zygotic LDs derived from chi-square statistics. From Hu and Yeh [21], a chi-square statistic with one degree of freedom for testing null hypothesis H 0 : D AjBl = 0 (j = A, a; l = B, b) is set as: The normalized zygotic LD is set as The r-square, r 2 AjBl , ranges from 0 to 1. The composite digenic LD, denoted by Δ AB , is 2p AABB +p AABb +p AaBB +p AaBb /2-2p A p B [25], which can be estimated in terms of zygotic LDs:D AB ¼D AABB þD AABb þD AaBB þD AaBb =2 [21]. Note that the preceding expression is alternative to Weir's formula derived as the sum of gametic LD (D AB ) and the disequilibrium (D A/B ) between two non-alleles from different gametes within individuals, i.e.D AB ¼D AB þD A=B [25]. Weir's formula explicitly indicates that the composite digenic LD remains a low-order LD but is always not smaller than the gametic LD.
From Hu and Yeh [21], a chi-square statistic with one degree of freedom for testing null hypothesis H 0 : Δ AB = 0 is set as The variance of composite digenic LD is given in Hu and Yeh [21], in whichD A ¼p AA Àp 2 A andD B ¼p BB Àp 2 B are the Hardy-Weinberg disequilibrium (HWD) at loci A and B, respectively.
As indicated in [21], Eq (5) is different from Weir and Cockerham's expression in that it is derived in terms of zygotic LDs (rather than gametic LD), HWD, and allele frequency [37]. The feature is that its calculation does not need to estimate gametic LD from diploid genotyping data under HWE or the random mating assumption. To test Δ AB = 0 using Eq (4),D AB is set as zero in VðD AB Þ. Zygotic LD and HWD are set as zero in VðD AB Þ if they are not significantly different from zero. Zygotic LD is tested using Eq (2), and HWD is tested using chisquare statistic [25]. Similarly, the normalized composite digenic LD is set as r 2 Δ = χ 2 Δ /n, which ranges from 0 to 1.
To test the difference between |Δ AB | and the maximum absolute zygotic LD, i.e. d ¼ jD AB j À maxðjD AABB j; jD AABb j; jD AaBB j; jD AaBb jÞ, from Hu and Yeh [21], a chi-square statistic with one degree of freedom for testing null hypothesis H 0 : d = 0 is set as where Vðd AjBlðÀÞ Þ or Vðd AjBlðþÞ Þ is detailed in Eq (24) of Hu and Yeh [21]. Note that the maximum zygotic LD may be a different one of the four zygotic LDs in different SNP pairs, and its sign may be positive or negative. The composite digenic LD may be positive or negative as well.
The subscript AjBl(-) in Eq (6a) refers to the case where both Δ AB and the maximum zygotic LD are positive or both are negative. The subscript AjBl(+) in Eq (6b) refers to the case where Δ AB and the maximum zygotic LD are different in sign. When d for a SNP pair is significantly greater than zero, no epistatic selection is indicated between the two SNPs [15,21]. The normalized r-square for d>0 is denoted as r 2 d>0 that equals χ 2 AjBl(-) /n or χ 2 AjBl(+) /n with d>0. When d for a SNP pair is significantly smaller than zero, strong epistatic selection is indicated between the two SNPs [15,21]. The normalized r--square for d<0 is set as r 2 d<0 that equals χ 2 AjBl(-) /n or χ 2 AjBl(+) /n with d<0. These r-squares range from 0 to 1.
Note that a caution is needed in applying the above methods to testing zygotic and composite digenic LDs, and their differences. Because the variances of the parameters are derived using Fisher's delta method [25] in constructing chi-square statistics or z-scores [21], a large sample size is required in order to obtain good approximation of variance. When MAF is too small or the sample size is too small, say n<30, unstable test results could be produced, which is similar to testing gametic LD using the chi-square statistic [25].
In our calculations, we confine our analysis to syntenic SNP pairs within 10Mb (~10 cM in human genomes) so as to make the data be manageable. Based on Weir and Cockerham's [37] suggestion, the steps for statistical tests used by Hu and Yeh [21] for each population are summarized as follows. First, test individual zygotic LDs for pairwise SNPs using Eq (2). Note that the SNP with MAF<5% is excluded. Then, test the composite digenic LD. Δ AB in VðD AB Þ is set as zero. The specific zygotic LD in VðD AB Þ is dropped if it is not significantly different from zero. HWD at each locus in VðD AB Þ is dropped if it is not significantly different from zero. The composite digenic LD for a SNP pair is tested only when at least one of the four zygotic LDs is significant, which otherwise the test is not conducted further for this SNP pair. Finally, test the difference between the composite digenic and maximum zygotic LDs. The d value in VðdÞ using Eq (6) is set as zero, and non-significant zygotic LDs in VðdÞ are dropped. Similarly, the test is conducted only when the SNP pair has at least one significant zygotic LD.
The above analysis is separately conducted on each chromosome in each population. To derive more conservative conclusions, a very stringent significant level for each test is set as p-value<10 −9 , smaller than the Bonferroni adjusted p-value (0.01/ the maximum number of SNP pairs). All these calculations and tests are conducted with a program in C used by Hu and Yeh [21].
To investigate the pattern of the zygotic and composite digenic LDs along chromosomes, we test Pearson's correlation between the significant r 2 AjBl (j = A, a; l = B, b) or r 2 Δ and the physical distance over all autosomes in each population. The correlations between the significant r 2 d>0 or r 2 d<0 and the physical distance are tested as well. All these tests are conducted in R. Hierarchical cluster analysis among the eleven populations is conducted using the joint information, including the total and the average r-squares of significant zygotic and composite digenic LDs, the average physical distances for significant r-squares, and Pearson's correlations between the significant r-square and the physical distance. Population distance matrix is constructed according to the algorithm of Canberra distance between any two vectors [38]. R-function, pvclust (), is applied for plotting [39].

Total zygotic LDs
The number of common SNPs among the eleven populations generally decreases from Chromosome (Chr) 1 to 22 (Table 1). Chr 2 has the most common SNPs (86440) while Chr 22 has the fewest common SNPs (14481). The total SNP pairs within 10Mb (with MAF!5% at each SNP locus) display the pattern similar to that of the number of common SNPs, with the longer chromosome harbouring more SNP pairs (Table 1). From the statistical tests of zygotic LDs (H 0 : D AjBl = 0; j = A, a; l = B, b), the total SNP pairs with at least one significant zygotic LD generally decrease from Chr 1 to 22. Chr 2 has the most significant zygotic LDs (324909-883086 in different populations). On average, about 47.4% (±6.8%) SNP pairs each with MAF !5% within 10Mb per chromosome per population (a range of 23.1~59.3%) have at least one significant zygotic LD, indicating that extensive zygotic LDs exist in human populations.
The total significant zygotic LDs over all autosomes are unequal among populations or among four zygotic LDs in each population (Fig 1a). Population CEU has the most significant zygotic LDs for each of the four two-locus genotypes (D AjBl 6 ¼0, j = A, a; l = B,b), followed by populations CHB, JPT, CHD, TSI, GIH, MEX, and the four Africa populations (MKK, YRI, ASW, and LWK). With a reference to the specific zygotic LD, populations CEU, CHB, CHD, JPT, GIH, and TSI have more significant zygotic LDs for the double heterozygotes (D AaBb ) than for other three zygotic LDs (D AABB , D AABb , and D AaBB ); while populations ASW, LWK, MEX, MKK, and YRI have more significant zygotic LDs for the double homozygotes (D AABB ) than for other three zygotic LDs (D AABb , D AaBB , and D AaBb ). There are fewer significant zygotic LDs for the homozygote-heterozygote genotypes (D AABb or D AaBB ) than for other two zygotic LDs (D AABB and D AaBb ) in each population. Population CEU has the most significant composite digenic LDs (Δ AB 6 ¼0) over all autosomes, followed by populations CHB, JPT, CHD, TSI, GIH, MKK, YRI, and MEX. Two Africa populations, LWK and ASW, have a relatively fewer significant composite digenic LDs (Fig  1a). Like the patterns of D AABB and D AaBb , population MEX has the fewest SNP pairs with significant composite digenic LDs among the seven non-Africa populations.
A significant correlation exists between the total significant zygotic LDs and the number of common SNPs, or between the total significant composite digenic LDs and the number of common SNPs across all autosomes (Pearson's correlations = 0.99, p-value<10 −16 ; Fig 1b). However, Chrs 6, 11, and 20 have relatively more significant zygotic LDs or more significant composite digenic LDs; while Chrs 3, 4, and 19 have a fewer significant zygotic LDs or a fewer significant composite digenic LDs. A common feature is that each autosome has more significant D AABB 's or D AaBb 's or Δ AB 's than significant D AABb 's or D AaBB 's.
Concerning the difference between the composite digenic and maximum zygotic LDs ðdÞ, about 1.80% (±0.9%) of significant Δ AB 's, ranging from 0.93% to 3.23%, are significantly greater than the maximum zygotic LDs (d>0), indicating strong signals of linear additive effects on fitness. About 1.21% (±0.2%) of significant Δ AB 's, ranging from 0.96% to 1.51%, are significantly smaller than the maximum zygotic LDs (d<0), indicating strong signals of epistatic selection for fitness. A majority of the composite digenic LDs are comparable to the maximum zygotic LDs in magnitude, indicating weak signals of epistatic selection [15], [21]. Populations CEU, MKK, and YRI have more SNP pairs displaying significant d(>0) than do the rest populations (Fig 1a), indicating that the three populations have relative more SNPs pairs strongly exhibiting non epistatic effects on population fitness than do the rest populations (LWK, ASW, CHD, CHB, GIH, JPT, MEX and TSI). Population CEU has the most SNP pairs with the d values being significantly smaller than zero (d<0), while two Africa populations (ASW and LWK) have the fewest SNP pairs with the d values being significantly smaller than zero (d<0; Fig 1a).
Pattern of the total SNP pairs with d >0 over all populations is similar to that of the total composite digenic LD (Δ AB ) across all autosomes (Fig 1b). Each chromosome has more SNP pairs with d >0 than those with d <0 except Chr 5. The main reason for more SNP pairs with d<0 on Chr 5 is because population MEX has more SNP pairs with d<0 on Chr 5 than do other populations (details not shown here).

R-squares
The distribution patterns of significant zygotic LDs over all autosomes are highly heterogeneous among populations (Mann-Whitney test for all pairwise populations, p-value<10 −16 ). One common feature is a generally consistent order of distributions among eleven populations for the normalized zygotic LDs or the normalized composite digenic LDs (Fig 2a, 2b, 2c, 2d, 2e, 2f and 2g for the density distribution and Table 2 for means and standard deviations). The density distributions are bimodal, with one peak close to lower r-squares and another peak at high r-squares (= 1.0). All average r-squares (r 2 AjBl (j = A, a; l = B, b); r 2 Δ , r 2 d<0 , and r 2 d>0 ) increase generally in the order of YRI, MKK, CEU, CHB, LWK, JPT, CHD,TSI, GIH, ASW, and MEX. Although CEU has the most significant zygotic LDs (over all autosomes), it has relatively lower zygotic LDs. Although LWK and ASW have a relatively fewer significant zygotic LDs, their rsquares are not small (Table 2). Population MEX does not have the most significant zygotic LDs, but has high zygotic LDs, indicating that the number of significant zygotic LDs does not reflect their strengths.
The average r-squares for significant zygotic LDs of the double heterozygotes (r 2 AaBb ) or the double homozygotes (r 2 AABB ) over all autosomes are greater than the average r-squares for significant zygotic LDs of the homozygote-heterozygote genotypes (r 2 AABb and r 2 AaBB ). Density distributions of r 2 AABb and r 2 AaBB are essentially identical (Fig 2b and 2c). The average rsquares for significant composite digenic LDs (r 2 Δ ) are large in each population, similar to the average r 2 AaBb 's, but r 2 Δ 's have different density distributions from those of r 2 AaBb 's, with large densities close to 1 (Table 2; Fig 2e).  The average r-squares are generally greater for significant d>0 than for significant d<0 in addition to their distinct density distributions in each population ( Table 2; Fig 2f and 2g).
Concerning with the physical distances between SNP pairs with at least one significant zygotic LD (Table 2), variations exist among populations or among the four zygotic LDs (Table 3). Most standard deviations are greater than means, with the coefficient of variation (CV) being greater than 1, which indicates quite variable distributions in physical distance. The common pattern is that the SNP pairs with significant r 2 AABB generally occur within longer physical distances (0.75~2.67Mb on average) than do the SNPs pars with significant r 2 AABb and r 2 AaBB (< 0.55 Mb on average), which in turn occurs within longer distances than the SNP pairs with significant r 2 AaBb and r 2 Δ (< 0.2 Mb on average). R-squares for significant d >0 occur within relatively longer distances (0.6~3.9 Mb on average) than those for significant d<0 (< 2 Mb on average).
Populations vary in average physical distance among the four zygotic LDs. For the double homozygotes, the four Africa populations (YRI, MKK, LWK, and ASW) generally have significant r 2 AABB within relatively longer distances (1.16~2.66 Mb), but population CEU has significant r 2 AABB within relatively shorter distances (0.75 Mb). The rest populations generally have Table 2. Means and standard deviations over all autosomes for the r-squares of the significant zygotic LDs, the r-squares of the significant composite digenic LD, and the r-squares of the significant differences between the composite digenic and maximum zygotic LDs in each population (p-value <10 −9 ).  (Table 3). For the double heterozygotes, the four Africa populations have significant r 2 AaBb generally within shorter distances (0.053~0.08 Mb), but population CEU has significant r 2 AaBb within relatively longer distances (<0.1 Mb). For the homozygote-heterozygote genotypes, the four Africa populations generally have significant r 2 AABb or r 2 AaBB within relatively longer distances (0.14~0.53 Mb), while population CEU has significant r 2 AABb or r 2 AaBB within~0.12 Mb on average. The rest populations have significant r 2 AABb or r 2 AaBB within the intermediate distances (~0.14 Mb on average). The physical distance for significant r 2 Δ is the longest on average in population ASW (~0.16 Mb), followed by MEX (~0.12 Mb), CEU and MKK (~0.1Mb), and the rest populations (<0.1Mb). This order is distinct from the population order for any significant zygotic LDs. The average physical distances are longer for significant r 2 Δ than for significant r 2 AaBb , but are shorter than for the r-squares of any other significant zygotic LDs (r 2 AABB , r 2 AaBB , and r 2 AABb ) in each population ( Table 3).
The physical distances are quite variable for significant r 2 d>0 and r 2 d<0 , with CV>1.0 in all populations except ASW (CV = 0.79) and MEX (CV = 0.97) for r 2 d>0 ( Table 3). The average distances for significant r 2 d>0 are almost twice as long as those for significant r 2 d<0 in each population. Populations ASW, LWK, and MEX have significant r 2 d>0 within longer distances (>3 Mb), while populations CEU and YRI have significant r 2 d>0 within relatively shorter distances (~0.  Table 3).

Patterns of r-squares along chromosomes
Pearson's correlation tests indicate that r-squares of the four zygotic LDs are significantly negatively correlated with the physical distance ( Table 4). The r-squares of the double homozygote LDs, r 2 AABB 's, are more strongly correlated with the physical distance (Mb) than do the rsquares of other three zygotic LDs in each population except ASW that exhibits a relatively weaker correlation. Although there are significantly different distributions among populations in r 2 AaBb , comparable correlations among them exist between r 2 AaBb and the physical distance (Pearson's correlation~-0.1).
R-squares for the composite digenic LDs (r 2 Δ ) are also significantly negatively correlated with the physical distance (all p-values<10 −9 ), but their strengths are not as strong as the corrections between r 2 AjBl (j = A, a; l = B, b) and the physical distance in each population (Table 4). For instance, although the significant r 2 Δ has the most SNP pairs (7130869) among the significant r 2 AjBl , r 2 d<0 , and r 2 d>0 in CEU, most r 2 Δ 's are distributed within short distances (Fig 3), producing a relatively weaker correlation between r 2 Δ and the physical distance (Pearson correlation = -0.0630, p-value<2.2×10 −16 ).
Unlike the patterns of r 2 AjBl (j = A, a; l = B, b) and r 2 Δ , both r 2 d<0 and r 2 d>0 are significantly positively correlated with the physical distance except population YRI ( Table 4), indicating that SNPs on distant positions trend to have both additive and epistatic effects on fitness. Populations ASW and LWK exhibit relatively stronger correlations between r 2 d<0 and the physical distance than do populations MKK and YRI.
Taken together all above information for cluster analysis (Fig 4), populations ASW and LWK (a probability of 91% in a group) can be separated from populations MKK and YRI (a probability of 99% in a group), with a probability of 63%. Four populations (CHD, JPT, GIH, and TSI) join together, with a probability of more than 90%. Population MEX can be considered as a separate group although it has a probability of 78% to join the group of CHD, JPT, GIH, and TSI. Populations CEU and CHB have a probability of 84% in a group, and join the group of all other non-Africa populations with a probability of more than 50%. Populations MKK and YRI have a probability of more than 60% to join the group of non-Africa populations, representing a "transitional" status from Africa populations ASW and LWK to non-Africa populations.

Common epistatic SNPs
Initial screening indicates that no common SNP pairs of significant r 2 d<0 are present among eleven populations. Most epistatic SNPs are distinct among populations or population specific. To capture the common SNPs among regions, we concentrate on four Africa populations (ASW, LWK, MKK, and YRI), three Asia populations (CHB, CHD, and JPT), and two Europe populations (CEU and TSI).
In the four Africa populations, a few chromosomes have common SNP pairs of significant r 2 d<0 , with two pairs on Chr 2, seventeen on Chr7, and two pairs on Chr 19 that show a strong signal of epistatic selection for population fitness (S1 Table). These significant r 2 d<0 's range from 0.1642±0.0530 to 0.5356±0.4119, and the d values range from -0.0666±0.0203 to -0.0472 ±0.0156. One immediate feature is that these syntenic SNP pairs occur within 0.5 Mb. Gene annotations indicate 2 SNP pairs each with unknown function information, 9 SNP pairs where one SNP is annotated but the other is not annotated yet, 5 SNP pairs each within the same annotated genes, and 5 SNP pairs in three distinct gene pairs on Chr 7. The annotated genes involving the epistatic SNP pairs are VWC2L, von Willebrand factor C domain containing protein 2-like; PTPRN2, protein tyrosine phosphatase, receptor type, N polypeptide 2; ESYT2, extended synaptotagmin-like protein 2; CPAMD8, C3 and PZP-like, alpha-2-macroglobulin domain containing 8; and WDR60, WD repeat domain 60.
In the two Europe populations (CEU and TSI), there are 2521 common SNP pairs of significant r 2 d<0 's in total on all autosomes (S3 Table; Fig 5). Chr 3 has the most common SNP pairs (226) while Chr 21 has the fewest common SNP pairs (11). The physical distances of these epistatic SNPs are within 0.5 Mb, ranging from 16bp between rs500608 and rs3819114 on Chr 11 (d = -0.0625±0.0354; r 2 d<0 = 0.4274±0.1897) to 338385bp between rs2522505 and rs17035289 on Chr2 (d = -0.0475±0.0283; r 2 d<0 = 0.3433±0.0040). Compared with the results in the three Asia populations or in the four Africa populations, the r 2 d<0 values of the common SNP pairs between populations CEU and TSI are higher, on average, with the mean ranging from 0.2768 ±0.0973 to 1.0±0.0. The d values range from -0.1086±0.0074 to -0.0131±0.0045.
Gene annotations indicate there are 751 SNP pairs (out of 2521) where each pair comes from the same annotated gene, 271 SNP pairs where each pair comes from two differently annotated genes, 909 SNP pairs where gene annotations are unknown, and 590 SNP pairs where one SNP comes from one annotated gene and the other from an unannotated gene in each pair. The feature is that these interactions occur between neighbor genes or regions (SNP pairs within 0.5 Mb). Some interactions have been evident in clinical experiments or via GWAS studies, while many more gene-gene interactions are not empirically corroborated yet. For instance, the interaction between BCL2L15 (rs7529353) and PHTF1 (rs1936398) on Chr1 (d = -0.0428±0.0032; r 2 d<0 = 0.6324±0.5199) is reported to have protein-protein interactions coded by these two genes (http://www.bioinfo.mochsl.org.br/miriad/gene/BCL2L15/). Fig 6  lists 19 networks of the interactions among three or more genes on individual chromosomes. For instance, the interactions among genes MAGI3 (rs1230658), PHTF1(rs1936398) and PTPN22 (rs1217410) on Chr 1 are evident, which is associated with Type 1 diabetes [41,42]. Part of ZRANB3 gene (213 bp; rs6742030) on Chr 2 is antisense to spliced RAB3GAP1 gene (rs7422031; d = -0.0483±0.0075; r 2 d<0 = 0.6047±0.5590) according to the information from http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/. Many newly detected gene-gene interactions await experimental evidence (S3 Table).

Discussion
In this study, we investigate the patterns of genome-wide zygotic LDs in eleven human populations and test the differences between the composite digenic and maximum zygotic LDs to search for epistatic SNPs. The results confirm the previous theory that patterns of zygotic LDs are more informative than that of the composite digenic/gametic LD in revealing natural population history [15,21]. The eleven populations can be separated into a few groups in terms of the patterns of both zygotic and composite digenic LDs (Fig 4). Compared with the analysis of genome-wide gametic LD [22], zygotic LD analysis provides additional insights into the history of human populations. The four Africa populations (YRI, LWK, ASW, and MKK) have a fewer significant zygotic LDs than do the non-Africa populations in total. Africa populations contain more significant zygotic LDs for the double-homozygotes (D AABB ) than for any other three zygotic LDs (D AABb , D AaBB , and D AaBb ). The non-Africa populations, especially CEU, have more significant zygotic LDs, and contain more significant zygotic LDs for the double-heterozygotes (D AaBb ) than for any other three zygotyic LDs (D AABB , D AABb , and D AaBB ). Non-Africa populations also have many significant zygotic LDs for the double-homozygotes (D AABB ), which is fewer than the number of significant D AaBb but much more than the number of significant D AABb or D AaBB . In addition, the four Africa populations have a fewer significant composite digenic LDs than do the non-Africa populations. All average r-squares (r 2 AjBl (j = A, a; l = B, b), r 2 Δ , r 2 d<0 , and r 2 d>0 ) increase in a general order of populations YRI, MKK, CEU, CHB, LWK, JPT, CHD,TSI, GIH, ASW, and MEX. Average r-squares are greater for D AABB and D AaBb than for D AaBB and D AABb in each population. Africa populations YRI and MKK can be separated from the other two Africa populations (LWK and ASW) in terms of the patterns of average r-squares. Patterns of r-squares with physical distances also vary among different zygotic LDs and among populations. All these divergences are essentially associated with their distinct population history.
The historical migration and origin of modern human are very complex from previous studies using different approaches (e.g., human anatomical structure and genetic approaches) and different types of molecular markers (e.g., mitochondrial DNA and Y-chromosome) [43,44]. The current non-Africa populations are hypothesized to come out of Africa [45,46] or arise from multiple regions [47,48]. Under the Out of Africa model, the genetic variations in non-Africa populations reflect the historical divergence when tracing back to the precursor samples of original Africa ancestors. Because of the longer history, more extensive recombination could occur in situ modern Africa populations derived from ancient Africa populations, which results in lower LDs between SNPs in current Africa populations than in non-Africa populations [22,49]. The results of zygotic LDs reflect this pattern. More significant zygotic LDs for both the double-homozygotes (D AABB ) and double-heterozygotes (D AaBb ) than for the homozygote-heterozygotes (D AABb and D AaBB ) are mainly due to the joint effects of selection and recombination (the maximum physical distance assayed here is 10 Mb). More significant zygotic LDs for the double-homozygotes occurring within a relative long distances on average (>1Mb, Table 3) suggest the main effects from genetic drift for the samples from original Africa ancestors. Genetic drift reduces the whole genome heterozygosity. Long physical distances enhance the occurrence of recombination events and hence indirectly amplify genetic drift effects [15], leading to over proportions of the double homozygotes. The more significant zygotic LDs for the double-heterozygotes occurring within shorter physical distances (<0.1Mb), on average, suggest that genotypic interactions from heterozygotes remain locally maintained on chromosomes. This could arise from the low recombination events within short physical distances, producing relatively smaller effects from the genetic drift. The original heterozygosities from ancestors could be relatively lightly eroded. Nature selection that enhances heterozygote interactions cannot be excluded. This also implies that ancestors harboured a pretty high level of genetic diversity due to the historical accumulation of mutation effects. The significant zygotic LDs for the heterozygote-homozygote represent an intermediate case. A certain level of recombination together with genetic drift effects can produce a fewer significant D AABb 's or D AaBB 's.
The relatively fewer significant zygotic LDs in the group of ASW and LWK than in the group of YRI and MKK could reflect the difference in their genetic drift effects if all of them had approximately experienced the same length of evolutionary time. This is probably related to the divergence of different lines of descents in Africa, as implied from mitochondrial DNA markers (see review by Goldstein and Chikhi [44]). Our results imply that the effective population sizes (N e ) could be smaller in populations ASW and LWK than in populations YRI and MKK.
Non-Africa populations CEU, CHB, CHD, JPT, and TSI exhibit the different patterns of zygotic LDs from those of the four Africa populations. The more total significant zygotic LDs are also consistent with their relatively younger populations that are reported to maintain higher gametic LDs [22,44]. There are much more significant D AABB and D AaBb than significant D AaBB and D AABb , reflecting the similar patterns to those in Africa populations. However, the pattern of more significant D AaBb than significant D AABB contrasts to the results in Africa populations. This is because D AABB has greater fluctuations than D AaBb when the genetic drift effects are large. These young populations, if derived from ancient Africa populations, could have small effective population sizes and larger genetic drift effects, under the Out of Africa model [45,46] rather than the Multiregional model [47,48]. Furthermore, genetic drift effects enhance the reduction in the difference between D AABB and D AaBb in magnitude. These joint effects produce more significant D AaBb than significant D AABB [15]. The differences among these populations mainly reflect historically different extents of their genetic drift effects.
It is of interest to discuss that the young CEU population has the most significant zygotic LDs (D AABB , D AaBB , D AABb , and D AaBb ) among all eleven populations. One possible reason is that the genetic drift effects in CEU are smaller than other non-Africa populations, as implied from Tenesa et al. [49]. Tenesa and coworkers show that the effective population size is slightly larger in population CEU (N e = 2772±598) than in populations JPT (2517±525) and CHB (2620±557) [49]. As a consequence, the zygotic LDs in population CEU have relatively smaller r-squares, closer to the patterns in Africa populations, and generally occur within shorter physical distances for D AABB , D AABb , and D AaBB but within longer physical distances for D AaBb in non-Africa populations.
Population MEX differs from other non-Africa populations in that it has more significant D AABB 's than significant D AaBb 's. It also differs from the Africa populations in that it has more significant zygotic LDs. One possible reason is that population MEX comes from an admixture of complex groups with potentially distinct ancestors [50]. This can be implied from the cluster analysis where population MEX has a probability of 78% to join two groups (CHD and JPT, GIH and TSI). Such case is analogous to the consequence of genomic admixture via gene flow, which maintains more extents of D AABB than D AaBb [15]. When the effects of gene flow (population admixture) are greater than or comparable to the genetic drift effects, it is expected the occurrence of more significant D AABB 's than significant D AaBb 's.
Our results provide an opportunity to evaluate a sole use of the composite digenic LD in revealing population history. In theory, the composite digenic LD is expected to have the pattern similar to that of gametic LD although it is always not smaller than the gametic LD in magnitude [25]. This trend becomes clearer when gametic LD (D AB ) dominantly contributes to Δ AB in most SNP pairs. It essentially reflects the information of low-order LD rather than higher-order LD [21,51]. Our results confirm that the composite digenic LD generally does not have a stronger "spatial' pattern along chromosomes than do any zygotic LDs (Table 4). Zygotic LDs decrease faster than the composite digenic LD with the physical distance, similar to the pattern of gametic LD [15]. It successfully indicates low LDs in populations YRI and MKK, but high LDs in populations ASW and LWK. This suggests that different ancestry populations might be historically involved in deriving the group of YRI and MKK and the group of ASW and LWK. The composite digenic LDs seem also to support that the admixture of distinct ancestry populations could be the main process for relatively higher LDs in population MEX, different from other non-Africa populations that have relatively lower LDs.
Apart from the utility of zygotic LDs for revealing population history, we search for genome-wide epistatic SNPs for fitness using zygotic LDs, which is now becoming an important issue in personal genetics [52]. This approach differs from the GWAS-based or other methods in that it does not require specific phenotypic traits, such as many complex diseases. Instead, the method works on a general fitness trait that is associated with different kinds of diseases directly or indirectly affecting human survivals. We have detected 19735-95921 SNP pairs that exhibit strong signals of epistatic selection (d<0, p-value<10 −9 ). This accounts for a very small proportion of the total SNP pairs with significant zygotic LDs or significant composite digenic LDs. Nevertheless, variations among populations are substantial. The two Africa populations (ASW and LWK) have a few epistatic SNP pairs while CEU has the most expistatic SNP pairs. No common epistatic SNPs among all eleven populations suggest that these epistatic SNPs are mainly locally specific.
The feature of genome-wide epistatic SNPs is unknown since only a few epistatic SNPs in complex diseases are recorded in the literature [10,27,52]. Most reports refer to the major effects of individual SNPs in different disease traits via GWAS analysis. One common feature from our results is that genome-wide epistatic SNPs pairs generally tend to positively correlate with the physical distance, contrasting to the patterns of zygotic LDs in human populations. Further, epistatic SNPs generally occur within the physical distances that are, on average, longer than those for the composite digenic LDs but shorter than those for D AABB . Average rsquares for epistatic SNPs, r 2 d<0 , have similar patterns to other r-squares of zygotic LDs among populations ( Table 2). This pattern helps to gain insights into the characteristics of genomewide epistatic SNPs for fitness.
Another feature of genome-wide epistatic SNPs from our results is that many intron SNPs are involved in epistatic selection. The interacted intron SNPs are located within genes or between distinct genes (S1, S2 and S3 Tables). Also, gene annotations for many intron SNPs are unknown based on the latest online NCBI dbSNP in human genomes. This phenomenon is likely common in functional annotations in human or other organisms' genomes [53], [54,55]. More evidence that intron variants are associated with complex diseases would be rapidly accumulated in the literature with GWAS studies [56]. How these individual intron variants affect fitness remains complicated, such as altering transcriptions of disease-related genes (altering regulatory elements) [55] or changing the protein level of gene expressions. Elucidation of the molecular mechanisms and genetic pathways of these intron SNPs in forming epistasis remains an important challenge [3].
We have detected common epistatic SNP pairs in three regions (Africa, Europe, and East Asia) although no epistatic SNP pairs are shared among all eleven populations. This also suggests few common medications regarding specific SNP epistasis but the necessity of geographical medication [44]. Owing to the different numbers of populations among three regions investigated, the numbers of common epistatic SNPs are different (S1, S2 and S3 Tables). Gene annotations of all epistatic SNP pairs, including SNP reference sequence IDs and positions on individual chromosomes, are checked from current online NCBI dbSNP. The unannotated SNPs in S1, S2 and S3 Tables could be annotated with the accumulation of future functional gene mapping. The feature for regionally common epistatic SNPs is that they occur within less than 1Mb, including SNPs within genes and between neighbour genes. Some epistatic SNPs are evident from clinical trials or the experiment reports in the literature, but many more epistatic SNP pairs are not verified or newly discovered. These newly found epistases await empirical evidence. All these epistatic SNP pairs are of potential importance for understanding common diseases-common variants (CD-CV) in different regions and for aiding in geographical genomic medicine [52].
Supporting Information S1 Table. Gene annotations of the common SNP pairs among four Africa populations (ASW, LWK, MKK, and YRI), with the composite digenic LD being significantly smaller than the maximum zygotic LD (p-value<10 −9 ). (XLSX) S2 Table. Gene annotations of the common SNP pairs among three Asia populations (CHB, CHD, and JPT), with the composite digenic LD being significantly smaller than the maximum zygotic LD (p-value<10 −9 ). (XLSX) S3 Table. Gene annotations of the common SNP pairs between two Europe populations (CEU and TSI), with the composite digenic LD being significantly smaller than the maximum zygotic LD (p-value<10 −9 ). (XLSX)