The fixation index Fst plays a central role in ecological and evolutionary genetic studies. The estimators of Wright (), Weir and Cockerham (), and Hudson et al. () are widely used to measure genetic differences among different populations, but all have limitations. We propose a minimum variance estimator using and . We tested in simulations and applied it to 120 unrelated East African individuals from Ethiopia and 11 subpopulations in HapMap 3 with 464,642 SNPs. Our simulation study showed that has smaller bias than for small sample sizes and smaller bias than for large sample sizes. Also, has smaller variance than for small Fst values and smaller variance than for large Fst values. We demonstrated that approximately 30 subpopulations and 30 individuals per subpopulation are required in order to accurately estimate Fst.
Citation: Chen G, Yuan A, Shriner D, Tekola-Ayele F, Zhou J, Bentley AR, et al. (2015) An Improved Fst Estimator. PLoS ONE 10(8): e0135368. https://doi.org/10.1371/journal.pone.0135368
Editor: Francesc Calafell, Universitat Pompeu Fabra, SPAIN
Received: November 12, 2014; Accepted: July 21, 2015; Published: August 28, 2015
This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication
Data Availability: The real dataset from the Wolaita were obtained in 2007 with written informed consent that limits data access only to the investigators. Therefore, public deposition of data would breach ethical compliance because participants did not give consent for deposition of their genotypes. Please contact Dr. Fasil Tekola Ayele (email@example.com) for real data detail informaiton.
Funding: This research was supported by the Intramural Research Program of the Center for Research on Genomics and Global Health (CRGGH). The CRGGH is supported by the National Human Genome Research Institute, the National Institute of Diabetes and Digestive and Kidney Diseases, the Center for Information Technology, and the Office of the Director at the National Institutes of Health (Z01HG200362). The data from the Wolaita came from a study supported by the Wellcome Trust (Grant #079791). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
The fixation index Fst is widely used as a measure of population differentiation due to genetic structure. Wright [1, 2] defined Fst as the ratio of the observed variance of allele frequencies between subpopulations to the expected variance of allele frequencies assuming panmixis. Wright’s estimator of Fst is biased, because a priori expected allele frequencies are unknown and the numerator and denominator terms in the equation are not independent. In practice, various frameworks have been proposed to improve estimation of Fst. Weir and Cockerham used an analysis of variance (ANOVA) approach to estimate within- and between-population variance components [3, 4]. Weir and Cockerham’s estimator is widely used because their estimator can describe the genetic population structure in a single summary statistic, is asymptotically unbiased with respect to sample size, and can compensate for overestimates particularly at low levels of genetic differentiation unlike Wright’s estimator . However, it can be upwardly biased unless adjustment is done for intralocus sampling error, the number of subpopulations sampled, time of divergence, etc. . In the present study, we propose a method that improves Fst estimation by combining Wright’s and Weir and Cockerham’s estimators to achieve a minimum variance estimate. For comparison, we also include Hudson et al.’s estimator , which recently has been recommended by Bhatia et al. . We demonstrate application of our modified estimator in analysis of real data.
For a diallelic marker, let p be the true minor allele frequency in the total population. Let the true subpopulation allele frequencies be p1, …, pr in r ≥ 2 subpopulations. Let σ2 be the true population variance in allele frequencies across subpopulations. Suppose the observed sample frequencies are and the sample sizes are n1, …, nr. Let and . Let ϑ be the difference in allele frequencies, such that for two subpopulations, .
Wright’s Fst  is defined as and is estimated as with For the special case of two subpopulations, Rosenberg et al.  showed that by algebraic rearrangement Thus, Fst is a function of the difference in allele frequencies and is proportional to ϑ2.
Weir and Cockerham’s estimator , assuming a random union of gametes or equivalently no individual-level inbreeding, is based on and yielding
The definition of Fst of Hudson et al.  is Given observed sample estimates , is a biased estimate of HW, because An unbiased estimate of pj(1 − pj) is thus given by . However, is an unbiased estimate of HB if , i.e., under the null hypothesis. Therefore, we estimate Fst by which is a ratio of unbiased estimates. This estimator generalizes Bhatia et al.’s  version of Hudson et al.’s  estimator for r > 2.
Note that under the null hypothesis of p1 = ⋯ = pr, both and are asymptotically zero. Our goal is to construct an estimator based on a linear combination of and such that the new estimator has the smallest variance among all such linear combinations. Let and be the asymptotic variances of and , and σ12 be the asymptotic covariance. We propose the following weighted version of : where b > 0 is a fixed number to be chosen later. We choose a = a0 such that is minimized: (1) It is seen that and hence is more precise in estimation. From the proof of the Proposition we see that Eq (1) is equivalent to, (2) which gives, with b = (δ − 1)/(δ + 1), At the end of the proof of the following Proposition, we show that δ ≥ 1 with equality if and only if n1 = ⋯ = nr. When n1 = ⋯ = nr, we have and . Let denote convergence in distribution.
Proposition. Assume that 0 < p0 < 1 and that the nj’s are not all equal (so that δ > 1). If p1 = ⋯ = pr, with and we have where λ1, …,λr are the eigenvalues of Ω′1/2 BΩ1/2, Ω = (ωij)r × r with ωij = p0(1 − p0) if i = j and ωij = 0 if i ≠ j, Ω−1/2 is the square root of Ω: Ω = Ω′1/2Ω1/2, and , bj = (−γ1, …, −γj−1, (1 − γj), −γj+1, …, −γr)′.
In the above Proposition, take a = a0, then a0+δ(1 − a0) = 0 and δ (1 − a0) = −δ/(δ − 1), and we get
Corollary 1. Under conditions of the Proposition,
Simulations Under the Balding-Nichols model , the allele frequency in each of r subpopulations conditional on p and Fst is a random deviate from the beta distribution β , which has mean p and variance p(1 − p)Fst = σ2.
Simulation 1. This simulation was designed to estimate bias in the worst case scenario of two subpopulations. We evaluated the relationships between and Fst and between and . First, given the true average allele frequency p for r = 2, Fst reaches its maximum value for pj values of 0 and 2p. The estimator  yields a constrained range for from 0 to 2p. Therefore, we first assigned the true value for p by drawing a random uniform deviate from the interval (0, 0.5) and the true value for Fst by independently drawing a random uniform deviate from the interval (0, 2p). Conditional on the true values of p and Fst, we randomly generated pj from the beta distribution. We next assigned the number of individuals per subpopulation nj = [5, 10, 20, 50, 100, 110]. We then randomly drew alleles from the binomial distribution Bin(2nj, pj). We generated 10,000 independent replicate data sets. Based on the above formulae, the four estimators , , , and were calculated. Linear regression models were used to evaluate the relationship between Fst and and between and . We assessed the fit in a linear regression model with the F-test, r2, and the root mean squared error (RMSE), which is the square root of the sum of the variance and the squared bias.
Simulation 2. This simulation was designed to evaluate variance under sampling conditions approaching unbiasedness, i.e., large numbers of subpopulations and individuals per subpopulation. We evaluated the relationships between and the number of subpopulations (r) and between and the number of individuals per subpopulations (nj). Conditional on the average allele frequency p, Fst, the number of subpopulations r = [5, 10, 20, 50, 100, 250], and the number of individuals per subpopulation nj = [5, 10, 20, 50, 100, 250, 1000], we randomly generated r allele frequencies as in Simulation 1 and calculated , , , and .
Application to data
We included genotype data from a total of 120 unrelated individuals from the Wolaita (WETH) ethnic group from southern Ethiopia who served as controls in a genome-wide association study of podoconiosis . The Wolaita ethnic group speaks an Omotic language, and comparison with HapMap African populations has shown that it has the closest genetic similarity with the Maasai from Kenya and the lowest genetic similarity with the Yoruba in Nigeria . Genotyping was performed by deCODE Genetics using the Illumina HumanHap 610 Bead Chip, which assays > 620,000 single-nucleotide polymorphisms (SNPs). Of the 551,840 autosomal SNPs in the raw genotype data, we excluded 39,249 SNPs that had a minor allele frequency of < 0.05, 378 that were missing in > 0.05 of individuals, and 321 that had a Hardy-Weinberg p-value < 0.001. The remaining 511,892 SNPs were merged with genotype data for ASW (n = 49), CEU (n = 112), CHB (n = 84), CHD (n = 85), GIH (n = 88), JPT (n = 86), LWK (n = 90), MKK (n = 143), MXL (n = 50), TSI (n = 88), and YRI (n = 113) in HapMap phase 3, release 2, which contained 1,440,616 SNPs. A total of 464,642 SNPs were common to both of WETH and HapMap data sets. , , and were calculated per marker.
Simulation 1: We first compared with the true Fst for the worst-case scenario of r = 2. For small sample sizes, was the least biased estimator, followed by , , and (Table 1). For large sample sizes, and were comparably good, and and were identically worse (Table 1). None of the four estimators was strongly sensitive to equal vs. unequal sample sizes (Fig 1). When was close to 0, yielded the most negative estimates, followed by and . As expected, all four estimators showed a quadratic relationship with (Fig 1). With respect to , by all four measures was the best estimator whereas was the worst estimator (Table 1).
The x-axis shows the difference of allele frequencies between two subpopulations (left plots) and (right plots); the y-axis shows values for Wright’s (top row), Weir and Cockerham’s (second row), the modified (third row), and Hudson et al.’s estimators (bottom row), and the legend indicates the sample sizes n1 (before hyphen) and n2 (after hyphen).
An assessment of bias by the total sample size (n1 and n2) for r = 2 is presented in Fig 2. was biased and this bias was constant across total sample size, as expected given that this estimator does not account for nj. In contrast, , , and were less biased as the total sample size increased. When the total sample size exceeded 30, was the least biased estimator; otherwise, was the least biased estimator. For r = 2, the magnitude of bias for all four estimators was constant when the total sample size was at least 60.
The x-axis shows the total sample size (n1 + n2). The y-axis shows (red), (blue), (green), and (orange) for r = 2.
Simulation 2: Given p = 0.2, Fst = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], n = 1000 individuals, and r = 200 subpopulations, mean values are presented in Table 2. The means for , , , and equaled the expected values, consistent with all four estimators being asymptotically unbiased. First, we investigated the relationship between Fst and the variance of the four estimators. Given p = 0.2 and Fst < 0.5, had the smallest variance, followed by and (Fig 3). Given p = 0.2 and Fst > 0.5, had the smallest variance, followed by and . Similar results were obtained for p = 0.1, 0.3, 0.4, and 0.5 (S1 Table).
The x-axis shows the number of subpopulations. The y-axis shows the mean (left) and variance (right) of (red), (blue), (green), and (orange) values, given Fst = 0.5 and average allele frequency p = 0.2. The top plot represents 5 individuals per subpopulation and the bottom plot represents 1000 individuals per subpopulation.
Second, we investigated how the number of subpopulations and the number of individuals per subpopulation affected bias. When the number of subpopulations was approximately 30, no matter the number of individuals per subpopulation, bias was stable (Fig 3). For r > 30 and small nj, all four estimators were biased, with the order of . For r > 30 and large nj, all four estimators were unbiased. For nj > 30, all four estimators were stable and bias decreased as r increased, with the best estimator and the worst estimator (Fig 4).
The x-axis shows the number of individuals per subpopulation. The y-axis shows the mean (left) and variance (right) of (red), (blue), (green), and (orange) values, given Fst = 0.5 and an average allele frequency p = 0.2. From top to bottom, the plots represent the number of subpopulations r = 10, 20, and 40, respectively.
Application to Data: The means and variances of values between the WETH and 11 samples in HapMap 3 are presented in Table 3. The WETH sample was closest to the MKK sample, consistent with shared Cushitic and Nilo-Saharan ancestry . and yielded the same order for all pairs of relationships and all four estimators yielded the same order of relationships for the five HapMap samples closest to the WETH sample. The order of the means was < < < . was approximately 30% larger than and approximately 20% smaller than , which has corresponding effects on divergence time estimates. Given that and are less downward biased than for these sample sizes (Fig 2), the larger values are more likely to be correct.
Fst is directly related to the variance in allele frequencies among subpopulations. The dependence of Fst on allele frequencies and genetic diversity has been observed . In our study, an approximately linear relationship between , , , and with the squared difference of allele frequencies () was observed, as expected. By simulation, we found that all four estimators were unbiased for large numbers of subpopulations and individuals per subpopulation but that no one estimator was uniformly better than the others. For Fst < 0.5, had smaller variances and MSE values. For Fst > 0.5, had smaller variances and MSE values. For Fst ≈ 0.5, , , and had similar variance and MSE values.
The numbers of individuals and markers have been reported to affect Fst estimation . We found that the number of subpopulations was more important than the number of individuals per subpopulation. Estimation of Fst, both in terms of means and variances, stabilized with approximately 30 subpopulations, regardless of the number of individuals per subpopulation. This behavior occurs because there are r estimates of with which to estimate p and σ2. Estimation was biased for r = 2 and improved as r increased, according to the Central Limit Theorem. Estimation was biased for nj < 30 and improved as nj increased (except for Wright’s estimator), also according to the Central Limit Theorem. Our proposed estimator is a minimum variance combination of Wright’s and Weir and Cockerham’s estimators and is less biased than Weir and Cockerham’s estimator for small samples sizes and less biased than Wright’s estimator for large sample sizes.
A modified Fst estimator is proposed, which combines Wright’s and Weir and Cockerham’s estimators. It splits the difference in biases present in Wright’s and Weir and Cockerham’s estimators. We propose the routine use of this new and improved estimator of Fst as a way to reduce the biases and limitations of the classical estimators. We demonstrated that, in order to estimate Fst accurately, at least 30 subpopulations and 30 individuals per subpopulations are required.
Proof of the Proposition
If p1 = ⋯ = p2, . Note the ’s are independent, and Let , p0 = (p0, …, p0)’, γj = nj/n, and bj = (−γ1, …, −γj−1, (1 − γj), −γj+1, …, −γr)′, then for j = 1, …, r, and so by the Central Limit Theorem, where , Ω = (ωij)r×r with ωij = p0(1 − p0) if i = j and ωij = 0 if i ≠ j.
Let Ω−1/2 be the square root of Ω: Ω = Ω′1/2Ω1/2, λ1, …, λr be all the eigenvalues of Ω′1/2 BΩ1/2, and Λ = diag(λ1, …, λr), then there is an orthogonal normal matrix Q such that Ω′1/2 BΩ1/2 = Q′ΛQ, and so where the ’s are independent chi-squared random variables with one degree of freedom. This gives the desired result.
Lastly, we prove In fact, It is known that for r = 1 or 2, with “=” if and only if n1 = ⋯ = nr. Now we use induction to prove this is true for all integer r. In fact, suppose the above conclusion is true for some integer r > 2, then for integer r + 1, and Since by assumption , with “=” if and only if n1 = ⋯nr+1, since , with “=” if and only if nj = nr+1.
This gives δ ≥ 1 with “=” if and only if n1 = ⋯ = nr.
We thank Professor Gail Davey for allowing us to use the Wolaita GWAS dataset in this study.
Conceived and designed the experiments: GC AY DS CNR. Analyzed the data: GC AY DS FT JZ AB YZ CW AA. Contributed reagents/materials/analysis tools: MJN. Wrote the paper: GC AY DS FT AB AA CNR.
- 1. Wright S. Genetical structure of populations. Nature 1950; 66(4215): 247–249.
- 2. Wright S. The genetical structure of populations. Ann Eugen 1951; 15(4): 323–354. pmid:24540312
- 3. Cockerham CC. Variance of Gene Frequencies. Evolution 1969; 23(1): 72–84.
- 4. Weir BS, Cockerham CC. Estimating F-Statistics for the Analysis of Population Structure. Evolution 1984; 38(6): 1358–1370.
- 5. Willing E-M, Dreyer C, van Oosterhout C. Estimates of Genetic Differentiation Measured by Fst Do Not Necessarily Require Large Sample Sizes When Using Many SNP Markers. PLOS ONE 2012; 7(8): e42649. pmid:22905157
- 6. Waples RS. Separating the wheat from the chaff: patterns of genetic differentiation in high gene flow species. J Hered 1998; 89(5): 438–450.
- 7. Hudson RR, Slatkin M, Maddison WP. Estimation of Levels of Gene Flow from DNA Sequence Data. Genetics 1992; 132(2): 583–589. pmid:1427045
- 8. Bhatia G, Patterson N, Sankararaman S, Price AL. Estimating and interpreting FST: The impact of rare variants. Genome Res 2013; 23: 1514–1521. pmid:23861382
- 9. Rosenberg NA, Li LM, Ward R, Pritchard JK. Informativeness of Genetic Markers for Inference of Ancestry. Am J Hum Genet 2003; 73(6): 1402–1422. pmid:14631557
- 10. Balding DJ, Nichols RA. A method for quantifying differentiation between populations at multi-allelic loci and its implications for identity and paternity. Genetics 1995; 96: 3–11.
- 11. Tekola Ayele F, Adeyemo A, Finan C, Hailu E, Sinnott P, Burlinson ND, et al. HLA class II locus and susceptibility to podoconiosis. N Engl J Med 2012; 366(13): 1200–1208. pmid:22455414
- 12. Tekola-Ayele F, Adeyemo A, Aseffa A, Hailu E, Finan C, Davey G, et al. Clinical and pharmacogenomic implications of genetic variation in a Southern Ethiopian population. Pharmacogenomics J 2014; 15(1): 101–108. pmid:25069476
- 13. Shriner D, Tekola-Ayele F, Adeyemo A, Rotimi CN. Genome-wide genotype and sequence-based reconstruction of the 140,000 year history of modern human ancestry. Sci Rep 2014; 4: 6055 pmid:25116736
- 14. Jakobsson M, Edge MD, Rosenberg NA. The Relationship between Fst and the Frequency of the Most Frequent Allele. Genetics 2013; 193(2): 513–528.