## Figures

## Abstract

The fixation index *F*_{st} plays a central role in ecological and evolutionary genetic studies. The estimators of Wright (), Weir and Cockerham (), and Hudson *et al*. () are widely used to measure genetic differences among different populations, but all have limitations. We propose a minimum variance estimator using and . We tested in simulations and applied it to 120 unrelated East African individuals from Ethiopia and 11 subpopulations in HapMap 3 with 464,642 SNPs. Our simulation study showed that has smaller bias than for small sample sizes and smaller bias than for large sample sizes. Also, has smaller variance than for small *F*_{st} values and smaller variance than for large *F*_{st} values. We demonstrated that approximately 30 subpopulations and 30 individuals per subpopulation are required in order to accurately estimate *F*_{st}.

**Citation: **Chen G, Yuan A, Shriner D, Tekola-Ayele F, Zhou J, Bentley AR, et al. (2015) An Improved *F _{st}* Estimator. PLoS ONE 10(8):
e0135368.
https://doi.org/10.1371/journal.pone.0135368

**Editor: **Francesc Calafell,
Universitat Pompeu Fabra, SPAIN

**Received: **November 12, 2014; **Accepted: **July 21, 2015; **Published: ** August 28, 2015

This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication

**Data Availability: **The real dataset from the Wolaita were obtained in 2007 with written informed consent that limits data access only to the investigators. Therefore, public deposition of data would breach ethical compliance because participants did not give consent for deposition of their genotypes. Please contact Dr. Fasil Tekola Ayele (fasil.ayele2@nih.gov) for real data detail informaiton.

**Funding: **This research was supported by the Intramural Research Program of the Center for Research on Genomics and Global Health (CRGGH). The CRGGH is supported by the National Human Genome Research Institute, the National Institute of Diabetes and Digestive and Kidney Diseases, the Center for Information Technology, and the Office of the Director at the National Institutes of Health (Z01HG200362). The data from the Wolaita came from a study supported by the Wellcome Trust (Grant #079791). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

The fixation index *F*_{st} is widely used as a measure of population differentiation due to genetic structure. Wright [1, 2] defined *F*_{st} as the ratio of the observed variance of allele frequencies between subpopulations to the expected variance of allele frequencies assuming panmixis. Wright’s estimator of *F*_{st} is biased, because *a priori* expected allele frequencies are unknown and the numerator and denominator terms in the equation are not independent. In practice, various frameworks have been proposed to improve estimation of *F*_{st}. Weir and Cockerham used an analysis of variance (ANOVA) approach to estimate within- and between-population variance components [3, 4]. Weir and Cockerham’s estimator is widely used because their estimator can describe the genetic population structure in a single summary statistic, is asymptotically unbiased with respect to sample size, and can compensate for overestimates particularly at low levels of genetic differentiation unlike Wright’s estimator [5]. However, it can be upwardly biased unless adjustment is done for intralocus sampling error, the number of subpopulations sampled, time of divergence, *etc*. [6]. In the present study, we propose a method that improves *F*_{st} estimation by combining Wright’s and Weir and Cockerham’s estimators to achieve a minimum variance estimate. For comparison, we also include Hudson et al.’s estimator [7], which recently has been recommended by Bhatia et al. [8]. We demonstrate application of our modified estimator in analysis of real data.

## Methods

For a diallelic marker, let *p* be the true minor allele frequency in the total population. Let the true subpopulation allele frequencies be *p*_{1}, …, *p*_{r} in *r* ≥ 2 subpopulations. Let *σ*^{2} be the true population variance in allele frequencies across subpopulations. Suppose the observed sample frequencies are and the sample sizes are *n*_{1}, …, *n*_{r}. Let and . Let *ϑ* be the difference in allele frequencies, such that for two subpopulations, .

Wright’s *F*_{st} [2] is defined as
and is estimated as
with
For the special case of two subpopulations, Rosenberg et al. [9] showed that by algebraic rearrangement
Thus, *F*_{st} is a function of the difference in allele frequencies and is proportional to *ϑ*^{2}.

Weir and Cockerham’s estimator [4], assuming a random union of gametes or equivalently no individual-level inbreeding, is based on and yielding

The definition of *F*_{st} of Hudson et al. [7] is
Given observed sample estimates , is a biased estimate of *H*_{W}, because
An unbiased estimate of *p*_{j}(1 − *p*_{j}) is thus given by . However, is an unbiased estimate of *H*_{B} if , i.e., under the null hypothesis. Therefore, we estimate *F*_{st} by
which is a ratio of unbiased estimates. This estimator generalizes Bhatia et al.’s [8] version of Hudson et al.’s [7] estimator for *r* > 2.

Note that under the null hypothesis of *p*_{1} = ⋯ = *p*_{r}, both and are asymptotically zero. Our goal is to construct an estimator based on a linear combination of and such that the new estimator has the smallest variance among all such linear combinations. Let and be the asymptotic variances of and , and *σ*_{12} be the asymptotic covariance. We propose the following weighted version of :
where *b* > 0 is a fixed number to be chosen later. We choose *a* = *a*_{0} such that is minimized:
(1)
It is seen that and hence is more precise in estimation. From the proof of the Proposition we see that Eq (1) is equivalent to,
(2)
which gives, with *b* = (*δ* − 1)/(*δ* + 1),
At the end of the proof of the following Proposition, we show that *δ* ≥ 1 with equality if and only if *n*_{1} = ⋯ = *n*_{r}. When *n*_{1} = ⋯ = *n*_{r}, we have and . Let denote convergence in distribution.

**Proposition**. *Assume that 0 < p*_{0} < 1 *and that the n _{j}’s are not all equal (so that δ > 1). If p_{1} = ⋯ = p_{r}, with*

*and*

*we have*

*where λ*

_{1},

*…,λ*Ω′

_{r}are the eigenvalues of^{1/2}

**B**Ω

^{1/2}, Ω = (

*ω*

_{ij})

_{r × r}

*with*

*ω*

_{ij}=

*p*

_{0}(1 −

*p*

_{0})

*if*

*i*=

*j*

*and*

*ω*

_{ij}= 0

*if*

*i*≠

*j*, Ω

^{−1/2}

*is the square root of*Ω: Ω = Ω′

^{1/2}Ω

^{1/2},

*and*,

**b**

_{j}= (−

*γ*

_{1}, …, −

*γ*

_{j−1}, (1 −

*γ*

_{j}), −

*γ*

_{j+1}, …, −

*γ*

_{r})′.

In the above Proposition, take *a* = *a*_{0}, then *a*_{0}+*δ*(1 − *a*_{0}) = 0 and *δ* (1 − *a*_{0}) = −*δ*/(*δ* − 1), and we get

**Corollary 1**. *Under conditions of the Proposition,*

**Simulations** Under the Balding-Nichols model [10], the allele frequency in each of *r* subpopulations conditional on *p* and *F*_{st} is a random deviate from the beta distribution *β* , which has mean *p* and variance *p*(1 − *p*)*F*_{st} = *σ*^{2}.

**Simulation 1**. This simulation was designed to estimate bias in the worst case scenario of two subpopulations. We evaluated the relationships between and *F*_{st} and between and . First, given the true average allele frequency *p* for *r* = 2, *F*_{st} reaches its maximum value for *p*_{j} values of 0 and 2*p*. The estimator [4] yields a constrained range for from 0 to 2*p*. Therefore, we first assigned the true value for *p* by drawing a random uniform deviate from the interval (0, 0.5) and the true value for *F*_{st} by independently drawing a random uniform deviate from the interval (0, 2*p*). Conditional on the true values of *p* and *F*_{st}, we randomly generated *p*_{j} from the beta distribution. We next assigned the number of individuals per subpopulation *n*_{j} = [5, 10, 20, 50, 100, 110]. We then randomly drew alleles from the binomial distribution *Bin*(2*n*_{j}, *p*_{j}). We generated 10,000 independent replicate data sets. Based on the above formulae, the four estimators , , , and were calculated. Linear regression models were used to evaluate the relationship between *F*_{st} and and between and . We assessed the fit in a linear regression model with the *F*-test, *r*^{2}, and the root mean squared error (RMSE), which is the square root of the sum of the variance and the squared bias.

**Simulation 2**. This simulation was designed to evaluate variance under sampling conditions approaching unbiasedness, i.e., large numbers of subpopulations and individuals per subpopulation. We evaluated the relationships between and the number of subpopulations (*r*) and between and the number of individuals per subpopulations (*n*_{j}). Conditional on the average allele frequency *p*, *F*_{st}, the number of subpopulations *r* = [5, 10, 20, 50, 100, 250], and the number of individuals per subpopulation *n*_{j} = [5, 10, 20, 50, 100, 250, 1000], we randomly generated *r* allele frequencies as in Simulation 1 and calculated , , , and .

### Application to data

We included genotype data from a total of 120 unrelated individuals from the Wolaita (WETH) ethnic group from southern Ethiopia who served as controls in a genome-wide association study of podoconiosis [11]. The Wolaita ethnic group speaks an Omotic language, and comparison with HapMap African populations has shown that it has the closest genetic similarity with the Maasai from Kenya and the lowest genetic similarity with the Yoruba in Nigeria [12]. Genotyping was performed by deCODE Genetics using the Illumina HumanHap 610 Bead Chip, which assays > 620,000 single-nucleotide polymorphisms (SNPs). Of the 551,840 autosomal SNPs in the raw genotype data, we excluded 39,249 SNPs that had a minor allele frequency of < 0.05, 378 that were missing in > 0.05 of individuals, and 321 that had a Hardy-Weinberg *p*-value < 0.001. The remaining 511,892 SNPs were merged with genotype data for ASW (*n* = 49), CEU (*n* = 112), CHB (*n* = 84), CHD (*n* = 85), GIH (*n* = 88), JPT (*n* = 86), LWK (*n* = 90), MKK (*n* = 143), MXL (*n* = 50), TSI (*n* = 88), and YRI (*n* = 113) in HapMap phase 3, release 2, which contained 1,440,616 SNPs. A total of 464,642 SNPs were common to both of WETH and HapMap data sets. , , and were calculated per marker.

## Results

Simulation 1: We first compared with the true *F*_{st} for the worst-case scenario of *r* = 2. For small sample sizes, was the least biased estimator, followed by , , and (Table 1). For large sample sizes, and were comparably good, and and were identically worse (Table 1). None of the four estimators was strongly sensitive to equal vs. unequal sample sizes (Fig 1). When was close to 0, yielded the most negative estimates, followed by and . As expected, all four estimators showed a quadratic relationship with (Fig 1). With respect to , by all four measures was the best estimator whereas was the worst estimator (Table 1).

The x-axis shows the difference of allele frequencies between two subpopulations (left plots) and (right plots); the y-axis shows values for Wright’s (top row), Weir and Cockerham’s (second row), the modified (third row), and Hudson et al.’s estimators (bottom row), and the legend indicates the sample sizes *n*_{1} (before hyphen) and *n*_{2} (after hyphen).

An assessment of bias by the total sample size (*n*_{1} and *n*_{2}) for *r* = 2 is presented in Fig 2. was biased and this bias was constant across total sample size, as expected given that this estimator does not account for *n*_{j}. In contrast, , , and were less biased as the total sample size increased. When the total sample size exceeded 30, was the least biased estimator; otherwise, was the least biased estimator. For *r* = 2, the magnitude of bias for all four estimators was constant when the total sample size was at least 60.

The x-axis shows the total sample size (*n*_{1} + *n*_{2}). The y-axis shows (red), (blue), (green), and (orange) for *r* = 2.

Simulation 2: Given *p* = 0.2, *F*_{st} = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], *n* = 1000 individuals, and *r* = 200 subpopulations, mean values are presented in Table 2. The means for , , , and equaled the expected values, consistent with all four estimators being asymptotically unbiased. First, we investigated the relationship between *F*_{st} and the variance of the four estimators. Given *p* = 0.2 and *F*_{st} < 0.5, had the smallest variance, followed by and (Fig 3). Given *p* = 0.2 and *F*_{st} > 0.5, had the smallest variance, followed by and . Similar results were obtained for *p* = 0.1, 0.3, 0.4, and 0.5 (S1 Table).

The x-axis shows the number of subpopulations. The y-axis shows the mean (left) and variance (right) of (red), (blue), (green), and (orange) values, given *F*_{st} = 0.5 and average allele frequency *p* = 0.2. The top plot represents 5 individuals per subpopulation and the bottom plot represents 1000 individuals per subpopulation.

Second, we investigated how the number of subpopulations and the number of individuals per subpopulation affected bias. When the number of subpopulations was approximately 30, no matter the number of individuals per subpopulation, bias was stable (Fig 3). For *r* > 30 and small *n*_{j}, all four estimators were biased, with the order of . For *r* > 30 and large *n*_{j}, all four estimators were unbiased. For *n*_{j} > 30, all four estimators were stable and bias decreased as *r* increased, with the best estimator and the worst estimator (Fig 4).

The x-axis shows the number of individuals per subpopulation. The y-axis shows the mean (left) and variance (right) of (red), (blue), (green), and (orange) values, given *F*_{st} = 0.5 and an average allele frequency *p* = 0.2. From top to bottom, the plots represent the number of subpopulations *r* = 10, 20, and 40, respectively.

Application to Data: The means and variances of values between the WETH and 11 samples in HapMap 3 are presented in Table 3. The WETH sample was closest to the MKK sample, consistent with shared Cushitic and Nilo-Saharan ancestry [13]. and yielded the same order for all pairs of relationships and all four estimators yielded the same order of relationships for the five HapMap samples closest to the WETH sample. The order of the means was < < < . was approximately 30% larger than and approximately 20% smaller than , which has corresponding effects on divergence time estimates. Given that and are less downward biased than for these sample sizes (Fig 2), the larger values are more likely to be correct.

## Discussion

*F*_{st} is directly related to the variance in allele frequencies among subpopulations. The dependence of *F*_{st} on allele frequencies and genetic diversity has been observed [14]. In our study, an approximately linear relationship between , , , and with the squared difference of allele frequencies () was observed, as expected. By simulation, we found that all four estimators were unbiased for large numbers of subpopulations and individuals per subpopulation but that no one estimator was uniformly better than the others. For *F*_{st} < 0.5, had smaller variances and MSE values. For *F*_{st} > 0.5, had smaller variances and MSE values. For *F*_{st} ≈ 0.5, , , and had similar variance and MSE values.

The numbers of individuals and markers have been reported to affect *F*_{st} estimation [5]. We found that the number of subpopulations was more important than the number of individuals per subpopulation. Estimation of *F*_{st}, both in terms of means and variances, stabilized with approximately 30 subpopulations, regardless of the number of individuals per subpopulation. This behavior occurs because there are *r* estimates of with which to estimate *p* and *σ*^{2}. Estimation was biased for *r* = 2 and improved as *r* increased, according to the Central Limit Theorem. Estimation was biased for *n*_{j} < 30 and improved as *n*_{j} increased (except for Wright’s estimator), also according to the Central Limit Theorem. Our proposed estimator is a minimum variance combination of Wright’s and Weir and Cockerham’s estimators and is less biased than Weir and Cockerham’s estimator for small samples sizes and less biased than Wright’s estimator for large sample sizes.

## Conclusion

A modified *F*_{st} estimator is proposed, which combines Wright’s and Weir and Cockerham’s estimators. It splits the difference in biases present in Wright’s and Weir and Cockerham’s estimators. We propose the routine use of this new and improved estimator of *F*_{st} as a way to reduce the biases and limitations of the classical estimators. We demonstrated that, in order to estimate *F*_{st} accurately, at least 30 subpopulations and 30 individuals per subpopulations are required.

## Appendix

### Proof of the Proposition

As , is asymptotically a chi-squared random variable, , , and Thus and

Let and be the asymptotic variance of , then the asymptotic variance of is , and the asymptotic covariance of is . Now we have

If *p*_{1} = ⋯ = *p*_{2}, . Note the ’s are independent, and
Let , **p**_{0} = (*p*_{0}, …, *p*_{0})’, *γ*_{j} = *n*_{j}/*n*, and **b**_{j} = (−*γ*_{1}, …, −*γ*_{j−1}, (1 − *γ*_{j}), −*γ*_{j+1}, …, −*γ*_{r})′, then for *j* = 1, …, *r*, and so by the Central Limit Theorem,
where , Ω = (*ω*_{ij})_{r×r} with *ω*_{ij} = *p*_{0}(1 − *p*_{0}) if *i* = *j* and *ω*_{ij} = 0 if *i* ≠ *j*.

Let Ω^{−1/2} be the square root of Ω: Ω = Ω′^{1/2}Ω^{1/2}, *λ*_{1}, …, *λ*_{r} be all the eigenvalues of Ω′^{1/2} **B**Ω^{1/2}, and Λ = *diag*(*λ*_{1}, …, *λ*_{r}), then there is an orthogonal normal matrix **Q** such that Ω′^{1/2} **B**Ω^{1/2} = **Q**′Λ**Q**, and so
where the ’s are independent chi-squared random variables with one degree of freedom. This gives the desired result.

Lastly, we prove
In fact,
It is known that for *r* = 1 or 2, with “=” if and only if *n*_{1} = ⋯ = *n*_{r}. Now we use induction to prove this is true for all integer *r*. In fact, suppose the above conclusion is true for some integer *r* > 2, then for integer *r* + 1,
and
Since by assumption ,
with “=” if and only if *n*_{1} = ⋯*n*_{r+1}, since , with “=” if and only if *n*_{j} = *n*_{r+1}.

This gives *δ* ≥ 1 with “=” if and only if *n*_{1} = ⋯ = *n*_{r}.

## Supporting Information

### S1 Table. Means, Variances, and MSEs of in simulation 2.

https://doi.org/10.1371/journal.pone.0135368.s001

(PDF)

## Acknowledgments

We thank Professor Gail Davey for allowing us to use the Wolaita GWAS dataset in this study.

## Author Contributions

Conceived and designed the experiments: GC AY DS CNR. Analyzed the data: GC AY DS FT JZ AB YZ CW AA. Contributed reagents/materials/analysis tools: MJN. Wrote the paper: GC AY DS FT AB AA CNR.

## References

- 1. Wright S. Genetical structure of populations. Nature 1950; 66(4215): 247–249.
- 2. Wright S. The genetical structure of populations. Ann Eugen 1951; 15(4): 323–354. pmid:24540312
- 3. Cockerham CC. Variance of Gene Frequencies. Evolution 1969; 23(1): 72–84.
- 4. Weir BS, Cockerham CC. Estimating F-Statistics for the Analysis of Population Structure. Evolution 1984; 38(6): 1358–1370.
- 5.
Willing E-M, Dreyer C, van Oosterhout C. Estimates of Genetic Differentiation Measured by
*F*_{st}Do Not Necessarily Require Large Sample Sizes When Using Many SNP Markers. PLOS ONE 2012; 7(8): e42649. pmid:22905157 - 6. Waples RS. Separating the wheat from the chaff: patterns of genetic differentiation in high gene flow species. J Hered 1998; 89(5): 438–450.
- 7. Hudson RR, Slatkin M, Maddison WP. Estimation of Levels of Gene Flow from DNA Sequence Data. Genetics 1992; 132(2): 583–589. pmid:1427045
- 8.
Bhatia G, Patterson N, Sankararaman S, Price AL. Estimating and interpreting
*F*_{ST}: The impact of rare variants. Genome Res 2013; 23: 1514–1521. pmid:23861382 - 9. Rosenberg NA, Li LM, Ward R, Pritchard JK. Informativeness of Genetic Markers for Inference of Ancestry. Am J Hum Genet 2003; 73(6): 1402–1422. pmid:14631557
- 10. Balding DJ, Nichols RA. A method for quantifying differentiation between populations at multi-allelic loci and its implications for identity and paternity. Genetics 1995; 96: 3–11.
- 11. Tekola Ayele F, Adeyemo A, Finan C, Hailu E, Sinnott P, Burlinson ND, et al. HLA class II locus and susceptibility to podoconiosis. N Engl J Med 2012; 366(13): 1200–1208. pmid:22455414
- 12. Tekola-Ayele F, Adeyemo A, Aseffa A, Hailu E, Finan C, Davey G, et al. Clinical and pharmacogenomic implications of genetic variation in a Southern Ethiopian population. Pharmacogenomics J 2014; 15(1): 101–108. pmid:25069476
- 13. Shriner D, Tekola-Ayele F, Adeyemo A, Rotimi CN. Genome-wide genotype and sequence-based reconstruction of the 140,000 year history of modern human ancestry. Sci Rep 2014; 4: 6055 pmid:25116736
- 14.
Jakobsson M, Edge MD, Rosenberg NA. The Relationship between
*F*_{st}and the Frequency of the Most Frequent Allele. Genetics 2013; 193(2): 513–528.