Estimating FST and kinship for arbitrary population structures

FST and kinship are key parameters often estimated in modern population genetics studies in order to quantitatively characterize structure and relatedness. Kinship matrices have also become a fundamental quantity used in genome-wide association studies and heritability estimation. The most frequently-used estimators of FST and kinship are method-of-moments estimators whose accuracies depend strongly on the existence of simple underlying forms of structure, such as the independent subpopulations model of non-overlapping, independently evolving subpopulations. However, modern data sets have revealed that these simple models of structure likely do not hold in many populations, including humans. In this work, we analyze the behavior of these estimators in the presence of arbitrarily-complex population structures, which results in an improved estimation framework specifically designed for arbitrary population structures. After generalizing the definition of FST to arbitrary population structures and establishing a framework for assessing bias and consistency of genome-wide estimators, we calculate the accuracy of existing FST and kinship estimators under arbitrary population structures, characterizing biases and estimation challenges unobserved under their originally-assumed models of structure. We then present our new approach, which consistently estimates kinship and FST when the minimum kinship value in the dataset is estimated consistently. We illustrate our results using simulated genotypes from an admixture model, constructing a one-dimensional geographic scenario that departs nontrivially from the independent subpopulations model. Our simulations reveal the potential for severe biases in estimates of existing approaches that are overcome by our new framework. This work may significantly improve future analyses that rely on accurate kinship and FST estimates.

from which the result follows using the continuous mapping theorem [1,2]. The proof forÂ m follows, which applies analogously toB m . Our a i are independent but not identically distributed, since they depend on p T i that varies per locus, so the standard law of large numbers does not apply toÂ m . We show almost sure convergence using Kolmogorov's criterion for the Strong Law of Large Numbers [3], which is satisfied for bounded Var(a i ). Since |a i | ≤ C < ∞ for all i and some C (see main text), then E[a 2 i ] ≤ C 2 , so Var(a i ) ≤ C 2 . Therefore,Â m a.s.

A.2 Order of error of expectations
The error of the ratio of expectations from the expectation of the ratio is given by and expanding the covariance [4]. Previous work on ratio estimators [4,5] assumes IID a i and b i , which does not hold for SNP loci. Assuming independent loci (Cov(a i , b j ) = 0 for i = j) and large m soB m ≈ Bc is practically independent of any given a i and b j , then Since a i , b i are bounded, | Cov(a i , b i )| ≤ C 2 for the same C of the previous section, so for some large enough m and C. Hence m = O 1 m as is for standard ratio estimators [5].

B The Weir-Goudet F ST estimator for subpopulations
Here we show that the relative F ST estimator presented in [6] (denoted byβ WT in that work) for biallelic loci equals the HudsonK F ST estimator in the main text. This estimator-the special case for biallelic loci only-is implemented in the function snpgdsFst with method option set to W&H02

S2
in the R package SNPRelate by the authors of [6]; despite this method name, this estimator does not equal the Weir-Hill estimator in [7], but rather the estimator described below (verified both numerically and by looking at the source code). Note that Weir-Cockerham (W&C84) is the only other F ST estimator implemented in that package as of this writing (other F ST estimators proposed in [6] are not implemented). The earliest presentation of this exact estimator that we are aware of is [8], which came out the same year that we first described HudsonK [9]. The Weir-Goudet relative F ST estimator for subpopulations is given bŷ andp iju is the sample allele frequency at locus i for subpopulation j and allele u. As in the existing F ST methods section, here n is the number of subpopulations and n j is the number of individuals in subpopulation j.
In order to establish the connection between this and other F ST estimators, we first generalize some of our earlier notation to be for more than two alleles. We can calculate the allele frequency variance term per allele u, orσ wherep T iu = 1 n n j=1p iju is the sample allele frequency estimate for each allele u weighing subpopulations equally. Expanding the square and rearranging, we find that the sum of square subpopulation sample allele frequencies is given by Going back to the WG estimator for subpopulations in Eq. (S1), its parts can be restated as

S3
Similarly, Now, taking the special case of biallelic loci (p T iu =p T i for the first allele u, andp T iu = 1 −p T i for the second allele; note thatσ 2 iu =σ 2 i for both alleles; also note repeated use of the identity Thus, the numerator and denominator of Eq. (S1), respectively, are also equal to Thus, the above numerator and denominator are, respectively, twice the values of the HudsonK estimator in the main text, a common factor of two that cancels out. Therefore, in this special case where every locus is biallelic, the WG F ST estimator for subpopulations equals the HudsonK estimator.

S4
C Derivation of existing method-of-moment estimators

C.1 F ST estimator for independent subpopulations
Assuming the coancestry model in Eq. (6) in the main text for independent subpopulations (θ T jk = 0 for j = k), the first and second moments of the IAFs are: θ T jj appears by averaging Eq. (S3) over j: Since Eq. (S2) has the same value for every j, and Eq. (S4) as well for every j = k, we average these to reduce estimation variance. The results are in terms ofp T i = 1 n n j=1 π ij : F ST also appears in Eq. (S7) because j = k terms are introduced in the double sum. Subtracting Eq. (S5) and Eq. (S7) in turn from Eq. (S6) results in: To reduce variance further, we average across loci, giving T and solving for F ST in this system of equations results in the following F ST estimator: This estimator is simplified noting that 1 n n j=1 π 2 ij appears in the IAF sample variance, so substituting it into Eq. (S8) recovers Eq. (13) in the main text as desired:

C.2 Standard kinship estimator
Here we assume the kinship model in Eq. (5) in the main text. Since the first moment is the same for all individuals j, we average these genotypes to reduce variance, which results in the following estimator of p T i : Each ϕ T jk appears once per (j, k) pair in Eq. (5), recast here in terms of the sample covariance: Variance in the kinship estimate is reduced by averaging across loci, yielding: Pluggingp T i into Eq. (S9) and solving for ϕ T jk recovers Eq. (18) in the main text as desired:

D Proofs that F ST and kinship estimator limits are constants with respect to the ancestral population T
In our work we calculate the limits of several estimators, which are given in terms of an arbitrary ancestral population T (not necessarily the MRCA, unless otherwise noted). The apparent paradox that the limit of an estimator would vary depending on the choice of T is resolved since these limits are in fact constant with respect to T . All proofs depend on the following IBD identities for change of ancestral population [10]: where A, B are two possible ancestral populations for the individuals j, k, and A is ancestral to B.

D.1 Proof that the limit ofF indep ST
does not depend on T Here we study the limit ofF indep ST in Eq. (14) in the main text. Let S be a reference population ancestral to the individuals in question and T be another population ancestral to S. Denote the key parameters relative to S by F S ST ,θ S and relative to T by F T ST ,θ T . The equations that relate both quantities satisfy our IBD shift identity (which follows by averaging Eq. (S10) over individuals for F ST or pairs of individuals forθ T ): Solving for the values relative to S gives The desired equality of the limit for both S and T follows: D.2 Proof that the limit ofφ T,std jk does not depend on T Here we study the limit of the standard kinship estimatorφ T,std jk in Eq. (19) in the main text. Let S be a reference population ancestral to the individuals in question and T be another population S7 ancestral to S. The equations that relate the terms relative to S and those relative to T follow from Eq. (S10) just as in the previous subsection: The desired result follows:

E Mean coancestry bounds
Here we prove that, for any weights such that w j > 0, n j=1 w j = 1, and for uniform weights 1 , k), and θ T = 1 n F ST for the independent subpopulations model. The Cauchy-Schwarz inequality for covariances implies θ T jk ≤ θ T jj θ T kk . Therefore, where the second inequality follows from Jensen's inequality, since x 2 is a convex function. Since θ T jj ≤ 1, then F ST ≤ 1 as well. Equality in the second bound requires θ T jj = F ST for all j, and equality in the first bound requires θ T jk = θ T jj = θ T kk , so thatθ T = F ST requires θ T jk = F ST for all (j, k). Since all w j , θ T jk ≥ 0, then where the second inequality follows from dropping j = k terms from the double sum ofθ T . The case w j = 1 n gives 1 n F ST ≤θ T , with equality for the independent subpopulations model by construction.

F Moments of estimator building blocks
Here we calculate first and some second moments for "building block" quantities that recur in our estimators, particularly terms involving x ij andp T i , and which enable us to calculate the limits of S8 our estimators. Below are examples for genotypes, which follow from Eq. (5) in the main text; calculations for IAFs follow analogously from Eq. (6) in the main text (not shown).

G Derivation of new kinship estimator
To begin the method-of-moments derivation, we compute the raw first and second moments from the kinship model of Eq. (5) in the main text.
For obtain a symmetric estimator, we also compute the raw moments of 2 − x ij (which counts the alternative allele): If we solved for p T i using the first moment equations, we would recover the standard kinship estimator of Eqs. (17) and (18), so we shall avoid this strategy.
To proceed, we average the two second moment equations above. Note that Therefore, the symmetric estimator (which gives the same calculation if the reference allele is switched) is A genome-wide estimate is obtained by averaging the previous statistics across loci, resulting in The new kinship estimator follows from obtaining a consistent estimator of the limit of v T m as m goes to infinity, and applying it to solve for ϕ T jk in the above equation for the expectation of A jk , as detailed in the main text.
H Mathematical equivalence between the proposed kinship estimator as described here and our original 2016 proposal The new kinship estimator presented in the main text is algebraically equivalent to the version presented in our 2016 manuscript [9]. As notation differs, here we establish in detail the correspondence. We begin by restating the preprint estimator in notation as close as possible in the original preprint, but with two variables renamed as they clash with our present notation. In the preprint, there are two steps to the final kinship estimator, analogous to the two steps in our updated presentation. The first step consisted of obtaining preadjusted kinship estimates that had a uniform bias, and in the second step the estimates are unbiased using the minimum preadjusted kinship estimate. The preadjusted estimator (referred to asφ T,new jk in the preprint, but which we will denote by C jk here since it clashes with the notation of our final kinship estimator in Eq. (34) in the main text) was given by Thus, bias was determined by the unknownφ T , which plays the same role as the unknown v T m in our present work. The final estimator (denoted in the preprint byφ T,new jk , but which matches our S10 final kinship estimateφ T,new jk in Eq. (34) in the main text, so we use our new notation here) was given byφ Lastly, it was proposed to estimate the unknownφ T from the minimum C jk , since under the hypothesis that the true minimum kinship is zero, then Lastly, the preprint recommends a more stable estimate thanĈ min above. Thus, the general approach was in place that resulted in new kinship estimates given an estimatorĈ min of C min . Now we tie the preprint estimator to our present estimator. First note that Eq. (S11) is the same as where A jk is exactly as in Eq. (33) in the main text, and B = 4 m i=1p T i (1 −p T i ) is a constant shared by all individual pairs j and k. We will show that this B cancels out in the end, so that its form does not matter. Thus, in the present notation we eliminated B, resulting not only in a simpler presentation but also eliminating the only term in the preprint notation that depended on p T i (in the new formulation there is no need to estimate any allele frequencies).
The key is to notice that, since B is the same for all individual pairs, any approach to estimate the limit of the minimum A jk corresponds directly to an approach for estimating the limit of the minimum C jk , and vice versa, using the relation: Solving for the mean kinship in Eq. (S13) results in an estimate given bŷ Lastly, plugging thisφ T into Eq. (S12) results in a final estimator of which equals the present new estimator in Eq. (34) in the main text, as desired.