Confidence Intervals for Population Allele Frequencies: The General Case of Sampling from a Finite Diploid Population of Any Size

The estimation of population allele frequencies using sample data forms a central component of studies in population genetics. These estimates can be used to test hypotheses on the evolutionary processes governing changes in genetic variation among populations. However, existing studies frequently do not account for sampling uncertainty in these estimates, thus compromising their utility. Incorporation of this uncertainty has been hindered by the lack of a method for constructing confidence intervals containing the population allele frequencies, for the general case of sampling from a finite diploid population of any size. In this study, we address this important knowledge gap by presenting a rigorous mathematical method to construct such confidence intervals. For a range of scenarios, the method is used to demonstrate that for a particular allele, in order to obtain accurate estimates within 0.05 of the population allele frequency with high probability (%), a sample size of is often required. This analysis is augmented by an application of the method to empirical sample allele frequency data for two populations of the checkerspot butterfly (Melitaea cinxia L.), occupying meadows in Finland. For each population, the method is used to derive % confidence intervals for the population frequencies of three alleles. These intervals are then used to construct two joint % confidence regions, one for the set of three frequencies for each population. These regions are then used to derive a % confidence interval for Jost's D, a measure of genetic differentiation between the two populations. Overall, the results demonstrate the practical utility of the method with respect to informing sampling design and accounting for sampling uncertainty in studies of population genetics, important for scientific hypothesis-testing and also for risk-based natural resource management.


Introduction
Spatiotemporal patterns of genetic variation among populations are often used to test hypotheses about processes underlying the patterns, such as selection, migration and genetic drift (e.g., [1,2,3,4,5]).This genetic variation captures differences in genetic structure among populations, where the genetic structure of a population is determined by the distribution of alleles among individuals in the population.The allele distribution of a population is the result of a range of biological and environmental processes acting on the population and also on surrounding populations within the same geographical region, which result in non-random mixing of gametes among individuals in all populations.Patterns in allele distributions among populations are generally assessed by consideration of variations in allele frequencies (e.g., [6,7,8,9]).
Logistic and ethical constraints mean that in practice, a population is unlikely to be sampled in its entirety, such that the population allele frequencies have to be estimated using a subset of sampled individuals.For diploid organisms in a large population, it has been established that the frequency of an allele in a sample provides an unbiased estimate of the frequency in the population as a whole [10].Thus, if samples of a given size are repeatedly taken from a large population, then the mean frequency of an allele in a sample converges to the population allele frequency as the number of samples increases.However, many studies only take a single sample from a population and present or use the resulting frequencies of alleles in the sample, without accounting for sampling uncertainty (e.g., [11,12,13,14,15,16,17,18]).Therefore, these studies implicitly assume that the sample allele frequencies are close to the population allele frequencies, which is by no means guaranteed.Interpretation of findings from these studies is therefore complicated by the potential for large sampling uncertainty.Ideally, in this single sample case, uncertainty bounds for the population allele frequencies would be quantified, based on the sample allele frequencies.
A recent study by Hale et al. [19] used computer simulations to draw samples from four diploid populations, with allele frequencies based on four empirical datasets.Subsequently, they used the sample data to derive the means, variances and ranges of allele frequencies in samples of varying size, as well as the means, variances and ranges of indicators that are a function of allele frequencies (specifically, heterozygosity and F ST ).Using these results, Hale et al. [19] concluded that a sample size of 30 is sufficient to accurately estimate population allele frequencies when using microsatellites.However, the authors did not quantify uncertainty bounds for the population allele frequencies using their sample data, such that their assessment of accuracy lacks a rigorous quantitative basis.Furthermore, for each sample size, sample frequencies were calculated using only 100 replicate samples, which may not give a good approximation of the true dispersion in the sample frequencies.This highlights a weakness of using a computational approach that lacks an underlying mathematical theory.Moreover, Hale et al. [19] did not consider the situation identified earlier, where one sample is taken and there is a need to quantify uncertainty of population allele frequencies using just the sample allele frequencies.For this situation, earlier studies have derived confidence intervals in order to capture uncertainty in the population allele frequencies.If a diploid population is of infinite size and is at Hardy-Weinberg equilibrium (HWE), then the frequency of an allele in a sample follows a binomial distribution [10].Thus, Gillespie [20] proposed the use of the Wald confidence interval [21].However, this interval only performs well for sufficiently large sample sizes -with small sample sizes, it is too short [22].For the binomial distribution, other confidence intervals have been derived that do perform well for small sample sizes [22,23], notably the Clopper-Pearson interval that was derived eight decades ago [24].However, these only apply in the limited cases when the population is at HWE. Weir [10] proposed a confidence interval that allows for deviations from HWE, analogous to the Wald confidence interval.However, like the latter, it is expected to perform well only for sufficiently large sample sizes [10].This is problematic because the accuracy at a particular sample size is unknown, so ''sufficiently large'' cannot be rigorously quantified.Moreover, all the confidence intervals considered only apply to cases when the population can be assumed to be of infinite size, i.e. when the population size is much larger than the sample size.This does not reflect the range of scenarios encountered in empirical research (e.g., [13,25,26,27]).
In this study, we build on previous work by constructing confidence intervals for population allele frequencies for the general case where (i) the population is diploid, finite and can be of any size; (ii) the sample can take any size less than or equal to that of the population; and (iii) the population can deviate from HWE to any extent.These confidence intervals are guaranteed to contain the population allele frequency with a probability above a known threshold.The method derived for constructing these intervals is then used to calculate sample sizes required to achieve accurate estimates of population allele frequencies, under a range of scenarios.Here, accuracy is measured as the length of the confidence intervals.The sample sizes derived serve as a guide for determination of suitable sample sizes in future population genetic studies.In particular, we show that a sample size of 30 does not necessarily give accurate estimates, thus refining the conclusion of Hale et al. [19].Lastly, we provide an example of how the method can be applied to microsatellite data for two populations of the checkerspot butterfly (Melitaea cinxia L.) [13], to derive confidence intervals for population allele frequencies and also for Jost's D, a measure of genetic differentiation between two populations that is a function of their population allele frequencies [9].The data comes from a study [13] where it is unclear that the populations can be assumed to be of infinite sizes or at HWE.This example illustrates how the mathematical theory underlying our method can be used to quantify sampling uncertainty not only in population allele frequencies, but also in parameters that are functions of these frequencies, important for hypothesis-testing and also for natural resource management.
Overall, this study provides a rigorous mathematical quantification of sampling uncertainty in population allele frequencies, without the typically unrealistic constraints of assuming that a population has an infinite size and is at HWE.Thus, the results can be applied to a wide range of studies in population genetics.

Methods
In order to construct confidence intervals for the general case of taking samples of any size from a diploid population of any size (larger than or equal to the sample size) and with any degree of deviation from HWE, the sampling distribution of the allele frequencies in this general case is first exactly specified.This distribution is then used to derive formulae specifying confidence intervals that contain the population allele frequencies with probability above a known threshold.Using these formulae, confidence intervals are derived for a range of archetypal scenarios, which are then used to calculate sample sizes that permit accurate estimates of population allele frequencies in these scenarios.In addition, the formulae are applied to a real scenario where samples are taken from two butterfly populations [13], to construct confidence intervals for the population allele frequencies of these two populations.The intervals are then used to derive a corresponding confidence interval for Jost's D [9].

Derivation of sampling distribution of allele frequencies
Consider a population of M diploid individuals, from which a sample of N individuals is randomly drawn (M §N).At the locus of interest, there are nw1 alleles, denoted by A i , i [ 1, 2, . . .,n f g .Let the population allele frequencies be denoted by p i , with the corresponding sample allele frequencies being denoted by p i,N .Also, let P ij be the frequency of individuals in the population with alleles A i and A j (i, j [ 1, 2, . . .,n f g ), such that MP ij is the number of corresponding individuals in the base population.p i is related to P ij by the formula p i ~Pii z P n j~1, j=i P ij =2 À Á .Here, P ii is a measure of the homozygosity of individuals in the population with respect to allele A i .P ii ~pi { P n j~1, j=i P ij =2 À Á , and thus P ii ƒp i .If 0ƒp i ƒ0:5, then the minimum value of P ii is 0, since all copies of allele A i can be distributed among heterozygotes of allele A i .However, if p i w0:5, then this is not possible -there must be at least one homozygote of allele A i .In this case, the minimum number of homozygotes is realized when all heterozygotes have a copy of allele A i , i.e. when P ii z P n j~1, j=i P ij ~1.Rearranging this for P n j~1, j=i P ij and substituting into P ii ~pi { P n j~1; j=i P ij =2 À Á gives the minimum value of P ii as 2p i {1.Thus, overall, Max 0, 2p i {1 f g ƒP ii ƒp i .The frequency of allele A i in the sample of size N is given by p i,N ~Yi,N = 2N ð Þ, where Y i,N is the number of copies of allele A i in the sample.Thus, the probability distribution of p i,N is the same as the probability distribution of Y i,N except with the x-axis scaled by a factor 1= 2N ð Þ. Denote the probability mass function (pmf) for for y i,N outside this range.Thus, for the following calculations, only y i,N [ 0, 2N ½ are considered.Now, Y i,N ~2X ii zX i : , where X i : ~Pn j~1, j=i X ij and X ij is the number of individuals in the sample with alleles A i and A j .X ii , X i : and Z i ~N{X ii {X i : thus represent the number of individuals in the sample with two copies, one copy and no copies of allele A i , respectively.X ii , X i : and Z i follow a multivariate hypergeometric distribution with pmf given by: where terms on the right-hand side in brackets are binomial coefficients.Since the number of individuals in the sample must equal N, x ii zx i : Þfor all those biologically feasible combinations of x ii , x i : and z i satisfying y i,N ~2x ii zx i : .It will now be shown that this summation can be simplified to the sum of an expression that depends only on x ii and y i,N , over all x ii between lower and upper bounds that only depend on y i,N .This considerably eases calculation of P Y i,N ~yi,N ð Þ .Firstly, the number of individuals in the sample with two copies, one copy or no copies of allele A i cannot exceed the corresponding numbers in the sampled population.This gives rise to three inequalities: Secondly, the number of individuals with two copies, one copy or no copies of allele A i in the sample must be non-negative.This gives rise to three more inequalities: Thus, the inequalities ( 2)-( 7) can be rearranged to obtain the double inequality: It is noted that x i : ~yi,N {2x ii , which means that x i : §0 (inequality ( 6)) ensures that x ii ƒy i,N =2ƒN, as required.Denote the upper and lower bounds in ( 8) by L y i,N ð Þ and U y i,N ð Þ respectively.Then P Y i,N ~yi,N ð Þcan be written as: where z i has been rewritten using z i ~N{x ii {x i : and x i : ~yi,N {2x ii , ceiling a ½ is the smallest integer larger than a, and floor a ½ is the largest integer smaller than a.The ceiling and floor functions are introduced to ensure that x ii is an integer, which it must be to have a biological interpretation.
Þ , the pmf for p i,N , and thus defines the probability distribution of p i,N .Equation ( 9) can be used to define the probability distribution of p i,N given only four parameters: M, p i , P ii and N.

Derivation of confidence intervals for sample allele frequencies
For given population size (M), population allele frequency (p i ), population frequency of homozygotes with allele A i (P ii ) and sample size (N), the probability distribution for the sample allele frequency (p i,N ) can be calculated exactly using equation (9).The mean value of this distribution is: as in the case of sampling from an infinite diploid population [10].
In equation (10), the expectation E X ij Â Ã ~NP ij has been used, which follows from the fact that the X ij 's, with i, j [ 1, 2, . . .,n f g , follow a multivariate hypergeometric distribution with parameters M, N and MP ij [28].In addition, it can be proved that the variance of p i,N is (see File S1).The variance thus includes a standard finite correction factor M{N ð Þ = M{1 ð Þ that tends to 1 as M??, as required.
Therefore, as M??, ð Þ, the variance in the case of an infinite population size [10].The cumulative distribution function (cdf) for p i,N is specified by: where y i,N is an integer in the interval 0, 2N ½ .This cdf can be calculated using equation (9) and is a function of p i .To construct a CI for p i given M, N and P ii , consider testing the null hypothesis H 0 : p i ~pi,0 against the alternative H 1 : p i =p i,0 at significance level a for an observed value of Y i,N , denoted by ŷ y i,N .The null hypothesis is not rejected if using p i ~pi,0 , ŷ y i,N falls within an acceptance region defined by B a ~yi,N,a=2 , , where y i,N,a=2 is the largest integer for which P Y i,N vy i,N,a=2 À Á ƒa=2 and B a can be calculated using equation (12).One method of constructing a §100 1{a ð Þ% CI for p i is to determine the set of values of p i,0 for which the null hypothesis is not rejected at significance level a, denoted by , and defining the CI as [29].This hypothesis-testing approach corresponds to the ''testmethod'' described by Talens [30], who applied it to construct CI's for parameters of the univariate hypergeometric distribution.If P i,H 0 ,a ~1, the empty set, then the acceptance region needs to be extended to include ŷ y i,N for at least one value of p i,0 .Thus, in this case, a is decreased until ŷ y i,N is in the acceptance region for at least one p i,0 value.The CI specified by equation ( 13) is not an exact 100 1{a ð Þ% CI because the distribution for Y i,N is discrete and because there may be some p i,0 values between Min P i,H0,a ð Þand Max P i,H0,a ð Þ for which ŷ y i,N = a , since P Y i,N ƒy i,N ð Þis not guaranteed to be a monotonic function of p i (see equations ( 9) and ( 12)).It is noted that for the probability parameter in a binomial distribution, Clopper and Pearson [24] derived §100 1{a ð Þ% CI's using an analogous method.Thus, for the case of a large population size relative to the sample size (MwwN) and HWE, where p i,N approximately follows a binomial distribution, CI's for p i derived using our method would be virtually the same as Clopper-Pearson CI's.This can be verified by explicitly calculating and comparing CI's using the two methods -for example, if M~1,000,000wwN~30 and ŷ y i,N ~10, then both methods give the CI 0:083, 0:285 ½ for p i .In deriving equation (13), it was assumed that P ii is known.However, P ii is generally unknown.In this case, a §100 1{a ð Þ% confidence region (CR) can be derived for p i and P ii in an analogous way, by considering the null hypothesis H' 0 : p i ~pi,0 , P ii ~Pii,0 .The CR can be derived by determining the set of vectors p i,0 , P ii,0 ð Þfor which the null hypothesis is not rejected at significance level a.This set is denoted by , where B a is as defined before.Using the CR, §100 1{a ð Þ% CI's for p i and P ii can be defined as and where P i,H' 0 ,a, j is the set of values consisting of the jth elements in the set of vectors P i,H' 0 ,a .In determining P i,H' 0 ,a , p i,0 and P ii,0 values over the biologically feasible ranges are tested.Since the number of copies of allele A i in the population must be at least the number found in the sample, ŷ y i,N 2M ð Þƒp i,0 .Also, the number of copies of all other alleles in the population must be at least the number found in the sample, 2N{ŷ y i,N ; this constrains the maximum value of p i,0 according to . For a given value of p i,0 , Max 0, 2p i,0 {1 f g ƒP ii,0 ƒp i,0 , as determined earlier.If P i,H' 0 ,a ~1, then the acceptance region is extended to include ŷ y i,N for at least one pair of values of p i,0 and P ii,0 , by decreasing a.
CI's specified by equation ( 13) can be calculated given M, N, P ii and a, whereas those specified by equations (14a) and (14b) can be calculated given just M, N and a.In this paper, computation of any CI's using these equations, incorporating the underlying formulae specified by equations ( 1), ( 8), ( 9) and ( 12), was carried out using the software package Mathematica v5.0 [31].Supporting Webpage 1 provides Mathematica code for computation of the CI's and can be viewed at http://rpubs.com/kkeenan02/Fung-Keenan-Mathematica/.However, other software packages such as MATLAB [32] and R [33] could also be used to implement the formulae; indeed, an R version of the code is provided on Supporting Webpage 2 and can be viewed at http://rpubs.com/kkeenan02/Fung-Keenan-R/.All source code for the composition of Supporting Webpages 1 and 2, including the raw Mathematica and R code used, can be accessed at https://github.com/kkeenan02/Fung-Keenan2013/.

Determining minimum sample sizes for accurate estimation of population allele frequencies
In studies of population genetics, it is desirable to obtain a sample allele frequency close to the population allele frequency with high probability [19], i.e. a short CI of high probability.Thus, under three representative scenarios, we calculate the minimum sample sizes required to obtain §95% CI's with lengths ƒ0:2 and ƒ0:1 across all possible values of the observed sample allele frequency, p p i,N ~ŷ y i,N 2N ð Þ.These minimum sample sizes are denoted by N ƒ0:2 and N ƒ0:1 respectively.They are calculated by starting with N~10, computing CI lengths for all possible values of p p i,N and then taking the maximum value.This is repeated for increasing N in increments of 10 until the maximum CI length becomes ƒ0:1.The N value with maximum CI length closest to 0.2 is then chosen, and then increased or decreased as necessary to find N ƒ0:2 .N ƒ0:1 is derived analogously.Maximum CI lengths of 0.2 and 0.1 represent small maximum absolute errors in the estimated allele frequency of 0.1 and 0.05 respectively, if the estimate is taken as the mid-point of the CI.The first scenario examined, Scenario 1, is sampling from a population of M~1,000 at HWE. M~1,000 is larger than or on the same order of magnitude as the upper bound of the range of size estimates for 15/24 (63%) species populations collated by Frankham et al. [25], covering mammals, birds, insects and plants.Thus, M~1,000 is taken to represent a large population.Later, M is varied over two orders of magnitude to consider populations that range from small to very large.Since there is a HWE, P ii ~p2 i .Also, in the sampled population, the number of homozygotes with allele i, MP ii , must be an integer.This restricts the number of values that p i can take to 11 equally spaced values in the interval 0, 1 ½ (starting from 0).Scenarios 2 and 3 represent situations where HWE does not hold, such that P ii =p 2 i .In Scenario 2, P ii is assumed to take its minimum value, which is Max 0, 2p i {1 f g , whereas in Scenario 3, P ii is assumed to take its maximum value, which is p i .In comparison with Scenario 1, p i can now take a larger number of values -it can take the 2,001 equally spaced values in the interval 0, 1 ½ in Scenario 2 and the 1,001 equally spaced values in the interval 0, 1 ½ in Scenario 3. Since the variance of p i,N is an [ B increasing function of P ii (equation ( 11)), CI's derived using p i,N are expected to increase in length with P ii as well.This means that CI lengths and minimum sample sizes (N ƒ0:2 and N ƒ0:1 ) derived for Scenarios 2 and 3 are expected to encompass the entire ranges of possible values.
In the three scenarios examined, P ii is specified as a function of p i , such that equation ( 13) can be used to calculate a CI for p i given a value of p p i,N .Calculation of this CI involves considering all possible values of p i and determining under which values p p i,N falls within the corresponding acceptance region (as described above).An alternative scenario, Scenario 4, is that the relationship between P ii and p i is unknown.In this case, equation (14a) has to be used to calculate a CI for p i given a value of p p i,N , which involves considering all possible values of p i and P ii .This results in a considerable increase in computation time; for example, when sampling N~30 individuals from a population of M~1,000, 2,001 values of p i need to be considered but 1,002,001 combinations of p i and P ii need to be considered, representing an increase in computational time by two orders of magnitude.However, for a given value of p i , the maximum value of P ii gives the highest variance for p i,N and is thus expected to maximize the length of the acceptance region within which p p i,N could fall within.Thus, CI's for p i derived under Scenario 4 are expected to closely match those derived under the case of maximum homozygosity (Scenario 3).This can be verified by explicit calculation of the CI's -for example, given M~100, N~30 and p p i,N ~15, the CI's derived under Scenario 4 and Scenario 3 are 0:14, 0:4 ½ and 0:15, 0:39 ½ , representing a difference in length of only 0.02.Similar results hold for p p i,N ~10 and 5. Therefore, results from Scenario 3 are used to approximate those for Scenario 4, obviating the need for long computational runtimes to explore all possible combinations of p i and P ii .
Lastly, from equation ( 11), the variance of p i,N is an increasing function of M. Thus, CI lengths for p i are expected to increase with M, which would result in increases in N ƒ0:2 and N ƒ0:1 .To test this explicitly, for a population at HWE and with N~30, maximum lengths for the §95% CI for p i (across all values of p p i,N ) are calculated for M~100, 250, 500, 750, 1,000, 2,500, 5,000, 7,500 and 10,000.This range corresponds to populations that are small to very large [25].

Application to an empirical data set for checkerspot butterflies
To demonstrate how the theory developed can be used in practice, it is applied to microsatellite data for samples from two populations of the checkerspot butterfly (Melitaea cinxia L.) occupying meadows on the A ˚land Islands in Finland [13].Specifically, for the CINX1 locus, §100 1{2a ð Þ% CI's are calculated for the frequencies of alleles A, B and C for the Pra ¨sto änd Finstro ¨m populations, using the corresponding sample allele frequencies (Table 1 of Palo et al. [13]).For each population, a CI is not calculated for the population frequency of the fourth and final allele D, because this is fixed by the population frequencies of the first three alleles.To achieve consistency with the notation used in our study, henceforth, alleles A, B, C and D are referred to as alleles A 1 , A 2 , A 3 and A 4 respectively.The sample size for the Pra ¨sto ¨population is N 1 ~53, whereas that for the Finstro ¨m population is N 2 ~74 [13].p i,N1 and q i,N2 , i [ 1,2,3,4 f g , are used to denote the sample allele frequencies of A i in the Pra ¨sto ¨and Finstro ¨m populations respectively.Similarly, p i and q i are used to denote the population allele frequencies of A i in the two populations, respectively.The two populations consist of two and seven subpopulations respectively.Thus, technically, the two populations can be referred to as metapopulations, although this terminology is not used in our study for clarity.The subpopulations form part of a total of about 536 subpopulations on the A ˚land Islands, with an estimated total size ranging from 35,000 to at least 200,000 [13].Thus, the size of each subpopulation is assumed to be (35,000z200,000)=2 ½ =536~219, such that the Pra ¨sto ¨and Finstro ¨m populations are assumed to have a size of M 1 ~2|219~438 and M 2 ~7|219~1533 respectively.The observed sample allele frequencies in the two populations, denoted by p p i,N1 and q q i,N2 respectively, are taken directly from [13].These are used to calculate the observed number of copies of allele A i in each population, denoted by ŷ y i,N1 and ẑ z i,N2 respectively, using the formulae ŷ y i,N1 ~2N 1 p p i,N1 and ẑ z i,N2 ~2N 2 q q i,N2 .Due to rounding error in p p i,N1 and q q i,N2 values from [13], ŷ y i,N1 and ẑ z i,N2 had to be rounded to the nearest integer.Since the true homozygosity of each allele in each population is unknown [13], conservative CI's are calculated assuming P ii takes its maximum value of p i (this is expected to maximize the lengths of the CI's, as explained in the previous section).In addition, a~0:025=3 is chosen, such that a §98:3% CI is derived for each of the three population allele frequencies in each of the two populations.The reason for this choice of a is because for each population, using the Bonferroni Inequality [34], the cubic region defined by the three CI's can be taken as a §100 1{2b ð Þ% confidence region (CR) for the three population allele frequencies, where 3a §b.The choice of a~0:025=3 allows §95% CR's to be derived for each set of three population allele frequencies in the two populations.
To demonstrate how the theory developed in this paper can be used to derive CI's for genetic indicators that are a function of the population allele frequencies, the §95% CR's are used to calculate a §95% CI for Jost's D for the CINX1 locus and the Pra ¨sto ¨and Finstro ¨m butterfly populations.Jost's D is a measure of genetic distance between populations [9].For the case of two populations, it is given by where, using the notation in this example, and A lower limit to a §95% CI for D Jost can be derived by minimizing the function in (15) given the constraints that the six population allele frequencies are contained within the two corresponding §95% CR's, and the two constraints 1{ P 3 i~1 p i ~p4 §0 and 1{ P 3 i~1 q i ~q4 §0.This lower limit is denoted by L D .Similarly, the upper limit to a §95% CI for D Jost can be derived by maximizing the function in (15) under the same constraints.This is denoted by U D .In this study, the ''Minimize'' and ''Maximize'' functions in Mathematica v5.0 [31] are used to compute L D and U D , but corresponding functions may be used in other software packages, such as the ''solnp'' function in the ''Rsolnp'' R package.A §95% CI for D Jost can then be defined as the interval L D , U D ½ .Supporting Webpage 1 provides Mathematica code that can be used to calculate L D , U D ½ for the butterfly case study examined (http://rpubs.com/kkeenan02/Fung-Keenan-Mathematica/). Corresponding R code is presented on Supporting Webpage 2 (http://rpubs.com/kkeenan02/Fung-Keenan-R/).

Maximum length of §95% confidence interval with increasing sample size
For Scenario 1, where the sampled diploid population of size M~1,000 was at HWE, the maximum length of the §95% CI for the population frequency of allele A i , p i , considering all possible observed values of the sample allele frequency, p p i,N , decreased non-linearly with sample size N (Figure 1).The minimum N required to achieve a maximum length ƒ0:2, N ƒ0:2 , was 22, whereas the minimum N required to achieve a length ƒ0:1, N ƒ0:1 , was 49 (Figure 1).
For given N and p p i,N , the CI for p i consists of all possible values of p i for which p p i,N lies in the corresponding acceptance region, as described in Methods.This acceptance region is expected to increase with the variance of p i,N , s 2 p i,N ½ .At HWE, the frequency of homozygotes of allele A i , P ii , is equal to p 2 i ; thus, according to equation (11), s 2 p i,N ½ takes its highest values at intermediate values of p i .As a result, intermediate values of p p i,N are likely to fall within the acceptance regions of more values of p i , such that the corresponding CI lengths are generally longer.Simulation results for Scenario 1 were broadly in agreement with these theoretical expectations (Figure 2).
In Scenario 2, where the population was no longer at HWE but attained its lowest P ii , the maximum CI length also decreased nonlinearly with increasing N (Figure 1).Compared with Scenario 1, N ƒ0:2 was slightly larger and N ƒ0:1 was larger by about a factor of two, taking values of 26 and 94 respectively (Figure 1).This is contrary to expectations that minimum homozygosity would give smaller N ƒ0:2 and N ƒ0:1 (see Methods).The reason is that although ½ is smaller for given N and p i compared with the case of HWE, which would be expected to result in fewer p i values for which a given p p i,N falls within the corresponding acceptance regions and hence a shorter CI, there are more possible values of p i (2,001 compared with 11 -see Methods), which increased the number of p i values for which p p i,N falls within the corresponding acceptance regions.For Scenario 2, P ii ~Max 0, 2p i {1 f g , such that s 2 p i,N ½ attains its highest values at p i values close to 0.25 and 0.75 (equation ( 11)).Thus, for given N, values of p p i,N near 0.25 and 0.75 are expected to generally exhibit the longest CI lengths.Simulation results for Scenario 2 were broadly in agreement with these theoretical expectations (Figure 3).
For the last scenario, Scenario 3, the population attained its highest P ii , representing the opposite extreme to Scenario 2. As for the previous two scenarios, the maximum CI length decreased non-linearly with increasing N (Figure 1), but this time, N ƒ0:2 and N ƒ0:1 took values of 94 and 285 respectively.These values were approximately four and six times as large as the corresponding values in Scenario 1 (Figure 1).In Scenario 3, P ii ~pi , and  equation (11) shows that intermediate values of p i give the highest values of s 2 p i,N ½ .Therefore, as in Scenario 1, intermediate values of p p i,N are expected to generally exhibit the longest CI lengths for given N. Again, simulation results were consistent with these expectations (Figure 4).

Maximum length of §95% confidence interval with increasing population size
When taking a sample of size N~30 from a diploid population at HWE, the maximum length of the §95% CI for p i , across all p p i,N , increased with population size M (Figure 5).As M was increased from a small value of 100 to 1,000, the maximum CI length remained the same at 0.2 (this is possible because there is nothing to prevent different values of M giving the same maximum CI length, using equation ( 13)).The maximum CI length increased when M was increased from 1,000 to a large value of 2,500, but only by 0.04.Thereafter, the length remained constant up to a very large value of M~7,500, and only increased by a small amount of 0.02 with a further increase in M to 10,000.Thus, simulation results confirm the theoretical expectation that CI length increases with M (as explained in Methods -this expectation arises because s 2 p i,N ½ increases with M, according to equation ( 11)).However, the increase in the CI length was modest as M was increased over two orders of magnitude.§95% confidence interval for Jost's D between two checkerspot butterfly populations For the Pra ¨sto ¨checkerspot butterfly population, the §98:3% CI's for the population frequencies for three of the four alleles at the CINX1 locus were computed using equation ( 13) and data from Palo et al. [13], as described in Methods.The three alleles are denoted by A 1 , A 2 and A 3 respectively, with the population frequencies denoted by p 1 , p 2 and p 3 respectively.The three corresponding §98:3% CI's derived were 0:002, 0:080 ½ , 0:598, 0:872 ½ and 0:114, 0:381 ½ respectively.Using the same methodology, for the Finstro ¨m population, the §98:3% CI's for the population frequencies of A 1 , A 2 and A 3 were calculated as 0:003, 0:109 ½ , 0:714, 0:914 ½ and 0:003, 0:109 ½ respectively.The three §98:3% CI's for the Pra ¨sto ¨population were used to form a cubic region, which corresponds to a §95% CR for A 1 , A 2 and A 3 [34].In the same way, the three §98:3% CI's for the Finstro ¨m population were used to form a §95% CR for A 1 , A 2 and A 3 .Within these two CR's, the maximum and minimum values of D Jost (D Jost is given by equation ( 15)) were calculated and used to derive a §95% CI for D Jost , as described in Methods.This CI for D Jost was derived as 2:34|10 {5 , 0:186 Â Ã .If D Jost was calculated simply using the sample allele frequencies and equation ( 15), then only one value would have been obtained: 0.043.
The application of our method to empirical data for two populations of the checkerspot butterfly [13] demonstrated how the underlying theory can be applied to construct joint §95% CR's for the population frequencies of multiple alleles at a single locus.These CR's were then used to construct a §95% CI for Jost's D, which measures genetic differentiation between the two populations.This illustrates how our method can be used to quantify sampling uncertainty in genetic indicators that are a function of population allele frequencies, thus facilitating hypothesis-testing and also risk-based natural resource management.In the example considered, the single point estimate of Jost's D using the sample allele frequencies was about four times lower than the upper bound of its §95% CI.Thus, use of the single point estimate without accounting for sampling uncertainty could lead to misleading conclusions.The effectiveness of any management measures based on such conclusions would be compromised, hindering the achievement of objectives related to conservation or sustainable use.Our example therefore highlights the practical utility of the method that we have derived.
In conclusion, we have presented a rigorous mathematical method for quantifying sampling uncertainty in estimates of population allele frequencies, for a general case that has hitherto not been analyzed.In addition, we have demonstrated its practical application in informing sampling design and determining uncertainty in genetic indicators.Thus, the method derived advances both theory and practice, with broad implications for a range of disciplines, including: conservation genetics, evolutionary genetics, genetic epidemiology, genome wide association studies (GWAS), forensics and medical genetics.In particular, the method provides exact answers to the question of how many individuals need to be sampled from a population in order to achieve a given level of accuracy in estimates of population allele frequencies.This is a question that has rarely been studied before [19], despite its important practical implications.Previous studies have derived sample sizes required to sample, with high probability, at least one copy of all alleles at a locus above a given frequency [35,36,37], but these do not correspond to sample sizes required to achieve accurate estimates of population allele frequencies.Derivation of the latter requires explicit quantification of sampling uncertainty, as we have done in this study.

Possible future extensions
The CI's and CR's constructed using our method are conservative in the sense that they contain the true values with a probability equal to or greater than a desired threshold.This conservative property is useful in hypothesis-testing if there is a need to decrease the probability of obtaining a false positive below a certain threshold.However, researchers would ideally like to construct CI's and CR's containing true values with a known probability, not just with probability at or above a known threshold.Therefore, future research could attempt to tighten the intervals and regions that we have derived, ideally until they cover a known probability.For example, Cai and Krishnamoorthy [23] devised a method (using their ''Combined Test'' approach) that was shown to give shorter CI's for the probability parameters of both the binomial and univariate hypergeometric distributions, when compared with a hypothesis-testing approach analogous to the one used in this paper.If their method could be extended to the population allele frequency parameter for the more complex distribution used in this paper (equation ( 9)), based on a multivariate hypergeometric distribution (equation ( 1)), then this might result in tighter CI's for population allele frequencies when sampling from a finite diploid population.
In addition, the CI's derived in our paper were designed to quantify the uncertainty in population allele frequencies that arises from taking a random sample of a finite diploid population, which can exhibit any degree of relatedness.The ''random'' refers to equal probability of choosing individuals that may be of any type, which does not imply that once an individual of a particular type has been sampled, the next individual sampled is equally likely to belong to any of the types (i.e., does not imply individuals in the population or sample are unrelated).Relatedness among individuals in the population is implicitly included within the population frequencies of the different genotypes, and is thus accounted for when calculating the sampling distribution for the population frequency of an allele A i , as specified by equation ( 1).However, this representation does not give an explicit quantification of the degree of relatedness among individuals in the population, for example using kinship coefficients.DeGiorgio and Rosenberg [38] used such coefficients to derive an unbiased estimator of heterozygosity (H~1{ P n i~1 p 2 i , for a locus with n alleles) in the case of sampling from a population of diploid individuals that could be related, and DeGiorgio et al. [39] extended these results to the case of individuals with arbitrary ploidy.In their calculations, the number of copies of allele A i in each sampled individual k was treated as a random variable and the covariances of these variables were then related to the kinship coefficients.This is different to calculations in our paper, where the number of sampled individuals with alleles A i and A j was treated as a random variable (see Methods).The method in this paper might be revised by considering the number of copies of allele A i for each sampled individual instead, following [38,39].This could allow explicit quantification of relatedness in the context of deriving CI's for the population allele frequencies.

Figure 1 .
Figure 1.Change in maximum length of $95% confidence interval with increasing sample size.Graph showing how the maximum length of the §95% confidence interval (CI) for the population frequency of an allele A i (p i ) changes with increasing sample size N, when sampling from a diploid population of size M~1,000.For a given N, the maximum CI length was derived by calculating CI lengths for all possible values of the observed sample allele frequency and then taking the maximum length.The three curves correspond to three scenarios where the population is (1) at Hardy-Weinberg equilibrium (HWE), (2) attains its lowest homozygosity value with respect to A i , and (3) attains its highest homozygosity value with respect to A i .For each curve, the two filled circles represent the minimum N values required for the maximum CI length to reach values of ƒ0:2 and ƒ0:1.For visual guidance, the two dashed horizontal lines mark maximum CI lengths of 0.2 and 0.1.The exact values of N tested are described in Methods, and equation (13) was used to calculate the CI lengths.doi:10.1371/journal.pone.0085925.g001

Figure 2 .
Figure 2. Change in length of §95% confidence interval across different observed values, under Hardy-Weinberg equilibrium.For a sample of size N taken from a population of size M~1,000 at Hardy-Weinberg equilibrium, graph showing the length of the §95% confidence interval (CI) for the population frequency of an allele A i (p i ) across all possible observed values of the sample allele frequency (p i,N ).The three curves correspond to N~10, 30 and 50.Equation (13) was used to calculate the length of each CI.doi:10.1371/journal.pone.0085925.g002

Figure 3 .
Figure 3. Change in length of §95% confidence interval across different observed values, under minimum homozygosity.For a sample of size N taken from a population of size M~1,000 with the minimum homozygosity possible for an allele A i , graph showing the length of the §95% confidence interval (CI) for the population frequency of A i (p i ) across all possible observed values of the sample allele frequency (p i,N ).The three curves correspond to N~10, 30 and 100.Equation (13) was used to calculate the length of each CI.doi:10.1371/journal.pone.0085925.g003

Figure 4 .Figure 5 .
Figure 4. Change in length of §95% confidence interval across different observed values, under maximum homozygosity.For a sample of size N taken from a population of size M~1,000 with the maximum homozygosity possible for an allele A i , graph showing the length of the §95% confidence interval (CI) for the population frequency of A i (p i ) across all possible observed values of the sample allele frequency (p i,N ).The four curves correspond to N~10, 30, 100 and 200.Equation (13) was used to calculate the length of each CI.doi:10.1371/journal.pone.0085925.g004