Estimating Effective Population Size from Linkage Disequilibrium between Unlinked Loci: Theory and Application to Fruit Fly Outbreak Populations

There is a substantial literature on the use of linkage disequilibrium (LD) to estimate effective population size using unlinked loci. The estimates are extremely sensitive to the sampling process, and there is currently no theory to cope with the possible biases. We derive formulae for the analysis of idealised populations mating at random with multi-allelic (microsatellite) loci. The ‘Burrows composite index’ is introduced in a novel way with a ‘composite haplotype table’. We show that in a sample of diploid size , the mean value of or from the composite haplotype table is biased by a factor of , rather than the usual factor for a conventional haplotype table. But analysis of population data using these formulae leads to estimates that are unrealistically low. We provide theory and simulation to show that this bias towards low estimates is due to null alleles, and introduce a randomised permutation correction to compensate for the bias. We also consider the effect of introducing a within-locus disequilibrium factor to , and find that this factor leads to a bias in the estimate. However this bias can be overcome using the same randomised permutation correction, to yield an altered with lower variance than the original , and one that is also insensitive to null alleles. The resulting formulae are used to provide estimates on 40 samples of the Queensland fruit fly, Bactrocera tryoni, from populations with widely divergent expectations. Linkage relationships are known for most of the microsatellite loci in this species. We find that there is little difference in the estimated values from using known unlinked loci as compared to using all loci, which is important for conservation studies where linkage relationships are unknown.


Introduction
The magnitude of linkage disequilibrium (LD) can be used to estimate effective population size [1][2][3][4][5]. In general, low populations sizes are expected to give rise to relatively high levels of LD, and similarly high population sizes to low LD levels. An important feature of this means of estimation is that measurement at a single point in time can provide information on effective size. Furthermore closely-linked loci give information on population sizes over historical periods of time, while loosely-linked loci estimate population sizes in the immediate past [3], [4].
Much recent attention has been paid to the use of unlinked loci for estimating population size, for which the term 'Linkage Disequilibrium' will inappropriately be used. There are three major advantages of studying unlinked loci. First, the majority of pairs of loci are unlinked. Secondly, these are the only locus pairs for which it is easy to estimate the recombination frequency, 50%. Finally, in the study of pest populations, and in the area of conservation, it is usually the most recent population sizes that are of interest, for which unlinked loci are the most relevant.
The principal problem in studying unlinked loci comes from the sample sizes needed to obtain accurate LD estimates. The expected disequilibrium is a function of 1=N e , where N e is the effective population size, assumed constant, and 1=S, where S is the sample size [6]. Unless sample sizes are large, the latter can overwhelm the former.
A second complication comes from the usual necessity to use diploid data. Most LD theory is based on haplotypes rather than diploid genotypes, which typically cannot be observed. Although the recognition of haplotypes may seem inappropriate for unlinked loci, the same distinction applies as for linked loci, because the information on population size comes from genes with the same parental origin rather than genes inherited from different parents. The passage from zygotic to to gametic parameters can be made using either the maximum likelihood estimator of Hill [7], or, as will be used here, the Burrows estimator as elaborated by Weir [8].
In preliminary investigations of the size of Queensland fruit fly populations, we found very low N e estimates for populations that are believed to be large. We traced this discrepancy to an excess of homozygous genotypes, believed to be due to the presence of null alleles at some of the microsatellite loci used in the study.
Because of these complications, the problem of finding an adequate estimator of N e is fraught with potential biases. Waples and Do [9] have, however, shown that their LDNe program works well in estimating N e from simulated data. The program uses empirically derived correction factors rather than investigating the underlying reasons for the biases. The purpose of the present paper is to produce an analytical solution to account for the biases. We derive two sets of formulae that do this, depending on whether a 'within-locus disequilibrium factor' is used or not, and compare the application of these two sets to simulated and real data.

Queensland Fruit Fly Samples
Two data sets are analysed in the paper.
(1) East coast Australian populations. The data are from 55 samples from towns in the state of NSW in the years 2002-2004 [10]. Some of these sample come from areas where the flies are endemic, and in other cases the outbreaks appear to be only temporary. (2) NorthWest. These flies were collected during the years 1999-2003 from Northern West Australia and the Northern Territory [11].
The data in the two cited papers have previously been summarised only in terms of single locus statistics. The present paper involves a two-locus analysis, which requires additional information from the original data sets. The original data sets are provided in Supporting Information, Data S1 and Data S2.

Computer Simulation
All simulations reported in the paper are forward Monte-Carlo simulations under the Wright-Fisher model. Parents were chosen randomly in each case, thereby allowing selfing and not assuming permanent mate bonding, an important aspect of population structure [6]. Most simulations involved a starting population with either 16 or 32 loci, each locus having the number of alleles chosen randomly between 2 and 8. Alleles were assigned randomly at different loci, assuming no systematic LD. Populations were simulated for 20 generations, followed by sampling without replacement of 32 individuals from the final population, and calculation of LD levels. Simulations were written in C, and are available on request.

Theory
Most of LD theory applies to gametes rather than genotypes. Fortunately a simple method, the 'Burrows composite LD coefficient', is available for handling genotypes. This coefficient has been defined by Cockerham and Weir [12] in terms of sums of genotype frequencies. It is convenient to introduce here a slightly different but simpler way of relating genotype frequencies to gamete frequencies. See Table 1 for a listing of symbols used. Figure 1 shows the principle for populating a 'composite haplotype table'. Each genotype in Part (i) contributes the four possible gametes to the composite haplotype table in Part (ii). In the case of double heterozygotes, where the phase is usually unknown, each of the four possible haplotypes is represented. For all other genotypes the haplotypes are known, but each genotype nevertheless contributes four haplotypes. Note the use of S rather than N for the diploid sample total to emphasise the distinction between number in a population (N) and number in a sample (S). The normal haploid table cannot be written down from the genotypes in Figure 1, but the total would be 2S, and, for example, the number of a genes = n a~2 n 1: zn 2: . The marginal totals in the composite table are double these.  table for one sample of size 32 from the Eastern Australia fruit fly  data set, where one microsatellite, a, has 3 alleles and a second, b,  has 4. Again the total in the haplotype table of Part (ii) of Figure 2  is 4x the total in the genotype table of Part (i), rather than 2x as  would be found in a table where all haplotypes were known. The usual LD coefficient can be calculated for the numbers in the composite haplotype table of Figure 1, and given the designation D(comp). It is: The LD coefficient of Cockerham and Weir [12], D, is defined in terms of frequencies P ij :: and P i: :j , and given as the sum of two coefficients, D ij :: zD i: :j : It can be seen from the definitions of P ij :: and P i: :j from [12], ignoring the sample-size correction N/(N21), that this LD coefficient is double the value of D(comp) given above.
The intuitive justification for the composite haplotype table is most readily seen in the case of random mating (which is not assumed in the definition of D(comp)). In a genotype such as A 1 A 2 ,B 3 B 4 , the true haplotypes will be either A 1 B 3 and A 2 B 4 or alternatively A 1 B 4 and A 2 B 3 . Under random mating, whichever are the 'false' haplotypes are expected to occur at frequencies that are simply the products of the relevant gene frequencies. The frequencies contributed by the false haplotypes will dilute, but not bias, the haplotype frequencies. It is readily shown that this dilution will be simply a factor of 2. For example, following Figure 1, the frequency of the ab haplotype in the composite table, p ab (comp), is the true frequency of the ab haplotype, p ab , except for the contribution from the double heterozygotes. The true contribution ought to be 1 2 p ab={{ , whereas it is in fact 1 4 ½p ab={{ zp a{={b . Thus the difference between these two is the difference between p ab (comp) and p ab , giving.
Under the assumption of random mating, it can be seen that.
where D is the usual LD parameter, equal to p ab {p a p b . Therefore Subtracting p a p b from each side, The LHS of this equation is, by definition, the disequilibrium coefficient from the composite table, D(comp). So the equation is simply Since this is an expectation under the assumption of random mating, the equation can be written as: where the expectation is taken over replicate populations of the same sample size. The LD measure introduced by Hill and Robertson [13] is r 2~D2 =½p a :(1{p a ):p b :(1{b b ). An equivalent parameter can be calculated from the composite haplotype table. The marginal frequencies are the same as for the regular gamete table. So from (1) it follows that the expectation of r 2 (comp) calculated from the composite table is It is convenient to define a coefficient where, under random mating, the composite r 2 estimates the gametic r 2 , rather than onequarter of the latter. As pointed out above, the LD coefficient of Cockerham and Weir [12] does this. Therefore we define the statistic r 2 c as which from (3), (1) and (2) is calculated as The above definition of r 2 c ignores an extra factor introduced by Weir [8]. This factor arises from the potential covariance of the two alleles at the a locus and similarly at the b locus. These covariances are implemented through a 'single-locus disequilibrium factor', p aa {p 2 a at the a locus and p bb {p 2 b at the b locus, which essentially measure deviations from expected homozygosity. The modified definition of r 2 , r 2 D , is Because of difficulties in implementing this disequilibrium factor, its discussion is deferred to a later section under this label.
x 2 for the composite haplotype table. Owing to doublecounting of genes, the composite gamete table has the property that all marginal totals are multiples of 2, while the overall total is a multiple of 4. Nevertheless a regular x 2 can be calculated for such tables, and the resulting x 2 (comp) values for a r6c table has close to the expected distribution for (r{1)(c{1) degrees of freedom (Appendix S1). It has the advantage of having more power than the x 2 values calculated from the genotype table, owing to the large number of zero and unit values in the genotype table. Its use in independence tests may, however, be limited by its sensitivity to null alleles (see below).
Weighting of r 2 values. The calculation of LD for a microsatellite data set involves two levels of summation. There will usually be many loci, say L, and each of the L(L{1)=2 pairs yields a separate estimate of r 2 . However within each locus pair, say locus l and locus m, there will be separate calculations for each pair of alleles. These two levels may be labelled as 'between locus pairs' and 'within locus pairs'. Each needs to be separately treated in terms of weighting of the r 2 values.
Between locus pairs. It is often the case that, through missing readings, different locus pairs will have reduced numbers of observations. The sample size for loci l and m may be designated as S lm . Furthermore some loci will have large numbers of alleles and therefore provide more information than loci with small numbers of alleles. Waples and Do [9] have suggested the weighting S 2 lm (k l {1)(k m {1) for the different r 2 values, where k l and k m are the number of alleles at the l and m loci respectively. The overall estimate of r 2 then becomes A recent publication [14] suggests a slightly different weighting compared to that of Waples and Do [9], which would make a small difference to the overall N e estimate.
Within locus pairs. r 2 ij values for alleles i at locus l and j at locus m can be simply averaged to provide the r 2 lm value. However this has the undesirable property that rare alleles exert a disproportionate influence on the overall r 2 value. This effect that can be ameliorated by omitting low frequency alleles [9]. A more systematic way of avoiding this problem is to weight alleles according to their frequency. In the case where the frequencies of alleles A i and B j are respectively p i and q j , a suitable weighting is p i q j [15]. The overall r 2 lm value then becomes Since P p i q j = 1, this value does not need to be normalised. And since the marginal frequencies are the same for the regular and composite tables, the same weighting applies to both.
It is interesting to contrast this weighting proportional to gene frequencies to the normal x 2 weighting of allele pairs for a r6c table. The x 2 with (r{1)(c{1) degrees of freedom can be expressed as the sum of r6c individual x 2 values each with 1 df, if the values are weighted by (1{p i )(1{q j ) rather than p i q j . Thus the x 2 weighting gives rare alleles higher weight than common ones. Zhao et al [15] have compared these two measures, amongst others, for their use in QTL mapping, and recommend a standardised x 2 weighting for this case. However the higher weighting for rare alleles, as suggested from x 2 , performs poorly as just a simple measure of LD (Appendix S2).
Because of the different weighting for x 2 (comp) and r 2 c , there is no simple relationship between the two statistics. In general, however, significant values of x 2 (comp) will lead to low estimates of N e and non-significant values of x 2 (comp) will be associated   with high N e estimates. See [16] for a more detailed examination of the x 2 statistic. The estimation of N e . The theory for estimating N e from unlinked loci has been developed by Weir [8], Weir and Hill [6] and Hill [3]. The effective size refers to a model Wright-Fisher population, and departures from this model, such as permanent pair bonding, make a difference of a factor of 2 in N e estimates [6]. Such pair bonding is, of course, unlikely in fruit fly populations. A model assuming discrete generations as considered here is, however, necessarily an approximation to real populations that are likely to have overlapping generations.
Taking no account, for the moment, of the effect of sample size, the key equation relating the expected LD level to N e is where c is the recombination frequency. This reduces to for unlinked loci, c~1 2 . The expectation for r 2 here assumes a balance between increase of r 2 due to finite population size and loss due to recombination. All of the equations below assume this balance between drift and recombination. Equation (8) is derived using the ratio of expectations of r 2 rather than the expectation of the ratio (see Hill [17]). However computer simulation shows that it works well for loosely linked or unlinked genes, those of interest in the present study. It is unbounded for low values of Nc, when the expression given by Sved and Feldman [18]: seems to work better. However for c~1 2 , the RHS of equation (10) reduces to 2=3N e , which is double the value of equation (9) and clearly inaccurate at this end of the scale.
Equations (8)-(10) assume the measurement of haplotype or gamete frequencies. As previously indicated, diploid data may be taken into account using the composite LD measure. It follows from equations (1) and (4) that the expectation for this measure is identical to that of (8): Sample size is a critical issue in determining LD levels [8], [6], [3]. This is especially the case for unlinked loci, where the levels of x 2 and r 2 cannot be zero even if there is no association of loci in the population being sampled. The usual procedure in estimating true LD levels in the population is simply to subtract the level of r 2 expected for zero LD with a particular sample size. As pointed out in [19], however, there is one circumstance where this procedure will not work. With complete LD in the population, r 2~1 , as commonly found for the most tightly linked SNPs, the subtraction will falsely suggest r 2 levels less than 1.
The effect on the equation for gametes (8) is to increase the expected value of r 2 by a factor of 1=2S, where 2S is the haploid sample size. The r 2 statistic in this case is shown asr r 2 to indicate that it is an estimate that includes the effects of sampling In fact the exact expectation forr r 2 should include the term 1=(2S{1) rather than 1=2S, equivalent to noting that the exact expectation of x 2 is 2S=(2S{1)~1z1=(2S{1) rather than 1 [20]. Weir [8] takes this factor into account in working with the 'unbiased' rather than 'biased' value of r 2 .
As shown in equations (1) and (3) of Appendix S1, the expectation for the composite x 2 , or equivalently the composite LD coefficientr r 2 c , involves the factor 1{ 1 (2S{1) 2 , rather than 1z 1 2S{1 applicable to haploid data. This factor is very close to 1. Similarly the sampling correction factor forr r 2 c for a diploid sample of size S is close to 1 S : or for unlinked loci: The estimate for N e comes from inverting equation (13), wherer r 2 c is calculated according to equation (6) and each r 2 kl is calculated according to equation (7) N N e~1 3fr r 2 The effect of null alleles. Use of the composite disequilibrium index depends critically on the ability to distinguish heterozygous and homozygous genotypes. Unfortunately the presence of any null alleles makes this distinction difficult. Genotypes such as a=a null , will be incorrectly scored as a=a. Homozygous null genotypes are not easily detected, since it is difficult to distinguish between absence of a band and simple failure of the PCR reaction in the rare cases expected for homozygotes.
The expected effect of null alleles on the composite LD statistic can be quantified as in Appendix S3. This shows that a null allele at one of the two loci at a frequency p n alters the expectation of equation (1) to: The statistic r 2 c is increased by the factor 1=(1{p n ) 2 . Although this effect may be small, it can readily be shown to overwhelm the calculations when the expected LD value is small due to high effective population size. In the case of an infinitely large population, the true value of r 2 c is expected to be just the sampling correction, which is approximately 1 S . A null allele at one of the two loci is expected to increase this value to 1 S : 1 (1{pn) 2 . Applying equation (13), the estimated value of r 2 c is then found by subtracting the usual 1 S sampling contribution, giving Applying numerical values to equation (15), for a sample size S~32 and null frequency p n~0 :02, the equation yields a value for N e of 259. The actual population in this case should be infinitely large, so that a null allele frequency as low as 2% can have a strikingly large effect. A null allele at frequency 0.1, still difficult to detect, leads to a N e estimate of 45.
Simulations with null alleles. Simulations with null alleles have been carried out to test these expectations. These are 2-locus simulations with heterozygosities ranging from 50% to 87%. Under these conditions, equation (15) may slightly over-estimate the effect of null alleles. For example, in the above case with S~32 and p n~0 :02, simulation yields a value of N e~2 65 compared to expectation of 259, while p n~0 :1 yields N e~5 6 compared to 45.
Simulation can also be used to check on more realistic cases where the value of N e comes from multiple loci, rather than a twolocus simulation. These show that even low levels of null alleles at a single locus may have measurable effects. For example with 32 loci each with 5 alleles, the presence of just one locus amongst these having a null allele frequency of 10% can have a detectable effect, reducing the expected value of N e from infinitely large to less than 1,000. Much the same result is found for 5 loci each with a null frequency of 2%. Simulations also indicate that 8 out of 16 loci having null alleles at a particular frequency has much the same effect as one out of two loci in the simulations and calculations given above.
Correcting the effect of null alleles through permutation. A general formulation for the estimation of r 2 c may be given as follows: Herer r 2 c is the estimate derived from the data, and r 2 c is the true measure of LD in the population, which is the quantity of interest in estimating N e . The analysis above has shown that in the absence of null alleles, the correction factor is attributable purely to sampling, and is 1 S ½1{ 1 (2S{1) 2 . The analysis on null alleles has shown that these will act as disturbing factors, whose effect can conveniently be subsumed into the correction factor in equation (16).
A randomising procedure can be suggested that will ameliorate the effect of null alleles. If the genotypes at each locus are independently randomly permuted amongst individuals, such as in the exact test of significance of LD, eg. [21], there can be no underlying LD. So the mean value of r 2 c given by the average of many such randomly permuted samples is a direct estimate of the correction factor in (16) taking into account the actual genotype structure. Ifr r 2 c ½permute is the estimated value of r 2 c in such permuted samples, then equation (16)  From equation (9), the estimate of N e is then simply Bothr r 2 c andr r 2 c ½permute can be given with or without the sampling correction factor 1 S ½1{ 1 (2S{1) 2 . In the data tables below, the factor has been subtracted from both in order to use equation (14) to estimate the value of N e with no permutation. However for the value of N e with r 2 ½permute subtracted, the sampling factor cancels out and could have been omitted.
The permutation approach can be tested by simulation. This is shown in the first four lines of Table 2. All, except for the final two rows, involved 16 loci simulated for 20 generations, followed by sampling of 32 individuals. The first row shows the average r 2 c value for a range of population sizes from 32 to 1028. The second row shows the estimated N e values using equation (14), with each of the r 2 c values calculated directly from the composite haplotype table according to equations (1) and (4). The N e values are in good agreement with expectation.
The effect of introducing null alleles is shown in row (3). The simulations here involved choosing 8 of the 16 loci, and replacing 5% of alleles with null alleles in these. The N e values calculated using equation (14) are drastically reduced, especially for the higher population sizes. However the permutation correction in row (4) essentially brings the estimated N e values back to their expected value.
In the case of an infinitely large population, simulation is not necessary to justify the permutation approach for correcting for null alleles. The loci would be in linkage equilibrium in such a population, with a true value of r 2 c of zero. The only contributing factor to the observed value ofr r 2 c must be the correction factor, attributable to null alleles, plus the usual sampling factor of approximately 1=S. Additional permutation of genotypes in a sample from a population with zero LD will not have any effect, so ther r 2 c estimates with and without permutation will be identical and equal to r 2 ½permute.
The case of an infinitely large population also serves to show that the permutation approach will NOT work in removing biases due to non-random mating. For example, a sample might consist of individuals from two independently randomly mating populations, where the substructure has not been recognised. Such a sample will give a reduced estimate of N e due to the induced LD [22] even though there may be no LD within each of the two contributing populations. However permuting the sample cannot resolve this issue. It can be seen that the value of r 2 ½permute from the composite table will be zero, except for the normal sampling component of approximately 1=S, assuming no null alleles. The application of equation (17) would then falsely indicate that the LD within populations was real and attributable to small population size. A valid correction could be produced if the sub-samples from the two populations could be independently permuted, which is possible in computer simulation but not with real data where the substructure is unknown.
Taking account of all types of departure from random mating thus appears difficult. But Waples and England [23] have considered the case of migration into a random mating population, and shown that there is little effect on N e estimates in this case.
Including the single-locus disequilibrium factor. As mentioned above, a homozygosity correction term was suggested by Weir [8], as shown in equation (5). The effects of this term are shown in row (5) of Table 2, the r 2 value, and row (6), the N e value. The latter shows a substantial bias in N e values, especially for the larger population sizes. The size of this discrepancy seems surprising, since, under random mating, the mean value of the homozygosity correction should be zero, and only a small correction should result. However there is a bias due to the fact that, in a finite-size sample, the expectation of aa frequency is less than p 2 a . This is most evident where there is a single a allele, giving p a~1 =2S, but where the frequency of the aa genotype must be zero.
The obvious way of eliminating this bias would seem to be the use of ½n a =2S½(n a {1)=2S] as the expected frequency of homozygotes. But simulation shows that this substantially overcorrects the bias. It is, however, possible, just as in the case of correcting the bias for null alleles, to use a permutation correction. This involves calculation ofr r 2 D from equation (5), random permutation of genotypes in the sample, and calculation of r r 2 D ½permute in permuted samples. The procedure may be summarised as:r From equation (19), the estimate of N e is N e~1 Simulation in row (7) of Table 2 shows that this correction works well for all N e values.  Non-outbreak population. *Significant at 5% level. **Significant at 1% level. ***Significant at 0.1% level. doi:10.1371/journal.pone.0069078.t003 The homozygosity deviation factor, p aa {p 2 a , was not specifically designed in [8] to take into account null alleles. It seems particularly vulnerable to their effect, since p aa may be substantially over-estimated. However simulation shows that this factor dramatically improves rather than worsens the effect of null alleles. In contrast to the bias of the r 2 c considered previously that lacks the disequilibrium correction, row (8), which introduces null alleles at the same frequency of 5% in half of the loci, gives almost the same N e value as row (6) where there are no null alleles. As previously, the bias due to the factor can be eliminated by subtracting the permutation r 2 using equation (20), as shown in row (9).
A second advantage of the disequilibrium factor is that it reduces the variance of estimates. The N e estimates given in Table 2 are based on large numbers of replicates. However the variability between individual simulation runs is high. Estimated standard deviations ofr r 2 c andr r 2 D are given in rows (10) and (11). Both standard deviations are high in relation to the mean, but that associated withr r 2 c is especially so. Of course the magnitude of the standard deviations is heavily dependent on the choice of number of loci and heterozygosity levels. Doubling the number of loci from 16 to 32 substantially reduces standard deviations, row (12) and row (13), but the relativities between the two terms are maintained.
In summary of Table 2, only the original N e estimate from equation (14), where r 2 c lacks the single-locus disequilibrium factor, gives unbiased N e estimates. Nevertheless there is a strong reason for including the faxtor, provided that the bias in N e values is compensated, either by permutation as above, or by empirical correction as implemented in the computer program LDNe [9]. Weir's insight in introducing this factor is vindicated by the increased accuracy of estimation and lowered sensitivity to null alleles.

Results and Discussion
Results for the East coast populations are given in Table 3. Populations with low sample numbers, 15 or less, were omitted from the analysis, leaving 40 out of the original 52 samples. The table includes mostly samples from outbreak areas where the flies were not normally found, but also ten samples where the flies are endemic, including one from Queensland, the home range of the flies. The expectation is that these ten are samples from large populations.
The results are based on 29 microsatellites, a total of 29628/ 2 = 406 locus pairs. Because of missing readings, not all pairs are present in all populations.
Amongst the 29 loci, 5 pairs are known to be closely linked, 51 pairs to be loosely linked, and 197 to be unlinked [24]. For the remaining 153 locus pairs, one or both chromosomes are unknown. Average values of r 2 D for the four classes are 0.0434, 0.0153, 0.0084 and 0.0096 respectively. As expected, average values are higher for the known linked loci.
Values of r 2 c were calculated from the composite haplotype tables, and N e values (column 3) were then calculated from these values using equation (14). All populations, including the eight non-outbreak populations, show very low estimated population sizes. All are highly significantly different from infinite population size. The major conclusion from the above analysis, however, is that the existence of either null alleles or population sub-structure can cause cause N e values to be substantially under-estimated.
A direct test for null alleles is given in Table 4. The signal for null alleles is, eg. [25], excess of homozygotes over expectation. In a data set with multiple populations, a non-parametric test can be carried out based on number of populations where there is such an excess. Table 4 shows the results, revealing at least 10 out of 29 microsatelltes with significant excess of homozygotes, which, in the lack of systematic homozygote excess, can likely be attributed to null alleles rather than to population structure.
Returning to Table 3, column 4 shows the values of N e using r 2 values corrected using equation (17). The correction factor in this case comes from 200,000 simulated populations for each outbreak sample. The N e values clearly have a more realistic mixture of population sizes than the estimates based on the raw r 2 values. Positive values of greater than 1,000 are listed as infinite, as also are the N e estimates associated with negative r 2 estimates. Lower values of N e have been rounded to the nearest 10.
The disequilibrium factor is introduced in column 5. This column is marked as giving the most likely estimate of N e . As expected, all of the really small population size estimates come in the outbreak populations rather than in the endemic populations.
The N e values in columns 3-5 are based on the unlinked locus pairs, including the 153 additional pairs likely to be loosely linked or unlinked. The values in column 6 are the equivalent corrected N e estimates based on all locus pairs. These can be directly compared to the values of N e given by the LDNe program [9], also based on all locus pairs. There is good agreement for the smallest population sizes, although the LDNe program shows infinite sizes in a number of cases where the values of N e in column 5 are finite. N e values in column 5, using unlinked loci, differ very little from values on column 6 using all loci. The expectation is that the use of linked loci will lead to under-estimation of N e . Many, but not all, values in column 6 are slightly below those in column 5, but the differences are not large. This result seems fortuitous, given that linkage relationships are not as well established for many organisms, necessitating the use of all locus pairs.
The final two columns of Table 3 show two different tests of significance, each based on the unlinked plus likely unlinked subsample of locus pairs. The first is the usual genotype likelihood test of LD [21], based on permutation of genotypes, with log likelihoods of the genotype tables summed over all relevant locus pairs. The second is a likelihood test based on permutation of genotypes, with likelihoods calculated on the composite haplotype tables. This test seems much more sensitive. Partly this is because, as indicated above and illustrated in Figure 2, the composite  haplotype table is much denser than the genotype table, where all  the zero and unit values do not contribute to the likelihood. However the second test is influenced by LD, but also by null alleles. The significant values are mostly associated with low population sizes, but there are exceptions to this in both directions. In general, the significance tests seem to be of limited value in judging whether population sizes are infinite or not.
The results from North-West samples [11] are given in Table 5. The results show a comparable proportion of high population numbers compared to the East coast populations of Table 3. Less has been known about these populations, but these results would suggest that, with the exception of the final two samples from Broome and Derby in West Australia, these are well-established outbreaks in most cases.

Summary of the Findings
The Burrows composite index can be equivalently derived from a 'composite haplotype table' in which all genotypes sampled contribute four possible haplotypes.
Although the composite haplotype table has marginal totals that are even numbers due to double counting, a valid r6c x 2 can be calculated for the table. The r 2 value calculated from this table, r 2 (comp), needs to be multiplied by a factor of 4 to give r 2 c , a valid estimator of r 2 .
The expected r 2 value calculated for the table is 1 S :½1{ 1 (2S{1) 2 in the absence of LD. This contrasts with the sampling correction of 1 2S :½1z 1 (2S{1 for r 2 calculated when haplotypes can be recognised.
The overall calculation of r 2 c involves summation of values from different locus pairs. Within locus pairs, it involves summation of r 2 c values for each pair of alleles. The weighting for the former is taken from [9], while a weighting proportional to gene frequencies is proposed for the latter. The results when this formula are applied to data from Queensland fruit fly give low N e values in all samples, including ones from known large endemic populations. Null alleles are suggested as a cause for this discrepancy, and shown to be frequent in the data.
The effect of a null allele at frequency p n is shown to increase the composite r 2 c value by the fraction 1=(1{p n ) 2 . Although this effect seems small, it will nevertheless overwhelm the calculations for large population sizes.
The r 2 c value can be corrected for null alleles using a comparison between the calculatedr r 2 c value and an equivalentr r 2 c value calculated when genotypes in the sample are permuted at random. This correction is verified by simulation.
The single-locus disequilibrium factor suggested by Weir [8], equivalent to a homozygosity correction, is introduced into the calculation. This alters the value of r 2 c to r 2 D . Use of r 2 D is shown to bias the N e values due to the difficulty of calculating the singlelocus disequilibrium factor using p aa {p 2 a in a finite population. Simulation shows that this bias can be rectified using the same permutation approach as for null alleles. r 2 D , and N e calculated from r 2 D , have lower variances than r 2 c , and N e calculated from r 2 c . Simulation shows that the r 2 D values are almost unaffected by null alleles, in sharp contrast to the r 2 c values. The estimates of N e from both East coast and NorthWest populations are, as expected, mostly low for outbreak populations and high for endemic populations.
The calculations are based on loci known to be unlinked, but are not substantially changed when all locus pairs are considered. Linkage information is usually not available for non-laboratory organisms, and this result shows that lack of such information may not be critical in calculating N e based on LD.
Although the LDNe program [9] is empirically based, it uses the single-locus disequilibrium factor, and appears to work well both with and without null alleles. Figure S1 The effect on the estimate of r2 from ?2 weighting compared to allele frequency weighting when introducing a single new mutant.