Coalescent Tree Imbalance and a Simple Test for Selective Sweeps Based on Microsatellite Variation

Selective sweeps are at the core of adaptive evolution. We study how the shape of coalescent trees is affected by recent selective sweeps. To do so we define a coarse-grained measure of tree topology. This measure has appealing analytical properties, its distribution is derived from a uniform, and it is easy to estimate from experimental data. We show how it can be cast into a test for recent selective sweeps using microsatellite markers and present an application to an experimental data set from Plasmodium falciparum.


Introduction
The coalescent process is an established tool to describe the evolutionary history of a sample of genes drawn from a natural population [1][2][3].For a neutrally evolving population of constant size N the coalescent has well understood analytical properties concerning tree shape and mutation frequency spectrum which provide a firm basis for a variety of statistical tests of the neutral evolution hypothesis [4][5][6][7][8].Adding recombination as an evolutionary mechanism, the coalescent is usually studied in the framework of the ancestral recombination graph (ARG) [9].The combined action of selection and recombination has been analyzed first in detail by Hudson and Kaplan [10] and, in terms of genetic hitchhiking, by Kaplan et al. [11].More recently, it was shown that the (non-Markovian) ARG can well be approximated by a simpler, more tractable model, the so-called Sequential Markov Coalescent [12][13][14], which is of particular interest for the efficient simulation of genealogies across large genomic regions.How single recombination events reflect on tree shape under neutrality has recently been analyzed by Ferretti et al. [15].Here, we concentrate on tree shape in the vicinity of a selected locus.
Selection changes the rate by which coalescent events occur and hence can lead to distortions of tree shape.It is well known [6,16] that selective sweeps can produce highly unbalanced trees when selection acts in concert with limited recombination, i.e. at some chromosomal distance from the site under selection.Conversely, observing unbalanced trees should provide information about recent selection in a particular genomic region.In fact, this property is also the basis of Li's MDFM test [16].A practical concern is how such distorted gene genealogies may reliably be estimated or re-constructed using polymorphism data.When working with SNPs a large genomic fragment with many polymorphic sites has to be analyzed to obtain a clear phylogenetic signal.Since for many organisms recombination and mutation rates are on the same order of magnitude [17,Table 4.1], one harvests about as many recombination as polymorphic sites when sampling genomic sequences, thus complicating tree shape estimation.To alleviate this problem one may turn to multi-allelic markers, such as microsatellites, complementing or replacing biallelic SNPs.
In this paper we introduce the statistic V of tree balance and, first, derive theoretical properties of this and derived statistics.Second, we show how a selective sweep affects these statistics.Third, we investigate the possibility and reliability of estimating V from experimental data.Fourth, we define an easily applicable microsatellite based test statistic for selective sweeps.It requires clustering of microsatellite alleles into two disjoint sets and examining whether these sets are sufficiently different in size and/or whether they have a sufficiently large distance from each other.Finally, we demonstrate a practical application.

Terminology
Consider the coalescent tree for a sample of size n.It is a binary tree without left-right orientation, with ordered internal nodes and branch lengths representing a measure of time.All leaves are aligned on the bottom line, representing the present.We use the term tree topology when talking about the branching pattern and tree shape when talking about topology and branch lengths.We remark that topology and shape can be conceptually distinguished, but in practice estimating topology relies on polymorphism patterns.Since these depend on branch lengths, i.e. on shape, topology can usually not be estimated independently.We call the size of a tree the number of leaves and the length of a tree the combined length of all branches.The height is the time interval between present and root, indicated by t 0 in Figure 1.Let the label of the root be n 0 .The n leaves can be grouped into two disjoint sets, L 0 and R 0 , the 'left-' and 'right-descendants' of the root.Let L 0 be the smaller of the two sets and DL 0 D~V 0 .Hence, DR 0 D~n{V 0 §n=2.Let n 1 be the 'right' child of n 0 , i.e. the root of the subtree with leaf set R 0 .The descendants of n 1 can again be grouped into two disjoint subsets, L 1 and R 1 , the left-and right-descendants of n 1 .Again, without loss of generality, let DL 1 DƒDR 1 D and denote DL 1 D~V 1 .
Hence, DR 1 D~n{V 0 {V 1 .Proceed in this way to define subsets L 2 , R 2 , and so on.For any tree there are h such pairs (L i ,R i ) where log 2 (n)ƒhƒn{1, with h depending on the topology of the tree.The set R 0 , . . .,R h constitutes a -not necessarily uniquetop-down sequence of maximal subtrees.

Tree topology of the neutral coalescent
Consider a coalescent tree of size n under the neutral model with constant population size, where n is assumed to be large.Root imbalance is measured by the random variable V 0 .The distribution of V 0 is 'almost'-uniform [18,19] on 1,2, . . .,tn=2s f g .More precisely, where d.,.denotes here the Kronecker symbol.The expectation is The variance is and the standard deviation

Author Summary
It is one of the major interests in population genetics to contrast the properties and consequences of neutral and non-neutral modes of evolution.As is well-known, positive Darwinian selection and genetic hitchhiking drastically change the profile of genetic diversity compared to neutral expectations.The present-day observable genetic diversity in a sample of DNA sequences depends on events in their evolutionary history, and in particular on the shape of the underlying genealogical tree.In this paper we study how the shape of coalescent trees is affected by the presence of positively selected mutations.We define a measure of tree topology and study its properties under scenarios of neutrality and positive selection.We show that this measure can reliably be estimated from experimental data, and define an easy-to-compute statistical test of the neutral evolution hypothesis.We apply this test to data from a population of the malaria parasite Plasmodium falciparum and confirm the signature of recent positive selection in the vicinity of a drug resistance locus.s(V 0 )&n=(2 ), provided n is sufficiently large.The compound random variables V i , iw0, have support which depends on V j , jvi.More precisely, the distribution of V i , given V j , jvi, is almost-uniform on 1,2, . . .,tn i =2s f g with where n i ~n{v 0 {:::{v i{1 (iw0) is a random variable which is bounded below by n=2 i and above by n{i.The moments are somewhat more complicated.For instance, Continuing this way, evaluating sums iteratively and using the above approximation, one derives Similary, one can obtain the second moments and combine these to Define now the normalized random variables To calculate the moments of V Ã i , iw0, we replace n i by E(n i ).Simulations suggest that this is acceptable, as long as n i is not too small.Figure 2 shows this fact for iƒ3.Here we focus on V Ã i for iƒkvh, where k is small and n is large (k~2, n §100, say).Since, ) i , we obtain Similarly, and It is very convenient to work with the normalized random variables V Ã i instead of V i .Their support is bounded by 0 and 1 for all i and they are well approximated by independent continuous uniforms on the unit interval.This considerably facilitates the handling of sums and products of V Ã i .For instance, the joint distribution F (kz1) of V Ã 0 ,V Ã 1 ,:::,V Ã k is then approximated by the continuous uniform product with distribution function F (kz1) (k,u 0 ,:::,u k )&F (k,u~u 0 : ::: : expectation and variance As is well known, the normalized sum of continuous uniforms converges in distribution to a normal random variable rather quickly.In fact, we have for the standardized sum In practice, already k~2 yields a distribution which is reasonably close to a normal (see Suppl. Figure S1).Linked trees.Consider now a sample of recombining chromosomes.Coalescent trees along a recombining chromosome are not independent.In particular, tree height and tree topology of closely linked trees are highly correlated.However, under conditions of the standard neutral model, correlation breaks down on short distances (Figure 3) [15].Roughly 10 recombination events in the sample history reduce correlation by about 50%.Under neutrality and when N is constant, a sample of size n has experienced on average 4Nra n{1 recombination events [20] (Suppl.Figure S2), where a n is the n-th harmonic number and represents the length of the tree.Assuming a recombination rate of 1 cM/Mb, population size N~10 4 and sample size n~200, this amounts to roughly 10 recombination events per 4 kb.If N~10 5 , in an interval of only about 400 bp correlation is reduced to 50% (Figure 3).Thus, if correlation half-life is determined by roughly 10 events in the sample, we estimate the correlation length L half as where c is the recombination rate per bp per unit time.Hence, trees may be regarded as essentially uncorrelated when considering physical distances of some 10 kb and sufficiently large populations and samples.Eq (9) may be violated if population size N is not constant.As a biologically relevant example we consider a population bottleneck, during which the population is reduced to size N b .A bottleneck is characterized by three parameters, time of onset, duration (both in units of 4N) and depth (d~N b =N).A bottleneck induces time dependent changes of the coalescent rate [21] and a reduction of effective population size.Particularly drastic effects on the genealogy are observed when the duration is similar to or larger than the depth [22].Given biologically reasonable parameters, this inflation may even be larger under a bottleneck than under a selective sweep (Figure 3).

Tree topology in the vicinity of a selective sweep
A positively selected allele sweeping through a population leads to a drastic reduction of tree height due to its short fixation time t Ã (see Figure 1C).The fixation time depends on the selection coefficient s and population size N.In units of 4N, t Ã &(1=a)log(a), where a~2Ns [23].This is much smaller than the neutral average fixation time t Ã &1.The reduced fixation time leads to a severe reduction of genetic variability.Furthermore, external branches of the tree are elongated relative to internal branches, yielding a star-like phylogeny of an approximate length of nt Ã .Replacing the neutral tree length a n{1 in eq (9) by this For the parameters used in Figure 3, we have L half &3300 bp, which agrees well with the simulation result.
In contrast to tree height and length, tree topology at the selected site does not necessarily differ from a neutral tree; only when moving away from the sweep site, and with recombination, topology may drastically change.In fact, given a shallow tree, recombination leads with high probability to an increase of tree height and to unbalanced trees [15].Thus, recombination events next to the selected site tend to increase tree height (see sketch in Figure 1B and C) and to create a bias in favour of unbalanced trees, i.e. trees with small V 0 (Figure 4A).The expected proximal distance x p from the selected site of such a recombination event can be estimated as x p &1=r u , ð11Þ where r u ~c n t Ã =2, c is the per site recombination rate, and n t Ã is the length of a star-like phylogeny; the factor 1=2 accounts for the fact that it is more likely to recombine with an ancestral chromosome (thereby increasing tree height) as long as these are more abundant than the derived chromosomes carrying the selected allele.Roughly, this is the case during the first half of the fixation time t Ã .Assuming instead of the star phylogeny a random tree topology of average length a n{1 t Ã at the selected site, one obtains the larger (call it distal) estimate where r l ~c a n{1 t Ã =2.
Unbalanced trees tend to have strongly elongated root branches and harbor an over-abundance of high frequency derived SNP alleles [6,16].With microsatellites it is usually not possible to determine the ancestral and derived states of an allele, because they mutate at a high rate and possibly undergo back-mutation.However, under the symmetric single step mutation model, the expected distance between a pair of alleles (in terms of motif copy numbers) behaves as the distance in a one-dimensional symmetric random walk and therefore increases at a rate proportional to the square root of the scaled mutation rate h (see Methods).Thus, alleles which are separated by long root branches tend to form two distinct allele clusters.

Estimating V
Tree topology is ususally not directly observable and has to be estimated from data.We focus on estimating V i , iƒkvh, from microsatellite data.Given a sample of n microsatellite alleles with tandem repeat counts DA j D, 1ƒjƒn, we use UPGMA [24] to construct a hierarchical cluster diagram.If subtree topology within a particular cluster node should not be uniquely re-solvable, for instance if alleles are identical, we randomly assign the alleles of the subtree under consideration to two clusters with equal probability.This gives preference to clusters of balanced size in case of insufficient resolution.We then use the inferred tree topology b V V i to estimate V i of the true tree.This procedure is conservative for the test statistics described below, since it gives preference to large values b v v when the true value v is small (Figure 4, column A).For a cluster pair C 1 , C 2 , define the distance as We find that UPGMA clustering gives good estimates of V 0 when clusters are clearly separated from each other, i.e. when distw1.
Let I be the indicator variable for this event.Then, we have for the median (Figure 4, column B).Without requiring distw1 the estimate b V V 0 is more biased.In part, this is due to the conservative UPGMA strategy mentioned above.However, estimation of V 0 is very accurate when root branches are strongly elongated, i.e. under conditions of selective sweeps or certain bottlenecks (Figure 4, bottom).

Application: Testing the neutral evolution hypothesis
We now turn to an application of the above results and explain how a new class of microsatellite based tests of the neutral evolution hypothesis can be defined.
Consider a sample of n alleles at a microsatellite marker and record their motif repeat numbers.Applying UPGMA clustering to the alleles, we obtain estimates b V V i , iƒk as described above.
These are transformed to c Then, we determine the following test statistics Thus, the test variable T (sum) k in eq (14) is the estimate of S k given in eq (8).Similarly, T (product) k and T (dist) 0 are the estimates of the product P k i~0 V Ã i and of V Ã 0 \I.We now test the null hypothesis T (:) wq for a critical value q~q(a).For a given level a we obtain the critical value q for T (sum) from the standard normal distribution and for T (product)  from the uniform product distribution in eq (7) (Table 1).For T (dist) we use the critical value of the normalized version of eq (1).Generally, these critical values are conservative, since V Ã i tends to over-estimate V i , when small (Figure 4).In particular, statistic T (dist) is very conservative due to the additional condition on the distance.The true critical values for level a would be larger than those shown in Table 1.
False positive rates and power.First, we analyzed the false positive rates under the standard neutral scenario (i.e., constant N) for different mutation rates h and varying sample sizes n.As reference parameter settings for simulations with msmicro (see Methods) we use sample size n~200, microsatellite mutation rate h~40 and recombination rate r~400.The latter corresponds to a recombination rate of 10 {8 per bp per chromosome, when one assumes a population size of N~10 5 and a size of the investigated genomic region of 10 5 bp (r~4N : 10 {8: 10 5 ).We placed 15 microsatellite markers at positions 1, 10, 30, 60, 70, 80, 85, 87, 88, 89, 90, 91, 92, 95, 100 kb.As expected, we find that the false positive rates remain below their theoretical expectation for all parameter choices h and n (Figure 5 top; Tables 2 and 3).For the simulations with selection we assumed that a site at position 89 kb was undergoing a selective sweep with selection coefficient s~0:005 or s~0:01.The time since completion of the sweep Table 1.Critical values for the tests considered in eqs ( 14)-( 16).was an adjustable parameter t, with the reference setting t~10 {4 .We simulated hard selective sweeps, i.e. the selected allele is introduced as a single copy and fixed with probability about 2s.
The test statistic T (sum) 2 is shown in Figure 5 and power profiles for all three tests in Figure 6.We find that maximum power of the tests is attained within the interval given by eqs (11) and ( 12) (Figure 6 and Tables 4 and S1).Depending on the strength of selection, maximum power is close to the upper interval bound at x d (s~0:005, Table S1), or removed from x d towards the interior of the interval (s~0:01, Table 4).This is in agreement with the expectation that only very strong selective sweeps generate a starlike phylogeny, which lead to the proximal estimate x p in eq (11).Thus, the location of the power maximum depends on the strength of selection and the details of the tree topology at the selected site.Maximum power for the compound tests T (sum) 2 and T (product) 2 is more removed from the selected site than for the simple test T (dist) 0 .The latter measures imbalance only at the root node n 0 and is most sensitive to single recombination events between marker and selected site, while multiple events blur the effect.The power of all tests is sensitive to the mutation rate and to sample size (Tables S2  and S3).For the parameters tested, the power of the simple T (dist) 0 increases when h or n increase.For T (sum) 2 , maximum power is reached for h&20.Very small, as well as very high, mutation rates produce little power.Realistic mutation rates in insects and vertebrates are between h~5 and 50 [25][26][27], thus within the powerful domain.Importantly, power can be increased by increasing sample size: all of the above tests become more powerful for large samples (Tables S3, S4 and S5).Since the tests consistently underscore the theoretical false positive rate, relaxed singnificance levels (for instance a~0:05) can be applied.At level a~0:05 test T (sum) 2 has power of more than 80% to detect recent selective sweeps (Figure 6 and Table 4).For intermediate mutation rates power of test T (sum) 2 is somewhat higher than of T (product) 2 (Table S2).Generally, power profiles of T (sum) 2 and T (product) 2 follow qualitatively the same pattern.In contrast, power of test T (dist) 0 may be quite different.Interestingly, T (dist) 0 performs better than T (sum) 2 or T (product) 2 when selection is only moderately strong.Unsurprisingly, power of all tests depends heavily on the strength of selection.Also, the time since completion of the selective sweep influences power.Reasonable power can be reached if tv10 {3 in coalescent units.
We also examined how much the tests are confounded by deviations from the standard neutral model.First, we determined the false positive rates under a population bottleneck.From other studies it is known that bottlenecks with a severity (duration divided by depth) around 1 are particularly problematic [16,28].We find that tests T (sum) 2 and T (product) 2 can produce substantially more false positives than expected, in particular if bottlenecks are recent (Table S6).Interestingly, test T (dist) 0 is very robust against these disturbances and the false positive rate remains clearly under the theoretical value for all onset parameters tested (Table S6).
We note that the false positive rates of T (sum) 2 and T (product) 2 depend strongly on the bottleneck duration even when the severity is kept fixed (Table S7).Very short (duration 0:001), but heavy reductions of N are more disturbing for T (sum) 2 and T (product) 2 than long, but shallow bottlenecks (duration 0:1).In contrast, T (dist) 0 is fairly insensitive to changes of bottleneck duration (Table S7).
Under a model of fast population expansion (expansion rate 10), all tests remain below, or close to, their theoretical false positive rate.Again, test T (dist) 0 is unsensitive to population expansion and varying onset times (Table S8).
We expected that our topology based tests would yield many false positives under a model of population subdivision.As a potentially critical case we examined sampling from a population divided into two sub-populations which split 2N generations ago and which exchange migrants at rate m.We analyzed both varying migration rates and varying sampling schemes (Tables S9  to S12).The false positive rate for tests T (sum) 2 and T (product) 2 remains clearly under its theoretical expectation, even if sampling is heavily biased (sample size of sub-population 1 was n 1 ~195 and of sub-population 2 was n 2 ~5; Table S9).In contrast, test T (dist) 0 , which only measures tree imbalance at the root node, is more vulnerable to biased sampling from a sub-divided population.The false-positive rate grows up to 17% if n 1 ~195 and n 2 ~5.In general, we find test T (dist) 0 to be less vulnerable to population bottlenecks, but tests T (sum) 2 and T (product) 2 to be more robust under population substructure.
Finally, we examined how deviation from the single step mutation model would influence our tests.We modified the mutation model and allowed occasional jumps (probability p) of larger steps.We tested jumps of step size 2 (Table S13) and 7 (Table S14).All tests, eminently the compound tests T (sum) 2 and T (product)

2
, remain clearly below their theoretical false positive rate.

Case study
Emergence of drug resistance in malaria parasites is among the best documented examples for recent selective sweeps.We reanalyzed 16 microsatellite markers surrounding a well studied drug resistance locus of malaria parasites [29] (Figure 7).The signature of recent positive selection is consistently detected by all tests on two markers somewhat downstream of the drug resistance locus pfmdr1 (marker l-35 and l-37 in the notation of [29]; Table 5).Highest significance is reported by test T (product) 2 (p-value close to 0:001).T (dist) 0 reports a p-value of 0:006 and T (sum) 2 reports p-values slightly above 0:010.In addition, T (product) 2 reports locus l-29 (located upstream of pfmdr1) to be significant at p~0:025.This locus is also detected by T (dist) 0 (p~0:038).Other four loci are reported only by T (dist) 0 (l-30 (p~0:006), l-31 (p~0:025), l-32 (p~0:006), l-30 (p~) and l-40 (p~0:031)).Discrepancies in the test results are due to their different sensitivities to various parameters.The simple and compound tests have different power profiles with power peaks at different positions from the selected site (Figure 6).Plasmodium in South-East Asia is most likely expanding and sub-structured; however, there is only limited knowledge about the details.
As shown above, T (dist) 0 is quite sensitive to biased sampling from different sub-populations.Some of the significant results of T (dist) 0 may be inflated due to sub-structure.There is also some disagreement between tests T (sum) 2 and T (product) 2 regarding significance, although both test imbalance at tree nodes n 0 , n 1 and n 2 .In fact, the cases reported by the two tests may still differ in their details.Comparing the three components v with respect to their maximum and minimum, we find that the cases reported as significant by , the maximum is close to 1:0 while the minimum tends to be less than 0:04 (Figure S4).Thus, test 2 have to be small to yield a significant result.T (product) 2 is more permissive and accepts that one of the three components may be large.
All tests agree on significance of two markers close to a site which was previously shown to have experienced a selective sweep.They also agree all on strongly increased p-values in the immediate vicinity of the selected site (l-33, l-34).Together, these results confirm the accuracy and practical utility of our tests.

Discussion
The binary coalescent has a number of well-studied combinatoric and analytic properties [1,30,31].Here we only concentrate on tree topology and use a classic result of Tajima [19] to define a simple measure, V i , of tree balance.It is the minimum of the left and right subtree sizes under internal node n i .Its normalized version is approximately uniform on the unit interval and the summation over internal nodes n i , i~1,::k, is close to normal.Another summary statistic of tree balance is Colless' index C [32].It also depends on the sizes of left-and right subtrees of the internal nodes, but its distribution is more complicated.C has received attention in the biological literature before [33] and, more recently, in theoretical studies, for instance by Blum&Janson [34].A problem with Colless' index is that it is difficult to estimate if the true tree structure is unknown.But, limiting attention to the tree structure close to the root, we show that the balance measure V can be estimated, for instance, from microsatellite allele data by a clustering method.We found that a version of UPGMA clustering gives most reliable results.
Coalescent trees for linked loci are not independent.However, correlation dissipates with recombinational distance.In fact, under neutral conditions only about ten recombination events are sufficient to reduce correlation in tree topology by 50%.Thus, estimating tree imbalance at multiple microsatellites can be performed independently for each marker, if they are sufficienty distant from each other.Conversely, with a very small number of recombination events, V is not drastically altered on average [15].Thus, when working with SNPs, one may afford to consider haplotype blocks containing a few more recombination events than segragting sites and still be able to reconstruct a reliable gene genealogy.This possibility will be explored in more detail elsewhere.Microsatellites have been used before as markers for selective sweeps.Schlo ¨tterer et al. [35] have proposed the lnRH statistic to detect traces of selection and Wiehe et al. [28] have shown that a multi-locus vesion of lnRH for linked markers can yield high power while keeping false positive rates low.However, a severe practical problem with the lnRH statistic is that it requires data from two populations, and for each of them two additional and independent sets of neutral markers for standardization.There are a few methods to detect deviations from the standard neutral model based on single microsatellite locus data from one Table 4. Power of T (sum) , T (product) and T (dist) in dependence of distance to selected site.population.For instance, the test by Cornuet and Luikart [36], which compares observed and expected heterozygosity, is designed to detect population bottlenecks.A test by Schlo ¨tterer et al. [37] uses the number of alleles at a microsatellite locus and determines whether an excess of the number of alleles is due to positive selection (SKD test).However, as the authors pointed out, the test depends critically on a reliable locus-specific estimate of the scaled mutation rate.We have compared SKD and the test proposed here with respect to power and false positive rates.While the SKD-test is generally more powerful, especially at larger distances from the selected site (Table 4 and Suppl.Tables S1, S5), it has higher false positive rates than the tests proposed here, in particular when compared to T (dist) 0 (Suppl.Table S6), and for non-standard mutation models (Suppl.Tables S13, S14).Note also that under population sub-structure SKD yields up to 100 times more false positives than our tests (Suppl.Tables S9 to S12).
It should be emphasized that it is the topology of the underlying genealogical tree, not the genetic variation, which constitutes the basis for the test statistics proposed here.The two steps, estimating topology, and performing the test are two distinct tasks.The quality of the tests hinges on the quality of the re-constructed genealogy.With a perfectly re-constructed genealogy the false positive rates are completely independent from any evolutionary mechanisms which do not affect the average topology, such as historic changes of population size.However, simulations show that power would still remain under 100% in this case.The robustness of topology based tests with respect to demographic changes has been shown before by Li [16] for a similar test which uses SNP data to reconstruct V 0 .But Li's test can only be performed if an additional non-topological criterion is satisfied and thus can only test a subset of trees with V 0 .The tests T (sum) and T (product) defined here rely only on topological properties of the  ) and on the product uniform (for T (product)

2
) distributions.Values for T (dist) 0 are given as raw data (c v 0 v 0 , n, d).The p-value is 2c v 0 v 0 =n.5% (single star) and 1% (double star) significance are indicated.Marker positions are taken from [29].The region analyzed (about 17 kb) corresponds to about 1 cM (site under selection in bold). 1 defined in eq (13).doi:10.1371/journal.pcbi.1003060.t005genelaogy and we argue that multi-allelic markers, such as microsatellites, help estimating the true genealogy and improving test results.Although our analyses and simulations are based on the binary Kingman [1] coalescent, we expect that the new test statistics should be robust also under more general coalescent models, for instance when multiple mergers during the selective sweep phase are allowed [38].
Despite a shift to high throughput sequencing technologies in the last decade, microsatellite typing continues to be a costefficient and fast alternative to survey population variability in many experimental studies.This is in particular true for projects directed towards parasite typing, e.g. of Plasmodium, and projects with non-standard model organisms, e.g.social insects [39,40], but also for many biomedical studies.

Coalescent simulations
We simulated population samples under neutral and hitchhiking models with modified versions of the procedures described by Kim and Stephan [41] and Li and Stephan [42] and of ms [43], termed msmicro.In the modified versions we incorporated evolution of microsatellite loci under the symmetric, single step and multi-step mutation models.Microsatellite mutations are modeled as changes to the number of motif repeats, where only numbers but not particular sequence motifs are recorded.Output data comprise coalescent trees in Newick format and the state of microsatellite alleles for each of n sequences.With msmicro also multiple linked microsatellites can be modeled.Coalescent simulations were run under different evolutionary conditions: neutral with constant population size (N~10

Tree topology
Realizations v i of the 'true' random variables V i , 0ƒiƒk were extracted from the simulation results.Estimation of v v i was performed by UPGMA hierarchical clustering.If a cluster node could not be uniquely resolved then we gave preference to a bipartite partition in which the left and right subtrees were of equal or similar size.This was accomplished by randomly assigning alleles to two clusters with equal probability.To estimate v v 0 we also explored a simple clustering method which works in the following way: we first sorted alleles by size; then we divided the sorted list into two halfs.The separator was placed between those two alleles which had maximal distance (in terms of microsatellite repeat units) from each other.If this was not unique, the separator was placed between those two alleles that resulted in two sets of most similar size.While this clustering method is very effective in estimating v 0 , it is less accurate than UPGMA clustering for v i , iw0.

Distance between microsatellite alleles
The single step symmetric mutation model behaves as a onedimensional symmetric random walk of step size one.The theory of random walks (e.g.[44]) tells that the average distance between the origin of the walk and the current position scales with the square root of the number k of steps.More precisely, The variance is linear in k.Here, steps are represented by mutational events occuring at rate h.Thus, E dist ~ffiffiffiffiffiffiffiffiffiffi 2h=p p and V dist &h=e, where e is Euler's constant.The empirical distance between two clusters C 1 and C 2 can be calculated as Given that the rate of the first recombination event adjacent to a selective sweep site is r l ~an{1 ct f =2 (in case of a neutral topology) or r u ~nct f =2 (in case of a star phylogeny) the distance between the selected site and the 'first' recombination event is described by a Poisson process with rate r l x or r u x.Shown is the probability that the Poissson variable is 0 (i.e., for a ''recombination free zone'') for r l (upper curve) and r u (lower curve).(EPS) .Given a test is significant at level a~0:01, the plots show the maximum (x-axis) and the minimum (y-axis) of the three terms v Ã 1 , v Ã 2 and v Ã 3 , which enter into the sum and product in T (sum) 2 and T (product) 2 , respectively.The sum-and product-tests may yield different results, because the summands are differently constrained (here (A), the maximum v * 0:4) than the factors (here (B), the maximum may reach almost 1, but the minimum is smaller than in the sum-test).(PDF)

Supporting Information
Table S1 Power of T (sum) , T (product) and T (dist) in dependence of distance to selected site.Moderate selection strength.(PDF) Table S2 Power of T (sum) , T (product) and T (dist) in dependence of mutation rate h.

(PDF)
Table S3 Power of T (sum) , T (product) and T (dist) in dependence of sample size n.(PDF) Table S4 Power of T (sum) , T (product) and T (dist) in dependence of distance to selected site.Small sample size.(PDF) Table S5 Power of T (sum) , T (product) and T (dist) in dependence of distance to selected site.Large sample size.(PDF) Table S6 Empirical false positive rate.Bottleneck model with varying onset t of the bottleneck.Strength is fixed at 0:01N.Significance levels a are based on theoretical formulae according to eqs (7) and (8).(PDF) Table S7 Empirical false positive rate.Bottleneck model with varying duration of the bottleneck.Severity (duration divided by strength) is fixed at 1. Significance levels a are based on theoretical formulae according to eqs (7) and ( 8).(PDF) Table S8 Empirical false positive rate.Population expansion with varying onset t of the expansion.Expansion rate is fixed at 10.

(PDF)
Table S9 Empirical false positive rate.Population substructure with two sub-populations, split time t~1 in the past and sampling scheme n 1 ~195, n 2 ~5.Varying migration rate m per generation per 4N individuals.Significance levels a are based on theoretical formulae according to eqs (7) and ( 8).(PDF) Table S10 Empirical false positive rate.Population substructure with two sub-populations, split time t~1 in the past and sampling scheme n 1 ~190, n 2 ~10.Varying migration rate m per generation per 4N individuals.Significance levels a are based on theoretical formulae according to eqs (7) and (8).(PDF) Table S11 Empirical false positive rate.Population substructure with two sub-populations, split time t~1 in the past and sampling scheme n 1 ~180, n 2 ~20.Varying migration rate m per generation per 4N individuals.Significance levels a are based on theoretical formulae according to eqs (7) and (8).(PDF) Table S12 Empirical false positive rate.Population substructure with two sub-populations, split time t~1 in the past and sampling scheme n 1 ~150, n 2 ~50.Varying migration rate m per generation per 4N individuals.Significance levels a are based on theoretical formulae according to eqs (7) and (8).(PDF) Table S13 Empirical false positive rate.Mutation model with jumps of size 2. Varying probability p for a step of size 2. With probability 1{p the step size is 1.Significance levels a are based on theoretical formulae according to eqs (7) and (8).(PDF) Table S14 Empirical false positive rate.Mutation model with jumps of size 7. Varying probability p for a step of size 7.With probability 1{p the step size is 1.Significance levels a are based on theoretical formulae according to eqs ( 7) and ( 8).(PDF)

Figure 1 .
Figure 1.Coalescent trees under recombination and selection.A: Sketch of a neutral coalescent tree with tree size n~20.B and C: A selective sweep in locus C leads to a tree of low height (t 0 small).The selective sweep was initiated by a beneficial mutation at time t Ã .At some distance from C, a single lineage (circled branch in C) has ''recombined away'' leading to the unbalanced tree shown at locus B. Note that tree height between trees B and C changes drastically and that V 0 ~4 at locus C and V 0 ~1 at locus B. Multiple recombination events (indicated by the crosses at the bottom line) between loci A and B lead to essentially uncorrelated trees at A and B. doi:10.1371/journal.pcbi.1003060.g001

Figure 2 .
Figure 2. Mean and standard deviation of V and V Ã for coalescent trees of size n~200.Shown are the values for 10 4 independent realizations.x-axis: values of V (black circles) and V Ã (red squares) are determined for the subtrees originating at node n i , i~0,:::,3.The solid gray line shows the theoretical expectation according to eq (3).doi:10.1371/journal.pcbi.1003060.g002

Figure 5 .
Figure 5. Profile of S 2 and Ŝ S 2 along a recombining chromosome.Plots in column A show the distribution of S 2 ~P2 i~0 2(V Ã i {1=2), i.e. when the tree topology is known.Plots in column B show the distribution of the estimate Ŝ S 2 ~P2 i~0 2( c V Ã i V Ã i {1=2) when the tree topology is unknown, but estimated from microsatellite polymorphism data.Each boxplot corresponds to one of 15 marker loci located at the positions indicated on the x{axis.The regions spans 100 kb in total.Symmetric step-wise mutation model with h~40.Other parameters: n~200, N~10 5 and recombination rate per bp c~10 {8 (corresponding to 1 cM/Mb).First row: standard neutral model with constant N. Second row: bottleneck model with severity 1 and onset t~0:01.Third row: Selective sweep at locus x~0 with s~0:005 which was completed t~10 {4 time units ago.For comparison with the theoretical expectation, the leftmost boxplot in each panel shows the standard normal distribution (labeled 'N').doi:10.1371/journal.pcbi.1003060.g005

Figure 7 .
Figure 7. Traces of selection around a drug resistance locus in Plasmodium.Results of tests T (sum) (stars), T (product) (circles) and T (dist) (triangles) applied to a 17 kb region sorrounding the pfmdr1 locus in P.falciparum.Shown are significant results on the 5% (open symbols) and 1% (filled symbols) levels.Positions of the examined microsatellite markers are indicated by arrows.Data from [29].doi:10.1371/journal.pcbi.1003060.g007

Figure
Figure S1Agreement of S k with the standard normal.Shown are the distribution functions for the standard normal distribution (green line), and for (see eq(8))S k ~ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi 12=(kz1) p P k i~0 (V Ã i {1=2), k~2 (red line) and k~0 (blue line).The latter is uniform on {1:73, 1:73.Obviously, already for k~2 the agreement between the standard normal and S k is quite good.(EPS)FigureS2Average number of recombination events in neutral coalescent trees.(A) in dependence of sample size n (4Nr~10) and (B) of the scaled recombination rate 4Nr (n~100).Red: simulation results obtained from 1000 replicates of ms[43].Shown are average (bullets) and standard deviation (whiskers).Black: theoretical value E(n r )~4Nra n{1 .(EPS) FigureS3Distance from sweep site to first recombination site.Given that the rate of the first recombination event adjacent to a selective sweep site is r l ~an{1 ct f =2 (in case of a neutral topology) or r u ~nct f =2 (in case of a star phylogeny) the distance between the selected site and the 'first' recombination event is described by a Poisson process with rate r l x or r u x.Shown is the probability that the Poissson variable is 0 (i.e., for a ''recombination free zone'') for r l (upper curve) and r u (lower curve).(EPS)

Table 2 .
Empirical false positive rate for varying h.

Table 3 .
Empirical false positive rate for varying sample size n.

Table 5 .
Test statistics and p-values for the empirical data set of P.falciparum.