^{*}

HL and WS conceived the study and wrote the paper. HL designed and performed the analysis and simulations.

The authors have declared that no competing interests exist.

An important goal of population genetics is to determine the forces that have shaped the pattern of genetic variation in natural populations. We developed a maximum likelihood method that allows us to infer demographic changes and detect recent positive selection (selective sweeps) in populations of varying size from DNA polymorphism data. Applying this approach to single nucleotide polymorphism data at more than 250 noncoding loci on the X chromosome of

The authors provide evidence for the recent action of positive selection in the fruit fly

A long-standing interest in evolutionary biology has been to estimate the rate of adaptive substitution. Adaptive events can be inferred from interspecific data by comparing nonsynonymous and synonymous substitution rates [

Footprints of

Reduced polymorphism due to hitchhiking will be restored after about 0.1_{e}

Demographic change affects the genome-wide polymorphism pattern in a species or population. Thus, we used the whole dataset to infer demographic processes in the two populations. For the African population, the dataset is given in terms of the mutation frequency spectrum (MFS), where the MFS is the distribution describing the relative abundance of derived mutations occurring

Following Nielsen [_{k}_{A}_{0} generations, where _{A}_{0} is the current effective population size for the X chromosome in the African population; _{k}_{ik}_{ik})_{ik}_{ik})_{ik}_{ik})_{Ak}_{k}_{Ak}_{A}_{0}ξ_{ik}_{k}

To infer the demographic change in the derived European population, we used the _{A}_{E}_{A}_{E}_{ij}_{00} ω_{nA}_{nE}

(A) The demographic histories are plotted together.

(B) The demographic histories are plotted for both populations separately.

(C) The joint MFS for an example genealogy where the sample size of European lines (indicated by E) is 3, and that of African lines (A) is 4. ω_{ij}

Finally, we assume that the out-of-Africa migration does not affect the genetic polymorphism in the African population (

Before entering the analysis, it is crucial to examine whether the mutation rate among the noncoding loci is homogeneous. We found that the level of genetic polymorphism of a locus (measured by Watterson's θ_{W}_{k}_{k}

Pearson's _{W}

Compared with the standard neutral model, a genome-wide excess of rare derived mutations in the African sample is observed [_{A}_{0} = N_{A}_{1} (the simple model), and the alternative hypothesis is N_{A}_{0} ≥ N_{A}_{1} (the complex model). The likelihood ratio test (LRT) (

For the European population, the expected MFSs under our estimated bottleneck scenario and a previously proposed bottleneck scenario [

In the expansion model, the estimated _{W}^{−9} per site per generation (assuming ten generations per year). Then ^{6}, and the estimated time to the expansion in the past

An assumption of the above analysis is that all single nucleotide polymorphisms evolve neutrally. However, it may well be that part of the excess of rare mutations seen in the African population is due to purifying selection. For this reason, following Fu [

To gauge the error in estimating the likelihood function, we calculated the variance of log-likelihood (given the estimated parameters) for the African population. It is 0.0087, a rather small value relative to the max log(

After combining the European and African datasets, the demographic history of the European population is inferred using the joint MFS (^{6}. The African and European populations diverged approximately 15,800 y ago (12 to 19 kya). The effective size of the founder population was only 2,200 (540 to 10,800), and the duration of the bottleneck about 340 y (20 to 1,000 y).

In the following, we discuss the results of our demographic analyses. The methods for inferring the demographic scenarios for the African and European populations go beyond our previous study [

Our estimate of the divergence time (but not of the duration of the bottleneck and the size of the European founder population) agrees with a value recently reported [

To further evaluate the estimated demographic scenarios of the two populations, we compared four summary statistics and their standard deviations calculated from the data to those expected under the demographic scenarios. The four summary statistics were chosen because they are components of three well-known neutrality tests [

The standard deviation (SD) of a summary statistic is calculated as the standard deviation among loci. The values of the solid bars are obtained from the data. VMR, varying mutation rate model; CMR, constant mutation rate model. The 95% CI of the SD of each summary statistic obtained from 10^{3} simulated datasets is also given.

The time of the out-of-Africa migration we estimated in this study is probably too recent because gene exchange between Africa and Europe has been neglected in our model. On the other hand, the rate of gene flow was probably low because the estimated size of the founder population is very small. Furthermore, it is interesting to note that the out-of-Africa migration happened only about 44,000 y after the range expansion in Africa. This relatively long time suggests that the out-of-Africa migration may not be directly related to the range expansion in the ancestral African population.

The inferred demographic scenario is characterized by the best choice of parameter values (given the model) to explain the genome-wide polymorphism pattern in the data. However, we found that the MFSs expected under the demographic scenario still differ from the observed MFSs (

Since we observed an excess of rare and high-frequency derived mutations in both populations (relative to the inferred demographic scenarios), positive selection may play a role [

Besides inferring candidate regions affected by positive selection, it is of great interest to estimate the rate of adaptive substitution. However, due to false positives and relatively low power to detect weak and/or old selection events (

To obtain the rate of adaptive substitution and the distribution of selection coefficients in the African population, we summarized the polymorphism data of the _{w}

Let δ be the rate of adaptive substitution, and the distribution of substitutions with selection coefficient

By dividing the outcomes for a window into neutral and selected cases, we have _{w}_{w}_{w}_{w}_{w}

An obvious advantage of our approach is that we do not make any assumption about _{1}, _{2}). We have _{1} > 0, _{1} < _{i}

Furthermore, an LRT is used to test neutrality (i.e., whether

We conducted a nonoverlapping window analysis to detect evidence for hitchhiking events (

In the African sample, we observed that in 27.8% of the windows (15 of 54), the null hypothesis is rejected. Since the LRT may not be χ^{2} distributed, neutral simulated data are used to estimate the false-positive rate caused by the method. The false-positive rate (averaged over the windows) is 15.1%, and the multinomial test suggests that not all rejections of the null hypothesis are due to false positives (

In the African sample, 13 loci have a high mutation rate but relatively low diversity (

For the European population, we compared the windows in which the null hypothesis is rejected (

We also tested the null hypothesis (δ = 0) against the alternative hypothesis (δ ≥ 0). The LRT suggests that the demographic scenario alone cannot account for the polymorphism feature of both the African and European populations because the hypothesis of δ = 0 is rejected in both populations (^{2}_{0.05} = 3.84).

Following the methods outlined above and more fully described in Materials and Methods, the frequency spectra of selection coefficients for both populations were obtained (

The _{1}_{2}

In the African population, the estimated rate of adaptive substitution (^{−9} per site per generation with a 95% CI of (0.0025 × 10^{−9}, 0.0167 × 10^{−9}). Here (and also for the European population, see below), the upper bound of the CI is determined by the assumption that at most one hitchhiking event occurs within a window (100 kb). Furthermore, since the CIs obtained do not include the uncertainty in the inferred demographic scenarios, they may be too narrow. We extrapolated these results to the entire euchromatic portion of the X chromosome of ^{−9} (0.013 × 10^{−9}, 0.087 × 10^{−9}) per site per generation. The size of the CI is mainly determined by the number of beneficial mutations that have been fixed during the last 60,000 y within the windows considered.

In the European population, the estimated rate of adaptive substitution is 0.0168 × 10^{−9} (0.0026 × 10^{−9}, 0.0646 × 10^{−9}) per site per generation. Thus, we estimated that about 60 beneficial mutations fixed on the X chromosome in the derived European population. If these substitutions occurred exclusively in coding regions, the rate of adaptive substitution is 0.088 × 10^{−9} (0.014 × 10^{−9}, 0.336 × 10^{−9}) per site per generation. Since the number of beneficial substitutions in the derived European population is smaller than that in the African population, it is reasonable that the CI of

Our analysis suggests that positive selection may have been important for the African population in the past 60,000 y since it expanded its size. This is not surprising as during that time period dramatic environmental changes occurred in Africa with a transition from a full glacial to an interglacial period (70 to 55 kya [

Because the European population is derived very recently, it is reasonable that we observed a smaller number of beneficial substitutions than in the African population, although the estimated rate of adaptive substitution appears to be higher in Europe. Our analysis also suggests that the genome-wide reduction of variability in the European population (relative to the African one,

It is likely that the estimated rate of adaptive substitution does not depend on the lengths and positions of the windows since similar estimates for the African population were obtained when longer windows of 120 and 150 kb were used. Furthermore, since only few beneficial mutations with strong selection coefficients exist (

The frequent occurrence of weak (perhaps undetected) sweeps in the African population may have an impact on the estimated demographic expansion scenario because they contribute to the genome-wide skew of the MFS. A large number of undetectable sweeps was also found using other methods [

Our inference of positive selection could also be biased because of the presence of purifying selection [

There are several other reasons why the rates of adaptive substitution could be underestimated. (a) Using our approach, we cannot infer weak selection events (

Despite these caveats, our estimated rates of strong adaptive substitutions are only slightly lower than the published rate (0.092 × 10^{−9} per site per generation [

We analyzed noncoding DNA polymorphism on the X chromosome from two regional

The divergence between _{W}_{1}, and θ_{H}_{W}_{1} is the number of singletons (derived mutations that are carried by one sampled chromosome), and θ_{H}

We assume that the demographic history of the Zimbabwe population is characterized by an expansion model (_{A1}_{A0}_{A0}_{A0}_{A0}_{A}_{0}/_{A}_{1}).

For inferring the demographic history of the European population, a two-population model is used (_{E0}_{E1}_{E0}_{E1}_{E1}_{E0}_{E0}_{E1}_{E}_{0}/_{E}_{1}).

It is assumed that there is no recombination within a locus; thus each locus is treated as a point in the ancestral recombination graph (ARG). An ARG for the partly linked loci within the considered DNA region is constructed by coalescence [

The proposed methods for constructing an ARG under the Wright-Fisher model [

The coalescence under the hitchhiking model is divided into three phases: a neutral phase, a selective phase, and a second neutral phase. To include the proposed demographic models, the treatment of the selective phase follows previous work [_{0}, and _{0} is the current effective population size for the X chromosome. The neutral phases are modeled as described above.

A chromosomal region in the European sample could be affected by two hitchhiking events: a recent one happened in the European population after the two populations split, and another independent one occurred in the ancestral African population before the split. To simulate this case, we divide the coalescent into five phases accordingly: a neutral phase, a recent selective phase in the European population, a second neutral phase, a selective phase in the ancestral African population, and a third neutral phase.

The LRT is a statistical test of the goodness-of-fit between two models. If the null model (the simple model) and the alternative model (the complex model) are hierarchically nested, and the null model has one parameter less than the alternative model, then we have χ^{2} = −2ln(max _{null}_{alternative}^{2} distribution with 1

In case of a multiparameter model (_{1}, _{2}, …, _{k}_{1}. Let _{1} [

Similarly, when we are interested in the CI of _{sum}

The MFS of 262 loci represents the data for inferring the demographic scenario of the African population, and the joint MFS of 272 loci the data for inferring the demographic scenario of the European population.

When analyzing the African sample, the genealogies are simulated conditional on the joint parameters of the demographic expansion model. The method can be easily modified when the outgroup sequence is not available. In such a case, we cannot distinguish mutations carried by

The likelihood of the joint MFS is used to infer the demographic scenario of the European population given the demographic scenario of the African population. The likelihood for the _{Ek}_{ijk}_{ijk})_{ijk}_{ijk})_{Ek}_{Ek}_{E}_{0}μ_{k}_{E}_{0} generations. Since loci are independent given the expected branch lengths, the likelihood for all loci is

In this study, the expected branch length is obtained through Monte-Carlo simulations. Obviously, the accuracy of the estimation can be improved using large numbers of simulations ^{6} simulations for inferring the demographic scenarios of the African and European populations.

The likelihood-based CI of a parameter is obtained by the method described above. We evaluated the validity of the likelihood-based CI using simulated data. By comparing the likelihood-based CI with the CI estimated from simulated data, we found that the difference between both CIs is reasonably small (results not shown).

The loci within the same window are partly linked, and the recombination rate between loci is given by [

We denote _{X}_{i}

Following Felsenstein and his colleagues [^{c}_{k}

To obtain the likelihood, it requires a summation over a huge number of topologies, and each topology has an infinite number of possible branch lengths. Therefore, rather than sampling all genealogies, we consider a large random sample of ^{c}

Simulate genealogies (topology without mutation) for

Compute the value of _{G} as
_{i}

Repeat steps 1 and 2 ^{5}.

In this study, we used six hitchhiking models that differ in ^{3} simulated neutral data sets (conditioned on the recombination rates among loci and the demographic scenario). It is the probability that one or more hitchhiking models are accepted by the LRTs (based on simulated neutral data).

The multinomial test is used to test whether all candidates of selective sweeps are due to false positives. This is given by the probability that the number of false positives in the windows is larger than or equal to the number of candidates.

Since simulations suggest that the variance of the summary statistics θ_{W}_{1}, and θ_{H}

Let _{w}_{1},…,_{6}] comprise the summary statistics of the LRTs (at the 5% significant level) for the _{i}_{i}

Thus,

We assume that the windows are independent of one another. Then _{w}_{w}_{w}

Step 1. The polymorphism data of loci within a window are simulated conditional on the demographic scenario, the recombination rates among loci [_{A0}_{A0}_{e}_{1,sim},…,_{6,sim}].

Step 2. We introduce the indicator variable

Step 3. Repeat the steps 1 and 2

Therefore,

Then we partition the value of _{0.05,0.1}, δ_{0.1,0.3}, δ_{0.3,0.5}, δ_{0.5,0.7}, δ_{0.7,1}, and δ_{1,20}), and δ is given by their summation. We treat the cases (^{−9} per site per generation (see above), respectively. The spacing of the grid of parameter values is 0.0003 × 10^{−9} per site per generation.

The genetic polymorphism in the European sample could be affected by sweeps that occurred in the ancestral African population before the split. Thus, we need to consider the effect of “old” sweeps when estimating δ_{E}_{E})

For hitchhiking events that occurred in the derived European population, τ is uniformly distributed within [0, _{E0}_{E}_{E}),

Given a sweep originated in the African population, the probability that the sweep occurred before the split is η = (_{A}_{0} − _{E}_{0} − _{E}_{1})/_{A}_{0}. Then, the probability _{E}, s_{A}_{w}_{E}, s_{A}

The related _{E}, s_{A}

The length of each window is 100 kb, and the power is obtained by averaging over the windows.

(168 KB DOC)

(164 KB DOC)

Maximum likelihood estimates:

(81 KB DOC)

(26 KB DOC)

(124 KB DOC)

(122 KB DOC)

(44 KB DOC)

The sequences used in this study were obtained from the EMBL Nucleotide Sequence Database (

We thank D. De Lorenzo, S. Glinka, and L. Ometto for providing their alignments, J. Hermisson, P. Pfaffelhuber, M. Przeworski, and three reviewers for insightful comments on a previous version of the manuscript, and S. Hutter for help with extracting information on coding sequences.

ancestral recombination graph

confidence interval

1,000 years ago

likelihood ratio test

mutation frequency spectrum