^{1}

^{*}

^{1}

^{1}

JN, APG, and MS conceived and designed the experiments. JN performed the experiments and analyzed the data. JN, APG, and MS wrote the paper.

The authors have declared that no competing interests exist.

The Δ32 mutation at the CCR5 locus is a well-studied example of natural selection acting in humans. The mutation is found principally in Europe and western Asia, with higher frequencies generally in the north. Homozygous carriers of the Δ32 mutation are resistant to HIV-1 infection because the mutation prevents functional expression of the CCR5 chemokine receptor normally used by HIV-1 to enter CD4+ T cells. HIV has emerged only recently, but population genetic data strongly suggest Δ32 has been under intense selection for much of its evolutionary history. To understand how selection and dispersal have interacted during the history of the Δ32 allele, we implemented a spatially explicit model of the spread of Δ32. The model includes the effects of sampling, which we show can give rise to local peaks in observed allele frequencies. In addition, we show that with modest gradients in selection intensity, the origin of the Δ32 allele may be relatively far from the current areas of highest allele frequency. The geographic distribution of the Δ32 allele is consistent with previous reports of a strong selective advantage (>10%) for Δ32 carriers and of dispersal over relatively long distances (>100 km/generation). When selection is assumed to be uniform across Europe and western Asia, we find support for a northern European origin and long-range dispersal consistent with the Viking-mediated dispersal of Δ32 proposed by G. Lucotte and G. Mercier. However, when we allow for gradients in selection intensity, we estimate the origin to be outside of northern Europe and selection intensities to be strongest in the northwest. Our results describe the evolutionary history of the Δ32 allele and establish a general methodology for studying the geographic distribution of selected alleles.

A spatially explicit model of the Δ32 mutation, which confers resistance to HIV-1 infection, reveals its spread across Europe and provides a general method for tracking the geographic spread of selected alleles.

The geographic spread of advantageous alleles is fundamental to evolutionary processes, including the geographic distribution of adaptive traits, the cohesiveness of species, and the spatial dynamics of coevolution between pathogens and their hosts. Various theoretical models describe the dynamics of how advantageous alleles spread within a population, but few well-studied examples exist, particularly in humans, of how advantageous alleles spread geographically.

The CCR5 Δ32 mutation is a good example of an advantageous allele with a well-characterized geographic distribution. The Δ32 mutation currently plays an important role in HIV resistance because heterozygous carriers have reduced susceptibility to infection and delayed onset of AIDS, while homozygous carriers are resistant to HIV infection [

To understand the origin and spread of Δ32, we modeled the effects of selection and dispersal on the allele. The Δ32 mutation is found only in European, West Asian, and North African populations. The allele frequency exhibits a north–south cline with frequencies ranging from 16% in northern Europe to 6% in Italy and 4% in Greece (

The sampling locations are marked by black points. The interpolation is masked in regions where data are unavailable.

Previous discussion of the geographic distribution of Δ32 has focused on the north–south cline in frequency. Lucotte and Mercier [

One alternative is that a northern origin coupled with typical levels of dispersal in Europe is adequate to explain the geographic distribution of Δ32. Under this hypothesis, rare long-distance dispersal events, such as Viking dispersal, play a minor role in the spread of the advantageous allele. Another alternative is that the allele may have arisen in central Europe and increased to a higher frequency in the north because of a geographical gradient in selection intensity [

A further question regarding the geographic spread of Δ32 is whether the historical selective agent acted only in Europe and western Asia or on a larger geographic scale. In the former case, the restriction of Δ32 to Europe and western Asia is explained by spatially varying selection, and in the latter, by insufficient time for the allele to have dispersed farther.

Here we fit a simple population genetic model to the geographic distribution of Δ32 in order to infer features of the processes of dispersal and selection that shaped the historical spread of the allele. In particular we conclude that given current estimates of the age of the Δ32 allele, the allele must have spread rapidly via long-range dispersal and intense selection to attain its current range. We find the Δ32 allele is likely restricted geographically because of limited time to disperse rather than local selection pressures. In addition, we show that the data are consistent with origins of the mutation outside of northern Europe and modest gradients in selection.

To examine the geographic distribution of Δ32, we adapted Fisher's deterministic “wave of advance” [_{0} and _{0}, respectively), and the ratio ^{2}) to the additive selection coefficient _{c}_{NS}_{EW}

To apply the model to allele frequency data sampled from different locations, we combined the spatially explicit PDE model with a binomial sampling scheme (see

(A) The underlying allele frequency surface generated by the PDE model using MLEs for the parameters. The coarseness of the surface and irregular coastlines are due to the resolution of the simulated habitat (see

(B and C) Two replicates of simulated data obtained using the same sampling locales and sample sizes as in the dataset and displayed using the same interpolation methods and contours as in

To estimate the parameters of the model, we derived a likelihood function based on binomial sampling from the deterministic allele frequency surface. Estimating parameters via maximum likelihood requires an optimization step, and here we use a simple grid search. We found in applications to simulated data that the likelihood method with a grid search is able to estimate parameters with reasonable accuracy (

The result of the maximum likelihood estimation is that values of ^{2}/^{5} and 10^{6} km^{2} have the highest likelihood. The maximum likelihood estimate (MLE) of _{NS}_{EW}^{5} km^{2} with the profile likelihood falling off nearly symmetrically for higher and lower values (_{NS}_{EW}^{6} km^{2} with a steep drop in likelihood for values less than 10^{5} and a gradual decline for values greater than 10^{6} (^{4} km^{2} or smaller result in an expected geographic distribution of Δ32 that is too restricted to fit the data, and values of ^{7} result in a distribution that is far too broad.

The grey line shows the log profile likelihood for _{NS}_{EW}^{5} with a log-likelihood of −263.0. The black line shows the profile likelihood when selection gradients are incorporated into the model (_{NS}_{EW}^{6} with a log likelihood of −247.7.

The curves are drawn for the two MLEs of ^{1}Based on estimates in [^{2}From

The time required in the model to reach a frequency of 16% for a fixed value of

We next investigated whether the data reject the hypothesis of uniform selection gradients. We used a likelihood ratio test that compares the maximum likelihood achieved when _{NS}_{EW}_{NS}_{EW}_{NS}_{EW}_{NS}_{EW}^{−5}), such that we can reject the null hypothesis of uniform selection.

In the model with selection gradients, selection was inferred to be stronger in the north and in the west, with the north–south selection gradient being steeper than the east–west selection gradient. In particular, the estimates of _{NS}_{EW}_{NS}^{−4} and _{EW}^{−4}. The profile likelihood surface for _{EW}_{NS}_{EW}_{NS}_{NS}^{−4} km^{−1} implies a selection intensity at the latitude of Oslo that is a 21% increase on the selection intensity at the latitude of Milan (e.g., _{EW}^{−4} km^{−1} generates a selection intensity in Copenhagen that is 5% greater than in Moscow (e.g.,

The plus signs indicate locations where the likelihood was evaluated. The dark contour at −250 marks the −2 log-likelihood support region for the estimates of _{NS}_{EW}

Regarding the geographic origin of Δ32, we found that if selection is constrained to be spatially uniform, the origin is localized to a region east of the Baltic (

(A) Assuming selection intensity is uniform spatially (i.e., _{NS}_{EW}

(B) Allowing for north–south and east–west spatial gradients in selection (i.e., _{NS}_{EW}

Likelihoods were calculated at each of the black points and the surface was obtained by interpolation.

The underlying allele frequency surface generated by the MLE parameters is a qualitative indicator of the goodness of fit of the model (see ^{−5}). We also found that the data are overdispersed relative to the variance expected under a binomial distribution. The overdispersion parameter ϕ (see

Finally, to better understand the history of Δ32, we considered an extension of the model that included the dispersal of the allele to Iceland. Iceland was colonized principally from Scandinavia at approximately 900 CE [

We can draw several conclusions from our analysis of the geographic distribution of the Δ32 allele. First, the results suggest that strong selection (^{2}/^{6}, a value of

Second, we conclude that if selection is spatially uniform, Δ32 arose by mutation in northeast Europe as suggested by Libert et al. [

Third, our results show that the geographically restricted distribution of Δ32 is a result of Δ32 not having had time to disperse more widely, rather than resulting from a geographic restriction of selection favoring it. Given more time and no change in selection affecting Δ32, the allele would have spread over a wider area.

Our large estimates of dispersal are consistent with the Viking hypothesis of Lucotte and Mercier [

Our analysis makes a number of simplifying assumptions. Our model does not incorporate genetic drift. To examine the effect of ignoring drift, we simulated a stepping stone model with local deme sizes of _{e}^{−2}; [_{e}

Another assumption of our approach is that the allele under selection has an additive effect. We tested the robustness of our results to deviations from additivity by generating allele frequency surfaces in which the fitness advantage of Δ32 heterozygotes is kept constant and a range of fitness advantages of the Δ32/Δ32 genotype was assumed. We found that varying the degree of dominance had little effect on the geographic distribution of Δ32 (results not shown). The negligible importance of the fitness advantage of Δ32 homozygotes arises because the proportion of Δ32 homozygotes is sufficiently small throughout the history of Δ32 that the assumption regarding the fitness of the homozygote only has a minor effect.

Our use of diffusion equations assumes that only the mean and variance of the dispersal distribution are needed to model the effects of dispersal and that higher central moments such as kurtosis are negligible. Studies of Fisher's wave of advance in the ecology literature have shown that if kurtosis is non-negligible, as in the case of “fat-tailed” dispersal distributions (distributions whose tails are not exponentially bounded), the asymptotic behavior of the wave of advance changes so that the speed of the wave continually accelerates [

While violations of each of these simplifying assumptions (no genetic drift, additivity of the selective effect, a diffusion approximation for dispersal) are unlikely to have important effects on our estimates, the variance introduced by violations may contribute to the overdispersion observed in the data and the significant

In summary, we present an approach to analyzing the geographic distribution of a selected allele. The approach allows us to estimate the ratio of dispersal to selection as well as fit gradients in selection to the observed allele frequency data. Our analysis confirms Δ32 has been under strong selection, and furthermore shows that long-range dispersal and selection gradients have been important processes in determining the spread of this advantageous allele. The results provide an insight into the history of Δ32 and into the processes that affect the geographic spread of advantageous alleles in humans.

We focused our analysis on the region extending from 22°N to 75°N and 10°W to 154°E. Topographic data were obtained from the ETOPO5 data assembled by the National Geophysical Data Center. The exact dataset used was a version with 1° latitude/longitude resolution that is provided as a standard dataset in MATLAB 7. The coastline data were extracted by taking all values above sea level to be land. A land bridge between Denmark and Sweden was added to model migration between the two closely separated land masses. An image of the habitat is available as

A summary of Δ32 allele frequency data was constructed by pooling data from multiple published papers [

To model the frequency of the Δ32 allele across Europe we used an approach based on Fisher's wave-of-advance theory [

where Δ(^{2} denotes the variance of the parent–offspring dispersal distance distribution. To calculate Δ(

To incorporate gradients in selection,

where _{c}, x_{c},_{c}

To represent the occurrence of the mutation at a single location in space with an initial local frequency of _{0}, we specified the initial conditions of the PDE solution to be

where δ(_{0} was calculated by the formula 1/

For the application of the equations to a geographic habitat, we set the _{0}.

To solve the PDE for ^{(n)} with elements ^{(n)}_{j,k}

To model the frequency of Δ32 in Iceland, we used a standard single-population deterministic model of selection in which the additive selection intensity was set to _{c}

To obtain MLEs for the parameters of the model, we used a fixed allele age _{a}_{a}

This likelihood approach benefits from taking into account the sample size at each sampling locale, so that the discrepancy between predicted and observed allele frequencies is penalized less at locations with smaller sample sizes. The value of _{a}

A grid-based method was used to produce a joint likelihood surface over _{NS}, G_{EW}, x_{0}, and _{0}. We used a grid for ^{4} and 2 × 10^{6} spaced evenly on a logarithmic scale; a grid for the geographic origins _{0} and _{0} that contained the 29 locations indicated by the points in _{NS}^{−5} to 22 × 10^{−5} with increments of 3 × 10^{−5}; and a grid for _{EW}^{−5} to 17.5 × 10^{−5} with increments of 5 × 10^{−5}. The resulting grid contained 14,848 points in the five dimensions of _{NS}, G_{EW}, x_{0}, and _{0}.

To asses the goodness of fit of the model we performed a standard _{n} −L_{5}) where _{5} is the log-likelihood computed using the MLE values in our full five-parameter diffusion-based model, and _{n}_{n}_{5} = −247.7. We also estimated the overdispersion parameter ϕ by the ratio of the

To evaluate the effect of kurtosis we used simulations on a two-dimensional stepping stone habitat of 121 × 121 demes placed on a torus-shaped habitat arranged in a uniform distribution on (−6,000 km, 6,000 km) along the _{0} = 0 and _{0} = 0, and the initial frequency of the allele was set to 10^{−5}. The simulations were stopped when the allele frequency became greater than 16%. Selection was incorporated with an additive allele with

The second was a modified double exponential that when

The third was a double gamma distribution that was used by Cavalli-Sforza et al. [

All three distributions were parameterized to have a standard deviation equal to 100 km, so that the effect of kurtosis alone could be assessed. For the shape parameter of the double gamma distribution we used the value of 0.0419 estimated by Cavalli-Sforza et al. [_{NS}, G_{EW}, x_{0}, and _{0} all fixed to zero, so that ^{4} to 1 × 10^{5}. The mean and standard error for estimates of

The grey squares mark the geographic area used for numerical solution of the PDE. Coastlines are overlaid in the figure only for reference.

(38 KB PDF).

(33 KB PDF).

Funding for this work comes from Howard Hughes Medical Institute (JN), the Miller Foundation (APG and MS) and National Institutes of Health grant NIH-GM-40282 (MS). We thank S. Limborska for providing geographic coordinates of his published data, as well as Eric Anderson, Laurent Excoffier, Gerard Lucotte, and two anonymous reviewers for helpful comments regarding the manuscript.

maximum likelihood estimate

partial differential equation