Recent Selective Sweeps in North American Drosophila melanogaster Show Signatures of Soft Sweeps

Adaptation from standing genetic variation or recurrent de novo mutation in large populations should commonly generate soft rather than hard selective sweeps. In contrast to a hard selective sweep, in which a single adaptive haplotype rises to high population frequency, in a soft selective sweep multiple adaptive haplotypes sweep through the population simultaneously, producing distinct patterns of genetic variation in the vicinity of the adaptive site. Current statistical methods were expressly designed to detect hard sweeps and most lack power to detect soft sweeps. This is particularly unfortunate for the study of adaptation in species such as Drosophila melanogaster, where all three confirmed cases of recent adaptation resulted in soft selective sweeps and where there is evidence that the effective population size relevant for recent and strong adaptation is large enough to generate soft sweeps even when adaptation requires mutation at a specific single site at a locus. Here, we develop a statistical test based on a measure of haplotype homozygosity (H12) that is capable of detecting both hard and soft sweeps with similar power. We use H12 to identify multiple genomic regions that have undergone recent and strong adaptation in a large population sample of fully sequenced Drosophila melanogaster strains from the Drosophila Genetic Reference Panel (DGRP). Visual inspection of the top 50 candidates reveals that in all cases multiple haplotypes are present at high frequencies, consistent with signatures of soft sweeps. We further develop a second haplotype homozygosity statistic (H2/H1) that, in combination with H12, is capable of differentiating hard from soft sweeps. Surprisingly, we find that the H12 and H2/H1 values for all top 50 peaks are much more easily generated by soft rather than hard sweeps. We discuss the implications of these results for the study of adaptation in Drosophila and in species with large census population sizes.

substantially higher than H12 o values calculated under models with recombination rates even modestly higher than 10 -7 cM/bp. Therefore, H12 o values calculated under low recombination rates may be too conservative for most genomic regions. Hence, we used the H12 o value obtained from regions with a low, but not extremely low, ρ = 5×10 -7 cM/bp, filtering out all regions with a recombination rate lower than 5×10 -7 cM/bp from the data.

Robustness of the H12 scan
To ensure that the H12 peaks identified in our genomic scan are robust to any peculiarities of the DGRP data set such as inversions, unaccounted substructure within the data, or sequencing quality, we performed a number of tests. The individual strains of the DGRP data set contain a number of inversions, seven of which are shared across multiple strains (S5A Table) (The locations of inversion breakpoints were identified by Spencer Koury, personal communication). One possibility is that elevated peaks of homozygosity could result from inversions suppressing recombination. To test for this possibility, we performed a binomial twosided test for enrichment of the top 50 peaks in regions with inversions versus a model of a uniform distribution of the peaks genome-wide. We found no significant enrichment in any inversions except for an inversion on chromosome 3R, In(3R)K (P-value=6.44E-06) (S5A Table). We further performed a Chi-square test for a correlation between members of haplotype groups in each peak and haplotypes potentially linked to an inversion on the same chromosome, as inversions have been shown to affect polymorphisms chromosome-wide [1]. We did not find any enrichment for strains bearing inversions in any single haplotype cluster group for the top 50 peaks (S5B Table), suggesting that the enrichment of peaks in the In(3R)K inversion cannot be attributable to the inversion per se. Finally, even after removing regions of the genome overlapping major cosmopolitan inversions, there continues to be an elevation and long tail of H12 values in DGRP data relative to expectations under any neutral demographic model (S6 We repeated the analysis in the DGRP version 2 data set as well. In the DGRP data set, there are at least five pairs of strains with genome-wide identity by descent (IBD) values > 50% suggesting twin or sibling relationships [2], and three of these complete pairs were among our data set of 145 strains. Since related strains can increase haplotype homozygosity, in our new DGRP v2 scan, we removed one of the members of each closely related pair to ensure that the top 50 H12 peaks are robust to any homozygosity contributed by related pairs of flies. In addition, we removed strains with the most missing data, and down sampled to 145 lines to match the number of strains in the original scan. Forty of the top 50 DGRPv2 peaks overlapped 34 unique peaks among the top 50 peaks in the DGRP scan (S7B Fig.). Since related pairs in a sample of 145 individuals can increase homozygosity at most by (2/145) 2 = 0.00019, we did not exclude these lines from the final analysis of the DGRP data.
We scanned the remaining 63 strains that were non-overlapping with the original 145 strains to determine if we could recover the peaks in a completely independent data set, and observed that 12 peaks among the top 50 peaks in this scan overlap 11 unique peaks among the top 50 peaks identified in the DGRP data set (S7C Fig.).
Finally, we sub-sampled the DGRP data set to 40 strains 10 times and plotted the resulting distributions of H12 values (S8 Fig.). In contrast to H12 distributions observed in the six tested neutral demographic models also sampled at 40 strains, there is an elevation and long tail of genome-wide H12 values, indicating that the elevation in haplotype homozygosity observed in the DGRP data are population-wide and not specific to any subset of the strains.

Estimation of θ A for the top 50 peaks
The monotonic relationship between the softness of a sweep and both H12 and H2/H1 over the interval (0.01 < θ A < 100) in Fig. 5 and 10 suggests that these two statistics are informative for the purpose of inferring the softness of a sweep. Here, we estimate the softness of a sweep by varying the parameter θ A . We developed a Bayesian approach for inferring θ A by sampling the posterior distribution of θ A conditional on the observed values H12 obs and H2 obs /H1 obs from a candidate sweep. Given that sampling this true posterior distribution is computationally intractable, we used approximate Bayesian computation (ABC) for our inference procedure. Specifically, we drew θ A values from a prior distribution, simulated a large data set under each θ A value, and then kept 1000 parameter values which produce sweeps with H12 and H2/H1 values close to the observed values H12 obs and H2 obs /H1 obs from the candidate sweep (differences <10% for each statistic). From these posterior distributions, we inferred the maximum a posteriori (θ A MAP ) value of the given candidate sweep to estimate its softness (Methods).
We estimated the softness of the top 50 peaks detected in our H12 scan in Fig. 8A by inferring the θ A MAP value that generates haplotype structure best resembling the spectra observed for each peak using the above ABC procedure. We first considered the N e = 10 peaks is shown in S10B Fig. S4 Table lists all θ A MAP values and their 95% confidence intervals.
The minimum θ A MAP value among all 50 top peaks is θ A MAP = 6.8, which is obtained for the peak centered at Cyp6g1.
We also estimated θ A MAP for our top 50 peaks under the admixture model proposed by Duchen et al. [3] to determine the effect of admixture on our estimates (Methods). S10A Fig