A New Method to Scan Genomes for Introgression in a Secondary Contact Model

Secondary contact between divergent populations or incipient species may result in the exchange and introgression of genomic material. We develop a simple DNA sequence measure, called G min, which is designed to identify genomic regions experiencing introgression in a secondary contact model. G min is defined as the ratio of the minimum between-population number of nucleotide differences in a genomic window to the average number of between-population differences. Although it is conceptually simple, one advantage of G min is that it is computationally inexpensive relative to model-based methods for detecting gene flow and it scales easily to the level of whole-genome analysis. We compare the sensitivity and specificity of G min to those of the widely used index of population differentiation, F ST, and suggest a simple statistical test for identifying genomic outliers. Extensive computer simulations demonstrate that G min has both greater sensitivity and specificity for detecting recent introgression than does F ST. Furthermore, we find that the sensitivity of G min is robust with respect to both the population mutation and recombination rates. Finally, a scan of G min across the X chromosome of Drosophila melanogaster identifies candidate regions of introgression between sub-Saharan African and cosmopolitan populations that were previously missed by other methods. These results show that G min is a biologically straightforward, yet powerful, alternative to F ST, as well as to more computationally intensive model-based methods for detecting gene flow.


Introduction
Secondary contact occurs when sympatry is restored between two or more populations that have evolved for some amount of time in allopatry. For evolutionary biologists, secondary contact between diverging populations can provide a compelling natural experiment. For example, the frequency and symmetry of hybrid matings can yield insight into the roles of sexual selection [1] and/or reinforcement [2] in speciation. Likewise, the frequency of backcrossing and subsequent introgression can reveal the extent to which postzygotic isolating mechanisms have accumulated [3]. In this context, studies of naturally occurring secondary contact offer a distinct advantage over laboratory-based studies of reproductive isolation-the patterns of introgression represent the fitness of hybrid genotypes in natural environments, replete with a variety of ecological selection pressures. Lastly, studies of secondary contact are not limited merely to satisfying the intellectual curiosity of evolutionary biologists: hybridization and introgression from closely related invasive populations can be a significant extinction threat for endangered endemic populations [4,5].
With the advent of comparative population genomics, there is now the potential to 1) quantify the frequency and tempo of introgression between natural populations experiencing secondary contact at the level of entire genomes, and 2) identify which genomic regions are exchanged. A variety of methods have been developed to estimate the rate and directionality of gene flow between diverging populations [6][7][8][9] Generally, these estimate historical population demography to assess if the observed data fit with an isolation model, and if not, estimate the direction and magnitude of gene flow necessary to explain the observed data. Comparatively fewer methods have been developed to localize introgression-identifying which genomic regions have experienced exchange-and most are tailored to have utility in particular taxa, for example requiring both "pure" and admixed samples or requiring that one population was formed by a recent dispersal event [10,11]. Many investigators rely upon unusually low observed values of the traditional fixation index, F ST [12], to identify introgressing genomic regions (e.g., [13][14][15]). We suggest that F ST may not be ideally suited for this particular application: it is derived from the variance in allele frequencies among populations and may lack power to detect introgression in cases of secondary contact [16]. This is because for F ST to take on values close to zero following secondary contact, alleles must not only be shared across populations, but their frequencies in the two populations must also be equal. This is not necessarily expected in a secondary contact model, in which introgression is either very recent or otherwise limited. In this paper, we consider whether whole-genome sequence data can be leveraged to obtain both greater sensitivity and specificity to detect introgression than using F ST alone.
While there are a variety of alternatives to F ST for detecting introgression [6,[8][9][10][11][17][18][19], our aim is to develop a method that fulfills seven criteria: 1) it has minimal prior assumptions, 2) is sensitive to recent gene flow, 3) has a low rate of false positives, 4) has a straightforward biological interpretation, 5) is applicable to a wide range of taxa, 6) can localize tracts of introgression in the genome, and 7) is fast to compute on large genomic datasets. To this end, we propose a simple haplotype-based sequence measure called G min , which is can be quickly calculated in sliding windows across whole-genome alignments. G min is the ratio of the minimum between-population haplotype distance to the mean between-population haplotype distance, calculated in windows across the genome. We present the results of extensive computer simulations demonstrating that G min is more sensitive to recent introgression than F ST in a secondary contact model. We also use G min on a previously published dataset to scan the X chromosome for introgression between sub-Saharan African and cosmopolitan populations of the commensal fruit fly Drosophila melanogaster.

Materials and Methods
Rationale for the G min measure Assume that we have nucleotide sequences of multiple individuals sampled from two populations, such that there are a total n 1 sequences from population 1 and n 2 sequences from population 2. The average number of pairwise nucleotide differences between sequences from the two populations is defined as in which d XY is the Hamming distance (or, p-distance) between sequence X from population 1 and sequence Y from population 2 [20]. Similarly, let min(d XY ) be the minimum value of d XY among all n 1 ×n 2 comparisons. We can then define the ratio, The ratio G min ranges from zero to unity and has the property that if n 1 = 1 and n 2 = 1, then. G min = 1. Under a strict model of isolation (i.e., no historical gene flow), a lower bound is imposed upon G min by the divergence time between the two populations. However, for population divergence models that include recent gene flow the lower bound is determined by the timing of the most recent gene flow event (for example, see Fig 1). A coalescent approximation for the expectation of G min is provided in S1 Text. We performed coalescent simulation to contrast G min with F ST calculated with the expression given by [21], Behavior of the G min ratio To characterize the behavior of the G min ratio, two sets of coalescent simulations were generated. The first set was intended only to examine the distribution of G min under the null model of neutral population divergence with no gene flow (isolation). The second set of simulations was designed to contrast the sensitivity and specificity of G min with those of F ST , using a binary classification procedure. This second set considers a large parameter space for a secondary contact Illustration of the average and minimum between population coalescent times for models that include A) population divergence in isolation, and B) secondary contact. For sufficiently high rates of mutation, these two times are the main determinants for the observable quantities: the mean number of between population nucleotide differences, d XY and the minimum between population differences min(d XY ).
model, which includes an ancestral population of size N that splits into two descendant populations at time τ D (measured in units of N generations). We focus on cases in which each of the descendant populations also has size N (however, for treatment of the effects of varying population size in secondary contact models, see [22]). Subsequently, at time τ M (also measured in units of N generations) before the present, the source population is allowed to send migrants instantaneously to the other population. Instantaneous migration was assumed, rather than specifying a time for the onset of continuous gene flow, because it more discretely captures the effect of the timing of secondary contact. The number of migrating lineages is governed by the "migration probability" parameter, λ. For example, at time τ M , let there be k ancestral lineages present in the source population, so that the number of lineages chosen to migrate is a binomial distributed random variable with expectation kλ. We assume that gene flow is unidirectional. This model is implemented in a modified version of the coalescent simulation software MS [23], called MSMOVE [24]. This modified version has the added feature of recording which simulated genealogies experienced a migration event.
Since G min is intended to be measured in a sliding window scan of whole-genome sequence alignments, we performed simulations that approximate variably sized genomic windows. This was achieved by varying both the population mutation rate (θ = 4Nμ where μ is the mutation rate for a given window) and the population crossing-over rate (ρ = 4Nc, where c is the rate of crossing-over per window). Specifically, we used values of θ ∊ {10,20,50,100,150} and ρ ∊ {0,1,10,20,50,100,150}. To provide a more familiar frame of reference for these simulation parameters we provide the following expected values calculated as if our simulated data were derived from population sampling of DNA sequences. For a sample size of 10 individuals, it is expected that θ = 10 corresponds to a window size with 28 segregating sites, while θ = 150 approximates a window with 424 segregating sites. Similarly, ρ approximates the size of haplotypes within windows. For example, when θ = 150 and ρ = 0, all 424 segregating sites would be partitioned among haplotypes that span the length of the window. However, when θ = 150 and ρ = 150, there are also 424 expected recombination events, therefore each segregating site would have its own non-recombining coalescent history, on average.
For each pairwise combination of parameter values, a total of 10 4 independent windows were simulated. This scheme assumes that large windows are being used to scan the genome for gene flow, such that genealogical histories within windows can be correlated, but that adjacent windows contain independent genealogies. Additionally, we considered two different sample size configurations. The first configuration is one in which only a single source-population sequence is available (n 1 = 10 and n 2 = 1) and the second sample configuration assumes that polymorphism data are available from both populations (n 1 = 10 and n 2 = 10). For both sample size configurations, the direction of the gene flow is from population 2 into population 1, going forward in time.
For the first set of simulations, which characterizes the behavior of G min under the null isolation model, we considered a range of population divergence times, τ D ∊ {1/25,2/25,3/ 25,. . .,8}. We performed a variance partitioning analysis to quantify the effects of the n 2 , θ, ρ, and τ parameters (as well as their interactions) on the mean and variance of both G min and F ST . We first fit a linear model that includes all parameters and their interactions. We then quantified the variance explained by each parameter by calculating the partitioned sum of squares.
For all analyses, we tested the non-independence of parameters and for any potential bias-inducing effects of model complexity by comparing variance partitioning for each parameter after 1) iterating the order of parameters in the model, 2) running models both with and without interaction terms, and 3) serially removing parameters. All post-processing and analyses of simulated data was performed using the R statistical environment [25].

Sensitivity and specificity
To contrast the sensitivities of G min and F ST to gene flow under the alternative secondary contact model, we examine the proportion of simulated true migrant genealogies that are deemed outliers using a simple designation criterion. While this is not meant to be a formal statistical test of gene flow versus isolation, it is a convenient procedure for approximating the sensitivity and specificity of G min and F ST . Using this procedure, we classify a genomic window as being "positive" for gene flow on the basis of its standardized deviation from the genome-wide mean (Z-score). We defined three levels of stringency for considering an individual window as positive for gene flow, Z < −1.645, Z < −2.326, and Z < −3.090. Let the set of windows with a Zscore less than the threshold be denoted as Q. Furthermore, simulated windows are classified as "true" gene flow windows if they contain a genealogy in which an ancestral lineage has switched populations. Therefore, any particular parameterization of the secondary contact model will yield the set M of true gene flow windows. Let M\Q represent the set of true gene flow windows with a Z-score below the threshold value. The sensitivity of the test (φ) can therefore be defined as the proportion Thus, φ = 1, when all true gene flow windows have an outlying Z-score. Conversely, we define specificity (ψ) as such that if ψ = 1, then all windows with an outlying Z-score are true gene flow windows. For the analysis of sensitivity and specificity, the simulated parameter combinations were the same as those used in the first set of simulations described in the previous subsection. The only exceptions were that we simulated a narrower range of divergence times τ D ∊ {1/100,2/100,3/ 100,. . .,1} and added two additional parameters: the relative time of gene flow, which had the range τ M ∊ {τ D /100,2τ D /100,3τ D /100,. . .,τ D } (for τ D > 0) and migration probability in the set, λ ∊ {0.001,0.005,0.01,0.05,0.1}. In addition to assessing the sensitivity and specificity of G min and F ST , we also evaluated the effect of each varied parameter on sensitivity and specificity. Variance partitioning was performed as described in the previous subsection.

Application to Drosophila melanogaster data
We developed G min in anticipation of high-quality short-read assemblies of population-level samples from more than one population. Such data have just begun to emerge from a variety of organisms. To contrast the sensitivity of G min with that of F ST , we apply it to a subset of the highest quality available resequence dataset: X chromosome polymorphism of two populations of Drosophila melanogaster [10]. The two populations include a cosmopolitan population from France and a sub-Saharan African population from Rwanda. While these two populations generally show low levels of sequence divergence (chromosome average F ST = 0.183 and d XY ¼ 0:0085), a recent study was able to detect a signal of recent cosmopolitan admixture in several African populations, including the deeply sampled Rwandan population [10]. We obtained 76 bp paired-end Illumina reads from seven French and nine Rwandan lines from the NCBI short read archive (see S1 Table for details on the sampled lines). All reads were aligned to the reference genome of D. melanogaster, build 5.45 (http://flybase.org), using the BWA software, version 0.6.2 [26]. The resulting alignments for individual lines in the BAM format were merged using the SAMTOOLS software package [27]. The values of F ST and G min were calculated in non-overlapping 50 kb windows using the POPBAM software package [28]. We only analyzed nucleotide sites that met the following criteria: read depth per line greater than 5, Phred-scaled scores for the minimum root-mean squared mapping quality greater than or equal to 25, and a SNP quality that is at least 25; we also only incorporated reads with a minimum mapping quality of 20 and an individual base quality of at least 13. Of the 443 X chromosome 50 kb windows, seven (1.58%) had less than 25% of the reference genome positions passing the above filters and were subsequently ignored. Lastly, we construct neighbor-joining trees based on uncorrected Hamming distance in 50 kb windows using POPBAM. For the sake of consistency, individual windows were identified as outliers if Z < −1.645. We compare our analysis to that of Pool et al. [10], who utilized a Hidden Markov Model method based on the pairwise distances between sub-Saharan African and cosmopolitan genomes. In windows of 1000 non-singleton SNPs, each Rwandan line was assigned a posterior probability of admixture. We identified previously known admixed regions as those whose sum of posterior probabilities across lines is greater than 0.50 (see S5

Results
Behavior of the G min ratio under an isolation model G min is the ratio of min(d XY ), the minimum number of nucleotide differences between haplotypes sampled from different populations, to d XY the average number of between-population differences (Eq. 2). In a strict isolation model of divergence, we expect that both min(d XY ) and d XY will increase as a function of the population divergence time, τ D . Ultimately, G min is expected to approach unity for very ancient divergence times (τ D >> 4N), because there is a high probability of only a single ancestral lineage remaining in each population. Conversely, for very recent divergence times, G min is expected to be much less than unity, since it is unlikely that all coalescent events will occur only between ancestral lineages from the same population before a single coalescent occurs between lineages from different populations. Computer simulations show that both G min and F ST increase asymptotically to unity as the divergence time increases, but also that G min increases at a faster rate and plateaus at an earlier divergence time (Fig 2).
In the isolation model, the variance of G min is most strongly affected by the time of population divergence, τ D . Variation in τ D alone explains approximately half of the simulated variance for both G min and F ST ( Table 1). When the population mutation rate θ 10, G min becomes downwardly biased (Fig 2B). We suspect that this bias arises for low mutation rates because, when few mutations occur on a set of correlated genealogies, G min does not always capture the minimum time of the between-population coalescent events, rather it may reflect a randomly chosen between-population coalescent event that, by chance, has fewer mutations separating them than the true minimum event. Finally, whether a single source-population sequence is available (n 2 = 1) or polymorphism data are available (n 2 = 10) has a minor, but predictable, effect: G min is always closer to unity when n 2 = 1 than when n 2 = 10 (Fig 2). It should be noted that although we report on the results for F ST in the case of n 2 = 1, this is obviously not a situation in which F ST (as a measure of difference in allele frequencies) would be applicable. Finally, we found no evidence of bias in any of the variance partitioning analyses, so that the full models with all parameters and interaction terms have been included.

Sensitivity and specificity
When we consider a secondary contact model, the two parameters that exert the strongest influence on the behavior of both G min and F ST are the time of migration relative to divergence (τ M ) and the magnitude (λ) of the migration event (S2 and S3 Tables). Our simulations show that G min has increased sensitivity and specificity compared to F ST for all combinations of the τ M and λ parameters, regardless of the values of nuisance parameters, such as θ and ρ (Fig 3). The sensitivity of G min is greatest when τ M is recent and λ is small (S1 Fig). It is interesting to note that the sensitivity of G min decreases with increasing λ because large amounts of migration tends to reduce the average between-population sequence distance, thereby also reducing the expected G min and increasing its variance (S4 Table). However, for F ST , λ does not have a profound effect on its sensitivity (S4 Table). In contrast, increased λ results in a greater specificity for G min (Fig 3). This means that although high λ results in a lower proportion of the migrant genealogies appearing in the negative Z-score tail, a greater proportion of all genealogies in the tail are true migrant genealogies.  Surprisingly, the rate of recombination has only a mild effect on the sensitivity of G min and F ST (S2 Fig). This may be due to the relatively intermediate levels of recombination used in the computer simulations, since the recombination rate must be very high (ρ > 50) to break up introgressed haplotypes when τ M is very recent. This is also true of specificity (S3 Fig). Likewise, increasing the population mutation rate also slightly increases both the sensitivity (S4 Fig) and the specificity (S5 Fig). These results suggest that sensitivity and specificity of G min are optimal when large genomic windows (θ > 10) with relatively low levels of recombination (ρ < 20) are considered.
A trade-off between sensitivity and specificity occurs when we contrast results from simulations of divergence from a single source population sequence (n 2 = 1) with those from polymorphism data from both populations (n 2 = 10). G min has increased sensitivity when n 2 = 1 compared to when n 2 = 10 (S6 Fig). In contrast, the specificity of G min is substantially greater when n 2 = 10 (S7 Fig). Therefore, situations in which only a single source-population sequence is available results in G min having increased power to detect migrant genealogies at any given locus in the genome, while polymorphism data from two populations yields increased power to detect gene flow across the genome. The specificity result is intuitive from a biological standpoint: if low levels of gene flow occur, then having more sequences per population will increase the probability of recovering an introgressed haplotype. Sensitivity increases when n 2 = 1 because there is less variance in the coalescent process in the ancestral population for genealogies that do not experience gene flow and the expected G min in an isolation model is closer to unity; this results in a higher proportion of migrant genealogies significantly departing from a genome-wide distribution.

Application to cosmopolitan admixture in Drosophila melanogaster
We compare the ability of G min versus F ST to detect cosmopolitan admixture in a Rwandan population of D. melanogaster. We used POPBAM to calculate the two statistics in 436 nonoverlapping 50 kb windows on the X chromosome in a sample of seven French and nine Rwandan lines (Fig 4A). The mean and standard error for G min is 0.6500 ± 0.0311 and for F ST is 0.1725 ± 0.0083. Interestingly, the range of G min (0.0982-0.9833) is more than twice as large as that of F ST (0.0170-0.5107) (Fig 4B). This expanded range of G min is consistent with a greater sensitivity of G min , even for relatively low levels of population divergence. The outliers from the chromosome-wide G min distribution identified cosmopolitan admixture in all of the previously identified admixture windows (Fig 4A). In contrast, outlier values of F ST appear in only one of the six previously identified tracts (Fig 4A). The outliers of G min also reveal two additional candidate introgression tracts on the X chromosome-a region consisting of five significant windows between coordinates 1.65-2.05 Mb, and a single window located at 12.95-13 Mb just above our arbitrary cut-off (Z = −1.6352); neither region was previously identified by Pool et al. [10]. The first region near the 2 Mb coordinate harbors a low frequency introgressed haplotype carried by Rwandan line, RG35. Neighbor-joining trees indicate that the RG35 sample is nested within the French samples, although the particular French line(s) with which it clusters varies across windows (S8 Fig). The second marginally significant window involves a similar scenario where RG35 is nested within the clade of French lines, sister to the French line FR229 (S9 Fig). These inferred low frequency introgressions went undetected in both our F ST scan and the Hidden Markov Model analysis performed by Pool et al. [10] The window size used by Pool et al. [10] was based on the number of SNPs, rather than physical distance, such that windows in this sub-telomeric region are larger than 100 kb, on average. Therefore, it is possible that the large windows analyzed by Pool et al. [10] contain conflicting genealogical histories, resulting in the distance between RG35 and any particular French line not being reduced, on average.

Discussion
Comparative population genomic datasets, or whole genome alignments of many individuals from multiple populations within a species or between closely related species, are finally becoming realized in evolutionary genetics. One of the many potential uses of these new data is to estimate the degree to which introgression occurs between populations coming into secondary contact. Also of interest is pinpointing the genomic location of introgression and characterizing the functional properties of introgressing coding material, if any. Many of the first studies to make use of whole-genome datasets rely on the traditional fixation index, F ST , to identify introgressed genomic regions. However, we have shown that F ST has a number of inherent weaknesses for detecting introgression in a secondary contact model.
Our analyses focus on phased haplotype data, which can be especially useful for inferring details of historical population demography and gene flow [18,29,30] and haplotype sharing among populations is often used as a criterion for detecting introgression [19,31,32]. We show that haplotype-based measures of within-and between-population sequence differences, such as G min , offer better sensitivity and specificity over allele frequency measures such as F ST . Furthermore, our simulations show that G min is robust to local variation in mutation rate and, to a lesser extent, recombination rate. The robustness of G min to the local recombination rate primarily occurs when gene flow is both recent and limited, in which case there is a limited opportunity for recombination to break up introgressed haplotypes (S2 and S3 Figs). This result suggests that choice of window size offers an avenue for distinguishing recent versus older introgression events (S10 Fig). Larger windows with more mutation and recombination events offer greater power to identify very recent introgression events, whereas smaller windows can identify older introgression events, albeit with less specificity than larger windows. In practice, the most useful window size will vary by the particular taxa of interest. Due to the relative ease in calculating G min , optimal window size can be rapidly evaluated over a range of genomic intervals.
Like F ST or d XY ; G min is not a formal test statistic, rather it is a sequence measure designed to identify a distinctly bimodal pattern of between-population coalescence that is expected under models of secondary contact, but not expected in models of strict population isolation. We were unable to derive a closed-form expression for the variance of the G min ratio in a pure isolation model, due in part to the fact that we observe a non-zero positive covariance between the numerator, min(d XY ), and the denominator, d XY (data not shown). Therefore, using G min as the basis for a simple single-locus test is not currently feasible. However, like F ST , G min can be readily incorporated into other inferential frameworks, such as approximate likelihood methods [33]. Our approach differs from more formal inferential frameworks, such as those used by the IM program [9], in that IM tests the hypothesis of whether or not gene flow has occurred; the goal of G min is less formal, seeking instead to localize introgression genealogies in otherwise diverging genomes. In practice, a G min scan may be an extremely useful first step for identifying candidate regions for introgression. Unlike many likelihood-based methods for detecting gene flow in a population divergence model, G min can be quickly applied to large whole-genome datasets and interpretation of G min requires a minimal set of assumptions. The fundamental assumption is that the individuals in the analysis came from either one population or a different population. This is in contrast to some methods for detecting admixed regions of the genome, which rely on investigators being able to assign individuals to two pure parental populations, as well as a third population of hybrid individuals [11]. Of course, knowing the hybrid status of individuals, or having more detailed information of sample geographical distribution, may enable more advanced analysis [6,17].
While G min is more sensitive to recent gene flow than F ST , it has additional desirable properties that distinguish it from other recently proposed haplotype-based methods. For example, Harris and Nielsen [8] describe a method for detecting recent gene flow by measuring the genomic length distribution of tracts of identity-by-state. The computer simulations presented by Harris and Nielsen [8] demonstrate that their method can accurately infer the timing and magnitude of admixture events, as well as other demographic parameters, over a range of time scales. However, the identity-by-state method of Harris and Nielsen [8] may also be sensitive to 1) low quality reads and sequencing error, 2) reductions in effective population size due to background selection, and 3) accuracy of the required modeling of historical population bottlenecks. In contrast, we argue that G min is not as sensitive to errors in sequencing or assembly, because G min does not explicitly depend upon uninterrupted runs of shared polymorphic sites. Additionally, the lower tail of G min is not expected to be strongly affected by background selection under a secondary contact model. This is because background selection does not affect the tempo of neutral divergence [34] and can skew within-population polymorphism towards an excess of rare alleles [35], neither of which affects G min (however, for the effect of reductions in the effective population size, see below).
Besides recent introgression, the primary factor affecting G min is the number of ancestral lineages present at the time of the initial population split. As a result, the distribution of G min will be affected by any force that alters the probability density of within-population coalescent events, including changes in the effective population size or natural selection. If natural selection acts to reduce diversity in one population exclusively or, if the effective population size of one population is smaller than that of the other, we expect there to be fewer ancestral lineages present at the time of the initial population divergence. To consider the performance of G min in these cases, we can extrapolate from our computer simulation results of different sampling schemes, in particular when n 1 = 10 and n 2 = 1. We find that when only a single source-population genome is used, G min has greater sensitivity (S6 Fig), but reduced specificity compared to when n 2 = 10 (S7 Fig). This suggests that forces acting to increase the rate of coalescence within populations, such as population bottlenecks, will result in increased confidence that small values of G min can be attributed to recent gene flow, but also a diminished ability to recover all of the introgressed regions in a genome. Similarly, the reduced specificity of F ST when there is a reduction in within-population variation is well-known [21,36,37], however G min does not appear to be as strongly affected as F ST (S7 Fig). In conclusion, we do not wish to argue that G min is in any way a panacea for the longstanding problem of distinguishing models of gene flow from those of pure isolation [38]. Indeed, G min lacks sensitivity when gene flow occurred more than halfway back to the time of the population divergence or when there is a large amount of gene flow (S1 Fig). For example, if a genomic region is sweeping across species boundaries [39], G min is not expected to be as informative as F ST . Therefore, it is also important to caution that genomic intervals with low G min should be subsequently vetted to ensure that the region does not have unusually low absolute values of d XY . However, in cases of recent secondary contact, and when the rates of gene flow are not extremely high, we have shown that G min performs well and is more reliable than F ST (Fig 3). In addition, we illustrate how a simple statistical procedure employing G min to scan the X chromosome of recently diverged cosmopolitan and sub-Saharan African populations of Drosophila melanogaster performs as well as more sophisticated methods (Fig 4). However, unlike many more sophisticated methods, the calculation of G min is fast and broadly applicable to any taxa for which haploid genome sequences are available. G min can be easily calculated from population genomic data using the software package POPBAM [28]. We anticipate that with the continued emergence of new haplotype sequencing methods [40,41], these types of data will be increasingly used for evolutionary studies. In this case, G min can be an effective and biologically straightforward addition to the suite of tools available to evolutionary biologists.