Estimation of Isolation Times of the Island Species in the Drosophila simulans Complex from Multilocus DNA Sequence Data

Background The Drosophila simulans species complex continues to serve as an important model system for the study of new species formation. The complex is comprised of the cosmopolitan species, D. simulans, and two island endemics, D. mauritiana and D. sechellia. A substantial amount of effort has gone into reconstructing the natural history of the complex, in part to infer the context in which functional divergence among the species has arisen. In this regard, a key parameter to be estimated is the initial isolation time (t) of each island species. Loci in regions of low recombination have lower divergence within the complex than do other loci, yet divergence from D. melanogaster is similar for both classes. This might reflect gene flow of the low-recombination loci subsequent to initial isolation, but it might also reflect differential effects of changing population size on the two recombination classes of loci when the low-recombination loci are subject to genetic hitchhiking or pseudohitchhiking Methodology/Principal Findings New DNA sequence variation data for 17 loci corroborate the prior observation from 13 loci that DNA sequence divergence is reduced in genes of low recombination. Two models are presented to estimate t and other relevant parameters (substitution rate correction factors in lineages leading to the island species and, in the case of the 4-parameter model, the ratio of ancestral to extant effective population size) from the multilocus DNA sequence data. Conclusions/Significance In general, it appears that both island species were isolated at about the same time, here estimated at ∼250,000 years ago. It also appears that the difference in divergence patterns of genes in regions of low and higher recombination can be reconciled by allowing a modestly larger effective population size for the ancestral population than for extant D. simulans.


Introduction
The study of speciation genetics can be imperfectly divided into two categories: the study of genes that directly contribute to speciation and population genetic analyses that aim to reconstruct the natural historical context in which speciation occurs. In an ever-increasing assemblage of model systems for the study of the genetics of speciation, the Drosophila simulans species complex has maintained its prominence, due in part to focused genome sequencing efforts that complement decades of study on the complex and its sister species, D. melanogaster.
The D. simulans complex is comprised of two island species, D. mauritiana and D. sechellia, along with cosmopolitan D. simulans. Calibrating a molecular clock by the divergence of the complex from D. melanogaster, and assuming that this split occurred ,3 mya [1], prior studies of the D. simulans complex indicated that D. sechellia and D. mauritiana were initially isolated from the mainland species approximately 500 and 250 kya, respectively [2]. The aim of the current study is to better estimate the times at which the incipient island species were first isolated from the ancestor of D.
simulans. There are two rationales for this work. First, functional divergence among these species, at least in terms of genetic incompatibilities in hybrids, is considerable [3][4][5][6][7][8][9]. Thus, reliable estimates of isolation times are required to estimate the rate at which functional divergence can evolve in incipient species.
A second rationale relates to prior failure to reject a strict isolation model (SIM) (i.e., no gene flow subsequent to initial isolation) for the members of the D. simulans complex [2,10] that may reflect lack of statistical power. Levels of DNA sequence divergence in genomic regions of high and low recombination are very different within the complex, although their divergences from D. melanogaster are similar [10]. It is possible that genes in low recombination regions, especially those on the fourth chromosome, are less diverged within the complex due to introgression subsequent to initial isolation. This would be consistent with the finding that hybrid sterility does not map to this chromosome [11], making introgression more likely if the opportunity were present.
There is, however, an alternative explanation for reduced divergence that can be reconciled with the SIM: a higher effective population size (N e ) in the common ancestor than in extant D.
simulans coupled with a shallower coalescent for low-recombination genes in the ancestral population. The latter is predicted under various models under which selection on one site also reduces N e over an extended region of linked sites. This is perhaps most easily envisioned under Maynard Smith and Haigh's [12] genetic hitchhiking model, whereby the rapid fixation of a new, beneficial mutation reduces local nucleotide diversity at adjacent sites that can not segregate independently due to insufficient opportunity for recombination during the selective sweep. However, other models (e.g., background selection [13], pseudohitchhiking/''genetic draft'' [14]) also predict reduced polymorphism in regions of low recombination. For either the hitchhiking or pseudohitchhiking model to dramatically reduce polymorphism in regions of reduced recombination, the frequency of selective sweeps must be high enough to make recent sweeps likely in genomic regions of low recombination. Estimates of the frequency of selective sweeps indicate that these are quite common in Drosophila [15][16][17]. For example, using the estimate of Smith and Eyre-Walker [15] of one adaptive amino acid substitution every ,45 years in D. simulans and D. yakuba (assuming 10 generations/year), a selective sweep on the fourth chromosome, which accounts for ,1.06% of the codons in D. melanogaster, would be expected every ,4250 years.
Here, we present two simple models to estimate isolation times from multilocus data. The first, a 2-parameter model, was used to estimate isolation times separately for loci in regions of high and low recombination, along with a substitution rate correction factor for the island species. This model provides a straightforward approach to estimating isolation times from multilocus data that should be broadly applicable to model and non-model species pairs that include an appropriate outgroup (here, D. melanogaster). Given no compelling reason to do otherwise, the analyses were performed under the assumption that the N e of the ancestor of the island species and D. simulans was the same as that of extant D. simulans. Although separate analyses on the two recombination classes of loci each assume a SIM, there appears to be a substantially more recent isolation time for low-recombination loci than for high-recombination loci. Thus, a single SIM (i.e., a historical model with a single isolation time) may not be compatible with both data sets. This led us to develop the second model, a 4-parameter model, which estimates a single isolation time along with the required change in N e and rate correction factors for both recombination classes. Confidence intervals for the parameters were estimated by bootstrapping. We find that the 4-parameter model can accommodate the difference in divergence between the two recombination classes by assuming an ancestral N e that is only slightly larger than the N e of extant D. simulans. The results of the analyses indicate that both island species arose at approximately the same time.

Strains
Genomic DNA was extracted from 5 long-standing isofemale lines of D. mauritiana, 4 North American lines of D. simulans (courtesy of J. Coyne) and 1 line of D. sechellia using a DNeasy kit (Qiagen, Inc.).

Loci
In addition to 13 loci previously studied [2,10], 17 loci were sequenced in all of the strains; D. mauritiana sequences for 14 of the new loci were used in a separate study on codon usage bias [18]. Most of the new loci were sequenced completely from stop through start codons (see Table 1). D. melanogaster sequences for the 17 newly sequenced loci were obtained from Genbank [19]; sources for the 13 previously sequenced loci are given in the related publications. Initial PCR products were purified using a Qiaquick kit (Qiagen, Inc.), and purified PCR products served as templates for cycle-sequencing reactions using the DTCS kit (Beckman-Coulter, Inc.). Sequencing reactions were analyzed using a Beckman-Coulter CEQ 8000 Genetic Analysis System. Internal sequencing primers were designed as needed for complete sequencing of both strands. Sequence data were checked and manually aligned using Sequencher software (Gene Codes, Inc.). Estimates of polymorphism and divergence were based only on sites for which bases were reported for all strains (i.e., any sites with indel variation or unresolved bases were excluded); results were not qualitatively affected by removing such sites. A total of 33,026 aligned sites were analyzed. All sequences have been submitted to Genbank (see Table 1 for accession numbers).

2-parameter model
For a given trio of species (the island species, D. simulans and D. melanogaster), three pairwise measures of divergence were obtained. d is is the estimate of the average number of pairwise differences between the island species and D. simulans; d im is this estimate for the island species and D. melanogaster; and d sm is this estimate for D. simulans and D. melanogaster. The model assumes an unchanged substitution rate throughout the gene tree, with the exception that a rate change is permitted along the lineage leading to the island species subsequent to isolation. The cause of the rate change (e.g., changing generation time, change in the effectiveness of negative selection), while interesting in its own right, is not important for the model. If T is the divergence time for D. simulans and D. melanogaster lineages, d sm = 2Tm; this allows an estimate of m = d sm /2T. Here, T is assumed to be 3 mya [1]; estimates of isolation time will scale approximately linearly with T. If t is the isolation time for D. simulans and the island species, d is = (1+f)tm+gh s , where f is a substitution rate scalar for the island species, g is the ratio of the ancestral population's N e to that of extant D. simulans, and h s is the level of polymorphism in D. simulans. The term (1+f)tm represents the divergence expected since isolation (i.e., tm on the lineages leading to D. simulans and ftm on lineages leading to the island species). The term gh s represents the divergence prior to isolation. For the purposes of our analyses, g is assumed to be 1.0 for the 2-parameter model, although other values can be used when appropriate. [Kliman et al. [2] used a similar approach, assuming g to be 1 and assigning values of f based on available data.] Finally, d im = (2T2t+tf)m. Substituting d sm /2T for m, the latter formula can be rearranged to give an estimate of f in terms of observed divergence estimates and t: Rearranging the formula d is = (1+f)tm+gh s to solve for t, and substituting the formula for f (Eq. 1) and d sm /2T for m, gives an estimate of t based on observed divergence and polymorphism (nucleotide diversity) estimates: Thus, t and f are easily estimated simultaneously from polymorphism and divergence data given arbitrary g and T.

4-parameter model
Equations 1 and 2 can be applied to distinct data sets, here genes in low-recombination (LR) and high-recombination (HR) regions, respectively. LR regions, in which recombination is dramatically reduced over an extended region, include the entire fourth chromosome, the tip of the X chromosome and regions near centromeres [20]. Let t 0 and f 0 refer to estimates from HR loci; let t 1 and f 1 refer to estimates from LR loci. Under the SIM, t 0 = t 1 . Applying the same subscripts to divergence and polymorphism data for HR and LR loci, and following equation 2, which can be rearranged to yield From here, t is calculated by substituting ĝ into equation 2, and fˆ0 and fˆ1 are calculated by substituting t into equation 1 for HR and LR loci, respectively.

Confidence intervals for parameter estimates
To obtain 95% confidence intervals for the parameters, sites were sampled with replacement from the full data set to produce 10,000 bootstrap replicates from which parameters were calculated. Bootstrapping of sites was performed three ways: by sampling from all sites regardless of locus or class (HR vs. LR), by sampling while holding the number of sites per class constant, and by sampling while holding the number of sites per locus constant. The three methods were first applied to the original data set. To better capture stochastic variance in time depths of genealogies, which applies here to both polymorphism and divergence, all three methods were applied to data sets for which loci were sampled with replacement within recombination class. One hundred bootstrapped locus lists were generated, and each was used for 100 bootstraps of sites. As expected, confidence intervals were wider when loci, as well as sites, were re-sampled. We found that 100 re-samplings of loci is sufficient to construct the distributions of parameter estimates, as the distributions are essentially the same as those derived from 1,000 re-samplings.

DNA sequence variation
Contributions of individual loci to measures of divergence and polymorphism for all sites are shown in Table 2. Two estimates of h s are shown: an estimate based on average pairwise nucleotide diversity, ĥ p [21], and Watterson's estimator, ĥ W , which is based on the number of polymorphic sites [22]. As expected, polymorphism at all LR loci is low for both D. simulans and D. mauritiana (Tables 2 and 3); this also suggests that these regions of low recombination are shared by the species. It is worth noting that divergence at the LR locus CG15216, which is located near the centromere of chromosome 2, is not reduced within the D. simulans complex. Analyses were performed both with and without this locus. It is immediately apparent that divergence between D. simulans and the island species is reduced at LR loci (Students t test: p = 0.00396 for D. mauritiana, p = 0.00243 for D. sechellia; when CG15216 is excluded, p = 0.00396 for D. mauritiana, p = 0.00156 for D. sechellia; corresponding Mann-Whitney U tests were also significant at p,0.01). However, divergence is similar for the two recombination classes between either species and D. melanogaster (Table 3); differences are not statistically significant and the means for LR loci exceed those for HR loci.
These findings indicate that the reduction in divergence within the complex at LR loci is probably not due to greater selective constraint. The pattern also contrasts with findings in primates [23], where a positive correlation between divergence and recombination rate is observed in several species pairs, including distantly related pairs for which the contribution of the coalescent in the common ancestral pair should be minor (i.e., in accordance with the model used by Birky and Walsh [24]). Thus, while recombination rate and mutation rate may be associated in primates, our data are not explained by such an association. Ometto et al. [25] also observed a positive correlation between recombination rate and intron divergence of D. melanogaster and D. simulans (although not for divergence in intergenic regions), which may have a non-neutral explanation; however, as noted above, we do not observe a corresponding pattern in our data when comparing the two recombination classes.

2-parameter model
Estimates of t and f for HR and LR loci are shown in Table 4. Distributions of the parameter estimates, as obtained by bootstrapping, are shown for estimates using D. simulans ĥ p in Figures 1  and 2 for the pairs D. simulans/D. mauritiana and D. simulans/D. sechellia, respectively. Distributions for estimates using ĥ W were nearly identical (data not shown). For both species pairs, isolation times are more recent for LR loci (,250-270 kya for both species) than for HR loci (,350 kya for D. simulans/D. mauritiana and ,500 kya for D. simulans/D. sechellia), although the confidence intervals are broad. The method used to bootstrap sites had virtually no effect on the shapes of the distributions; therefore, only   Figures 1 and 2), distributions were broader than when loci were not re-sampled (closed markers). This indicates that to better capture stochastic error with these data, loci must be re-sampled. As noted earlier, levels of DNA sequence divergence among closely related species vary widely across loci. This is probably due, in part, to differences in selective constraint. It may also reflect stochastic variance in the time depths of the coalescent of the ancestral population into which the lineages from the extant species descend. Thus, the low divergence within the D. simulans complex at fourth chromosome loci could simply reflect stochastic variance in the ancestral population; it could be coincidental that ase, located at the tip of the X chromosome, also shows low divergence within the complex. However, low divergence at LR loci could also reflect a shallow coalescent in the ancestral population due to Hill-Robertson effects predicted for regions of low recombination (e.g., hitchhiking following selective sweeps, background selection) [12,13,26,27]. Still, if reduced divergence at LR loci reflects selective sweeps, and if gene flow ceased at the same time for all loci, then we should obtain an estimate of the isolation time (t) that is similar to that for HR loci.
Regardless of the bootstrapping approach, while there is overlap in distributions of bootstrapped estimates of t for LR and HR loci for both species pairs, the distributions are clearly significantly different (Mann-Whitney U test, p,10 26 ).
One interpretation of these findings is that LR loci were, in fact, genetically isolated more recently -i.e., gene flow between the incipient island species and incipient D. simulans continued longer for LR loci than it did for HR loci. If hybridization were occurring subsequent to initial geographic isolation of the species, this explanation would require that HR loci face a greater barrier to gene flow than do LR loci. This is possible if loci contributing to hybrid incompatibility are numerous and broadly distributed across the genome, but absent from the regions of low recombination sampled in this study. Coyne and Berry [11] have shown that the D. simulans fourth chromosome does not harbor loci contributing to sterility in hybrids with D. mauritiana and D. sechellia. In their study of D. mauritiana introgression into a D. simulans background, True, Laurie and Weir [4] found that markers associated with hybrid sterility were distributed broadly throughout the genome; however, introgression of markers located at the tip of the X chromosome did not reduce fertility. Similarly, introgression of the tip of the D. mauritiana X chromosome into a Under this interpretation of the findings, low divergence at LR loci could reflect gene flow, i.e., rejection of the SIM, whereby gene flow ceased at the same time for all loci, perhaps at the time of initial geographic isolation. We applied a pair of tests of the SIM that allow stochastic variance to the species pair for which polymorphism was estimated for all thirty loci (D. simulans and D. mauritiana) and were unable to reject a SIM (p = 0.6150 [28]; p = 0.1020 [2]). However, it should be noted that these tests of the SIM are conservative.

Sensitivity of 2-parameter model isolation time estimates to f
Given the expectation that d is = (1+f)tm+gh s , t can be estimated from polymorphism and divergence: Since not all data sets will include loci in two distinct classes that allow appropriate use of the 4-parameter model, and since the confidence intervals for f in our analyses are broad, it is worth exploring the effect of varying f on estimates of t. Figure 3 shows estimates of t under the 2-parameter model for LR and HR loci. Clearly, estimates of t are more sensitive to change at low values of f. Although the limit of t as f approaches infinity is zero, the curve is flatter. Since both island species have diverged more from D. melanogaster than has D. simulans, it is unlikely that true values of f are below 1.0 for either.

4-parameter model
An alternative interpretation of the findings from analyses using the 2-parameter model allows for gene flow to cease for all loci at the same time -i.e., it allows for a single SIM applicable to all loci. It requires that the ratio of coalescent time of (i) lineages descending from D. simulans and the island species within the ancestral population to (ii) lineages within extant D. simulans be greater for HR loci than for LR loci. This is expected if coalescent time is markedly reduced in LR regions (and essentially the same in extant D. simulans and the common ancestral population) and if the N e of the ancestral population is greater than that of extant D. simulans (i.e., if this ratio, g, is greater than 1). As N e of the ancestral population increases, the contribution of the coalescent in the common ancestor to total divergence increases proportionately. Birky and Walsh [24] predicted that recombination rate should have little effect on divergence; however, they noted the important exception: that the coalescent in the common ancestor becomes relevant when divergent taxa are very closely related. Consistent with Birky and Walsh [24], hitchhiking and pseudohitchhiking, by dramatically reducing N e in the common ancestral and extant populations, will reduce both total divergence between closely related species and nucleotide diversity within species in regions of low recombination.
Estimates of t, g, f 0 and f 1 are shown in Table 5. Distributions of the parameters (using ĥ p ) are shown in Figures 4 and 5 for analyses of D. simulans/D. mauritiana and D. simulans/D. sechellia, respectively.
The isolation times of both species are estimated at ,250 kya. The large differences obtained for estimates of t for LR and HR loci under the 2-parameter model are reconciled by a very modest decrease in the N e of extant D. simulans since isolation, with g estimated at ,1.2 for D. mauritiana and ,1.4 for D. sechellia using all loci. If the two island species arose at the same time, then g should be similar for both species, and there is considerable overlap in the confidence intervals. However, it is worth noting that if the species arose at different times, g can be different since N e can change over time for the ancestral population.
Thus, it is possible to reconcile the differences in divergence for LR and HR loci by allowing for an ancestral N e that exceeds that of extant D. simulans. In a constant N model, two randomly chosen lineages in a population should coalesce in pN generations (where p is the effective ploidy). Therefore, polymorphism (h s ) in extant D. simulans should be 2pN e m. Working back from the time of initial isolation, lineages in the ancestral population should have a mean coalescent time of gpN e generations. If, as introduced above, h is reduced in LR loci due to strong regional N e -reducing effects of selection, both in extant D. simulans and in the ancestral population, then if g exceeds 1, divergence would be disproportionately greater in HR genes than would be expected when g is 1.
In our analyses, g does not have to be much greater than 1.0 to explain the difference in divergence.
On the other hand, this reconciliation appears less reasonable when CG15216, the near-centromeric locus, is left out of the analyses. In contrast to the other five LR loci, divergence (per bp) between D. simulans and the island species at CG15216 is only slightly reduced (D.  (Table 4) and for all loci under the 4-parameter model (Table 5), i.e., between 100 and 200 kya. However, estimates of substitution rate correction factors rise quite a bit under the 4parameter model. Therefore, reconciliation of the different levels of divergence at LR and HR with the SIM intrinsic to the 4-parameter model requires much more highly elevated substitution rates along the lineages leading to the island species than are required by allowing different isolation times for the two classes of loci.

Inclusion of additional African D. simulans data
As noted earlier, the seventeen newly sequenced loci were sampled in four North American lines of D. simulans. African lines were generally included in the data sets of the thirteen previously analyzed loci. Given the possibility that inclusion of African lines could lead to increased estimates of h, we repeated some of the analyses with D. simulans sequence data included in the Drosophila  Population Genomics Project [29]. The ''vertical multiple alignment'' (VMA) files of syntenic assemblies for two Madagascar lines (md106 and md199) and one line from Kenya (c167.4) were downloaded from http://www.dpgp.org. FASTA files for each chromosome arm were generated using the perl program (vma2fas-ta2.pl) provided at the site. This program allows filtering of the VMA files in order to include only base calls that exceed a threshold quality score; for our analyses, this was set to 30. No data were available for the near-centromere locus CG15216 or for the chromosome 4 loci; however, little polymorphism was observed at these loci in the North American lines (or in D. mauritiana), so the lack of data from African lines probably does not affect our analyses. For egh, base calls with quality scores exceeding 30 were limited to 23.8% of sites in c167.4, 12.6% in md106 and 1.0% in md199. For ftz and pc, only 13.0% and 11.8% of sites were called, respectively, in c167.4. No bases were called for Prosa6 in any of the African lines. Since we only include sites that are resolved in all strains, we excluded these sequences. The analyses were first repeated with data from c167.4 (when available) and the Madagascar strain with the most base calls. They were then repeated with the single African strain with the most base calls. The latter always included more sites, but decreased the number of pairwise contrasts with African lines. As before, analyses were performed both with and without CG15216. Bootstrapping was performed by resampling loci and sites within loci as described earlier.
Parameter estimates, with percentage change from analyses without the added data, are shown in Table 6. In general, inclusion of African lines increased estimates of h, but did not increase estimates of divergence from D. melanogaster, indicating that the increased estimates of h do not reflect any systematic sequencing error. Instead, these estimates might reflect a decrease in h in North American populations, consistent with a proposed population bottleneck [30]. Consequently, estimates of g under the 4-parameter model, while still above 1.0, are reduced relative to those estimated prior to inclusion of the data from African lines. However, the isolation time estimates under the 4-parameter model are nearly unchanged (with percentage changes ranging from +1.2% to +4.6%). The isolation time estimates for HR loci under the 2parameter model are somewhat reduced. However, these remain significantly higher (Mann-Whitney U test, p,10 26 ) than estimates  for LR loci. Therefore, our findings are qualitatively unchanged. Perhaps of greater interest, it appears that a single isolation time estimate for HR and LR loci can be reconciled under a SIM by allowing an even more modest change in N e than indicated by analyses without the added data from African lines.

Sensitivity of parameter estimates to site type
The thirty loci used in this study differ in the proportion of synonymous, replacement and noncoding sites. It is reasonable to ask whether parameter estimates change if analyses are limited to a specific type of site. Parameter estimates for specific site types (using ĥ p ) are given in Table 7. Estimates of t from replacement sites tend to be lower, with concomitantly higher estimates of f. For the 4-parameter model, estimates of f are sometimes exceedingly high, particularly for replacement sites. If this simply reflected a marked reduction of the efficacy of purifying selection on amino acid sequences in the island species (perhaps due to reduced N e ), the model might be acceptable. However, given such high estimates of f from the current data for replacement sites, it is probably inadvisable to accept the corresponding estimates of t.  Table 6. Parameter estimates following inclusion of additional D. simulans data from African lines.

Analysis
t 0 (2-p) a f 0 (2-p) a t (4-p) g (4-p) f 0 (4-p) f 1 (4-p) Relevance to efforts to infer the history of the D. simulans species complex The evolutionary history of the D. simulans complex has been debated (e.g., [2,31]), meriting attention at least in part because of the importance of the complex as a model for speciation. A generally accepted strict bifurcating phylogeny has not emerged. In general, D. sechellia is more diverged at the DNA sequence level from the other two species, although this must be influenced to some extent by the higher substitution rate in D. sechellia. Our analyses indicate that the two island species arose at about the same time. In comparison to previous estimates scaled by a 3 mya divergence time for D. melanogaster and the D. simulans complex, the estimated isolation time for D. mauritiana is essentially unchanged (,250 kya). However, this date is considerably more recent than that estimated previously for D. sechellia [2]. This may be contributing to the difficulty in reconstructing the phylogeny of this species complex. Given that the dates are similar for the two island species, it also lends support to the possibility that they arose from a common ancestor itself isolated from the mainland species, a scenario previously suggested by Caccone et al. [32].