Evidence for Pervasive Adaptive Protein Evolution in Wild Mice

The relative contributions of neutral and adaptive substitutions to molecular evolution has been one of the most controversial issues in evolutionary biology for more than 40 years. The analysis of within-species nucleotide polymorphism and between-species divergence data supports a widespread role for adaptive protein evolution in certain taxa. For example, estimates of the proportion of adaptive amino acid substitutions (α) are 50% or more in enteric bacteria and Drosophila. In contrast, recent estimates of α for hominids have been at most 13%. Here, we estimate α for protein sequences of murid rodents based on nucleotide polymorphism data from multiple genes in a population of the house mouse subspecies Mus musculus castaneus, which inhabits the ancestral range of the Mus species complex and nucleotide divergence between M. m. castaneus and M. famulus or the rat. We estimate that 57% of amino acid substitutions in murids have been driven by positive selection. Hominids, therefore, are exceptional in having low apparent levels of adaptive protein evolution. The high frequency of adaptive amino acid substitutions in wild mice is consistent with their large effective population size, leading to effective natural selection at the molecular level. Effective natural selection also manifests itself as a paucity of effectively neutral nonsynonymous mutations in M. m. castaneus compared to humans.


Introduction
Several approaches have revealed evidence for adaptation at the molecular level. Local reductions in diversity, indicating selective sweeps, have been identified in populations of a number of species [1,2]. In Drosophila, reductions in neutral diversity have also been associated with increased divergence at amino acid sites, indicative of recurrent selective sweeps of advantageous amino-acid changing substitutions [3]. There is also evidence for a general reduction in diversity close to conserved sequence features in protein-coding genes and noncoding elements of hominids [4]. These reductions can be attributed either to genetic hitchhiking of positively selected alleles or background selection against negatively selected alleles. Higher F ST within genic than nongenic regions of the human genome suggests that genic regions are subject to local adaptation within populations [5], (although see [6]). Further evidence for adaptation has come from attempts to identify sites or genes subject to recurrent positive selection by looking for an excess of substitutions in sites of interest over that expected. For example, a nonsynonymous to synonymous divergence ratio exceeding one at a locus may be evidence for positive selection at nonsynonymous sites. An approach that tests for an excess of substitutions at selected sites is the McDonald-Kreitman test [7], which contrasts levels of polymorphism with divergence at selected sites (e.g., nonsynonymous sites) and linked putatively neutral sites (e.g., synonymous sites). A recent extension of this test quantifies molecular adaptation as the fraction of substitutions driven to fixation by positive selection (a) [8,9] by comparing the observed number of selected substitutions to the number expected, based on levels of polymorphism and divergence at neutral sites.
The application of derivatives of the McDonald-Kreitman test to amino-acid changing sites has resulted in a wide range of estimates of a, the causes of which may be multifaceted. Relatively high estimates of a have been obtained for enteric bacteria [10] and consistently high estimates have been obtained for Drosophila [9,[11][12][13], suggesting that a may be 50% or more in these species. On the other hand, estimates from yeast [14,15], Arabidopsis [16] and hominids have been low. In the case of hominids, several independent estimates have found a for hominids to be 13% at most (with the exception of an estimate by Fay et al. [8]) [17][18][19][20].
Some of the observed variation in estimates of a may also be attributable to differences in the methods used. Specifically, some estimates of a are compromised by slightly deleterious mutations, since these contribute proportionately more than neutral polymorphisms to diversity than divergence. If slightly deleterious mutations are prevalent and not properly accounted for then a could be substantially underestimated [21]. This may partially explain low estimates of a obtained using methods that do not incorporate explicit population genetics models (e.g. yeast [14,15] and Arabidopsis, [16]). Recently, improved methods to estimate a have been developed that model the contribution of slightly deleterious mutations to polymorphism and divergence [20,22].
However, estimates of a obtained from hominids are low, even when based on methods that attempt to model for slightly deleterious mutations [20,22], whereas estimates from Drosophila are high [22]. It is possible that this observation is a consequence of differences in effective population size (N e ). Drosophila melanogaster is estimated to have an N e of 1-2 million, whereas hominids seem to have an unusually low recent N e [23]. In D. miranda, which is estimated to have a lower N e than D. melanogaster, N e is probably still an order of magnitude larger than that for hominids, and a high estimate of a was observed [13]. The proportion of adaptive substitutions is expected to depend on N e for two reasons. Firstly, a higher proportion of both advantageous and deleterious mutations are expected to be effectively neutral in species with low N e , because selection for/against slightly advantageous/deleterious mutations becomes less effective (see [24]). In such species, a higher proportion of slightly deleterious mutations, and conversely, a smaller proportion of slightly advantageous mutations, are expected to contribute to divergence. Secondly, if the rate of adaptation is limited by the supply of mutations, then species with low N e will adapt more slowly simply because they have to wait longer for each new advantageous mutation to appear in the population.
Currently there are no estimates for rates of adaptive evolution of protein-coding genes in mammals other than hominids, particularly for species with higher N e . M. m. castaneus populations from NW India, a region believed to be part of the ancestral range of the house mouse sub-species complex [25], have silent-site diversity for the X-chromosome of the order of 1% [26]. When combined with an estimate of the per nucleotide mutation rate per generation for murids [27], this level of diversity suggests that M. m. castaneus N e is two orders of magnitude higher than recent N e of hominids, and comparable to N e typically seen in Drosophila. We hypothesised that the murid protein-coding genes would therefore show pervasive natural selection, whereas its impact is much reduced in hominid orthologs. We tested the hypothesis by estimating a from protein-coding genes of murid rodents by comparing nucleotide polymorphism data of M. m. castaneus sampled from the NW Indian population with nucleotide divergence to M. famulus and the rat.

Results/Discussion
To infer levels of negative and positive selection in murid proteincoding genes, we analysed nucleotide diversity within a sample of 15 wild, unrelated M. m. castaneus from the NW Indian population together with the nucleotide divergence between M. m. castaneus and either M. famulus or the rat. We sequenced amplicons from a sample of 77 autosomal loci that are part of the Environmental Genome Project (EGP) [28] (details of the genes sequenced are presented in Table S1). These loci are not a random sample, since they are associated with human genetic diseases whose susceptibility is influenced by environmental challenge. However, they show low rates of adaptive amino acid substitution that are typical of hominids [17][18][19][20]. Summary statistics concerning nucleotide diversity at intronic, 4-fold degenerate, 2-fold degenerate and 0-fold degenerate sites are shown in Table 1 and Table S2 and the allele frequency distributions (or site frequency spectra, SFS) are plotted in Figure 1. As expected, zero-fold degenerate sites have the lowest nucleotide diversity, lowest divergence, the most negatively skewed SFS, and the most negative estimate of Tajima's D, a statistic related to the skew in the distribution of allele frequencies [29]. This is consistent with purifying selection keeping most amino acid mutations at low frequencies and reducing the number of fixations. Nucleotide diversity is higher for synonymous than intronic sites, as is Tajima's D. Together with a slightly higher synonymous than intronic divergence between M. m. castaneus and M. famulus (Table 1), this suggests somewhat weaker purifying selection acting on synonymous than intronic sites in murids, and that synonymous sites are likely to be the most appropriate neutral reference.
Recent N e in wild mice, humans and Drosophila can be compared by equating synonymous site nucleotide diversity (h p ) to 4N e m, where m is an estimate of the mutation rate per site per generation ( Table 2). Using estimates of m based on synonymous site divergence and an assumption of two generations per year, our estimate of N e for wild mice of 580,000 is similar to that obtained for African D. melanogaster, whereas, in African populations of humans, N e is nearly two orders of magnitude smaller. Our estimate for M. m. castaneus is consistent with, although marginally higher than, a recent estimate of 400,000 (also assuming two generations per year) [30], based on smaller sample of loci. Nucleotide diversity in NW Indian M. m. castaneus is approximately one order of magnitude higher than

Author Summary
The prevalence of natural selection at the DNA level remains a controversial issue in evolutionary biology. In particular, estimates of the proportion of adaptive amino acid changes (a) vary greatly between taxa, being 50% or more in bacteria and fruit flies, but at most 13% in hominids. Here, we infer the frequencies of polymorphisms in protein-coding genes of 15 Mus musculus castaneus individuals sampled from the ancestral range of the house mouse species complex. By combining the polymorphism data with nucleotide divergence to the related murid species M. famulus and the rat, we obtain an estimate for a of 57%. This represents the first estimate of a for a mammal other than humans. The high rate of adaptive protein evolution in wild mice and other taxa implies that hominids may be somewhat unusual in having low rates of adaptive protein evolution. One possible cause of this is the low effective population size in humans, which is predicted to lead to less effective natural selection and fewer adaptive mutations. This is consistent with the higher frequency of nearly neutral deleterious amino acid mutations in hominids than murids that we infer in our analysis.  [26,30,31].
To estimate parameters of the distribution of fitness effects of deleterious amino acid-changing mutations we used a maximum likelihood (ML) procedure [32] that contrasts the SFS at putatively neutral sites (four-fold degenerate or intronic sites in this case) with sites assumed to be subject to purifying selection (nonsynonymous sites). The procedure fits a gamma distribution of deleterious mutational effects to the nonsynonymous SFS, and a demographic model to both the neutral and nonsynonymous SFSs that allows a step change in population size at some time in the past. The method assumes that positively selected mutations make a negligible contribution to polymorphism. Selective effects (s) of new amino acid mutations are estimated as the product of N e and s (see Materials and Methods for details of the method). Assuming four-fold sites as the neutral reference, estimates of proportions of amino acid mutations that have fitness effects in different N e s ranges under the best-fitting mutation effect distributions are compared in Table 3 for our M. m. castaneus data set, three African or African-American human data sets ( [28]; the ''Seattle SNPs'' Programs for Genomic Applications (PGA) [33]; the dataset of Boyko et al. [20]) and an African D. melanogaster data set [11]. Similar results are obtained if intronic sites are used as the neutral reference (Table 3). Nearly neutral deleterious amino acid mutations (i.e., mutations for which N e s,1), which have an appreciable chance of drifting to fixation, are relatively uncommon in both mice and Drosophila (10% and 6% of amino acid mutations, respectively), whereas they make up ,20% of amino acid mutations in humans (maximum P = 0.038 for mouse versus human comparison, see Table 4 for details; P = 0.25 for mouse vs. Drosophila comparison). Strongly deleterious mutations (N e s.10), which essentially never become fixed, are inferred to be somewhat more frequent in mice and Drosophila (79% and 87%, respectively; P = 0.21) than humans (,70%; P,0.05 for all contrasts with mice except one, see Table 4 for details). Whilst it is possible that these differences between the species in the relative frequencies of mutations in different N e s categories reflect differences in the distribution of absolute selection coefficients (s) between species, it is more likely that they reflect differences in N e . For example, a lower long term N e in humans would allow more deleterious mutations to segregate at higher frequencies than in either mice or Drosophila. ML estimates of the demographic parameters of the model using four-fold sites as the neutral reference imply that there has been a recent increase in N e in M. m. castaneus (Table S3), as well as African D. melanogaster [32].
A lack of neutral diversity in fast evolving genes has previously been interpreted as evidence for the effects of selective sweeps and therefore adaptation in Drosophila [13,34,35]. However, in contrast to these results, we found a nonsignificant positive correlation between synonymous site diversity and nonsynonymous divergence (Spearman r = 0.21, p = 0.084 for d N vs. h p and r = 0.16, p = 0.084 for d N vs h S ). Therefore, unlike in Drosophila, our data do not suggest that selective sweeps in genes undergoing high rates of adaptive evolution reduce local neutral diversity, although our relatively small data set limits the power of this analysis.
To further investigate evidence for adaptation we estimated the fraction of adaptive amino acid substitutions, a, between M. m. castaneus and either M. famulus or the rat by a method related to the McDonald-Kreitman test for adaptive evolution [7] that contrasts polymorphism with divergence [22]. The method attempts to account for nearly neutral amino acid mutations, which, when compared to strongly deleterious mutations, contribute proportionately more to polymorphism than divergence. The parameters of  The estimated fraction of adaptive substitutions is the difference between this expected number and the observed number of amino acid substitutions, scaled by the observed number (see [22] and Methods). Simulations suggest that the method produces close to unbiased estimates of a if the assumptions of the model are met, and is robust to substantial departures from the model assumptions, including complex demographic scenarios and linkage between sites [22]. However, in common with all McDonald-Kreitman based approaches, it is sensitive to long-term population size changes, a point that is discussed later.
Our estimate of a for wild mice, assuming four-fold sites as the neutral reference, is 57% (Table 3). This is somewhat higher than, but non-significantly different to, an estimate of 52% for African D. melanogaster with divergence to D. simulans (P = 0.54). However, the estimate for a in mice is very much higher than estimates for hominids for all the human polymorphism data sets, including the EGP data (P = 0.014 for a comparison only involving the subset of gene orthologs sequenced in mice using divergence to macaque) and PGA (P = 0.11 using divergence to macaque), and the data set of Boyko et al. [20] (P = 0.020 using divergence to chimpanzee). CpG dinucleotides have an elevated mutation rate in mammals and differ in frequency between coding and non-coding DNA. However, using only sites that are unlikely to be part of a CpG dinucleotide (non-CpG-prone sites) yields estimates of a that are similar to those based on all sites ( Table 5).
Estimates of a could also be affected by more complex demographic scenarios, such as admixture between differentiated Table 3. Estimated percentages of amino acid mutations in different N e s ranges and estimates of a, the fraction of substitutions driven to fixation by positive selection.   sub-species and/or population subdivision, that are not modelled in our algorithm. We tested for evidence of population structure or admixture using the program Structure [36] using one randomly sampled four-fold degenerate or intronic SNP per sequenced locus. For both intronic and 4-fold degenerate synonymous sites we found no evidence for population subdivision in the M. m. castaneus sample, since the ''no-admixture'' model gives P = 1 in all but one case for a number of populations parameter K = 1 (see Table S4 for details). However, under the ''admixture'' model there is better support for two populations (K = 2) than one population (Table S4), suggesting population subdivision. Figure 2 shows an ancestry plot for one randomly selected run that provided support for K = 2. In this plot each individual shows ancestries in both putative populations (with roughly equal proportions in the population), suggesting that they are admixed. However, we do not find any individuals that have ancestries in just one of the two populations (i.e., there are no individuals that are purely from population 1 or purely from population 2), suggesting that this result can be explained by a violation of an assumption in Structure, namely that genotype frequencies are at Hardy-Weinberg equilibrium. House mice are known to inbreed in the wild [37], potentially causing an elevated inbreeding coefficient (F is ). To determine whether such an effect can be observed in our sample, we calculated F is values [38] for each SNP using the program Genepop (http://genepop.curtin.edu.au/). We found a substantial excess of loci showing positive F is values, indicating a deficiency of heterozygotes and a deficiency of negative F is values, indicating an excess of homozygotes ( Figure S1). Thus, our sample shows evidence for inbreeding. In summary, our interpretation of these results is that the M. m. castaneus population shows no evidence for hidden population substructure or admixture between differentiated subspecies, but there is evidence that inbreeding is a feature of all individuals used in the study. We attempted to account for the effect of inbreeding, and therefore the possibility that alleles from the same individual are not independent samples from the population, by repeating the analysis for 20 datasets created by randomly selecting one allele from each individual for each site, such that each sequence analysed in each data set was a composite, derived from a single individual. We then calculated mean estimates of a by averaging over the 20 randomly generated datasets. When calculated by this method, our estimates of mean a are only marginally lower than estimates using the complete data set (see Table S5).
Estimates of a obtained using rat as the outgroup are generally somewhat lower than those using M. famulus, but are still close to 40% (estimates range from 0.33 to 0.51, see Table 5), suggesting that the estimate of a in murids is robust to the choice of outgroup. An earlier method to estimate a [8] attempts to remove the influence of nearly neutral deleterious mutations by excluding polymorphisms at a frequency below an arbitrary threshold (e.g., 10%). Estimates of a produced by this method are somewhat lower than the estimates from our method ( Table 6), but this is expected because estimates are likely to be downwardly biased [21]. However, they are substantially higher than estimates of a using this method in hominids [8]. If all sites are included in this analysis, irrespective of their frequency, estimates of a in wild mice are close to zero, or even negative (Table 6). This is likely to be due to slightly deleterious mutations, which contribute low frequency polymorphisms but have little chance of fixation, and lead to downwardly biased a estimates. Indeed even when low-frequency, segregating at a frequency of ,10%, are excluded, analyses suggest that estimates of a may still be downwardly biased.
Currently available estimates of a from a variety of species vary widely. The estimates of the fractions of adaptive substitutions in microbes [10], Drosophila [9,11,12] and now mice present a  serious challenge to the neutralist view of protein evolution. Taken together these results suggest that most amino acid substitutions are caused by positive selection, and that genetic drift is therefore not the most important cause of protein evolution. However, estimates of a obtained for yeast [14,15], Arabidopsis [16], and hominids (which are ,10% at most [17][18][19][20]) suggest the opposite. There are several possible explanations for these discrepancies. One possibility is that the estimates obtained for yeast and Arabidopsis are not based on an explicit population genetics model, and even though attempts have been made to reduce the impact of slightly deleterious mutations, the estimates may still be downwardly biased. Nevertheless, these results are hard to reconcile with estimates from microbes, Drosophila and mice, since both of these species are thought to have a relatively high N e . On the other hand, the low estimated proportion of adaptive substitutions in hominids may reflect their low N e , since this will increase the proportion of effectively neutral advantageous and deleterious mutations. Low N e will also reduce the rate of adaptive evolution if the rate is limited by the supply of mutations. This is consistent with the low recent N e estimates for humans [23], chimpanzees [39] and gorillas [40]. It is also possible that most adaptive evolution occurs in noncoding regions in primates [41]. Alternatively, changes in effective population size can lead to bias in the estimate of a [7,42]. It can be shown that if the true value of a is independent of N e , but that the current N e (which affects the level of polymorphism) is different to the average N e over the evolution of the species (which affects the level of divergence) then the relationship between the true (a true ) and estimated (a est ) values of a is given by [22], if the distribution of fitness effects is gamma, where l is the ratio of the current and ancestral N e and b is the shape parameter of the gamma distribution of mutational effects. Thus, a contraction in N e will lead to an underestimate of a and an increase will lead to an overestimate. It is therefore possible that the difference in the estimate of a between hominids and rodents is due to recent demography; if the current N e of humans was much smaller than the ancestral population size, and/or the current N e of M. m. castaneus was much larger than the ancestral, then a true could be very similar in the two species. Recent evidence suggests that ancestral great ape N e may have been substantially bigger than current [43]. So, for example, assuming b for humans is 0.2 [20,32,44], a est = 0.1 implies a true = 0.35 and 0.43 for 5-and 10-fold reductions in long-term N e , respectively (equation 1). However, we also infer from our polymorphism data that M. m. castaneus has undergone a recent increase in N e , although our evidence for this is modest. Nevertheless, assuming that current M. m. castaneus N e is 5and 10-fold larger than the ancestral N e , our estimates of a est = 0.57 and b = 0.31 (see Table S3) would imply that a true = 0.29 and 0.12 respectively (consistent with the estimates from humans). More estimates of a from other murid and mammalian species will help to determine whether the high rate of adaptive evolution we have inferred is widespread amongst murid species and therefore not an artefact of demography.

Selection of amplicons
DNA sequences were generated for genes sampled from a set whose human orthologs have been sequenced in the Environmental Genome Project (EGP) [28]. All genes sequenced as part of EGP are autosomal and many had polymorphism data available for African human populations at the time of searching, allowing us to make a direct comparison of the results we obtained in mice with results based on the same set of genes in a human population. Loci sequenced in Africans (618 as of 7th August 2007) whose orthologs could be identified in the mouse genome (585 genes, using NCBI Homologene) were considered. For 77 loci, DNA sequences were generated for the 15 M. m. castaneus individuals and one M. famulus individual. Primers were designed to amplify regions that captured coding and intronic DNA using Primer3 [45].

DNA sequences
PCR reactions were performed using GoTaq DNA polymerase (Promega) using a touchdown program consisting of 95uC for 15 minutes, followed by 28 cycles of 95uC for 30 seconds, 62uC for 45 seconds (reducing by 0.5uC every cycle), 72uC for 2 minutes, then 12 cycles of 95uC for 30 seconds, 52 uC for 45 seconds and 72uC for 2 minutes, with a final extension at 72uC for 10 minutes. Following evaluation on 1% agarose gels, products were purified using ExoSAP-IT (USB), or, if product indicated non-specific priming, the appropriate band was cut from a gel and extracted using Qiaquick gel extraction kit (Qiagen). Forward and reverse sequences were generated using Big Dye Terminator Sequencing Kits (Applied Biosystems) on an ABI Prism 3730 DNA Analyzer.
Sequence analysis and variant detection was carried out using CodonCode Aligner version 2.0.6 (http://www.codoncode.com/ aligner/). Sequences had an average Phred score of .60. All sequence traces were manually checked. CodonCode was set to highlight any site with a Phred score ,30 (which could include low quality sequence or heterozygous sites). All such sites were manually checked, but in order to avoid excluding heterozygotes, were not automatically excluded. Alignments between the 15 M. m. castaneus, the M. famulus individual and the M. m. musculus reference sequence were obtained. All alignments were manually checked before further analysis.
Processing of sequence data. Orthologous Rattus norvegicus sequences were obtained for each amplicon using a combination of approaches. Initially, the mouse reference sequence was BLASTed [46] against two different assemblies of the rat genome, downloaded from UCSC genome browser. The first was produced by the Baylor College of Medicine Human Genome Sequencing Center (BCM-HGSC) as part of the Rat Genome Sequencing Consortium, and the second (referred to as the alternative assembly) was produced by Celera Genomics. If no reciprocal-best-hit was found in the standard assembly, a reciprocal-best-hit search was done in the alternative assembly. The mouse and rat sequences were then aligned using MAVID [47], the alignments checked by eye, and poorly aligned sections masked in rat. If the reciprocal-best-hits approach failed to identify an orthologous section in rat, the relevant section from the ''multiz30way'' whole genome sequence alignments of 30 vertebrates (available at the UCSC genome browser http:// genome.ucsc.edu/) was checked. If the sequence of interest was located entirely within a single unbroken alignment for mouse and rat, then the rat sequence was considered as orthologous, and the relevant section of the alignment was realigned, checked and masked as before. Using this procedure, we successfully identified rat sequences orthologous to at least part of each mouse amplicon.
Alignments for each amplicon were generated between M. m. musculus release mm9, the M. m. castaneus individuals, M. famulus and rat. Then, using the M. m. musculus annotation, sites were categorised as ''0-fold'' (0-fold degenerate sites, or nonsynonymous), ''2-fold'' (2-fold degenerate sites), ''4-fold'' (4-fold degenerate synonymous sites, or synonymous) or ''intron''. We excluded potential splice sites (defined as the first 6bp or last 16bp of an intron). Sites were also scored for their CpG-prone status (defined as being preceded by a C or followed by a G in any species).
Summary statistics. We calculated two standard estimates of diversity, h S , the number of segregating sites divided by a normalising factor a n~X n{1 i~1 1=i ð Þ, where n is the number of sequences [48], and h p , the average heterozygosity, also known as nucleotide diversity or p [49]. Both estimators, when normalised per site, are unbiased estimates of 4N e m (under a Fisher-Wright neutral of drift and an infinite sites model of mutation, such that each site experiences no more than one mutation). However, they are expected to differ if there is an excess or deficit of low or high frequency variants. This difference and therefore the level of skew in the SFS can be quantified with Tajima's D [29]. However, since D can only be calculated when the same number of chromosomes has been sequenced at each site (and this is not the case in the data presented here, see Table S6), we sampled sequences 20 times without replacement at each site (rejecting sites with fewer than 20 sequences) such that the number of chromosomes sampled per site was constant. We also calculated divergence between the M. m. castaneus sequences and both M. famulus and rat sequences using a Jukes-Cantor correction for multiple hits. 95% confidence intervals for h p , h S , Tajima's D and divergence were calculated by bootstrapping 1,000 times by locus.

Inference of the distribution of fitness effects of new amino acid mutations
We extended a maximum likelihood approach [32] to estimate parameters of the distribution of fitness effects of new amino acid mutations using the allele frequency distributions (the site frequency spectra, SFSs) for 0-fold and putatively neutrally evolving sites (either 4-fold or intronic sites). We assumed that effects (s) of amino acid mutations are unconditionally deleterious, and sampled from a gamma distribution with shape and scale parameters a and b, respectively. These parameters were estimated along with the fraction of unmutated sites, f 0 , and demographic parameters N 1 , N 2 and t, corresponding to ancestral population size, current population size and the number of generations since a population size change, respectively. Polymorphism data were summed across loci using folded SFSs. We extended the method to allow variation in the number of alleles at each site. We generated SFSs for sites with the same numbers of alleles, computed the log likelihood for each SFS, and summed these to compute the overall log likelihood. Selective effects are estimated on a scale Ns, where N is a measure of the population size at the time that the polymorphism data are censured. Under the assumption of a single step change in population size, there may be little information to estimate the relative values of the population size before (N 1 ) and after (N 2 ) the size change if, for example, t&N 2 or t%N 2 . We therefore computed a weighted recent population size from Estimating the proportion of adaptive substitutions, a We estimated a for 0-fold substitutions using a method that attempts to account for the segregation of slightly deleterious mutations and recent population size changes [22]. The divergence at neutrally evolving sites (d S ) is proportional to the mutation rate per site. At selectively evolving sites, the expected divergence due to fixation of deleterious mutations is proportional to the product of the mutation rate and the fixation probability, u(N, N e s), of a new mutation [50]. Defining d N as the observed divergence at the selectively evolving sites, and equating N with N e (because these are equivalent under the transition matrix method under which population size is estimated) a is proportional to the difference between the observed and expected divergence: (note that d S and ds are different quantities). a was initially estimated using all sequenced alleles (i.e. a total of 30 alleles, two per individual).
To test whether our estimates of alpha could be biased as a result of assuming that different alleles from the same individual are independent, we also estimated a using only a selected and neutral reference sequence for each individual. Specifically we created 20 data sets in which, for each individual and for every site, we randomly picked a single base from the individuals' two alleles. Each data set therefore consisted of up to 15 selected/ neutral reference sequences, each one corresponding to a different individual. Mean estimates of a were then computed by averaging over the 20 randomly generated datasets.
We also used Fay, Wyckoff and Wu's [8,9] extension of the McDonald-Kreitman test to estimate the fraction of 0-fold substitutions driven to fixation by positive selection, a FWW : where D N (P N ) and D S (P S ) are numbers of divergent (polymorphic) sites for selected and neutral classes, respectively, and the summation is over genes. To reduce the influence of nearly neutral alleles on a FWW , we excluded sites where the rare variant was at a frequency of 10% or less. This method assumes that the number of alleles sampled is constant, so we again sampled the alignments to give 20 alleles per site and ignored sites that lacked an orthologous base in the outgroup. 95% confidence intervals for all statistics were obtained by bootstrapping 1,000 times by locus.

Inference of population structure
House mice, especially those originating from the ancestral range, could exhibit a complex genetic composition, reflecting either incomplete lineage sorting or admixture between the different subspecies [30,51]. We used our multilocus SNP dataset to determine if the population sample of M. m. castaneus that was used in our study shows evidence for admixture or hidden population substructure. We randomly selected one SNP from each amplicon, excluding SNPs that cover known splice sites and only including SNPs where at least 10 of the 15 individuals could be sequenced. We excluded any SNPs covering indel sites. We separately analysed SNPs from intronic sites and SNPs from 4-fold degenerate sites. Altogether 84 (intronic data) or 82 (4-fold degenerate sites) unlinked SNP loci were included in the analysis.
We used the program Structure [36] to identify the presence of different subpopulations in the sample, if any, and to estimate the ancestry of the sampled individuals in each of these subpopulations. The number of subpopulations is inferred by calculating the probability P(X|K) of the data given a certain prior value of K (number of subpopulations) over a number of Monte Carlo Markov Chain (MCMC) iterations. The posterior probabilities P(K|X) can be calculated following Bayes' rule. The subpopulations are characterised by different allele frequencies, and, according to their multilocus genotypes, individuals are probabilistically assigned to one or more subpopulations. The scores of individuals in the subpopulations correspond to the probability of ancestry in any one of them. In this study we assumed prior values of K from 1 to 4. We considered two models for the ancestry of individuals. In the first, the ''no-admixture model'', individuals are assumed to be drawn purely from one of K populations. In the second, the ''admixture model'', individuals are allowed to have mixed ancestry, that is, some fraction of an individual's genome comes from different subpopulations. Both of those models assume that all the markers are unlinked and provide independent information on an individual's ancestry. Inferences of the number of subpopulations and ancestries of individuals are based on 1,000,000 iterations of the MCMC, after a ''burn-in'' period of 100,000 iterations. We ran the program without incorporation of prior population information. We performed 3 independent runs of the Markov chain for each parameter set to check for convergence of the chains.

Table S5
Estimates of the fraction of substitutions driven to fixation by positive selection obtained using only a single allele from each individual. Mean estimates of a were computed by averaging over the results from 20 randomly generated datasets, where each data set contains a single sequence for each individual constructed by sampling a single base from the individuals' two alleles at every site. Calculations are performed using two different classes of sites, both rat and M. famulus as outgroups and using both 4-fold degenerate synonymous sites and intron sites as the neutral reference.