Molecular Evolution of Immune Genes in the Malaria Mosquito Anopheles gambiae

Background As pathogens that circumvent the host immune response are favoured by selection, so are host alleles that reduce parasite load. Such evolutionary processes leave their signature on the genes involved. Deciphering modes of selection operating on immune genes might reveal the nature of host-pathogen interactions and factors that govern susceptibility in host populations. Such understanding would have important public health implications. Methodology/Findings We analyzed polymorphisms in four mosquito immune genes (SP14D1, GNBP, defensin, and gambicin) to decipher selection effects, presumably mediated by pathogens. Using samples of Anopheles arabiensis, An. quadriannulatus and four An. gambiae populations, as well as published sequences from other Culicidae, we contrasted patterns of polymorphisms between different functional units of the same gene within and between populations. Our results revealed selection signatures operating on different time scales. At the most recent time scale, within-population diversity revealed purifying selection. Between populations and between species variation revealed reduced differentiation (GNBP and gambicin) at coding vs. noncoding- regions, consistent with balancing selection. McDonald-Kreitman tests between An. quadriannulatus and both sibling species revealed higher fixation rate of synonymous than nonsynonymous substitutions (GNBP) in accordance with frequency dependent balancing selection. At the longest time scale (>100 my), PAML analysis using distant Culicid taxa revealed positive selection at one codon in gambicin. Patterns of genetic variation were independent of exposure to human pathogens. Significance and Conclusions Purifying selection is the most common form of selection operating on immune genes as it was detected on a contemporary time scale on all genes. Selection for “hypervariability” was not detected, but negative balancing selection, detected at a recent evolutionary time scale between sibling species may be rather common. Detection of positive selection at the deepest evolutionary time scale suggests that it occurs infrequently, possibly in association with speciation events. Our results provided no evidence to support the hypothesis that selection was mediated by pathogens that are transmitted to humans.


Introduction
Infection in a susceptible host leads to parasite development or amplification, enabling disease transmission. In a resistant host, parasite development is halted. As pathogens that circumvent the host immune response are favoured by natural selection, so are host alleles that reduce parasite load. Such evolutionary processes leave their signature on the molecular makeup of the genes involved. In vertebrates, analysis of genetic diversity of the MHC (HLA in humans) genes showed that selection maintains exceptionally high allelic diversity [1][2][3]. Similar patterns were found in several members of the pathogen recognition encoding R gene family of plants [4]. Diversifying selection on these genes fits well with their known role in immune recognition, confirming that selection maintains ''excess'' (and ancient) alleles that differ in their capacity to recognize pathogens [5] by frequency dependent or overdominant balancing selection. If alleles conferring resistance to infection reduce the fitness of uninfected individuals, it is possible that balancing selection will maintain resistant and susceptible alleles as if they both conferred resistance to specific pathogens [4,6]. An alternative scenario for host-pathogen interactions is the arms race [7], in which a series of selective sweeps alternate in pathogen and host populations, reflecting host genotypes that confer resistance and pathogen genotypes that facilitate infection. Selective sweeps reduce diversity within populations but enhance inter-population diversity. Unlike purifying selection, an arms race will be associated with a higher rate of substitutions that results in amino acid (aa) changes (K A ) over that resulting in synonymous substitutions (K S ) in alleles from different populations [8,9]. Evidence for this form of positive selection has been found in surface antigens of many pathogens including Plasmodium spp. [1,10,11]. Molecular evolution of insect immunity genes has been studied primarily in Drosophila. Most studies have revealed weak evidence for adaptive evolution in general and especially in antimicrobial peptides [12][13][14]. Evidence of diversifying selection, as exemplified by the vertebrate MHC locus, was not found in these studies, and the arms race scenario was rarely supported. Studies on mosquito immune genes are in their infancy [15][16][17][18], and findings to date echo those on Drosophila. Understanding the forces and factors that govern pathogen susceptibility in host populations remain enigmatic [19][20][21][22][23] especially in arthropods whose innate immunity is thought to be their prime defense [24,25]; many of which transmit pathogens to humans and domestic animals. Increased understanding of arthropod-pathogen relationships would have important public health implications for vector-borne diseases.
Recent advances in understanding the immune system of insect disease vectors have resulted in the identification of many genes whose products play key roles in these responses [26][27][28][29]. We selected four genes encoding molecules with different roles in the immune response mounted against eukaryotic and prokaryotic pathogens (Table 1). They include genes coding for defensin, gramnegative bacteria-binding protein (GNBP), a serine protease gene (SP14D1) and gambicin. These genes were implicated in An. gambiae responses to infection including with Plasmodium parasites (Table 1), although they probably do not include the main determinant locus of the mosquito natural susceptibility to malaria; which remains unknown to date.
Here, we describe and decipher patterns of molecular variation at each gene within and between populations and sibling species of Anopheles gambiae, the principal vectors of malaria in Africa. We evaluate if different modes of selection shaped variation on these genes, and assess whether selection could be mediated by mosquito-transmitted human parasites i.e., selection by the protozoan agent of malaria, Plasmodium falciparum and the nematode agent of lymphatic filariasis, Wuchereria bancrofti. Here, we extend our limited study on defensin (Simard et al. 2007), while including the defensin data to enhance the scope of the current analysis. Comparing signatures of selection based on intrapopulation data, between conspecific populations and between sibling species, as well as between distant Culicid taxa (over 100 mya) might provide insights into the modes of selection operating on different time scales.
To evaluate selection mediated by ''human'' pathogens, we contrast patterns of molecular variation between anthropophilic vector (An. gambiae s.s. and An. arabiensis) and zoophilic non-vector (A. quadriannulatus) sibling species (Table 2). Similarly, we included four A. gambiae populations that differ in their exposure to human pathogens and span the range of geographical and genetic distances within this species e.g., [30][31][32][33][34]. For example, the transmission of W. bancrofti by An. gambiae and An. arabiensis is very high in Nigeria and moderate in eastern Kenya, but it is nonexistent in western Kenya and Senegal (Table 2). Betweenpopulation variation in exposure to these pathogens is expected to correlate with selection pressure mediated by them. If selection mediated by human parasites dominated the evolution of a gene, we predict that divergence between anthropophilic species (An. gambiae and An. arabiensis) will be small in functional domains (e.g., exons), but high in neutral domains (e.g., introns) of the same gene, whereas, divergence between anthropophilic and zoophilic (An. quadriannulatus) species will be high across all domains. Likewise, we predict that patterns of within-gene differentiation between An. gambiae populations will be correlated with their exposure rate to human pathogens.

Results
Population characteristics are summarized in Table 2. Examination of protein variation might help delimit the modes of selection although it is less amenable for statistical tests. Therefore, variation in the mature protein (excluding signal peptide and cleaved domain, Table 1) is briefly described. No length variation was found across species in all proteins encoded by each gene. A single mature protein was shared across all three species in defensin ( Figure 1, Table S1). The two common proteins found in gambicin ( Figure 1, Table S1) were also shared across all three species. Two of the three common proteins of SP14D1, were shared between An. gambiae and An. arabiensis and one was shared between An. gambiae and An. quadriannulatus ( Figure 1, Table S1). In GNBP, however, protein diversity was large ( Figure 1, Table S1). Within populations, typically only one or two proteins had a frequency greater than one. Such common proteins were separated by only 1-2 aa changes from each other, whereas 1-3 aa changes separated all proteins from the most common one in that population (Table S1). With the possible exception of GNBP, these patterns are inconsistent with selection for hypervariability. Neutral evolution may explain protein variation in gambicin, SP14D1, and even in GNBP, because increased protein diversity in Anthropophily b Very low [71] Moderate [71,72] High [31,71,72] High [32,71] High [33,71] High [70,71] Local malaria transmission c None [71] Moderate 400 [31] High 400 [31] Low 10 [32] Moderate 120 [73] Moderate 260 [70] Local filaria Transmission d None [71] None None Moderate [34] High [33] None a Collection method included IR: Indoor-resting adult mosquitoes collected by pyrethrum-spray or aspiration; IR-bednet: blood fed and blood-seeking females collected by aspiration from net traps hung over the beds of sleeping volunteers; and HL: blood-seeking mosquitoes were collected by human landing catches. b Refers to the mosquito preference to feed exclusively on human blood.  GNBP is expected, under neutrality, due to its length (Table S1). The lack of protein diversity across species in defensin, however, suggests that purifying selection is involved (Simard et al. 2007). Protein diversity in the zoophilic An. quadriannulatus showed no distinct features compared to those of the anthropophilic An. gambiae and An. arabiensis.

Within Population Genetic Diversity
A sliding window examination of nucleotide diversity across the genes revealed over a ten fold difference between maxima and minima of every species (Figure 2). Diversity in coding regions was significantly lower than that in non-coding regions for every gene in all populations (except defensin in An. arabiensis and An. gambiae from Senegal, Figure 3), in accordance with purifying selection. Diversity at non-coding (NC) regions differed significantly among genes (at all populations except An. gambiae from Senegal), but it did not predict among-gene diversity in coding regions, which did not differ significantly in any population ( Figure 3). The correlation between recombination rates (between neighboring nucleotides) and nucleotide diversity in the coding region was not significant (r = 0.-19, P.0.38, df = 1/22), as was the total diversity (Table 3). High NC diversity and low coding diversity (e.g., SP14D1) is consistent with purifying selection, but where NC diversity is also low (e.g., GNBP, gambicin), positive selection, i.e., a recent selective sweep, cannot be ruled out.
Under neutrality, a similar pattern of polymorphism is expected across functional regions. Comparing site frequency spectra between coding and non coding regions provided a comprehensive test of that variation. Frequency spectra were grouped into 'rare alleles' (singleton sites), 'moderate alleles' (sites where the rare nucleotide numbered two or three), and 'common alleles' (sites where the rare nucleotide was observed four or more times). Invariant sites were included to accommodate total length variation between regions. Contingency table analyses were used to assess the effect of functional region (coding vs NC), population, and their interactions on the frequency spectra. Within population differences in the polymorphism spectra between coding and NC regions were highly significant across all populations (P,0.01, Table 4). Heterogeneity x 2 tests showed no differences between populations (P.0.1) in all genes, providing no indication for local adaptation regardless of exposure to human pathogens, ie., comparing the zoophilic An. quadriannulatus with the anthropophilic An. gambiae and An. arabiensis. In coding regions, moderate and rare allele frequencies were particularly reduced (Table 4), as expected under purifying selection because it acts more strongly against rare polymorphisms, which include most deleterious mutations. Reduction in the frequencies of all allele classes (including common alleles) as detected in the coding regions of SP14D1and GNBP (Table 4) could indicate severe constraints or positive selection.

Within Population variation in Synonymous and Nonsynonymous Sites
Diversity of nonsynonymous (K A ) sites was lower than that of synonymous (K S ) sites across species in all genes, although, it was not significantly lower in gambicin (and GNBP in An. arabiensis,   Table 3). Heterogeneity among species in K A /K S ratios was detected (Table 3; P,0.029, ANOVA, F = 6.8, df = 2/6), but contrary to expectations based on the degree of anthropophily, this ratio was higher in An. arabiensis than in An. quadriannulatus (An. gambiae was intermediate despite being most anthropophilic species). Heterogeneity among genes in K A /K S ratios (Table 3; P,0.007, ANOVA, F = 11.0, df = 3/6) showed higher ratios in gambicin (across species). Higher K A /K S ratio in gambicin may reflect elevated K A due to the low intensity of purifying selection (relaxed constraints). However, K A did not differ among genes (P.0.5, ANOVA, F = 0.6, df = 3/6) and gambicin's K A was ranked the second highest. To evaluate if K S of gambicin was reduced, we used a covariance analysis regressing diversity in synonymous sites over diversity in nonsynonymous sites, species, and gene. Contrary to relaxed constraints, K S of gambicin -was lower than that in all other genes (P,0.048, multiple least squares means comparison test). Since relaxed functional constraint does not account for these results, a better explanation is provided by negative balancing selection (see below).

McDonald Kreitman Test
The McDonald Kreitman test (1991) compares the ratios of fixed to polymorphic substitutions of nonsynonymous (NS) and silent (both synonymous and NC) substitutions between species. These fixation rates are expected to be equal under neutrality, whereas positive selection is expected to increase the fixation rate Table 3. Nucleotide diversity (p610 23 ), number of polymorphic sites (S), recombination parameter between adjacent position (R = 4Nr)610 23 , and ratio of nucleotide diversity in nonsynonymous/synonymous sites (v = K a /K s ) in coding regions in each population.  Table 4. Frequency spectra in coding (C) and non-coding (NC) regions across species at each gene.   in NS sites. The test could not be performed between An. gambiae and An. arabiensis because there were no fixed differences between them across all four genes (Table 5) in accordance with other evidence suggesting gene exchange (introgression) between them [46][47][48]. Departures from neutrality were detected only in GNBP in comparisons of both species with An. quadriannulatus (Table 5). In both cases, the ratios of fixed to polymorphic sites were lower in NS sites than those in silent sites. These results are inconsistent with positive selection operating by fixing different aa in each species at GNBP. Notably, the fixation rates of NS substitutions were not lower than those of other genes, as might be expected under purifying selection. Instead, the rates of fixation of silent substitutions were substantially higher than those of the other genes -as if positive selection operated on silent, rather than on NS substitutions in GNBP.

Divergence/Differentiation Between Species and Populations
Within-gene heterogeneity in divergence, measured by F ST , is evidence for selection [52]. Heterogenic differentiation across functional domains of the same gene were observed in five out of twelve tests (P,0.05 in individual test, and Binomial multiple test: P,0.0002). In all comparisons, divergence in coding regions (including all polymorphic sites) was lower than that in NC regions. The most pronounced heterogeneity was observed in gambicin across all three species pairs, but a similar, less extreme pattern was found in GNBP, and in SP14D1, between An. arabiensis and An. gambiae (Figure 4). Zero divergence (F ST = 0) in the coding region of gambicin as opposed to its high divergence in intronic and flanking regions (F ST .0.4, Fig 4) is remarkable, given that the polymorphism in the coding region was comparable to other genes (Table 3 and 4). The divergent ''haplotype'' of synonymous mutations and the distinct introns across species (not shown) do not support positive selection driving one allele across all three species.
Despite the dramatic within-gene heterogeneity in divergence observed in gambicin, HKA tests [42] between coding and noncoding regions were not significant in all four genes. Insignificant results persisted even when the average number of substitutions per site between species (Dxy) in the coding region was set to zero, indicating that the test had low power [8].
Within-gene heterogeneity in differentiation between An. gambiae populations, was detected in five out of 24 tests ( Figure 5; P,0.05 in individual test, and Binomial multiple test: P,0.006). Heterogeneity across functional domains of the same gene were observed in SP14D1 (3 tests) and GNBP (2 tests), whereas differentiation in gambicin was minimal across its functional regions. In genes showing heterogeneity, differentiation in coding region(s) was lower than corresponding NC region(s), as expected under balancing selection. Contrary to predictions (see Introduction), differentiation heterogeneity pattern was not correlated with population exposure to human pathogens. Instead, it appears to be correlated with the overall magnitude of differentiation between populations, probably reflecting higher power to detect heterogeneity when expected (neutral) differentiation is high.

Selection During Culicidae Evolution
Tests of positive selection were performed using the codeml program in the package PAML 3.15 [49] based on gene trees of members of the Culicidae. Counting nonsynonymous and synonymous substitutions separately in every codon along the branches of the tree, the likelihood of positive selection (v.1, where v = K A /K S ) is estimated allowing for heterogeneity in the mode and intensity of selection among codons. Considering that the time of divergence between the Culicinae and the Anophelinae exceeds 100 my [53], this analysis was aimed at evolutionary changes that occurred on a considerably ''deeper'' time scale than previous analyses, based on variation within and between populations of sibling species.
Positive selection was not detected for Defensin, GNBP and SP14D1 (Table 6). Strong evidence for positive selection, however, was found at gambicin, where v exceeded 11 at one codon (codon 72, Table 6). Six variants of the mature protein were observed among 64 sequences representing members of the An. gambiae complex and three of these variants were common (frequency.2, Figure 1). All three common proteins had substitutions in the same codon. Phenylalanine and valine were shared by all three members of An. gambiae, whereas isoleucine was found only in An. gambiae. An. funestus had a similar nonpolar aa -leucine. Unlike these variants, formed by conservative substitutions, An. darlingi shared the polar aa tyrosine with Culex pipiens (and Cx. quinquefasciatus), whereas Aedes aegypti and Armigeres subalbatus had alanine in this site. Amino acid diversity in this site was exceptionally high both within An. gambiae and between distant taxa, but reversal mutations were not common.

Discussion
Variation in the susceptibility to pathogens in insects and to malaria parasites in mosquitoes has been amply demonstrated  [27,[54][55][56][57] and immunity factors have been repeatedly linked to the variation in susceptibility [26][27][28][29]. Drosophila innate immune genes diverged between species (on average) faster than nonimmune genes but no evidence for positive balancing selection maintaining higher protein diversity (hypervariability) has been found by most studies [13,14,58,59]. In addition, only a few examples of positive selection have been described [12,14,60], providing support for the arms race or the diversifying selection models of insect-pathogen interactions. Similarly, recent studies on mosquitoes detected only faint signals of positive selection or none [15][16][17][18]. We described and analyzed polymorphisms in four mosquito immune genes to decipher selection effects, presumably mediated by pathogen-mosquito interactions. Inference on selection relied on within-gene heterogeneity i.e., in synonymous vs. nonsynonymous substitution rates. Within-gene heterogeneity is not confounded by factors such as demographic history, introgression, shared ancestral polymorphism and inversions which are known to confound comparisons between genes. Focusing the analyses on different taxonomic units afforded the opportunity to examine processes that have shaped genetic variation at several evolutionary time scales. Our main results can be summarized as follows. At the most contemporary time scale, probed by within-population variation, purifying selection alone was detected. At a deeper time scale, probed by between populations and sibling species variation, signatures of negative frequency-dependent balancing selection were detected on two (maybe three) genes. At the deepest time scale, spanning anopheline evolution, positive selection was detected on a single gene -gambicin. Our evidence does not support the hypothesis that selection was mediated by pathogens that are transmitted to man.
At the most contemporary time scale, intra-population polymorphisms revealed ample evidence for purifying selection on all genes. This evidence included lower diversity in coding vs. NC regions, a deficit of rare and moderate frequency SNPs at the coding regions, and K A /K S ratios below one across all populations. An inconclusive signal of negative balancing selection was detected on gambicin by an elevated K A /K S ratio (0.4, not statistically lower than one) due to reduced K S .
At a slightly longer time scale, intra-species variation revealed reduced differentiation at the mature protein compared with the same gene's NC regions (GNBP and SP14D1). Within-gene heterogeneity among functional regions in differentiation in both genes persisted at inter-species level between An. gambiae and An. arabiensis. Such heterogeneity cannot be explained by variations in mutation, recombination, introgression, or shared ancestral polymorphism because these effects are unlikely to be divided among functional domains of the same gene. Given considerable polymorphism within-populations (Tables 3 and 4), purifying selection poorly explains the observed pattern because it affects polymorphism and divergence rather than divergence alone. Correspondingly, the same significant pattern was obtained by bootstrapping the average number of substitutions per site (Dxy, not shown). The observed pattern is better explained by balancing selection on coding regions [52] regardless if the selection operated before or after speciation (see below about alternative explanations). Patterns of divergence between sibling species, extending the time scale of analysis, showed remarkable heterogeneity among functional regions of gambicin across all three species pairs, with over ten fold reduced divergence in coding as opposed to NC regions. Likewise, frequency dependent (negative) balancing selection provides a compelling explanation for the MK test on GNPB between An. quadriannulatus and both An. gambiae and An. arabiensis, showing high rate of fixation of synonymous substitutions. Accordingly, the aa under selection remain protected from loss because selection increases their frequency as they become rare, but consequent fluctuations in protein frequencies increase drift and fixation of partially linked silent substitutions. GNBP's high protein diversity and its role in pathogen recognition fit well with this explanation. Nonetheless, positive selection on silent substitutions affecting transcription and expression cannot be ruled out, although it is unlikely.
Whether these results can be more parsimoniously explained by neutral or purifying selection needs to be addressed, especially because the HKA test, applied to coding and NC regions of each gene detected no significant results. Notably, the HKA test considers independent genealogy for each ''gene'', even though this does not apply for exons and introns of the same gene. Thus, it appears to be overly conservative for within gene testing. Clearly, significant heterogeneity in differentiation and divergence among functional regions of the same gene cannot be reconciled with a neutral explanation. Purifying selection due to functional constraints limits variation in coding regions by removing deleterious mutations. Hence, it limits both polymorphism and divergence, but the fewer neutral (e.g., synonymous) or minimally deleterious mutations that attain moderate or high frequencies are subject to drift -similarly to mutations in NC regions. Therefore, unless polymorphism in the mature protein is near zero, purifying selection primarily limits the number of polymorphic sites, whilst drift continues to shape differentiation and divergence as it does for neutral loci. Strong purifying selection might even increase drift in coding regions and so, elevate differentiation due to smaller effective population size. Because polymorphism in the coding regions was not exhausted as our data showed, purifying selection cannot explain the ten fold reduced divergence in coding as opposed to NC regions at gambicin. In other words, why has the strong drift on NC regions (F ST .0.4) not fixed the common multiple proteins shared across species? Likewise it cannot explain why heterogeneity in divergence was not observed in defensin despite being subjected to purifying selection more than the other genes as indicated by finding a single mature protein across all species (see also Simard et al. 2007).
At the longest time scale, spanning over 100 my of Culicidae evolution [53,61], PAML analysis detected strong positive selection on gambicin. At a single codon, nonsynonymous mutations occurred at a rate over 10 fold higher than the rate of synonymous mutations. No evidence for positive selection was detected in the other genes.
Consistent with previous studies on vectors, our results confirm that purifying selection is the most common mode of selection operating on immune genes [15][16][17][18] as it operated on all genes at the contemporary time scale. Signatures of negative frequencydependent balancing selection were detected at least on gambicin, and GNBP during recent evolutionary time scales, suggesting that a Figure 4. Divergence between species measured by F ST in functional regions of each gene. The 95% CI were estimated by bootstrapping over positions (1000 bootstrap replications) provided that there were ten or more variable positions in that region across the pair of populations compared. An. gambiae is represented by its western Kenya population (GA). Defensin, gambicin, GNBP, and SP14D1 are denoted by Df, Gm BP, and SP, respectively. NC denotes noncoding regions, C denotes coding regions, F denotes flanking regions, I denotes intronic region, M denotes mature protein, and SC, denotes signal and cleaved propetdide segment. doi:10.1371/journal.pone.0004549.g004 diverse (maybe fluctuating) body of pathogens mediate balancing selection to maintain several alleles of immune genes. Positive selection was detected at the longest time scale spanning over 100 my on gambicin, suggesting that an arms race occurs rather rarely in accord with previous studies that detected no positive selection on recent evolutionary time scales [15][16][17][18]. Positive selection may be associated with speciation events following exposure to new pathogens. The low specificity of an innate system faced with myriad targets may constrain evolution of immune genes because enhanced defense against one pathogen may reduce defense against another [23,62]. Clearly, such interpretations based on an exploratory investigation using four genes and a few species are merely tentative. These results add to the growing body of studies on immune genes of vector species that found little evidence for positive or classical diversifying selection [15][16][17][18] and of other insects [13,14,58,59].
Finally, our results do not support the view that selection on these genes was mediated by human pathogens because overall, patterns of genetic variation are homogenous across the zoophilic An. quadriannulatus and the anthropophilic An. gambiae and An. arabiensis as well as across population of An. gambiae that differ in their exposure to human pathogens. Contrasting these results with corresponding patterns from the gene(s) that confer resistance to human pathogens might provide useful insights on Plasmodiumvector interactions. Identification of such gene(s) appears to be very near.

Mosquito Samples
Anopheles gambiae mosquito collections were made between 1994 and 1999 (Table 2). Collection sites include Asembo Bay in western Kenya, Jego in eastern Kenya, Gwamlar in central Nigeria, and Barkedji in Senegal. For brevity, population names used hereafter are western and eastern Kenya, Nigeria, and Senegal, respectively. An. arabiensis specimens were collected in Asembo Bay. An. quadriannulatus DNA was kindly provided by F.
H. Collins and Nora Besansky from specimens collected in a rural area of southern Zimbabwe in 1986 [35]. At each site, mosquitoes were collected within one period from houses less than 5 km apart. Further details are found in Lehmann et al. [30].

DNA extraction, species identification, and sequencing
Anopheline mosquitoes were visually identified as members of the An. gambiae complex. Genomic DNA was extracted from whole mosquitoes as described previously [30] and suspended in 100 ml of TE. Species identification was carried out using the PCR assay [36]. Molecular form of the An. gambiae specimens was determined using the PCR-RFLP assay [37]. An. gambiae specimens collected from Kenya and Nigeria were all of the S form, while those from Senegal were of the M form. PCR reactions to amplify the full target gene were carried out using 2 ml of template DNA (from an aliquot of whole-mosquito extracts diluted 1:20 in distilled water) in 50 ml reaction containing 5 units Taq polymerase (Boehringer Mannheim or Gibco BRL) in manufacturer's buffer, 1.5 mM MgCl 2 , 200 mM each dNTP (PE Applied Biosystems) and 50 pmol each forward and reverse primers. To minimize PCR errors, amplification of SP14D1 and GNBP were performed using a mixture of Taq polymerase and (Pfu Promerga) mixed 1:7, respectively. Amplification of Gambicin was performed using Pfu only.
Primers were designed based on the published sequence of each gene. Cycling conditions for amplification included denaturation at 94uC for 5 minutes, followed by 35 cycles at 94uC for 30 seconds, 52uC for 30 seconds and 72uC for 1 minute, with a final extension step at 72uC for 5 minutes. PCR products were examined on a 1% agarose gel, and cloned using the pGem Tvector kit (Promega). Individual transformed colonies (white) were selected. The size of the DNA insert was determined by PCR using pUC/M13 forward and reverse primers. In most cases, a single appropriately sized insert was chosen at random, and sequenced in both directions after purification with the Wizard PCR Purification Kit (Promega). In addition to the previous forward and reverse primers, internal nested primers were used as sequencing primers. Cycle sequencing was performed using PE BigDye Terminator Ready Reaction Kit according to manufacturer's recommendations (PE Applied Biosystems). Sequencing reaction products were analyzed on an ABI 377 sequencer (PE Applied Biosystems). Sequences were checked for accuracy on both strands using Sequence Navigator (PE Applied Biosystems). Multiple alignments were performed with the Pileup program of GCG (Genetics Computer Group, 1999) using default options, and were adjusted by eye. To avoid sampling bias, a single allele (haplotype sequence) was arbitrarily selected from each specimen for the analysis. Alignments of variable positions are provided in supporting information figures ( Figure S1, S2, S3, S4). DNA sequences have been deposited in GenBank (Defensin sequences have been deposited under the accession numbers DQ211988-DQ212056; Gambicin, GNBP, and SP14D1 were deposited under accession numbers FJ653713-FJ653911).

PCR error
Because multiple insertion/deletion (indels) were common in SP14D, GNBP and defensin, direct sequencing was not possible. Sequences were determined from 2-4 independent clones of the same allele, to identify errors resulting from mis-incorporation of nucleotides by Taq polymerase during the PCR amplification. We estimated PCR error rate to be 0.001 per bp in accordance with published records (Kwiatowski et al., 1991). High variation between alleles, allowed distinguishing different alleles and different clones of the same allele. Gambicin was amplified using Pfu only, which practically eliminates PCR errors. Few indels in gambicin facilitated direct sequencing, which was used to verify sequences derived from clones (as above).
Although we used statistics that are less sensitive to the effect of PCR errors (e.g., nucleotide diversity instead of the number of segregating sites and theta derived based on the latter), the polymorphism reported here is slightly biased upwards because of PCR errors. Nevertheless, our inference is unbiased because instead of relying on the absolute values of polymorphism, we compared polymorphism between different functional regions of the gene that have the same probability to include a PCR error once differences in sequence length were accommodated (below).

Data analysis
Nucleotide diversity (p) was estimated using DnaSp 4.10 [38]. The 95% confidence interval (CI) of p was estimated using bootstrapping over positions in programs written in SAS (SAS Institute Inc., 1990). To evaluate if recombination rate differed between genes and determined their diversity the recombination parameter (R = 4Nr) between adjacent nucleotide positions for each gene was estimated using DnaSp. A more complete summary of polymorphism was obtained by the site frequency spectra [39,40], which describes the frequency of sites that are invariant (f = 0), singleton (f = 1), and polymorphic (f = 2, 3, … n/2), where f is the frequency of the rare nucleotide at this site/position and n is the number of sequences. These spectra distinguish between rare (e.g., singletons) and common mutations (sites where the rarest nucleotide was observed 4-7 times, which is the maximum possible frequency given 9-14 sequences per population). The frequency of neutral mutations increases slowly compared with positively selected mutations but faster than deleterious mutations. Hence, rare mutations represent a greater fraction of new and mildly deleterious mutations, whereas common ones represent a greater fraction of ancient and neutral mutations. The site frequency spectrum is especially useful to compare polymorphism in different regions of a gene without bias due to PCR errors, because it accounts for sequence length variation. We compared and tested equality of nucleotide diversity of synonymous and nonsynonymous sites using bootstrapping in MEGA 3.1 [41].
The Hudson, Kreitman and Aguadé's test (HKA test) compares within and between species divergence and polymorphism in two (or more) loci, accommodating different rate of neutral polymorphism between loci [42]. This test was designed to detect positive and positive-balancing selection. It was performed using DnaSP. The McDonald and Kreitman's Test (1991) compares the ratios of fixed to polymorphic substitutions of nonsynonymous and silent (both synonymous and NC) substitutions between species. Under neutrality, fixation rate is expected to be equal, but positive selection would increase the rate of fixation in nonsynonymous sites. This test was performed using DnaSP.
Differentiation between populations was assessed by sequencebased F statistics analogous to Wright F statistics [43], calculated according to [44] and tested (for being greater than zero) by a permutation test using DnaSP. Confidence intervals around F ST values were calculated by bootstrapping over nucleotide positions using programs written in SAS [45]. To avoid the effect of unequal sample size due to pooling four An. gambiae populations compared with single populations of An. arabiensis and An. quadriannulatus, inter-species comparisons were performed using the population of An. gambiae from western Kenya, which is sympatric with An. arabiensis. The binomial test (which estimates the probability of obtaining the observed number of significant tests at the 0.05 level given the total number of tests) was used to detect significant departures from null hypothesis across multiple tests, such between pairwise population comparisons across genes.
The evolutionary relationship between the sibling species is not fully resolved probably because introgression between An. gambiae and An. arabiensis affected genes unprotected by fixed inversions [46][47][48]. Because of uncertain phylogeny and introgression, we did not classify mutations as ancestral, shared, and derived and our selection analysis relied on within-gene comparisons. Comparisons between different functional regions of a gene (defined below) and synonymous vs. non-synonymous mutations provide robust evidence for selection and avoid confounding effects of population demography, inversion, introgression, and PCR errors because they affect all regions of the gene equally. Likewise, such comparison is not susceptible to variation in mutation and recombination rates between unlinked loci across the genome. This approach is conservative because polymorphism in shorter DNA fragments is subject to higher sampling variation, reducing the power to detect differences between regions. Physical linkage between adjacent regions may further reduce the differences between them even if selection operated on only one region. The advantage of this approach, however, is that significant differences represent robust evidence for selection.
Test of positive selection on single codons was performed using the codeml program in the package PAML 3.15 [49]. It estimates the per site ratio of nonsynonymous to synonymous substitutions in every codon along the branches of a phylogenetic tree by fitting nested maximum likelihood models with different parameters. Analyses were performed on coding regions of all homologue genes from the family Culicidae available in Genbank (searched using tblastx) and all unique sequences obtained in this study. GNBP alignment was 171 aa long and included eight species (An. gambiae, An. arabiensis, An. quadriannulatus, Ae. aegypti, Ae, albopictus, Ae. triseriatus, Cx.quinquefasciatus, and Armigeres subalpatus). SP14D1 alignment was 246 aa long and included six species (An. gambiae, An. arabiensis, An. quadriannulatus, Ae. aegypti, Cx.quinquefasciatus, and Ar. subalpatus). Gambicin alignment was 81 aa long and included nine species (An. gambiae, An. arabiensis, An. quadriannulatus, An. funestus, An. darlingi, Ae. aegypti, Cx.quinquefasciatus, Cx.pipens, and Ar. subalpatus). Defensin alignment was 101 aa long and included seven species (An. gambiae, An. arabiensis, An. quadriannulatus, An. funestus, An. darlingi, Ae. aegypti and Ar. subalpatus). Multiple alignment of coding regions was done using ClustalW [50] followed by hand alignments before removal of all gaps. For GNBP and SP14D, pairwise local alignment were obtained in tblastx instead of Clustal and final alignment was performed manually in Genedoc (version 2.700). Neighbor Joining trees were produced using the program Neighbor (PHYLIP 3.66) based on a distance matrix computed by Dnadist (PHYLIP 3.66), run under default parameters [51].