A gene-based capture assay for surveying patterns of genetic diversity and insecticide resistance in a worldwide group of invasive mosquitoes

Understanding patterns of diversification, genetic exchange, and pesticide resistance in arthropod disease vectors is necessary for effective population management. With the availability of next-generation sequencing technologies, one of the best approaches for surveying such patterns involves the simultaneous genotyping of many samples for a large number of genetic markers. To this end, the targeting of gene sequences of known function can be a cost-effective strategy. One insect group of substantial health concern are the mosquito taxa that make up the Culex pipiens complex. Members of this complex transmit damaging arboviruses and filariae worms to humans, as well as other pathogens such as avian malaria parasites that are detrimental to birds. Here we describe the development of a targeted, gene-based assay for surveying genetic diversity and population structure in this mosquito complex. To test the utility of this assay, we sequenced samples from several members of the complex, as well as from distinct populations of the relatively under-studied Culex quinquefasciatus. The data generated was then used to examine taxonomic divergence and population clustering between and within these mosquitoes. We also used this data to investigate genetic variants present in our samples that had previously been shown to correlate with insecticide-resistance. Broadly, our gene capture approach successfully enriched the genomic regions of interest, and proved effective for facilitating examinations of taxonomic divergence and geographic clustering within the Cx. pipiens complex. It also allowed us to successfully survey genetic variation associated with insecticide resistance in Culex mosquitoes. This enrichment protocol will be useful for future studies that aim to understand the genetic mechanisms underlying the evolution of these ubiquitous and increasingly damaging disease vectors.


Introduction
The brown, dusk-biting mosquitoes collectively classified within the Culex pipiens complex (Diptera: Culicidae), include two globally distributed invasive species, the temperate Culex pipiens, and the tropical Cx. quinquefasciatus, along with several additional taxa with more restricted distributions [1]. Specific populations of these two species are critical urban vectors of the nematode that causes human periodic filariasis (Wuchereria bancrofti), and several epidemic encephalitides such as West Nile virus [2] and Usutu virus [3]. These mosquitoes also vector avian malaria, a group of parasites that are of significant concern to island bird communities in Hawaii, the Galapagos, and elsewhere [4][5][6][7].
Rapid human movements around the globe likely facilitated the spread of many now cosmopolitan mosquito species such as several in the Cx. pipiens complex, and accordingly these distributions are a relatively recent phenomenon [8]. One of the best-studied invasive species is the yellow fever mosquito, Aedes aegypti. Outside its source location in Africa, populations of Ae. aegypti all share the same basic genotype, revealing its rapid, human-facilitated expansion [9]. Interestingly, in contrast to this pattern, microsatellite analyses of populations of Cx. pipiens and Cx. quinquefasciatus from across the world have uncovered unexpectedly high levels of genetic diversity. For example, continental populations of Cx. quinquefasciatus flanking the Pacific Ocean are highly differentiated [10]. Furthermore, although historical records pinpoint an original introduction of Cx. quinquefasciatus into the Hawaiian Islands from the Americas [11], current Hawaiian Cx. quinquefasciatus have a distinct Australasian signature [10]. The mechanisms underlying the likely replacement of the first population in Hawaii by the second are unknown and understanding this process will require a better understanding of the specific genetic makeup (i.e., which genes and their capabilities) of the population(s) involved.
Another important aspect of the Cx. pipiens complex is the extent to which genetic exchange (hybridization) has contributed to ecological divergence and patterns of disease transmission. For example, inter-taxonomic hybridization between the two forms of Cx. pipiens may have significant negative consequences for arboviral transmission to humans [12].
Several studies have also found evidence of extensive hybrid zones between temperate Cx. pipiens or Cx. pipiens pallens (a subspecies limited to northeastern Asia) and tropical Cx. quinquefasciatus [13,14]. Finally, analysis of genetic variation at the acetylcholinesterase locus 2 (ACE2) across members of the complex indicated that the hybridization event that may have resulted in formation of the temperate Cx. pipiens pallens was unidirectional which is surprising since patterns of hybridization of contemporary Cx. p. pallens with Cx. quinquefasciatus appear bidirectional [13].
To address these and other questions specific to the Cx. pipiens complex, it will be necessary to extensively survey population and taxonomic samples at a large number of independently segregating molecular markers. Such an analysis would provide greater clarity to patterns of evolutionary divergence, global movement, and genetic exchange within these mosquitoes. Next-generation sequencing (NGS) has enabled vast amounts of genetic data to be collected at relatively low cost [15,16]. However, challenges for sample-specific data collection and analysis are created by the presence of diverse microbial symbionts such as Wolbachia and endogenous viral elements in these mosquitoes [17]. Furthermore, mosquito genomes like those of Culex are often riddled with repetitive DNA [18]. Of a recent assembly of the 567. 56 Mb Cx. pipiens pallens genome, 60.63% (344.11 Mb) was found to consist of repetitive elements [19]. Such elements make whole genome data collection and analysis expensive and wasteful since only a small proportion of the genetic variation observed can be confidently compared across all specimens.
Capitalizing on recent technological advancements, a capture approach where DNA or RNA probes designed to match known genes are hybridized to DNA libraries of individual specimens and sequenced has been gaining traction [20][21][22]. Because it bypasses large amounts of DNA of unknown function and heritability, targeted gene enrichment allows users to pool tens or even hundreds of indexed specimens, and cost-effectively sequence thousands of homologous loci simultaneously. However, such enrichment methodologies have so far been minimally applied in mosquitos for examining population genetics or evolutionary patterns (but see [23]).
Here we describe the design and use of a genetic baits assay targeting 512 genes annotated in the Cx. quinquefasciatus genome including regions that have been shown to harbor genetic variation that correlates with insecticide resistance. We examined the utility of these baits for taxonomic differentiation and patterns of admixture by sequencing samples from four taxa of the Cx. pipiens species complex, samples of known hybrid origin, and one sample of a closely related, outgroup taxon, Culex torrentium. To further examine the potential of these baits for exploring finer scale, intra-taxonomic population structure and differentiation, we included samples of Cx. quinquefasciatus from multiple geographic sources. Finally, within our samples we investigated the presence and frequency of alleles previously found to correlate with insecticide resistance. This was done to test the utility of these baits for surveying genetic variation that may contribute to a reduced efficacy of chemical control efforts. Such information can be critical for developing effective strategies to mitigate disease transmission by these mosquitoes [24].

Bait design and screening
We designed an in-solution capture assay targeting 131 rapidly evolving Culex genes obtained from a previous comparison of de novo-assembled transcriptomes from multiple samples of Cx. pipiens f. pipiens and Cx. pipiens f. molestus [25]. These 'rapidly evolving' genes were enriched for seven GO terms, of which five terms (chitin metabolic process, chitin binding, serine-type endopeptidase activity, proteolysis and odorant binding) were also enriched along the 'fly' branch [26]. This indicates they may represent a genetic 'core' for adaptive evolution within the Diptera. To facilitate estimates of genotyping error rates, we also included 28 identified 'slow-evolving' genes [25]. To these 131 rapidly evolving and 28 slow evolving genes, we also added 353 genes potentially involved in insecticide resistance. These included annotated P450s, alpha and beta esterases, sodium channel genes, and acetylcholinesterase genes [27]. In total, our capture assay targeted 512 genes (S1 Table). These genes were then extracted from the Cx. quinquefasciatus genome (v. CpipJ2.5) [28] using their VectorBase annotations (https://vectorbase.org/vectorbase/app) [29].
To ensure optimal enrichment, we commissioned Daicel Arbor Biosciences (https:// arborbiosci.com/) to design 39,953 120 bp baits with~1.5x flexible tiling density (~80bp probe spacing) across our targeted genes. These baits covered the complete exonic and intronic regions for each gene, allowing for simultaneous investigation of both adaptive and neutral evolution. These candidates were then assessed using BLAST v. 2.12.0 [30]. Bait candidates were accepted when they satisfied one of the following conditions: a) no BLAST hit with a melting temperature (T m ) above 60˚C, b) no more than two hits at T m 62.5-65˚C, or 10 hits in the same interval and at least one neighbor candidate being rejected. c) no more than 2 hits at T m 65-67.5˚C and 10 hits at T m 62.5-65˚C and two neighbor candidates on at least one side being rejected. d) no more than a single hit at or above T m 70˚C or e) no more than one hit at T m 65-67.5˚C and 2 hits at T m 62.5-65˚C and two neighbor candidates on at least one side being rejected. The baits were synthesized as a myBaits version 3 kit. After stringent filtration, 29,992 baits were retained, covering all 512 target genes with at least one bait. The targeted sequences total 2,524,269 bp in length, and are well distributed across the three Cx. quinquefasciatus chromosomes (S1 Fig).

Target enrichment and sample sequencing
To test our targeted enrichment approach, we chose specimens representative of the genetic diversity observed across the complex (S2 Table). Specifically, we included specimens of the two Culex pipiens forms from Europe and North America (f. pipiens and f. molestus), specimens of the subspecies Cx. pipiens pallens from the Republic of Korea and, to assess the power of the assay to discern intraspecific patterns of diversity, specimens of Cx. quinquefasciatus from six distinct geographic regions: east-southeast Asia, Samoa, Hawaii, North America (including the Caribbean), Brazil and Nigeria. We also included known hybrids of Cx. pipiens and Cx. quinquefasciatus from California and North Carolina. Most specimens had previously been examined using a panel of microsatellite loci [10,12,13,31]. Finally, we included one sample of the closely related species Cx. torrentium for outgroup comparisons.
We extracted DNA from individual mosquitoes using a phenol-chloroform method previously described [32]. We then performed an initial step to clean and concentrate DNA by using Omega Mag-Bind TotalPure NGS beads at 0.9 ratio following the manufacturer's protocol. For library preparation, we used the Illumina DNA library prep (formerly Nextera DNA Flex), again following the manufacturer's protocol. Each sample was given a unique, barcoded adapter in this step to allow library multiplexing prior to sequencing. DNA concentration and quality of the libraries were determined using the Qubit 2.0 Fluorometer and Bioanalyzer High Sensitivity DNA Analysis kit (Agilent), respectively. To create amplicons that did not have affinity to streptavidin, we performed four amplification cycles following instructions in Appendix A2 of the myBaits Hybridization Capture for NGS protocol (v. 4.01). To do this, we used universal P5 and P7 primers. The resulting products were cleaned using Omega Mag-Bind beads and hybridized with our capture biotinylated baits for target enrichment following myBaits protocol (v. 4.01). We used diluted baits to a ratio of 1:6. These libraries were amplified following 12 cycles using KAPA HiFi Hotstart ready mix, and the resulting products were cleaned with AMPure XP beads (Beckman Coulter). Concentration and quality of final libraries were checked using Qubit and Bioanalyzer, and each sample was adjusted to a final concentration of 4 nM (1.33 ng/μl). We obtained libraries with fragment sizes of 600 bp on average. These were 2 × 300 bp paired-end sequenced in multiplexed groups of six or seven samples on an Illumina MiSeq using 600-cycle MiSeq version 3 kits.

Data mapping and variant calling
After sequencing, we first used Trim Galore v. 0.4.1 [33] to trim Illumina sequencing adapters and bases from read ends with a quality score less than 20 (Cutadapt version 1.9.1) [34]. We removed both reads of a pair if either was less than 30 bases long after trimming. We mapped all remaining trimmed reads to the Cx. quinquefasciatus reference genome (v. CpipJ2.5) [28] using BWA-MEM v. 0.7.12 with default settings [35]. Next, we added read groups and sorted the mapped reads using the AddOrReplaceReadGroups function in Picard v. 1.119 [36]. We then marked read duplicates using the tool MarkDuplicates, also with Picard v. 1.119, followed by indel realignment using IndelRealigner in the Genome Analysis Toolkit ('GATK') v. 3.6 [37]. Finally, for each sample, we identified genetic variants using GATK's HaplotypeCaller [38] (specific flags:--emitRefConfidence GVCF,--variant_index_type LINEAR,--variant_in-dex_parameter 128000 -rf BadCigar).
With the resulting raw VCF files (one per sample), we used GATK's GenotypeGVCFs function to produce a single, multi-sample VCF containing all identified variants observed across all samples. This file was filtered to retain only single nucleotide polymorphisms (SNPs), using the SelectVariants tool in GATK v. 4.0.8.1 [39]. This tool was also used to remove any variants that fell outside our designated baits coordinates (S1 Table). Next, we applied a series of hard quality filters, removing all SNPs with any of the following parameters: QD < 11.0, FS > 40.0, MQ < 56.0, MQRankSum < -0.2, ReadPosRankSum < -3.0, and/or SOR > 2.0. These thresholds were based on the observed distribution of variants (S2 Fig), and were equal to, or more stringent than, the recommended values given in GATK's best practices [40]. Finally, we used SnpEff v. 4.3 [41], with a custom database to annotate the remaining SNPs for downstream sorting by variant type.
We did not sequence any unenriched libraries in parallel with our enriched library sequencing efforts. However, for the purpose of comparing the enrichment efficiency of our bait capture assay to unenriched libraries, we used previously published Illumina data from two Cx. pipiens f. pipiens, two Cx. pipiens f. molestus, and one Cx. pipiens pallens (S3 Table). These data were generated using similar methods to those used here, but without the application of any enrichment method [19,42]. Four of these five datasets were prepared from single, wildcaught mosquitoes [42], while the fifth was a pool of laboratory-maintained samples [19]. As each dataset contained substantially more reads than what we obtained from our capture-assay libraries, we used the program Seqtk v. 1.1-r91 [43], to down sample each dataset's reads to three million pairs (after trimming and quality filtering). After read down sampling, we mapped the reads, sorted them, and realigned INDELs as described above. These data were not included in our subsequent clustering analyses nor in our insecticide resistance investigation. For all datasets (both enriched and unenriched), the 'stats' function in SAMtools v. 1.15 [44] was used to determine the number of properly paired reads that mapped to the full genome, the number of properly paired reads that mapped to our target regions, and the percentage of target regions with a depth of coverage equal to or greater than three reads (�3×).

Genetic clustering and admixture
In addition to examining the enrichment efficiency of our bait capture approach, we also wanted to assess our enriched dataset's utility for surveying inter-taxonomic relationships and potential gene flow (admixture) across samples derived from the Cx. pipiens species complex, as well as for surveying intraspecific population relationships. As prior work has shown the importance of using a large number of segregating markers to detect structure from genetic data when divergence between distinct populations is likely to be low [45], we wanted to maximize the number of selectively neutral markers used. Therefore, we selected all variants that were annotated as either 'synonymous' or 'intronic', as they are more likely to be "neutral". Although research in Drosophila suggests that mutations in both of these site categories can experience selection [46][47][48], the strength of this selection is likely far less than that acting on non-synonymous variation.
We used GATK's 'SelectVariants' tool to generate two new VCFs from our VCF database of high quality synonymous and intronic SNPs, one with all samples except the outgroup Cx. torrentium (henceforth 'Cx. pipiens complex' dataset), and a second with only the Cx. quinquefasciatus samples (henceforth 'Cx. quinquefasciatus' dataset). We then removed any variant from both datasets that was not in Hardy-Weinberg equilibrium (p < 0.0001), and any variant in which the minor allele was represented at less than 5% frequency. Both filtering steps were carried out using VCFtools v. 0.1.17 [49]. Finally, for both datasets, we used PLINK v.1.90b6.6 [50] to remove SNPs with a pairwise squared correlation (r 2 ) greater than 50% within sliding windows of 50 SNPs at 10 SNP increments between windows [51]. This was done to reduce the impact of linkage between SNPs on our examinations of population clustering and admixture [52].
We first used principal component analyses (PCAs) to investigate non-parametric clustering among the samples in both datasets. These PCAs were conducted with the program PLINK v. 1.90b6.6 [50], and the results visualized using R v. 4.0.2 [53], focusing on the first two principal components (PC1 & PC2). We also examined patterns of genetic structure within our data using a Discriminant Analysis of Principal Components (DAPC) [54], as well as a maximum likelihood approach with the program ADMIXTURE v. 1.3.0 [55]. The DAPC were carried out with the package adegenet v. 2.1.5 [56] in R. We first used the 'find.clusters' function to identify probable genetic clusters represented in the data, For this analysis, we retained all principal components. To determine the optimal number of clusters (K), we used the Bayesian information criterion (BIC) [57]. If our BIC results indicated the optimal number of clusters was greater than one, then the number of retained principal components was determined by using the 'cross-validation' function in adegenet, with sample assignments determined in the initial clustering analysis. We used 75% of the data for a training set and the remaining 25% for data confirmation. This was repeated for 100 replicates. We used the 'dapc' function to probabilistically assign each sample to a cluster. From the information on discriminant functions, a genotype composition plot (Compoplot) was generated indicating the attributed probabilities of each sample to a cluster [56].
With ADMIXTURE, we examined potential clusters (K) from one to seven in both datasets. Each K value was run 20 independent times with a different seed value for each run. Across K values, we compared the means observed for the standard error of the 10-fold cross-validation (CV) error estimate to determine the number of clusters best supported by the data [58]. We determined the average q-matrix cluster assignment for each sample for each K value using the online version of CLUMPAK [59], with default settings.

Genetic diversity and taxonomic divergence
To examine the amount of genetic diversity harbored within individual samples, populations, and taxa, we used GATK v. 4.0.8.1 [39] to designate all sample-variant combinations with a depth of coverage less than 15× as a 'no call'. A read depth of 15× or greater has been shown to be adequate for assessing the diploid state of an allele (homozygous vs. heterozygous) within a sample with potentially high amounts of heterogeneity [60]. No upper limit was placed on read depth. Next, we used GATK to retain only biallelic SNPs that were annotated as either 'synonymous' or 'intronic' and called in all samples. This variant filtering was done to improve the equivalency of relative diversity estimates across all the samples. Finally, we used VCFtools v. 0.1.17 [49] to count the number of observed homozygous variants. The resulting data were used to calculate the average heterozygosity within a sample across assessed sites [61,62]. We also calculated taxon and population (Cx. quinquefasciatus only) means and standard errors of the means. Although these estimates do not give us absolute estimates of genetic diversity (because they only include known segregating sites), they do allow us to make relative comparisons between groups of samples (e.g., taxa or populations).
To examine relative divergence between sample clusters (e.g., taxa or populations), we used VCFtools v. 0.1.17 and our larger clustering dataset to calculate the pairwise fixation index (F st ) [63]. Comparisons were done between the four complex taxa excluding the known hybrids, and between these and the outgroup Cx. torrentium. We also compared the Cx. quinquefasciatus populations from the six designated geographic regions. All sample taxonomic and population designations were based on their prior assignments (S2 Table). We report both the weighted and unweighted estimates. Weighted estimates may be more strongly impacted by unequal sample sizes, whereas unweighted estimates may be more affected by variants segregating at low frequencies [64].

Phylogenetic analysis
To further examine sample clustering as well as taxonomic relationships amongst all samples, including the outgroup Cx. torrentium, we performed a maximum likelihood phylogenetic analysis. For this analysis, we focused on neutral variants that are likely to have similar mutation probabilities. Therefore, from our annotated variants dataset, we used BCFtools v. 1.9 [65] to select only 4-fold ('silent') segregating sites. Next, we removed variants that were not in Hardy-Weinberg equilibrium using VCFtools v. 0.1.17. We also thinned highly correlated SNPs as described above. The resulting VCF file was converted to PHYLIP format using the vcf2phylip. py v. 1.5 python script [66]. We then used jModelTest 2.1.10 [67,68] with default settings to select the best-fit model of nucleotide substitution for our datasets based on BIC scores. With the best fitting model, we used PhyML v. 3.1 [69] to carry out a maximum-likelihood phylogenetic analysis, with 100 non-parametric bootstrap replicates to determine confidence values. The resulting phylogenetic tree was visualized using the program FigTree v. 1.4.4 [70].

Presence of variants potentially conferring insecticide resistance
To assess the utility of our capture assay for surveying genetic polymorphism that may contribute to insecticide resistance, we first conducted a literature survey to identify known single nucleotide variants that have been shown to be associated with insecticide resistance in Cx. pipiens complex mosquitoes. Specifically, we examined publications that reported the gene and position of a segregating variant that correlated with resistance to one or more active insecticidal products (e.g., organophosphates or pyrethroids). These were exclusively missense mutations that changed the amino acid sequence and likely protein interactions with the insecticide. With their genome coordinates (chromosome and base position), we used VCFtools v. 0.1.17 to calculate the frequencies of the susceptible and resistant alleles across all our samples. We also used VCFtools to examine the sample-specific presence of these variants to compare taxa and populations.

Data mapping and variant calling
The average percentage of total reads mapped to the full genome was very similar between our enriched libraries prepared with a capture assay and unenriched libraries (79.8% vs. 81.3%; Table 1). However, the enriched libraries had an average of 13.82% of the reads mapped to the target gene regions, whereas the unenriched datasets only had 0.76%. This indicates an enrichment factor of 18.2 fold for the target regions. A difference between the enriched and unenriched data was also reflected in the percentage of the target regions covered by three or more reads (Table 1).
We initially called 12,301,010 variants across all samples, including both single nucleotide polymorphisms (SNPs) and insertions/deletions (INDELs). After removing all INDELs and any additional variants not located in our designated baits, we were left with 315,512 SNPs. Quality filtering further reduced this to 132,185 SNPs.

Genetic clustering and admixture
For examining genetic relationships for all the samples within the Culex pipiens complex, we generated a dataset consisting of 14,303 unlinked SNPs annotated as either 'synonymous' or 'intronic'. A principal component analysis with this dataset revealed that the greatest genetic divergence (indicated along PC1) occurred between the samples designated as Cx. quinquefasciatus and all the other samples (Fig 1). PC2 distinguished the Cx. pipiens pallens samples from the other samples. As expected, the two samples known to be admixed between Cx. quinquefasciatus and Cx. pipiens were intermediate between these taxa along PC1. Additionally, along PC2, there appeared a small distinction between the two forms of Cx. pipiens (f. pipiens and f. The analysis of clustering using ADMIXTURE also indicated that a K value of 2 was best supported (S4A Fig). Population clustering at this K value again indicated the genetic distinction between the Cx. quinquefasciatus samples and the other complex samples (Fig 2). However, the two samples known to be hybrids between Cx. pipiens and Cx. quinquefasciatus clearly showed their mixed ancestry. At K = 3 we saw a division between Cx. quinquefasciatus samples from Hawaii and Samoa and all other Cx. quinquefasciatus samples. At K = 4 the Cx.

Table 1. Comparison of read-mapping between enriched and unenriched libraries.
We report the average number of read pairs after quality trimming, the number of read pairs that properly mapped to the full genome, and the percent of the reads which mapped to the full genome. The standard deviations are given in parentheses. Also included are the number of read pairs that mapped to the target bait regions, the percentage of the properly paired reads that mapped to the target regions, and the percentage of the target bait regions with a coverage of three or more reads (�3×). For individual sample statistics see S3 We also looked at sample clustering just in our known Cx. quinquefasciatus samples. This dataset consisted of 9,829 unlinked, segregating variants annotated as 'synonymous' or 'intronic'. All samples clustered within their known geographic region (Fig 1), and more broadly there were three distinct groupings. These corresponded to a cluster of Hawaiian and Samoan samples that were distinct from all the other samples along PC1, and a cluster of east Asian samples that were distinct from the third cluster along PC2. This third cluster consisted of samples from North America and the Caribbean, Brazil, and Nigeria. The DAPC with just the Cx. quinquefasciatus samples suggested they derived from a single cluster (K = 1; BIC = 159.48; S3D Fig). This was not surprising given the limited number of markers used and the low amounts of genetic divergence likely to be present among populations of this species [45]. Given this result, we did not perform additional tests within the DAPC analytical framework for this dataset.
The admixture results for the Cx. quinquefasciatus samples also suggested a single taxonomic group (i.e.; K = 1; S4B Fig). However, when we looked at sample clustering at higher K values, we saw the greatest distinction between specimens deriving from Hawaii and Samoa, and all remaining samples (Fig 3). At K = 3 we saw the east Asian samples form a distinct cluster, recapitulating the results for our principal component analysis. One sample from India (QUE_EAS_01), appeared to be highly admixed with genetic representation from multiple populations across K values. At K = 4, the Hawaiian and Samoan samples formed distinct clusters. The Nigerian and Brazilian samples showed their distinctiveness (and relation to one another) at K = 5. However, this affiliation disappeared at K = 6. Such cluster shifting across K values highlights the overall degree of genetic similarity among these samples and likely reveals

Genetic diversity and taxonomic divergence
To examine relative genetic diversity within all Cx. pipiens complex mosquitoes sequenced, we used 916 biallelic, neutral SNPs which each had a depth of at least 15 reads (15×) in all samples. The mean number of heterozygous sites and the mean sample pairwise heterozygosity (π) for all taxa are given in Table 2, and each sample's individual diversity observations are given in S4 Table. The taxon/group with the highest π values was Cx. pipiens pallens at 0.091 (SE = 0.006). This value means that among the Cx. pipiens pallens, on average 9.1% of the 916 SNPs were found in a heterozygous state. The next highest value of π was observed in the Cx. torrentium sample with 0.084. The known hybrids had an average π of 0.066 (SE = 0.009). The lowest mean π value was observed in the various Cx. quinquefasciatus samples (0.023, SE = 0.002).
To examine relative genetic diversity within just the Cx. quinquefasciatus samples, we used 540 SNPs that were determined to be biallelic and had a depth of at least 15 reads (15×) in the samples under consideration. These SNPs were also considered most likely to be evolving neutrally by virtue of being annotated as 'synonymous' or 'intronic'. The mean number of heterozygous sites and the mean sample pairwise heterozygosity (π) for the six geographic designations of Cx. quinquefasciatus are given in Table 3. The samples from east Asia had the highest mean observed π with a value of 0.150 (SE = 0.015). Hawaiian samples also appeared to be relatively genetically diverse with a π value of 0.103 (SE = 0.017). The lowest mean values  Table 4 gives the pairwise unweighted and weighted estimates of the fixation index (F st ) [63], between each of the four Cx. pipiens complex taxa examined here as well as the outgroup, Cx. torrentium. Weighted estimates were always larger than unweighted estimates. Not surprisingly, the highest values were observed between the Cx. pipiens complex taxa and the Cx. torrentium sample. Among the taxa within the Cx. pipiens complex, the highest unweighted F st value was between Cx. quinquefasciatus and Cx. pipiens f. pipiens (0.2967). With weighted F st  Table 2. Relative genetic diversity within taxa across 916 neutral, bi-allelic, segregating SNPs. Given are the taxonomic designations (including a category for known hybrid samples), sample size for each taxon, the mean number of heterozygous sites observed per sample with standard error, and the corresponding mean pairwise sample heterozygosity value with standard error.

Phylogenetic analysis
The dataset for our phylogenetic analysis consisted of 1,735 unlinked 4-fold synonymous SNPs, all of which were present in at least 75% of the samples. The evaluation of models of nucleotide sequence evolution indicated that a transversional model of mutation with a gamma distribution of rate heterogeneity best fit the data (TVM + Γ) [71]. As expected, the outgroup species Cx. torrentium was clearly distinct from the samples of the Cx. pipiens complex (Fig 4). The Cx. quinquefasciatus samples also clustered with high confidence and overall, clustering in our phylogeny recapitulated the results of the PCAs and ADMIXTURE analyses.

Presence of variants potentially conferring insecticide resistance
After reviewing the literature, we investigated the presence and frequency of seven single nucleotide polymorphisms that have been shown to correlate with insecticide resistance in the Cx. pipiens complex (Table 6). Interestingly, all presumptive resistance-associated alleles were present among the samples we examined. For one of these sites, R213, found in gene CYP6BZ2 (cytochrome P450 6BZ2), there are two allelic changes that are associated with Table 3. Relative genetic diversity within populations of Cx. quinquefasciatus across 540 segregating, neutral, bi-allelic SNPs. Given are the population designation, sample size for each population, the mean number of heterozygous sites observed per sample with standard error, and the corresponding mean pairwise sample heterozygosity value with standard error.

Discussion
We present evidence that targeted gene enrichment in Culex mosquitoes is an effective way to substantially increase the amount of sequence data from non-repetitive genomic regions of

PLOS NEGLECTED TROPICAL DISEASES
known function (i.e.; coding sequences). We also show that this data can be used to survey a large number of segregating genetic sites from across the genomes of several Culex pipiens complex samples. Use of these sites allowed us to successfully examine taxonomic relationships, population structure, and patterns of admixture in these mosquitoes, and recovered similar patterns of population differentiation observed after the analyses of thousands of specimens at 7-12 microsatellite loci [10,[12][13][14]31]. We also showed enrichment approach has utility for surveying the presence and frequency of alleles known to correlate with insecticide resistance.
Perhaps not surprisingly, the genetic reads derived from Cx. quinquefasciatus samples mapped the best to the reference genome, while the outgroup sample, Cx. torrentium, mapped the poorest (S3 Table). Using just the Cx. quinquefasciatus samples to look at the relationship between number of raw reads generated and the number of successfully mapped reads, we observed a small but significant, positive trend (S5 Fig). This suggests that a greater depth of sequencing is advisable, as this would increase the number of reads per sample, but there are likely other factors to consider. These may include the age of the sample (and corresponding DNA degradation), and the relative taxonomic distance from the reference [74]. In the latter case, the number of variants which will be useful in downstream analyses may not be greatly improved by a greater depth of sequencing.
In our clustering analysis using principal components, we observed the greatest genetic distinction between the Cx. quinquefasciatus samples and those of the other taxa (Fig 1). Interestingly, the samples of Cx. quinquefasciatus clustered more tightly than these other samples when considered collectively. This result was also seen in our DAPC (S3B Fig). The more loosely defined cluster for non-Cx. quinquefasciatus samples likely reflects the greater amount of genetic divergence harbored within these taxa, and may support the unique taxonomic designations attributed to them. However, we also observed high levels of genetic diversity within these taxa, particularly Cx. pipiens f. pipiens and Cx. pipiens pallens ( Table 2). It remains to be determined how much of this is true biological diversity, and how much could be an artifact of reference-based mapping biases.
We also observed two primary genetic groups in our ADMIXTURE analysis (Fig 2), with K = 2 being the best supported (S2 Fig). As with our PCA, these correspond to a Cx. quinquefasciatus cluster and a cluster with all other samples. In both the PCA and ADMIXTURE analysis, the hybrid samples showed the expected mixture of lineages. Table 6. Summary of Insecticide Resistance-Associated Allele Frequencies. Given are the genomic position of examined SNPs that were previously found to correlate with resistance (chromosome and base position), gene ID (in the annotated Cx. quinquefasciatus genome), gene name, amino acid change, number of chromosomes examined (e.g., # of samples the variant is called at position � 2), and the frequencies of the susceptible and resistant alleles.

PLOS NEGLECTED TROPICAL DISEASES
While the best supported K value in these analyses indicate the number of confidently discreet taxa or populations, examinations of additional K values can provide important insights into patterns of more nuanced genetic divergences among the samples, as well as indicate samples that may be admixed. Interestingly, in our ADMIXTURE analysis at K = 3, the Cx. quinquefasciatus samples became split between a Hawaiian and Samoan group and the rest of the samples. This was somewhat surprising given the patterns of clustering observed in the PCA, which differentiated Cx. pipiens pallens from the other taxa along the second axis. In the ADMIXTURE analysis, Cx. pipiens pallens only became distinct at K = 5. These differences may reflect differences between the non-parametric approach of a PCA versus the approach of an ADMIXTURE analysis, which utilizes both allele frequency and ancestry fraction parameters [55].
When we examined clustering in just the Cx. quinquefasciatus samples, we again observe the greatest differences between the Hawaiian and Samoan samples and everything else (Figs 1  & 3). However, for both our DAPC and ADMIXTURE analysis, K = 1 was the best supported. This is not surprising given that these represent a single taxon with the potential for high rates of inter-population gene flow. Considering patterns of genetic diversity within Cx. quinquefasciatus populations, the east Asian samples harbored the highest mean number of heterozygous sites and a correspondingly high π value (Table 3). This recapitulates previous examinations of genetic diversity in this species [10]. The lowest genetic diversity was present in the Brazilian samples, which may indicate a relatively recent colonization of South America.
In the quantitative examination of taxonomic differentiation, weighted F st values were always higher than unweighted values (Table 4). Not surprisingly, the greatest F st values observed were between the taxa in the species complex and the outgroup, Cx. torrentium (Table 4). Interestingly, among taxa in the species complex, the highest unweighted value was observed between Cx. pipiens f. molestus and Cx. pipiens pallens, whereas for weighted values it was between Cx. pipiens f. molestus and Cx. quinquefasciatus. The distinctiveness of the Cx. pipiens f. molestus samples from these two taxa is also observed in the principal component analysis (Fig 1). As expected, the lowest weighted and unweighted F st values are both for the comparison of the two forms of Cx. pipiens.
Within Cx. quinquefasciatus, the greatest genetic differentiation was between the samples from Nigeria and those from Samoa (Table 5). This may reflect their relative geographic distance from one another and the corresponding decrease in genetic exchange. However, other factors such as differential selection could also play a role in generating the genetic divergence observed between African and Samoan populations of Cx. quinquefasciatus [75].
In both examinations of taxonomic differentiation using F st values, the number of samples per population being compared was small (Tables 2 and 3). Such small samples sizes can artificially inflate F st estimates [76,77]. However, the large number of variants used in these analyses (916 for all Cx. pipiens complex samples, and 540 for the Cx. quinquefasciatus samples only), should have minimized such effects [78]. Nonetheless, it is possible our estimates of F st may not accurately reflect the levels of genetic differentiation which exists between specific populations within the Cx. pipiens mosquito complex. In the future, analysis of more samples could address this question.
As expected, in the taxonomic analysis the outgroup sample Cx. torrentium was distinct from the other samples (Fig 4). Within the samples of the Cx. pipiens complex there are two major clades comprised of the Cx. quinquefasciatus specimens and everything else. The Cx. quinquefasciatus clade was well supported (99/100 bootstrap support), whereas the second clade was more poorly supported (52/100 bootstrap support). This was likely due to the presence of the hybrid specimens from North America. Of note, the samples of Cx. pipiens f. pipiens and Cx. pipiens f. molestus do not form monophyletic clades in this analysis. This may reflect the low level of genetic differentiation between the two taxa, combined with documented genetic exchange between them [14,25,42,79].
Our assessment of insecticide resistance-associated alleles revealed the presence of all identified variants in at least one of the sequenced samples. This points to the ubiquity and maintenance of these alleles in the Cx. pipiens complex and underscores the importance of careful insecticide resistance management [80]. However, it should be noted that the individual mosquitoes used here were not assayed for resistance to any insecticide, and therefore the presence of these alleles cannot be explicitly associated with resistance.
Another consideration regarding resistance-associated alleles (and observed genetic variation more broadly), is the extent to which the same derived mutation may have arisen independently in multiple complex populations ('genetic homoplasy'). We have assumed a single origin for all examined genetic variation, but such an assumption is unlikely to be true across such a large number of segregating sites. If there is extensive homoplasy in the data examined here, this would likely obscure patterns of population clustering and taxonomic differentiation [81]. Considering homoplasy is of particular importance for mutations that may confer a fitness advantage, such as those related to insecticide resistance [82].
Interestingly, all but one of the resistance-associated variants we surveyed were segregating at low frequencies (< 20% of the samples; Table 6). This suggests there may be counter-acting fitness costs to harboring these variants. Indeed, there are known fitness costs associated with mutations in the acetylcholinesterase gene in the absence of strong selection from insecticide exposure [83,84], and such costs may be extended to cytochrome P450 mutations more broadly [85]. The inclusion of 353 genes in our baits assay that could potentially evolve to confer insecticide resistance (i.e., P450s, alpha and beta esterases, sodium channel genes, and acetylcholinesterase genes) means that in the future the methodologies described here could be used to survey known genetic variation that contributes to resistance. Furthermore, these methods could also be a cost-effective way to screen for novel mutations associated with insecticide resistance in these genes.

Conclusions and future directions
In conclusion, the described bait-based assay is a powerful tool for improving sequencing efficiency and for addressing phylogenomic questions at multiple scales, including questions of taxonomic differentiation and population structure, across the Cx. pipiens complex. It can also be used to uncover the presence and extent of gene flow among populations and admixture. Furthermore, the utility of the data that can be generated using these baits is likely to expand. For example, it will be possible to investigate specific evolutionary drivers of taxonomic differentiation such as drift or selection. Of particular interest will be the identification of variation in specific genes contributing to the extensive ecological and behavioral differences observed among the Cx. pipiens complex taxa.
Supporting information S1