High-throughput genotyping of a full voltage-gated sodium channel gene via genomic DNA using target capture sequencing and analytical pipeline MoNaS to discover novel insecticide resistance mutations

In insects, the voltage-gated sodium channel (VGSC) is the primary target site of pyrethroid insecticides. Various amino acid substitutions in the VGSC protein, which are selected under insecticide pressure, are known to confer insecticide resistance. In the genome, the VGSC gene consists of more than 30 exons sparsely distributed across a large genomic region, which often exceeds 100 kbp. Due to this complex genomic structure, it is often challenging to genotype full coding nucleotide sequences (CDSs) of VGSC from individual genomic DNA (gDNA). In this study, we designed biotinylated oligonucleotide probes from CDSs of VGSC of Asian tiger mosquito, Aedes albopictus. The probe set effectively concentrated (>80,000-fold) all targeted regions of gene VGSC from pooled barcoded Illumina libraries each constructed from individual A. albopictus gDNAs. The probe set also captured all orthologous VGSC CDSs, except some tiny exons, from the gDNA of other Culicinae mosquitos, A. aegypti and Culex pipiens complex, with comparable efficiency as a result of the high nucleotide-level conservation of VGSC. To improve efficiency of the downstream bioinformatic process, we developed an automated pipeline—MoNaS (Mosquito Na+ channel mutation Search)—which calls amino acid substitutions in the VGSC from NGS reads and compares those to known resistance mutations. The proposed method and our bioinformatic tool should facilitate the discovery of novel amino acid variants conferring insecticide resistance on VGSC and population genetic studies on resistance alleles (with respect to the origin, selection, and migration etc.) in both clinically and agriculturally important insect pests.


Introduction
Medically important insect vectors such as mosquitoes undergo strong selective pressure from insecticides in the field where control of vector-borne diseases such as malaria, dengue and zika is applied. This pressure often results in development of resistance against insecticides, which pose a potential risk for public health [1,2]. Although there can be various mechanisms by which insects acquire resistance against insecticides, two main physiological mechanisms are well known. One is enhanced metabolism of insecticide active ingredients by detoxification enzymes including cytochrome P450s, glutathione S-transferases and carboxyl esterases. The other mechanism is target-site insensitivity due to a point mutation(s) in the insecticide's target-protein which reduces the interaction between the two molecules.
Synthetic pyrethroids are the most frequently used insecticide group for the control of clinically important mosquitos. The mode of pyrethroid's toxicity is inhibition of the voltage-gated sodium channel (VGSC) in the nervous system [3]. Development of resistance against pyrethroids, which is known as knockdown resistance (kdr), was first reported in housefly, Musca domestica, in 1950s [4]. The kdr phenotype as well as another distinct phenotype, super-kdr, was eventually linked to amino acid (aa) substitutions on the two positions, L1014F and L1014F+M918T, respectively, on the gene coding VGSC protein [5,6]. Currently, these and other aa substitutions have been found to be associated with resistance in many medical and agricultural insect pests [7,8]. With this historical background, aa substitutions in a variety of insect species are often described with projection to the corresponding aa position in M. domestica VGSC for comparison. The VGSC is highly conserved among insects, and many of the resistance-conferring aa substitutions are seen in different species [9]. Therefore, it is relatively straightforward to infer the effect of certain aa substitutions in any species if the effect of those substitutions has already been elucidated in other species. Analyzing nucleotide sequences of an entire coding sequence (CDS) of the VGSC gene from genomic DNA (gDNA), however, is complicated because VGSC genes typically consist of many (>30) small exons sparsely distributed across a large genomic region which often exceeds 100 kbp. Therefore, most studies employing polymerase chain reaction (PCR) and direct sequencing usually cover only restricted regions where the known resistance-conferring substitutions are frequently found; e.g., IIS5-6 [7,8]. Such a bias may lower the chance of discovering novel resistance mutations existing outside the region investigated.
Aedes albopictus, the Asian tiger mosquito, is a medically important mosquito species ubiquitously present on most continents on the Earth. In some regions where the other effective vector, A. aegypti, is absent, the species often take a main role for transmitting chikungunya and dengue viruses [10,11]. The kdr substitution in A. albopictus had not been reported until the F1534C allele was discovered in Singapore 2009 [12]. Since this discovery, F1534C and other kdr substitutions at the same aa position, F1534S and F1534L, were reported from in A. albopictus in various geographic locations worldwide [13][14][15]. More recently, we also discovered the new kdr substitution V1016G in A. albopictus by extending the region of the search for mutations [16].
Next-generation sequencing (NGS) technology has reduced the cost and time of DNA sequencing by orders of magnitude. The recent Anopheles gambiae 1000 Genomes project, Ag1000G, has uncovered a number of previously unknown nonsynonymous mutations in the VGSC gene in A. gambiae and A. coluzzii [17], some of which have been suspected to cause resistance directly or indirectly. Although whole-genome sequencing (WGS) may discover novel variants of VGSC unequivocally, this naive approach is still too costly per sample just for analyzing VGSC. Alternatively, we considered an enrichment approach involving hybridization of oligo DNA/RNA [18] which is often employed to selectively sequence targeted genomic regions for studies e.g. on genotyping of disease-related genes in humans. This technology is aimed at increasing the depth of reads and the number of samples to be multiplexed per given sequencing capacity in return for limiting the region to be analyzed. In this study, we designed biotinylated oligonucleotide DNA probes from A. albopictus VGSC CDSs. The probe set efficiently concentrated targeted regions from the gDNA of individual A. albopictus. Although the probe set was designed from the A. albopictus VGSC gene, the same probe set captured most CDSs of other important arbovirus vectors A. aegypti and Culex pipiens complex, in which several kdr mutations are already known to exist [8,19], as a result of the high nucleotide conservation of VGSC. This technology allows for full-CDS analysis of the complex VGSC gene in a relatively low-cost and highly multiplexed manner, which is expected to promote discoveries of novel resistance-conferring aa substitutions both in medical and agricultural insect pests.

Results
From the A. albopictus genome assembly AaloF1 [20], 229 oligo DNA probes (xGen Custom Target Capture Probes, IDT) were designed. Of these, 145 probes target VGSC coding regions from gene model refined in this study (see Material & Methods). The rest of probes target exons of other genes located on the same scaffold as VGSC, JXUM01S000562, to detect signal of positive selection by method such as Sabeti, et al [19]. However, in the more recent contiguous assembly of the C6/36 cell line [21], many of those genes do not locate in proximity to VGSC, which suggests possible misassembly in either of the two assemblies. Therefore, in this study, we only evaluate performance of capture for VGSC gene. See Material & Methods for more details.
Targeted sequencing of the VGSC gene was conducted for 56 mosquito gDNA samples including A. albopictus, A. aegypti and Culex pipiens complex subspecies (C. quinquefasciatus, C. pipiens pallens and C. pipiens form molestus) ( Table 1) in a single run of Illumina MiniSeq. Each individual gDNA was indexed with different barcoded adapters, and in total, 4.9 million read pairs (150 bp PE) were assigned to the 56 samples. From those samples, 40-170, 50-120, and 63-100 K read-pairs were obtained from each individual mosquito of A. albopictus and A. aegypti and C. pipiens complex, respectively. The raw read data were deposited to DDBJ Sequence Read Archive (DRA) under BioProjectID: PRJDB7889 and each accession numbers as listed in Table 1. For comparison, random sequence read sets derived from reference genome sequences as a simulated output of WGS approach were mapped in same manner to the captured library data. In A. albopictus, 44% of all reads on average overlapped with the VGSC CDSs targeted (Reads overlapping per kilobase exon and per million sequenced reads: RPKM = 6.5 × 10 4 ), which was approximately 8.3 × 10 4 -fold enrichment compared to the simulated whole-genome shotgun sequencing (Fig 1). The efficiency reduced to 4.2 × 10 4 -fold enrichment after PCR duplicates were removed. Although the probe set was designed based on only the A. albopictus genomic sequence, the same probe set captured VGSC CDSs from  the gDNA of other mosquitos, A. aegypti and C. pipiens complex, at a similar on-target rate to A. albopictus (Fig 1). Fig 2 shows a distribution of median and minimum sequencing depths normalized to total amount of sequencing effort within each CDS in each individual sample after PCR duplicates were removed. In A. albopictus, most nucleotides in all exons were covered deeply with minimum bias in all samples. In A. aegypti and C. quinquefasciatus, however, some exons were covered at relatively low depth partly or entirely. In particular, exons 2 and 16.5 were covered at nearly or absolutely zero depth. Fig 3 presents a distribution of the allele balance in genotypes containing single or multiple nucleotide variants (SNVs or MNVs) called by FreeBayes [22]. The ratio of the first allele in a heterozygous genotype was distributed mostly around 50%, which was substantially different from the homozygous genotype (near 100%) except for one SNV or MNV site in exon 32 of the A. albopictus gene located in the GGT (Gly) trinucleotide tandem repeats variable in length near the C-terminus of VGSC, where accurate calling of the genotype is difficult.

Discussion
In this study, we evaluated the potential of targeted enrichment sequencing technology to genotype mosquito VGSC CDSs from individual gDNA samples. The result of the experiment is quite promising, most nucleotides of VGSC CDSs were covered at sufficient read depths even in samples with less than a 30 Mbp (0.1 million read-pairs) sequencing effort.
Even though the probe set was designed on the basis of the A. albopictus genome sequence only, it successfully enriched VGSC CDSs from the gDNA of two other Culicinae mosquito species, A. aegypti and C. pipiens complex, which are estimated to have diverged 71.4 and 179 million years ago, respectively, from A. albopictus [20]. A. aegypti show almost equal overall enrichment efficiency (relative RPKM of targeted capture sequencing compared to random sequencing) to that in A. albopictus (Fig 1). On the other hand, the overall enrichment efficiency was lower by approximately 50% in C. pipiens samples than that in A. albopictus, though the on-target ratio (absolute RPKM) was still comparable or rather higher in C. pipiens (Fig 1). An effective and cheap method to sequence voltage-gated sodium channel gene This result may be explained by much smaller genome size (579 Mb in the CpipJ2 assembly versus 2.25 Gbp in the C6/36 assembly) of C. pipiens that might compensate for the lower nucleotide identity to the probes. Applicability of a single probe set to multiple species (e.g., the same genus or family) is obviously advantageous because this obviates the need to prepare each custom probe sets specific for each single species and may enable capture even in species lacking prior genome information. Nonetheless, the evolutionary distance will limit the range of species that one probe set can be applied. In this study, the mapping results on each exon indicated that capture efficiency decreased in some exons (Fig 2). The empirical observation in this study suggests that less than 87.5% in identity or less than 60 bp in size for the homology track of targets could decrease the efficiency of capture significantly (Fig 4). Many exons in A. gambiae fall short of these criteria (S2 Fig), indicating the current probe set cannot directly be applied to this distant mosquito group. Our probe set especially failed to capture two optionally used exons, 2 and 16.5, in A. aegypti and C. pipiens complex, which are among the smallest exons targeted (Fig 2). It is assumed that those tiny exons alone do not provide enough thermostability for probe-target DNA duplex during the capture. Because the probes for those small exons contain flanking intronic sequences of the A. albopictus genome, those flanking sequences may have provided enough homology region to capture sequences from this species. Although it is straightforward to optimize our probe set further at least for the two other species of mosquito simply by adding species specific probes for those exons and flanking intronic regions, small exons in general will be a major challenge when a probe set is aimed to be used to a wide-range of species because the homology in an intronic region will decay more rapidly than that in an exonic region during speciation. We also missed another exon corresponding to "exon 12" in Anopheles gambiae described by Davies et al. [9] during probe design (see Material & Methods). Such tiny and rarely used exons may be difficult to annotate without high-quality high-throughput RNA sequencing data. Nevertheless, in mosquitoes, all such tiny exons are actually situated on the N-terminal intracellular loop or the intracellular loop between domains I and domain II in VGSC, where no resistance-associated mutation has been found so far [8]. Therefore, it is not clear whether ignoring those small exons of VGSC from analysis does pose a serious problem for insecticide resistance research.
Although our current probe set is designed only from coding exonic regions except for some small exons, hybridization capture methodology allows sequencing flanking intronic region (up to 200 bp, depending on library insert size) in addition to targets [31] (see Fig 5B). Single nucleotide variations (SNVs), which are generally more frequently found in such noncoding regions, provide valuable information for phylogenetic and population genetic information such as origin of resistance allele and their dynamics [17,32].
Copy number variation (CNV) in the VGSC gene with potential implication to insecticide resistance have been suggested in A. aegypti [33]. Although CNV detection was not attempted in this study because no validated CNVs of VGSC gene were involved in our samples, the ability of target capture sequencing (or exome sequencing) technology for CNV analysis has been explored in many studies [34,35]. Those approaches rely on differences in sequencing depth caused by copy number change. In this study, median sequencing depth in each exon normalized by total sequencing amount showed relatively uniform distribution among samples when the region was covered by sufficient number of reads (Fig 2). Further normalization could be possible from depth of other gene loci which are considered refractory to copy number change by adding such targets in the probe set, which should be explored in a future study.
The process of genotyping VGSC carried out in this study was automated in MoNaS ( Fig  5A). This program sequentially runs tools conducting mapping of NGS reads to a reference, sorting, removal of PCR duplicates, indexing for BAM files, variant calling, variant annotation, and finally integration of these results across multiple samples into a single table with conversion of the aa coordinates to those corresponding to the M. domestica VGSC protein.
The automation in MoNaS allows researchers to process raw NGS reads of many samples via a simple command line operation without expert knowledge of the bioinformatics field. MoNaS can be run locally (https://github.com/ItokawaK/MoNaS) with appropriate genome reference data. Also, a web-service of MoNaS implemented with JBrowse alignment viewer [36] is provided by NIID Pathogen Genomics Center's severer (https://gph.niid.go.jp/monas) (Fig 5B).

Design of custom probes
The full-length VGSC gene (AALF000723-RA in gene set: AaloF1.2) was found in scaffold JXUM01S000562 in the genome assembly of an A. albopictus Foshan strain, AaloF1 [20] hosted on vectorbase.org [37]. Because the annotation missed some CDSs (entire exon 19c for instance), we refined the annotation by aligning shotgun-sequenced NGS reads of VGSC cDNA [16] using Hisat2 [38] and the M. domestica VGSC protein sequence (GenBank accession No.: AAB47604) via BLASTX [39]. Compared to AaloF1.2, the refined annotation included three added CDSs, and four extended CDSs (see detail in S1 Table). Among the 35 coding CDSs in total, sizes of 34 (32 + 2 mutually excluding exons) CDSs matched to the A. aegypti VGSC CDSs annotated by Davies et al. [9]. Therefore, the numbering of exons in this paper was set to be concordant with the A. aegypti VGSC exons described by Davies et al. [9]. The additional optionally used 45 bp small exon, referred to as exon 16.5 here, was found in cDNA data between exons 16 and 17 in the genome. All CDS sequences, some of which contained a flanking intronic region (for tiny exons less than 120 bp in size) were submitted to the IDT website (https://sg.idtdna.com) to design 120 bp xGen Lockdown biotinylated oligonucleotide DNA probes with 2× Tiling density option. We also included some exons of other genes flanking VGSC (AALF020128, AALF020129, AALF020130, AALF000725, AALF000726, AALF000727, AALF000728, and AALF000730) or a gene nested in the intronic region of VGSC (AALF020132) during the probe design to take advantage of the population genetic   analysis in future studies. From 15 kbp genomic regions in total, 229 probes were designed (S2 Table), of which 145 target VGSC. Nonetheless, in the more recent contiguous assembly of the C6/36 cell line (see below), AALF020132, AALF000725, AALF000726, AALF000727, AALF000728, and AALF000730 are not located in the same assembly with the VGSC locus. For this reason, in this paper, we evaluate the performance of the probe set only in terms of VGSC CDS enrichment.

Samples
Fifty-six mosquitos either belonging to species A. albopictus, A. aegypti or Culex pipiens complex-either kept in the laboratory or caught in the wild (Table 1)-served as a source of

Reference genomes and annotation for VGSC
Although the probe sets were designed from assembly AaloF1, we chose a C6/36 cell line genome assembly, canu_80X_arrow2.2 [21], as a reference genome of A. albopictus for further bioinformatic analysis because this assembly has better contiguity and fewer scaffolds than AaloF1 does. In the canu_80X_arrow2.2 assembly, the whole VGSC gene was found in scaffolds MNAF02001058.1 and MNAF02001442.1 annotated as Gene IDs LOC109421922 and LOC109432678, respectively, in the NCBI Aedes albopictus Annotation Release 101. The two VGSC genes were assumed to be redundant haplotigs. To avoid dual mapping of the NGS reads, we purged MNAF02001442 by hard-masking this entire scaffold (replacing all bases with the "N" character) rather than MNAF02001058.1 because LOC109432678 in MNAF02001442.1 has a single frame-shifting nucleotide deletion in the thymine homopolymer track within exon 4 (TTTTTT ! TTTTT), which was suspected due to an uncorrected base-calling error. LOC109421922 was defined by the number of transcriptional variants in the NCBI's annotation because VGSC is known to have complex alternative splicing patterns [9]. Nevertheless, we simplified the VGSC gene model into two possible transcriptional variants to build a GFF3 annotation file for annotating aa changes. These two transcripts include all the regions of mandatory or optional CDSs but differ by the two mutually exclusive exons "19c/k" and "26d/l," where one carries exons "19c" and "26k," and the other contains exons "19d" and "26l." CDSs of all the transcriptional variants of LOC109421922 were merged via overlaps. Those merged CDSs perfectly matched AaloF1 except for LOC109421922 including exon 16.5 and except for one mutually exclusive exon "26k" whose sequence itself was found to be intact in MNAF02001058.1.
The VGSC gene in the chromosome level assembly of the A. aegypti genome, AaegL5.0 [40], was annotated in the same manner. The whole VGSC gene is encoded as AAEL023266 (the NCBI Aedes aegypti Annotation Release 101) on chromosome 3. AAEL023266 has 13 transcripts, these CDSs were merged via overlaps as in the canu_80X_arrow2.2 assembly of A. albopictus. AAEL023266 appeared to be lacking an exon corresponding to exon 16.5, whereas we found its sequence between exons 16 and 17. AAEL023266, however, contains an additional exon between exons 11 and 12. The 21 bp small exon was assumed to correspond to exon "12" in the Anopheles gambiae genome described by Davies et al. [9] and is situated within the intracellular loop between domains I and II. We found the sequence homologous to this exon also in the two A. albopictus genome assemblies, AaloF1 and canu_80X_arrow2.2; which means we had failed to include this exon in the probe design. The complete VGSC sequence was also found in scaffold NIGP01000811 and was assumed to be a redundant haplotigs. This scaffold was purged from the assembly by hard-masking.
In the C. quinquefasciatus genome assembly Cpip_J2 [29], the VGSC gene is located in scaffold supercont3.182. The VGSC gene in supercont3.182, however, contains a shorter exon 13 which was truncated by scaffolding gap and lacks the entire exon 14. Complete exons 13 and 14 were found in another scaffold, supercont3.1170, which contains an incomplete VGSC gene probably as an alternative haplotig. We fused contig AAWU01037504.1 containing exons 13 and 14 of VGSC from supercont3.1170 into supercont3.182 to restore the complete coding sequence of the VGSC gene, thereby creating supercont3.182_2 (S1A Fig). The VGSC in super-cont3.182 (and supercont3.182_2) still contained kdr aa substitutions, L932F and I936V, as already reported by Davies et al. (2007) plus unusual frameshifting deletions in exons 26l and 32 (S1B Fig). For these reasons, supercont3.182_2 was further polished by the consensus module in BCFtools [41] with the "-H 1" option using the variant information for Cpip-JNA-01 in the VCF file generated as described below, thereby finally resulting in supercont3.182_3. The latter scaffold was added to the genome assembly, and the original scaffolds super-cont3.182 and supercont3.1170 were purged by hard-masking. We were not able to find an exon corresponding to "exon 16.5" in A. albopictus and A. aegypti.

Bioinformatic analysis
The FASTQ data were mapped to the reference using BWA mem (v.0.7.17) [42] with default options. The resultant BAM files were sorted by the sort program from the SAMtools suite (v.1.9) [43], and we removed PCR duplicates by the rmdup programs from the SAMtools. Variant calling was performed on the resulting BAM files of each species in the FreeBayes software (v.1.2.0) [22] with default options. The variant annotation (for aa changes) was conducted with the csq program from the BCFtools suite (v.1.9) [41] with options "-l -p a". Finally, the discovered aa changes were projected onto the corresponding position in the M. domestica VGSC protein sequence (GenBank accession No.: AAB47604). Those bioinformatic processes ( Fig 5A) were automated in pipeline tools MoNaS (Mosquito Na + channel mutation Search; https://github.com/ItokawaK/MoNaS) written in the Python3 script language.
For estimating the level of enrichment, five sets of random data on whole-genome shotgun paired-end reads (150 bp × 2, 300 ± 50 bp insert length, 1 million read pairs) from each reference genomic assembly were simulated in the wgsim software (https://github.com/lh3/wgsim). The multicov program from the Bedtools suite (v.2.27.1) [44] was applied to calculate the number of reads overlapping with any targeted CDS regions. Nucleotide identities of exons were calculated using Muscle [45] and BioPython's Phylo package [46]. R (v.3.5.1) [47] and the ggplot2 package [48] were utilized for summarizing and visualizing the data.

Data accessibility
Raw NGS reads obtained in this study were deposited to DDBJ Sequence Read Archive (Bio-ProjectID: PRJDB7889, see Table 1 for accession no. of each sample). Annotation information for VGSC and new reference sequence of VGSC gene in C. pipiens complex used in this study (supercont3.182_3) are provided in S1 Data file. Web service and source codes of MoNaS are hosted in https://gph.niid.go.jp/monas and https://github.com/ItokawaK/MoNaS, respectively.  Davies et al., 2007 (ref. 9 in main text). The green zone represents >60 bp length and >87.5% similarity. (TIF) S1 Table. Annotation for VGSC CDSs used in this study (new) along with annotation in gene set AaloF1.2 (old) in genome assembly AaloF1. The coordinates for start and end are 0-based and 1-based, respectively, as for BED format. (TSV) S2 Table. List of probe sequences and corresponding genomic interval in AaloF1. The coordinates for start and end are 0-based and 1-based, respectively, as for BED format. (TSV) S1 Data. Zipped folder containing gff3 annotation files for VGSC gene model used in this study and fasta file for the sequence of scaffold supercont3.182_3 of C. pipiens complex.

Supporting information
(ZIP)