Complete mitogenome of Anopheles sinensis and mitochondrial insertion segments in the nuclear genomes of 19 mosquito species

Anopheles sinensis is a major malarial vector in China and Southeast Asia. The mitochondria is involved in many important biological functions. Nuclear mitochondrial DNA segments (NUMTs) are common in eukaryotic organisms, but their characteristics are poorly understood. We sequenced and analyzed the complete mitochondrial (mt) genome of An. sinensis. The mt genome is 15,418 bp long and contains 13 protein-coding genes (PCGs), two rRNAs, 22 tRNAs and a large non-coding region. Its gene arrangement is similar to previously published mosquito mt genomes. We identified and analyzed the NUMTs of 19 mosquito species with both nuclear genomes and mt genome sequences. The number, total length and density of NUMTs are significantly correlated with genome size. About half of NUMTs are short (< 200 bp), but larger genomes can house longer NUMTs. NUMTs may help explain genome size expansion in mosquitoes. The expansion due to mitochondrial insertion segments is variable in different insect groups. PCGs are transferred to nuclear genomes at a higher frequency in mosquitoes, but NUMT origination is more different than in mammals. Larger-sized nuclear genomes have longer mt genome sequences in both mosquitoes and mammals. The study provides a foundation for the functional research of mitochondrial genes in An. sinensis and helps us understand the characteristics and origin of NUMTs and the potential contribution to genome expansion.


Introduction
Mitochondria are eukaryotic cell organelles that are mainly involved in oxidative phosphorylation [1][2]. The conservation, easy alignment, maternal inheritance, and straightforward gene orthology of mitochondrial (mt) genomes have made the mt genome important for studies of phylogeny and evolution [3][4][5][6]. Mt genomes are sometimes associated with insecticide resistance. Several transcripts encoding enzymes such as NADH dehydrogenase and ATP synthase, which are involved in the production of energy within the respiratory chain, were overexpressed in Aedes aegypti larvae exposed to insecticides [7]. Anopheles sinensis is a major a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Takara Ex Taq polymerase (Takara, Japan) and 1.5 μl of 25 mM MgCl 2 , 2.5 μl of 10×PCR Buffer (Mg 2+ free), 3 μl of a dNTP mixture (2.5 mM each), 1 μl of each 10 mM primer [34], 0.2 μl of 5 U/μl Taq polymerase and 1 μl of the mt genome DNA template. PCR thermal cycling included a 5 min initial denaturation at 94˚C for 5 min, followed by 35 cycles of denaturation at 94˚C for 1 min, annealing at 48˚C-55˚C for 45 s, and elongation at 68˚C for 1 min, followed by a final elongation for 10 min at 72˚C. The PCR products were electrophoresed on a 1.0% agarose gel and then purified using a QIAquick Gel Extraction Kit (QIAGEN, Hilden, Germany). The purified PCR products were directly sequenced using the primer sets except for the control region. The purified products were loaded into pMD-19T vectors, cloned, and then sequenced. All fragments were sequenced in both directions.
The obtained sequences were edited using DANMAN (http://www.lynnon.com/) and were identified in reference to annotated mosquito mt genome sequences through alignment using Clustal X [35]. The sequences of PCGs were translated into amino acids using MEGA version 6.0 [36]. Almost all tRNAs were also recognized by the online tRNAscan-SE Search Server v.1.21 [37], and the tRNAs (tRNA Arg , tRNA Ser(AGN) ) that could not be found by tRNAscan-SE were confirmed by sequence homology comparison. The CR was examined for repeats and special structures with the aid of the Tandem Repeats Finder (http://tandem.bu.edu/trf/trf. html) [38]. The nucleotide composition was calculated using DNA Star (http://www.dnastar. com/). Codon usage bias was calculated using MEGA version 6.0 [36]. Strand asymmetry was evaluated by AT Skew and GC Skew using the formulae AT skew = [A%−T%]/ [A%+T%] and GC skew = [G%−C%]/ [G%+C%], respectively [39].

Nuclear mitochondrial DNA insertion analysis
Eighteen mosquito species' nuclear genome sequences in FASTA format were downloaded from VectorBase (https://www.vectorbase.org/). The genome sequence of An. sinensis was obtained from Chongqing Normal University (unpublished data) (S1 Table). The An. sinensis genome has an assembly scaffold size of 194.49 Mb, with gene area coverage 98.90%. Of the 19 mosquito species with genomes analyzed for NUMTs, three species belonged to the Culicinae with two species in the genus Aedes and one in the genus Culex. The remaining 16 species were all in the genus Anopheles and belonged to the Anophelinae. The 18 previously determined mt genome sequences were downloaded from NCBI, and the mt genome sequence of An. sinensis was obtained in the present study (S1 Table).
The NUMTs were identified using mt genome sequences to search against the nuclear genome sequence using BLASTN for each species. The significance threshold for the BLASTN search was set to E<10-4, with the window slide of 30 bp [40]. The number, total length and density (NUMTs per Mb of nuclear genome sequence) were calculated with Excel 2010 and used to measure the basic characteristics of NUMTs' occurrence in the genome. Excel 2010 was also used to count the number and different lengths of the NUMTs. The statistical analyses were carried out using TIBCO Statistica (https://www.tibco.com/products/tibco-statistica). An R package was written to calculate the location and frequency of different areas of the mt genome sequences that had been transferred to the nuclear genome.

An. sinensis mt genome and its organization
We determined the complete mt genome sequence of An. sinensis. It is a typical circular and double-stranded molecule of 15,418 bp (GenBank accession MF322628). The obtained mt genome sequence was 342 bp longer than an earlier report of the An. sinensis mt genome [9], and the difference was mainly due to earlier incomplete sequencing of the mt genome control region (CR). Other genes lengths in our assembly of An. sinensis mtgenome are same as previous assembly sequence (Genbank accession NC028016) expect for control region. The two mt genome sequences have a 98.6% similarity, and a total of 151 single nucleotide polymorphisms (SNPs) were identified between them, except for CR.
The mt genome sequence of An. sinensis contains a conserved set of 37 genes, including 13 protein-coding genes (PCGs), two rRNA genes (lrRNA and srRNA), 22 tRNA genes (tRNAs) and a large non-coding region (CR, also known as the AT-rich region) (Fig 1). Twenty-three genes are located on the majority strand (J-strand), while 14 genes reside on the minority strand (N-strand) (Fig 1 and Table 1). The gene arrangement is the same as in previously published mosquito mt genomes, and the unique difference of the arrangement from other dipteran species mt genomes is that the latter have the order trnA-trnR [41]. This arrangement difference might be associated with different adaptations and evolutionary histories of mosquitoes [42].

Characteristics of the An. sinensis mt genome
The A+T content of An. sinensis mt genome is 78.34%, and the A+T contents of PCGs, tRNAs, rRNAs and CR are 76.85%, 78.59%, 81.46% and 93.58%, respectively ( Table 2). This result is similar to the universal feature presumed from earlier reported mosquito mt genomes [43][44][45][46][47][48][49][50][51] in that the CR has the highest A+T content, followed by rRNAs. For the 13 PCGs in the An. sinensis mt genome, the third codon position has a higher A+T content (94.24%), followed by the first codon position (69.46%) and the second codon position (66.84%). The results supports the data from other known mosquito mt genomes [43][44][45][46][47][48][49][50][51] in that the 3rd codon position has the highest A+T content, followed by the 1st codon position and 2nd codon position. AT-skew and GC-skew have also been widely used to measure the nucleotide compositional behaviors of mt genomes [42]. The AT skew and GC skew of the An. sinensis mt genome are 0.026 and -0.155, respectively ( Table 2). The AT-skew values are positive, and the GC-skew values are negative for all other mosquito mt genomes [43][44][45][46][47][48][49][50][51], which indicated overall mt genome preference for the use of A and C. The PCGs of the An. sinensis mt genome show an overall negative AT-skew (-0.144) and positive GC-skew (0.047). It is a common phenomenon that the PCGs of mosquito mt genomes prefer to use T and G [43][44][45][46][47][48][49][50][51].
All 13 PCGs in the An. sinensis mt genome use ATN as the start codon, except for COI, which uses the special start codon TCG, and ND5, which uses GTG as the start codon. Use of GTG as a start codon has been documented for mtDNA-encoded proteins in various organisms, including Anopheles species [52]. All 13 PCGs use the complete stop codon TAA, except for COII, COIII and ND4, which use the incomplete T as a stop codon. There is no other mosquito species with a mt genome that uses TAG as a stop codon [43][44][45][46][47][48][49][50][51]. The usage bias of amino acids for the 13 PCGs was identified in the An. sinensis mt genome. Leu has the highest percentage (16.05%), followed by Phe (9.67%), Ser (9.30%) and Ile (9.24%), and Cys has the lowest percentage (1.10%) (S1 Fig). This order is similar to other mosquito mt genomes [43][44][45][46][47][48][49][50][51]. Leucine has an inferred high usage frequency, and as a hydrophobic amino acid, it can be a component of many transmembrane proteins in the mitochondria.
There are a total of 3733 codons in the An. sinensis mt genome, excluding termination codons, which is within the codon number range of other insect mt genomes (3585-3746) [53]. For the relative synonymous codon usage (RSCU), UUA is the most used codon (RSCU value 5.28) in the An. sinensis mt genome, followed by CGA (3.24), UCA (2.60), GGA (2.50), UCU (2.50), CCU (2.24), GCU (2.14), GUU (2.10) and ACA (2.10), and ACG is the least used codon (0.02) (Fig 2). The third codon position has a higher usage frequency of A (46.45%) or U (47.79%) in the An. sinensis mt genome. This phenomenon is consistent with previously reported mosquito mt genomes [43][44][45][46][47][48][49][50][51]. All tRNAs of the An. sinensis mt genome can folded into the typical clover-leaf secondary structure, except for tRNA Ser(AGN) , which has lost the DHU stem ( S2 Fig). This is a common feature in metazoan mt genomes [42]. Consistent with other known mosquito mt genomes, all tRNA lengths range from 64 to 72 bp. There are 18 mismatches, all as GU base pairs, found in the 12 tRNAs of the An. sinensis mt genome (S2 Fig). The large subunit rRNA (16SrRNA), 1328 bp long, is located between ND1 and tRNA Val , and the small subunit rRNA (12SrRNA), 797 bp long, is between tRNA Val and control region, both on the minority strand. The total A+T content of rRNA genes is 81.46%.
The CR plays an important role in the regulation of replication and the transcription of the mt genome [54][55]. The CR region of the An. sinensis mt genome is 577 bp long and is located between 12SrRNA and tRNA Ile . The A+T content of this region (93.58%) is higher than other regions of the An. sinensis mt genome. There is a poly-T stretch of 18 bp to be identified, which may be a recognition site for the initiation of replication in the mt genome [56]. In addition, there are two 46 bp long tandem repeats found in the CR. The tandem repeat structures in CRs are common, but the length and tandem repeat time vary in the other known mosquito mt genomes [6].

NUMTs and their comparison in mosquitoes
We identified NUMTs of 19 mosquito species with both known mt genomes and nuclear genome sequences. In the subfamily Anophelinae, all 16  The NUMT lengths in the six Anophelinae species range from 43 bp to 309 bp, while lengths in the three Culicinae species range from 37 bp to 15,580 bp. If lengths are divided into three classes (large-size (>2 kb), medium-size (200 bp to 2 kb) and small-size (< 200 bp), it is seen that the large-sized NUMTs only exist in the three Culicinae species, with the largest NUMT (15,580 bp) existing in Cx. quinquefasciatus (S2 Table). This suggests that larger genomes can house longer NUMTs. If we assemble the NUMTs, the numbers of large-sized, medium-sized and small-sized NUMTs are 12, 157 and 171, respectively. This suggests that half of the NUMTs are small-sized (< 200 bp). Most of the longer NUMTs would be disruptive to shorter NUMTs through nucleotide deletions or insertions in the evolutionary process, which is the probable reason why half of NUMTs are small [26,27].
In the 16 species of Anophelinae, the total length of NUMTs in each species ranges from 0 bp to 309 bp, and the density ranges from 0 bp/Mb to 1.83 bp/Mb of the nuclear genome. The average total length and density are 66.75 bp and 0.38 bp/Mb, respectively (Table 3 and Fig 3). In the three species of Culicinae, the total length of NUMTs in each species range from 28,431 bp to 92,934 bp, and the density range from 32.29 bp/Mb to 69.24 bp/Mb, with the average total length and density being 60,562.33 bp and 51.15 bp/Mb, respectively (Table 3 and Fig 3). The Culicinae species have greater length and density of NUMTs than the Anophelinae species. Statistical analyses showed that the total length of NUMTs is significantly correlated with genome size (R = 0.9104, p = 6.31E-08) in all 16 species investigated, which is consistent with the NUMT investigation of 85 species [57]. In addition, the density of NUMTs is significantly correlated with genome size in these mosquitoes (R = 0.7667, p = 1.29E-04). The total length and density of NUMTs are also significantly correlated (R = 0.9219, p = 2.04E-08). These results suggest that NUMTs could contribute to the expansion of genome size.
In the genetics of other insects, the total length and density of NUMTs in the Drosophila melanogaster (genome~170 Mb), Tribolium castaneum (~150 Mb) and Apis mellifera (~230 Table 3. Number, total length and density of nuclear mitochondrial segments (NUMTs) in 19 mosquito species of nuclear genomes.

NUMT density (bp/Mb nuclear genome)
Anophelinae  [40], respectively. The genome sizes in these three species are comparable with the Anophelinae species; however, the total length and density of NUMTs of the three species are larger than those in the Anophelinae species, especially for T. castaneum and A. mellifera. The genome size expansion due to mitochondrial insertion segments is variable in different insect groups. The A. mellifera nuclear genome recombination rate is much greater than that of D. melanogaster [59], suggesting that the NUMT total length and density in the nuclear genome is related to the genome recombination rate. The largest number of NUMTs in the genomes of the nine mosquito species originated from the COI (39 NUMTs) gene, followed by the 16SrRNA (31), CytB (27) and ND5 (22) genes. The least number of NUMTs are from the CR (6) (Fig 4). In the six Anophelinae species, NUMTs were derived from only seven mitochondrial genes, including three COII in An. atroparvus, two COIII in An. farauti, one ND4 in An. sinensis, one ND5 in An. christyi, one 12SrRNA in An. minimus and one tRNA Cys -tRNA Tyr in An. darlingi. In the three Culicinae species, NUMTs in Ae. aegypti and Cx. quinquefasciatus cover the whole mt genome. Similarly, NUMTs in Ae. albopictus also cover the whole mt genome, with the exception of ND4, 12SrRNA and CR. These data suggest that the PCGs are transferred to the nuclear genome at a higher frequency, and the larger-sized nuclear genomes in mosquitoes have larger-sized mt genome sequences.
The largest numbers of NUMTs were derived from the 16SrRNA gene in three mammals, Sus scrofa (6), Pan troglodytes (65) and Homo sapiens (53). In Mus musculus, the largest number of NUMTs originated from CR (11), followed by ND2 (7) and ND4 (6), but no NUMT was found from tRNA Val [60]. This suggests that the NUMT origination may be different in mosquitoes and mammals. The genome sizes of M. musculus, S. scrofa, P. troglodytes and H. sapiens are 2.7 G, 2.3 G, 2.9 G and 2.9 G, and the mt genome coverage rates of NUMT are 84.7%, 32.6%, 100% and 100%, respectively [60]. The larger-sized nuclear genomes in mammals appear to have a wider range of mt genome sequence coverage rates.

Conclusion
We studied mitochondrial genes in An. sinensis through analysis of the complete mt genome sequence. NUMT analysis of nineteen mosquito species led to the conclusion that the number, total length and density of NUMTs are significantly correlated with genome size. NUMTs are an important cause of nuclear genome size expansion in mosquitoes. The genome size expansion due to mitochondrial insertion segments is variable in different insect groups. PCGs are transferred to the nuclear genome at a higher frequency in mosquitoes, but the NUMT origination is quite different from mammals. Larger-sized nuclear genomes, in both mosquitoes and mammals, have a wider range of transferred mt genome sequences.