The Complete Mitochondrial Genome of the Stalk-Eyed Bug Chauliops fallax Scott, and the Monophyly of Malcidae (Hemiptera: Heteroptera)

Chauliops fallax Scott, 1874 (Hemiptera: Heteroptera: Malcidae: Chauliopinae) is one of the most destructive insect pests of soybean and rice fields in Asia. Here we sequenced the complete mitochondrial genome of this pest. This genome is 15,739 bp long, with an A+T content of 73.7%, containing 37 typical animal mitochondrial genes and a control region. All genes were arranged in the same order as most of other Heteroptera. A remarkable strand bias was found for all nine protein coding genes (PCGs) encoded by the majority strand were positive AT-skew and negative GC-skew, whereas the reverse were found in the remaining four PCGs encoded by the minority strand and two rRNA genes. The models of secondary structures for the two rRNA genes of sequenced true bugs and Lygaeoidea were predicted. 16S rRNA consisted of six domains (domain III is absent as in other known arthropod mitochondrial genomes) and 45 helices, while three domains and 27 helices for 12S rRNA. The control region consists of five subregions: a microsatellite-like region, a tandem repeats region and other three motifs. The unusual intergenic spacer between tRNA-H and ND4 only found in the species of Lygaeoidea, not in other heteropteran species, may be the synapomorphy of this superfamily. Phylogenetic analyses were carried out based on all the 13 PCGs showed that Chauliopinae was the sister group of Malcinae and the monophyly of Lygaeoidea.


Introduction
The stalk-eyed bug, Chauliops fallax Scott, 1874, is an important pest of bean plants such as soybean and a minor cause of pecky rice in China, Japan and Korea [1][2][3]. Ecology and controlling methods of this species were studied in past years [2]. However, no molecular markers have been used to investigate the population genetic structure or evolutionary patterns of C. fallax, which might facilitate the managements of this pest.
The genus Chauliops was described by Scott (1874) as a lygaeid genus, then it was included in Heterogastrinae [4]. Breddin (1907) raised Chauliops to a subfamily Chauliopinae [5]. Later on, Chauliopinae and Malcinae were considered closely allied by many authors based on morphological characters [6]. Š tys (1967) gave family status to the Malcidae (only including Chauliopinae and Malcinae) also based on morphological characters [7], and his opinion was accepted by nearly all subsequent authors [8]. Hua et al. (2008) provided the mitochondrial genome (mt-genome) data of Malcinae (genus Malcus) [9], however, no molecular data have been published in Chauliopinae, and the relationships of Chauliopinae and Malcinae have not been confirmed by molecular evidence so far.
In recent years, the numbers of complete mitochondrial (mt) genome sequences of insects have a rapid increase due to its relative short in length (14-20 kb) and easy to get the whole genome, and widely use in inferring phylogenetic relationships [10]. Up to now, the complete mt-genomes of 282 species of insects, and the complete or nearly complete mt-genomes of 35 species among 30 families of Heteroptera, which includes 76 families totally [11,12], are available at NCBI (status September 10, 2012). However, the number of sequenced heteropteran mtgenomes is still very limited relative to the species-richness of Heteroptera.
In this study, we describe the complete mitochondrial genome sequence of the C. fallax and provide analyses of the nucleotide composition, codon usage, compositional biases, RNA secondary structure, and evaluate the phylogenetic relationship of Chauliopinae and Malcinae in Heteroptera based on the sequences of protein coding genes (PCGs). We compare the conserved sequences of RNA secondary structures of sequenced true bugs and Lygaeoidea species respectively, which may be helpful for aligning rRNA sequences correctly and reconstructing improved phylogeny trees in the future. Moreover, we also discover some potential conserved motifs of RNA secondary structure in Lygaeoidea.

Genome organization and structure
The complete mitochondrial genome sequence of C. fallax was a double-stranded circular DNA molecule of 15,739 bp in size and has been deposited in the GenBank (Accession number: JX839706; Figure 1). This mt-genome totally contained the typical 37 genes (two rRNAs, 13 PCGs and 22 tRNAs) and a large non-coding region (control region), with the same gene order as observed in most other true bugs [13] (Table 1). Gene overlaps were observed at 16 gene junctions and involved a total of 67 bp which may make the genome relatively compact; the longest overlap (16 bp) existed between ND4L and tRNA-Thr. The two gene pairs ATP8/ATP6 and ND4L/ND4 overlapped same seven nucleotides (ATGATAA). In addition to the control region, 101 nucleotides were dispersed in eight intergenic spacers, ranging in size from 1 to 59 bp. The longest spacer sequence was located between tRNA-His and ND4.

Transfer RNAs
All of the 22 typical animal tRNA genes were found in C. fallax mt-genome, ranging from 63 to 72 bp. Most of the tRNAs could be folded into the classic cloverleaf secondary structure except for tRNA-Ser (GCT), in which its dihydrouridine (DHU) stem simply formed a loop (see Figure S1). The amino acid acceptor stem (7 bp) and the anticodon loop (7 nt) had extremely low variability, and the most variable in size was the stems and loops of DHU and TYC, which the loop size (3-10 bp) was more variable than the stem size (2-5 bp). The length of the anticodon stems was conservative, with the exception of tRNA-Ser (GCT) which possessed a long optimal base pairing (9 bp in contrast to the normal 5 bp) and a bulged nucleotide in the middle for the AC stem.
A total of 28 unmatched base pairs exist in the C. fallax mitochondrial tRNA secondary structures, 26 of which were G-U pairs located in the amino acid acceptor stem (8 bp), the DHU stem (10 bp), the anticodon stem (4 bp), the TYC stem (4 bp), and the remaining two U-U mismatches in the amino acid acceptor stem of tRNA-Ala and tRNA-Leu (TAG) respectively. Additionally, 24 mismatches were significantly biased in eight tRNA genes which were encoded on the minority strand (N-strand), and the others were found on the majority strand (J-strand).

Ribosomal RNAs
The ends of C. fallax rRNA genes were assumed to extend to the boundaries of flanking genes, because it was impossible to precisely determined by DNA sequencing alone [14]. The 16S rRNA (large rRNA subunits) was assumed to fill up the blanks between tRNA-Val and tRNA-Leu (TAG). The 12S rRNA (small rRNA subunits) was located between tRNA-Val and the non-coding region. 16S rRNA has a length of 1,265 bp with an A+T content of 77.6%, while 12S rRNA has a length of 794 bp with an A+T content of 74.6%. The secondary structure of C. fallax 16S rRNA consisted of six structural domains (domain III is absent as in other arthropod mt-genomes) and 45 helices ( Figure 2). The secondary structure of 12S rRNA consisted of three structural domains and 27 helices ( Figure 3).
In 16S rRNA, domains IV and V are more conserved than domains I, II, and VI by the alignment of the sequenced species in Heteroptera and also in Lygaeoidea. Some helices (H563, H1775, H2064, H2507, and H2588) are highly conserved in both sequence and secondary structure among most heteropteran mtDNA, and only few nucleotides differences are found either in the terminal loops or in the terminal couplets of helices. In domain II, a conserved helix could not be observed within the available heteropteran mtDNA, but the helix H579, H822 and the internal loop of helix H991 are highly conserved in Lygaeoidea mtDNA. Helix H837 forms a long stem structure with a small loop in the terminal as frequently found in other insects [15]. In domain IV, the initial 5 bp of helix H1792 form hydrogen bonds as in most insects [16], and the first nucleotide pair in this helix often forms a UU interaction which has been observed in some helices of the insect 12S rRNA gene [17]. Although the terminal half of H1792 may pair as in many insects, these interactions are less conserved [16]. Accordingly, we leave the terminal half of H1792 unpaired in C. fallax. The secondary structure of the helix H1835 is similar to that proposed for Drosophila melanogaster [15], Apis mellifera [18] and Ruspolia dubia [19], which are different with the structure of some true bugs proposed by Li et al. [20]. In domain V, most helices are highly conserved in secondary structure, with the exception of H2077 and H2347. In terms of secondary structure and alignment, helix H2077 is the most problematic region with no apparent conserved motifs [16]. The helix H2347 is greatly variable among many insects mtDNA, and in C. fallax this region consisting of just the terminal 3 paired bases, which is similar to that proposed for Zygaena sarpedon lusitanica [21]. In addition, some helices (H183, H736, H991, H1057, H1087, H1196, H1648, H2077, H2347 and H2735) are greatly variable in both sequence and secondary structure, and the sequences from H183 to tRNA-Val are most different among most insect mtDNA.
In 12S rRNA, the sequence and secondary structure of domain III is more conserved than the other two parts (domains I and II) among most heteropterans. The 59 end of the 12S rRNA was made up of a long, unpaired sequence followed by a pseudoknot formed by 5 bp stem H9 and the 59 portion of stem H17. The helix H27 is probably ten base pairs long in C. fallax, while the secondary structure assumed identical with the fruit fly model (for Drosophila melanogaster) [15] and the Gutell model (for Apis mellifera) [18]. However, helix H27 was 8 bp long in some other models [22,23] for the reasons of two additional base pairings at the distal end of the helix is unclear. The helix H47 is highly variable among heteropterans and difficult to align with few nucleotides which only conserved in Lygaeoidea. A consistent secondary structure for this region could not be found even within all the available mitochondrial 12S rRNA structures. The possible folds of this section presumed for C. fallax consists a long stem, an internal loop and a short terminal loop, which was predicted by the Mfold web server [24]. From helices H567 to H769, the secondary structure of this circle section is highly variable among the studied taxa and only aligned ambiguously. An exception is the distal section of helix H769 is extremely conserved as in other insects [22]. In domain III, helices from H921 to H960 are highly conserved among Lygaeoidea. However, the most complicated portion of 12S rRNA located in the stem H1047 and the associated stems H1068, H1074 and H1113, possibly because its high AT bias and several non-canonical base pairs across many other insects [25]. Due to the evidence found for helix H1068 in insects [26], a six base-pair-long stem mostly comprising 59-GAAUAU-39 on one side and 59-AUUUUC-39 on the other. Helix H1303 consists of a lone nucleotide pair at the base of the helix, an internal bulge, and a distal stem containing three UU base pairs. Helix H1399 is more conserved than any other helices of 12S rRNA across true bugs, but the terminal loop is highly variable both in length and sequence.

Protein coding genes
Twelve of the 13 PCGs of C. fallax initiated with ATN as start codon (four with ATG, four with ATT, three with ATA and one with ATC) ( Table 1). The only exception was the COI gene, which used TTG as a start codon. This non-traditional start codon for the COI gene was also observed in other true bugs [13], and dipterans [27,28]. Most PCGs stopped with the complete termination codon: ten with TAA (ND2, COI, ATP8, ATP6, ND3, ND5, ND4, ND4L, ND6 and ND1) and one with TAG (CytB). The remaining two (COII and COIII) were terminated with a single T adjacent to a downstream tRNA gene on the same strand. The phenomenon that single T acts as termination codon was common in insect mt-genomes and it had been presumed that the complete termination codon TAA could be generated by posttranscriptional polyadenylation [29].

Nucleotide composition and codon usage
For the whole mt-genome of C. fallax, the nucleotide composition was significantly biased toward A and T. The A+T content was 73.7% (A = 44.8%, T = 28.9%, C = 16.7%, G = 9.6%), which is a common value among known hexapod mt-genomes ranging from 62.4% in Atelura formicaria (Zygentoma) [30] to 87.4% in Diadegma semiclausum (Hymenoptera) [31]. The average A+T content of all PCGs, tRNAs, rRNAs and the control region is 72.9%, 77.2%, 76.4% and 72.4%. The lowest A+T content is 67.5% in COI, while the highest is 81.8% in ATP8 ( Table 2). The nucleotide skew statistics [32] of all PCGs show that the J-strand PCGs (AT-skew = 0.11, GC-skew = 20.22) were much less TAand GC-skewed than the N-strand PCGs (AT-skew = 20.40, GCskew = 0.32), and the N-strand tRNAs had also higher GC-skewed than the J-strand tRNAs. This kind of strand bias of nucleotides composition has been generally related to asymmetric mutational constraints in the process of replication [33].
Besides, in C. fallax, it was interesting that each of the PCGs of Jstrand was positive AT-skew and negative GC-skew, whereas the reverse was observed in each of the PCGs of N-strand (Figure 4). This remarkable phenomenon has not been reported for any insect mt-genome before. Unfortunately, the mechanism of this phenomenon is unclear. However, there were reports that the value of GC skew was associated with replication orientation and AT skew varies with gene direction, replication and codon positions [34]. To deeply understand the mechanism of this phenomenon, more research work about mt-genomes sequences and function are needed to be done.
The nucleotide bias toward AT was also reflected in the codon usage ( Table 2). The analysis of the base composition at each codon position of 13 PCGs showed that the third codon position (82%) was higher in A+T content than the first (69.2%) and second (67.2%) codon positions. The mt-genome of C. fallax contained 3,664 codons totally, while 2,254 codons (61.5%) were found on the J-strand and 1,410 codons (38.5%) on the N-strand. Over all, four most prevalent codons in C. fallax, Ile (ATT) (8.68%), Met (ATA) (7.97%), Phe (TTT) (6.87%) and Leu (TTA) (6.65%) were all composed wholly of A and/or T, which may play an important role for the whole mt-genome high A+T content. In addition, the most infrequently used codons were NNG (267 codons, 7.3%) and the most frequently used codons were NNA (1,523 codons, 41.6%). The fourfold degenerate codon usage presented a strong bias towards adenine (A) at the third codon of J-strand PCGs whereas uridine (U) shows preponderance on the N-strand, except the Ser (AGN) whose most frequently used codons are ended with A ( Figure 5). The twofold degenerate codon usage demonstrated definite bias favoring A/U over G/C at the third codon position on both strands, except the Gln (CAR) and Lys (AAR) of the Nstrand favoring G rather than A. All codons are present on both strands of C. fallax mtDNA PCGs, but AGG and CGC codons are not observed in the J-strand, and CUC, AUC, CCC, CCG and CGA codons in the N-strand, reflecting the influence of a strong biased codon usage [35].

Non-coding regions
The mt-genome of C. fallax includes three major non-coding regions of more than 20 bp: spacer 1 was 59 bp between tRNA-His and ND4, spacer 2 was 20 bp between tRNA-Ser and ND1, and spacer 3 was 1,148 bp with 72.4% A+T content between 12S rRNA and tRNA-Ile (I)-tRNA-Gln (Q)-tRNA-Met (M) gene cluster ( Figure 1). Spacer 1 is a feature common to each of the five Lygaeoidea mtgenomes (38 bp in Berytidae, 40 bp in Malcinae, 59 bp in Chauliopinae, 72 bp in Colobathristidae, and 124 bp in Geocoridae) which have been sequenced to date but is not found in other heteropterans. Additionally, in Malcinae, it has been reported that one subregion of the intergenic spacer between tRNA-His and ND4 has an exactly repeated counterpart in the control region (34 nt, Blast E-value: 2e-15), and thought it may be the autapomorphy of Malcidae [9]. However, in Chauliopinae, the sister group of Malcinae, no copy of this spacer was found across all the mtgenome. Hence, the repetition may be the autapomorphy only for Malcinae, not including the subfamily Chauliopinae.
Spacer 2 is common to most insect mt-genomes. Among these spacers, there are two consensus motifs in a conserved sequence block (CSB) region ( Figure 6), which may indicate that they undergo a common intermediate stage of tandem duplication and random loss (TDRL) process [36]. There is a 5 bp motif, AATGA, which is conserved across the members of Lygaeoidea, and to a lesser extent across the infraorder Pentatomomorpha, WRTGA. The another 5 bp motif, ACTTA, which is conserved across the members of Pentatomomorpha with the exception of Malcidae (Malcinae + Chauliopinae), ACCTA, which may be the autapomorphy of Malcidae and provide another evidence for the monophyly of Malcidae. Spacer region 3 is considered as the control region identified in other mt-genomes [37] which includes the origin sites for transcription and replication [38]. In some arthropods mtgenomes, the control region was reported to have one to four of these four different motifs: tandem repeats, poly-thymine (poly-T) sequence, a subregion of even higher AT richness, and a stem-loop structure [39]. The control region of C. fallax contained all these four motifs and could be divided into five parts ( Figure 7A): (1) at the 59-end of the control region is a 7 bp poly-C structure, which was also found in other insects [20,25]; (2) a 8 bp poly-T stretch and a microsatellite-like region ((TA) 4 (GATATA) 2 ); (3) a 35 bp region heavily biased toward A+T (91.4%); (4) a 460 bp region contained four tandem repeats including three (I-III) 122 bp repeat units and a partial copy of the repeat (IV) 94 bp, which were identified by tandem repeats finder server [40]; (5) a potential stem-loop secondary structure was found at the end of control region, however, without 'TATA' sequence existed at the 59 end and 'G(A)nT' at the 39 end ( Figure 7B). In the second region, the poly-T stretch may play a role in the control of transcriptional or may be the site of replication initiation [41]. The microsatellite-like region, located 188 bp from 12S RNA, was rare and only been reported in Stenopirates sp. [20] among all studied heteropterans. In the fourth region, tandem repeats are common in the control region for most insects, and length variations may be caused by a variable copy number of repetitive elements, which produces obvious size variation in the entire mt-genome [42]. The existence of tandem repeats can be explained by replication slippage mechanism [42,43].

Phylogenetic relationships
Phylogenetic analysis was performed with the large data set, 29 heteropteran species as ingroups and other 3 hemipterans as outgroups (Acyrthosiphon pisum [44], Sivaloka damnosus [45] and Lycorma delicatula [13]). Bayesian inference and ML analyses recovered fully bifurcating trees with the same topology ( Figure 8). In the present study, the topology of infraordinal relationships of Heteroptera is similar to previous work [25]. Two Gerromorpha superfamilies were monophyletic in the basal position of these five infraorders. Within Cimicomorpha, Reduviidae was paraphyletic with respect to Anthocoridae and Miridae. In Pentatomomorpha, our results support that Aradoidea and the Trichophora are sister groups as indicated in Xie et al. [46]. In Eutrichophora, our study was (Lygaeoidea + (Pyrrhocoroidea + Coreoidea)) but poorly supported by ML and Bayesian inferences, while more extensive taxa sampling was needed in further analysis. In Lygaeoidea, our conclusion was (Colobathristidae + (Berytidae + (Geocoridae + Malcidae))), and the sister-relationship of Malcinae and Chauliopinae was confirmed.

Ethics statement
No specific permits were required for the insect collected for this study in Zhejiang Province, China. The insect specimens were collected in the soybean field by net sweeping. The field studies did not involve endangered or protected species. The species in the genus of Chauliops are common small insects and are not included in the ''List of Protected Animals in China''.  After being transported to the laboratory, they were stored at 220uC until used for DNA extraction.

PCR amplification and sequencing
Total genomic DNA was extracted from muscle tissue of thorax by a CTAB-based method [47]. The entire mt-genome of Chauliops fallax was amplified in four overlapping PCR fragments by PCR amplification. The primer were modified from previous work [48], and designed from the sequenced fragments (see Table S1).
PCRs were performed with TaKaRa LA Taq under the following conditions: 1 min initial denaturation at 94uC, followed by 30 cycles of 20 s at 94uC, 1 min at 50uC, and 2-8 min at 68uC, and a final elongation for 10 min at 72uC. The PCR products were electrophoresed in 1% agarose gel, purified, and then sequenced by ABI 3730XL capillary sequencer with the BigDye Terminator Sequencing Kit (Applied Bio Systems). All fragments were sequenced with primer walking on both strands.

Sequence analysis and annotation
Sequence files were proof read and assembled into contigs in BioEdit version 7.0.5.2 [49]. Protein coding regions were identified by ORF Finder implemented by the NCBI website (http://www.ncbi.nlm.nih.gov/gorf/gorf.html) with invertebrate mitochondrial genetic codes. To ensure the accurate boundaries of different genes, protein coding regions and ribosomal RNA genes were compared with published insect mitochondrial sequences with CLUSTAL X version 1.83 [50]. Transfer RNA analysis was conducted using tRNAscan-SE version 1.21 [51] with the invertebrate mitochondrial codon predictors and a cove score cut off of 5. Only a few of tRNA genes that could not be detected by tRNAscan-SE were identified by comparing to other heteropterans. Nucleotide composition and codon usage were analyzed with MEGA 5.0 [52]. Strand asymmetry was calculated using the formulas: AT skew = [A2T]/[A+T] and GC skew = [G2C]/ [G+C] [32]. The putative control region was inferred using the Mfold web server (http://mfold.rna.albany.edu/) [24] with default settings to identify the regions of potential inverted repeats or palindromes. The tandem repeats of the control region were identified by tandem repeats finder server (http://tandem.bu.edu/ trf/trf. html) [40].

Phylogenetic analyses
Phylogenetic analysis was carried out based on the 29 complete or nearly complete mt-genomes of true bugs from GenBank. Three species from Sternorrhyncha and Auchenorrhyncha were selected as outgroups (see Table S2). DNA alignment was inferred from the amino acid alignment of 13 PCGs using MUSCLE as implemented in the MEGA version 5.0 [52]. Alignments of individual genes were then concatenated to be the data set used to reconstruct the phylogeny excluding the stop codon and the third codon. GPU MrBayes [54] and PHYML online web server [55] were employed to reconstruct the phylogenetic trees with the GTR+I+G model estimated by Modeltest Version 3.7 [56]. In Bayesian inference, two simultaneous runs of 5,000,000 generations were conducted for the matrix. Each set was sampled every 100 generations with a burnin of 25%. Trees inferred prior to stationarity were discarded as burnin, and the remaining trees were used to construct a 50% majority-rule consensus tree. In ML analysis, the parameters were estimated during analysis and the node support values were assessed by bootstrap resampling (BP) calculated using 100 replicates.