The Complete Mitochondrial Genome of the Asiatic Cavity-Nesting Honeybee Apis cerana (Hymenoptera: Apidae)

In the present study, we determined the complete mitochondrial DNA (mtDNA) sequence of Apis cerana, the Asiatic cavity-nesting honeybee. We present here an analysis of features of its gene content and genome organization in comparison with Apis mellifera to assess the variation within the genus Apis and among main groups of Hymenoptera. The size of the entire mt genome of A. cerana is 15,895 bp, containing 2 ribosomal RNA genes, 13 protein-coding genes, 22 transfer RNA (tRNA) genes and one control region. These genes are transcribed from both strands and have a nucleotide composition high in A and T. The contents of A+T of the complete genomes are 83.96% for A. cerana. The AT bias had a significant effect on both the codon usage pattern and amino acid composition of proteins. There are a total of 3672 codons in all 13 protein-coding genes, excluding termination codons. The most frequently used amino acid is Leu (15.52%), followed by Ile (12.85%), Phe (10.10%), Ser (9.15%) and Met (8.96%). Intergenic regions in the mt genome of A. cerana are 705 bp in total. The order and orientation of the gene arrangement pattern is identical to that of A. mellifera, except for the position of the tRNA-Ser(AGN) gene. Phylogenetic analyses using concatenated amino acid sequences of 13 protein-coding genes, with three different computational algorithms (NJ, MP and ML), all revealed two distinct groups with high statistical support, indicating that A. cerana and A. mellifera are two separate species, consistent with results of previous morphological and molecular studies. The complete mtDNA sequence of A. cerana provides additional genetic markers for studying population genetics, systematics and phylogeographics of honeybees.


Introduction
The Asiatic honeybee, Apis cerana Fabricius (Hymenoptera: Apidae), is an important honeybee species in Asian countries. For a long time A. cerana was considered a sub-species of A. mellifera. However, on the basis of several genetic, morphological, and behavioural characteristics, it has been established that A. cerana and A. mellifera are two completely isolated species, and they were considered most recently derived and sister taxa [1][2][3][4]. Honeybee has been the object of considerable scientific curiosity and has served as the classical model organism in diverse fields such as navigation, face recognition and sensory biology for almost a century [5][6][7]. Honeybee is regarded as the premier pollinator of major fruit crops. It has a long history of association with mankind because of its cavity-nesting lifestyle and it is used for producing honey, bee pollen, wax, and royal jelly [8].
Most metazoan species possess a compact, circular mitochondrial (mt) genome, which varies in size from 14 to 19 kb that contain 36-37 genes, including 12-13 protein-coding genes, two ribosomal RNAs (rRNA) genes and 22 transfer RNAs (tRNA) genes necessary for translation of the proteins encoded by the mtDNA [9][10][11]. mtDNA has been extensively used for studying phylogenetic and evolutionary relationships among animal species, due to its maternal inheritance, rapid evolutionary rate, and lack of genetic recombination [12][13][14][15][16]. mtDNA has proved to be an important tool in intra-specific and inter-specific phylogenetic studies of honeybees [17][18][19][20][21], and one region of the Apis mt genome, a non-coding sequence located between cytochrome oxidase I (cox1) and cytochrome oxidase II (cox2) has proven to be particularly informative for intra-specific studies [22,23].
The Hymenoptera are one of the largest insect orders belonging to hexapods, comprising over 100,000 described species [24]. In 1993, Crozier reported the complete sequence of the A. mellifera mt genome, this was the first identified mt genome of Hymenopteran [25]. Until recently, only 9 complete and 18 nearly complete mt genome sequences have been determined in Hymenoptera [25][26][27][28][29][30][31][32][33][34][35] (Table 1), and in the genus Apis, only the A. mellifera mt genome has been determined. The lack of knowledge of mt genomics for Hymenoptera is a major limitation for population genetic and phylogenetic studies of the Hymenoptera including the species in the Apidae.
In the present study, we determined the complete nucleotide sequence of the A. cerana mt genome, we performed phylogenetic analyses using selected Hymenoptera species. The new sequence may provide useful information on both genomics and the evolution of Hymenoptera, because there are only a few complete (or nearly complete) mtDNA sequences available from these animals.

Results and Discussion
General features of the mt genome of A. cerana The length of the complete mt genome of A. cerana was 15,895 bp (Figure 1), and the mtDNA sequence was deposited in the GenBank under the accession number GQ162109. The mt genome of A. cerana contains 13 protein-coding genes (cox1-3, nad1-6, nad4L, atp6, atp8 and cob), a small subunit ribosomal RNA gene (rrnS), a large subunit ribosomal RNA gene (rrnL), and 22 transfer RNA genes ( Table 2). The A. cerana mt genome shows a novel arrangement within Hymenoptera compared with the ancestral pancrustacean mt genome organization [36]. All rearranged genes are tRNAs, tRNA gene rearrangements can be classified as translocations, local inversions, or shuffling. A translocation is a movement of a gene to another position across a protein-coding gene. A rearrangement is classified as an inversion when the tRNA is found on the opposite strand and as shuffling when the tRNA gene is on the same mt stand but in a different position compared with the ancestral organization (without movement across a protein-coding gene) [37].
As shown in Figure 2, we identified eight rearrangements shared between A. cerana and A. mellifera. tRNA-Ala, tRNA-Ser (AGN) and tRNA-Glu move to tRNA-Ile-tRNA-Gln-tRNA-Met cluster, and tRNA-Ala changes position relative to tRNA-Ser (AGN) and tRNA-Glu. tRNA-Trp moves across two tRNAs (tRNA-Cys and tRNA-Tyr) boundaries. The order of tRNA-Ile-tRNA-Gln-tRNA-Met becomes tRNA-Met-tRNA-Gln-tRNA-Ile, and two of tRNAs (tRNA-Gln and tRNA-Arg) arrangements are inversions. Finally, the tRNA-Lys and tRNA-Asp swap positions. The orientation and gene order of the A. cerana mt genome were identical to that of A. mellifera except for the tRNA-Ser (AGN) . In A. cerana, tRNA-Ser (AGN) gene was only translocated from nad3-nad5 junction to the junction of nad2 and AT-rich region compared with the ancestral pancrustacean mt genome organization. In A. mellifera, tRNA-Ser (AGN) gene was not only translocated but also shuffled, which swapped positions with tRNA-Glu gene. The duplication/random loss model, and the intramitochondrial genome recombination and duplication/nonrandom loss model are possible mechanisms to explain translocation and shuffling. Of these, intramitochondrial genome recombination is presumed to be more common [37]. The genes atp6, atp8, cox1, cox2, cox3, cob, nad2, nad3 and nad6 are transcribed from one strand, while nad1, nad4, nad4L, and nad5 are transcribed from the other strand. The nucleotide compositions of the entire mtDNA sequences for A. cerana are biased toward A and T, with T being the most favored nucleotide and G the least favored, in accordance with the mt genome of A. mellifera. The content of A+T is 83.9% for A. cerana (42.3% A, 41.6% T, 6.3% G and 9.8% C) ( Table 3), 84.9% for A. mellifera (43.2% A, 41.7% T, 5.5% G and 9.6% C), respectively. The bias of the base composition of an individual strand can be described by skewness [38], which is calculated as (A%2T%)/(A%+T%) and (G%2C%)/(C%+G%), respectively. AT-skews and GC-skews of each of the 13 proteincoding genes and the 2 ribosomal RNA genes and the whole mt genome were calculated for A. cerana and A. mellifera (Table 3). In A. cerana, except for the atp8 gene which has a relatively weak skew of A vs T (AT skew = 0.092), other twelve PCGs have a skew of T vs A (AT skew between 20.002 and 20.209). The PCGs of H-strand have a strong skew of C vs G (GC skew between 20.098 and 20.309), and the nad1, nad4, nad4L and nad5 genes, which coded on the L-strand have a strong skew of G vs C (GC skew between 0.252 and 0.470). In the two rRNA genes, the GC skew displayed the same pattern (GC skew = 0.352 and 0.312 for the rrnS and rrnL genes, respectively). The AT skew displayed an opposite pattern (AT skew = 0.076 and 20.029 for rrnS and rrnL genes, respectively). By comparing the skew values of 13 protein-coding genes and two rRNA genes between A. cerana and A. mellifera, we found that GC skews of different genes coded on the same strand are all positive or negative, whereas the AT skews of different genes coded on the same strand are either positive or negative. This is congruent with the pattern of skew values in other insects [39].  found that the criterion for detecting reversal of strand asymmetry on mt genomes should be the sign of GC skew values, not the AT skew values.
The A. cerana mt genes overlap a total of 27 bp in 4 locations ranging from 1 to 19 bp (Table 2). Gene overlaps on the same strand in the mt genome of A. cerana can be founded in three regions. The longest is a 19 bp overlap between atp8 and atp6. The second case is a 5 bp overlap between cox1 and tRNA-Leu (UUR) . The third is 2 bp overlap occurred between cox2 and tRNA-Asp. Overlaps of genes coded by the different strands also can be found in the mt genome of A. cerana. one bp overlap exists between nad2 and tRNA-Cys. Interestingly, the A. cerana mt genome has the same overlapping nucleotides and locations between genes as those of A. mellifera, where genes also overlap a total of 27 bp in four locations [25].
Apart form the A+T-rich region, the A. cerana mt genes are separated by a total of 705 bp of intergenic spacer sequences, which are spread over 22 regions and range in size form 1 to 231 bp ( Table 2). The longest intergenic region (231 bp) is located between tRNA-Met and tRNA-Gln genes. In A. mellifera mt genome, a total of 813 bp of intergenic spacer sequences are spread over 24 regions, ranging in size from 1 to 193 bp. The longest one is located between tRNA-Leu (UUR) and cox2 [25]. This region has proven to be particularly informative for intraspecific studies, because the sequences do not appear to be subject to strong purifying selection and accumulate numerous base substitutions and insertion/deletions [21]. In comparison, A. cerana has only 89 bp intergenic spacer sequence in this region. Such short intergenic regions were also found in other metazoans [40][41][42]. Furthermore, previous studies have shown that the intergenic regions in the mt genomes of some metazoan contain signals for transcription initiation and replication [43,44]. In the spacer region (tRNA-Ser (UCR) -nad1), we found a 6 bp motif (TACTTA), which is conserved between A. cerana and A. mellifera. This intergenic spacer region may correspond to the binding site of mtTERM, a transcription attenuation factor [45].

Protein-coding genes and codon usage patterns
The boundaries between protein-coding genes in the mt genome of A. cerana were determined by aligning their sequences and by identifying translation initiation and termination codons with comparison to those of A. mellifera. The amino acid sequences inferred for nad3, nad4L, cox1 and cox3 genes are similar in length to those of A. mellifera ( Table 4). The similarity of the amino acid sequences is 51.5%-97.5% between A. cerana and A. mellifera (Table 4). Based on similarity, cox1 is the most conserved proteincoding gene, while the nad6 is the least conserved. Genes in mt genome may have different evolutionary rates, which might be caused by different selection pressures or the restriction of gene function. That is what we found in honeybee that rates of different mitochondrial genes diverged differently.
The predicted translation initiation and termination codons for the protein-coding genes of A. cerana mt genome were compared with those of A. mellifera. All protein-coding sequences started with a typical ATN codon. The start codons inferred in the mt genome of A. cerana are ATC (one of 13 protein-coding genes), ATT (eight of 13 protein-coding genes), ATG (three of 13 protein-coding genes) and ATA (one of 13 protein-coding genes) as a translation initiation codon (Table 2). Furthermore, the anomalous initiation codon and incomplete stop codon frequently occur in protein-coding genes of most of the Hymenoptera mtDNA sequenced to date, such as the gene cox1 has been found to employ TTG as a start codon and nad4 has been found to employ TA as a stop codon in Vanhornia eucnemidarum [27]. The codon TGA, which is a termination codon in the universal code, codes for tryptophan in most mt genomes except those of higher plants [46]. All stop codons used TAA as a termination codon in the mt genome of A. cerana. This is presumably polyadenylation after transcription to complete the termination codon [47]. However, two of the 13 protein-coding genes (cox1, cox2) are predicted to employ T as an initiation codon in the mt genome of A. mellifera [25]. Furthermore, incomplete stop codon frequently occurs in protein-coding genes of most of the Hymenoptera mt DNA sequenced to date [27,28,32,34]. The pattern of codon usage in the A. cerana mtDNA was also studied. Excluding the termination codons, a total of 3672 amino acids are encoded by the A. cerana mt genome, and a total of 3676 amino acids for A. mellifera. The most frequently used amino acids were Leu (15.52%), followed by Ile (12.85%), Phe (10.10%), Ser (9.15%) and Met (8.96%) ( Table 5).

Transfer RNA genes
The 22 tRNA genes encoded in the mt genome of the A. cerana vary in length from 60 to 78 nucleotides with differences in stem and loop sizes of dihydrouridine (D) and TYC loops. The order and orientation of the gene arrangement pattern are identical to that of A. mellifera, except for the position of the tRNA-Ser (AGN) gene. All of the 22 tRNA genes have the typical cloverleaf structure, except for tRNA-Ser (AGN) that lacks DHU arm ( Figure  S1). Their putative secondary structures ( Figure S1) are similar to those of A. mellifera [25], indicating their similar functions. Among Hymenopteran, mismatched base pairs have been reported in mitochondrial tRNAs inferred for A. mellifera, Melipona bicolor, Diadegma semiclausum, Bombus ignitus and Evania appendigaster [25,28,31,32,34]. A total of 6 mismatched base pairs occur in tRNAs of A. cerana. Two tRNA genes, tRNA-Ala and tRNA-Gln,  Table 3. Nucleotide composition and skews of Apis cerana mitochondrial protein-coding and ribosomal RNA genes and comparison with Apis mellifera. have a single G-T and T-T mismatches in the acceptor stem, respectively. tRNA-Val has a single T-T mismatch in the anticodon arm, and tRNA-Cys, tRNA-His and tRNA-Pro all have a single G-T mismatches in the DHU arm. Sequences of three tRNAs overlap with adjacent genes, tRNA-Cys overlaps with the adjacent gene nad2 on the opposite strand for 1 bp at its 39-end. tRNA-Leu (UUR) overlaps with the adjacent gene cox1 on the same strand for 5 bp at its 39-end. tRNA-Asp overlaps with the adjacent gene cox2 on the same strand for 2 bp at its 39-end. Finally, and most importantly, stem mismatches and sequence overlap are not uncommon for mt tRNAs of Hymenopteran, and are probably repaired by a posttranscriptional editing process [48,49]. The 22 tRNA genes are located on both strands, 14 tRNAs are encoded on the H-strand and 8 on the L-strand (Table 2). In these tRNAs, there is a strict conservation of the sizes of the amino acid acceptor stem (13-15 bp) and the anticodon loop (7-9 bp). Their D-loops consist of 4-10 bp. The extra variable loops of 4-5 bp and the TYC loops of 4-10 bp complete the cloverleaf structures. In most metazoan mitochondria, tRNA-Ser (AGN) generally has an unpaired DHU arm, while tRNA-Ser (UCN) has a standard cloverleaf structure [50,51]. The Apis tRNA-Ser (AGN) and tRNA-Ser (UCN) structures conform to this interpretation.

Ribosomal RNA genes
The rrnS and rrnL genes of A. cerana were identified by sequence comparison with those of A. mellifera. The rrnS is located between tRNA-Leu (CUN) and tRNA-Val, and the rrnL is located between tRNA-Val and the A+T-rich region. The length of the rrnS and rrnL gene of A. cerana is 786 bp and 1328 bp, respectively, but relatively shorter than the genes in A. mellifera (827 bp and 1371 bp respectively) ( Table 2). The A+T contents of the rrnS and rrnL for A. cerana are 81.6% and 83.1%, respectively (Table 3). Sequence identity in the rrnS and rrnL genes is 53.9% and 86.0% between A. cerana and A. mellifera, respectively.

A+T-rich region
The A+T-rich region is believed to be involved in the regulation of transcription and control of DNA replication, characterized by five elements: (1) a polyT stretch at the 59 end of the A+T-rich region, which may be involved in the control of transcription and/ or replication initiation; (2) a [TA(A)] n -like stretch following the polyT stretch; (3) a stem and loop structure, which may be associated with the second strand-replication origin; (4) a TATA motif and a G (A) n T motif flanking the stem and loop structure, and (5) a G+A rich sequence downstream of the stem and loop  structure [52]. The A+T-rich region is well known for the initiation of replication in both vertebrates and invertebrates, and the reduced G+C content is one of the most outstanding features of this region [9]. The largest non-coding region (562 bp) in the A. cerana mt genome is flanked by rrnS and tRNA-Ser (AGN) , it is highly enriched in AT (95.5%). In this region, we found some of the elements commonly present in most insects' A+T-rich region, such as PolyT stretch, [TA(A)] n -like stretch, stable stem-loop secondary structures (not shown), and TATA motif, which may function in the initiation of genome replication. Based on these features, it possibly functions as a control region. The AT-region of Diadegma semiclausum has the highest percentage of A+T content (96.4%) and Orussus occidentalis has the lowest A+T content (79.7%) among Hymenopteran sequenced to date ( Table 6).

Phylogenetic analyses
Many systematic and population genetic studies have been based on genetic markers in the mt genomes at both the nucleotide and amino acid levels [53,54]. So far, three mt genes (rrnL, cox2, nad2) and two nuclear gene (EF1-a intron and itpr) have been used for phylogenetic study of members within the genus Apis, yielding discrepant results [3,4]. Usage of complete mt sequences for phylogenetic analyses would be more reliable, but so far only two complete mt genomes (A. cerana and A. mellifera) are available within the genus Apis. To better understand the evolution of genome-level features in A. cerana, phylogenetic relationships among representative members of the Hymenoptera were inferred from concatenated amino acid sequences of the 13 mt protein-coding genes. With addition of the A. cerana mt genome sequences, there are now 28 Hymenopteran sequences available for analysis, and only 16 of these have the complete sequences of the 13 mt protein-coding genes. The final alignment of the amino acid sequences of 13 protein genes for the 16 taxa were subjected to the ML, MP and NJ analyses. The results revealed that A. cerana and A. mellifera were closely related with high statistical support, indicating that A. cerana and A. mellifera are sister relationship (Figure 3), supporting the taxonomic classification by the analyses of morphological and molecular data. Three major clades were recovered within Apidae: Apini (A. cerana and A. mellifera), Meliponini (M. bicolor), and Bombini (B. ignitus+B. hypocrita sapporoensis) that were strongly supported in all (ML, MP and NJ) analyses. These results were congruent with previous studies [33,35]. Our data provide a robust support for the relationship among Hymenopteran.
In conclusion, the mt genome of A. cerana exhibits almost the same characteristics with that of A. mellifera. Phylogenetic analyses indicated that A. cerana and A. mellifera are two separate species, supporting the taxonomic classification by genetic relatedness and phylogeny. The analyses of A. cerana mt genome have added to our knowledge on mt genomes of Hymenoptera, and provided additional genetic markers for studying population genetics, systematics and phylogeographics of honeybees.

Sample origin and DNA extraction
One A. cerana female adult was collected from the Mengla county, Yunnan province, southeastern China, and was preserved in 75% ethanol and stored at 4uC until used for DNA extraction. Total genomic DNA was extracted from its thoracic muscle tissue by treatment with sodium dodecyl sulphate/proteinase K (Merck), followed by purification using Wizard TM DNA Clean-Up System (Promega) and then eluted into 60 ml H 2 O according to the manufacturer's recommendations. DNA samples were stored at 220uC until further use. Amplification and sequencing of partial cox1, nad4 and rrnL Partial fragments of the cox1, nad4 and rrnL mtDNA genes were amplified using three sets of primers (Table 7), designed by the authors according to homological mtDNA fragments of other Hymenopteran deposited in GenBank. PCR reactions were carried out in a 25 ml reaction volume consisting of 14.25 ml sterile deionized water, 2.5 ml 106PCR Buffer (Mg 2+ free), 4.0 ml MgCl 2 (25 mM), 2.0 ml dNTPs (2.5 mM each), 0.25 ml each primer (50 pmol/ml), 0.25 ml ExTaq DNA polymerase (5 U/ml, Takara) and 1.5 ml DNA template (40 ng/ml) with the following conditions: after an initial denaturation at 94uC for 5 min, then 94uC for 30 s (denaturation), 45-50uC for 30 s (annealing), 72uC for 30 s (extension) for 35 cycles, followed by 72uC for 5 min (final extension). Each amplicon (5 mL) was examined by agarose gel electrophoresis to validate amplification efficiency. Then, the partial cox1, nad4 and rrnL amplicons were sent to TaKaRa Company (Dalian, China) for sequencing from both directions by using primers used in the PCR amplifications.

Long-PCR amplification and sequencing
The nucleotide sequences obtained from these three genes were then used to design A. cerana-specific primer sets for long PCR reactions. Three overlapping long PCR fragments covering the entire mt genome of A. cerana were obtained using the long PCR primers sets ( Table 7). The Long-PCR reaction volume amounted 50 ml containing 28.5 ml sterile deionized water, 5.0 ml 106LA PCR Buffer(Mg 2+ free), 5.0 ml MgCl 2 (25 mM), 8.0 ml dNTPs (2.5 mM each), 0.5 ml each primer (50 pmol/ml), 0.5 ml LA Taq DNA polymerase (5 U/ml, Takara) and 2 ml DNA template (40 ng/ml). Long-PCR cycling conditions used were initial denaturation step at 92uC for 2 min, followed by 30 cycles of  denaturation at 92uC for 10 s, primer annealing at 50uC for 30 s and elongation at 60uC for 10 min during the first 10 cycles, and then an additional 10 s per cycle during the last 20 cycles. The final elongation step was continued at 60uC for 10 min. All amplifications were done on a T-Gradient thermocycler (Biometra, Germany). The three long-PCR fragments were sequenced using a primer-walking strategy. All sequencing was performed with ABI PRISM Big Dye terminator chemistry and analyzed on ABI 3730 automated sequencers (Applied Biosystems, USA).

Gene annotation and sequence analysis
Sequences were assembled manually and aligned against the complete mt genome sequence of A. mellifera using the computer program Clustal61.83 [55] to identify gene boundaries. The open-reading frames and codon usage profiles of protein-coding genes were analyzed using the program MacVector 4.1.4 (Kodak, version4.0). Translation initiation and translation termination codons were identified based on comparison with the mt genome of A. mellifera. The amino acid sequences inferred for the mt genes of the A. cerana were aligned with those of A. mellifera by using Clustal61.83. Based on pairwise alignments, amino acid identity (%) was calculated for homologous genes. Codon usage was examined based on the relationships between the nucleotide composition of codon families and amino acid occurrence, where the genetic codons are partitioned into AT rich codons, GC-rich codons and unbiased codons. For analyzing ribosomal RNA genes, putative secondary structures of 22 tRNA genes were identified using tRNAscan-SE [56], of the 22 tRNA genes, 19 were identified using tRNAscan-SE, the other 3 tRNA genes were found by eye inspection, and rRNA genes were identified by comparison with the mt genome of A. mellifera.

Phylogenetic analyses
Phylogenetic relationship among the Hymenoptera were performed using the 14 Hymenoptera species (Table 1) as ingroup, plus the mt DNA sequence of A. cerana obtained in the present study, using one Reduviidae species (Triatoma dimidiata, GenBank accession number AF301594) as the outgroup, based on amino acid sequences of 13 protein-coding genes. Amino acid sequences for each gene were individually aligned using Clus-tal61.83 [55] under default setting, and then concatenated into single alignments for phylogenetic analyses. Three methods, namely neighbor joining (NJ), maximum likelihood (ML) and maximum parsimony (MP), were used for phylogenetic reconstructions. Standard unweighted MP was performed using package Phylip 3.67 [57]. NJ analysis was carried out using PAUP 4.0 Beta 10 programme [58], and ML analysis was performed using PUZZLE 4.1 under the default setting [59]. The consensus tree was obtained after bootstrap analysis, with 1000 replications for NJ and MP trees, and 100 for ML tree, with values above 50% reported.