The Complete Mitochondrial Genome of Bean Goose (Anser fabalis) and Implications for Anseriformes Taxonomy

Mitochondrial DNA plays an important role in living organisms, and has been used as a powerful molecular marker in a variety of evolutionary studies. In this study, we determined the complete mtDNA of Bean goose (Anser fabalis), which is 16,688 bp long and contains 13 protein-coding genes, 2 rRNAs, 22 tRNAs and a control region. The arrangement is similar to that of typical Anseriform species. All protein-coding genes, except for Cyt b, ND5, COI, and COII, start with an ATG codon. The ATG start codon is also generally observed in the 12 other Anseriform species, including 2 Anser species, with sequenced mitochondrial genomes. TAA is the most frequent stop codon, one of three–TAA, TAG, and T- –commonly observed in Anseriformes. All tRNAs could be folded into canonical cloverleaf secondary structures except for tRNASer(AGY) and tRNALeu(CUN), which are missing the dihydrouridine (DHU) arm. The control region of Bean goose mtDNA, with some conserved sequence boxes, such as F, E, D, and C, identified in its central domain. Phylogenetic analysis of complete mtDNA data for 13 Anseriform species supports the classification of them into four major branches: Anatinae, Anserinae, Dendrocygninae and Anseranatidae. Phylogenetic analyses were also conducted on 36 Anseriform birds using combined Cyt b, ND2, and COI sequences. The results clearly support the genus Somateria as an independent lineage classified in its own tribe, the Somaterini. Recovered topologies from both complete mtDNA and combined DNA sequences strongly indicate that Dendrocygninae is an independent subfamily within the family Anatidae and Anseranatidae represents an independent family. Based on the results of this study, we conclude that combining ND2, Cyt b, and COI sequence data is a workable solution at present for resolving phylogenetic relationships among Anseriform species in the absence of sufficient complete mtDNA data.


Introduction
Animal mitochondrial DNA (mtDNA), typically a small (15-20 kb) double-stranded maternally-inherited circular genome, plays an important role in processes associated with metabolism, programmed cell death, illness, and aging [1]. It possesses a remarkably conserved set of 37 genes: 13 protein-coding genes (ND1-6 and ND4L [for NADH dehydrogenase subunits 1-6 and 4L], ATP6 and ATP8 [for ATPase subunits 6 and 8], COI-III [for cytochrome oxidase subunit I-III], and Cyt b [for cytochrome b]), two ribosomal RNA genes (12S rRNA and 16S rRNA), and 22 tRNA genes. In addition, it contains at least one variable length sequence known as the control region (D-loop) [1,2,3]. Compared with nuclear genes, mtDNA is conservative in gene content, abundant in cells, and intronless, but contains much important phylogenetic information [1,4,5,6]. Because of these advantages, mtDNA is valuable for studies of genetic structure, biological identification, taxonomy, and phylogeny [7,8,9]. Some mtDNA regions, such as Cyt b, ND2, COI, and the control region, are of special interest, having been widely used to resolve taxonomy of controversial organismal groups [10,11,12]. Because they undergo moderate rates of evolution, the protein-coding genes Cyt b, ND2, and COI have been particularly useful [13,14,15]. In recent years, combinations of Cyt b, ND2, and COI gene sequences have been applied to phylogenetic problems at a variety of levels, ranging from related species to genera and families; they have been especially valuable for clarifying phylogenetic relationships within some controversial animal groups, especially that of some birds [10,14,15,16,17]. The mitochondrial control region, which contains initiation sites for transcription and replication, is the most variable mtDNA segment and has the fastest evolutionary rate. Because some conserved domains in the control region exhibit rather homogeneous evolutionary rates, this region can be used to determine intra-and interspecific relationships [1,2,3,18].
In recent years, as a result of revolutionary advances in molecular techniques associated with PCR, sequencing, and data analysis, increasing attention has been focused on the complete mitochondrial genome [19,20]. The complete genome, or a subset thereof, has been widely used as an informative molecular marker in a variety of evolutionary studies, including those involving phylogenetics and population genetics [1,21]. Over the past 10 years, comparative analysis of mtDNA sequences and gene order has been employed as a powerful tool for resolving ancient phylogenetic relationships and elucidating population genetic structure [3,22.23]. Because the quantity of mtDNA sequence data used in studies of phylogenetic relationships at species and population levels is continually increasing [22], the mitochondrial genome has become a highly useful molecular marker for the reconstruction of phylogenetic relationships at different animal taxonomic levels. Analyses of complete mitochondrial genomes not only provide sequence information for phylogenetic studies, but also further elucidation of its structure and function [21,24,25].
Anseriformes is a large and complex group comprising approximately 150 species and exhibiting very high worldwide diversity [10,26]. Although one of the best-studied groups of birds, taxonomic statuses and phylogenetic placements within the order remain in dispute [10,16,17,27,28,29]. Various morphological characteristics and molecular data derived from nuclear and mitochondrial genes have been employed to infer phylogenetic relationships within Anseriformes. Many authors divide the order into three general families: Anhimidae (screamers), Anatidae (ducks, geese, swans), and Anseranatidae (magpie geese) [16,30]. Anatidae, the largest and most important group in Anseriformes, is traditionally divided into subfamilies Anatinae and Anserinae [10,16]. Anatinae consists of tribes Mergini, Anatini, Tadornini, and Aythyini, while Anserinae includes geese, swans, and Dendrocygna [10]. Taxonomic debates primarily focus on: (i) classification status of Anseranatidae, (ii) subfamilies of Anatidae, and (iii) phylogenetic relationships among these subfamilies. A major source of conflict is centered around whether Anseranatidae should be considered as an independent family, or rather as a subfamily of the family Anatidae [10,16,17,27,28]. These debates are still ongoing, but mitochondrial sequences are beginning to shed light on the phylogeny of Anseriformes [11,29,31,32,33]. Recent evidence has indicated that complete mtDNA may be able to resolve controversial relationships among Anseriform birds, and a few studies using this approach have been reported [11,32,33]. Complete mtDNA sequences of Anseriform birds are beneficial for correctly solving classification status problems. As the quantity of Anseriform mtDNA sequence data is increased, phylogenetic relationships should be gradually clarified. The main limitation to taxonomic resolution is thus the availability of mitochondrial genome data. There are only 12 Anseriform species for which complete mtDNA has been sequenced: Anas platyrhynchos (NC_009684), Anas formosa (NC_015482), Anser anser (NC_011196), Anser albifrons (NC_004539), Anseranas semipalmata (NC_005933), Aythya americana (NC_000877), Branta canadensis (NC_007011), Cairina moschata (NC_010965), Cygnus columbianus (NC_007691), Cygnus atratus (NC_012843), Dendrocygna javanica (NC_012844), and Mergus squamatus (NC_016723). Consequently, mtDNA genome sequences of additional Anseriform species should provide further insights into their diversity and evolution.
The Bean goose (Anser fabalis), a member of the family Anatidae (Anseriformes), is an important wetland indicator species locally distributed from Eurasia to North Africa. With the occurrence of rapid economic development during the last two decades, this population is suspected to be undergoing a continuous and rapid decline as a result of habitat loss, illegal hunting, and human disturbances [34]. Because systematic studies are essential for its population conservation, acquisition of the complete mtDNA sequence of this species would be of great value. In this study, we sequenced the Bean goose complete mtDNA genome and compared its gene structure with that of other tribal representa-tives to gain insights into mtDNA evolution. To contribute to an understanding of Anseriform evolutionary history, we also performed phylogenetic analyses based on complete mtDNA and combined Cyt b, ND2, and COI sequence data published in GenBank.

Protein-coding Genes
The total length of the 13 protein-coding genes is 11,296 bp, which represents 65.38% of the entire mitochondrial genome. The longest gene is ND5 (1,818 bp), located between tRNA Leu (CUN) and Cyt b, and the shortest is ATP8 (168 bp), which is between tRNA Lys and ATP6. All protein-coding genes start with an ATG codon, except for ND5, COI, and COII, which begin with GTG. TAA, found in 10 genes, is the most frequent stop codon, although ND5 ends with AGA, and ND1 and COI end with AGG.

Ribosomal RNA, Transfer RNA, and Non-coding Regions
Bean goose mtDNA contains a small subunit (12S rRNA) and a large subunit (16S rRNA) of rRNA. They are located between tRNA Phe and tRNA Leu , and are separated by tRNA Val . The 12S rRNA, located between nucleotide positions 68 and 1,051, is 983 bp long; the 16S rRNA is located between nucleotide positions 1,123 and 2,793, and has a length of 1,671 bp.
The mtDNA genome contains 22 tRNAs, ranging in size from 61 to 74 bp; the longest tRNA is tRNA Leu (74 bp), and the shortest is tRNA Val (61 bp). Except for tRNA Ser (AGY) and tRNA Leu (CUN), which lacks a dihydroxyuridine (DHU) arm, they could be folded into the typical cloverleaf structure.
The non-coding regions include a control region (D-loop) and a few intergenic spacers. The D-loop region is located between tRNA Glu and tRNA Phe , and is 1,182 bp long. Intergenic spacers comprise 66 bp spread over 14 regions (Table 1). Areas of gene overlap occupy a total of 89 bp, with the longest region of overlap a 74-bp long sequence located between tRNA Lus and tRNA Val (Table 1).

Phylogenetic Reconstructions
Trees derived from phylogenetic analysis of complete mtDNA for 13 Anseriform species and of the three genes for 36 species have similar topologies (Figs. 2 and 3). The analyzed species are divided into four major clades: Anatinae makes up the first lineage, which is sister to the second group, Anaserinae; Dendrocygninae forms the third group and is sister to Anatinae and Anaserinae. The lineage consisting of these three groups in turn is sister to the fourth clade, Anseranatidae. The outgroup, Gallus gallus, is located at the base of the tree. Trees generated from maximum likelihood (ML) and Bayesian inference (BI) analyses of complete mtDNA for 13 Anseriform species have highly similar topologies, with only slight differences in bootstrap support or posterior probability values (Fig. 2). Four major branches are represented: Anatinae, Anserinae, Dendrocygna javanica, and Anseranas semipalmata. The first lineage, subfamily Anatinae, includes species of tribes Mergini (Mergus squamatus), Anatini (Anas formosa, Anas platyrhynchos), Aythyini (Aythya americana), and Tadonini (Cairina moschata). The second lineage includes tribes Anserini (Branta canadensis, Anser albifrons, Anser anser, and Anser fabalis) and Cygnini (Cygnus atratus and Cygnus columbianus), forming the subfamily Anserinae. Anatinae and Anserinae are sister groups, which then group together with Dendrocygna javanica. Anseranas semipalmata is the sole member of the fourth group.
Topologies recovered from maximum likelihood (ML) and Bayesian inference (BI) analyses of the three genes from 36 Anseriform species are highly congruent, with strong bootstrap support observed for most nodes (Fig. 3). The 36 species are divided into four major lineages: Anatinae, Anserinae, Dendrocygna javanica, and Anseranas semipalmata. The first lineage, subfamily Anatinae, includes species of tribes Mergini, Somaterini, Anatini, Aythyini, and Tadonini. This lineage is a sister group to the second lineage, subfamily Anserinae, which comprises tribes Anserini and Cygnini. These two lineages form a clade that is in a sister group relationship with Dendrocygna javanica; this clade in turn is sister to Anseranas semipalmata.

Mitochondrial Genome Annotation and Features
In many respects, the newly-sequenced mitochondrial genome is nearly identical to those reported for other Anseriform birds. The compact arrangement of Bean goose mtDNA is similar to typical avian mtDNA [34,35], which has no introns, no long intergenic spacers, and only a few overlapping sequences. Its length is similar to that of other birds, and well within the range of avian mitochondrial genomes [5,32,36]. The overall base composition is: A, 30.06%; C, 31.84%; G, 15.36%; and T, 22.74%. A+T content (52.80%) is higher than C+G content (47.20%), similar to other Anseriformes (51.6-55.7%) [37,38]. Compared with mitochondrial genomes of other Anseriform species, the A, T, and A+T compositions of Bean goose mtDNA are similar, and it shares with the other genomes a strong AT bias ( Table 2). Guanine (G) is the rarest nucleotide; the percentage of the other three bases are roughly equal to each other (Table 2), similar to other vertebrate animals [13]. GC and AT skews are a measure of compositional asymmetry; in amniote mtDNA, GC-skew values are all negative (G,C), while AT-skew is positive (A.T) [39]. In Bean goose mtDNA, GC-skew (20.357) and AT-skew (0.143) values are in accord with this principle.

Comparison of Three Anser Mitochondrial Genomes
Mitochondrial genomes sequenced from three Anser species-Anser anser, Anser albifrons, and Anser fabalis-are similar in length; the longest is that of Anser anser (16,738 bp) and the shortest is in Bean goose (16,688 bp). Compared with the other two species, some overlapping sequences are found in Bean goose mtDNA-in particular, in a 74-bp region between rRNA-Lus and tRNA leu . As in most Anseriformes, all genes are encoded on the same strand, and there are no missing or duplicated genes. Among the three mtDNAs, homologous regions comprise 16,239 bp, representing 97.12% of the entire genome. As seen in the three Anser genomes, these regions generally have the highest transition/ transversion ratios and are more significant in closely-related species [14,16]. All three Anser mitochondrial genomes are quite

Comparison of Protein-coding Genes
When we compared the 13 Anseriform species with respect to predicted initiation and termination codons of 13 mitochondrial protein-coding genes (Table 3), we found that most protein-coding genes used ATG as start codons. Proteins in only a few species start with GTG, CTG, or ATA. Stop codons are also similar across species, with TAA, TAG, and T-occurring most frequently. Among the 13 protein-coding genes, specific examples include the following: the COI initiation codon is GTG and the termination codon is AGG in all 13 species; ND6 starts with ATG and ends with TAG, except in Anser fabalis, where it ends with TAA; ND1 starts with ATG and ends with AGG, except in Anser albifrons, where the stop codon is TAA; and ND2 starts with ATG and ends with TAG, except in Anseranas semipalmata and Cygnus atratus, for which the stop codon is TAA.

Control Region Comparisons
The control region (D-Loop), responsible for transcription and replication, is the most variable mtDNA region, sometimes containing variable numbers and lengths of tandem repeats [22]. Three internal portions have been recognized in the control region: the 59-peripheral domain, the central conserved domain, and the 39-peripheral domain [36,39,45]; in Bean goose, these domains span positions 15,507-15,877, 15,878-16,335, and 16,336-16,688, respectively. Within the control region, the central conserved domain exhibits a rather homogeneous evolutionary rate among species [2,18]. It contains conserved sequence blocks (CSB) B, D and F. In most avian mtDNA, C and E boxes are also found in this area [35,40,41]. The central domain is rich in C and T [16,17]. In this region in Bean goose, C+T content (58.95%) is also higher than A+G content (41.05%). Complete sequences of Anseriform mitochondrial control regions have an average size of 1,100 bp, and range in size from 970 bp in the Maned goose (Chenonetta jubata) to 1,230 bp in the Crested screamer (Chauna torquata) [16]. Similar to other birds, there is only one control region, which is 1,182 bp long, in Bean goose mtDNA. Its nucleotide frequencies are not significantly different from other birds, and it does not contain any repeats. Comparative analyses of the structure and organization of control regions can help elucidate relationships in the Anseriformes [41]. Based on consensus alignment with central conserved domains of control region sequences of other Anseriform species, some conserved sequence boxes, such as F, E, D, and C, were identified (Fig. 4). These blocks are similar to those of other vertebrates [39,42]. The average genetic distance (0.318) between control regions of Anseranas semipalmata and other studied species is higher than average (0.277). Anseranas semipalmata thus has the most divergent control region of the studied species (Fig. 4), suggesting that its Dloop sequence is evolving at a faster rate.

Phylogenetic Analyses
Based on the results of our phylogenetic analyses, from which similar topologies were recovered, the inclusion of Lophodytes cucullatus and Mergellus albellus within the genus Mergus of tribe Mergini is supported, as suggested by an earlier study [32]. The genus Somateria is an early-diverging lineage in Mergini that is sometimes classified as a separate tribe, Somaterini [17,43]. This view, supported by Donne-Goussé [16], is congruent with our results. Our phylogenetic trees suggest a close genetic relationship between Anatini and Aythyini, in accordance with the ideas of Gonzalez [10]. Within Anatidae-according to our study-Aythyini diverged earlier than Anatini and is monophyletic [10,16]. Taxonomy and systematic relationships within Tadornini have been extensively debated [10,30]. Tadornini monophyly is supported by phylogenetic analyses of control region data [10], but other molecular studies have shown the group to be nonmonophyletic [16]. In our study, Tadornini contains the genera Tadorna, Cairina, and Chloephaga. Cairina moschata was formerly placed into the paraphyletic ''perching duck'' assemblage, but subsequently moved to the dabbling duck subfamily (Anatinae) [16,30]. Analysis of Cyt b and ND2 mitochondrial gene sequences, Table 3. Predicted initiation and termination codons for 13 mitochondrial protein-coding genes in 13 Anseriform species.

Gene
Predicted initiation and termination   however, indicate that this species might be more closely related to the genus Aix [16,17,44,45], and suggest its placement in Cairinini rather than in Tadornini [16,45]. Relationships among these taxa are not well-supported, however, and the addition of more complete mitochondrial genome sequence data is needed. In our analyses, the Bean goose, a member of tribe Anserini, has a close genetic relationship with fellow tribal members Anser albifrons, Anser anser, and Anser cygnoides. Anserini and Cygnini are shown to be sister groups, together composing the Anserinae. In our study, their placement is similar to that in phylogenetic trees generated from control region gene data [16]. The classification status of Dendrocygna and the Anseranatidae has been a subject of debate. Some authors have placed Dendrocygna inside the subfamily Anserinae and included Anseranatidae in Anatidae [17], while other authors consider them to be two different lineages [46]. Our molecular results, both those based on complete mitochondrial genome data and those derived from combined sequences of Cyt b, ND2, and COI, suggest that Dendrocygna does not belong to Anserinae, but represents an independent subfamily. The situation is similar for Anseranas semipalmata, which appears to be an independent family. Phylogenetic analyses based on complete mtDNA or concatenated genes (ND2+Cytb+COI) both clearly strongly support Dendrocygna (Dendrocygna javanica) as an independent subfamily, Dendrocygninae, within the family Anatidae, as well as the existence of Anseranatidae (Anseranas semipalmata) as an independent family [30].
Comparison of Phylogenies Based on Complete mtDNA with those Derived from Combined ND2, Cyt b, and COI Data Mitochondrial DNA has been extensively used as a molecular marker for resolving phylogenetic relationships at different taxonomic levels in difficult animal groups [47,48]. Results are sometimes very controversial, however, because they are derived from a small set of specific gene markers [10,16,17]. Mitochondrial DNA-either single genomes or a combination of markershas been used to resolve phylogenetic relationships among Anseriform species, but the results are still considered controversial [10,16,36]. Several recent analyses have demonstrated that phylogenies based on complete mtDNA are better supported than those based on individual genes or partial mitochondrial genomes [1,10,23]. Complete mtDNA data may thus provide new insights into Anseriform higher-level systematics. Currently, however, complete mtDNA genomes of only 13 Anseriform species, including Anser fabalis from our study, have been sequenced and deposited in GenBank, limiting the utility of Anseriform mitochondrial genomes as molecular markers for resolving phylogenetic relationships. In our study, tribal-level phylogenetic relationships in Anseriformes based on complete mtDNA are similar to those derived from combined ND2, Cyt b, and COI gene sequence data, with only some minor topological differences due to the limited taxon sampling in the case of complete mtDNA (Figs. 2 and 3). We therefore conclude that in the absence of enough completely sequenced mitochondrial genomes, the use of combined ND2, Cyt b, and COI sequence data is a workable solution for reconstructing phylogenetic relationships among Anseriform species.

Specimen Collection
A Bean goose (Anser fabalis) injured by power lines was found in the wild in the Shengjin Lake National Nature Reserve, Anhui, China in October 2010. The bird was treated for its injuries but later died (The Shengjin Lake National Nature Reserve is authorized to administer medical treatment to animals by the Anhui Provincial Nature and Wildlife Conservation Station, a provincial government agency for wildlife in Anhui Province). Tissues from the dead bird were field-stored at 220uC at the Institute of Biodiversity and Wetland Ecology, Anhui University. Permission to collect the bird in the wild and use the tissue was issued by the Shengjin Lake National Nature Reserve.

DNA Extraction, PCR Amplification, and Sequencing
Whole genomic DNA of Bean goose was isolated from muscle tissue using the phenol/chloroform method. Extracted DNA was examined on a 1.0% agarose/TBE gel and stored at 220uC as a template for PCR.
Based on alignment of complete mtDNA sequences of Cygnus atratus (NC_012843), Anas platyrhynchos (NC_009684), and Aythya americana (NC_000877), we designed three primer pairs (primer sets 9, 11, and 14) using Primer 5.0. We also employed other primers developed for use with Anas platyrhynchos (Jianchang duck) [33]. These primers were used to amplify and sequence the complete mitochondrial genome of Bean goose ( Table 4). All generated sequences were less than 1200 bp each, with each segment overlapping the next by 80-100 bp.
PCR amplifications were carried out in 25-ml volumes containing 100 ng template DNA, 2.5 ml of 106 reaction Buffer, 1 ml of 25 mM MgCl 2 , 2 ml of 2 mM dNTPs, 1 ml of each 10 mM primer, 0.5 U Taq DNA polymerase (Trans Taq-T DNA Polymerase, Beijing, China), and sterile doubly-distilled water to final volume. PCR amplification conditions were as follows: denaturation for 5 min at 94uC, followed by 30 cycles of denaturation for 30 s at 94uC, annealing for 30 s at 49-55uC (depending on primer combinations), and elongation for 2 min at 72uC, and a final extension step for 10 min at 72uC. PCR products were purified using a V-gen PCR clean-up purification kit, and were then bidirectionally sequenced by Sangon Biotech (Shanghai, China).

Sequence Analysis
Sequences were checked and assembled using the programs Seqman (DNASTAR 2001), BioEdit, and Chromas 2.22, and then adjusted manually. Protein-coding genes were identified by comparison with known complete mtDNA sequences of Anseriform birds using Sequin 11.0. The 22 tRNA genes were identified using the software package tRNA Scan-SE 1.21 (http://lowelab. ucsc.edu/tRNAscan-SE), with their cloverleaf secondary structures and anticodon sequences determined using DNASIS (Ver.2.5, Hitachi Software Engineering). Two rRNAs were identified by comparison with complete mtDNA sequences of other Anseriformes available in GenBank. The complete mitochondrial genome sequence of Bean goose has been deposited in GenBank under accession number NC_016922.

Phylogenetic Analyses
To study the phylogeny of Anseriformes, phylogenetic trees were reconstructed using ML and BI methods, with Gallus gallus (NC_001323) used as an outgroup. Two sets of phylogenetic trees were generated, one based on complete mtDNA for 13 Anseriform species, and the other derived from multiple sequence alignments of several mitochondrial gene regions for 36 typical Anseriform species (Table 5). Avian species generally exhibit moderate levels of sequence divergence in mitochondrial gene regions, especially in protein-coding genes such as Cyt b, ND2, and COI; a combination of these three genes is suitable for resolving phylogenetic problems at different taxonomic levels, ranging from related species to genera and families [14,15]. Because we were interested in resolving relationships at these taxonomic levels within Anseriformes, we chose Cyt b, ND2, and COI sequence data for use in the second phylogenetic analysis.
Before reconstructing phylogenetic trees, sequence alignment was carried out using the 13 complete mtDNA Anseriform genomes and the three concatenated data sets for 36 Anseriform species using ClustalX 1.8, followed by manual adjustment. ML analyses were performed in PAUP (version 4.0b8) using TBR branch swapping (10 random addition sequences) and a general time-reversible model with invariant sites and among-site variation (GTR+I+C); this model was selected as the best fit model of evolution using Modeltest (version 3.06) based on the AIC criterion. Support for internal branches in the ML tree was evaluated via the bootstrap test with 100 iterations. Bayesian inference of phylogeny was performed using the program MrBayes 3.1.2, with the same best fit substitution model used as the one selected for the ML analysis. MrBayes 3.1.2 simultaneously initiates two Markov Chain Monte Carlo (MCMC) runs to provide additional confirmation of convergence of posterior probability distributions. Analyses were run for one million generations until the average standard deviation of split frequencies was less than 0.01, which indicated that convergence was reached. Chains were sampled every 1000 generations.