Assembly and comparative analysis of the complete mitochondrial genome of three Macadamia species (M. integrifolia, M. ternifolia and M. tetraphylla)

Background Macadamia is a true dicotyledonous plant that thrives in a mild, humid, low wind environment. It is cultivated and traded internationally due to its high-quality nuts thus, has significant development prospects and scientific research value. However, information on the genetic resources of Macadamia spp. remains scanty. Results The mitochondria (mt) genomes of three economically important Macadamia species, Macadamia integrifolia, M. ternifolia and M. tetraphylla, were assembled through the Illumina sequencing platform. The results showed that each species has 71 genes, including 42 protein-coding genes, 26 tRNAs, and 3 rRNAs. Repeated sequence analysis, RNA editing site prediction, and analysis of genes migrating from chloroplast (cp) to mt were performed in the mt genomes of the three Macadamia species. Phylogenetic analysis based on the mt genome of the three Macadamia species and 35 other species was conducted to reveal the evolution and taxonomic status of Macadamia. Furthermore, the characteristics of the plant mt genome, including genome size and GC content, were studied through comparison with 36 other plant species. The final non-synonymous (Ka) and synonymous (Ks) substitution analysis showed that most of the protein-coding genes in the mt genome underwent negative selections, indicating their importance in the mt genome. Conclusion The findings of this study provide a better understanding of the Macadamia genome and will inform future research on the genus.


Introduction
Macadamia spp belongs in the family Proteaceae, class Magnoliopsida, and order Proteales. The Proteaceae family has five subfamilies, 80 genera, and over 1600 species [1]. Jansenii and M. ternifolia produce non-edible nuts containing high levels of bitter cyanide glycosides, thus has not been used to guide the breeding [5][6][7]. Macadamia seeds are sweet with high nutritional and medicinal value. Therefore, they have enjoyed the reputation of "King of Thousand Fruits". They are also used in international transactions due to their high economic value [8].
Mitochondria (mt) are organelles that primarily convert biomass energy in living cells into chemical energy to fuel biological activities. Besides, they participate in other biological processes, including cell differentiation, cell apoptosis, cell growth, and cell division [9][10][11][12]. Therefore, mt are the core hub of life activities within individual cells and the entire living body [13]. Both plastids and mt harbor genetic information and are thought to have evolved through endosymbiosis of freely living bacteria [14][15][16]. In most seed plants, nuclear genetic information is inherited from both parents, while cp and mt are derived from maternal genes [17]. Thus, we can temporarily ignore the influence of paternal genes, thereby reducing the difficulty of genetic research and promoting the research of genetic mechanisms [18].
Studies have shown that the size of the mt genome varies significantly between different species. For example, plants have a larger mt genome than animals [19].
Furthermore, mt genome size in seed plants can vary by at least one order of magnitude ranging from ~ 222 Kbp in Brassica napus [20] and ~ 316 Kbp in Allium cepa [21] to ~ 3.9 Mbp in Amborella trichopoda [22] and a striking ~ 11.3 Mbp in Silene conica [23]. This phenomenon may be caused by the abundance of non-coding regions and repeated elements in the plant mt genome [24]. DNA recombination between homologous sequences produces small circular sub-genomic DNA. The circular genomic DNA coexists with the complete "master" genome in the cell. These genomes typically have several kb repeats, leading to multiple heterogeneous forms of the genome [25][26][27][28][29][30][31]. The mutation rate of plant mt genomes is very low; however, their rearrangement rate is so high that there is almost no conservation of synteny [27; 32-34].
The development of cost-effective and more efficient DNA sequencing methods like high-throughput sequencing has accelerated mt genome sequencing. So far (until June 2021), the mt genomes of 618 green plant species have been released in the NCBI database. Long-term mutually beneficial symbiosis caused the mt to lose some of the original DNA, possibly by transfer, leaving only the DNA encoding it [35,36].
Mt DNA integrates DNA from various sources by intracellular and horizontal transfer [37]. Therefore, regardless of the length, gene sequence and content, mt genome varies remarkably among different plant species [38]. The mt genome length of the smallest terrestrial plant is about 66 Kb, and that of the largest terrestrial plant is 11.3 Mb [39,40]; the number of genes is usually between 32 and 67 [41]. In this study, the mt genomes of three Macadamia species were sequenced, assembled, and annotated.
Also, their genomic and structural features were analyzed and compared with other angiosperms (and gymnosperms). This study improves our understanding of Macadamia genetics and provides crucial data to inform future research on the evolution of mt genomes of land plants.

Genomic features of the mt genomes of the three Macadamia species
The mt genomes of M. integrifolia, M. ternifolia and M. tetraphylla have a typical terrestrial plant genome ring structure (Fig 1). A total of 71 unique genes were identified in the mt genomes of the three Macadamia species, including 42 protein-coding, 26 tRNA, and 3 rRNA genes (Table 1). In addition, two copies of rRNA26, ccmB, rps19, trnN-GTT, and trnH-GTG, and seven copies of trnM-CAT were identified. It has been established that the mt genomes of land plants contain a variable number of introns [42]. In the present study, the three mt genomes had ten genes with introns, length ranging from 13bp (rps3) to 31841bp (cox2) where ccmFC, rpl2, rps3, and rps10 had two introns, cox2 had three, nad1, nad4, and nad5 had four and nad2 and nad7 had five introns. Besides, in all protein-coding genes, except atp6, cox1, nad1, nad4L, rps4, and rps10, which had ACG as the start codon, all the others had ATG as their start codon. In addition, the stop codons in all the protein-coding genes were: TAA 45.2%, TGA 28.6%, TAG 14.3%, CAA 9.5%, and CGA 2.4%.
The size and GC content of mt genome are the primary characteristics. Here, we compared the size and GC content of mt genomes between three Macadamia species and 36 other green plants, including four phycophyta, three bryophytes, two gymnosperms, four monocots, and 23 dicots (Table S1). The size of the mt genomes ranged from 22,897 bp (Chlamydomonas moewusii) to 2,709,526 bp (Cucumis melo) (Fig 2). Compared to phycophyta and bryophytes, the mt genomes of the three Macadamia species are bigger. The GC content in the mt genomes was also highly variable, ranging from 32.24% in Sphagnum palustre to 50.36% in Ginkgo biloba.
Overall, the GC content of angiosperm mt genome (including monocots and dicots) is higher than that in bryophytes but less than in gymnosperms, implying that the GC contents fluctuated following the angiosperms divergence from bryophytes and gymnosperms. Interestingly, the GC content significantly fluctuated in algae and was mostly conserved in angiosperms, although their genome sizes vary significantly.

Repeat sequences anaysis
Microsatellites or simple sequence repetitions (SSRs) are DNA fragments composed of short sequence repeating units of 1-6 base pairs [43]. Their unique value is created by their polymorphism, relative abundance, codominant inheritance, large-scale genome coverage, and PCR detection simplicity [44]. Based on the SSRs analysis, we identified 87 SSRs with SSRs monomers and dimers accounting for 70.11% of the total SSRs. Adenine (A) was the most repeated monomer with 19 (38%) out of the 50 identified monomer SSRs. The AT repeat was the most common dimer SSR, accounting for 66.67% of all the identified dimers. However, one hexamer [ATTAGG(X3)] was present in the mt genomes of three Macadamia species.
Among the reference mt genome, only Nelumbo nucifera has been published in the NCBI database. N. nucifera belongs to the family Nelumbonaceae and the same order (Proteales) with Macadamia. Therefore, the mt genome of N. nucifera was used as a reference for comparative analysis in the present study. The monomers in N. nucifera were lower than in the three Macadamia species, while pentamers and hexamers in N. nucifera were significantly higher than in the three Macadamia species (Fig 3a). Moreover, the SSRs in mt genomes of M. integrifolia, M. ternifolia, M. tetraphylla, and N. nucifera were mainly single-nucleotide A/T motifs, and dimer AT/TA motifs. Within the Macadamia genus, the mt SSRs among the different species are highly similar (Fig 3b). However, compared with N. nucifera, there were both differences and similarities. For example, the single nucleotide A/T in the three Macadamia species has 23 unit repeats, while N. nucifera has only nine. Nevertheless, their single-nucleotide C/G numbers were the same (two-unit repeats) (Fig 3b). In addition, the AG/CT and AT/AT motifs unit repetitions are the same, although N. nucifera also has an AC/GT motif, lacking in the three Macadamia species.
Interestingly, the pentanucleotide AATGT/ACATT, ACTAG/AGTCT, and ACATT/AGTAT also had the same number of repetitions in the three Macadamia species and N. nucifera. Overall, the greater the nucleotide motif, the greater the difference between the three Macadamia species and N. nucifera.
Core repeating units ranging from 1 to 200 bases (tandem repeats) are widely present in eukaryotes and some prokaryotes genomes [45]. In the present study, 25, 21, and 20 tandem repeats (10-33 bp) were identified in the M. integrifolia, M. ternifolia, and M. tetraphylla with a match greater than 95% (Table S2,

The prediction of RNA editing
RNA editing is a post-transcriptional process entailing the addition, deletion, or conversion of bases in the coding region of a transcribed RNA. The conversion of cytosine to uridine is common in cp and mt genomes of plants [46][47][48][49][50], which improves protein preservation in plants. The accurate detection of ribonucleic acid editing is inseparable from the proteomics data. However, Mower's software PREP computationally predicts the RNA edit site [51].

DNA migration from cp to mt
The cp-like sequences in the mt genome were detected by comparing against the complete cp genome sequence of M. integrifolia obtained from the NCBI database ( Fig 5). We detected 28 fragments in the mt genome of M. integrifolia, ranging in size from 32 bp to 5210 bp. The cp-like sequence had 36,902 bp, accounting for 5.4% of the mt genome. Five complete annotated tRNA genes were detected, namely trnH-GTG, trnM-CAT, trnW-CCA, trnD-GTC, and trnN-GTT, with some fragments of rrn18 genes. The findings also revealed that 28 insertion regions accounted for 23.2% of the cp genome, including seven complete protein-coding genes (petL, petG, ndhE, rps15, rpl23(X2), rpl2) and eight complete tRNA genes (trnH-GUG, trnD-GUC, trnM-CAU, trnW-CCA, trnP-UGG, trnP-GGG, trnI-CAU, trnN-GUU). Besides, several protein-coding genes were also identified, including psbA, rpoB, psbD, psbC, ndhC, rpl2, ycf2(X2), ndhB, rps7(X2), ndhD, ndhB and ycf1, and some tRNA genes (trnI-GAU, trnA-UGC, trnN-GUU ) , which migrated from the cp genome into the mt genome. But, most of these genes lost their integrity during the evolution process, and only their partial sequences were found in the mt genome. Furthermore, most cp-like sequences were located in the spacer region of the mt genome. These findings are consistent with previous research, where during evolution, tRNA genes were more conserved than the protein-coding genes and rRNA genes since they play an important role in mt genome [52].

Phylogenetic analysis within higher plant mt genomes
Phylogenetic analysis was performed to understand the evolution of the three Macadamia species compared to 29 dicots, four monocots, and two gymnosperms (out-groups). The phylogenetic tree was constructed based on the comparisons in the data matrix of 23 conserved protein-coding genes (Fig 6). The findings revealed that the phylogenetic tree strongly supports the separation of Proteales from rosids and asterids, the separation of eudicots from monocots and angiosperms from gymnosperms. The evolutionary relationships among all the taxa separated into 20 families (Leguminosae, Cucurbitaceae, Apiaceae, Apocynaceae, Solanaceae, Rosaceae, Caricaceae, Brassicaceae, Salicaceae, Bataceae, Malvaceae, Vitaceae, Lamiaceae, Nelumbonaceae, Proteaceae, Butomaceae, Arecaceae, Poaceae, Cycadaceae, and Ginkgoaceae) were efficiently deduced in the phylogenetic tree ( Fig   6).

The substitution rates of protein-coding genes
In genetics, non-synonymous (Ka) and synonymous (Ks) substitution rates help understand the evolutionary dynamics of protein-coding genes among similar species since the Ka to Ks ratio indicates gene selection [53,54]. In the present study, N. nucifera was used as a reference species to calculate the Ka/Ks ratio of 40 protein-coding genes present in the mt genome of three Macadamia species. The Ks of atp9 and rps14, and the Ka of rps12 was 0. Besides, in most protein-coding genes, the Ka/Ks ratio was significantly less than 1 (Fig 7). However, the Ka/Ks ratio of nad4, rpl2, rps3, rps4, and rps10 was greater than 1, with the rps3 ratio being 2.34, implying that these genes might have undergone surgery positive selection following Macadamia and N. nucifera differentiation from their last common ancestor [55].
Besides, the ATP synthase, Cytochrome C biogenesis, Ubichinol Cytochrome C reductase, and Maturases of Ka/Ks ratios were below 1, implying that the negative selection acted on these genes (Table 2). Therefore, these genes may be highly conserved during the evolution of higher plants.

CONCLUSIONS
The complete mt genomes of M. integrifolia, M. ternifolia and M. tetraphylla share many common features with angiosperm mt genomes. In this study, we found that the mt genomes of the three Macadamia species were circular like most mt genomes.

Genome sequencing
The three Macadamia species examined in this study were collected from Xishuangbanna. Agarose electrophoresis was used to assess the quality of the extracted DNA samples. Meanwhile, a Nanodrop spectrophotometer was used to detect the concentration and purity of the DNA samples. The qualified DNA samples were used for Illumian DNA library construction, according to the standard procedure.
Subsequently, a paired-end sequencing library with an insert size of 350 bp was constructed. The Illumina Hiseq 4000 high-throughput sequencing platform was used for sequencing. The sequencing strategy involved PE150 (Pair-End 150) and the sequencing data volume of not less than 1 Gb. Illumina high-throughput sequencing results initially existing as original image data files were converted into Raw Reads.
CASAVA software was used for Base Calling.

Genome assembly and annotation
SPAdes v.3.5.0 software was used to splice and assemble mt genome sequences [56].
DOGMA [57] and ORF Finder were used to annotate the mt genome. The Blastn and Blastp method was used to compare mt gene-encoding protein and rRNAs among related species. The accuracy of the results was verified, and corrections were made accordingly. tRNA scan-SE2.0 [58] and ARWEN [59] were used to annotate tRNA.
The tRNAs with unreasonable length and incomplete structure were eliminated.

Analysis of repeat structure and sequence
Microsatellites within the mt genomes of the three Macadamia species were identified using MISA [60]. The minimum number of repeats for the motif length of 1, 2, 3, 4, 5, and 6 were 10, 6, 5, 4, 3, and 3, respectively, were identified in this analysis. The tandem repeats were detected using Tandem Repeats Finder v4.09 software [61] with default parameters.

DNA transformation from cp to mt and RNA editing analyses
The cp genome of M. integrifolia (NC_025288) was downloaded from the NCBI database. Chloroplast-like sequences were identified and the genome was mapped using TBtools [62]. The online program Predictive RNA Editor for Plants (PREP) suite [63] was adopted to identify the possible RNA editing sites in the protein-coding genes of the three Macadamia species. The cutoff value was set as 0.2 to ensure accurate prediction. The protein-coding genes from other plant mt genomes were used as references to reveal the RNA editing sites in the mt genomes of the three Macadamia species.

Phylogenetic tree construction and Ka/Ks analysis
The genome sequences of the three Macadamia species were compared with those of 35 other plant species to further verify their phylogenetic position. Notably, the complete mt genome sequences of these species were available in the NCBI database.