The complete chloroplast genome sequencing analysis revealed an unusual IRs reduction in three species of subfamily Zygophylloideae

Tetraena mongolica, Zygophyllum xanthoxylon, and Z. fabago are three typical dryland plants with important ecological values in subfamily Zygophylloideae of Zygophyllaceae. Studies on the chloroplast genomes of them are favorable for understanding the diversity and phylogeny of Zygophyllaceae. Here, we sequenced and assembled the whole chloroplast genomes of T. mongolica, Z. xanthoxylon, and Z. fabago, and performed comparative genomic and phylogenetic analysis. The total size, structure, gene content and orders of these three chloroplast genomes were similar, and the three chloroplast genomes exhibited a typical quadripartite structure with a large single-copy region (LSC; 79,696–80,291 bp), a small single-copy region (SSC; 16,462–17,162 bp), and two inverted repeats (IRs; 4,288–4,413 bp). A total of 107 unique genes were identified from the three chloroplast genomes, including 70 protein-coding genes, 33 tRNAs, and 4 rRNAs. Compared with other angiosperms, the three chloroplast genomes were significantly reduced in overall length due to an unusual 16–24 kb shrinkage of IR regions and loss of the 11 genes which encoded subunits of NADH dehydrogenase. Genome-wide comparisons revealed similarities and variations between the three species and others. Phylogenetic analysis based on the three chloroplast genomes supported the opinion that Zygophyllaceae belonged to Zygophyllales in Fabids, and Z. xanthoxylon and Z. fabago belonged to Zygophyllum. The genome-wide comparisons revealed the similarity and variations between the chloroplast genomes of the three Zygophylloideae species and other plant species. This study provides a valuable molecular biology evidence for further studies of phylogenetic status of Zygophyllaceae.


Chloroplast genome assembly and annotation
The raw data were processed by filtering adapter and low-quality reads using fastQC (http:// www.bioinformatics.babraham.ac.uk/projects/fastqc/), then the clean data were used for genome assembly. GetOrganelle (https://github.com/Kinggerm/GetOrganelle) [21] and SPAdes (v. 3.9.0) [22] were used to assemble the clean data using the default parameter. The chloroplast genome assembly was then identified from the assembled sequences by align to Tribulus terrestris (NC_046758), Arabidopsis and tobacco chloroplast genomes [11,23]. The online annotation tool DOGMA (http://dogma.ccbb.utexas.edu) [24] was utilized to annotate the protein-coding genes, tRNAscan-SE [25,26] software was used to annotate the tRNA gene, and RNAmmer 1.2 server (http://www.cbs.dtu.dk/services/RNAmmer/) [27] was used for rRNA identification. The annotation results were edited using Sequin, and the resulting Sqn file was submitted to the GenBank database. The GenBank accession number of the chloroplast genomes of T. mongolica, Z. xanthoxylon, and Z. fabago were MK331720, MZ427318, and MK341052, respectively. The GenBank annotation files were submitted to Organellar Genome DRAW (OGDRAW) [28] to draw the visualized chloroplast genome map.

Loss of ndh genes verification
To verify the loss of ndh genes in the chloroplast genomes of T. mongolica, Z. xanthoxylon, and Z. fabago, leaf DNA samples were extracted from tobacco and these three species and PCR experiments were performed on fragment psaC-ndhE-ndhG-ndhI-ndhA-ndhH-rps15 and rps7-ndhB-trnL-CAA of the tobacco chloroplast genome and the fragment psaC-rps15 and rps7-trnL-CAA of the chloroplast genomes in the three species. The PCR products were sequenced (BBI Life Sciences Co., Shanghai, China), and the sequencing results were spliced and compared with the references of the corresponding species. Details of gene fragments selected and primers in PCR were list in S1 Table. Genomic structure analysis The Perl script MISA (https://webblast.ipk-gatersleben.de/misa/) [29] was used to detect microsatellites (mononucleotides, dinucleotides, trinucleotides, tetranucleotides, pentanucleotides, hexanucleotides) from three chloroplast genomes of Zygophyllaceae plants with the following thresholds: 10 repeat units of mononucleotide SSR, 6 repeat units of dinucleotide SSR, 5 repeat units of trinucleotide, tetranucleotide, pentanucleotide and hexanucleotide SSR. The online software REPuter (https://bibiserv.cebitec.uni-bielefeld.de/reputer) (University of Bielefeld, Bielefeld, Germany) [30] was utilized to predict the location and size of the repeat sequences, with the parameter set to spread the repeat copy at a percentage of at least 90% similarity, the minimum repeat size parameter was set as 30 bp.

Codon usage analysis
The distribution of codon usage was analyzed using the software CodonW (University of Texas, Houston, TX USA) [34] with the Relative synonymous codon usage (RSCU) value. RSCU value is an efficient index reflecting non-uniform usage of synonymous codons in a given coding sequence. In general, the RSCU value without any codon usage bias equals 1.00, and a RSCU below 1.00 indicates the relative probability of codon utilization is lower than expectation, just as the codon utilization frequency is higher than expectation while the RSCU may be above 1.00.

Comparative genomics analysis
The comparison of gene order between chloroplast genomes of T. mongolica (MK331720), Z.

Genome content and organizations
Approximately 3 G, 3 G, and 7.1 G of 150 bp pair-end clean reads for T. mongolica, Z. xanthoxylon, and Z. fabago, respectively, were got from the Illumina sequencing, while the reads were assembled using GetOrganelle and SPAdes (Fig 1). All the three chloroplast genomes encode 107 unique genes, including 70 protein-coding genes, 4 rRNA genes, and 33 tRNA genes (Tables 1 and 2). It is noteworthy that the rRNA genes located in IRs region in most higher plants present in the SSC region of the three Zygophyllaceae plants, and subsequently the copy number of rRNA genes change from 2 to 1. We compared the three Zygophyllaceae chloroplast genomes with that of Amborella trichopoda, which was thought to be the most primitive group of angiosperms, and the result showed that all the ndh genes encoding subunits of NADH oxidoreductase were lost in T. mongolica, Z. xanthoxylon, and Z. fabago which usually located in SSC and IRs. Moreover, rps16, rpl12, ycf2 and infA, which were common in the chloroplast genomes of most angiosperms, lost in the chloroplast genomes of these three Zygophyllaceae plants.
To verify the loss of ndh genes in chloroplast genomes of these three species, utilizing tobacco as the reference, the gene fragment psaC-ndhE-ndhG-ndhI-ndhA-ndhH-rps15 and rps7-ndhB-trnL-CAA located in SSC and IRA regions of tobacco chloroplast genome and the corresponding fragments in chloroplast genomes of T. mongolica, Z. xanthoxylon, and Z. fabago were selected for verification. The results showed that ndhE, ndhG, ndhI, ndhA, ndhH and ndhB genes were lost in the selected fragments of chloroplast genomes of T. mongolica, Z. xanthoxylon, and Z. fabago (Fig 2).
We used REPuter [30] and Tandm Repeats Finder [46] to identify the palindrome repeats, forward repeats, reverse repeats, and tandem repeats of chloroplast genomes of T. mongolica, Z. xanthoxylon, and Z. fabago (Fig 3). A total of 53, 40, and 38 long repeats were detected in three chloroplast genomes (Fig 3a). The chloroplast genome of T. mongolica contained 49 tandem repeats, 3 palindrome repeats, and 1 reverse repeats. The chloroplast genome of Z. xanthoxylon contained 36 tandem repeats and 4 palindrome repeats, while the chloroplast genome of Z. fabago contained 34 tandem repeats, 3 palindrome repeats, and 1 reverse repeats (Fig 3b). In the three chloroplast genomes, long repeats with the length of 10 bp was the most common category, and then 11 bp and 12 bp categories (Fig 3c).

PLOS ONE
The complete chloroplast genome sequencing analysis revealed an unusual IRs reduction in three species

Codon usage
Codon preference (codon usage bias) indicates the result of combined action of natural selection, species mutations, and genetic drift [47]. In the present study, according to the sequences

PLOS ONE
Relative synonymous codon usage analysis indicated that there was more than one synonym codon for almost all (except methionine) amino acids in the three chloroplast genomes, and the codons of UGG (tryptophan) and AUG (methionine) exhibited no usage bias (RSCU = 1) (Fig 6). About half of the codons have a RSCU value of >1.00 (30,30, and 32 for the chloroplast genomes of T. mongolica, Z. xanthoxylon, and Z. fabago, respectively), and all codons with usage bias (RSCU>1) except CGU ended with A or U.

PLOS ONE
The complete chloroplast genome sequencing analysis revealed an unusual IRs reduction in three species

Comparative genomics analysis
To detect gene loss and inversion, we compared the chloroplast genomes of the three Zygophyllaceae species with those of Averrhoa carambola, Linum usitatissimum, Erythroxylum novogranatense, Geranium maderense, and Erodium carvifolium, using MAUVE. The results pointed out that the size of the chloroplast genomes of the three Zygophyllaceae species were approximately (10-60) kb smaller than those of other species (Fig 7), and all 11 genes which encoded the subunits of NADH dehydrogenase (ndh gene) were lost from SSC and IRs. Moreover, the 4 rRNA that appeared in the IR region in most other plant were transferred to the SSC region in the three Zygophyllaceae species. In addition, compared with other species, there were no gene inversions in LSC region, SSC region, and IR region in the chloroplast genomes of the three Zygophyllaceae species. In order to characterize genomic divergence between T. mongolica, Z. xanthoxylon, and Z. fabago, mVISTA software was employed to identify the divergent regions in the chloroplast genomes of the three Zygophyllaceae species, and Tribulus terrestris chloroplast genome was utilized as reference (Fig 8). The two IR regions were more conserved than LSC and SSC region, and the non-coding regions exhibited higher divergence than the coding regions. Moreover, the highest divergent regions in the three chloroplast genomes were detected in the intergenic spacer regions, including trnK-trnQ, trnQ-psbK, trnS-trnG, trnG-trnR, trnR-atpA,

Phylogenetic analysis
To investigate the phylogenetic status of the three Zygophyllaceae species in angiosperms and their interspecific relationships, 50 protein-coding genes from 69 plant species were phylogenetically analyzed using MEGA X software (Fig 9). All the plants chosen belong to the Core Eudicots branch according to the APG classification [6,7,48,49]. The results indicated that Caryophyllales and Santalales were early-divergent angiosperms, and order Vitales was the earliest divergent clade of Rosids. Of Malvids and Fabids clades, Myrtales, Geraniales, and Zygophyllales were early evolutionary groups. As expected, the three Zygophyllaceae species were clustered in the Fabids clade together with Oxalidales, Malpighiales, Celastrales, Rosales, Fagales, Cucurbitales, and Fabales. But unexpected, the four Zygophyllales plants were clustered in one branch with Geraniales and Fabales, considered that Geraniales was classified in Malvids according to the latest APG classification. In Zygophyllaceae, Z. xanthoxylon and Z. fabago formed a monophyletic branch with 100% bootstrap value, and the branch was sister clade to the genus Tetraena.

IR expansion and contraction
Although the IR region is thought to be the most conserved region in chloroplast genome, the contraction and expansion of the IR region boundary is a common phenomenon in the evolution of the chloroplast genome and the main cause of the chloroplast genome size alteration [50][51][52]. Here, we conducted a comparative analysis of the IR/LSC and IR/SSC boundary regions of T. terrestris, T. mongolica, Z. xanthoxylon, and Z. fabago (Fig 10). In these three chloroplast genomes of T. mongolica, Z. xanthoxylon, and Z. fabago, no pseudogenes and genes crossing the border were found. The boundary was between rpl22 and trnH-GUG on the IRB/LSC side, and between trnH-GUG and psbA on the IRA/LSC side. In T. mongolica, the

PLOS ONE
The complete chloroplast genome sequencing analysis revealed an unusual IRs reduction in three species PLOS ONE | https://doi.org/10.1371/journal.pone.0263253 February 2, 2022 boundary of IRB/SSC was located between trnL-CAA and trnL-UAG, and the boundary of IRA/SSC was between rpl32 and trnL-CAA. In Z. xanthoxylon and Z. fabago, the boundary of IRB/SSC was located between rpl32 and trnL-CAA, and the boundary of IRA/SSC was between rps7 and trnL-CAA.
Specifically, in the IR region of T. mongolica, Z. xanthoxylon and Z. fabago, trnH-GUG deviates from the IR/LSC boundary by 129 bp, 155 bp, and 164 bp, respectively. trnL-CAA is 566 bp, 449 bp, 594 bp, respectively, from the IR/SSC boundary. The gene rpl22 located in LSC, which was 13-28 bp from the IRB/LSC border, similarly, the gene psbA deviated from the IRA/LSC by 80-130 bp. Among the three species, the genes close to the IR/SSC border in SSC

PLOS ONE
The complete chloroplast genome sequencing analysis revealed an unusual IRs reduction in three species were different. In T. mongolica, trnL-UAG was 554bp from IRB/SSC boundary, and rpl32 was 46bp from IRA/SSC boundary. In both Z. xanthoxylon and Z. fabago, rpl32 and rps7 located close to the border of IRB/SSC and IRA/SSC.

Discussion
The sizes of the three Zygophyllaceae chloroplast genomes are significantly shorter than those of most angiosperms. In majority of angiosperms, the chloroplast genomes are 120-160 kb in length, while the sizes of the chloroplast genomes of T. mongolica, Z. xanthoxylon, and Z. fabago range from 104 to 106 kb. The LSC regions of most angiosperms are generally about 80-90 kb in length, while the SSC regions are about 16-27 kb in length, and the size of two IRs are approximately 20-28 kb. Compared with most angiosperms, the sizes of LSC and SSC of T. mongolica, Z. xanthoxylon, and Z. fabago don't change significantly, and the most conspicuous change is occurred in two IRs reduced by about 16-24 kb in size. Thus, the reduced sizes of chloroplast genomes of these three Zygophyllaceae species are mainly associated with the shrinkage of IRs.
Although the chloroplast genome is highly conservative, several chloroplast genomes are significantly smaller than that of most other plants, and some of them are listed in S3 Table. The most common reports of small chloroplast genomes came from studies of chloroplast genomes in parasitic plants, including Taxillus chinensis and T. sutchuenensis in Loranthaceae of Santalales [53], Epifagus virginiana in Orobanchaceae of Lamiales [54], Cuscuta chinensis and C. japonica in Convolvulaceae of Solanales [55]. Smaller chloroplast genomes were also found in some gymnosperms such as Welwitschia mirabilis in Welwitschiaceae of

PLOS ONE
Welwitschiales [56], and Gnetum ula in Gnetaceae of Gnetales [57]. In non-parasitic angiosperms, the chloroplast genome with the size smaller than 130 kb was rarely reported except Astragalus membranaceus, whose chloroplast genome was approximately 124 kb, partly due to the loss of an IR [58]. The shrinkage of chloroplast genomes of the other plant species were associated with significant reduction in size of SSCs, for example, the SSCs of chloroplast genomes of the parasitic plants in S3 Table were less than half of that in tobacco [59], a classical angiosperm chloroplast genome. In the three Zygophyllaceae species, the sizes of LSC and SSC decrease slightly, but the lengths of IRs decrease dramatically. Thus, the three Zygophyllaceae species could be utilized as novel models to investigate the evolution of chloroplast genome structure and size.
Comparison of three Zygophyllaceae chloroplast genomes with those of other plant species reveal that, 4 rRNA genes usually presented in IRs are located in SSC region in these three chloroplast genomes, and thus leading to the reduction of the copy number of rRNA genes. In addition, it had been reported that due to the contraction and expansion of IR regions in the chloroplast genome of Pothos scandens in Araceae, some genes which existed in IR regions transferred to the LSC region becoming single copy and most of genes which appeared in SSC region transferred to the IR regions turning into duplicated, resulting in the change of gene numbers and the increased size of LSC region and the decreased size of SSC region [60]. Different from our study, although the IR regions had contracted and expanded, there was no loss of genes and no significant change in the size of the chloroplast genome in Pothos scandens. Similar to our observation, previous studies had reported rRNA gene displacement in Erodium species [61]. And all ndh genes usually located in SSC and IRs region encoding subunits of NADH oxidoreductase are lost. Moreover, rps16, rpl12, ycf2 and infA, which are common in most angiosperm chloroplast genomes, are lost in chloroplast genomes of T. mongolica, Z. xanthoxylon, and Z. fabago. All above may be the possible reasons for the size reduction of IRs region. In addition, the NADH dehydrogenase complex in plant plastids are involved in photosynthesis in response to environmental stress. Although very uncommon, the ndh gene losses or pseudogenization are widespread phenomena in chloroplasts of different lineages of seed plants which are photoautotrophic [62]. The phenomenon had been reported that the ndh genes of plant plastid were specifically lost and NDH subunits which were nuclearencoded were expression in Pinaceae [57], Orchidaceae [63], gnetophytes [64] and Geraniales [61]. Adaptation to the environment is especially critical for plants grow in barren soil in the arid and semi-arid regions. The current result reveals the loss of 11 ndh genes in these chloroplast genomes of T. mongolica, Z. xanthoxylon, and Z. fabago, and it is not certain whether ndh genes encoded by plastid have been lost completely or moved to cell nucleus functionally for the three Zygophylloideae species, which deserves to be discussed.
Previously reports had shown that losses of plastid-encoded ndh genes in Pinaceae possibly occurred before the divergence of this lineage (140 MYA) [57,65]. The most recent losses of plastid-encoded ndh genes were found in a long divergent branch with 13 species in Erodium which had been supposed to predate the divergence of this branch (3 MYA) [61,62]. A more recent phenomenon of pseudogenization of 4 ndh genes in genus Melianthus of Geraniales [66]. This branch was found to have diversified about 2 MYA and preserved some translatable sequences in the plastome [62]. In our study, T. mongolica was from the genus Tetraena of Zygophylloideae, and Z. xanthoxylon and Z. fabago belonged to the genus Zygophyllum of Zygophylloideae. All 11 plastid-encoded ndh genes were loss in the three species. However, the ndh genes were intact in chloroplasts of Larrea tridentata of Larreoideae and Tribulus terrestris of Tribuloideae. Subfamily Larreoideae and Tribuloideae were classified into Zygophyllaceae. It might suggest that the loss of plastid-encoded ndh genes in the three species involved in our study had possibly occurred ahead of the divergence of subfamily Zygophylloideae (38 MYA) [67]. However, due to the limited number of species chosen in our study, more species from Zygophyllaceae and Zygophylloideae could be added in subsequent studies which is helpful to further explore the loss of ndh genes and the function of NADH complex in Zygophyllaceae.
In prior studies, the correlations of repeats, SNPs and InDels were analyzed in chloroplast genomes of Malvaceae [68]. It was shown fluctuations in correlations at the family level, the subfamily level and the genus level in quantitative researches. While up to 90% of repeats and SNPs were simultaneous, and 52%-72% of repeats contained InDels at the family and subfamily level in qualitative studies. And it was hypothesized that the correlations among mutation events might be a usual feature in plant chloroplast genomes. This showed the important role of repeats in the generation of SNPs and InDels. 10 polymorphic loci were identified in chloroplast genomes of Blumea species, among which 5 regions were concurrent with repeats [69]. In our current study, we identified 8 polymorphic loci, and 7 were existed in the regions where repeats emerged except rps15-trnN-GUU. The co-occurrence proportion of repeats and polymorphic loci was as high as 87.5%. This result also supported the view that repeats could be utilized to identify the polymorphic loci for future researches on phylogeny and taxonomic status of plant.
Phylogenetic trees based on 50 common protein-coding genes in the chloroplast genomes of 69 plant species provide crucial molecular evidence for exploring the phylogenetic status of the three Zygophyllaceae species. Considered that Zygophyllaceae had been classified in the order of Geraniales in Flora Reipublicae Popularis Sinicae [4] and Flora of China [5], our results support the latest taxonomic classification of Zygophyllaceae described in APG IV in which Zygophyllaceae belongs to Fabids rather than Malvids. T. mongolica, Z. xanthoxylon and Z. fabago are clustered into a single branch with Larrea tridentata, which is another species in Zygophyllaceae, and the four Zygophyllaceae species are clustered in the Fabids clade together with Oxalidales, Malpighiales, Celastrales, Rosales, Fagales, Cucurbitales, and Fabales. Our phylogenetic analyses also reveal the close relationship between Z. fabago and Z. xanthoxylon, and support the incorporation of the Z. xanthoxylon into the genus Zygophyllum.
At the same time, our phylogenetic analysis also raises some new speculations on the evolutionary status of Zygophyllaceae and other related taxonomic branch, which need to be investigated further. First, our results show that Zygophyllales is clustered in a small branch with Fabales, but not with other orders in Fabids like Oxalidales, Malpighiales, and Rosales, indicating a closer relationship between Zygophyllales and Fabales which is not reported in previous reports. Second, it is surprisingly to find Zygophyllales of Fabids are clustered in a single clade with many plant species in Geraniales, which are classified into Malvids according to APG IV [7]. Our results raise a possibility that at least part of species in Geraniales belong to Fabids instead of Malvids, just as Zygophyllales was once classified in Malvids and is now classified in Fabids.
It should be noted that in our study the phylogenetic tree was constructed based on 69 species belonging to 51 genera and 30 families, including plants from Oxalidales, Malpighiales, Celastrales, Rosales, Fagales, Cucurbitales and Fabales which were also classified into Fabids like Zygophyllaceae, and species from Malvids with disputed classification. Four species from two subfamilies (six in total) of Zygophyllaceae, among which three species from two genera (six in total) of Zygophylloideae, were chosen in this study. The three species were T. mongolica, Z. xanthoxylon, and Z. fabago with significant shortage in size of the chloroplast genomes which were concerned in our study. Based on the limited number of species selected, future research could consider more species of Zygophyllaceae to conduct more detailed phylogeny analysis. It will be helpful to explore the phylogenetic status and evolution of Zygophyllaceae.
In brief, we assemble the whole chloroplast genomes of T. mongolica, Z. xanthoxylon, and Z. fabago. Our study reveals the unusual reduction of the three chloroplast genomes, especially IR regions, and the loss of 11 genes cording subunits of NADH dehydrogenase in SSC and IRs region. Comparative genomics identify the genetic variation between the chloroplast genomes of the three Zygophyllaceae species and other plant species. Phylogenetic analysis according to 50 common protein-coding genes of 69 plant chloroplast genomes support current understanding of the phylogenetic status of Zygophyllaceae.
Supporting information S1