Analysis of the Complete Chloroplast Genome of a Medicinal Plant, Dianthus superbus var. longicalyncinus, from a Comparative Genomics Perspective

Dianthus superbus var. longicalycinus is an economically important traditional Chinese medicinal plant that is also used for ornamental purposes. In this study, D. superbus was compared to its closely related family of Caryophyllaceae chloroplast (cp) genomes such as Lychnis chalcedonica and Spinacia oleracea. D. superbus had the longest large single copy (LSC) region (82,805 bp), with some variations in the inverted repeat region A (IRA)/LSC regions. The IRs underwent both expansion and constriction during evolution of the Caryophyllaceae family; however, intense variations were not identified. The pseudogene ribosomal protein subunit S19 (rps19) was identified at the IRA/LSC junction, but was not present in the cp genome of other Caryophyllaceae family members. The translation initiation factor IF-1 (infA) and ribosomal protein subunit L23 (rpl23) genes were absent from the Dianthus cp genome. When the cp genome of Dianthus was compared with 31 other angiosperm lineages, the infA gene was found to have been lost in most members of rosids, solanales of asterids and Lychnis of Caryophyllales, whereas rpl23 gene loss or pseudogization had occurred exclusively in Caryophyllales. Nevertheless, the cp genome of Dianthus and Spinacia has two introns in the proteolytic subunit of ATP-dependent protease (clpP) gene, but Lychnis has lost introns from the clpP gene. Furthermore, phylogenetic analysis of individual protein-coding genes infA and rpl23 revealed that gene loss or pseudogenization occurred independently in the cp genome of Dianthus. Molecular phylogenetic analysis also demonstrated a sister relationship between Dianthus and Lychnis based on 78 protein-coding sequences. The results presented herein will contribute to studies of the evolution, molecular biology and genetic engineering of the medicinal and ornamental plant, D. superbus var. longicalycinus.


Introduction
Chloroplasts are double membrane bound plant organelles that encode genes essential for photosynthesis and other biochemical pathways such as biosynthesis of starch, fatty acids, pigments and amino acids [1]. This organelle possesses its own single circular DNA, chromosome, which is highly conserved among species. Most chloroplasts carry two copies of inverted repeats (IRs) separated by a large single copy region (LSC) and a small single copy region (SSC). To date, more than 340 chloroplast (cp) genomes have been completely sequenced and characterized and are available in the Chloroplast Genome Database (http://chloroplast.ocean.washington. edu/tools/cpbase/run). The majority of angiosperm cp genome sequences are highly conserved, and these usually encode four rRNAs, 30 tRNAs and approximately 80 unique proteins. Previous studies reported that gene content, gene order and genome organization are highly conserved within terrestrial plants based on restriction site mapping [2,3]. However, with the availability of more chloroplast genomes in the database, comparative genome studies have been carried out. These investigations have revealed many structural gene rearrangements, large IR expression and the occurrence of gene loss in numerous angiosperm lineages [4,5]. Such studies are essential to reconstruction of plant phylogenetic trees [6], DNA barcoding [7], and population [8] and transplastomic studies [9].
Angiosperms are considered the most ancient flowering plant, originating approximately 160 million years ago [10]. The angiosperms consist of four major groups, basal angiosperms, magnoliids, monocots and eudicots. Caryophyllaceae is considered to be the most diverse and largest family of eudicots, consisting of 86 genera and 2,200 species [11][12][13]. These flowering groups of plants are widely distributed in the Mediterranean and bordering regions of Europe and Asia. The Dianthus genus consists of nearly 300 species native to Europe and Asia, with a few species extending to North Africa and arctic North America. The blooms of D. superbus are five petaled with green eyes. The petals are deeply notched, giving them a feathery or fringed appearance, leading to their commonly being referred to as fringed pink or large pink. D. superbus contains two species varieties, longicalycinus and speciosus. D. superbus var. longicalycinus is a herbaceous evergreen perennial plant that reaches 6-12 inches in height and is commonly grown in East Asian countries, especially China, Japan and Korea. D. superbus var. longicalycinus is a popular garden plant that has been used for its scent and as a Chinese herbal medicine (Qu Mai) for over 2,000 years. Specifically, it is commonly used as an anti-inflammatory agent for urinary infections, carbuncles and carcinoma of the esophagus [14,15]. The ethanol extract of D. superbus has been shown to suppress the production of IgE in a human B cell line, a murine model of peanut allergy, interleukin-4 (IL-4), IL-13 and eotaxin [16]. This medicinal herb stimulates the digestive and urinary systems, lowers blood pressure and reduces fever [16,17]. This plant also acts as an antibacterial agent, abortifacient, contraceptive, diuretic, emmenagogue, ophthalmic, tonic, and hair growth promoter and has the potential for use as an antifertility agent [17]. The plant is taken internally to treat acute urinary tract infections (especially cystitis), urinary stones, constipation and failure to menstruate [14]. It is applied externally to treat skin inflammation and swelling. The leaves are used in the treatment of hemorrhoids, lumbricoid worms, and venereal sores, while the flowers are used as an astringent, diuretic, hemostatic, resolvent and vulnerary [15].
Many genes have been lost from the chloroplast genome during plant evolution [18]. Martin et al. [19] reported that most of these losses happened in the interval between the original endosymbiosis of a cyanobacterium (containing~2000 protein-coding genes) and the last common ancestor of all existing chloroplast genomes (~210 protein coding genes). Gene loss or pseudogenes were observed in several land plants [18]. The cell viability pseudogene, ycf2, in rice and maize [20,21], the ribosomal protein subunit L23 (rpl23) in spinach [22] and the translation initiation factor (infA) were observed in tobacco, Arabidopsis and Oenothera elata [23][24][25][26]. Previous studies showed that the chloroplast genes ribosomal protein subunit L22 (rpl22), ribosomal protein subunit s16 (rps16) and subunit of photosystem I gene, ycf4, have been lost from some or all legume plants of angiosperms [27,28]. Additionally, nicotinamide adenine dinucleotide (NADH) dehydrogenase F (ndhF) and ycf2 were lost repeatedly from a variety of angiosperms [29][30][31]. Intron loss has occurred in the clpP (proteolytic subunit of ATP-dependent protease) gene of Sileneae [32]. Due to gene loss, pseudogenes, intron loss, inversions, shifts in inverted repeat boundaries and large insertions and deletions in the cp genome of land plants provide the most information about the evolutionary mechanisms involved.
Owing to lack of chloroplast genome information regarding this important medicinal and ornamental plant, there is demand to develop its genetic resources further. We previously sequenced and reported the cp genome of Dianthus superbus var. longicalycinus [33]. However, in this study, we characterized and analyzed the cp genome of Dianthus and conducted comparative genomics of its closely related family of Caryophyllaceae cp genomes such as Lychnis chalcedonica and Spinacia oleracea. The cp pseudogenes, infA and rpl23, and the intron containing clpP gene of Dianthus were analyzed and compared with 31 other angiosperm lineages to understand the evolutionary perspective of these genes. In addition, molecular phylogenetic analyses were conducted based on 78 protein-coding genes from 32 taxa. The results presented herein will contribute to a better understanding of the molecular biology, genetics and evolution of the Dianthus genus. In addition, these data should be useful for future studies of chloroplast genomes and phylogenomic studies of Caryophyllales.

Materials and Methods
Comparative genome analysis of the Dianthus chloroplast genome The complete chloroplast genome of D. superbus var. longicalycinus was compared with that of three other species, L. chalcedonica, S. oleracea and N. tabacum. To visualize the genomes of the four cp species, the annotated cp genomes were aligned using the Mauve program [34] and plotted with Circos 0.67 [35] to show gene locations, GC skew and GC content. Moreover, the four cp genomes were compared with the mVISTA program in Shuffle-LAGAN mode [36]. Dianthus was set as a reference.
Analysis of tandem repeats and single sequence repeats (SSR) PHOBOS v3.3.12 was used for the detection of tandem repeats and single sequence repeats (SSR). The analysis parameters of alignment scores for the match, mismatch, gap, and N positions were set as 1, -5, -5, and 0, respectively [39].

Analysis of RNA editing
The online program, Predictive RNA Editor for Plants (PREP) suite (http://prep.unl.edu/) [40], was used for the analysis of possible RNA editing sites in protein-coding genes of the Dianthus cp genome. For this analysis, the cut-off value was set at 0.8. The PREP-cp program has 35 reference genes for revealing RNA editing sites in the chloroplast genomes.

Synonymous (K S ) and nonsynonymous (K A ) substitution rate analysis
The completed cp genome sequence of Dianthus was compared with the cp genome sequences of Lychnis and Spinacia. To analyze synonymous (K S ) and nonsynonymous (K A ) substitution rates, the same individual functional protein-coding exons were extracted and translated into protein sequences and aligned separately using Geneious v7.1.7. The synonymous (K S ) and nonsynonymous (K A ) substitution rates for each protein-coding exon were estimated in DnaSP [41].

Phylogenetic analysis
The 31 completed cp genome sequences representing the lineages of angiosperms were downloaded from the NCBI Organelle Genome Resource database (S1 Table). The individual protein coding genes infA, rpl23 and clpP from 32 angiosperms (including Dianthus) were analyzed and investigated separately for evolutionary gene significance. The nucleotide sequences of each gene were subjected to Geneious alignment using Geneious v7.1.7. The 78 protein-coding gene sequences and three individual sequences were aligned using MAFFT v7.017 [38] through Geneious v7.1.7 separately. The aligned protein-coding gene sequences were saved in PHYLIP format using Clustal X v2.1 [42] and used to generate a phylogenetic tree. Maximum likelihood (ML) analysis was performed with RaxML v7. 0 [43] using the general time-reversible invariant-sites (GTRI) nucleotide substitution model with the default parameters. The bootstrap probability of each branch was calculated by 1000 replications.

Results
Comparison of the D. superbus chloroplast genome organization and gene contents with other cp genomes The cp genome of a medicinal plant, D. superbus var. longicalycinus, was analyzed, characterized and compared with its closely related species. The genome organization, gene content, GC skew and GC content of the four cp genomes were compared. The Circos diagram demonstrated a tightly genomic relationship between Dianthus and other cp genomes (Fig 1). The Dianthus cp genome encodes 78 protein coding genes, 30 tRNA genes, and four rRNA genes (Table 1). Seventeen genes are duplicated in the IR regions. The cp genome also has 17 introncontaining genes, 14 of which (8 protein-coding and 6 tRNA genes) are encoded in one intron and three (clpP, rps12 and ycf3) that are encoded in two introns ( Table 2). All genes had a common start codon (ATG) in the initiation site, except rps19, which carried ACG as a start codon.
Most of the genes were present in all cp genomes. The other Caryophyllales species, Lychnis and Spinacia and Nicotiana, also encode 30 tRNAs and four RNAs. Nevertheless, the Caryophyllales share an identical number of protein coding genes (78 genes), but Nicotiana encodes 88 protein coding genes. Intron containing genes varied among these species. Both Dianthus and Spinacia contain 17 intron containing genes, whereas Lychnis and Nicotiana have 16 and 15 intron genes, respectively. The value of Dianthus GC content is similar to that of Lychnis (36.3%), while that of Spinacia is 34.8% and Nicotiana 37.8% (Fig 1).

Category
Gene group Gene name mVISTA was employed to study sequence variations in the Caryophyllaceae family and Nicotiana. This analysis revealed that the coding region is more highly conserved than the non-coding regions (Fig 2). However, the most dissimilar coding regions of the four chloroplast genomes were clpP, infA, ycf1 and ycf2.

Comparisons of boundary regions of Dianthus with closely related cp genomes
The LSC/IRB/SSC/IRA boundary regions of the Dianthus cp genome were compared to the corresponding regions of the three other cp genomes of Lychnis, Spinacia and Nicotiana (Fig  3). The rps19 gene of Dianthus (133 bp of 279 bp) and Spinacia (135 bp of 279 bp) was extended from the IRB to the LSC region with 2 bp variability. However, the rps19 gene of Nicotiana was shifted to an LSC region with a 2 bp gap and absent from Lychnis. At the IRB/ SSC boundary, the ycf1 and ndhF genes of Dianthus overlapped, whereas the ycf1 gene of Lychnis was not present. Expansion, contraction and shifting of the ycf1 gene was observed in the boundary regions of SSC/IRA. The size variation of ycf1 from 5394 bp to 6002 bp was identified in all cp genomes. However, the pseudogene rps19 was only present in the IRA/LSC junctions of the Dianthus genome. The trnH gene was located in the LSC region of all genomes, but varied from 1 bp to 42 bp apart from the IRA/LSC junctions. When compared with other closely related cp genomes of Caryophyllacaee, the IR region of Dianthus (24,803 bp) was found to be smaller than that of Spinacia (25,073 bp), but larger than the Lychnis IR region (23,540 bp). Pseudogenization of infA and rpl23 genes The chloroplast genes infA and rpl23 of Dianthus were analyzed with 31 other angiosperms. Both infA and rpl23 were found to be pseudogenes in the cp genome of Dianthus. Among 32 angiosperms (including Dianthus), the infA gene was found to be a pseudogene or entirely missing from Dianthus and Lychnis of the Caryophyllales family, as well as Brassicales, Cucurbitales, Fabales, Malpighiales, Malvales, Myrtales and Sapindales of Rosids and Solanales of Asterids (Fig 4 and S2 Fig). Comparative analysis of the ribosomal protein gene, rpl23, in 32 angiosperms revealed that it was a pseudogene or lost gene exclusively in members of the Caryophyllales family such as Dianthus, Lychnis and Spinacia (Fig 5 and S3 Fig).

RNA editing
The PREP-cp program predicted 45 RNA editing sites in 16 genes of the Dianthus cp genome. Of these 16 genes, ndhB and ndhD encoded 10 RNA editing sites. The RNA editing types in Dianthus were all non-silent, and 100% C to U (S2 Table). Of these, 75.56% (34) occurred in the second base position of the codon, whereas 24.44% (11) were in the first position of the codon. The amino acid was changed due to nucleotide substitution in the codon. Among the 45 amino acids, 22 amino were converted from hydrophilic to hydrophobic (S to L, S to F and T to I), 12 from hydrophobic to hydrophobic (A to V, P to L and L to F), seven from hydrophilic to hydrophilic (T to M, H to Y and R to W) and four from hydrophobic to hydrophilic (P to S). Among these, 15 amino acids (33.3%) were converted from Serine to Leucine.

Synonymous (K S ) and nonsynonymous (K A ) substitution rate analysis
A total of 76 genes encoding 87 protein-coding exons in the cp genome of Dianthus were used to analyze synonymous and nonsynonymous rates against Lychnis and Spinacia (Fig 6). The

Phylogenetic analysis
A molecular phylogenetic tree was constructed using 78 protein coding genes of 32 cp genome sequences. Among these 32 taxa, Nelumbo was set as the outgroup. The phylogenetic tree was divided into two clades, rosids and asterids. Within asterids, Caryophyllales (core eudicots) diverged from asterids and formed two sister clades with a 100% bootstrap (BS) value. The Caryophyllales contained two sub sister clades. The first sub clade included Spinacia (Amaranthaceae), whereas Dianthus and Lychnis (Caryophyllaceae) were in the second sub clade with a 100% BS value (Fig 7).  Comparative genome analysis revealed several dissimilarities in the Caryophyllaceae family. Comparison of the contents of Dianthus with the other three cp genomes revealed that the protein coding, tRNA and rRNA regions were similar to those of Lychnis and Spinacia, encoding 78, 30 and 4 genes, respectively. This might have been because the genome shares its gene contents with the Caryophyllales family. However, the total number of introns in the plastid differs within this family. Specifically, Dianthus and Spinacia share a total of 22 introns in the cp genome, whereas Lychnis contains only 20 introns. This was due to the loss of two introns in the clpP gene of Lychnis. This intron loss might have been due to the rapidly evolving clpP gene in the Lychnis species [32,44]. Conversely, Nicotiana contains 24 introns in the cp genome [23]. The difference in the intron between Nicotiana and Caryophyllacea was due to the absence of an intron in the rpl2 gene of Caryophyllales. Downie et al. [45] revealed that several lineages of flowering plants had lost introns from the rpl2 gene independently, which could also be considered a distinguishing feature of core members of Caryophyllales [46].
The occurrence of IR regions could help stabilize the cp genome, and the most significant feature of the IR region is its resistance to recombinational loss [47]. Goulding et al. [48] reported that fluctuations have occurred sporadically in the IR regions of Angiosperms during Comparative Analysis of the Dianthus superbus Chloroplast Genome evolution. A copy of IR genes was lost during the rearrangement of cpDNA evolution of Angiosperms [49]. As shown in Fig 3, the IRs have both extended and constricted during evolution of the Caryophyllacea family plants; however, intense variations were not identified. Nevertheless, some variations were detected in the IRA/LSC regions. Some species encoded two copies of the rps19 gene near the IRB/SSC and IRA/LSC junctions, while the Dianthus cp genome encoded one copy of the rps19 gene at the IRB/SSC junction and the pseudogene rps19 was observed at the IRA/LSC junction. The length of the pseudogene rps19 was shorter (146 bp) than that of the regular rps19 gene (279 bp). This pseudogenization might have been due to IR fluctuation in the cp genome of Dianthus. Interestingly, the ACG start codon was found in rps19. Neckermann et al. [50] reported that the ACG start codon has been converted into an initiation codon, AUG, in Nicotiana due to RNA editing in the translation process. This might also have occurred in the D. superbus var. longicalycinus cp genome. Taken together, this evidence indicates that evolutionary rates of cp genomes in the Caryophyllaceae are comparatively mild based on the relatively minor variations in the IR regions.
The infA and rpl23 genes appeared as pseudogenes or were lost from the cp genome of Dianthus. The functional gene sequence of infA was highly variable in Caryophyllales. The infA gene of Dianthus differed from that of other Caryophyllales such as Spinacia and asterids (Coffee, Daucus, Helianthus, Jasminum, Lactuca and Panax) and Rosids (Liquidambar and Vitis) because of the presence of a pseudogene, infA, in Dianthus and Lychnis (Fig 7). However, Spinacia encodes a functional intact infA gene in the Caryophyllales family. When compared with the other cp genome of Spinacia, 170 bp of the infA gene were deleted from Dianthus, possibly due to a double frameshift mutation (6 bp insert) near the 3 0 end. Previous studies also suggested that a 124 bp deletion occurred in the infA gene of tomato [18]. Earlier studies revealed that the infA gene was lost independently from multiple angiosperm lineages, including other species within the Caryophyllales [18,46,51,52]. Interestingly, another gene, rpl23, appears as a pseudogene or was lost from Caryophyllales. Earlier studies also suggested that both genes have been lost or subjected to pseudogenation in other Caryophyllales, including S. latifolia, S. vulgaris, S. noctiflora, S. conica and Spinacia [32,53]. Inversions, intron losses and substitution rate accelerations occurred independently in the cp genome of L. chalcedonica and S. paradoxa [32]. This gene loss might have been due to disruption of the nuclear-encoded DNA replication, recombination and repair machinery that regulates the cp genome [54]. These inversions and intron losses can be attributed to evolution of the plant organelle genome.
Further evolution of the infA and rpl23 pseudogenes and intron containing gene, clpP, of Dianthus were compared with 31 other angiosperms. The gene and intron losses of different families formed a clade in the phylogenetic analysis that revealed that independent evolutionary lineages occurred in all three genes (Figs 4 and 5 and S4 Fig). The cp genes chlB, chlL and chlN have been lost independently from Gnetales and Gnetum [55] and Welwitschia [56]. The infA gene in Ipomoea and the rps16 gene in Passiflora and Populus have also been lost independently [57]. Moreover, the infA and rpl23 genes have been lost or pseudogenization occurred independently in the cp genome of Dianthus. However, parallel evolution occurred in the cp genome of Lychnis because of loss of the intron from the clpP gene [32]. Moreover, the intron loss of the ClpP gene has been indentified in Cicer arietinum, Poceae, Onagraceae, Oleaceae and Pinus [57,58]. Ronny et al. [18] also reported that cp infA was lost repeatedly during angiosperm evolution. The cp pseudogene, rpl23, in spinach has been functionally replaced by a nuclear gene, which is similar to the homologous cytosolic ribosomal protein gene [59]. Earlier studies reported that the genes responsible for ribosomal proteins or other translocation components are involved in gene loss in both the chloroplast and mitochondria genomes [60,61]. It includes the transfer of chloroplast genes infA and rpl22, substitution of chloroplast genes rpl21 and rpl23 and uncharacterized losses of several mitochondrial ribosomal protein genes in addition to the transfer of rps10 [60,61].
Although chloroplast genomes are considered highly conserved regions in land plants, these regions with high sequence polymorphisms are frequently observed in closely related species [62]. The presence of several SSR sites in the cp genome of Dianthus superbus revealed that these sites can be evaluated for the intraspecific level of polymorphism, leading to highly sensitive phylogeographic and population structure studies for this species.
RNA editing is a post transcriptional process that has mainly occurred in mitochondrial and cp genomes of higher plants [63]. This process may induce substitution or indel mutations that lead to alternations in the process of transcription [9,[63][64][65]. However, in the ndhD gene, the initiation codon, ACG, was altered to AUG by this editing process. RNA editing of C to U substitution has commonly occurred in most of the angiosperms [66], and the total number of editing sites varied from 20 to 37 [63,[67][68][69][70]. However, comparison with other Caryophyllaceae family members such as Lychnis (48 editing sites) and Spinacia (47 editing sites) showed that the RNA editing sites and editing characteristics of Dianthus were similar. Chen et al. [63] also reported that closely related taxa generally share more RNA editing sites due to evolutionary conservation.
The nucleotide substitution patterns of synonymous and nonsynonymous are important indicators in gene evolution studies [71]. Makalowski and Boguski [72] reported that nonsynonymous substitutions occurred less frequently than synonymous substitutions, and the ratio of K A /K S was less than one in most of the protein-coding regions. In this study, the ratio of K A / K S was significantly less than one in all protein-coding regions of Dianthus. Nevertheless, the K A /K S ratio of rpl22 was 1.03407. This small fluctuation might have been due to nonsynonymous substitution in the rpl22 gene and is the result of silent mutation. However, the rpl22 nucleotide identity was less than 70% (66.6%) when compared with Lychnis.
Few studies have been conducted to analyze the phylogenetic relationships within the Caryophyllaceae family, and the phylogenetic evolution of D. superbus has yet to be investigated. Cuenoud et al. [73] reported that Caryophyllaceae was a sister clade to Amaranthaceae based on matK analysis. Clement et al. [74] revealed that anothocyanin pigment producing Caryophyllaceae was associated with betalain pigment producing Amaranthaceae. Our results also strongly supported that Dianthus (Caryophyllaceae) formed a sister clade to Spinacia (Amaranthaceae) with 100% BS value. Additionally, phylogenetic analysis strongly supports the loss or formation of a pseudogene of infA and rpl23 in the cp genome of Dianthus (Fig 7). Because of the loss or absence of the rpl23 gene from Caryophyllales, the clade diverged from asterids into a new separate clade. Another functional gene, infA, was lost from many angiosperms of land plants, including Dianthus. Owing to the absence or loss of the infA gene from Dianthus and Lychnis, Spinacia diverged from this clade and formed a subclade. When we investigated the evolutionary perspective of these genes, the infA and rpl23 gene losses of different families were found to form a clade, which suggested that the evolutionary lineages have occurred independently.

Conclusion
In summary, the Dianthus genome shares the same overall organization and gene contents of other cp genomes of Caryophyllaceae family members. However, several unique features were observed in the cp genome of Dianthus, including pseudogenization or gene loss of rps19, infA and rpl23 genes. When compared with the other 31 angiosperm lineages, the infA gene has been lost from most members of the rosids, solanales of asterids and Lychnis of Caryophyllales, whereas the rpl23 gene was lost or pseudogization has occurred exclusively in the family of Caryophyllales cp genomes. Phylogenetic analysis of individual protein-coding genes infA and rpl23 has also revealed that gene loss or pseudogenization occurred independently in the cp genome of Dianthus. Molecular phylogenetic analysis of 78 protein-coding genes revealed that Dianthus is most closely related to Lychnis and Spinacia. Overall, the results of this study will contribute to a better understanding of the evolution, molecular biology and genetic improvement of the medicinal and ornamental plant, D. superbus var. longicalycinus.