Detecting useful genetic markers and reconstructing the phylogeny of an important medicinal resource plant, Artemisia selengensis, based on chloroplast genomics

Artemisia selengenesis is not only a health food, but also a well-known traditional Chinese medicine. Only a fraction of the chloroplast (cp) genome data of Artemisia has been reported and chloroplast genomic materials have been widely used in genomic evolution studies, molecular marker development, and phylogenetic analysis of the genus Artemisia, which makes evolutionary studies, genetic improvement, and phylogenetic identification very difficult. In this study, the complete chloroplast genome of A. selengensis was compared with that of other species within Artemisia and phylogenetic analyses was conducted with other genera in the Asteraceae family. The results showed that A. selengensis is an AT-rich species and has a typical quadripartite structure that is 151,215 bp in length. Comparative genome analyses demonstrated that the available chloroplast genomes of species of Artemisia were well conserved in terms of genomic length, GC contents, and gene organization and order. However, some differences, which may indicate evolutionary events, were found, such as a re-inversion event within the Artemisia genus, an unequal duplicate phenomenon of the ycf1 gene because of the expansion and contraction of the IR region, and the fast-evolving regions. Repeated sequences analysis showed that Artemisia chloroplast genomes presented a highly similar pattern of SSR or LDR distribution. A total of 257 SSRs and 42 LDRs were identified in the A. selengensis chloroplast genome. The phylogenetic analysis showed that A. selengensis was sister to A. gmelinii. The findings of this study will be valuable in further studies to understand the genetic diversity and evolutionary history of Asteraceae.


Introduction
Asteraceae, the largest and the most diverse flowering plant family, currently has 32,913 accepted species in 1,911 genera and 13 subfamilies [1][2][3]. Artemisia L. (Asteraceae), as the largest genus in the Tribe Anthemideae, is widespread in mid-to high-latitudes and even dominates most cold and many warm deserts in the Northern Hemisphere. Numerous species of Artemisia are used as herbal medicines in many countries. For example, A. annua and A. mexicana produce antimalarial drugs [4][5][6], and artemisinin (from A. annua), first isolated and tested in the 1970s in China, is an active substance against malaria [7]. In particular, having good taste and rich nutrition, A. selengenesis has long been used as a health food source and is sometimes directly eaten. Some extracted substances, especially from the leaves and roots, have antitumor, antioxidant, and free radical scavenging activities, and the plant is also a well-known traditional medicine because of its potent effects [8,9]. Therefore, considering the important medicinal values of A. selengenesis and the importance of Artemisia species as resource plants, comprehensive phylogenetic and genetic/genomic studies to increase our knowledge of this genus are important.
In angiosperms, the chloroplast with conserved quadripartite circular genomic structure [10] is a uniparentally inherited organelle. It originates from a cyanobacteria-like organism through an endosymbiosis event [11] and contains closely arrayed polycistronic transcribed gene clusters [12][13][14]. As a result, large-scale evolutionary events in related species, such as gene deletions or additions and gene order changes, are not common [15]. Therefore, cp genomes are widely used to determine evolutionary patterns [16], phylogenetic analysis [17], and comparative genomic analysis between angiosperm, gymnosperm, and fern families [18].
In the past, because of the number of species, diverse morphological types, ploidy, and complicated genetic relationships of Artemisia, the taxonomic relationships of the genus are controversial and based only on morphological traits, such as the capitula type and floret fertility [19,20]. As a result, considering the conserved structural and relatively compact gene density, chloroplast genomic materials are widely used in genomic evolution studies, molecular marker development, and phylogenetic analysis of the genus Artemisia. Many researchers have used single gene data (matK, ndhF, rps11), IGS data (psbA_trnH, trnS_trnC, trnS_trnfM, trnL_trnF), and shared protein-coding gene data of Artemisia to perform phylogenetic analysis [19][20][21][22][23][24][25][26][27]. However, the cp genomic data of Artemisia are still quite limited and data for only a few species have been reported. Therefore, we sequenced and annotated the complete cp genome of A. selengensis and compared it with other species within Artemisia and other genera (Chrysanthemum, Soliva, Diplostephium, Cynara) within the Asteraceae family. Our study aimed to detect useful genetic markers and genetic materials, and to reconstruct its phylogeny. This study will be useful in further studies in that it will illuminate the genetic diversity and evolutionary history of Asteraceae.

Ethics statement
The plant sampling was collected in areas that were not privately owned or protected in any way and no specific permits were required for this study.

Plant material and high throughput sequencing
The sample was collected from the Dongting Lake region (28˚48 0 46.06@N, 112˚21 0 10.19@E). Firstly, we collected mature leaves of A. selengensis and put them in a container with liquid nitrogen. Then, leaves were stored at -80˚C until sequencing. The extraction of total cp DNA was conducted according to the method of Zhang [28]. method of Zhang [29]. Approximately 2G of raw data were obtained through next generation sequencing with paired-end 125 bp read length. After filtering using Trimmomatic v 0.32, clean data were obtained for subsequent analysis [30].
The quality of the sequencing data of the samples was visually evaluated using the software Fastqc v 0.10.0 and low-quality reads were filtered using quality control [31]. Then, we used SOAP denovo2 to assemble all good-quality paired reads to contigs [32]. Assembled contigs were joined into multiple scaffolding using SSPACE [33] to obtain the whole-genome sequence. In this process, different K-mers were selected firstly for assembly, the best k-mer was obtained to adjust the other parameters (-d -u -R -F, etc.), and then the preliminary assembly results were obtained again. Finally, GapCloser [32] software was used for optimization and gap filling to obtain the final assembly results. We filtered out fragments below 500 bp for evaluation, statistical analysis, and subsequent gene prediction.
The predicted annotation of the complete cp genome was performed by using the programs CpGAVAS and DOGMA [34] with default values. Then, the annotation results were stored in GFF3 format and checked manually, and codon positions were adjusted using Apollo [35]. OGDraw v1.2 [36,37] was used to visualize the gene features of the A. selengensis genome. The other more details about material collection, sequencing, annotation can be obtained from the announcement [38]. Furthermore, codon usage and the relative synonymous codon usage (RSCU) of the A. selengensis cp genome were confirmed using DAMBE6 [39] based on the protein-coding sequences.
To obtain comprehensive knowledge of the genomic variation, pairwise distances of intergenic spacers (IGSs), and introns, protein-coding sequences of the nine Asteraceae species relative to A. selengensis were calculated. First, we extracted a total of 83 IGSs with at least 100 bp, and 17 introns shared by these species, and performed sequence alignment using MAFFT v7.380 [44] under the FFT-NS-2 setting. At the same time, 80 protein-coding sequences were extracted and aligned in MEGA7 [45] with the ClustalW (Codons) program. Then, pairwise distances of IGSs and introns were determined by using MEGA7 [45] with Kimura's two parameter (K2P) model [46]. Additionally, sequence divergence of homologous protein-coding genes was estimated according to Keller's method [47] using the synonymous (Ks) and non-synonymous (Ka) nucleotide substitution rates with the yn00 program [48] from the PAML package [49]. Finally, a two independent samples t-test was performed to evaluate the significance of the Ka/Ks ratio within and outside of the genus Artemisia.
into two categories: (i) simple sequence repeats (SSRs or microsatellites) with 1-6 bp long repeat motifs, (ii) longer dispersed repeats (LDRs) with at least 30 bp long repeat motifs. We used MISA Perl Script [50] that was written by a Perl program to determine SSRs in the A. selengensis cp genome. The minimum number of repeats was set to 8, 4, 4, 3, 3, 3 for mono-, di-, tri-, tetra-, penta-, and hexa-nucleotide SSRs, respectively. Then, LDRs, including tandem (T), forward (F), palindrome (P), reverse (R), and complement (C) repeats, were identified. Tandem Repeats Finder version 4.09 [51] with default settings was used to detect tandem repeats. These repeats with n �30 bp and a sequence identity � 90% were selected. REPuter [52] was used to visualize forward, palindrome, reverse, and complement sequences with the parameter settings of 3 for Hamming distance and 30 bp for minimum repeat size. BLAST 2.8.1 [53] was used to align and perform NJ analyses of the complete cp genome, LSC, IR, and SSC DNA sequences, together with MEGA 7.0 [54] for 72 shared protein sequences alignment and NJ analyses. The results were stored as a Newick tree file for constructing a phylogeny tree.

Features of complete chloroplast genome
The A. selengensis cp genome with GenBank accession number: MH042532 was announced by our research group [38]. The complete cp genome of A. selengensis had a typical quadripartite structure and was 151,215 bp in length ( Table 1). The GC content of the whole genome, LSC, SSC, and IR regions were 37.46%, 35.55%, 30.81%, and 43.09%, respectively. The higher GC content of the IR regions was probably caused by the presence of all four ribosomal RNA genes duplicated in these regions [55] (Table 1). Furthermore, AT content of the 1st, 2nd, and 3rd positions of the codons were 54.1%, 61.9%, and 70.2%, respectively (Table 1).
A total of 25,926 codons were translated into 88 protein-coding sequences by 30 unique tRNA genes (Tables 1 and 4). By analyzing codon usage and the relative synonymous codon usage (RSCU) of protein-coding sequences of the A. selengensis cp genome, we found that AUU and UGC accounted for the highest and lowest codon usage, respectively. Furthermore, non-preferred synonymous codons (RSCU < 1) with 32 codons is more than preferred synonymous codons (RSCU > 1) with 28 codons. The start codon AUG and UGG were non-bias codons (RSCU = 1). We also found that all preferred synonymous codons ended with A/T nucleotides and 93.75% non-preferred synonymous codons ended with G/C (Table 4).

Comparative chloroplast genomic analysis
Genome features comparation of nine Asteraceae species. We compared A. selengensis with its related species, including four species from Artemisia and four species from other genera: Chrysanthemum, Soliva, Diplostephium, and Cynara (Table 5). Among them, the length of the cp genomes of the nine species ranged from 150,784 (S. sessilis) bp to 152,585 bp (C. humilis). The genomic length within the Artemisia genus was similar, ranging from 151,056 bp (A. capillaris) to 151,318 bp (A. gmelinii) with only a 255 bp difference. The LSC region accounted for 54.77%-54.89% of the whole genome, whereas the SSC and IRs regions accounted for 12.12%-12.17% and 16.50%-16.53%, respectively. In terms of gene organization, Artemisia species appeared to be well conserved with 21 genes containing introns and 114 unique genes, including 80 protein-coding genes, 30 tRNA genes, and four rRNA genes.
Large-scale evolutionary events in the chloroplast genome of A. selengensis. Additionally, the genomic rearrangement of nine Asteraceae species relative to C. humilis showed that the SSC region of five species within the Artemisia genus had no rearrangement but was inverted in comparison with other genera. All species in our study were highly syntenic and similar in their LSC and IRs regions (Fig 2).
The expansion and contraction of the IR region was the most common evolutionary event in the evolution of the genome, and they are hypothesized to explain size differences between cp genomes [24]. Therefore, we compared the IR/SSC and IR/LSC boundaries of the nine species relative to A. selengensis (Fig 3). The LSC/IRa border generally was positioned at the rps19 gene with 211-218 bp in LSC, 60-67 bp in IRa. Normally, rpl2 and trn-H are positioned at the IRb/SSC boundary, but we also found a pseudogene rps19 at the IRb/SSC boundary of A. frigida, A. montana, S. sessilis, D. glutinosum, and C. humilis. The IRa/SSC and SSC/IRb borders of intro-generic species and inter-generic species were different because of different gene order in SSC. In our study, the ycf1 gene had a duplicate phenomenon in the cp genome, but the length of these two genes were different. The shorter one set as ycf1_1 ranged from 557 to 660 bp, and the longer one set as ycf1_2 ranged from 3,111 to 5,085 bp. In intro-generic species, ycf1_1 and ndhF were located at the IRa/SSC border, whereas rps15 and ycf1_2 were at the SSC/IRb border, which was opposite in inter-generic species. The pseudogene ycf1_1, ranging from 557 to 558 bp, in A. frigida, A. montana, S. sessilis, D. glutinosum, and C. humilis was expressed in four species, ranging from 576 to 660 bp. It is hypothesized that the ycf1 gene plays an important role in genome evolution. We also found that the ycf1 gene overlapped with the ndhF gene at the IRa/SSC boundary in A. capillaris and the SSC-IRb boundary in C. humilis. Sequence divergence between intro-generic species and inter-generic species. To obtain a comprehensive knowledge on the variation in the protein-coding genes, introns, and intergenic spacers in the cp genome, we compared the K2p values of the intergenic spacers and introns and the Ka, Ks, and Ka/Ks ratio of the protein-coding genes of the nine Asteraceae species (Figs 4 and 5; S1, S2, S3 and S4 Tables). These species were divided into intro-generic species (within Artemisia: A. selengensis, A. capillaris, A. frigida, A. gmelinii, and A. montana) and inter-generic species (other genera of Asteraceae: A. selengensis, C. boreale, S. sessilis, D. glutinosum, and C. humilis). As excepted, the IR region was much more conserved than the LSC and SSC regions because of lower K2p values. The sequences differences between species weresignificantly higher than those of the species within the genus (P < 0.05). In intro-generic species, ndhD_psaC (116 bp), psaJ_rpl33 (439 bp), trnH-GUG_psbA (382 bp), rps18_rpl20 (264 bp), ccsA_ndhD (200 bp), and rpl32_trnL-UAG (880 bp) presented higher K2p values. The most variable intron in Artemisia was trnK-UUU and the second intron of clpP. The most divergence intergenic sequences between A. selengensis and species of other genera were rpl32_trnL-UAG, psbI_trnS-GCU, atpA_trnR-UCU, rpl16_rps3, and trnH-GUG_psbA, whereas the most variable intron was rps16 (Fig 4).

Phylogenetic analysis of A. selengensis
The NJ phylogenetic tree of five datasets is presented in Fig 6 and S1 Fig. Except

Fig 5. Ka/Ks ratio of protein-coding genes between intro-generic species (within
https://doi.org/10.1371/journal.pone.0211340.g005 Comparative and phylogenetic analysis of Artemisia selengensis chloroplast genomic to the tribe Anthemideae, were located in the same clade. By analyzing the LSC, SSC, and 72 shared-protein-sequences tree, S. sessilis was the well-supported basal taxon, but the relationship between Artemisia and Chrysanthemum was different. A. annua and A. frigida formed a new branch, which was a sister group with another branch constituted by the remaining six Artemisia species in the LSC and 72 shared-protein-sequences tree. However, this new branch contained five species in the SSC tree: C. boreale, C. indicum, C. x morifolium, A. annua, and A. frigida. The evolutionary distances were also calculated. The results showed that the closest species to A. selengensis was A. capillaris (0.0017), A. argyi (0.0027), A. gmelinii (0.0040), A. montana, and A. fukudo (0.0006), and A. gmelinii and A. montana (0.0042) in the complete cp genome, LSC, SSC, IR, and 72 shared-protein-sequences trees, respectively.
However, when we compared A. selengensis with other genera in the Asteraceae family, some differences which may indicate evolutionary events were found. Normally, the SSC region of most Asteraceae species has been inverted relative to the Nicotiana tabacum chloroplast genome, which is often regarded to be unaltered [57]. However, in our study, we noticed that the SSC region of five species within the Artemisia genus had no rearrangement but was inverted in comparison with other genera in the Asteraceae family. This event is in agreement with a previous study on A. frigida, which has been called "re-inversion" [23]. Actually, except for Artemisia species, this re-inversion event was also found in Carthamus tinctorius (KP404628) [58], Centaurea diffusa (NC024286), and one reported Lactuca sativa (NC007578) [59]. One possible explanation for these results may be that the SSR region is an inversion "hotspot" and the re-inversion event can be noticed in closely related individuals. However, even in individual plants, there will be SSC re-inversion events as well. For example, the SSC regions of two cp genome sequences of Lactuca sativa (NC007578 and DQ383816) presented different orientations [60,61]. Although some hypotheses have been proposed for the mechanism of different SSC orientations within and among individuals, including intramolecular recombination between the two IR regions [60] and recombination-dependent DNA replication of the cp genome [62], the regulation mechanism of the presence of the re-inversion event within and among individuals is still unclear.
The border between four junctions usually differs among plants [63]. Detailed comparisons of IR boundaries of intro-generic and inter-generic species in the Asteraceae family suggested that wide ranges of expansions and contractions of IR are very common evolutionary events. As a result, the pseudogenes, ycf1 and rps19, were present at the IRa/SSC and IRb/LSC boundaries, respectively. We also identified an unequal duplicate phenomenon of the ycf1 gene and overlapped regions between ycf1 and ndhF. Actually, the sizes of IRs can change from 10 kb (in liverworts) to 76 kb (in Pelargonium) in land plants [64,65]. Most angiosperms have a 20-25 kb IRs. Wang et al.(2008) proposed three types to explain the expansion and contraction of IR/LSC junctions in angiosperms. Type I relates to intact trnH and rps19 genes being seated in IRa and IRb, respectively, and rps19 is seated downstream of trnH. In Type II there is a partial rps19 in Ira, which is situated between rpl2 and trnH. This type coincides with our study and has been found in some eudicots. Type III relates to the same trnH-rps19 cluster in IRa or IRb. Several mechanisms have been proposed to explain why successive IR expansions can lead to floating of the four junctions, such as homologous dispersed repeat recombination in Geranium [66].
Except for the large-scale evolutionary events in the cp genome of A. selengensis, we also identified the most variable regions by calculating the pairwise distances of IGSs, introns, and protein-coding sequences of nine Asteraceae species relative to A. selengensis. K2p values are an effective method for estimating evolutionary rates of nucleotide sequences [46]. In our study, the ndhD_psaC (116 bp), psaJ_rpl33 (439 bp), trnH-GUG_psbA (382 bp), rps18_rpl20 (264 bp), ccsA_ndhD (200 bp), and rpl32_trnL-UAG (880 bp), which presented higher K2p values, indicated that these regions exhibited accelerated mutation rates within the Artemisia genus. The Ka/Ks ratio is used to evaluate whether selective pressure acts on protein-coding genes and is an important indicator for studying gene evolution. When Ka/Ks > 1 (= 1; <1), the gene was subjected to positive selection (neutral selection; purifying selection) [46]. In our study, accD evolved under beneficial mutations with a Ka/Ks ratio >1. Three genes (rps12, ycf1_2, ndhD, ranging from 0.5000 to 0.6770) suffered from neutral selection with a Ka/Ks ratio > 0.5.
Repeats play an important role in various rearrangements, such as additions, deletions, or large inversions [47]. Therefore, we analyzed SSRs and LDRs in cp genomes of the nine Asteraceae species and found 220-279 SSRs and 38-52 LDRs in each individual. Mono-nucleotide, palindromic, and forward repeats were the most common repeated sequences. Nine Asteraceae cp genomes presented a highly similar pattern of SSRs or LDRs distribution. Firstly, more than half of the SSRs was present in the LSC region, and approximately 45.5% and 40.8% of SSRs were in IGSs and protein-coding regions, respectively. Secondly, approximately 30% of LDRs were localized in the LSC, IRa, or IRb regions, approximately 39% of LDRs were in IGSs or the protein-coding regions. The same situation is also found in other species, such as Fabaceae [47] and Sapindaceae species. Then, we associated repeat distribution with different regions and found that ycf2, ycf1, ycf3, rrn4.5 and rrn5 were the richest regions (n > 10). In a word, these SSCs and LDRs present in our study represent important genetic maker resources that can be used to expand research on Artemisia species.
Five datasets, including the complete cp genomes, LSC, IR, SSC DNA sequences, and 72 shared protein sequences, reconstructed the Artemisia and Asteraceae phylogenetic relationship. However, different datasets produced different topological structures (Fig 6 and S1 Fig). Among them, LSC and the 72 shared-protein-sequences tree showed the most similar topological structures and were consistent with the phylogeny of 21 Korean Artemisia species reconstructed by trnL_trnF markers [27]. However, although some Artemisia cp data have been published, other studies contained only one to four Artemisia species [22][23][24][25][26], and it is difficult to obtain more phylogenetic data to support our results.
In summary, a new cp genomic resource A. selengensis was presented. This study filled the gap in A. selengensis genomic resources, and provides novel insights into evolutionary dynamics in an important medicinal resource clade: Artemisia. Our results revealed that the available cp genomes of Artemisia were well conserved in terms of genomic length, GC contents, gene organization, and order. Furthermore, some differences, which may indicate evolutionary events, were found. Firstly, a re-inversion event of the SSC region within the Artemisia genus was identified, but the regulation mechanism of the presence of the re-inversion event within and among individuals is still unclear. Secondly, the pseudogenes ycf1 and rps19, an unequal duplicate phenomenon of the ycf1 gene, and overlapping regions between ycf1 and ndhF were identified at the IR/SSC or IR/LSC boundaries because of the expansion and contraction of the IR region. Last but not least, the highly variable regions (ndhD_psaC, psaJ_rpl33, trnH-GUG_psbA, rps18_rpl20, ccsA_ndhD, rpl32_trnL-UAG, accD, rps12, ycf1_2 and ndhD) within Artemisia, which indicated fast-evolving events, were found. The analysis of repeated sequencesshowed that Asteraceae cp genomes presented a highly similar pattern of SSRs or LDRs distribution. The phylogenetic analysis of five datasets showed that LSC and 72 shared-protein-sequences may be more useful in the reconstructed Artemisia and Asteraceae phylogenetic relationship. This study will be useful for further studies to illuminate the genetic diversity and evolutionary history of Asteraceae.
Supporting information S1