Complete chloroplast genomes of two Siraitia Merrill species: Comparative analysis, positive selection and novel molecular marker development

Siraitia grosvenorii fruit, known as Luo-Han-Guo, has been used as a traditional Chinese medicine for many years, and mogrosides are its primary active ingredients. Unfortunately, Siraitia siamensis, its wild relative, might be misused due to its indistinguishable appearance, not only threatening the reliability of the medication but also partly exacerbating wild resource scarcity. Therefore, high-resolution genetic markers must be developed to discriminate between these species. Here, the complete chloroplast genomes of S. grosvenorii and S. siamensis were assembled and analyzed for the first time; they were 158,757 and 159,190 bp in length, respectively, and possessed conserved quadripartite circular structures. Both contained 134 annotated genes, including 8 rRNA, 37 tRNA and 89 protein-coding genes. Twenty divergences (Pi > 0.03) were found in the intergenic regions. Nine protein-coding genes, accD, atpA, atpE, atpF, clpP, ndhF, psbH, rbcL, and rpoC2, underwent selection within Cucurbitaceae. Phylogenetic relationship analysis indicated that these two species originated from the same ancestor. Finally, four pairs of molecular markers were developed to distinguish the two species. The results of this study will be beneficial for taxonomic research, identification and conservation of Siraitia Merrill wild resources in the future.


Introduction
Siraitia plants are important perennial vines belonging to the fourth most economically important plant family, Cucurbitaceae, and the genus has been widely cultivated as economic crops in southern China and northern Thailand [1]. Among these crops, S. grosvenorii, a traditional Chinese medicinal plant native to Guangxi, China, has been cultivated for approximately 200 years. The fruit of S. grosvenorii, Luo-Han-Guo, has been used as a traditional PLOS

Plant materials, DNA extraction, and sequencing
The fresh leaves of two-year-old S. grosvenorii and S. siamensis plants were collected from Guangxi Medicinal Botanical Garden of the Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences and Peking Union Medical College (Nanning, China), then frozen at -80˚C until further use. Total DNA was extracted from approximately 100 mg samples using a plant genomic DNA kit (DP305) (Tiangen Biotech Co., Ltd., Beijing, China), DNA quality was assessed in a Nanodrop 2000 spectrophotometer (Thermo Scientific), and DNA integrity was evaluated using a 1.0% (w/v) agarose gel. DNA samples from each species were used to prepare two separate libraries with an average insert size of 500 bp and sequenced using an Illumina HiSeq 4000 (Illumina Inc., San Diego, CA, USA) with a standard protocol.

Chloroplast genome assembly
First, the low-quality reads sequenced from all the samples were filtered by Trimmomatic software [27]. Then, the trimmed reads, including nuclear and organelle genome data, were used to assemble the chloroplast genome. All chloroplast genomes of plants assessed in the National Center for Biotechnology Information (NCBI) were used to search against Illumina paired-end reads using SRA-BLASTN with an E-value cutoff of 1e-5 [28]. Clean reads with high homology were considered plastome reads and used for downstream genome assembly. SPAdes (v3.10.1) and CLC Genomics Workbench (v7) were used for the de novo genome assembly, the SPAdes using for the assembling that the parameters were set as "-k 21,33,55,77,99,127 -careful" [29]. The contigs obtained were identified by Gepard and spanned the entire plastome [30]. All the identified contigs were assembled using the SeqMan module of DNASTAR (v11.0) [31]. Then, three scaffolds, including the large single-copy (LSC), the IR, and small single-copy (SSC) regions, were obtained. The specific de-novo genome assembler NOVOPlasty was also used to reassemble the two species chloroplast genome for verification [32].To verify the assembly accuracy, the four boundaries between the single-copy (SC) regions and IR regions of the assembled sequence were confirmed by PCR amplification and Sanger sequencing, and the sequences of the primers are listed in S1 Table.

Genome annotation, repeats and simple sequence repeats (SSRs) analyses
The online program Dual Organellar GenoMe Annotator (DOGMA, http://dogma.ccbb. utexas.edu/) and the Chloroplast Genome Annotation, Visualization, Analysis, and GenBank Submission (CPGAVAS) were used to annotate the two genomes [33,34]. The protein-coding sequences were verified by Blastp against the GenBank database. The tRNA genes were identified by tRNAscan-SE and DOGMA [33,35]. Then, manual corrections of the positions of the start and stop codons and the intron/exon boundaries were performed based on the entries in the plastome database using the Apollo program (v.  , and the cutoffs of the unit numbers for mono-, di-, tri-, tetra-, penta-, and  hexa-nucleotides were 8, 4, 4, 3, 3, and 3, respectively [43].

Comparative genomic and selective pressure analyses
The mVISTA program in Shuffle-LAGAN mode with default parameters was used to compare the five complete chloroplast genomes using S. grosvenorii chloroplast genomes as a reference [44,45]. The sequence divergence of the chloroplast genomes was analyzed with a sliding window using DnaSP (v5.10) [46], and the step size was set to 200 bp with a 600 bp window length. Moreover, a total of 104 intergenic regions and 77 exons were manually extracted among four species, and the corresponding sequences aligned using ClustalW2 (v2.0.12) [47] were used to calculate the nucleotide variability (Pi) using DnaSP (v5.10). Selective pressure was analyzed for consensus protein-coding genes among twelve genomes from Cucurbitaceae species. Easy-CodeML software with the site model was performed to calculate the nonsynonymous (Ka) and synonymous (Ks) substitution ratios and likelihood ratio tests (LRTs). The values of both Ka/Ks (ω) and the LRTs were coupled to evaluate the selection on amino acid sites [48].

Phylogenetic analyses
A total of 30 chloroplast genomes, including 28 from Cucurbitaceae, were used for the phylogenetic analyses, including Nicotiana tabacum and Arabidopsis thaliana as outgroups. In this study, 28 chloroplast genome sequences were downloaded from NCBI GenBank (S2 Table). The software MAFFT was used to generate alignments of 64 consensus protein-coding gene sequences [49], and then the alignments were manually adjusted by BioEdit [50]. Maximum likelihood (ML) analysis was carried out based on the Tamura-Nei model using a heuristic search for initial trees that were the most appropriate by Modeltest 3.7 [51]. Maximum parsimony (MP) analysis was conducted using PAUP (v4.0a) [52]. Bootstrap analysis was performed with 1000 replicates. Finally, the reconstructed trees were visualized using Figtree (v1.4.3) (http://tree.bio.ed.ac.uk/software/figtree/).

Molecular marker development and validation
The molecular regions between the two Siraitia species were examined by the alignment and comparison of mVISTA similarities. The primers for the molecular markers were designed using the software Primer Premier 5.0 [53]. The accuracy of the molecular markers was verified using PCR amplification, and PCR was conducted with the following program: initial denaturation at 94˚C for 2 minutes; followed by 35 cycles of amplification at 94˚C for 20 seconds, 56˚C for 20 seconds, and 72˚C for 2 minutes; and final extension at 72˚C for 10 minutes. The PCR products were separated with 1.0% (w/v) agarose gel for 20 minutes at 180 volts. Then, the DNA fragments were purified and sequenced.

General features of the chloroplast genomes
The chloroplast genome structures of the two species are similar to those of other Cucurbitaceae species, which display a single circular molecule with a typical quadripartite structure. The complete chloroplast genome of S. grosvenorii was 158,757 bp in length and consists of a pair of IRs (IRa and IRb, each 26,288 bp in length) separated by a LSC (87,625 bp) region and a SSC (18,556 bp) region (Fig 1, S1 Fig and S3 Table). The complete chloroplast genome of S. siamensis possessed the same structure. The IRa and IRb were 26,289 bp each, the LSC was 88,069 bp, and the SSC was 18,543 bp. In total, the length of the whole genome was 159,190 bp. (Fig 1, S3 Table). The length of each region is similar to those of most plant chloroplast genomes reported previously [54]. The sequencing data of S. grosvenorii and S. siamensis were deposited in GenBank under the accession numbers MK755853 and MK755854, respectively.
Further analysis results revealed that the two species had approximately 36.8% GC content (S3 Table), which was distributed unevenly across the whole chloroplast genome. In contrast to the LSC regions, the GC contents in the IR regions displayed a higher value across the whole chloroplast genome, 42.8% in both S. grosvenorii and S. siamensis, possibly resulting from rRNA genes (rrn16, rrn23, rrn4.5 and rrn5, 55.3% GC content in both cases) with high GC content located in the IR regions [55]; the higher GC content in the IR region was regarded as an indicator of species affinity [56]. Moreover, the LSC regions had a GC content of 34.6% in both species, and the lowest values of 30.5% and 30.7% were found in the SSC regions in S. grosvenorii and S. siamensis, respectively. In addition, the GC contents of the protein-coding regions were 37.8% in S. grosvenorii and 37.9% in S. siamensis, and the percentages of GC content for the first, second, and third codon positions were 45.6%, 38.0% and 29.8% in S. grosvenorii and 45.6%, 62.0% and 29.9% in S. siamensis, respectively (S3 Table). A bias toward using thymine (T) and adenine (A) in the third codon position has also been observed in other land plant plastomes [57][58][59]. The skewed GC distribution across the whole genome might be associated with the position of the origin and terminals for gene replication [60][61][62].
The two species exhibited the same gene content and arrangement in the chloroplast genome. A total of 134 genes were identified, including 89 protein-coding genes, 37 transfer RNA (tRNA) genes and four ribosomal RNA (rRNA) genes. Eight protein-coding genes, seven tRNA genes, and four rRNA genes were duplicated in the IR regions in both species (Fig 1); ycf15-orf was duplicated, and its start codon was GTG. The data revealed that 23 genes contain introns in both genomes, 21 with one intron and the rest with two introns, and the genes clpP and ycf3 both contained three exons (S4 and S5 Tables). The gene clpP was related to energy transformation [63], and ycf3 was necessary for the stable accumulation of photosystem I complexes [64]. Introns found in functional genes play a significant role in the regulation of gene expression, which can trigger desirable biological traits at particular times [65,66]. In addition, the rps12 gene contained three exons and one intron because of trans-splicing, which resulted in a 5' end exon located in the LSC region, whereas the remaining exons were located in the IRs (S5 Table). Therefore, rps12 was duplicated in the IR region. Furthermore, the results of gene location analysis revealed twelve genes with partial overlaps in their sequences: trnK-UUU/ matK, psbD/psbC, trnM-CAU/trnT-GGU, atpE/atpB, rps3/rpl22, and trnP-UGG/trnP-GGG.
The basic characteristics of the chloroplast genomes of two Siraitia species and six other species from Cucurbitaceae are shown in S6 Table. Comparative analysis showed that the lengths of the eight genomes ranged from 155,293 bp (Cucumis sativus) to 159,190 bp (S. siamensis), and the overall GC content percentage of C. sativus (37.2%) was higher than that of any other genome (36.7%-37.1%). However, little difference could be found in gene number, gene type or GC content between S. grosvenorii and S. siamensis, which suggests that we should focus on other areas to find variation, such as intergenic spacers.

IR/SC boundaries and IR contraction and expansion
The contraction and expansion of the IR regions account for common evolutionary events and are a major cause of differences in chloroplast genome size, and evaluating them could shed some light on the evolution of some taxa [67,68]. A detailed comparison of the IR/SC boundary regions of twelve species is shown in Fig 2. A. thaliana and N. tabacum were set as outgroups, and the rest were Cucurbitaceae. From the comparison, we noticed that the lengths of ycf1 pseudogenes of both Siraitia species were both 1,181 bp, almost as long as those of the other five Cucurbitaceae species, except Lagenaria siceraria (28 bp) and Momordica charantia (29 bp). In addition, the ndhF gene, located in the SSC, reaches 134-135 bp across the IRb/SSC boundary in both L. siceraria and M. charantia, while the corresponding region was 7-12 bp in most other Cucurbitaceae species. Moreover, the gene rps19 is located in the LSC region with 265-277 bp across the LSC/IRb boundary in Trichosanthes kirilowii, Hemsleya lijiangensis and Gynostemma laxiflorum, whereas in other species, it was located away from the LSC/IRb by 68-208 bp. The rpl2 gene, duplicated in the IRs in most species, was present as only one copy in the IRb region in M. charantia, Cucurbita pepo and Citrullus lanatus, which might result from the location of the LSC/IRb boundary. All of these phenomena were related to the contraction/expansion of two IR regions in the complete chloroplast genomes.

Codon usage
RSCU is a measure of nonuniform synonymous codon usage in coding sequences in which values above 1 indicate that codons are used more frequently than expected [69,70]. All the protein-coding genes were encoded with 26,509 and 26,508 codons in the S. grosvenorii and S. siamensis chloroplast genomes, respectively. Detailed codon analysis revealed that the two cases had similar codon constitutions and RSCU values (S7 Table). Among these codons, leucine (Leu) and cysteine (Cys) were, respectively, the highest (10.48%) and lowest (1.13%) prevalence amino acid codons in the two Siraitia species. In addition, most of the amino acid codons showed preferences, with exception of methionine (Met) and tryptophan (Trp), which both had RSCU values of 1. The chloroplast genomes of the two Siraitia species both had 30 biased codons with RSCU >1, and the third positions of the biased codons were A/U except for Leu (UUG); otherwise, the codons with high frequency (>30%) and fraction were Asp-GAU, Glu-GAA, Ile-AUU, Lys-AAA, Leu-UUA, Asn-AAU, and Tyr-UAU, and the bias toward these seven codons was consistent with the low content of GC in the third codon position. Fig 3 shows that the RSCU value increased with the quantity of codons coding for a specific amino acid. A strong AT bias in codon usage is common in sequences with strong codon preference and was also found in most other land plant chloroplast genomes [71,72].

Repeat structure and SSRs analyses
Repeat units, which are distributed in chloroplast genomes with high frequency, play an important role in genome evolution [73][74][75]. S2(A) Fig shows repeat structures that were longer than 30 bases in eight species. The repeats of the S. grosvenorii chloroplast genome consist of 19 forward, 24 palindromic, two reverse, and one complement. By contrast, a slightly different number of repeats was found in S. siamensis, which contained 17 forward, 20 palindromic, one reverse and no complement. In the eight-species comparison, the number of repeats for C. lanatus (31) was the lowest, and the highest number of repeats was 49 in M. charantia, G. laxiflorum, and Cucurbita maxima.
On the other hand, SSRs, which are also known as microsatellites and are distributed abundantly across genomes, are tracts of repetitive DNA with certain motifs, ranging from 1-6 or more base pairs, that are repeated typically 5-50 times [76,77]. SSRs are widely used as molecular markers for species identification, analysis of phylogenetic relationships and population genetics because of their high polymorphism rates and stable reproducibility [78,79]. Here, a total of 252 and 253 SSRs were identified by MISA software within the chloroplast genomes of S. grosvenorii and S. siamensis, respectively (S2(B) Fig), and mononucleotide repeats were largest in number, 57 and 56, respectively. Moreover, S8 Table shows that A/T mononucleotide repeats (97.2% and 97.8%, respectively) were the most common, and for dinucleotide repeats, AT/ TA (68.4% and 67.8%, respectively) was the majority. However, only one repeat unit (AAT/TTA) was found among trinucleotide repeats. In addition, AT-rich repetitive motifs were high in the remaining SSR types. The SSRs within the chloroplast genomes of both species mainly comprised AT-rich repetitive motifs, consistent with the fact that AT content (63.2%) was very high (GC content was 36.8%) in both cases. Furthermore, these results were also consistent with previous reports that proportions of short poly-A or poly-T repeats were higher than those of poly-G or poly-C within most SSRs in many plant chloroplast genomes [80,81]. Distribution of the SSRs loci in the chloroplast genome of S. grosvenorii and S. siamensis were exhibited in S9 Table.

Sequence divergence and nucleotide diversity
Complete chloroplast genomes are often used to analyze plant taxonomy, phylogenetic relationships, and genetic diversity [82]. In this study, two Siraitia species were compared with other three species in Cucurbitaceae using mVISTA software, and S. grosvenorii was set as the reference. As shown in Fig 4, sequence divergence was similar for the whole sequences of the complete chloroplast genomes. In contrast to the other two Cucurbitaceae species, M. charantia was more similar to the two Siraitia species across the complete chloroplast genomes. The data plot revealed that the noncoding region was more divergent than its coding counterparts. The two IR regions were both less divergent than the single-copy regions, which might be the result of the four highly conserved rRNAs located in the IR regions [55].

Phylogenetic relationships of six genera within Cucurbitaceae
Chloroplast genomes are significant genomic resources for the reconstruction of accurate and high-resolution phylogenetic relationships and taxonomic status in angiosperms [86]. Complete chloroplast genomes and protein-coding genes have been widely employed to determine phylogenetic relationships at almost every taxonomic level [87]. In this study, to identify the phylogenetic positions of the two Siraitia species within the Cucurbitaceae family, we aligned 64 protein-coding sequences from 30 chloroplast genomes; A. thaliana and N. tabacum were set as outgroups, and the alignment length was 62,522 bp. The ML and MP trees displayed similar phylogenetic topologies (Fig 6). All nodes in the ML tree and MP tree were strongly supported by high bootstrap values: 22 of 27 nodes with 100% bootstrap values were found in the MP tree and 21 of 27 in the ML tree. In addition, the results illustrated that two Siraitia species were the most closely related species to M. kirilowii, and these two taxa were grouped with two Gray arrows and thick black lines above the alignment indicate genes with their orientation and the position of the IRs, respectively. Purple bars, blue bars, pink bars, gray bars and white peaks represent exon, Untranslated Region (UTRs), Conserved Noncoding Sequences (CNS), mRNA and genomes differences, respectively. A cut-off of 70% identity was used for the plots, and the Y-scale represents the percent identity ranging from 50% to 100%. https://doi.org/10.1371/journal.pone.0226865.g004 Complete chloroplast genomes of two Siraitia Merrill species species from Sicyoeae, three species from Cucurbiteae, and nine from Benincaseae, which showed a nested evolutionary relationship in the MP and ML trees. Furthermore, all species were clustered into a lineage distinct from the outgroups and strongly supported the new classification system of Cucurbitaceae [54].

Selective pressure events
Synonymous substitutions (Ks) accumulate nearly neutrally, whereas nonsynonymous substitutions (Ka) are subjected to selective pressures of varying degree and direction (positive or negative); values of Ka/Ks (ω) above 1.0 indicate that the corresponding genes experience positive selection, while ω values ranging from 0.5 to 1.0 indicate relaxed selection [88]. In the current study, we performed a selective analysis of the exons of each protein-coding gene using site-specific models with four comparison models (M0 vs. M3, M1a vs. M2a, M7 vs. M8 and Among the eight models, M2a was the positive model, and p (p 0 , p 1 , p 2 ) were the proportions of purifying selection, neutral selection, and positive selection. A total of 58 consensus proteincoding genes from 12 closely related species were evaluated with respect to selective pressure. Nine genes (accD, atpA, atpE, atpF, clpP, ndhF, psbH, rbcL, and rpoC2) were found to have undergone positive selection, and the ω 2 values ranged from 2.17 to 11.44 in the M2a model (Table 1).
To determine which sites were subject to positive selection, naive empirical Bayes (NEB) and Bayes empirical Bayes (BEB) methods were used to analyze the location of consistent selective sites in the alignment of chloroplast genomes in the M7 vs. M8 model. The data analysis revealed that the gene rpoC2 possesses 7 positive selective sites, followed by atpF (6), rbcL (4), atpA (2), atpE (2), clpP (2), ndhF (2), accD (1), and psbH (1). All positively selected sites in these nine genes are shown in Table 1.
Among the nine positively selected genes, rpoC2 encodes the RNA polymerase β". A comparison of rpoC2 between fertile lines of sorghum and cytoplasmic male sterile lines showed that a 165 bp deletion was identified that encodes several protein motifs involved with transcription factors; this region might play an important role in the regulation of developmental pollination [89]. The Siraitia species are dioecious, so the finding that rpoC2 evolved under positive selection might indicate that it is essential for sex differentiation. The chloroplast plays Complete chloroplast genomes of two Siraitia Merrill species important roles in photosynthesis and carbon fixation, and six genes (atpF, rbcL, atpA, atpE, ndhF, and psbH) with essential roles in photosynthesis were positively selected in this study. The Siraitia species are distributed in southeast Asia, so requirements for sufficient light might have exerted strong selective forces on the six genes during plant evolution. This phenomenon was also found in species within the Urophysa genus, which is distributed in southwest China [75]. The clpP gene, encoding the ATP-dependent clp protease, is likely involved in the transformation of chloroplast protein and might be essential for shoot development under clpPmediated protein degradation [63,90,91]. The positive selection of the gene clpP in our study might be associated with the evolution of the vining character within Cucurbitaceae. As for the gene accD, it encodes the β-carboxyl transferase subunit of acetyl-CoA carboxylase [92]; it is a vital gene for leaf development and has effects on leaf longevity and seed yield [93]. Expression of the gene accD might indirectly affect the efficiency of photosynthesis. These nine genes have undergone positive selection, which might be the result of adaptation to their barren environment.

Molecular markers for distinguishing S. grosvenorii and S. siamensis
In this study, several notably variable regions were found in the comparison of the two chloroplast genomes. To develop high-resolution molecular markers for the identification of these two species, the specific divergent regions, including ndhC-trnV-UAC, trnR-UCU-atpA, rpoB-trnC-GCA and the gene ycf1, were chosen as molecular marker regions, and specific primers were designed against the conserved regions ( Table 2). The primers, named GSPC-F/R, GSPR-F/R and GSPB-F/R, were used for amplifying the three intergenic regions and produced Complete chloroplast genomes of two Siraitia Merrill species different-length fragments in S. grosvenorii and S. siamensis, respectively. The gene ycf1, with high divergence, was also chosen for the development of molecular markers, and several SNP sites were found after amplification with GSPY-F/R. (Fig 7). The four molecular markers developed in this study will contribute to the identification of Siraitia species and facilitate efficient utilization and conservation of wild Siraitia resources. The lack of an effective approach made the position of different species in the genus unreliable until the Siraitia Merrill was acknowledged in 1980 [15]. The phylogenetic analysis in this study showed that the two Siraitia species were the most closely related species to M. kirilowii, which explains the reason that these species were placed into Momordica L. with the name of Momordica grosvenorii Swingle initially [94]. To the best of our knowledge, only seven species within the genus Siraitia have been confirmed based on morphological characteristics [15], although some varieties may remain undiscovered. Unfortunately, the increasing demand for high production of these species has brought the wild resources to the verge of depletion. The comparative analysis of chloroplast genomes in this study revealed that several highly variable intergenic regions will contribute to the development of specific markers to support the conservation of wild Siraitia varieties. Among the seven species, S. grosvenorii, which is recorded as both a medicinal and an edible species [4], has been widely cultivated as an important commercial crop. S. siamensis is known to have better disease resistance and setting percentage and considerable siamenoside I content. Different germplasm resources should be distinguished accurately and used for different purposes. Not only do adulterants of raw medicinal materials threaten the safety and reliability of food and medicine, but they also exacerbate the scarcity of wild resources in similar species. Our development of novel molecular markers will be of great use in the species authentication and conservation of Siraitia wild resources in the future.

Conclusions
In the current study, two complete chloroplast genomes of the Siraitia genus were assembled and analyzed for the first time. In a comparison with other species within Cucurbitaceae, several highly divergent noncoding regions were identified that would be beneficial for developing high-resolution molecular markers. Phylogenetic relationship analysis supported that S. grosvenorii and S. siamensis originated from the same ancestor, consistent with previous studies. Furthermore, 9 protein-coding genes were found to undergo selection, which might be the result of adaptation to the environment. Finally, molecular markers (GSPC-F/R, GSPR-F/R, GSPB-F/R and GSPY-F/R) were developed to distinguish the two species. The results in this study will be beneficial for taxonomic research, species identification, and conservation of the genetic diversity of Siraitia wild resources in the future.