The complete chloroplast genome sequences of Lychnis wilfordii and Silene capitata and comparative analyses with other Caryophyllaceae genomes

The complete chloroplast genomes of Lychnis wilfordii and Silene capitata were determined and compared with ten previously reported Caryophyllaceae chloroplast genomes. The chloroplast genome sequences of L. wilfordii and S. capitata contain 152,320 bp and 150,224 bp, respectively. The gene contents and orders among 12 Caryophyllaceae species are consistent, but several microstructural changes have occurred. Expansion of the inverted repeat (IR) regions at the large single copy (LSC)/IRb and small single copy (SSC)/IR boundaries led to partial or entire gene duplications. Additionally, rearrangements of the LSC region were caused by gene inversions and/or transpositions. The 18 kb inversions, which occurred three times in different lineages of tribe Sileneae, were thought to be facilitated by the intermolecular duplicated sequences. Sequence analyses of the L. wilfordii and S. capitata genomes revealed 39 and 43 repeats, respectively, including forward, palindromic, and reverse repeats. In addition, a total of 67 and 56 simple sequence repeats were discovered in the L. wilfordii and S. capitata chloroplast genomes, respectively. Finally, we constructed phylogenetic trees of the 12 Caryophyllaceae species and two Amaranthaceae species based on 73 protein-coding genes using both maximum parsimony and likelihood methods.


Introduction
Chloroplasts are important photosynthetic organelles that provide energy for the synthesis of glucose, fatty acids, and amino acids [1,2]. The chloroplast genome is the smallest of the plant genomes, ranging from 135 to 160 kb in most plants [3][4][5]. Most angiosperm chloroplast genomes have a quadripartite circular structure and contain two copies of inverted repeat (IR) regions, separating a large single copy (LSC) region and small single copy (SSC) region [5]. Recently, with the rapid development of next-generation sequencing platforms, many chloroplast genome sequences have been reported and used to help resolve plant phylogenies [6,7]. Chloroplast genomic data are widely used in various studies, such as those on molecular a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 phylogeny, molecular identification (DNA barcoding), and genetic diversity [8][9][10]. The structure and gene order of the chloroplast genome are stable, and the rates of nucleotide substitution are generally slow in angiosperms [11][12][13][14].
Rearrangements in the chloroplast genome were considered to have occurred rarely enough in evolution that they can be used to demarcate major groups [4]; however, recently some lineages have revealed various patterns of changes in chloroplast genomes, for example large-scale rearrangements, gene duplications, and even loss of IR regions [15][16][17][18][19][20]. Scattered angiosperm lineages show extensive rearrangement of plastid genomes, and these gene order changes are correlated with increased rates of nucleotide substitutions and gene and intron losses [6]. Rearrangements of the chloroplast genome are often associated with repeated sequences [5].
The family Caryophyllaceae consists of 75-80 genera and approximately 2,000 species, which are widely distributed, mainly in the temperate or warm-temperate regions of the northern hemisphere [21]. The genera Lychnis and Silene are sister genera belonging to tribe Sileneae, but the taxonomic identities and limitations between these two genera remain unclear [21][22][23], which is why the genus Lychnis was nested within Silene using nuclear ribosomal internal transcribed spacer (nrITS), five chloroplast genes, and intergenic spacers (IGS) [24]. Previous studies have shown that the Sileneae underwent accelerated plastid genome evolution, including inversions, shifts in IR boundaries, large indels, intron losses, and rapid rates of amino acid sequence substitution [25,26]. Interestingly, the psaA-ycf3:psaI-ycf4 inversion and intron losses in clpP-1 and clpP-2 were suggested to be independent events that occurred three times [26].
A total of ten Caryophyllaceae chloroplast genomes have been reported [25][26][27][28]. In the genus Lychnis, only the chloroplast genome of L. chalcedonica has been reported [26], whereas in Silene, chloroplast genomes from a total of six species have been reported [25,26]. Therefore, in this study, we sequenced the complete chloroplast genomes of L. wilfordii and S. capitata and then analyzed them to identify their genetic characteristics and differences compared with other Caryophyllaceae species. The specific goals of the present study were to (1) present the complete chloroplast genome sequences of two Sileneae species, (2) investigate any significant characteristics suggesting extensive genome rearrangement in this tribe, and (3) explore significant changes in gene content and intron losses in the tribe Sileneae.

Materials and methods
Plant materials, DNA extraction, sequencing and genome assembly Leaf materials from Lychnis wilfordii and Silene capitata were obtained from living plants by seed germination in a greenhouse at the Korean Botanical Garden. The voucher specimens of L. wilfordii (NIBRVP0000542331) and S. capitata (NIBRVP0000542433) were deposited in the National Institute of Biological Resources Herbarium (KB). Total genomic DNA was extracted using the Genome Wizard kit (Promega, Madison, WI, USA). Sequencing libraries were prepared using the NEXTflex Rapid DNA-seq kit (Bioo Scientific, Austin, TX, USA). Paired-end sequencing libraries containing insert sizes of approximately 350-450 bp were sequenced on the Illumina Hiseq 2500 platform (Illumina Inc., San Diego, CA, USA) at the National Instrumentation Center for Environmental Management (Seoul, South Korea), yielding 27,739,600 reads from L. wilfordii and 22,127,152 reads from S. capitata, each with a read length of 250 bp. These paired-end reads were aligned with sequences from Silene vulgaris (JF715057). After screening these paired-end reads through alignment with S. vulgaris plastid genome, 585,206 (2.1%) reads of L. wilfordii and 661,807 (2.9%) reads of S. capitata were extracted with mean of coverage 980× and 1082×, respectively. De novo assembly was performed using Geneious v. 7.1.3 (Biomatters, Auckland, New Zealand). The consensus sequences were extracted and gap-filled by PCR amplification using specific primers based on the gaps between sequences. The PCR products were purified and sequenced by Sanger sequencing.

Genome annotation and comparative analyses
The initial annotation of the two Caryophyllaceae chloroplast genomes was performed using Dual Organellar GenoMe Annotator (DOGMA) [29]. From this initial annotation, putative starts, stops, and intron positions were determined by comparison with homologous genes in other Caryophyllaceae chloroplast genomes. The tRNA genes were annotated using DOGMA and tRNAscan-SE [30]. The circular chloroplast genome map was drawn using the OGDraw program [31]. The complete chloroplast genomes of L. wilfordii and S. capitata were compared with those of ten other Caryophyllaceae species using the mVISTA program in Shuffle-LAGAN mode (Table 1) [32]. Agrostemma githago (KF527884) was used as a reference.

Repeat sequence analysis
Simple sequence repeats (SSRs or microsatellites; mono-, di-, tri-, tetra-, penta-, and hexanucleotide repeats) were detected using Phobos v. 3.3.12 [33] with thresholds of ten repeat units for mononucleotide SSRs, five repeat units for di-and trinucleotide SSRs, and three repeat units for tetra-, penta-, and hexanucleotide SSRs. REPuter [34] was also used to analyze the repeat sequences, which included forward, reverse, palindromic, and complementary sequences with a minimal length of 30 bp and 90% sequence identities (Hamming distance of three). Moreover, we constructed a phylogenetic trees based on the sequences of the pairs of repeat regions to investigate the relationship between the distributions of repeat sequences and structural inversions. Maximum parsimony (MP) analysis was conducted using PAUP v. 4.0a150 [35], and branch support was assessed using 1000 bootstrap replicates.

Phylogenetic analysis
Phylogenetic analyses based on 73 protein-coding genes were also performed for 12 Caryophyllaceae species, using two Amaranthaceae species (Beta vulgaris and Salicornia europaea) as the outgroup (Table 1). Among 77 whole protein-coding genes, ycf1, ycf2, accD, clpP genes were excluded from data matrix, since those genes were reported fast evolving genes with high

Genome organization and features
The complete sizes of the L. wilfordii and S. capitata chloroplast genomes are 152,320 and 150,224 bp, respectively (Fig 1, Table 2).  Table 2, S1 Table). The S. capitata chloroplast genome contains 111 unique genes, 19 of which are duplicated in the IR region, giving a total of 130 genes (Fig 1, Table 2, S2 Table). The chloroplast genomes of these two species contain 30 distinct tRNAs, seven of which are duplicated in the IR region. Seventeen genes contain one or two introns: 14 contain one intron and three (rps12, clpP, and ycf3) two introns. Six of the genes containing one intron are tRNAs (S1 and S2 Tables).
In addition, while the L. wilfordii and S. capitata chloroplast genomes both have lost the infA gene, the accD gene was pseudogenized only in L. wilfordii. The lack or pseudogenization of the infA gene has been discovered in many taxa outside of Caryophyllaceae, such as the Brassicaceae, Fabaceae, Liliaceae, Malvaceae, and Onagraceae [25,26,[40][41][42][43][44]. Loss or pseudogenization of the accD gene in the plastid genome or accD gene transfer to the nucleus has also been reported in various angiosperm lineages, including Poaceae, Orobanchaceae, Ericaceae, and Primulaceae [45][46][47][48].

Comparative chloroplast genomic analysis
We compared gene arrangements in the chloroplast genomes of L. wilfordii and S. capitata with those of the ten previously reported Caryophyllaceae species (Fig 2). The chloroplast genome of S. capitata has an identical gene order with those of the genera Agrostemma, Colobanthus, and Dianthus, but the chloroplast genome of L. wilfordii has unique structural changes compared with previously reported Caryophyllaceae chloroplast genomes (Fig 2). The gene rearrangements present in the LSC regions were a result of inversions and/or transpositions (Fig 2). The chloroplast genome of L. wilfordii revealed an inversion of the trnV-rbcL region compared with the genomes of other genera (Fig 2), whereas the L. chalcedonica genome had twice the number of inversions and transpositions in the accD-psaI and ycf3 regions compared with the genomes of other genera (Fig 2). Interestingly, truncated partial sequences of clpP-2 and accD were found in the IGS region between trnV and psaI. The 5' upstream non-genic region and a partial 348 bp sequence of the accD gene, as well as the exon 1 and partial intron 1 sequences of clpP-2, have remained, but the downstream regions of both genes were truncated in the L. wilfordii chloroplast genome. Compared with the gene orders in other chloroplast genomes, these disruptions in the accD and clpP-2 genes may have occurred by inversion of the trnV-rbcL fragment. Thus, we deduced that duplication of clpP occurred before diversification of L. chalcedonica from L. wilfordii, and that transposition of psaI-accD and the loss of introns in clpP-1 and clpP-2 in L. chalcedonica may have occurred after species diversification.
In the genus Silene, we identified three types of chloroplast genomes (Fig 2). These are a) the common type of chloroplast genome observed in most Caryophyllaceae (Agrostemma, Colobanthus, and Dianthus) (seen in S. capitata, S. latifolia, and S. vulgaris); b) chloroplast genomes exhibiting an inversion of the ycf3-psaI regions (seen in S. paradoxa, S. conoidea, and S. conica); c) chloroplast genomes exhibiting transpositions and/or inversions of the psbD-accD, petL-clpP, trnD-T, and psaI-psbE regions (seen in S. noctiflora). Silene noctiflora currently has the most complicated chloroplast genome among the Caryophyllaceae.
Overall sequence identity was analyzed with mVISTA program the among the 12 chloroplast genomes of Caryophyllaceae, using the Agrostemma githago genome as a reference (Fig  3). The results revealed higher divergence in the LSC regions than in the IRs and SSCs, as a result of gene rearrangements (Figs 2 and 3), and greater conservation in the coding regions than in the non-coding regions (Fig 3). The most divergent coding regions were the ycf1, ycf2, accD, and clpP genes, which are similar to results from previous studies [25,26,49], showing lower (under 50%) similarity compared with other protein-coding regions (Fig 3). Consequently, we suggest that these genes evolve rapidly in Caryophyllaceae (including the tribe Sileneae). These genes are either absent or highly variable in the genomes of Campanulaceae, Geraniaceae, and Poaceae [6].

Boundaries between single copy and inverted repeat regions
The size variations among angiosperm chloroplast genomes are mostly the result of expansion or contraction of the IR region [50]. Additionally, the expansion or contraction of the IR region differs among various plant species [51]. In this study, the LSC-IR and IR-SSC boundaries of the 12 sequenced Caryophyllaceae genomes were compared (Fig 4). IR locations have changed substantially in Lychnis and Silene as a result of movement of the boundaries between the IR and SC regions (Fig 4). The IR and SC boundaries of S. capitata are consistent with those of the S. vulgaris and S. latifolia genomes, as well as the Caryophyllaceae genera Agrostemma, Colobanthus, and Dianthus (Fig 4).
The expansion of the IR at the SSC/IR boundary that duplicates the entire ycf1 gene was found only in the genome of L. wilfordii and three Silene species (S. conica, S. conoidea, and S. noctiflora). This event was observed in non-core Caryophyllales [52]. In the S. noctiflora chloroplast genome, the ycf1 and rps15 genes are duplicated within the IR region (Fig 4), and this species contains the longest IR region (29,891 bp) among the 12 Caryophyllaceae species. However, the contraction of the IR at the LSC/IR boundary that duplicates a part of the rpl2 gene was found only in Silene (S. conica, S. conoidea, and S. noctiflora) and Lychnis (L. wilfordii and L. chalcedonica). Lychnis chalcedonica has the shortest IR region (23,540 bp) among 12 Caryophyllaceae species due to contraction of the IR region at the LSC/IR boundary and lack of expansion of the IR region at the IR/SSC boundary.
We then analyzed the SSRs (or microsatellites), which are increasingly evaluated in molecular genetic studies because of their high reproducibility, ease of scoring, and fast throughput compared with other marker techniques [53]. In the L. wilfordii and S. capitata chloroplast genomes, the most abundant SSRs were A or T mononucleotide repeats, which accounted for approximately 77.6% and 76.8% of the total SSRs, followed by tetranucleotides (10.4% and 16.1%) and dinucleotides (10.4% and 7.1%), respectively (Table 3, S5 and S6 Tables). SSRs in the chloroplast genome are commonly composed of A or T repeats and rarely G or C repeats [54,55]. Furthermore, the majority of L. wilfordii and S. capitata SSRs are located in IGS regions (49.3% and 55.4%), followed by genes (37.3% and 26.8%) and introns (13.4% and 17.9%), respectively (S5 and S6 Tables). SSRs located in coding regions were found mainly in ycf1 and rpoC2, with the remaining SSRs found in matK, rpoA, psbF, atpB, and atpF. Among the SSRs in genes, part or all of those in matK, rpoC2, rpoA, psbF, ycf1, and rrn23 were shared by the two Caryophyllaceae species.
Under the assumption that the common chloroplast types observed in most Caryophyllaceae (Agrostemma, Colobanthus, and Dianthus, S. capitata, S. latifolia, and S. vulgaris) are ancestral, the inversion of the ycf3-psaI fragment might have occurred independently at least three times: in L. chalcedonica, S. notiflora, and the lineage containing S. conoidea and S. conica, consistent with previous results [26]. Interestingly, loss of introns in the clpP gene is always coupled with these inversions. In Caryophyllaceae, all 12 species possess imperfect palindromic repeats on both sides of the ycf3-psaI fragment (Fig 2), whereas only one homologous sequence corresponding to these repeats was found in the intergenic region between psaI and ycf4 in two Amaranthaceae species. In all cases, the repeat sequences were overlapped by partial ycf4 coding region sequences (63 bp). Thus, the partial ycf4 and upstream sequences might have been duplicated in the IGS between psaA and ycf3 before diversification of Caryophyllaceae. Even in the repeats between psaA and ycf3, expected to be non-genic sequences, intermolecular duplicated sequences were grouped together in A. githago, C. quitensis, D. longicalix, and S. paradoxa based on the maximum parsimony tree (Fig 6). The intermolecular duplicated sequences were not grouped together in the other Silene species. A large fragment inversion mediated by short IRs was reported in several plant species. The 22 kb inversion in Asteraceae [56], the 42 kb inversion in Abies of the Pinaceae [57], the 21 kb inversion in Jasminae of the Oleaceae [16], and the 36 kb inversion in the core Genistoids are thought to be induced by IRs in tRNAs or repeat elements several base pairs long. These dispersed repeats were shown to promote inversions via intermolecular recombination [5,58,59]. Thus, we suggest that this short IR in Caryophyllaceae might mediate intramolecular flip-flop recombination events, and thus, independent identical inversion events of the ycf3-psaI 18 kb fragment might be facilitated independently in different lineages.

Phylogenetic analysis
Both the MP and ML trees of the 12 Caryophyllaceae species and two Amaranthaceae species based on 73 protein-coding genes showed consistent phylogenetic patterns (Fig 7, S1 Fig). In  Chloroplast genomes of two Caryophyllaceae species the ML tree, bootstrap analysis indicated that eight of ten nodes were supported by bootstrap values ! 99% and the other two nodes by values > 65%. In previous studies, the genus Lychnis was shown to be nested within the genus Silene based on internal transcribed spacer (ITS) sequences of nuclear genome and chloroplast DNA data [24,60] and based on chloroplast genome data [26]. Lychnis species were nested within Silene, close to S. paradoxa in the subgenus Silene, which is consistent with previous studies [26] (Fig 7). The subgenus Behenantha and monophyly of sect. Melandrium (S. capitata and S. latifolia) were not supported, whereas S. conoidea and S. conica of sect. Conoimorpha form a monophyletic group were found to be closely related to S. noctiflora of sect. Elisanthe (Fig 7). However, we need additional chloroplast genome data from more Sileneae species to resolve the relationship between Lychnis and Silene, as well as the infrageneric relationships of Silene.