The Complete Chloroplast Genome of Ye-Xing-Ba (Scrophularia dentata; Scrophulariaceae), an Alpine Tibetan Herb

Scrophularia dentata is an important Tibetan medicinal plant and traditionally used for the treatment of exanthema and fever in Traditional Tibetan Medicine (TTM). However, there is little sequence and genomic information available for S. dentata. In this paper, we report the complete chloroplast genome sequence of S. dentata and it is the first sequenced member of the Sect. Tomiophyllum within Scrophularia (Scrophulariaceae). The gene order and organization of the chloroplast genome of S. dentata are similar to other Lamiales chloroplast genomes. The plastome is 152,553 bp in length and includes a pair of inverted repeats (IRs) of 25,523 bp that separate a large single copy (LSC) region of 84,058 bp and a small single copy (SSC) region of 17,449 bp. It has 38.0% GC content and includes 114 unique genes, of which 80 are protein-coding, 30 are transfer RNA, and 4 are ribosomal RNA. Also, it contains 21 forward repeats, 19 palindrome repeats and 41 simple sequence repeats (SSRs). The repeats and SSRs within S. dentata were compared with those of S. takesimensis and present certain discrepancies. The chloroplast genome of S. dentata was compared with other five publicly available Lamiales genomes from different families. All the coding regions and non-coding regions (introns and intergenic spacers) within the six chloroplast genomes have been extracted and analysed. Furthermore, the genome divergent hotspot regions were identified. Our studies could provide basic data for the alpine medicinal species conservation and molecular phylogenetic researches of Scrophulariaceae and Lamiales.


Introduction
Ye-Xing-Ba (Tibetan name), a common Tibetan herb, is traditionally used for the treatment of exanthema and fever in Traditional Tibetan Medicine (TTM). As a medicinal alpine species [1], Scrophularia dentata Royle ex Benth. (Scrophulariaceae) is one of the original plants of the Tibetan herb [2]. In recent years, our research team has been concentrating on the ethnobotanical survey of Tibetan herbs since 1992 [3,4]. Based on our specimen collection and taxonomic identification, some chemical and pharmacological studies on S. dentata have been conducted by our team [5,6,7]. Because of the harsh living environment and increasing demand for medicinal use of the alpine plant, it is of utmost urgency to develop a conservation strategy for the species. Also, selection of excellent germplasm requires more efficient genetic data and molecular markers [8]. However, there is little sequence and genomic information available for the species in the GenBank, except for nrDNA ITS sequences, some chloroplast and mitochondrion segment sequences submitted by our team (http://www.ncbi.nlm.nih.gov/nuccore/? term=Scrophularia+dentata).
Chloroplast contains its own independent genome encoding a specific set of proteins [9]. The plastid genomes typically are composed of a large single copy (LSC) region and a small single copy (SSC) region, which are separated by two copies of inverted repeats (IRs) [10]. The small size of the chloroplast genome, about 115 to 165 kb in length, makes it suitable for complete sequencing and the data can be further applied to DNA bar coding and phylogeny construction [11,12]. As complete chloroplast genome sequences contain sufficient information, sequencing of whole chloroplast genomes is essential for the analyses of plant species [13,14].
S. dentata belongs to Scrophulariaceae, which is a large family of Lamiales, consisting of over 3,000 species [15]. Also, the genus Scrophularia has ca. 200 species and is divided into two sections, i.e. Sect. Scrophularia and Sect. Tomiophyllum [16]. However, the complete chloroplast genome of Scrophularia takesimensis within Sect. Scrophularia is the first published whole sequence in Scrophularia. This is the first sequenced member of the Sect. Tomiophyllum in the paper. Details of the S. dentata chloroplast genome structure and organization are reported and compared with previously annotated chloroplast genomes of other Lamiales species. Our studies could expand understanding of the diversity of Scrophularia cp genomes and provide basic data for the alpine medicinal species conservation and molecular phylogenetic researches of Scrophulariaceae and Lamiales.

Plant material
Samples of S. dentata were collected in Lhasa, Tibet, China. The voucher specimens were deposited in the herbarium of Shanghai University of Traditional Chinese Medicine (field number: XZ201416). Also, the location of the specimens is not within any protected area.

DNA extraction, genome sequencing and validation
Total chloroplast DNA was extracted from 100 g of fresh leaves using a sucrose gradient centrifugation method improved by Li et al. [17]. The genome was sequenced and assembled on an Illumina MiSeq platform following Gogniashvili et al. [18]. The four junctions between the SSC/LSC and IRs were validated by PCR amplification and Sanger sequencing, and nine other fragments were selected to validate the genome sequence further (S1 Table) [19].

Repeat structure
The REPuter program [24] was used to identify repeats (forward, palindrome, complement and reverse sequences). The size and identity of the repeats were limited to no less than 30 bp and 90%, respectively, with Hamming distance equal to 3 [25,26]. Meanwhile, simple sequence repeats (SSRs) were detected using MISA [27] by setting the minimum number of repeats to 10, 5, 4, 3, 3 and 3 for mono-, di-, tri-, tetra-, penta-and hexanucleotides. The repeats and SSRs of S. dentata were compared with those of S. takesimensis, the only one complete chloroplast genome available in Scrophulariaceae.

Genome organization and features
The junction regions between IRs and SSC/LSC and nine additional regions were confirmed by PCR amplification and Sanger sequencing. We compared these sequences to the assembled genome and no mismatch or indel was observed, which validated the accuracy of genome sequencing and assembly. The chloroplast genome sequence of S. dentata has been submitted to GenBank (accession number: KT428154).
The complete chloroplast genome of S. dentata has a total length of 152,553 bp, with a pair of inverted repeats (IRs) of 25,523 bp that separate a large single copy (LSC) region of 84,058 bp and a small single copy (SSC) region of 17,449 bp (Fig 1). The total GC content is 38.0%, which is similar to the published asterids cp genomes [28,29]. And the GC content is unevenly distributed in the genome. The GC content of IRs (43.1%) is higher than that of LSC and SSC region (36.0 and 32.29%). The high GC content in the IR regions is due to the increased presence of GC nucleotides in the four rRNA genes: rrn5, rrn4.5, rrn23 and rrn16, which is congruent to what has been found in other chloroplast genomes [30,31].
The chloroplast genome of S. dentata encodes a total of 114 unique genes, of which 18 are duplicated in IR regions. Among the 114 genes, there are 80 protein-coding genes (70.2%), 30 transfer genes (26.3%) and 4 rRNA genes (3.5%) ( Table 1). 18 genes contain introns, 15 (nine protein-coding and six tRNA genes) of which contain one intron and three (clpP, ycf3 and rps12) contain two introns ( Table 2). The rps12 gene is a trans-spliced gene, with the first exon located in LSC region and the other two exons duplicated in IR regions.
Protein-coding regions account for 59.6% of the whole genome, while tRNA and rRNA regions account for 4.6% and 5.9%, respectively. The remaining regions are noncoding sequences, including intergenic spacers, introns and pseudogenes. There are two pseudogenes identified: ycf1 and rps19. They are located in the boundary regions between IRb/SSC and IRa/ LSC, respectively. The lack of their protein-coding ability is due to the partial gene duplication.

Comparison to other Lamiales species
The chloroplast genome of S. dentata was compared with other five publicly available Lamiales genomes from different families, i.e., Boea hygrometrica (Gesneriaceae), Hesperelaea palmeri (Oleaceae), Salvia miltiorrhiza (Lamiaceae), Andrographis paniculata (Acanthaceae) and Sesamum indicum (Pedaliaceae) ( Table 3). Organization of the Lamiales chloroplast genome is conserved; neither translocations nor inversions were detected in the analyses. However, there are differences in terms of genome size. The A. paniculata chloroplast genome is the shortest (150,249 bp), while that of H. palmeri (155,820 bp) is longer than other species. The genome size variation is attributed mainly to the difference in the length of the LSC region, similar to the features in Asteraceae chloroplast genomes [36]. Also, the average size of the six Lamiales chloroplast genomes is 152,794 bp.
The overall sequence identity of the six Lamiales chloroplast genomes was plotted using the software mVISTA with S. dentata as reference (Fig 2). The results show that the chloroplast genomes within Lamiales are conservative, although divergent regions could be detected. As expected, the IR region is more conserved than the LSC and SSC regions in these species. Meanwhile, non-coding regions reveal a higher divergence than coding regions. In order to check divergent hotspot regions further, all the coding regions and non-coding regions (introns and intergenic spacers) within the six chloroplast genomes were extracted and analysed (S2 Table, Fig 3). The most divergent regions are localized in the intergenic spacers, including ccsA-ndhD, rps16-trnQ-UUG, ndhJ-ndhK, ndhE-ndhG, ndhC-trnV-UAC, trnH-GUG-psbA and ndhG-ndhI. These intergenic regions could be used in assessing phylogenetic relationships within Lamiales species. For the coding regions, the most divergent regions are ycf1, matK, rpl22, rpl32, rps15 and ccsA. These genes are all located in single copy regions.
The expansion and contraction at the borders of IR/SC are common in chloroplast genomes, which is the main reason for the size variation within angiosperm chloroplast genomes [37]. The junctions of LSC/IRb/SSC/IRa of the six Lamiales chloroplast genomes were compared (Fig 4). Although the length of the IR regions of the six species is similar, from 25,141 bp to 25,712 bp, some differences in IR expansions and contractions were observed.

Repeat and SSR analysis
Repeat sequences in the S. dentata chloroplast genome were analysed by REPuter and the results showed that 21 forward repeats and 19 palindrome repeats were at least 30 bp long per repeat unit with a sequence identity greater than 90% ( Table 4). The complement repeats and reverse repeats had not been detected. Of all the repeats found, 33 repeats (82.5%) are 30 to 39 bp long, 7 repeats (17.5%) are 40 to 49 bp long, and the longest repeat is 44 bp. We present a comparative analysis of repeats between S.dentata and S. takesimensis, the only one complete chloroplast genome available in Scrophulariaceae [38] (S3 Table, Fig 5A and 5B). Most (92.5%) The Complete Chloroplast Genome of Scrophularia dentata (Scrophulariaceae) of the repeats in S. dentata are conserved and could be found in S. takesimensis, and there are another 3 complement repeats and 2 reverse repeats found in S. takesimensis.
In general, repeats are mostly distributed in noncoding regions [39,40]. However, most of the repeats (62.5%) in the S. dentata chloroplast genome are located in coding regions (CDS),  mainly in ycf2; and it is similar to that of S. takesimensis (Fig 5B). Meanwhile, 30% repeats are located in intergenic spacers (IGS) and introns, and 7.5% repeats in parts of the IGS and CDS. The Complete Chloroplast Genome of Scrophularia dentata (Scrophulariaceae) Also, simple sequence repeats (SSRs) exert significant influence on genome rearrangement and recombination [41]. A total of 44 SSRs were detected in the S. dentata chloroplast genome, accounting for 500 bp of the total sequence (ca.0.33%), and there were 33, 7, 1 and 3 mono-, di-, tri-, and tetra-nucleotide repeats, respectively (Table 5). No penta-or hexa-nucleotide repeats were found. Most of the SSRs are mononucleotide repeats. 43 SSRs (97.7%) are composed of A and T nucleotides, whereas only one is composed of "GTCT" repeat. The high content of AT in SSRs contributes to the AT richness of the chloroplast genome [42,43]. Among the SSRs, 33 are located in IGS and introns, 10 are found in coding genes, including ycf1, rpoC2, atpB, rpoA, atpA and ndhD, and 1 is located in parts of the IGS and CDS. Compared with S. takesimensis, 12 SSRs were identity, 30 exhibited length polymorphisms and 2 had not been detected (S4 Table, Fig 5C and 5D). These repeat sequences may be useful in developing lineage-specific markers, which could be widely used in genetic diversity and evolutionary studies of Scrophularia.

Conclusions
In this paper, we reported the complete sequence of S. dentata cp genome, as the first whole cp genome in the Sect. Tomiophyllum of the genus Scrophularia. This cp genome sequence was compared to other five genomes from the Lamiales species (i.e., B. hygrometrica, H. palmeri, S. miltiorrhiza, A. paniculata and S.indicum). No significant structural changes were detected among the chloroplast genomes. All the coding regions and non-coding regions (introns and intergenic spacers) within the six chloroplast genomes were extracted and analysed. The most The Complete Chloroplast Genome of Scrophularia dentata (Scrophulariaceae) divergent regions are localized in the intergenic spacers, including ccsA-ndhD, rps16-trnQ-UUG, ndhJ-ndhK, ndhE-ndhG, ndhC-trnV-UAC, trnH-GUG-psbA and ndhG-ndhI. For the coding regions, the most divergent regions are ycf1, matK, rpl22, rpl32, rps15 and ccsA. These genes are all located in single copy regions. There were some differences in the genome size and IR expansion or contraction; and the divergent regions were analysed. Tandem repeats and SSRs within S. dentata were compared with those of S. takesimensis, which may provide markers for the analyses of infraspecific genetic differentiation within Scrophularia. In addition, our studies could provide basic data for the alpine medicinal species conservation and molecular phylogenetic researches of Scrophulariaceae and Lamiales.
Supporting Information S1