Comparative chloroplast genomes of eleven Schima (Theaceae) species: Insights into DNA barcoding and phylogeny

Schima is an ecologically and economically important woody genus in tea family (Theaceae). Unresolved species delimitations and phylogenetic relationships within Schima limit our understanding of the genus and hinder utilization of the genus for economic purposes. In the present study, we conducted comparative analysis among the complete chloroplast (cp) genomes of 11 Schima species. Our results indicate that Schima cp genomes possess a typical quadripartite structure, with conserved genomic structure and gene order. The size of the Schima cp genome is about 157 kilo base pairs (kb). They consistently encode 114 unique genes, including 80 protein-coding genes, 30 tRNAs, and 4 rRNAs, with 17 duplicated in the inverted repeat (IR). These cp genomes are highly conserved and do not show obvious expansion or contraction of the IR region. The percent variability of the 68 coding and 93 noncoding (>150 bp) fragments is consistently less than 3%. The seven most widely touted DNA barcode regions as well as one promising barcode candidate showed low sequence divergence. Eight mutational hotspots were identified from the 11 cp genomes. These hotspots may potentially be useful as specific DNA barcodes for species identification of Schima. The 58 cpSSR loci reported here are complementary to the microsatellite markers identified from the nuclear genome, and will be leveraged for further population-level studies. Phylogenetic relationships among the 11 Schima species were resolved with strong support based on the cp genome data set, which corresponds well with the species distribution pattern. The data presented here will serve as a foundation to facilitate species identification, DNA barcoding and phylogenetic reconstructions for future exploration of Schima.


Introduction
The chloroplast (cp) is a type of plastid that is critical to the growth of most plants, playing a major role in photosynthesis and fixation of CO 2 [1]. The cp genomes in angiosperms are circular DNA molecules with a highly conserved gene order and gene content, and range from a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 DNA extraction, sequencing, chloroplast genome assembly Total genomic DNA was isolated from fresh leaves (~100 mg) using the modified CTAB method (Doyle and Doyle 1987). Subsequently, the cp genomes were amplified using longrange PCR with fifteen primers [27]. The PCR products were fragmented for constructing short-insert (500 bp) libraries following the Illumina Nextera XT DNA library preparation instructions. Paired-end sequencing (250 bp) was performed on the Illumina MiSeq 2000 at the Laboratory of Molecular Biology of Germplasm Bank of Wild Species in Southwest China. Quality control of the raw sequence reads was performed using the NGS QC Tool Kit [28], with a cut-off value for percentage of read length and PHRED quality score as 80 and 30 following Yang et al. [5]. High-quality reads were assembled into contigs using the de novo assembler in CLC Genomics Workbench v6.5 (CLC Bio), using a k-mer of 64 and a minimum contig length of 500 base pairs (bp). The de novo contigs were assembled into complete chloroplast genomes followed the procedure of Yang et al. [5].

Chloroplast genome annotation and comparisons
The complete cp genomes were annotated with the identification of introns and exons using DOGMA [29]. The positions of start and stop codons and boundaries between introns and exons were investigated according to the published cp genome of Camellia taliensis (NC022264). The annotated GenBank files were used to draw the circular chloroplast genome maps using OrganellarGenomeDRAW [30]. The mVISTA program [31] was employed in the LAGAN mode to detect the variation of the chloroplast genomes. The cp genome of Schima sinensis was used as a reference. Microsatellites (mono-, di-, tri-, tetra-, penta-and hexanucleotide repeats)   Table 1). The cp genome map of Schima superba is presented as a representative (Fig 1). Excluding the duplicated IR region, the 11 Schima cp genomes identically encoded 114 different genes that were arranged in the same order, including 80 proteincoding genes, 30 tRNAs and 4 rRNAs. Seventeen genes were duplicated in the IRs, with six protein-coding genes, four rRNA and seven tRNA genes. Twelve of the protein-coding genes and six of the tRNA genes contained introns. Fifteen out of those eighteen genes contained a single intron, while the other three (clpP, rps12 and ycf3) had two introns. The 11 Schima cp genomes exhibited high similarity at the LSC/IR/SSC boundaries (Fig 2). The rps19 gene crossed the LSC/IR B (J LB ) region with no variation of sequence length within the two parts. The SSC/IR B (J SB ) junction occurred between the ycf1_like (incompletely duplicated in IR B ) and the 3' end of ndhF gene, with the sequence length of ycf1_like gene within IR B as 1388 or 1394. The ycf1 gene crossed the SSC/IR A (J SA ) region, with 1388 or 1394 bp of ycf1 within IR A . The ycf1 related length changes were the only variation detected in these junctions. The LSC/ IR A (J LA ) junction was located at the 3' end of the rps19_like (6 bp; incompletely duplicated in IR A ), with a 14 bp noncoding sequence between J LA and trnH gene. In addition, we identified unusual start codons for four genes, ACG for ndhD, ATC for psbI, ATT for psbT and GTG for rps19.

Chloroplast genome comparisons and divergence hotspots
Sequence identity plots of the 11 Schima cp genomes, generated using mVISTA, are shown in Fig 3. The plots illustrate the high sequence similarity across the Schima cp genomes, with a sequence identity of 99.1%. Two (ccsA and rps15) of the 49 variable protein-coding (>150 bp) genes had a percentage of variation above 1.00% (Table 2), while 19 (>150 bp) had no variation. Both of the two core DNA barcodes (rbcL and matK) [38] showed extremely low sequence divergence (0.21% and 0.33%, respectively). Furthermore, the variation of ycf1, the proposed "most promising chloroplast DNA barcode" of land plants [39], was only 0.67%. Among the 79 noncoding (>150 bp) regions, the percentage of variation ranged from 0.11% to 2.85% (Fig 4 and Table 3). Fourteen fragments (atpI-rps2, trnS (UGA)-psbZ, rps4-trnT -ndhF) did not show any sequence variation. Eight potential mutational hotspots (trnW (CCA)-trnP (UGG), trnT (UGU)-trnL (UAA), trnG (UCC)-trnfM (CAU), petD-rpoA, psbB-psbT, ndhE-ndhG, ndhC-trnV (UAC), rpl32-trnL (UAG)) were identified, with the variation percentage exceeding 2.0% among the 11 sampled species (Fig 4 and Table 3). These eight highly variable hotspots may have the potential to be used as special DNA barcodes for identifying Schima species. The aligned length of the complete cp genome (with one of the IR removed) among the 11 Schima species was 130,508 bp, with the total number of variable and parsimony informative (PI) sites being 1,121 bp and 261 bp, respectively. This data set contained 131 indels with a  total length of 586 bp, and the percent variability was 0.51%. These results indicate that the global variation of the cp genome within Schima is extremely low. A similar pattern was reported in other long-lived plants [40][41][42]. The ability to identify species within the genus using cp genome data needs to be assessed by sampling multiple individuals per species, even though the phylogenetic analyses have most of the species separated from each other (see below).

SSR polymorphisms
In total, 58 cpSSRs, including 55 mononucleotide (A, T), 1 dinucleotide (AT) and 2 trinucleotide (ATT, TTA) repeats were detected within the 11 Schima cp genomes. No tetranucleotide, pentanucleotide or hexanucleotide repeats were observed. The mononucleotide repeat (A, T) was found to be the most abundant, with repeat numbers of 10, 11 and 12 ( Table 4). The proportion of A and T repeats in mononucleotide repeat unit was 43.64% and 56.36%, respectively. Only one SSR locus with a different repeat unit (C) was detected in the trnG (UCC)-trnfM (CAU) intergenic spacer region. Within the 11 Schima cp genomes, SSR loci were primarily located in the LSC region (89.09%), followed by the SSC portion (14.55%), with only one present in the IR region (rrn5-trnR (ACG)) ( Table 4). One SSR locus was detected in the protein-coding gene psbI, with all others located in gene spacers and introns. No SSRs were  found in the tRNAs and rRNAs. The mononucleotide repeat (A) in trnH-psbA was the most variable SSR, with the size ranging from 12 to 42 bp. The cpSSRs of the 11 Schima species represented here showed abundant variation, and could be useful for research at the population    Eleven chloroplast genomes of Schima level. They will provide complementary data to the SSR markers of Schima identified from the nuclear genome [43].

Phylogenetic analyses
The data matrix we used for phylogenetic estimation consisted of an alignment containing entire cp genomes with one of the IRs removed. This data set was comprised of 131,113 nucleotide positions, with 2,508 variable sites (1.91%) and 427 PI sites (0.33%). ML analysis resulted in a well-resolved tree, with eight of the 10 nodes supported by 100% bootstrap values (BS). All Schima species grouped into a strongly supported clade (BS = 100%, Fig 5), indicating Schima is monophyletic. Two main clades were recovered, with Schima sericans being sister to those two lineages. Five species (S. argentea, S. brevipedicellata, S. khasiana, S. noronhae, S. wallichii) formed clade I (BS = 100%, Fig 5). The remaining five species (S. sinensis, S. superba, S. remotiserrata, S. multibracteata and S. crenata) grouped in clade II (BS = 100%, Fig 5). The branch leading to S. superba and three closely related species is extremely short, and bootstrap support values for two internal nodes within this clade are less than 80%.

Chloroplast genome features and comparison within Theaceae
Prior to this study, Camellia was the only genus within Theaceae to have its cp genome sequenced [25,26]. In the present study, we sequenced cp genomes of 11 species from Schima. The cp genomes all displayed typical quadripartite structure (Fig 1), which is consistent amongst most lineages of angiosperms [2]. The expansion and contraction of the IR region is considered to be the primary mechanism affecting length variation of angiosperm cp genomes, as demonstrated in Trochodendraceae [44] and Apiales [45]. However, only minor variation was detected at the SSC/IR A boundary of all of the 11 Schima cp genomes (Fig 2). Although the genes located at the IR junctions are identical in cp genomes of Schima and Camellia, the overall cp genome sequences of Schima are more homogenous as compared to Camellia, which was suggested to show more differences at the junction regions [26]. The cp genomes of Schima encode the same set of protein-coding genes as previously reported Camellia species, with the exception of Orf 42 and Orf 188 which were reported in Camellia [25], but not in other Ericales members such as Actinidia (Actinidiaceae) and Ardisia (Primulaceae) [46,47]. For the whole cp genomes of Schima, 37 tRNA genes were annotated, which is consistent with Huang et al. [26]. However, 38 tRNA genes were found in Yang et al. [25], due to a redundant annotation of trnP (UGG) in their study. As compared with sequences of Camellia, no  Potentially specific DNA barcodes for Schima Since the concept of DNA barcoding was proposed over a decade ago [48], substantial efforts have been made to develop DNA barcodes possessing both high universality and efficiency. Kress et al. [49] suggested that the internal transcribed spacer (ITS) and trnH-psbA spacer region had potential as useful DNA barcode regions for flowering plants. Hollingsworth et al. [38] later advocated matK and rbcL as a two-locus core barcode for land plants after comparing seven leading candidate loci, subsequently the nrDNA ITS was recommended to be incorporated into core barcode based on a large-scale sapmpling of seed plants [50]. Dong et al. [39] proposed that ycf1 was the most variable loci of the cp genome, which might be a promising DNA barcode performing better than existing plastid candidate barcodes of land plants. However, all of the five candidate protein-coding DNA barcodes (matK, rbcL, rpoB, rpoC1 and ycf1) showed extremely low sequence variation (<1.00%), and the other three fragments are also not among the most variable spacers. The eight potential mutational hotspots (trnW (CCA)-trnP (UGG), trnT (UGU)-trnL (UAA), trnG (UCC)-trnfM (CAU), petD-rpoA, psbB-psbT, ndhE-ndhG, ndhC-trnV (UAC), rpl32-trnL (UAG)) (Fig 4 and Table 3) identified in this study could be suitable barcodes for Schima. Recently, using the cp genome as a possible ultraor organelle-scale barcode for efficient plant species identification was discussed [10,11]. The high phylogenetic resolution among closely related species of Schima (Fig 5) suggests that the cp genome may indeed be useful as an organelle-scale barcode for species identification of Schima. Further studies based on sampling at the population scale are needed to evaluate the efficiency of the barcodes mentioned above and also the cp genome as an organelle-scale barcode.

Phylogenetic relationships among species of Schima
The cp genome has been suggested to be useful for phylogenetic reconstructions at low taxonomic levels [7,8,10,51]. Interspecies phylogenetic relationships within Camellia (Theaceae) were well-resolved using cp genome data [25,26]. In the present study, based on a recent classification of the genus [12], 11 out of 13 Schima species occurring in China were represented. The phylogenetic relationships within Schima were well resolved with strong support based on cp genome sequences ( Fig 5). Therefore, our study indicates that the complete cp genome has significant potential to resolve the low level phylogenetic relationships. Schima sericans, the first diverging lineage among sampled species, is distributed in southeastern Xizang and northwestern Yunnan in China. Schima sericans was sister to the remaining taxa, which formed two clades. Clade I includes five species (S. argentea, S. brevipedicellata, S. khasiana, S. noronhae and S. wallichii) that are primarily distributed in southwestern China and Indochina. Clade II comprises five species (S. sinensis, S. superba, S. remotiserrata, S. multibracteata and S. crenata) that mainly occur within central and eastern China (Fig 5). The phylogenetic relationships within Schima found here correspond well with the geographic distribution pattern, but do not match well with morphology. Schima was classified into two groups based on the shape of the leaf margin (entire or serrate) [12]. However, all of the species within clade I possess a serrate leaf margin except S. khasiana. Likewise, S. multibracteata is the only species with an entire leaf margin in clade II (Fig 5). Our results suggest that the taxonomic value of leaf margin shape should be reassessed for classification of Schima. Additionally, the branch length of the clade including S. superba and three closely related species is extremely short, indicating that these four species have recently diversified, or perhaps illustrating past hybridization within the group.
These results indicate that the current treatment of the genus needs to be reevaluated by integrating more types of evidence.