Plastid Genome Sequence of a Wild Woody Oil Species, Prinsepia utilis, Provides Insights into Evolutionary and Mutational Patterns of Rosaceae Chloroplast Genomes

Background Prinsepia utilis Royle is a wild woody oil species of Rosaceae that yields edible oil which has been proved to possess particular benefits for human health and medical therapy. However, the lack of bred varieties has largely impeded exploiting immense potentials for high quality of its seed oil. It is urgently needed to enlarge the knowledge of genetic basis of the species and develop genetic markers to enhance modern breeding programs. Results Here we reported the complete chloroplast (cp) genome of 156,328 bp. Comparative cp sequence analyses of P . utilis along with other four Rosaceae species resulted in similar genome structures, gene orders, and gene contents. Contraction/expansion of inverted repeat regions (IRs) explained part of the length variation in the Rosaceae cp genomes. Genome sequence alignments revealed that nucleotide diversity was associated with AT content, and large single copy regions (LSC) and small single copy regions (SSC) harbored higher sequence variations in both coding and non-coding regions than IRs. Simple sequence repeats (SSRs) were detected in the P . utilis and compared with those of the other four Rosaceae cp genomes. Almost all the SSR loci were composed of A or T, therefore it might contribute to the A-T richness of cp genomes and be associated with AT biased sequence variation. Among all the protein-coding genes, ycf1 showed the highest sequence divergence, indicating that it could accomplish the discrimination of species within Rosaceae as well as within angiosperms better than other genes. Conclusions With the addition of this new sequenced cp genome, high nucleotide substitution rate and abundant deletions/insertions were observed, suggesting a greater genomic dynamics than previously explored in Rosaceae. The availability of the complete cp genome of P . utilis will provide chloroplast markers and genetic information to better enhance the conservation and utilization of this woody oil plant.

Prinsepia utilis, one of the major woody oil plants in Rosaceae, is mainly distributed in mountainous regions with high elevations of 1000-3200 m in southwestern China, Pakistan, India, Nepal and Bhutan [10,11]. . In addition to P. utilis, there are the other three related species, P. sinensis (Oliv.) Oliv. ex Bean, P. uniflora Batal, and P. scandens Hayata, in the genus Prinsepia. Seed kernels of P. utilis averagely yield 37.2% of semi-drying and pale yellow fatty oil which has widely been consumed for cooking [10]. Up to date, the chemical components of its seed oil (e.g., oleic acid 32.6%; linoleic acid 43.6%; palmitic acid 15.2%) have been proved to possess particular benefits or potentials for human health and medical therapy [12,13], which can be undoubtedly comparable with Mediterranean olive oil. It is well known that the majority of woody oil plants grow in tropical and subtropical regions. P. utilis may represent as the best candidate oil plant which is suitable for extensive cultivation with cold weather in higher altitudes of temperate regions. The lack of bred varieties in P. utilis, however, has largely impeded exploiting immense potential for high quality of seed oil. Nowadays, the collection and transplanting of wild plants have seriously destroyed natural populations, making continuous loss of numerous useful germ plasm resources. It is urgently needed to enlarge the knowledge of genetic basis of the species and develop genetic markers that will be fairly necessary to enhance modern breeding programs for the purpose of the selection and breeding of new varieties.
Chloroplasts were derived from endosymbiosis between independent living cyanobacteria and a non-photosynthetic host [18]. This intracellular organelle encodes a number of chloroplast-specific components and has high levels of conservation [19]. Chloroplast DNA (cpDNA) of green plants provides sufficient information for genome-wide evolutionary studies. It has shown great potentials in addressing phylogenetic questions at both high and low taxonomic levels, and sometimes it is necessary to use complete cp genome sequences for resolving complex evolutionary relationships [20][21][22]. However, acquiring large coverage of cp genomes has typically been limited by conventional DNA sequencing. As next-generation sequencing techniques have revolutionized DNA sequencing via high-throughput capabilities and relatively low costs [23], it is now more convenient to obtain cp genome sequences and extend gene-based phylogenetics to phylogenomics.
In this study, we sequenced the complete cp genome sequence of a wild woody oil plant, P. utilis, which boasted great economic and taxonomic values in Rosaceae. Comparative analyses of the five Rosaceae plastomes, including the P. utilis plastome sequenced in this study, were further performed to provide insights into overall and especially evolutionary dynamics of Rosaceae cp genomes by detecting genome-wide sequence variations.

Results and Discussion
Genome organization of P. utilis  Table 1). Among these unique genes, 18 included one or two introns. All of these coding regions accounted for 51.3% of the whole genome. The gene contents and gene order of P. utilis were identical to those of the four previously known Rosaceae plastomes ( Table 2).
From the aspect of genome size, P. utilis, next to F. vesca (155,691 bp), is the second smallest one among the five sequenced Rosaceae cp genomes. It is about 3.6 kb, 1.5 kb, and 0.3 kb smaller than P. pyfifolia, P. persica, and P. rupicola, respectively. The genome size variation can be explained mainly by differences in the length of SSC and IR regions ( Table 2).

Contraction/expansion of IRs among Rosaceae cp genomes
Generally, the lengths of IR (IRa and IRb) regions differ among various plant species. Contraction or expansion of the IR regions often results in length variation of cp genomes [24]. In this study, the detailed IR-SSC and IR-LSC borders, together with the adjacent genes, were compared across the five Rosaceae cp genomes ( Figure 2). Of all the species except F. vesca, the junction of LSC and IRa was located in the coding region of rps19. The end of IRa of P. utilis expanded 178 bp into rps19, creating a truncated rps19 pseudogene of the same length at the IRb-LSC border. A similar feature was also observed in the other three cp genomes that the end of IRa expanded 152 bp, 95 bp, and 21 bp in P. rupicola, P. persica and P. pyfifolia, respectively. In F. vesca, LSC region possessed an intact rps19 gene and expanded to incorporate 10 bp non-coding regions apart from the IRa-LSC border. On the boundary of IRa and SSC, IRa of all the species expanded to ycf1 and resulted in a truncated ycf1 gene with the same length as its homologous part in IRb. In P. rupicola, P. persica, and P. pyfifolia, a part of the ndhF gene was located in IRa; in P. utilis and F. vesca, the entire ndhF gene was located in SSC region with 22 bp and 93 bp apart from the IRa-SSC border, respectively. A similar pattern of expansion was found in IRb. In all five species, the initiation of IRb partially contained the ycf1 gene, which was terminal with the rps19 pseudogene. In F. vesca, the nonfunctional rps19 gene seemed to be lost. Overall, a similar pattern of contraction/expansion was observed when these cp genomes approximately had a comparable genome size, for example, between P. utilis and F. vesca, and among the other three cp genomes.

Genome-wide sequence variations among Rosaceae cp genomes
We first aligned them using MAFFT [25] to identify genomewide sequence variation among these five Rosaceae cp genomes (as shown in Figure 3). As previously studied One and two asterisks indicate one-and two-intron containing genes respectively [26][27][28], most of the sequence variations were located in LSC and SSC regions, while IR regions exhibited comparatively fewer sequence variations. Interestingly, a positive association between A/T content and sequence divergence was observed when base content was introduced into sequence alignment.
Since IR region of cp genomes harbors lower A/T content, this might have led to higher sequence conservation of this region. Furthermore, the low A/T content of rRNA genes (above 50%) was also consistent with its high sequence conservation (data not shown). It has been well known that cp genomes of plants are quite A/T rich (above 60%) [28], and thus base content may be a major reason for determining cp genome variations.
To look sequence divergence into details, we further plotted sequence divergence along the whole genomes with genes annotated ( Figure S1). Consistent with other angiosperms, sequence divergence in intergenic regions was higher than that in genic regions of these five genomes. Intergenic regions with higher levels of divergence were matK-rps16, rps16-psbK, psbI-atpA, atpH-atpI, petN-psbM, ndhC-atpE, and rpl32-ccsA. Genic regions were ycf1, rpl22, and matK. The large sequence divergence in both coding and non-coding regions was mainly due to indels, as the sites where large indels (≥100 bp) occurred were mostly located in LSC or SSC regions. In addition, the IR-SSC and IR-LSC borders also harbored more sequence variations, which coincided with quite often contraction or expansion of the IR regions.
Among these cpSSR nucleotide units, the longest repeat stretch was polyA and (or) polyT with 16-17 bp, and most of the repeat units (> 95%) were also composed of polyA or polyT in these cp genomes; when the threshold was increased, all the repeat units were composed of polyA or polyT (Table 3). This was consistent with the observation in previous studies that cpSSRs were dominated by A or T mononucleotide repeats [38]. Furthermore, cpSSRs were unevenly distributed    along the whole genome with about 70% in noncoding regions ( Table 3). The composition and distribution of cpSSRs were quite similar within these Rosaceae cp genomes. Among these identified cpSSRs, about 80% were shared among these five Rosaceae cp genomes (similar repeat units located in similar genomic regions) (data not shown), suggesting the potential of these cpSSRs to be applied in further studies of more Rosaceae species.
Here, we found that most of the cpSSRs were A or T mononucleotide repeats, and they were mainly located in intergenic regions which harbored the vast majority of genomic variation. Previous studies also suggested that most repeats and indels were located in intergenic regions which were A + Trich in Trifolium cp genome [39]. Therefore, the observation that the genome divergence was associated with AT content in this study may indicate a common mutational pattern in the cp genomes of angiosperms.

Phylogenetic analysis based on the complete cp genome sequences
To explore phylogenetic relationship of P. utilis with other major angiosperms clades, we employed protein-coding sequences to resolve relationships among these divergent angiosperm clades. Since our interest focused on phylogenetic relationship between Rosaceae and its close sister clades, with an especial emphasis on its relationship with other species of rosids, we included most of the angiosperms clades previously studied instead of all the sequenced cp genomes [20,40]. A dataset including 78 protein-coding genes comprised of 65,713 aligned nucleotide characters was finally used for phylogenetic analyses of 78 taxa.
Maximum likelihood (ML) analysis resulted in a single tree with -lnL = 833985.25 ( Figure 4). Bootstrap analyses indicated that most of the nodes (67/74) were supported by values of 95% or greater and 57 of these had a bootstrap value of 100%. Maximum parsimony (MP) analysis also generated a single, fully resolved tree with a length of 74,348, a consistency index of 0.354 (excluding uninformative characters) and a retention index of 0.614 ( Figure S2). Bootstrap analyses indicated that 58 of the 78 nodes were supported by values of 95% or greater, and 56 of these had a bootstrap value of 100%. Both the ML and MP trees produced a similar topology (Figure 4 and Figure S2) and one of major differences concerned the position of Ceratophyllum as described in earlier papers [20,40]. The major clades including basal angiosperms, monocots, eudicots were strongly supported in this study.
In Rosaceae, both ML and MP analyses strongly supported that F. vesca was sister to the rest of the four species. Besides, ML tree divided the rest four species into two sister clades with 100% bootstrap support. The first clade included P. utilis and P. persica, and the second clade was consisted of the other two species. MP tree made a little difference that P. utilis and P. persica were grouped together but P. pyfifolia and P. rupicola did not form one sister clade. By comparing with previous studies, ML tree might represent a more correct topology in constructing Rosaceae phylogeny by using these genes [1,2]. Thus, only the ML tree was included in our latter analyses.
To identify indel variations along the Rosaceae phylogeny, a total of 95 indels in the gene regions of which the largest gap size was over 110 bp were found and mapped to the cp genome-based phylogenetic tree using F. vesca as an outgroup ( Figure 5). These 95 indels were distributed in 23 different genes, three (rpoC2, ycf2, and ycf1) of which contained as many as 12, 14, and 27 indels, respectively. Among these 7 branches, the branch leading to P. utilis contained most of the indels (21 of 95). The second most were contained by the common branch leading to two sister clades (19 of 95). The branch leading to the clade of P. utilis and P. persica only harbored three indels. Up to now, there is no clear consensus about whether indels are useful or not in phylogenetic analyses [41,42], and the doubtfulness is essentially based upon studies in which only one or several DNA fragments were utilized. In this study, 33 of all the indels observed were unambiguously mapped to monophyletic groups and they might have intensive benefit for nucleotide-based phylogenetic analyses. The remained indels might be homoplasies and possibly had negative effects on the reconstruction of phylogenetic tree. Previous studies also suggest that indels should be cautiously used in phylogenetic studies especially when few DNA fragments were analyzed.

Identification of chloroplast molecular markers
Since cp genomes of higher plants are quite conserved, the identification of molecular markers with higher sequence variation could be fairly valuable for population genetic and phylogenetic studies. Our genome-wide comparative analyses found that some genes boasted superior sequence divergence which could be definitely suitable for phylogeny studies ( Figure  S1). The values of overall mean distance (standing for genetic divergence) of 76 protein-coding genes in the five Rosaceae cp genomes, were further calculated ( Figure S3). Among them, ycf1, one of the largest genes in the cp genomes (about 5.5 kb), harbored the highest sequence divergence. To explore its phylogenetic utility, Maximum likelihood (ML) analysis for ycf1 was conducted in 65 species resulting in a single tree with -lnL = 179121.15 and high bootstrap values ( Figure S4). 37 of the 62 nodes possessed bootstrap values of 95% or greater and 31 of which had 100%. Despite the limited number of characters, overall topology of the phylogeny was quite similar to the one constructed with 78 protein-coding genes (Figure 4), indicating that ycf1 showed the potential for phylogenetic analysis in angiosperms. By comparing with another 9 potential molecular markers of which sequence divergence was no lower than 0.65 (rps18, rpl32, ccsA, ndhF, matK, rpl22; Figure S3) or indels were no less than 6 (matK, rpoC1, rpoC2, ycf2; Figure  5), ycf1 still performed the best to differentiate these five Rosaceae species ( Figure S5). While other markers presented diverse topologies in reconstructing phylogeny of Rosaceae and with lower node supports, ycf1 showed the same topology as the subtree constructed by 78 protein-coding genes with comparatively higher node supports ( Figure S5). Thus, ycf1 may serve as the best gene marker in distinguishing phylogenetic relationships within both Rosaceae and angiosperms.
Finding variable and easily aligned DNA sequences has been a challenging task in molecular genetics. Until now, most of DNA sequences used in plant phylogenetic analysis have come from the plastid genes, e.g. matK, rbcL, and trnL-F which have been extensively used. Although utility of these DNA regions has often obtained well-resolved phylogenies, the need for more data to address questions at multiple phylogenetic levels still persists. Previous studies indicated that ycf1 may be the most variable coding region in the cp genome, which is consistent with our analysis in the Rosaceae [43][44][45]. The usage of ycf1 has been confined to only orchids [43], Pinus [44], and magnoliid [45] where it has been demonstrated to be phylogenetically useful in analyses at the species, genus, and subfamily levels. Our study cans broad its uses as a new marker at least in Rosaceae.

Conclusion
In this study, we reported the complete chloroplast genome sequence of P. utilis, a wide woody oil plant in Rosaceae. The plastome was characterized by comparing with other sequenced plastomes in Rosaceae and the evolutionary position of P. utilis in Rosaceae was determined by phylogenetic analysis of concatenated alignments of gene sequences. While comparisons of plastomes in Rosaceae have identified many regions with higher sequence divergence, ycf1 was proved to boast the potential to solve phylogenetic issues ranging from lower taxonomic ranks to the relationships among genera and species. The availability of the complete cp genome of P. utilis will provide new resources for the utilization of cp genome sequence to further evolutionary studies and genetic breeding programs in Rosaceae.

Ethics statement
The species is naturally grown in Kunming Botanical Garden of Kunming Institute of Botany, the Chinese Academy of Sciences. The voucher specimens were deposited at the Herbarium of Kunming Institute of Botany (KUN). Permit for collection of the plant leaves was obtained from Wei-bang Sun, Kunming Botanical Garden, the Chinese Academy of Sciences.

Chloroplast DNA Extraction, Genome Sequencing and Assembly
About 20g fresh mature leaves were sampled from a single P. utihis plant. cpDNA was extracted by use of a modified high salt method reported by our group [46].
After the cpDNA isolation, approximately 5-10µg of DNA was sheared, followed by adapter ligation and library amplification, and then subjected to Illumina Sample Preparation Instructions. The fragmented cpDNAs were sequenced at both single-read using the Illumina Genome Analyzer IIx platform at the inhouse facility at The Germplasm Bank of Wild Species in Southwestern China. Files containing the raw read sequences are available from the National Center for Biotechnology Information (NCBI) Short Read Archive with the accession number: SRX277333. The obtained paired-end reads (2×100 bp read lengths) were assembled using SOAP de novo [47] to the cp genome sequence of P. persica, which was downloaded from GenBank (NC_014697) and served as a reference genome. Regions with ambiguous alignment (conflicted reads mapped to the same genomic region) were trimmed off manually and considered as gaps. Polymerase chain reaction (PCR) amplified fragments yielded by primers (Table S1) derived from the terminal ends of contigs, and the fragments were then sequenced to flank the gap regions. The PCR amplification reactions were template denaturation at 80°C for 5 min followed by 30 cycles of denaturation at 95°C for 30 sec, primer annealing at 55°C for 30 sec, and primer extension at 65°C for 1 min; followed by a final extension step of 5 min at 65°C [48]. PCR products were separated by electrophoresis in 1.5% agarose gel and sequenced on an Applied Biosystems (ABI) 3730 automated sequencer. Finally, the average coverage depth reached approximately 855×, which is greatly higher than the proposed minimum for cpDNAs (30×) [49].

Genome Annotation
Annotation of the assembled genome was performed with Dual Organellar GenoMe Annotator (DOGMA) using default parameters to predict protein-coding genes, transfer RNA (tRNA) genes, and ribosome RNA (rRNA) genes [50]. BLASTX and BLASTN searches against a custom database of previously published cp genomes identified locations of these putative genes. For genes with low sequence identity, manual annotation was performed to determinate the position of start and stop codons depending on the translated amino acid sequence using the chloroplast/bacterial genetic code. Gene map of this plastome was drawn by OGDraw v 1.2 [51]. The plastid genome sequence has been deposited in GenBank (accession number KC571835).

Whole-genome Sequence Alignment and Detection of SSRs
mVISTA program (http://genome.lbl.gov/vista/index.shtml) was utilized for pair-wise whole-genome alignment of the five Rosaceae cp genomes with F. vesca serving as an outgroup. This program takes both fasta format sequences and gene annotation text as input files, but here, only F. vesca gene annotation was used. To further estimate sequence divergences, the common protein-coding genes were extracted from these five species and aligned in MEGA5 [52]. Gene divergences represented by overall mean distance were calculated with Jukes-cantor model with all gaps and missing data treated as complete deletion. SSRs were detected using Msatfinder on-line v.2.0 (http://www.genomics.ceh.ac.uk/cgibin/msatfinder/msatfinder.cgi). We calculated the number of SSRs by using two thresholds of 1) mononucleotide repeats ≥ 8 nt, dinucleotide repeats ≥ 5 nt, and trinucleotide repeats ≥ 5 nt; 2) mononucleotide repeats ≥ 12 nt, dinucleotide repeats ≥ 10 nt, and trinucleotide repeats ≥ 10 nt.

Phylogenetic Analyses
For the purpose of phylogenetic analyses, two data sets were utilized. The first set included 78 common protein-coding genes which were extracted from 76 angiosperm cp genomes and 2 gymnosperm outgroups, representing different plant lineages of which cp genomes were sequenced (see Table S2 for a complete list of the genomes examined). For all the species, every gene was translated into amino acid sequence separately, and then aligned in MSWAT (http:// mswat.ccbb.utexas.edu/). This alignment was next constrained back to the nucleotide alignment. Some genes (e.g., infA, rpl22, and accD) that got lost from the majority of species were excluded from the analysis. The second data set included the ycf1 gene sequence from the same taxa as the 78-gene data set except that those taxa without ycf1 were deleted. These sequences were aligned using the program MAFFT version 5 [25] and adjusted manually where necessary.
Maximum likelihood (ML) analyses were implemented in RAxML version 7.2.6 [53]. RAxML searches relied on the general time reversible (GTR) model of nucleotide substitution with the gamma model of rate heterogeneity. Non-parametric bootstrapping as implemented in the ''fast bootstrap'' algorithm of RAxML used 1,000 replicates. The aligned data matrices are available upon request. MP analyses were performed with PAUP*4.0b10. Heuristic tree searches were conducted with 1,000 random-taxon-addition replicates and tree bisectionreconnection (TBR) branch swapping, with ''multrees'' option in effect. Non-parametric bootstrap analysis was conducted under 1,000 replicates with TBR branch swapping.