The Complete Chloroplast Genome Sequence of Podocarpus lambertii: Genome Structure, Evolutionary Aspects, Gene Content and SSR Detection

Background Podocarpus lambertii (Podocarpaceae) is a native conifer from the Brazilian Atlantic Forest Biome, which is considered one of the 25 biodiversity hotspots in the world. The advancement of next-generation sequencing technologies has enabled the rapid acquisition of whole chloroplast (cp) genome sequences at low cost. Several studies have proven the potential of cp genomes as tools to understand enigmatic and basal phylogenetic relationships at different taxonomic levels, as well as further probe the structural and functional evolution of plants. In this work, we present the complete cp genome sequence of P. lambertii. Methodology/Principal Findings The P. lambertii cp genome is 133,734 bp in length, and similar to other sequenced cupressophytes, it lacks one of the large inverted repeat regions (IR). It contains 118 unique genes and one duplicated tRNA (trnN-GUU), which occurs as an inverted repeat sequence. The rps16 gene was not found, which was previously reported for the plastid genome of another Podocarpaceae (Nageia nagi) and Araucariaceae (Agathis dammara). Structurally, P. lambertii shows 4 inversions of a large DNA fragment ∼20,000 bp compared to the Podocarpus totara cp genome. These unexpected characteristics may be attributed to geographical distance and different adaptive needs. The P. lambertii cp genome presents a total of 28 tandem repeats and 156 SSRs, with homo- and dipolymers being the most common and tri-, tetra-, penta-, and hexapolymers occurring with less frequency. Conclusion The complete cp genome sequence of P. lambertii revealed significant structural changes, even in species from the same genus. These results reinforce the apparently loss of rps16 gene in Podocarpaceae cp genome. In addition, several SSRs in the P. lambertii cp genome are likely intraspecific polymorphism sites, which may allow highly sensitive phylogeographic and population structure studies, as well as phylogenetic studies of species of this genus.


Introduction
Extant gymnosperms are considered the most ancient group of seed-bearing plants that first appeared approximately 300 million years ago [1]. They consist of four major groups, including Gnetophytes, Conifers, Cycads and Ginkgo. Podocarpaceae are considered the most diverse family of Conifers, and much of this diversity has taken place within the Podocarpus and Dacrydium genera [2]. The Podocarpaceae family comprises 18 genera and 173 species distributed mainly in the Southern Hemisphere, but extending to the north in subtropical China, Japan, Mexico and the Caribbean [3,4].
The Podocarpus sensu lato (s.l.) genus comprises nearly 100 species, widely spread throughout the Southern Hemisphere and northward to the West Indies, Mexico, southern China and southern Japan [5]. Ledru et al. [6] described that Podocarpus populations in Brazil are widely dispersed in eastern Brazil, from north to south, and three endemic species have been reported: Podocarpus sellowii Klotzch ex Endl, Podocarpus lambertii Klotzch ex Endl, and Podocarpus brasiliensis de Laubenfels [7]. P. lambertii is a native species from the Araucaria Forest, a subtropical moist forest ecoregion of the Atlantic Forest Biome, which is considered one of the 25 biodiversity hotspots of the world [8]. It is a dioecious evergreen tree of variable height, measuring 1-10 m, shade-tolerant, adapted to high frequency and density of undergrowth [9].
Phylogeny analyses by maximum parsimony of Podocarpaceae family using 18S rDNA gene sequencing and morphological characteristics indicated Podocarpaceae as monophyletic and Podocarpus s.l. and Dacrydium s.l. genera as unnatural [2]. This author concluded that single-gene studies rarely result in perfect phylogenies, but they could provide a basis for choosing between competing hypotheses. Parks et al. [10] suggested chloroplast (cp) genome sequencing as an efficient option for increasing phylogenetic resolution at lower taxonomic levels in plant phylogenetic and genetic population analyses.
Several studies have proven the potential of cp genomes as tools to understand enigmatic and basal phylogenetic relationships at different taxonomic levels, as well as probe the structural and functional evolution of plants [11,[17][18][19][20]. Hirao et al. [12] sequenced the cp genome of the first species in the Cupressaceae family, Cryptomeria japonica. They reported the deletion of one large inverted repeat (IR), numerous genomic rearrangements, and many differences in genomic structure between C. japonica and other land plants, thus supporting the theory that a pair of large IR can stabilize the cp genome against major structural rearrangements and, in turn, providing new insights into both the evolutionary lineage of coniferous species and the evolution of the cp genome [12,21,22].
Chloroplast genome sequencing in gymnosperms also brought insights into evolutionary aspects in Gnetophytes. Wu et al. [23] considered that the reduced cp genome size in Gnetophyte was based on a selection toward a lower-cost strategy by deletions of genes and noncoding sequences, leading to genomic compactness and accelerated substitution rates. More recently, comparative analysis of the cp genomes in cupressophytes and Pinaceae provided inferences about the loss of large IR [11,20]. On one hand, Wu et al. [20] and Wu and Chaw [16] argue that each Pinaceae and cupressophyte lost a different copy of IR. On the other hand, Yi et al. [11] showed that distinct isomers are considered as alternative structures for the ancestral cp genome of cupressophyte and Pinaceae lineages. Therefore, it is not possible to distinguish between hypotheses favoring retention or independent loss of the same IR region in cupressophyte and Pinaceae cp genomes.
The present study focuses on establishing the complete cp genome sequence of a further member of the Podocarpaceae family, the Brazilian endemic species P. lambertii. Here, we characterize the cp genome organization of P. lambertii and compare its cp genome structure with other conifer species.

Plant material and cp DNA purification
Chloroplast isolation of P. lambertii was performed from young plants collected at a private area located at Lages, Santa Catarina, Brazil (27u 489 57" S, 50u 199 33" W), where the species is abundant, with previous permission from the owner (José Antônio Ribas Ribeiro). This species is not considered threatened. Afterwards, the young plants were transplanted to the greenhouse until the collection of needles. The cpDNA isolation was performed according to Vieira et al. [24].

Chloroplast genome sequencing, assembling and annotation
Approximately 50 ng of cp DNA were used to prepare sequencing libraries with Nextera DNA Sample Prep Kit (Illumina Inc., San Diego, CA) according to the manufacturer's instructions. Chloroplast DNA was sequenced using Illumina MiSeq (Illumina Inc., San Diego, CA) at the Federal University of Paraná, Brazil. In total, 495,071 paired-end reads (26250 bp) were obtained, and de novo assembly was performed using Newbler 2.6 v. The obtained paired-end reads were mapped on P. lambertii cp genome and the genome coverage estimated using the CLC Genomics Workbench 5.5 software. By using this approach, a total of 377,437 paired-end reads (76.23%) was obtained from cpDNA, resulting in 1,200-fold genome coverage. Initial annotation of the P. lambertii cp genome was performed using Dual Organellar GenoMe Annotator (DOGMA) [25]. From this initial annotation, putative starts, stops, and intron positions were determined based on comparisons to homologous genes in other cp genomes. The tRNA genes were further verified by using tRNAscan-SE [26]. A physical map of the cp circular genome was drawn using OrganellarGenomeDRAW (OGDRAW) [27]. The complete nucleotide sequence of P. lambertii cp genome was deposited in the GenBank database under accession number KJ010812.

Repeat sequence analysis and IR identification
Simple sequence repeats (SSRs) were detected using MISA perl script, available at (http://pgrc.ipk-gatersleben.de/misa/), with thresholds of eight repeat units for mononucleotide SSRs, four repeat units for di-and trinucleotide SSRs, and three repeat units for tetra-, penta-and hexanucleotide SSRs. Tandem repeats were analyzed using Tandem Repeats Finder (TRF) [29] with parameter settings of 2, 7 and 7 for match, mismatch, and indel, respectively. The minimum alignment score and maximum period size were set as 50 and 500, respectively. All of the repeats found were manually verified, and the nested or redundant results were removed. REPuter [30] was used to visualize the remaining IRs in P. lambertii by forward vs. reverse complement (palindromic) alignment. The minimal repeat size was set to 30 bp and the identity of repeats $90%.

Results and Discussion
Chloroplast genome sequencing, assembling and annotation P. lambertii cp genome size was determined to be 133,734 bp, which is very similar to P. totara (133,259 bp) (NC_020361.1) and larger than the sequenced cp genomes of Pinaceae species, which range from 116,479 bp in Pinus monophylla [14] to 124,168 bp in Picea morrisonicola [31]. P. lambertii cp genome size is smaller than the cp sequences in the cycads Cycas taitungensis (163,403 bp) [32] and Cycas Revoluta (162,489 bp) (NC_020319.1). The genome size of P. lambertii cp is consistent with the size of non-Pinaceae conifer species, which ranges from 127,665 bp in T. mairei (NC_020321.1) to 136,196 bp in C. wilsoniana [20]. A total of 119 genes were identified in the P. lambertii cp genome, of which 118 genes were single copy and one gene, trnN-GUU, was duplicated and occurred as an inverted repeat sequence. The following genes were identified and are listed in Figure 1 and Table 1: 4 ribosomal RNA genes, 31 unique transfer RNA genes, 20 genes encoding large and small ribosomal subunits, 1 translational initiation factor, 4 genes encoding DNA-dependent RNA polymerases, 50 genes encoding photosynthesis-related proteins, 8 genes encoding other proteins, including the unknown function gene ycf2, and 1 pseudogene, ycf68. Among these 118 single copy genes, 14 were genes containing introns ( Table 1). The GC content determined for P. lambertii cp genome is 37.1%, which is higher than C. oliveri (35.2%), C. wilsoniana (35.1%), T. cryptomerioides (34.6%), and C. japonica (35.4%), but lower than C. taitungensis (39.5%) and P. thunbergii (38.8%).

Gene content differences
The gene content of P. lambertii cp genome and that of other conifer cp genomes sequenced to date show high similarity. However, some differences are observed when we compare P. lambertii cpDNA with other non-Pinaceae and Pinaceae conifers. One exception is the rps16 gene, which is absent from the P. lambertii cp genome. This result reinforce the apparently loss of rps16 gene in Podocarpaceae and Araucariaceae families. Wu and Chaw [16] reported the rps16 gene loss in Nageia nagi (Podocarpaceae) and Agathis dammara (Araucariaceae). This gene is present in other non-Pinaceae conifer cp genomes published so far [11,12,20,32]. The rps16 gene loss has already been reported in other gymnosperms, such as Pinaceae and Gnetophyte species [23,32,33]. Wu et al. [20] considered rps16 gene loss as a structural mutation unique to the cpDNAs of gnetophytes and Pinaceae, but since the loss of this gene has been identified in Podocarpaceae and Araucariaceae families, we can consider that some cupressophytes may also present this mutation. This gene is also absent, or nonfunctional, in some angiosperm species of the Fabaceae family, such as Medicago truncatula, in which it is completely absent, and in Phaseolus vulgaris and Vigna radiata, in which it is nonfunctional. In this angiosperm family, the coding sequence contains many internal stop codons and a modified initial stop codon [34,35]. Since this gene was shown to be essential for cell survival in tobacco [36], it was probably transferred to the nucleus, as observed for different species of the Fabaceae family [34,35], and has since become a functional nuclear gene required for normal plastid translation.
The trnP-GGG and trnR-CCG genes are considered to be relics of plastid genome evolution in gymnosperms, pteridophytes and bryophytes [37]. The trnP-GGG gene is present in the P. lambertii cp genome, as well as such conifer species as C. japonica, P.  thunbergii, C. oliveri and C. wilsoniana and other gymnosperm species, such as C. taitungensis, Gnetum and Ginkgo. The trnR-CCG gene is present as complete and functional tRNA in P. lambertii (Podocarpaceae), as well as the cp genomes of P. thunbergii (Pinaceae), C. taitungensis (Cycadaceae) [32], whereas it is absent from C. japonica (Cupressaceae), C. oliveri and C. wilsoniana (Cephalotaxaceae), and T. mairei (Taxaceae) [11,12]. Hirao et al. [12] suggested that trnR-CCG might have been completely lost in the Cupressaceae s.l., which has only relatively recently diverged during the long evolutionary history of plants. These data corroborate the hypothesis based on phytochrome phylogenetic trees, in which the most ancient branch of the conifers seems to be the Pinaceae, and the next split appears to have separated Araucariaceae plus Podocarpaceae from the Taxaceae/Taxodiaceae/Cupressaceae group [38]. This trnR-CCG gene may have been lost during the second split separating Araucariaceae and Podocarpaceae taxa. In addition, trnT-GGU occurs as a pseudogene in the C. japonica cp genome, with only 43 bp, while it is present and completely functional in P. lambertii and C. oliveri, C. wilsoniana, duplicated in P. thunbergii, and totally absent from the C. taitungensis cp genome. Interestingly, the trnT-GGU gene is highly conserved in angiosperms, and knockout of this gene in tobacco plants produced viable plants, whereas the growth of these plants was strongly affected, suggesting an important role during plastid translation [39]. The loss of the trnT-GGU gene in several gymnosperm species suggests that a uridine modification in the anticodon position of the trnT-UGU gene occurred during evolution, which would facilitate the reading of threonine codons and makes the trnT-GGU gene dispensable in these species [39][40][41][42]. Evolutionarily, the loss of this tRNA gene could be used as a tool, or marker gene, to study the possible ways that the conifers diverged during evolution. However, it remains to be determined whether structural differences in the cp ribosome or modification in the structure of this tRNA, between angiosperms and gymnosperms, would facilitate the decoding.

Comparative analysis of genome structure
Chloroplast genome organization is much conserved in angiosperms, as well as the presence of IRs, with very few exceptions. As reported by Terakami et al. [43] in Pyrus, Malus and Nicotiana, neither translocation nor inversion was detected in the three species. In addition, considering the many dicot and monocot species, only one large inversion was reported [43].
In addition to the loss of the large IR in conifers, many genome rearrangements were observed in the cp genome, and such rearrangements appear to play an important role in their evolution. Dot-plot analyses indicate that the structure of the P. lambertii cp genome differs significantly from cp genomes of other conifer species, and, surprisingly, it has significant differences when compared to P. totara (Figure 2A-H).
For the genus Cephalotaxus s.l., specifically C. wilsoniana and C. Oliveri, it was shown that the genome structures were almost the same [11]. Similar results were observed in the present study, as revealed by the high similarity in the dot-plot analyses between Podocarpus and Cephalotaxus genera, as represented by P. lambertii 6 C. wilsoniana ( Figure 2E) and P. Lambertii 6 C. oliveri ( Figure 2F), and between the Podocarpus and Taiwania genera, as represented by P. lambertii 6 T. flousiana ( Figure 2G) and P. lambertii 6 T.  cryptomerioides ( Figure 2H). This high similarity in dot-plot analysis indicates the occurrence of exactly the same structural modifications between P. lambertii and these two Cephalotaxus and Taiwania species.
Differently, for P. lambertii and P. totara (Figure 2A), we observed four large inversions of about 20,000 bp in length each. In both Cephalotaxus and Taiwania genera, the two sequenced species share the same region of natural occurrence, which is not true for either Podocarpus species sequenced. Thus, these large inversions can be explained by, and probably result from, the large distance between the natural occurrence of these two species in that P. lambertii occurs in Brazil, while P. totara occurs in New Zealand. Moreover, podocarps have a rich fossil record that suggests an origin in the Triassic period (about 220 million years) and a distribution in both the Northern and Southern Hemispheres through the Cretaceous and earliest Tertiary periods, about 100 million years ago [44][45][46]. Thus, geographic distance and different adaptive traits could explain the structural differences found between these two species of the same genera.
In addition, the loss of one large IR copy already reported in other conifer species were also observed in the P. lambertii cp genome [11,12,20]. However, short remaining IR sequences of 326 bp can be found in P. lambertii, 544 bp in C. oliveri, 530 bp in C. wilsoniana, 277 bp in T. cryptomerioides and 284 bp in C. japonica [11]. These short remaining IR sequences also differ in the nucleic acid sequences and gene content between different conifer species. In P. lambertii, trnN-GUU remain from the lost IR copy region, while in T. cryptomerioides and C. japonica, trnI-CAU remained after the rearrangements that determined the loss of one IR copy [11]. In C. oliveri and C. wilsoniana, the trnQ-UUG is duplicated; however, this gene is not normally present in the IR region, and its duplication was probably produced by other rearrangements not involved with the IR regions [20]. After much evidence provided by different conifer plastid genomes, it can be concluded that the loss of one IR copy occurred after a reduction in sequence and gene content and that such loss was most likely caused by this reduction [11,12,14,20,23,32,33]. However, this speculation remains to be established. To date, it is not entirely clear whether cupressophytes and Pinaceae species have lost different IR regions [11]. However, we can observe in P. lambertii an inversion in the direction of transcription of ribosomal RNA genes spanning rrn5-rrn16 and protein-coding genes, ndhB and ycf2, when compared to C. oliveri, C. wilsoniana, T. cryptomerioides and C. japonica (Figure 3).  / T  -----3 9  1 4  6  4  6 -

Repeat sequence analysis
The cp genome mode of inheritance, paternal in most gymnosperms, allows us to elucidate the relative contributions of seed and pollen flow to the genetic structure of natural populations by comparison of nuclear and cp markers [47]. The cp microsatellites, or SSRs, may be identified in completely sequenced plant cp genomes by simple database searches, followed by primers designed to screen for polymorphism. To date, studies of cp microsatellites have revealed much higher levels of diversity than have those of cp restriction fragment length polymorphisms (RFLP) [47][48][49].
We have analyzed the occurrence, type, and distribution of SRRs in the P. lambertii cp genome. In total, 156 SSRs were identified. Among them, homo-and dipolymers were the most common with, respectively, 80 and 63 occurrences, whereas tri-(4), tetra-(7), penta-(1), and hexapolymers (1) occur with lower frequency ( Table 2). Most homopolymers are constituted by A/T sequences (87.5%), and of the dipolymers, 61.1% were also constituted by multiple A and T bases.In this study, we identified 78 repeats with more than one nucleotide repeat, totaling almost 50% of all SSRs identified. The 13 tri-, tetra-, penta-, and hexapolymers are shown in Table 3, as well as their size and location. From these 13 polymers identified, 9 are localized in intergenic spacers, 3 in coding sequences, and only 1 inside an intron. These results reveal the presence of several SSR sites in P. lambertii. Hereafter, these sites can be assessed for the intraspecific level of polymorphism, leading to highly sensitive phylogeographic and population structure studies for this species.
Yi et al. [11] attributed the expansion of the accD ORF to the presence of tandemly repeated sequences. In the P. lambertii cp genome, we identified 2 tandem repeats in accD CDS, totaling 132 bp, or 44 codons. The accD reading frame length of the P. lambertii cp genome is 864 codons, similar to other cupressophyte species, such as C. oliveri (936 codons), C. wilsoniana (1,056 codons), C. japonica (700 codons) and T. cryptomerioides (800 codons). In contrast, the reading frame lengths of cycads, Ginkgo and Pinaceae, range from 320 to 359 codons, less than half the size found in cupressophytes. These results support the hypothesis of Hirao et al. [12] and Yi et al. [11] which holds that the accD reading frame has displayed a tendency toward enlarging sizes in cupressophytes.
The complete cp genome sequence of P. lambertii revealed significant structural changes occurring in the cp genome, even in species from the same genus. These results reinforce the apparently loss of rps16 gene in Podocarpaceae cp genome. In addition, several SSRs in the P. lambertii cp genome are likely intraspecific polymorphism sites which may allow highly sensitive phylogeographic and population structure studies, as well as phylogenetic studies, of species of this genus.