Complete chloroplast genome sequence and comparative analysis of loblolly pine (Pinus taeda L.) with related species

Pinaceae, the largest family of conifers, has a diversified organization of chloroplast (cp) genomes with two typical highly reduced inverted repeats (IRs). In the current study, we determined the complete sequence of the cp genome of an economically and ecologically important conifer tree, the loblolly pine (Pinus taeda L.), using Illumina paired-end sequencing and compared the sequence with those of other pine species. The results revealed a genome size of 121,531 base pairs (bp) containing a pair of 830-bp IR regions, distinguished by a small single copy (42,258 bp) and large single copy (77,614 bp) region. The chloroplast genome of P. taeda encodes 120 genes, comprising 81 protein-coding genes, four ribosomal RNA genes, and 35 tRNA genes, with 151 randomly distributed microsatellites. Approximately 6 palindromic, 34 forward, and 22 tandem repeats were found in the P. taeda cp genome. Whole cp genome comparison with those of other Pinus species exhibited an overall high degree of sequence similarity, with some divergence in intergenic spacers. Higher and lower numbers of indels and single-nucleotide polymorphism substitutions were observed relative to P. contorta and P. monophylla, respectively. Phylogenomic analyses based on the complete genome sequence revealed that 60 shared genes generated trees with the same topologies, and P. taeda was closely related to P. contorta in the subgenus Pinus. Thus, the complete P. taeda genome provided valuable resources for population and evolutionary studies of gymnosperms and can be used to identify related species.


Introduction
Gymnosperms are represented by a diverse and magnificent group of coniferous species distributed across eight families, consisting of 70 genera containing more than 630 species [1]. They are thought to have arisen from seed plants approximately 300 million years ago and are one of the ancient main plant clades. Gymnosperms possess larger genomes than flowering plants [2][3][4][5]. Recently, rapid progress has been made in angiosperm genome sequencing and a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 been sequenced and deposited in the NCBI Organelle Genome Resources database [51]. The evolution of cp genomes in terrestrial plants can now be studied using these database resources [51]. To date, a total of 16 complete chloroplast genomes in the genus Pinus have been sequenced and submitted to NCBI. In the current study, the complete cp genome of P. taeda (GenBank accession number: KY964286) was sequenced using next-generation sequencing tools. The goal of this study was to determine the cp genome organization of P. taeda and its global pattern of structural and comparative variation in the cp genome of P. taeda with 14 Pinus species (P. koraiensis, P. sibirica, P. armandii, P. lambertiana, P. krempfii, P. bungeana, P. gerardiana, P. monophylla, P. nelsonii, P. contorta, P. massoniana, P. tabuliformis, P. taiwanensis, P. strobus, and P. thunbergii).

Chloroplast genome sequencing and assembly
Plastid DNA was extracted from the fresh needle leaf parts of P. taeda using the DNeasy Plant Mini Kit (Qiagen, Hilden, Germany), and the resulting cpDNA was sequenced using an Illumina HiSeq-2000 platform (San Diego, CA, USA) at Macrogen (Seoul, Korea). The P. taeda cp genome was then assembled de novo using a bioinformatics pipeline (http://www.phyzen. com). Specifically, a 400-bp paired-end library was produced according to the Illumina standard method, which generated 28,110,596 bp of sequence data with a 100-bp average read length. Raw reads with Phred scores of 20 were removed from the total PE reads using the CLC-quality trim tool, and de novo assembly of trimmed reads was accomplished using CLC Genomics Workbench v7.0 (CLC Bio, Aarhus, Denmark) with a minimum overlap of 200-600 bp. The resulting contigs were compared against the P. thunbergii and P. contorta plastomes using BLASTN with an E-value cutoff of 1e-5, and five contigs were identified and temporarily arranged based on their mapping positions on the reference genome. After initial assembly, primers were designed (S1 Table) based on the terminal sequences of adjacent contigs, and PCR amplification and subsequent DNA sequencing were conducted to fill in the gaps. PCR amplification was performed in 20-μL reactions containing 1× reaction buffer, 0.4 μL dNTPs (10 mM), 0.1 μL Taq (Solg h-Taq DNA Polymerase), 1 μL (10 pm/μL) primers, and 1 μL (10 ng/μL) DNA, using the following conditions: initial denaturation at 95˚C for 5 min; 32 cycles of 95˚C for 30 s, 60˚C for 20 s, and 72˚C for 30 s; and a final extension step of 72˚C for 5 min. After incorporating the additional sequencing results, the complete cp genome was used as a reference to map the remaining unmapped short reads to improve the sequence coverage of the assembled genome.

Analysis of gene content and sequence architecture
The P. taeda cp genome was annotated using DOGMA [52], checked manually, and the codon positions were adjusted by comparison with homologs in the cp genome of P. taeda and P. contorta. Transfer RNA sequences of the P. taeda cp genome were verified using tRNAscan-SE version 1.21 [53] with default settings, and the structural features were illustrated using OGDRAW [54]. To examine deviations in synonymous codon usage by avoiding the influence of amino acid composition, the relative synonymous codon usage was determined using MEGA 6 software [55], and finally the divergence of the P. taeda cp genome from six other Pinus species (five from subgenus Pinus and one from subgenus Strobus) cp genomes was assessed using mVISTA [56] in Shuffle-LAGAN mode and using the P. taeda genome as a reference.

Elucidation of repeat sequences and simple sequence repeat (SSRs)
Repeat sequences, including direct, reverse, and palindromic repeats, were identified within the cp genome using REPuter [57] with the following settings: Hamming distance of 3, !90% sequence identity, and minimum repeat size of 30 bp. Furthermore, SSRs were detected using Phobos version 3.3.12 [58] with the search parameters set to !10 repeat units for mononucleotide repeats, !8 repeat units for dinucleotide repeats, !4 repeat units for trinucleotide and tetranucleotide repeats, and !3 repeat units for pentanucleotide and hexanucleotide repeats. Tandem repeats were identified using Tandem Repeats Finder version 4.07 b [59] with default settings.

Sequence divergence and phylogenetic analyses
The average pairwise sequence divergence of 60 shared genes and complete plastomes of 15 Pinus species was analyzed, using data from P. taeda, P. koraiensis, P. sibirica, P. armandii, P. lambertiana, P. krempfii, P. bungeana, P. gerardiana, P. monophylla, P. nelsonii, P. contorta, P. massoniana, P. tabuliformis, P. taiwanensis, P. strobus, and P. thunbergii. In cases of missed and unclear genes, annotation was confirmed by comparison with the reference sequence after assembling a multiple sequence alignment tool. The complete genome data set was aligned using MAFFT version 7.222 [60] with default parameters. For pairwise sequence divergence, a Kimura's model was used [61]. Indel polymorphisms among the complete genomes were identified using DnaSP 5.10.01 [62], and a custom Python script (https://www.biostars.org/p/ 119214/) was used to identify SNPs. To resolve the phylogenetic position of P. taeda within the genus Pinus, 14 published Pinus species plastomes were downloaded from the NCBI database for phylogenetic analysis. Multiple alignments of the complete plastomes were constructed based on the conserved structure and gene order of the plastid genomes [63], and four methods were employed to construct phylogenetic trees, including Bayesian inference (BI), which was implemented using MrBayes 3.1.2 [64], maximum parsimony (MP), which was implemented using PAUP 4.0 [65], and maximum likelihood (ML) and neighbor-joining (NJ), which were implemented using MEGA 6 [55] using previously described settings [66,67]. In a second phylogenetic analysis, 60 shared cp genes from 15 Pinus species, including P. taeda, and one outgroup species (Juniperus bermudiana) were aligned using ClustalX with default settings, followed by manual adjustment to preserve the reading frames. Finally, the same four phylogenetic inference methods were used to infer trees from the 60 concatenated genes using the same settings [66,67].

Category
Group of genes Name of genes

Selfreplication
Large subunit of ribosomal proteins rpl2, 14,16,20,22,23,32,33,36 Small subunit of ribosomal proteins rps2, 3,4,7,8,11,12,14,15,18,19 DNA-dependent RNA polymerase  spliced gene, with the N-terminal exon-I located at 92 Kb from C-terminal exons-II and III as reported previously for various gymnosperms [70]. The protein coding regions containing 81 genes were 61,691 bp and accounted for 50.76% of the P. taeda cp genome. In the P. taeda cp genome, the gene proportion for tRNA was 2.18% and for rRNA it was 3.71%. A total of 43.35% of the non-coding region was composed of introns and intergenic spacers. The total protein-coding sequences encoded 20,563 codons ( Table 4). The codon-usage frequency was calculated based on protein-coding and tRNA gene sequences (Table 5). Leucine was the most coded (2,067, 10.1%) and cysteine was the least coded (244, 1.2%) amino acid (Fig 2). Similar ratios for amino acids were found in previously reported cp genomes [71,72]. The maximum GAA (835; 4.06%) and minimum TGC (65; 0.316%) codons used coded for glutamic acid and encoding cysteine, respectively. The A-T content was 50.6%, 59.99%, and 69.97% at the three consecutive codon positions ( Table 4). The preference for the high A-T content at the 3 rd codon position is similar to the A and T concentrations reported in various terrestrial plant cp genomes [72][73][74].

Difference in gene contents of P. taeda
We selected 16 cp genomes in the Pinus genus (P. taeda (old), P. koraiensis, P. sibirica, P. armandii, P. lambertiana, P. krempfii, P. bungeana, P. gerardiana, P. monophylla, P. nelsonii, P. contorta, P. massoniana, P. tabuliformis, P. taiwanensis, P. strobus, and P. thunbergii) for comparison with P. taeda (new) (121,531 bp). Pinus taeda had the largest genome. The differentiation can be ascribed to the variation in size of LSC ( Table 1). Analysis of known genes functions revealed that P. taeda shared 60 different protein-coding genes with 15 other Pinus species. Furthermore, pairwise alignment between the cp genome of P. taeda and six related cp genomes showed the highest synteny. Annotation of the P. taeda cp genome was used for plotting the total sequence identity of the six cp genomes of Pinus species in mVISTA (Fig 3). The results revealed high sequence identity with five species from the subgenus Pinus (P. contorta, P. massoniana, P. tabuliformis, P. taiwanensis, and P. thunbergii) compared to P. armandii from the subgenus Strobus. However, for all species, relatively lower identity was observed in various comparable genomic regions, particularly the trnK-UUU, matK, atpI, rpl16, petB, petD, ycf1, and ycf2 regions (Fig 3). Similarly, non-coding regions exhibited greater bifurcation than the coding-regions. Among the diverging regions, psbA-chlB, psbM-clpP, ycf4-accD, ycf3-psaA, psaC-ccsA, ndhH-psaC, ycf3-psaA, trnG-UUU-chlL, and petL-psbF were significant. The current findings agree with the results previously reported for these genes in angiosperm cp genomes [43,72]. Our results confirmed similar variations among the coding-regions of the investigated species. This was also suggested by Kumar et al. [75]. Furthermore, comparison of the P. taeda whole cp genome with those of related species revealed lower SNP and indel substitutions for the subgenus Pinus cp genomes, which ranged from 809 in P. taeda (old) to 2,636 in P. thunbergii. However, the results revealed higher SNP and indel substitutions within the subgenus Strobus cp genomes, which ranged from 9,211 in P. gerardiana to 19,196 in P. monophylla (S2 Table). These results indicate the presence of interspecific mutations in the highly conservative cp genome that may be useful for analyzing genetic diversity and evolution. Similarly, we evaluated pairwise-sequence differentiation among the 16 pine species (S3 Table).
The results showed that the P. taeda genome had 0.0274 average sequence divergences, high divergence was detected for P. nelsonii (0.0402), and P. taeda (old) had the lowest average sequence divergence (0.00321) followed by P. contorta (0.00807).
The gene organization and gene contents of the cp genomes are generally conserved compared with those in the mitochondrial and nuclear genomes [76]. The cp genome organization and structure are extremely conserved in angiosperms, i.e. there is a distinctive quadripartite structure containing an SSC region and LSC region separated by a pair of inverted repeats  [81][82][83][84][85][86]. In contrast, rps16 is present in the angiosperms Oryza sativa and E. globulus, in the fern Adiantum capillus, and in the gymnosperms C. japonica and C. taitungensis. However, the position of rps16 is different in gymnosperms from that in angiosperm cp genomes. The position is intermediate between chlB and trnK-UUU in the gymnosperm cp genomes and halfway between trnQ-UUG and trnK-UUU and between chlB and matK in angiosperms and ferns, respectively. Doyle et al. [83] suggested the functional transfer of rps16 to the nucleus from chloroplasts and the absence of this gene from various terrestrial plants. Furthermore, it was reported that the loss of rps16 and its functional transfer to the nucleus may have occurred autonomously in gymnosperms, particularly in coniferous species.
trnR-CCG and trnP-GGG are also found in P. taeda cp genomes. These genes are reported as pseudo genes and are likely relics of cp genome evolution in mosses and gymnosperms [29, 87, 88]. trnP-GGG was previously reported in two gymnosperms, C. taitungensis and P. thunbergii, as well as in C. japonica, in the fern A. capillus and liverwort M. polymorpha, and but was absent from the cp genomes of angiosperms. This gene was also identified in Ginkgo and Gnetum [34], revealing that the gene is common in numerous gymnosperm species. Similarly, trnR-CCG in P. taeda was previously reported in C. taitungensis, A. capillus, P. thunbergii, and M. polymorpha. However, the absence of this gene in C. japonica and various cp genomes of angiosperms suggests that trnR-CCG is not well-maintained in the cp genomes of all gymnosperms and may have been lost in various taxa during plant evolution [79].
Furthermore, clpP, which encodes a proteolytic subunit of the ATP-dependent clpP protease, contains no intron in the P. taeda cp genome. Similar results were previously reported for P. thunbergii, P. mugo, P. dabeshanensis, and P. taiwanensis [37,41,68,89]. In contrast, clpP is found in the cp genome of other land plants, such as A. capillus, E. globulus, M. polymorpha, and C. taitungensis with two or three exons [29]. However, in the P. taeda cp genome, only the clpP second exon remained, and as such, it occurs as a pseudogene. Similarly, the rpl20 and clpP order is conserved in the P. taeda cp genome and clpP is co-transcribed with the 5'-end of rps12 and rpl20, as reported previously for the cp genomes of various gymnosperms [90,91] [92]. accD encodes acetyl-CoA-carboxylase and has been found in the P. taeda cp genome. The reading frame length of accD was similar to that of the cp genomes of other Pinaceae members and has 321 codons, which is fewer than that in C. japonica (700 codons) and more than the 309 codons of A. capillus and 316 codons of M. polymorpha. Furthermore, in angiosperms, particularly monocots, the reading-frame size of accD has been reduced from 106 codons in Oryza sativa to none in Zea mays. This has also been suggested as reason for the loss of accD in monocot plant species [93]. In contrast, the accD reading-frame in gymnosperms, particularly in coniferous species and C. japonica, may have diverted in the ascending direction.

Loss of large IR region within the P. taeda cp genome
The large inverted repeat regions, which have been reported in various land plant cp genomes, were reduced to two very short inverted repeat (IRa and IRb) regions of 830 bp in P. taeda, and were separated by a SSC region of 42,258 bp and LSC region of 77,614 bp (Fig 1). However, in the previously sequenced P. taeda cp genome submitted to NCBI, the short inverted repeat regions were 693 bp (Table 1). Similar results were observed in other Pinaceae members, such as P. taiwanensis, P. armandii, and P. dabeshanensis, where the inverted repeat sizes were reduced to 513, 475, and 473 bp, respectively [68,69,89]. The IR of P. taeda contained duplicated psaM and trnS-GCU and partial ycf12, apparently caused by incomplete loss of the large IR, as reported previously for various gymnosperms [36,37]. Detailed comparison of four junctions (J LA , J LB , J SA , and J SB ) between the two IRs (IRa and IRb) and two single-copy regions (LSC and SSC) was performed between Pinus species (P. contorta, P. tabuliformis, P. massoniana, P. taiwanensis, and P. thunbergii) and P. taeda by carefully analyzing the exact IR border positions and adjacent genes (Fig 4). Some IR expansion and contraction were observed in the P. taeda cp genome compared to that of the other five Pinus species, which ranged from 358 bp (P. contorta) to 845 bp (P. tabuliformis) (Fig 4). The genes marking the beginning and end of the IRs were only partially duplicated. psbI in P. taeda was located 9 bp from J LB in the LSC region. In P. contorta, P. tabuliformis, and P. taeda (old), this distance was 6 bp, whereas in P. massoniana and P. taiwanensis the distances were 26 and 338 bp, respectively. However, variation was found in P. thunbergii, and rpl23 was 100 bp away from J LB in the LSC region. Similarly, hypothetical chloroplast ycf12 was partially duplicated by 47 bp (P. taeda) and 35 bp in P. tabuliformis. However, in P. massoniana, ycf12 was located in the SSC region, 385 bp away from J SB . In P. taeda and P. tabuliformis, J LA was located between psaM and psbB and the difference in distance between psaM and J LA was 395 bp. However, in P. contorta and P. taiwanensis, psaM was located in the SSC region, whereas in P. massoniana, it was located at the J SA border (Fig 4). Similarly, in P. taeda, P. contorta, P. tabuliformis, P. massoniana, and P. taiwanensis, psbB was located in the LSC region at 478, 477, 505, 526, and 843 bp away from the J LA border, respectively.
Large IRs play a significant role in stabilizing and maintaining the conserved structure of the cp genomes [94]. Various studies have reported that during the evolutionary process of angiosperms, a copy of an IR was lost, particularly in the subfamily Papilionoideae [95-97], and rearrangement in the chloroplast genome was observed because of IR loss in these genomes as compared to cp genomes with normal IRs [94]. Similarly, in gymnosperms, complete IRs were lost in conifers, particularly in cupressophytes and Pinaceae cp genomes, and greater rearrangement was observed in these genomes compared to in higher plants [33]. The remaining IR parts in various Pinaceae member and cupressophyte cp genomes were shown to differ, suggesting that these two conifer clades lost their large IRs independently during evolution from a common ancestor [78,98]. Previously, it was reported that specific repeats in Pinaceae replaced the reduced IRs [99]. Compared to other conifers, a greater number of rearrangements occurred in Pseudotsuga menziesii and P. radiate cp genomes because of the lack of a large IR in these cp genomes [33]. Therefore, variation in the genome structure between P. taeda and related terrestrial plants, such as C. japonica, suggest that an IR is essential for structural stability of the cp genome.

Repeat analysis
Repeat analysis of the P. taeda cp genome revealed six palindromic repeats, 34 forward repeats, and 22 tandem repeats (S1 Fig and Table 6). Among these, three forward repeats were 45-59 bp in length, with 14 tandem repeats of 15-29 bp in length (S1 Fig). Additionally, two palindromic repeats were 75-89 bp and four repeats were >90 bp (S1 Fig). Overall, 62 repeats were found in the P. taeda cp genome. Among tandem repeats, 12 repeats were in coding regions, eight repeats in intergenic regions, one repeat extending from an intergenic region into a coding region, and one repeat in the petB intron region ( Table 7). The length of tandem repeats in these regions varied between eight and 14, and up to 10 repeat units were present. Various numbers of repeats have been identified in conifer cp genomes [100, 101] and the mechanisms implicit in the origin of these tandem repeats remain unclear. Nevertheless, they are known to be associated with chloroplast DNA rearrangement [102], gene expansion [100, 101], and gene duplication [103]. Previous reports suggested that repeat sequences, which play a role in genome rearrangement, are very helpful in phylogenetic studies [74,104]. Furthermore, analyses of different cp genomes revealed that repeat sequences are important causes of indels and substitutions [101]. Sequence variation and cp genome re-arrangement occurs because of the slipped strand mis-pairing and improper recombination of repeat sequences [104-106]. The presence of such repeats shows that the locus is an important hotspot for cp genome re-configuration [74,107]. In addition, such repeats contain crucial information for developing genetic markers for phylogenetic and population studies [74].

SSR analysis
SSRs are repeating sequences of typically 1-6 bp that are distributed throughout the genome. SSRs generally have a high mutation rate compared to neutral DNA regions because of slipped-strand mispairing. Because these short repeats are uniparentally inherited and haploid, they can be used as molecular markers in genetic studies analyzing population structures [108,109]. In this study, we detected perfect SSRs in the P. taeda cp genome (Fig 5). Specific attributes were set for the analysis because SSRs (10 bp or longer) are exposed to slipped strand mis-pairing, the main mechanism of SSR polymorphisms [110][111][112]. A total of 151 perfect microsatellites were found in the P. taeda cp genome (Fig 5). Most (71) SSRs in this cp genome possessed a mononucleotide repeat motif. Dinucleotide SSRs were the second most common repeat motif (Fig 5B). Using our search criterion, four tetranucleotide SSRs and one hexanucleotide SSR were detected in the P. taeda cp genome (Fig 5A). In P. taeda, most mononucleotide SSRs were A (92.5%) and C (8.45%) motifs, with most dinucleotide SSRs being A/T (47.3%) and A/G (52.63%) motifs ( Fig 5B and Table 8). Approximately 59.60% of SSRs were in non-coding regions, approximately 2.64% were present in rRNA sequences, and 1.98% were in tRNA genes (Fig 5A). These results are similar to those of previous reports showing that SSRs were unevenly distributed in cp genomes, and these findings may provide more information for selecting effective molecular markers for detecting intra-and interspecific polymorphisms [113][114][115][116]. Furthermore, analysis of various gymnosperm cp genomes revealed that most mononucleotides and dinucleotides are composed of A and T, which may contribute to bias in base composition, which is consistent with other cp genomes [117][118][119]. For SSR identification, although different criteria and algorithms were used, their distribution and characteristics were similar to the cp genomes of conifers [71,119], 30 asterid [72], and 14 monocot [112]. Our findings were comparable to those of previous reports in which SSRs in cp genomes were found to be largely composed of polythymine (polyT) or polyadenine (polyA) repeats, and infrequently contained tandem cytosine (C) and guanine (G) repeats [118,120]. Loblolly pine chloroplast genome Therefore, these SSRs contributed to the A-T richness of the P. taeda cp genome, which was also previously observed in the cp genomes of plant species [43, 71,120]. The SSRs identified in the cp genome of P. taeda can be evaluated for polymorphisms at the intra-specific levels and used as markers for evaluating the genetic diversity of wild populations of plants from the Pinaceae family.

Phylogenetic analysis
In plants, the cp genome is a valuable resource for exploring intra-and interspecific evolutionary histories [121][122][123][124][125][126][127]. Compared to nuclear genomes in chloroplasts, the uniparental inheritance (for exceptions, see [122,128]) is systematically striking because a single, independent genealogical history can be readily obtained for developing hypotheses [129][130][131]. Moreover, in some land plants (a few flowering plant lineages and conifers), the chloroplast is paternally inherited and independent of the nuclear and mitochondrial genome [132]. Recently, cp genomes have shown significant power in phylogenetic, evolution, and molecular systematics studies. During the last decade, various analyses have revealed the phylogenetic relationships at deep nodes based on comparisons of multiple protein coding genes, intergenic spacers [133,134], and complete genome sequences in chloroplast genomes [135] that have enhanced our understanding of the evolutionary relationships among angiosperms and gymnosperms. According to the most recent classification, the genus Pinus is comprised of approximately 110 species and is shared by two subgenera, Strobus and Pinus, which are divided into further sections [136]. Furthermore, some evolutionary hypotheses suggest that the subgenera Strobus and Pinus originated from the Eocene [137,138], whereas others indicated these subgenera were already present during the Cretaceous [138][139][140]. The Pinus subgenus has undergone significant distributional as well as environmental changes during their evolution, such as moving multiple times between America and Eurasia [140]. Chloroplast DNA polymorphisms in P. taeda have been used in numerous studies to assess paternal inheritance lineage and cytoplasmic diversity [141][142][143][144][145][146]. Continued efforts have expanded our ability to differentiate and understand the genomic structure and phylogenetic relationships of Pinus species [147]. The phylogeny and taxonomy of Pinus species have largely relied on chloroplast markers [140,148,149]. However, compared to nuclear genes, these markers are linked and offer independent information on species phylogeny. Previously, the phylogenetic study of pine based on multiple nuclear genes was reported by Syring et al. [150], where four lowcopy nuclear loci were analyzed in 12 pine species and combined with internal transcribed spacers and chloroplast data. Various studies revealed that the addition of more genes increased the chance for improving the phylogenetic tree [151][152][153]. However, this does not resolve all phylogenetic problems [154,155].
Complete genome sequencing provides detailed insight into an organism [43, 66,156]. In this study, the phylogenetic position of P. taeda within the Pinus genus was established by employing the complete cp genome and 60 shared genes of 16 species. Phylogenetic analyses using Bayesian inference, maximum parsimony, maximum likelihood, and neighbor-joining methods were performed. The phylogenetic analysis revealed that the complete dataset and 60 shared genes of P. taeda contained the same phylogenetic signals. In the datasets for the genome and 60 shared genes, P. taeda formed a single clade with P. contorta with high Bayesian interference and bootstrap support using the four different methods (Fig 6 and S2 Fig). Moreover, tree topology confirmed the relationship inferred from the phylogenetic work previously conducted based on cp genomes [89, 141,157], in which P. taeda was genetically similar to P. contorta. These results revealed good agreement with classical taxonomy, where similar concordance was observed in the cp genome and mitochondrial genome-based reconstructions of Pinus phylogeny [136,140]. Furthermore, these results are in broad agreement with previous results reported by Niu et al., where P. taeda formed a single clade with P. contorta based on pairwise non-synonymous substitution rates of orthologous transcripts [158]. Additionally, the results suggest that there is no conflict between the entire genome dataset and 60 shared genes in these cp genomes.

Conclusion
The current study determined the complete genome sequence of the chloroplast from P. taeda (121,531 bp). The gene order and genome structure of P. taeda was similar to that of cp genomes of other Pinus species. Furthermore, the distribution and location of repeat sequences were determined, and average pairwise sequence divergences among cp genomes of related species were identified. SSR, SNP, and phylogenetic analyses were performed on 16 Pinus species cp genomes. No major structural rearrangement of Pinus species cp genomes was observed. Phylogenetic analyses revealed that the dataset based on 60 shared genes and that of the entire genome generated trees with the same topologies regarding the placement of P. taeda. Such investigations are an essential source of important information on the complete cp genome of P. taeda and related species, which can be used to facilitate biological study, identify species, and clarify taxonomic questions.
Supporting information S1