Comparative analysis of the complete chloroplast genomes of seven Populus species: Insights into alternative female parents of Populus tomentosa

Populus tomentosa, of section Populus, is distributed mainly in northern China. This species has high resistance to many diseases and insects, and it plays key roles in shelterbelts and urban afforestation in northern China. It has long been suspected to be a hybrid, but its parents remain unknown. In the present study, we report four newly sequenced complete cp genomes from section Populus and comparative genomic analyses of these new sequences and three published cp genome sequences. The seven cp genomes ranged from 155,853 bp (P. tremula var. davidiana) to 156,746 bp (P. adenopoda) in length, and their gene orders, gene numbers and GC contents were similar. We analyzed SNPs, indels, SSRs and repeats among the seven cp genomes, and eight small inversions were detected in the ndhC-trnV, rbcL-accD, petA-psbJ, trnW-trnP, rpl16-rps3, trnL-ycf15, ycf15-trnL, and ndhF-trnL intergenic regions. Furthermore, seven divergent regions (trnH-psbA, matK, psbM-psbD, ndhC-trnV, ycf1, ndhF-ccsA and ccsA-ndhD) were found in more highly variable regions. The phylogenetic tree reveals that P. tomentosa is closely related to P. alba and P. alba var. pyramidalis. Hence, P. alba was involved in the formation of P. tomentosa.


Introduction
Species of the genus Populus, family Salicaceae, are collectively known as poplars and cottonwoods and play important economic and ecological roles due to their rapid growth rates, easy vegetative propagation, small genome size, importance as a timber source and other features [1,2]. The genus Populus is classified into 29 species belonging to six sections (Abaso, Aigeiros, Leucoides, Populus, Tacamahaca and Turanga), and it is the most widely distributed genus of woody plants in the world [3,4]. However, because poplars readily undergo interspecific a1111111111 a1111111111 a1111111111 a1111111111 a1111111111
Based on the five known Populus cp genome sequences (of P. alba, P. balsamifera, P. euphratica, P. tremula and P. trichocarpa) [29][30][31], the cp genomes of four Populus species were amplified with 33 primers (Table 1) using LA-PCR with Takara PrimeSTAR GXL DNA polymerase (TAKARA BIO INC., Dalian, China) following the method described by Yang [32]. A different 16 bp barcode sequences (Pacific Biosciences) was added to the primers of each of the four accessions P. adenopoda, P. alba var. pyramidalis, P. × hopeiensis and P. tomentosa ( Table 2). The PCR products were subjected to next-generation sequencing at Nextomics Biosciences, and gaps were filled by PCR amplification and Sanger sequencing. The assembled genome sequences were preliminarily annotated in Geneious R8, and the start and stop codons were manually adjusted. The tRNA genes were further confirmed through the online tRNAscan-SE web server [33]. The gene map of the annotated Populus cp genome was drawn by OGdraw online [34].

Codon usage
To examine deviations in synonymous codon usage by avoiding the influence of amino acid composition, the relative synonymous codon usage (RSCU) was detected using MEGA 5 software [35]. Because short protein-coding genes (CDS) generally result in large estimation errors for codon usage, CDS shorter than 300 bp in length were excluded from the codon usage calculations to avoid sampling bias [36]. Finally, 58 CDS for each cp genome were analyzed in this study.

SSR and long repeat sequence analysis
Microsatellites in the seven Populus cp genomes were detected using MISA [37] with the minimal repeat number set to 12, 6, 5, 5, 5 and 5 for mono-, di-, tri-, tetra-, penta-, and hexa-nucleotides, respectively. All of the repeats were manually verified. We used the online REPuter software [38] to identify and locate forward repeat (F), reverse repeat (R), complemented repeat (C) and palindromic repeat (P) sequences. The following settings for repeat identification were used: (1) Hamming distance equal to 3; (2) minimal repeat size, 30 bp; and (3) maximum computed repeats, 90 bp.

Sequence divergence analysis
To investigate divergence in cp genomes, identity across the whole cp genomes was visualized using the mVISTA viewer in the Shuffle-LAGAN mode among the seven accessions with P. adenopoda as the reference. MAFFT version 7.037 software [39] was used to align the seven cp genome sequences of Populus: P. adenopoda, P. alba, P. alba var. pyramidalis, P. tremula var. davidiana, P. × hopeiensis, P. tomentosa and P. tremula. After manual adjustment with BioEdit software, we performed sliding window analysis to assess variability (Pi) throughout the cp genomes using DnaSP version 5 software [40]. The window length was set to 600 bp and the step size was set to 200 bp. Single nucleotide polymorphisms (SNPs) and indels were detected using the "find variation" in Geneious R8. Inversions were manually detected using the BioEdit software. There were a total of 21 pairwise alignments for the seven cp genomes.

Complete cp genomes of Populus species
The seven cp genomes ranged in size from 155,853 bp (P. tremula var. davidiana) to 156,746 bp (P. adenopoda) (Fig 1 and (Table 3). Gene content and order were very similar among the cp genomes of the seven accessions and similar to those of other published cp genomes [43,44,48,49].
The four cp genomes all encode 130 genes with the same gene order and gene clusters. Among these genes, 112 are unique genes, including 78 protein-coding genes, 30 tRNA genes and 4 rRNA genes, except in P. alba, which has 111 unique genes (77 protein-coding genes, 30 tRNA genes and 4 rRNA genes). Twelve distinct genes (atpF, ndhA, ndhB, petB, rpl2, rpoC1, trnK-UUU, trnG-UCC, trnL-UAA, trnV-UAC, trnI-GAU, trnA-UGC) contain one intron and three genes (clpP, ycf3 and rps12) contain two introns. All annotated genes are listed in S1 Table. The overall GC contents range from 36.73% (P. × hopeiensis) to 36.79% (P. adenopoda and P. alba var. pyramidalis). However, the GC content is unequally distributed in the Populus cp genomes; it is highest in the IR regions (41.95-42.01%), moderate in the LSC regions (34.52-34.64%) and lowest in the SSC regions (30.40-30.67%) ( Table 3). The IR regions have the highest GC content due to the presence of eight rRNA sequences in the IR regions [19]. Furthermore, the AT content of the seven cp genomes was 55.1%, 62.6% and 70.5% at the first, second, and third codon positions, respectively, within protein-coding positions (Table 4). Overall, these seven cp genomes show a high conservation of all genome features, such as gene content, gene order, exon-intron structure and GC content. The alignment analysis revealed that the cp genomes of the seven Populus accessions were highly conserved, and that no rearrangement of gene organization had occurred (Fig 2).

IR expansion and contraction
The IR regions are known to promote the stability of the other regions of the genome by intramolecular recombination between the two copies of the IRs, thus limiting recombination between the two single copy regions [50,51]. Comparison of the boundaries between the IRs and single copy regions of the seven Populus accessions revealed very small boundary differences. Four junctions (J LA , J LB , J SA and J SB ) lay between the two IRs (IRB and IRA) and the two single copy regions (LSC and SSC). We carefully compared the IR border positions and the adjacent genes among the seven cp genomes. Detailed comparisons of the IR-SSC (J SA ) and IR-LSC (J SB ) boundaries among the cp genomes of the seven species are presented in Fig 3. For J LA , the boundary was located between rps19 and the trnH gene. The variation in distances between rps19 and J LA was from 200 bp to 219 bp, and the distances in the four species P. adenopoda, P. alba, P. alba var. pyramidalis and P. tremula were the same. The distance between trnH and J LA was consistent at 3 bp except in P. adenopoda, where it was 14 bp. The ycf1 gene   (7) 83 (6) 85 (7) 85 (7) 85 (7) 85 (7) 85 (7) Number of tRNA genes 37 (7) 37 (7) 37 (7) 37 (7) 37 (7) 37 (7) 37 (7) Number

Codon usage
Codon usage was calculated for the protein-coding genes present in the Populus cp genome. Most protein-coding genes employ the standard ATG as the initiator codon. However, ATA,  ATC, TTC, and ATT are also used as alternative start codons [52]. Among the Populus protein-coding genes, two genes were used as alternative start codons ATC for rpl16 and GTG for ndhD.
The codon usage patterns of the 58 distinct protein-coding genes in the seven cp genomes were examined. The cp genomes of P. alba var. pyramidalis and P. tomentosa were consistent, each with a length of 75,960 bp and containing 25,320 codons, whereas those of P. adenopoda, P. alba, P. tremula var. davidiana, P. × hopeiensis, P. tremula were 75,588 bp, 74,778 bp, 75,930 bp, 75,840 bp and 75,933 bp, respectively, in size and contained 25,196,24,926,25,310,25,280 and 25,311 codons, respectively, as shown in S2 Table. As an important indicator of codon usage bias, the RSCU value is the frequency observed for a codon divided by its expected frequency [53]. The values are divided into four categories: RSCU value of less than 1.0 (lack of bias), RSCU value between 1.0 and 1.2 (low bias), RSCU value between 1.2 and 1.3 (moderate bias) and RSCU value greater than 1.3 (high bias) [54,55]. Our results showed that the RSCU values corresponding to the usage of 31 codons in the seven accessions showed preferences (<1) for except methionine (Met) and tryptophan (Trp), with 29 codons having A/T in the third codon position. All three stop codons were present, with UAA being the most frequent stop codon in all seven cp genomes (S2 Table). In addition, our results indicated that leucine (Leu: 10.70%, 10.65%, 10.67%, 10.67%, 10.67%, 10.66%, 10.66% and 10.68%) and cysteine (Cys: 1.13%, 1.11%, 1.14%, 1.14%, 1.14%, 1.14%, 1.14% and 1.14%) were the most and least commonly encoded amino acids, respectively, in all seven cp genomes (Fig 4 and S2 Table).

SSR and long repeat analyses
Simple sequence repeats (SSRs) can be used as genetic markers in population genetics and evolutionary studies of closely related species, because of their high variability at the intraspecific level [56][57][58]. The number of cp genomes SSRs (cpSSRs) ranged from 26 to 46 among the seven Populus accessions (Fig 5A). The number of cpSSRs in P. tomentosa (26) was the same as that in P. alba var. pyramidalis (26), and the numbers of cpSSRs in the three accessions P. alba, P. tremula var. davidiana and P. × hopeiensis were similar (S3 Table). The mononucleotide repeat (P1) number with the highest variability ranged from 23 (P. alba var. pyramidalis and P. tomentosa) to 39 (P. tremula), and all of the P1s were composed of poly A (polyadenine) and poly T (polythymine) repeats (Fig 5B and S3 Table). Research has shown that, in the cp genome, SSRs are generally composed of polythymine (poly T) or polyadenine (poly A) repeats and infrequently contain tandem cytosine (C) and guanine (G) repeats [59,60]. In addition, all the dinucleotide repeat (P2) sequences in the seven accessions were AT repeats. In total, 74.83% SSRs were detected in the LSC region, 13.85% in the IR regions and 12.12% in the SSC region ( Fig 5C). In general, the cpSSRs of the seven Populus accessions represented abundant variation and will be useful for assays detecting polymorphisms at the population level for inferring distant phylogenetic relationships among Populus species [61].
Long repeat sequences have important roles in cp genome evolution and genome rearrangements and can be informative in phylogenetic studies [62]. Four repeat types were detected in the cp genome using REPuter software. However, complement repeats (C) were only identified in P. alba, P. alba var. pyramidalis, P. tremula var. davidiana and P. × hopeiensis which had four, two, one and four repeats, respectively. Nineteen forward repeats (F), three reverse repeats (R) and 14 palindrome repeats (P) were discovered in P. tomentosa. The repeat numbers of the other six cp genomes are shown in Fig 6A and S4 Table. The repeats were mostly distributed in the intergenic spacer (IGS) and intron sequences (S4 Table). Among these repeats, 290 (80.78%) had lengths of 30-39 bp, and only five (1.39%) were longer than 100 bp (Fig 6B). The presence of these repeats indicates that the locus is a crucial hotspot for genome reconfiguration [63,64]. Furthermore, these repeats are an informative source for developing genetic markers for phylogenetic and population studies [64].

Genome variation
SNP markers are the most abundant type of mutation and the most important marker for species identification [65]. Indels not only play an important role in elucidating genome evolution [66,67], but also have potential value in constructing phylogenies [68,69]. In this study, we compared these polymorphisms among the seven cp genomes. The numbers of nucleotide substitutions and indels varied from 44 to 274 and 129 to 252, respectively, and most mutations were located in noncoding regions (Table 5, S5 and S6 Tables).
In searching for SNPs and indels, we found little differences among the cp genome sequences of P. tomentosa, P. alba and P. alba var. pyramidalis, which had similar mutation models (S5 and S6 Tables). Interestingly, there were always more transitions (Ts) than transversions (Tv) in Populus except in two pairwise comparisons (P. tomentosa vs. P. alba and P. tomentosa vs. P. alba var. pyramidalis). Transitions (Ts) occurred at higher frequencies than did transversions (Tv) in almost all DNA sequences; transition/transversion (Ts/Tv) bias is a general property of DNA sequence evolution [70]. In the gene coding regions, seven genes (atpB, ndhD, ndhF, rpoB, rpoC2, rps8 and ycf1) were found to have SNP mutations, and four of genes had more synonymous substitutions than nonsynonymous substitutions between P. alba var. pyramidalis and P. tomentosa (S7 Table). In addition, 13 genes had SNP mutations Chloroplast genome of seven Populus species between P. tomentosa and P. alba (S7 Table). Therefore, the phylogenetic relationships of these species may be affected by different mutation models.

Small inversions
Small inversions in the cp genomes of angiosperms are ubiquitous and are commonly associated with a hairpin secondary structure [71,72]. Small inversions are generally detected by performing pairwise comparisons between sequences of closely related taxa [71]. In this study, a total of eight small inversions were uncovered based on the sequence alignment of the seven complete cp genomes, of which five were located in the LSC region, two were located in the IR regions, and one was located in the SSC region. In addition, eight small inversions were detected in the ndhC-trnV, rbcL-accD, petA-psbJ, trnW-trnP, rpl16-rps3, trnL-ycf15, ycf15-trnL and ndhF-trnL intergenic regions. The number of small inversions among the 21 pairwise alignments ranged from one to six. There was one small inversion between P. alba var. pyramidalis and P. tomentosa located in petA-psbJ and one between P. tremula var. davidiana and P. tremula, which was located in ndhF-trnL (S8 Table).  Genome sequence divergence among the seven Populus accessions. We used mVISTA to perform a sequence identity analysis with P. adenopoda as a reference (Fig 7). The alignment revealed high sequence similarity across the cp genomes, which suggests that they are highly conserved. To investigate the levels of sequence divergence, we calculated the levels of genetic divergence among the cp genomes of the seven accessions using DnaSP software. The pairwise nucleotide divergence values between two of the seven cp genomes varied from 0.00028 to 0.00164 (Table 6), with a mean of 0.00103. Using sliding window analysis, we identified the seven most divergent regions (Pi>0.005): trnH-psbA, matK, psbM-psbD, ndhC-trnV, ycf1,  ndhF-ccsA, and ccsA-ndhD (Fig 8). Further work is necessary to determine whether these seven variable loci can be used in Populus phylogenetic analyses or serve as excellent candidate markers for population genetic and phylogenetic analysis [73].

Phylogenetic analysis
Cp genomes provide abundant resources, that are useful for evolutionary, taxonomic, and phylogenetic studies [20,60,74]. Whole cp genomes and protein-coding genes have been successfully used to resolve phylogenetic relationships at almost every taxonomic level during the past decade [60,75].
The complete cp genome sequences of the seven Populus accessions and the three published complete cp genomes of members of section Populus (P. qiongdaoensis (KX534066), P. rotundifolia (KX425853), P. tremula × alba) were used for phylogenetic analysis, with P. euphratica (KJ624919), P. ilicifolia (KX421095), Salix babylonica (KT449800) and Salix paraplesia (MG262366) included as outgroups. ML and BI nucleic acid analyses were performed, and the results are summarized in Fig 9. The two topologies show similar phylogenetic patterns. Each topology divided the 10 Populus accessions into two clades. The first divergent clade contained P. adenopoda, P. alba, P. alba var. pyramidalis, and P. tomentosa, and the second contained P. rotundifolia, P. × hopeiensis, P. tremula var. davidiana, P. qiongdaoensis, P. tremula, and P. tremula × alba. The phylogenetic tree revealed that P. tomentosa was closely related to both P. alba var. pyramidalis and P. alba (bootstrap support = 100% and BI = 1.0).
According to the published Flora of China [76], the species of section Populus share a smooth bark but vary in bark color. In P. tomentosa, the color of the bark on the basal part of the trunk changes with plant age, shifting from dull gray to grayish green or grayish white and then to dark gray. The bark is grayish white in P. alba and P. adenopoda, grayish green or grayish white in P. tremula var. davidiana and P. tremula, and yellowish green to grayish white in P. × hopeiensis. In addition, the shape of the bud in Populus is ovoid or ovoid-globose and the shape of capsule is narrowly conical or long ovoid-ellipsoid; only P. tomentosa, P. alba and P. × hopeiensis have dense, white tomentose buds. In a previous analysis of bract and other characteristics, of the 22 P. tomentosa clones, the clones were divided into three populations, and the authors speculated that P. tomentosa is a natural hybrid of P. alba and P. tremula var. davidiana [15]. Zhang et al [17] compared 26 traits among P. tomentosa and five related species (including several varieties). Their comparisons revealed that P. tremula var. davidiana and P. alba var. pyramidalis may have participated in the formation of some natural types of P. tomentosa, although substantial variation exists among natural types. In addition, although many wild P. tomentosa ecotypes have arisen during the evolution of the species [77], 14 clones collected from throughout the species' entire natural ranges, clustered together in an amplified fragment length polymorphism (AFLP) marker analysis [78]. Wang et al [79] used 24 single-copy nuclear DNA sequences and 12 plastid fragments to reconstruct the phylogeny of Populus, which suggested that section Populus is a monophyletic group. The genus was divided into two distinct clades with maximum bootstrap support and posterior probability. The nuclear DNA phylogeny revealed a close relationship between P. tomentosa and P. adenopoda; however, in the plastid phylogeny, P. tomentosa and P. adenopoda belonged to two different clades. The authors speculated that in the hybridization event giving rise to P. tomentosa, the ancestor of P. tremula var. davidiana and P. × hopeiensis served as the maternal parent and P. adenopoda served as the paternal role. To clarify the origins of P. tomentosa and P. × hopeiensis, Wang et al [80] analyzed 10 nuclear DNA sequences and 6 cpDNA sequences from 392 individuals from 36 populations of 8 taxa (P. × hopeiensis, P. tomentosa, P. alba, P. adenopoda, P. tremula var. davidiana, P. tremula, P. tremuloides, and P. grandidentata). The authors aimed to improve the understanding of hybridization and introgression in section Populus. The results supported the division of P. tomentosa into two genetic types (mb1 and mb2) with different maternal parents; in both genetic types, P. alba acted as the male parent, whereas P. adenopoda and P. tremula var. davidiana acted as the maternal parent in mb1 and mb2, respectively. However, there is always a big controversy about the possible parent of P. tomentosa. RAPD and AFLP analyses have suggested that P. tomentosa is possibly a natural hybrid of P. alba and P. adenopoda [14][15], an interpretation highly consistent with observations of chromosomal behaviors during meiosis and pollen fertility [18]. Our analyses showed that P. alba var. pyramidalis is closely related P. tomentosa, and Yin [81] suggested that P. alba var. pyramidalis could be regarded as a variant of P. alba. Based on these findings, we speculate that P. alba was involved in the formation of P. tomentosa as a common female parent based on the cp genomes.

Conclusion
In the present study, we report four newly sequenced complete cp genomes from section Populus and comparative genomic analyses of these genomes and three other published cp genome sequences. The seven cp genomes were similar in structure and had a high degree of synteny. Comparison of seven cp genomes revealed seven divergent regions (trnH-psbA, matK, psbM-psbD, ndhC-trnV, ycf1, ndhF-ccsA and ccsA-ndhD) in the highly variable regions, which can be utilized as potential molecular markers for population genetic and phylogenetic studies in Populus. The location and distribution of SSRs and long repeat sequences were examined and shown to be similar and conserved among the genomes. In addition, a total of seven small inversions were detected in the ndhC-trnV, rbcL-accD, petA-psbJ, trnW-trnP, rpl16-rps3, trnL-ycf15, ycf15-trnL and ndhF-trnL intergenic regions. ML and BI phylogenetic trees based on the complete cp genome sequences indicated that P. tomentosa is closely related to both P. alba var. pyramidalis and P. alba. Thus, we speculate that P. alba was involved in the formation of P. tomentosa as a common female parent.
Supporting information S1