Characterization of 20 complete plastomes from the tribe Laureae (Lauraceae) and distribution of small inversions

Lindera Thunb. (Lauraceae) consists of approximately 100 species, mainly distributed in the temperate and tropical regions of East Asia. In this study, we report 20 new, complete plastome sequences including 17 Lindera species and three related species, Actinodaphne lancifolia, Litsea japonica and Sassafras tzumu. The complete plastomes of Lindera range from 152,502 bp (L. neesiana) to 154,314 bp (L. erythrocarpa) in length. Eleven small inversion (SI) sites are documented among the plastomes. Six of the 11 SI sites are newly reported and they locate in rpoB-trnC, psbC-trnS, petA-psbJ, rpoA and ycf2 regions. The distribution patterns of SIs are useful for species identification. An average of 83 simple sequence repeats (SSRs) were detected in each plastome. The mono-SSRs accounted for 72.7% of total SSRs, followed by di- (12.4%), tetra- (9.4%), tri- (4.2%), and penta-SSRs (1.3%). Of these SSRs, 64.6% were distributed in an intergenic spacer (IGS) region. In addition, 79.8% of the SSRs are located in a large single copy (LSC) region. In contrast, almost no SSRs are distributed in inverted repeat (IR) regions. The SSR loci are useful to identifying species but the phylogenetic value is low because the majority of them show autapomorphic status or highly homoplastic characteristics. The nucleotide diversity (Pi) values also indicated the conserved nature of the IR region compared to LSC and small single copy (SSC) regions. Five spacer regions with high Pi values, trnH-psbA, petA-psbJ and ndhF-rpl32, rpl32-trnL and Ψycf1-ndhF, have a potential use for the molecular identification study of Lindera and related species. Lindera species form a paraphyletic group in the plastome tree because of the inclusion of related genera such as Actinodaphne, Laurus, Litsea and Neolitsea. A former member of tribe Laureae, Sassafras, forms a clade with the tribe Cinnamomeae. The SIs do not affect the phylogenetic relationship of Laureae. This result indicated that ancient plastome captures may have contribute to the mixed intergeneric relationship of Laureae. Alternatively, the result may indicate that the morphological characters defined the genera of Lauraceae originated for several times.


Introduction
The large inversion (LI) of plastid genomes is occasionally reported from several plant families, such as Asteraceae [31], Fabaceae [32,33], Geraniaceae [34,35], Oleaceae [36,37], Passifloraceae [38], and Poaceae [39]. The LIs are often showed the systematic utilities because they occur in a clade of certain groups. Conversely, small inversion (SI)s occur extensively in any studied angiosperm plastome [31,40]. Previously, based on partial sequences, the presence of SIs was reported only in certain regions of plastome. In particular, it has been reported in the ndhB intron [41], trnH-psbA [42][43][44][45], petA-psbJ [46], rpl16 intron [47] and trnL-F regions [42,48]. SIs are not easily found because they are usually located in spacer regions and do not affect the gene order. Therefore, not only have there been few studies on SIs, but these studies have found a limited number of regions of SIs. As complete plastome studies have been developing, the areas where SIs are found are increasing. For examples, Kim and Lee (2005) compared the complete plastomes of four species of Poaceae and reported 16 SI regions [31]. Thereafter In this study, we report 20 complete plastome sequences from Lauraceae. Seventeen of them belong to Lindera and the other three species are Actinodaphne lancifolia (Blume) Meisn., Litsea japonica (Thunb.) Juss. and Sassafras tzumu (Hemsl.) Hemsl. In addition to our 20 new plastome sequences, all available plastome sequences in the NCBI database were compared and analyzed to investigate the systematic relationships of Lindera and closely related taxa. First, the degrees and locations of SIs of these plastomes were identified, and then whether these inversions affect the construction of phylogenetic trees was evaluated. Second, the hotspot regions of plastomes, which are important for interspecific relationships, were evaluated, and simple sequence repeats (SSRs) and dispersed repeat sequences are reported. Third, the phylogenetic relationships of the tribe Laureae using coding, noncoding, and all sequences, are compared. Finally, the evolution patterns of traits such as evergreen and deciduous are discussed.

Plant materials and DNA extraction
The leaves of 17 Lindera and three outgroup species used in this study were collected from Korea, China and Japan. All voucher specimens were deposited in the Korea University Herbarium (KUS). Their information is summarized in Table 1. Fresh leaves were collected and ground into powder in liquid nitrogen for Korean materials. Collected leaves are dried in silica gel and transported to lab for Chinese and Japanese materials. Total DNAs were extracted using a plant genomic DNA extraction kit (QIAGEN and iNtRON Biotechnology). The genomic DNAs were deposited in the Plant DNA Bank in Korea (PDBK).

Sequencing and annotation
Approximately 100 ng of extracted DNA were used for library construction and raw sequence reads were generated using Illumina MiSeq using reagent kit v3 (600-cycles) (Illumina, Inc. San Diego, CA). The raw read sequence data was deposited on NCBI Sequence Read Archive (SRA) under acc. nos. SRR10278371 -SRR10278390 (Table 1). The numbers of paired-endreads of 20 new complete plastome sequences ranged from 7,740,482 in Sassafras tzumu to 13,233,342 in Lindera glauca (Siebold & Zucc.) Blume ( Table 1). The average read length after trimming ranged from 258 to 287bp depending on the samples. For trimming and normalization of raw reads, BBDuk version 37.64 and BBNorm version 37.64, which were adopted in Geneious v. 11.1.2 (Biomatters Ltd.) [52] were used with kmer length of 27. All repeated reads were removed from trimmed reads by normalization process. The normalized reads are subjected to de novo assembly and then plastome contigs were recovered. All repeated reads were mapped to the plastome contigs and finally a single plastome contig was recovered for L. obtusiloba. The complete plastome of L. obtusiloba was assembled de novo at first. For other 19 plastomes, only the plastid reads were collected from trimmed reads using L. obtusiloba as a reference. The collected plastid reads were subjected to de novo assembly. In this way, a single plastome contig, which covers the whole plastome, was generated for the other 19 species. Annotation and mapping of protein coding genes (including exons and introns) and rRNA genes were performed using a BLAST search in the National Center for Biotechnology Information (NCBI). All tRNA genes were annotated using the tRNAscan-SE program [53]. Pseudogenes and deletions were determined by NCBI BLAST. The circular plastome maps were constructed by OrganellarGenomeDraw (OGDRAW) [54].

Plastome analysis
To locate the SIs of the 20 sequenced plastomes, we first identified the palindromic repeats that form the stem with longer than 4 bp loop regions using REPuter [55]. Their secondary structures and free energy were estimated using the mFOLD program [56]. For the same stem region, we identified it as SI if different species showed distinct loop sequence orientations. All of the 20 whole plastome sequences were aligned using MAFFT v. 7.017 [57].
Sliding window analysis was conducted to generate the nucleotide diversity (Pi) of complete Laureae genomes using DnaSP v. 6.10 software [58]. The step-size was set to 200 bp, with a 600 bp window length. All of the whole plastome sequences were aligned using MAFFT v. 7.017 [57]. The simple sequence repeats (SSRs) were detected with the Phobos v. 3.3.12 program [59]. We counted the SSR if it is repeated more than ten times for mono-, five times for di-, four times for tri-, three times for tetra-, and two times for penta-SSR loci.

Phylogenetic analysis
For the phylogenetic analysis of Lauraceae, we selected and downloaded 29 complete plastome sequences (28 Lauraceae and one Calycanthaceae plastomes) from the NCBI database (S1 Table). Out of the 29 complete plastome sequences, 11 are indicated as unverified sequences in the NCBI database. These sequences were used only for the construction of the Lauraceae phylogenetic tree using 49 taxa. The phylogenetic analysis was performed on a data set that includes 77 protein-coding genes and four rRNA genes. The 81 gene sequences were aligned separately with MUSCLE in Geneious v. 11.1.2 (Biomatters Ltd.) [52] and then concatenated as a single data matrix. We also constructed a phylogenetic tree for the 33 core Lauraceae group using 33 whole plastome sequences including all noncoding regions. The whole plastome sequences including all noncoding regions were aligned as a single data matrix with MAFFT v. 7.017 in Geneious v. 11.1.2. The GTR base substitution model was adopted based on the jModelTest2 [60] for maximum likelihood (ML) tree reconstruction using RAxML v.

Structures of the Lindera plastome
The ratio of plastid reads/total reads are ranged from 0.23% in L. metcalfiana C.K.Allen to 9.38% in L. sericea (Siebold & Zucc.) Blume ( Table 1). The differences are primary due to the leaf developmental stages. The sequencing coverage of 20 complete plastomes ranged from 44x (L. metcalfiana) to 2,020x (L. sericea) ( Table 1). We recovered single plastome contig, which covers whole plastome, on de novo assembly even for the lowest covered L. metcalfiana. It is primarily due to the long lead lengths and high coverage depths of Illumina MiSeq sequencing in our study.
The gene order and structure of the 20 plastomes are similar to those of a typical angiosperm (Fig 1) Table 2). All Lindera plastomes (excluding L. megaphylla Hemsl. and L. metcalfiana) comprised of 111 unique genes (77 protein-coding genes, 30 tRNA genes, and four rRNA genes). Sixteen genes had one intron and two genes (clpP and ycf3) had two introns. Seven protein-coding, seven tRNA, and four rRNA genes were duplicated in the IR regions. The A-T content of the Lindera plastomes was approximately 60.8% (Table 2).
Two genes were pseudogenized in Lindera and related genera. For example, the rpl22 gene is a pseudogene in all species. The rpl23 gene is a pseudogene in L. megaphylla and L. metcalfiana. The pseudogenized rpl23 was also reported in the parasitic Cassytha and non-parasitic Nectandra Rol. ex Rottb. of Lauraceae [20,21].
The contraction/expansion boundaries between SC and IR regions vary among the angiosperm species [64]. This causes differences in angiosperm plastome sizes. Our 20 plastomes  show similar SC and IR boundaries. Therefore, the variations in length among the 20 plastomes are minor. The LSC/IR boundary is located within the ycf2 coding region, and the SSC/ IR boundary is located within the ycf1 coding region, respectively. These results are consistent with previous studies for the Laureae tribe [20, 21, 65].
SIs are not easily found because they are usually located in the spacer regions and do not affect the gene order. Therefore, most plastome research do not mentioned the SIs. The presence of SIs was reported in the trnH-psbA, petA-psbJ and trnL-F regions  suggested regions (trnH-psbA, trnQ-rps16, rpoB-trnC, petA-psbJ and ndhA intron), but five SIs are located in other regions. The majority (nine of 11) of Lindera SIs were located on downstream of genes. The other two SIs are located gene coding region (no. 9) and intron region (no. 11). Six of 11 SIs were located downstream of two adjacent genes where the 3 0 ends of the two genes met tail-to-tail. However, the stem forming regions of SIs were closer to the 3 0 end of one of the genes. The other three SIs (nos. 1-3 in Fig 1) were located in the intergenic spacers between genes that had the same orientation (tail-to head orientation). These locations are generally accorded the previous prediction of SI locations in other plants [36]. The main function of stem-loop forming SI is the stability maintenance of the transcribed mRNA [36].
We also estimated the free energy (-ΔG) of each SI regions using the MFOLD program (Table 3). Two different orientations of all 11 SIs show identical free energy values. As a result, the flip-plop mutations of SIs are selectively neutral in evolution. Therefore, the flip-plop mutations occur easily on the same locus. We also estimated the number of flip-plop mutations for each SI on the ML tree using the ACCTRAN criteria of parsimony analysis (Fig 3). Five (nos. 3, 4, 5, 8, and 9) out of 11 SIs are autapomorphic characters. The other six SIs are synapomorphic characters, but their character states are changed several times ranging from five to seven times (Fig 3). As expected by free-energy values, the multiple changes of each SI character explain by easy flip-plop mutation at the stem region of hairpin structure. Therefore, the SIs are not strong phylogenetic markers to define the monophyletic groups. But, it shows strong molecular identification powers at species level.
SIs are always bounded with the hairpin structure of DNA sequences. The flip-plop mutation at the stem region create different orientation of loop sequences [63]. A single flip-plop mutation at the stem region generate several base pair differences at the loop region. Therefore, care should be used when regions where SIs are included, as these are used in the construction of phylogenetic trees, because incorrect phylogenetic signals may be given by such regions [63]. In particular, some areas, such as trnH-psbA [44,45], which are often used as interspecies markers, might be better to use after the removing the SI(s).
https://doi.org/10.1371/journal.pone.0224622.g003 to be attributable to frequent recombination between two copies in the IR region to continuously remove mutations. In plastid genomes, this positional effect acts more strongly than functional factors such as gene coding sequences (CDS), IGS, and intron regions. However, the functional effects act more strongly in the same LSC and SSC regions. For instance, all the regions with high Pi values mentioned above correspond to the IGS regions, not the CDS region, and most of the regions with Pi values exceeding 0.01 not mentioned above are also located in the IGS region (Fig 4). In the CDS region, Pi values were shown to be relatively high in the ndhF, ycf2 and ycf1 regions located at the LSC-IR-SSC junction, and this is considered attributable to IR contraction/expansion. In order to test the usefulness of the high Pi value regions for phylogenetic and DNA barcoding studies, we constructed the phylogenetic tree using the combined sequences of petA-psbJ, trnH-psbA, ndhF-rpl32 and rpl32-trnL-UAG intergenic spacer regions (IGS). Each of the four IGS regions shows Pi values more than 0.18. The aligned sequence of four regions was 4,734 bp in length and a ML (-ln L = 12859.707713) tree with bootstrap values more than 50% internal nodes were presented in S1 Fig. The tree shows almost fully resolved topology even the bootstrap supporting values are low on some internal nodes. Therefore, using the regions with high Pi values presented in this study will be helpful for studies of genealogy between closely related species or DNA barcoding studies to distinguish species.

Types and distribution of simple sequence repeats
Plastid simple sequence repeats (SSRs) have been used for molecular markers in plant population genetic studies [51, 68, 69] because they show high intraspecific variations. The copy number differences of SSRs are usually due to the slippage-mispairing during DNA replication [70]. In this study, we analyzed the SSRs of 20 newly sequenced plastomes (Fig 5 and S2 Table). An average of 83 SSRs were detected in each plastome. The numbers ranged from 73 in L. nacusua (D.Don) Merr. to 91 in L. angustifolia (W.C.Cheng) Nakai (S2 Table). The majority of SSRs were mono-SSRs and accounted for 72.7% of total SSRs. The di-SSRs comprised 12.4%, followed by tetra-(9.4%), tri-(4.2%), and penta-SSRs (1.3%) (Fig 5). The length of mono-SSRs ranged from 10 to 31 bp. Also, 31 A bases were detected in the ycf3 intron 1 of the L. metcalfiana plastome. The average number of mono-SSRs was 60.2, with the largest number being 65 in L. erythrocarpa, L. neesiana and L. pulcherrima (Nees) Hook.f. var. attenuate C.K. Allen, and the smallest number being 47 in L. nacusua (S2 Table). The length of di-SSRs ranged from 10 to 20 bp, and (AT)n was the most common type of di-SSR (S2 Table). The tetra-SSRs occurred on an average of 7.8 sites of each plastome and were more common than the tri-SSRs. This result is consistent with the previous results of complete plastomes of Lauraceae [15,17]. SSRs were scattered along the Laureae plastomes (Fig 5 and S2 Table). Of these SSRs, 64.6% were located in IGS regions and 21.8% occurred in intron regions. In contrast, only 13.6% were located in CDS regions (Fig 5). These results were similar to other studies of Lauraceae [18]. We also partitioned the distribution of SSRs according to the LSC, SSC and IR regions, and found that 79.8% of the SSRs were located in the LSC region (61.3%) (Fig 5). Only one or two SSRs were located in the IR region (S2 Table). SSRs were completely absent from the IR regions of L. communis Hemsl., L. nacusua and L. pulcherrima var. attenuata (S2 Table).
In order to evaluate the phylogenetic utility of SSRs, we compare the locus of each di-, tri-, tetra-, and penta-SSRs among 20 species (S3 Table). Eight of 44 loci were conserved among all species. Twenty-two loci show autapomorphic status and 14 loci show synapomorphic status. We also plotted each synapomorpic locus on the phylogenetic tree and only two (nos. 16 and 41, S3 Table) of them support monophyly of a clade consisted of L. glauca, L. angustifolia, L. nacusua, and L. communis. Other 12 synapomorphic loci are changed multiple times ranged from two to 10 times (Tree not shown). Therefore, the phylogenetic utilities of the SSR loci are very low in Lindera. But, it is a good maker to the identification of species. Actually, we were confidently identified all 20 species using the 44 SSR locus.

Phylogenetic analysis
To validate the phylogenetic relationships of Lauraceae, we aligned gene coding sequences for 49 Lauraceae taxa (S1 Table). The concatenated 81 gene sequences were 73,386 bp in length. The ML tree was obtained by RAxML with -ln L = 294926.225755. Most internal nodes are supported by 100% ML bootstrap values (Fig 6). Phylogenetic analysis was also performed on a data set that included whole plastome sequences for the core 33 Laureae taxa. The aligned whole plastome sequence including all noncoding regions was 158,484 bp in length. The 33 core Laureae tree was also constructed using the same condition as above (S2 Fig). In addition, we also construct the phylogenetic tree using only the intergenic spacer (IGS) regions of plastomes for 33   The trees suggested that the tribe Laureae was a monophyletic group, and that it is a sister group to the tribe Cinnamomeae. Their close outgroup is the tribe Perseae. In contrast to monophyletic tribes, 17 Lindera species formed paraphyletic assemblages because they include members of other genera such as Laurus (Lau.), Litsea (Lit.), Neolitsea (Neol.), and Actinodaphne (Act.) (Fig 6 and S2 Fig). For example, Act. tricocarpa C.K.Allen forms a sister group with Neol. Sericea (Blume) Koidz., while Act. lancifolia forms a sister group with Lit. japonica. Furthermore, Lit. glutinosa (Lour.) C.B.Rob. forms a clade with L. megaphylla Hemsl. and Lau. nobilis L. Neither Actinodaphne nor Litsea species form a monophyletic group in our tree ( Fig  6 and S2 Fig).
The genus Sassafras is usually treated as a member of the tribe Laureae based on general morphology, but it nested in a clade with Cinnamomum Schaeff. Nectandra was sister genus to the paraphyletic Cinnamomum in our plastome trees (Fig 6 and S2 Fig). Previous phylogenetic studies using partial plastid gene sequences [6,[27][28][29] or complete protein coding gene sequences [20] also reported the same relationships as in our tree. Therefore, our plastome trees agreed that Sassafras should be included in the tribe Cinnamomeae rather than the tribe Laureae [20]. Previous phylogenetic studies of Laureae using different molecular markers also support the paraphyly of Lindera. For examples, plastid matK tree showed the Lindera genus do not form a monophyletic group because some species of Litsea and Actinodaphne nested within a large Lindera clade [26]. The ITS and ETS tree also showed the strong paraphyly of Lindera because Lindera species were occurred at least four different clades [27]. Two of the clades also includes some species of Litsea, Actinodaphne, Parasassafras, Sinosassafras, and Iteadaphne. In addition, the nuclear rpb2 tree also show strong paraphyletic natures of Lindera [71]. Our whole plastome data also supports not only the paraphyly of Lindera but the paraphyly of other genera of Laureae. Therefore, the generic boundaries of tribe Laureae defined by morphological characters should be revised in near future.
Most of the Lauraceae are evergreen trees or shrubs and are distributed in tropical and subtropical regions of East Asia [3, 8-10]. Deciduousness was reported for some temperate Lindera and Litsea species and three Sassafras species [3, [8][9][10]. In order to test the evolution of deciduousness in Lindera and related genera, we plotted the character status on the whole plastome tree (S2 Fig). The tree clearly indicated that deciduousness was emerged at least five times and one reverse evolution from deciduous to evergreen also occurred on a branch leading to L. floribunda (C.K.Allen) H.P.Tsui. The deciduous trees L. obtusiloba Blume and L. erythrocarpa are derived independently from different evergreen ancestors. A core deciduous clade including six Lindera species from L. sericea to L. praecox (Siebold & Zucc.) Blume also include an evergreen, L. floribunda. Furthermore, some of these species also show semi-deciduous status depending on the distribution range [3, 9, 10]. Therefore, distribution range expansion to the north or high elevation and distribution range contraction to the south or lower elevation is the primary driving force of leaf characteristics in the evolution of Lindera. In addition, global climate change such as ice ages and global warming are also responsible for the evolution of leaf characteristics.
To test whether the problem of tangled relationships in which the genera in the Laureae are mixed with each other caused by SIs, a phylogenetic tree was constructed under the same tree building options, excluding the 11 SI regions that were found in this study (S4 Fig). However, there was no effect on the phylogenetic outcome. Similar results can be seen in the analysis using only the gene coding regions containing nine Lindera species instead of the whole plastome sequences used in this study [23]. Therefore, these results along with the previous phylogenetic study of Lindera and related genera probably suggest that hybridizations and plastome captures occurred frequently in the process of differentiation of these genera and species. In order to confirm the degree of plastome capture, additional studies including suitable nuclear markers are needed for Lindera, Litsea, Actinodaphne, and Laurus. Alternatively, the plastome phylogeny may indicate that the morphological characters defined the genus originated for several times.