Improving phylogenetic resolution of the Lamiales using the complete plastome sequences of six Penstemon species

The North American endemic genus Penstemon (Mitchell) has a recent geologic origin of ca. 3.6 million years ago (MYA) during the Pliocene/Pleistocene transition and has undergone a rapid adaptive evolutionary radiation with ca. 285 species of perennial forbs and sub-shrubs. Penstemon is divided into six subgenera occupying all North American habitats including the Arctic tundra, Central American tropical forests, alpine meadows, arid deserts, and temperate grasslands. Due to the rapid rate of diversification and speciation, previous phylogenetic studies using individual and concatenated chloroplast sequences have failed to resolve many polytomic clades. We investigated the efficacy of utilizing the plastid genomes (plastomes) of 29 species in the Lamiales order, including five newly sequenced Penstemon plastomes, for analyzing phylogenetic relationships and resolving problematic clades. We compared whole-plastome based phylogenies to phylogenies based on individual gene sequences (matK, ndhF, psaA, psbA, rbcL, rpoC2, and rps2) and concatenated sequences. We also We found that our whole-plastome based phylogeny had higher nodal support than all other phylogenies, which suggests that it provides greater accuracy in describing the hierarchal relationships among taxa as compared to other methods. We found that the genus Penstemon forms a monophyletic clade sister to, but separate from, the Old World taxa of the Plantaginaceae family included in our study. Our whole-plastome based phylogeny also supports the rearrangement of the Scrophulariaceae family and improves resolution of major clades and genera of the Lamiales.


Introduction
The genus of Penstemon (Mitchell) is a large group of ca. 285 species of flowering plants endemic to North America [1][2][3]. A recently released report by Wolfe,Blischak (3) a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 character as independent. However, this violates the assumption of independence inherent in ML analyses as each locus is dependent and linked on single heritable plastid chromosome that does not undergo recombination [32,33]. Basing phylogenetic analyses on whole organellar sequences rather than individual gene sequences does not violate the assumption of independence. This also improves phylogenetic resolution and confidence can be improved using whole-plastome sequences because the coding, noncoding, single-copy (long and short), and inverted repeat regions each accumulate point mutations at different rates within a plastome [34]. This then creates a phylogeny that represents the whole evolutionary history of the taxa based on the plastid genome.
Here we report the assembled and annotated plastomes of P. cyaneus, P. dissectus, P. palmeri, P. personatus, and P. rostriflorus, representing the Habroanthus, Dissecti, Penstemon, Cryptostemon, and Saccanthera subgenera, respectively. We include the previously published plastome of P. fruticosus [35] in our analyses to represent the basal subgenus, Dasanthera, thus including representatives from all recognized Penstemon subgenera. We had three primary objectives to guide our studies of Penstemon plastomes. First, document the complete plastome sequences for species representing each Penstemon subgenus. Second, evaluate these species for evolutionarily significant similarities and differences in plastome structure, using microsatellite simple sequence repeats (SSRs), repetitive sequences, nucleotide variants, expansion/contraction of the inverted repeats, plastome synteny, and the phylogenetic positions of the subgenera within Penstemon. Lastly, evaluate the efficacy of using whole-plastome sequences to resolve problematic clades within the Lamiales (i.e. gene tree discordance, polytomy, and low nodal support) when compared to phylogenies based on single-gene matK, ndhF, psaA, psbA, rbcL, rpoC2, and rps2 sequences and a concatenated sequence composed of the above listed genes [11][12][13].

DNA extraction, sequencing, and plastome assembly and annotation
We extracted DNA from leaves of greenhouse grown P. cyaneus, P. dissectus, P. fruticosus, P. palmeri, P. personatus, and P. rostriflorus (Fig 1) using a modified CTAB purification method [36]. We diluted the DNA to a minimum concentration of 5 ng and whole genome sequences were generated using the pair-end (2 x 250 bp) Illumina HiSeq platform (Illumina Inc., San Diego, CA) at the Brigham Young University DNA Sequencing Center (Provo, Utah, USA).
To isolate the plastome sequences from the unpaired genomic reads with NOVOPlasty (https://github.com/ndierckx/NOVOPlasty) [37] we used four combinations of seed inputs and reference plastomes: 1) P. fruticosus as the seed without a reference 2) P. fruticosus for both the seed and reference, 3) Erythranthe lutea as the seed without a reference, and 4) E. lutea as the seed and P. fruticosus as the reference. Then, we assembled the isolated sequences from all NOVOPlasty runs using the Geneious v11.0.3 De Novo Assembly tool [38] with the P. fruticosus plastome reference [35] to create a circularized plastome sequence. Next, we evaluated each assembly for coverage and identified sequence gaps (P. cyaneus and P. palmeri) and filled these gaps with additional NOVOPlasty runs using the portion of the P. fruticosus plastome sequence that spanned each gap as a seed reference.
We annotated the assembled plastomes using the GeSeq webserver (https://chlorobox. mpimp-golm.mpg.de/geseq.html) [39] with the default parameters, and NCBI reference sequences from Olea europaea, Scrophularia buergeriana, S. takesimensis, Veronica nakaiana, V. persica, and Veronicastrum sibiricum, and corrected any gene annotations that were missing start/stop codons or introns manually. Once we completed the annotations, we submitted all sequences and the annotations to NCBI.

Simple sequence repeats and repeat structure
To evaluate the assembled plastomes for SSRs and repeat region similarity among Penstemon subgenera we used the MISA microsatellite predicting webserver (http://misaweb.ipkgatersleben.de/) [40]. Our identified SSRs were based on homopolymer and copolymer lengths of five for di-, four for tri-, and three for tetra-, penta-and hexa-nucleotide repeats using the default parameters. Using the REPuter webserver (https://bibiserv.cebitec.uni-bielefeld.de/ reputer) we identified repeat regions in the forward and reverse direction as well as complement and palindrome sequences [41]. We used a 20 base pair minimum repeat size without a minimum distance between repeat sites.

Single nucleotide polymorphisms and codon preference
For the alignment of all sequences to the P. fruticosus reference we utilized MAFFT webserver (https://mafft.cbrc.jp/alignment/server/) [42] and identified the variable sites, SNPs and indels, with Geneious v11.0.3 [38]. To measure the relative synonymous codon usage (RSCU), we used the "Codon Usage Bias" function of DnaSP v6 [43], defined as the ratio between the frequency of use to expected frequency for each codon [44] within aligned protein coding DNA sequence (CDS).

Synteny blocks
To examine structural variants at the inverted repeat (IR) region junction sites, we employed IRScope webserver (https://irscope.shinyapps.io/irapp/) [45] and visually evaluated the structural variants within the Penstemon genus and within a subset of taxa from the Lamiales order (Table 1). To do this, we downloaded the GenBank files for each taxon (Table 1) from NCBI for the IRScope input. The IRScope webserver processes ten plastomes at a time, so we made several submissions of plastome groups to compare the IR junctions of all taxa using the Solanum lycopersicum GenBank file as an outgroup reference for uniform alignment of each submission.

Phylogenetic analysis
For the whole-plastome phylogeny, we aligned the plastome sequences of our six Penstemon species, 22 additional species from the Lamiales order, and So. lycopersicum as an outgroup (Table 1) using the MAFFT webserver [42]. To perform the ML analysis we used the GTR+G4 evolutionary model was performed in IQ-TREE with 1,000 bootstrap support [46,47]. For the individual gene sequence phylogenies, we made sequence alignments for ndhF, rbcL, and rps2 sequences of the same species as the whole-plastome phylogeny using the MUSCLE webserver (https://www.ebi.ac.uk/Tools/msa/muscle/) [48]. For the concatenated phylogeny, we used the aligned sequences of matK, ndhF, psaA, psbA, rbcL, rpoC2, and rps2 from each species, which we concatenated using Python v2.7.5. We performed ML analyses for each gene sequence and concatenated sequences, using the TN+G4 evolutionary model in IQ-TREE with 1,000 bootstrap support. To view and annotate the optimal ML from all ML analyses, we used the TreeGraph 2 v2.15.0-887 software [49].
In our plastome annotations, we identified identical genes for all species. Each had 83 unique CDS genes and four rRNA genes, as does the P. fruticosus reference. However, P. fruticosus has 29 unique tRNA genes and the rest of the plastomes had 30 unique tRNA genes (Table 3). We aligned the five Penstemon plastomes to P. fruticosus and identified 8,577 sequence variants, including 1,729 single nucleotide polymorphisms (SNPs) and 6,848 indels between all species.
We identified possible pseudogenization of the ndhD gene in all lineages except P. fruticosus due to a start-loss missense mutation, which shifted the start codon to the 128 th position in the amino acid sequence. The ndhD gene in the IR regions had an identical mutation in both IRa and IRb sequences. Additionally, we found at least three indels in the final 18 codons of ndhD, which caused frame shifts and early termination of the amino acid sequence. Penstemon dissectus, P. palmeri, and P. cyaneus had identical ndhD gene sequences and indels, which supports the monophyly of this clade as all three taxa inherited this mutation through a common ancestor. We also identified a missense mutation in the stop codon of the rps12 gene in P. palmeri, which extended the protein by 22 amino acids. We only observed this mutation in P. palmeri, which indicates that the mutation occurred after P. palmeri (subgenus Penstemon) diverged from the P. cyaneus (Habroanthus) and P. dissectus (Dissecti) lineages. Without additional taxa sampling from the Penstemon and Habroanthus subgenera we are unable to determine whether this rps12 mutation is unique to P. palmeri, or if it is common in the Penstemon subgenus.

Repeat analysis
Overall, the number and size of repeats were comparable among Penstemon species. Our P. fruticosus reference contained 19 SSRs ranging in size from seven dinucleotide repeats to ten tetranucleotide repeats. The number of SSRs in the species tested were very similar to the reference, ranging from 15 in P. cyaneus and P. palmeri to 18 in P. personatus. Penstemon dissectus had the only pentanucleotide repeat (Table 4). Due to their heritable physical location and length, and ease of amplification protocols, SSRs make ideal markers for population genetic and phylogenetic studies of closely related taxa [51]. The majority of identified Penstemon plastid SSRs appeared to be homologous loci across the different Penstemon lineages. Their exact physical locations and lengths varied due to indel mutations. We observed several incidences where two SSR loci (S1 Table) were physically separated or absent in some linages, but directly adjacent in other taxa. This could indicate that there is an indel mutation between the two loci or point mutations within one or both SSR intervening sequence(s). These loci could be useful to test homology and observe Table 3. Plastome sequence assembly results. Total sizes of the Penstemon plastome, long single copy (LSC), short single copy (SSC), and inverted repeat (IR) for each species, as well as the counts of total genes, coding DNA sequence (CDS), rRNA, tRNA, and duplicated genes (within IR regions), followed by the NCBI sequence and annotation submission ID number.

PLOS ONE
Phylogenetic resolution and complete plastome sequences of Penstemon changes in populations, lineages, and clades over time. Upon validation of universal primers for physical location, SSR length, and point mutations, these SSRs could be excellent population genetics tools, but such evaluations were beyond the scope of this research. REPuter identified 49 total repeats (reverse, complement, forward, and palindrome) for all species including P. fruticosus. However, the types, sizes, and locations of these repeats varied among species (Fig 2). The majority of repeats (88-96%) were smaller than 30 bp for all species and were located in the LSC region (61-55%). Penstemon fruticosus had the most palindromic repeats (14%) and P. personatus and P. rostriflorus had the fewest (8%). Penstemon dissectus had the most complimentary repeats (8%) and P. personatus had the fewest (2%).
The ploidy level within the Penstemon genus ranges from diploid to dodecaploid, with diploid genome sizes ranging from 463 Mbp in P. dissectus to 922 Mbp in P. nitidus [5]. Although the species included in this study are diploid, P. cyaneus has a nuclear genome up to 63% larger than the other taxa in this study ( Table 2). Along with its large nuclear genome size, P. cyaneus has approximately double the number of repetitive elements in its nuclear genome as compared to P. dissectus and P. fruticosus [52]. The causes and origins of diploid genome enlargement in Penstemon remains mostly unstudied, and it is unknown whether gene duplication including ectopic recombination, replication slippage, or retrotransposition also play a role. Since plastome size is uninfluenced by ploidy level or nuclear genome size, the number and composition of the repetitive elements in the chloroplasts are comparable between our Penstemon taxa. All plastome genes in a given lineage are orthologous, inherited from a common maternal ancestor, which make plastomes ideal for phylogenetic studies as determining homologs, paralogs, and pseudogenization unnecessary. Investigations of nuclear genome mutations are ideal for studying the genetic changes that occurred during speciation and diversification, but plastomes are ideal to construct maternal evolutionary histories and phylogenies.

Codon usage
The RSCU for each amino acid was nearly identical among all species (S2 Table). We observed 61 unique codons for the 20 amino acids used in all CDS in our Penstemon plastomes. Seventeen of these codons were preferred over the other codons for the same amino acid. However, our plastome annotations only identified 30 tRNA codon genes in our Penstemon species (29 in P. fruticosus), only 11 of which were for the preferred amino acid codons. This indicates that the remaining 31 tRNA codons, including five preferred tRNA codons, come directly from the host cell's cytosol and are not produced by the chloroplast's genome. Chloroplast genomes commonly encode around 30 tRNA genes, and like mitochondria, import tRNAs from the cytosol which are produced by the nucleus [53]. The overall chloroplast genome size is greatly reduced from the ancestral cyanobacteria species, which have up to 12,000 CDS genes, compared to a chloroplast's 80-230, primarily through horizontal gene transfer [54,55].

Synteny block analyses
The IR junctions were well conserved within the Penstemon genus as we only observed minor expansions/contractions that were not useful to clarify phylogenetic relationships (Fig 3A). Penstemon fruticosus had the largest IR regions at 25,598 bp and P. personatus had the shortest at 25,527 bp. However, we observed several structural variants within the Lamiales order that did clarify phylogeny. We observed major structural changes within the Plantaginaceae family, specifically several expansions of the IR regions into the SSC regions within the genus Plantago (Fig 3B). The sizes of the IR region were highly variable, ranging from 20,336 bp, in Pl. lagopus, to 38,398 bp in Pl. media. The remaining taxa in this family, excluding those within Plantago, have consistent IR region sizes of 25,465 bp to 25,757 bp. One of the most phylogenetically significant structural changes we observed was the complete inversion of the IRb-SSC-IRa regions in the Orobanchaceae taxa Castilleja paramensis and Pedicularis hallaisanensis ( Fig  3C). This inversion has major implications for the placement of this family within the phylogeny of the Lamiales order, which we will discuss in the context of our ML phylogenetic analyses. We also observed the expansion of the LSC region and contraction of the IR regions in Haberlea rhodopensis (Fig 3D).
Although IRScope is a visualization tool and not an analytical tool, it is demonstrably useful for observing and identifying plastome structural, or synteny block, changes near the IR junctions such as inversions, and IR expansion/contraction. These types of structural have previously been identified in plastids of the Orobanchaceae family [14] and in the genera Pelargonium [56] and Plantago [57]. Plastome structural variants are heritable and can provide valuable insight into evolution and speciation [58]. Many of these mutations can be traced through the evolutionary process of land plants and used as a phylogenetic tool to provide evidence of shared ancestry [59,60]. We found that the major structural mutations we observed with IRScope correlated with highly supported clades in our whole-plastome phylogenetic analysis. Unfortunately, these informative mutations are excluded from phylogenetic analyses that use isolated gene sequences, both individual and concatenated, as they are absent of genome structural variants.

Phylogenetic analyses
Our whole-plastome phylogenetic tree supports the revision of the Scrophulariaceae. In previous studies only the basal nodes have consistently had a high degree of resolution and nodal support [14]. While lineages inclusive of more recent diversification, such as observed in Penstemon [2] and Plantago [16], typically have poor resolution and low nodal support (polyphyly, paraphyly, and polytomies) and are typified by low levels of variation in individual gene sequences [17]. These lineages are only now well resolved in our whole-plastome phylogeny with all but two nodes having statistically significant bootstrap support values (BSV) above 95 [61], and all nodes with BSV above 90 (Fig 4A). This high overall nodal support is an indication of a highly stable and reliable phylogeny [62].
The phylogenetic trees derived from individual gene sequences in this study had drastically different topologies with lower overall nodal support as compared to our whole-plastome phylogeny. In our whole-plastome phylogeny, we observed 24 of 27 nodes with BSV of 100 (Table 5). In contrast, the phylogenies based on the individual gene sequences (matK, ndhF, psaA, psbA, rbcL, rpoC2, and rps2) and concatenated sequence observed nodes with BS of 100 ranging from two (psbA) to 18 (concatenated). Although the phylogeny based on concatenated sequences (Fig 4B) had greater nodal support than the phylogenies base on individual sequences, 22% of its nodes were not statistically significant with BSV less than 90. Only two (7%) of the nodes in the whole-plastome phylogeny were not significant, and no nodes had BS less than 90.
The sequences selected for phylogenetic analysis is crucial to the inference of the resulting tree. Genes that are vital to the function of photosynthesis such as psaA and psbA, for example, are highly conserved since mutations in these encoded proteins must not be deleterious to the proper function of photosynthesis. Our ML phylogeny of these two gene sequences had very low resolution because there was very little variation between species (S1 Fig). We also observed polytomic clades within the genera Penstemon and Scrophularia due to little variation    Table 5. Comparison of nodal support in our plastid phylogenetic trees. The node count and percentages for each phylogeny based on whole-plastome, individual gene sequences (matK, ndhF, psaA, psbA, rbcL, rpoC2, and rps2), and concatenated sequences with the specified level of nodal support.

Sequence
Total between species (Fig 4D). Sequence concatenation appears to stabilize resolution and nodal support as the number of sequences in the concatenation grows. However, the resolution is unreliable because concatenation creates an artificial chromosome sequence constructed from a subset of gene sequences in an arbitrary order. Even if all CDS sequences are used, it will still be unreliable and will omit heritable structural variants including unannotated pseudogenes that are crucial to constructing the evolutionary history of a lineage. The origin of the complete inversion of the IRa-SSC-IRb region we observed in Castilleja paramensis and Ped. hallaisanensis with IRScope ( Fig 3C) is a crucial structural variant that may play a key role in understanding the evolution and diversification of Orobanchaceae. However, the locus for the ndhF gene is located within this inversion but is unannotated in Ped. hallaisanensis due to a deletion or pseudogenization, which causes its sequence to be omitted from phylogenetic analyses of the ndhF sequence alone or in concatenation studies (Fig 4C). The placement of this family varies greatly in each phylogenetic analysis we performed. Interestingly, the Orobanchaceae with this inversion is often placed as a basal clade to most of the Lamiales after O. europaea in the rbcL phylogeny (S2 Fig), or to the Phrymaceae and Lamiaceae families in the psaA (S1 Fig), psbA (S1 Fig), rpoC2 (S2 Fig), rps2 (Fig 4D), and concatenated sequence phylogeny (Fig 4B). The nodal support of this family varied from very low in the psbA phylogeny (BS = 7) to significant (BS�95) in the whole-plastome, concatenated sequence, and ndhF phylogenies.
Our method of isolating and extracting plastome sequences from whole-genome sequencing data without additional DNA extraction steps is a cost-efficient method for plastome sequencing and assembly. Whole genome sequencing on the Illumina HiSeq 2500 (2x250) platform can produce 125-150 Gb on a single flow cell lane, with multiplexing possible up to 12 samples, at an approximate cost of $20.00-30.00 per Gb of data. NOVOPlasty recovered 16,000x to 62,000x coverage from our whole genome sequencing runs, indicating that minimal whole genome coverage ( Table 2) will contain sufficient plastome sequences for assembly and analysis.

Conclusion
As a result of this research, we have produced and submitted the assembled and annotated plastome sequences of P. cyaneus, P. dissectus, P. palmeri, P. personatus, and P. rostriflorus to the NCBI GenBank DNA sequence database. These sequences, with the previously published P. fruticosus plastome [35], complete the representation of all Penstemon subgenera.
Whole-plastome based phylogenetic analyses improved the resolution of the Lamiales order, the Plantaginaceae, and the genus Penstemon with high nodal support. Whole-plastome phylogenies are superior to both individual and concatenated chloroplast sequences as they provide more polymorphic markers that add statistical power to the tested hypotheses [34], they provide high statistical nodal support, and they detect heritable genome rearrangements, including inversions and IR expansions/contractions, and group taxa according to these genome structural changes. We found that a major limitation of both individual and concatenated gene sequence-based phylogenies is that heritable structural rearrangements are excluded from the analyses. Since these rearrangements are heritable, they are crucial to accurate phylogenetic relationships and may be critical to the resolution of phylogenetic ambiguities of closely related and recently classified taxa [16,63].
Our findings also suggest that Penstemon represents a unique monophyletic lineage in the Plantaginaceae family and warrant further exploration with a broad sampling of Penstemon taxa along with Old World and New World genera of the Plantaginaceae family to resolve problematic clades; a process particularly challenging using conventional markers and methods. Such work would assuredly further our understanding of the origins, evolution, and diversification of Penstemon within Plantaginaceae.
Supporting information S1 Fig. Maximum likelihood phylogenies of the whole-plastome, matK, psaA, and Table. Simple sequence repeats (SSR) by location in each Penstemon plastome. The physical locations and lengths of each SSR identified using MISA. Sizes and positions of each SSR varies between taxa due to indel mutations. We observed several incidences where two SSR loci were physically separated or absent in some linages, but directly adjacent in other taxa (bold text). (DOCX) S2 Table. Relative synonymous codon usage (RSCU) in each Penstemon plastome. Codons with moderate to high preference, RSCU values above 1.2, for each amino acid are in bold text. Only leucine, arginine, and serine have more than one codon with high preferences. Most amino acids with two codons had a preference for one codon, the exceptions being cysteine, lysine, and asparagine.