The plastome sequence of Bactris gasipaes and evolutionary analysis in tribe Cocoseae (Arecaceae)

The family Arecaceae is distributed throughout tropical and subtropical regions of the world. Among the five subfamilies, Arecoideae is the most species-rich and still contains some ambiguous inter-generic relationships, such as those within subtribes Attaleinae and Bactridineae. The hypervariable regions of plastid genomes (plastomes) are interesting tools to clarify unresolved phylogenetic relationships. We sequenced and characterized the plastome of Bactris gasipaes (Bactridinae) and compared it with eight species from the three Cocoseae sub-tribes (Attaleinae, Bactridinae, and Elaeidinae) to perform comparative analysis and to identify hypervariable regions. The Bactris gasipaes plastome has 156,646 bp, with 113 unique genes. Among them, four genes have an alternative start codon (cemA, rps19, rpl2, and ndhD). Plastomes are highly conserved within tribe Cocoseae: 97.3% identity, length variation of ~2 kb, and a single ~4.5 kb inversion in Astrocaryum plastomes. The LSC/IR and IR/SSC junctions vary among the subtribes: in Bactridinae and Elaeidinae the rps19 gene is completely contained in the IR region; in the subtribe Attaleinae the rps19 gene is only partially contained in the IRs. The hypervariable regions selected according to sequence variation (SV%) and frequency of parsimony informative sites (PIS%) revealed plastome regions with great potential for molecular analysis. The ten regions with greatest SV% showed higher variation than the plastid molecular markers commonly used for phylogenetic analysis in palms. The phylogenetic trees based on the plastomes and the hypervariable regions (SV%) datasets had well-resolved relationships, with consistent topologies within tribe Cocoseae, and confirm the monophyly of the subtribes Bactridinae and Attaleinae.


Introduction
The family Arecaceae contains 181 genera and about 2,600 species distributed throughout tropical and subtropical regions of the world [1]. The most recent taxonomic review in Shi et al. [24]. The DNA was purified with the Genomic DNA Clean & Concentrator™-10 Kit (Zymo Research, Irvine, CA, USA). The purified DNA was quantified using Qubit™ dsDNA HS Assay kit (Thermo Fisher Scientific, Carlsbad, CA, USA) in Qubit™ Fluorometer (Thermo Fisher Scientific). Libraries were prepared with Nextera XT DNA Library Preparation Kit (Illumina, San Diego, CA, USA) and sequenced in Illumina MiSeq 1 (Illumina), obtaining 250 bp paired-end reads. Plastome assembly was performed using CLC Genomics Workbench v. 8.0 (Qiagen, Germantown, MD, USA) software with de novo strategy. The Acrocomia aculeata plastome was used as a reference for the ordering of contigs. Plastome annotation was performed using Geneious Prime 1 (Biomatters, Auckland, New Zealand). For all genes, manual verification was performed, adjusting the initial and terminal codons. The final plastome sequence was deposited in GenBank: MW054718.

Plastome structural analysis in tribe Cocoseae
The comparative analysis to identify structural rearrangements in the plastomes of the Cocoseae species was carried out using eight species (S1 Table), excluding one IR from all plastomes, and using the progressive alignment on Mauve software [25]. The IRScope software [26] was used to visualize and compare the plastome junctions (IRb/LSC; IRb/SSC, SSC/IRa; IRa/LSC).

Identification of hypervariable regions
We estimated the variability of the sequences with the formula proposed by Shaw et al. [27], adapted and used by Zavala-Páez et al. [28]. First, we individually aligned each collinear coding sequence (CDS), intergenic spacers (IGS), and introns of the plastomes (list of species in S1 Table) using MAFFT v.7 software [29]. Then, the alignments were imported into the software DNAsp v6.12.03 [30] to obtain the number of invariable sites (monomorphic), parsimony informative sites (PIS), number of substitutions, and number of InDel events. Sequence variability (SV) was calculated using the formula: SV% = [(number of substitutions + number of InDels) / (number of substitutions + number of InDels + invariable sites)] x 100. The frequency of PIS was calculated using the formula: PIS% = [(number of parsimony informative sites/ number of substitutions + number of InDels + invariable sites)] x 100.
The ten regions with the highest SV% and PIS% values were selected to carry out the subsequent analysis. The plastid markers matK, trnQ-rps16, rps16 intron, trnD-trnT, trnL-trnF [31,32] and the nuclear markers PRK and RPB2 [32,33], commonly used for phylogenetic analysis in Arecaceae, were used for comparative purposes and subjected to the same procedure to obtain the PIS% and SV% values.
Plastome alignment was performed using progressive alignment on Mauve [25] implemented in Geneious Prime 1 v.2020.1.2. The Locally Collinear Blocks (LCBs) identified by Mauve were individually extracted and concatenated. The alignment of the ten regions with the highest SV% and PIS% values was carried out using MAFFT v.7.450 [29] implemented in Geneious Prime 1 v.2020.1.2. Phylogenetic inferences were performed by Maximum Likelihood (ML) using W-IQ-tree [34], with 1,000 bootstrap repetitions. The choice of substitution models, including FreeRate heterogeneity model, was made according to Bayesian information criterion (BIC; Table 1). Branch support analysis was performed with 1,000 repetitions of bootstrap and single branch test SH (-aLTR, 1,000 replicates). The resulting trees were represented using Geneious Prime 1 v.2020.1.2.

Bactris gasipaes plastome
The sequencing of plastid-enriched DNA resulted in 448,600 reads with an average length of 214 bp. Of these, 47,735 were plastome reads (~10%), resulting in an average depth of coverage of 67.64 (SD = 24.32). The assembled plastome has a 21 bp gap in the IGS trnT-UGU/ trnL-UAA (position 46,800 to 46,820). This gap is in an AT-rich region (sequence 22 bp upstream to 119 bp downstream is only 7.8% of GC-content) and is, therefore, difficult to sequence [35]. We calculated this gap length using other species of tribe Cocoseae as reference.
Bactris gasipaes plastome has the quadripartite structure typically found in angiosperms [2], with a pair of inverted repeat (IRs), a large single-copy region (LSC), and a small single-copy region (SSC). The IRs are 27,038 bp in length (each), the LSC is 85,118 bp, and the SSC is 17,452 bp, resulting in a plastome with 156,646 bp.
Bactris gasipaes plastome has an average GC-content of 37.5%. When comparing the plastome regions, the SSC has the lowest GC-content, with 31.3%, followed by LSC with 35.5%. The IRs have the highest value, with 42.6% of GC-content. The rRNA and tRNA show high GC-content, with 55.3% and 53.4%, respectively. Protein-coding genes have an average GCcontent of 37.9%, intergenic spacers (IGS) of 37.5%, and introns of 37.1%. The plastome GCcontent among species from tribe Cocoseae is similar, ranging from 37.40% (Elaeis guineensis) to 37.53% (Acrocomia aculeata).
In the Bactris gasipaes plastome, we annotated 113 unique genes, 79 of which are proteincoding genes, 30 tRNA genes, and 4 rRNA genes ( Table 2). Duplicate genes in IRs include 8 Table 1. Substitution models selected for the phylogenetic inferences using Maximum Likelihood (ML).

Comparative analysis in tribe Cocoseae
Plastomes are highly conserved (97.3% identity) within tribe Cocoseae. Species from subtribes Bactridinae and Elaeidinae have a plastome~2 kb larger than the species from subtribe Attaleinae. This difference in length between the subtribes is mainly in the IRs and LSC regions ( Table 3). The plastomes from Cocoseae species ranged from 154.048 bp (Butia eriospatha) to 156.937 bp (Elaeis guineenses).
The progressive alignment among species from tribe Cocoseae shows evidence for three Locally Collinear Blocks (LCBs) (Fig 1). These three LCBs are a result of the 4.5 kb inversion

PLOS ONE
The plastome sequence of Bactris gasipaes and evolutionary analysis in tribe Cocoseae (Arecaceae) present in the plastomes of Astrocaryum murumuru and Astrocaryum aculeatum (Fig 1). The set of genes that makes up this structural rearrangement is composed of ndhC, ndhK, ndhJ, trnF-GAA, and trnL-UAA.
In the LSC/IR and IR/SSC junctions of the plastomes, there are differences among the subtribes (Fig 2). In Bactridinae (Acrocomia aculeata, Astrocaryum aculeatum, Astrocaryum murumuru, and Bactris gasipaes) and Elaeidinae (Elaeis guineensis) the rps19 gene is completely contained in the IR region and, therefore, there are two copies of the complete gene. In contrast, in the subtribe Attaleinae (Butia eriospatha, Cocos nucifera, and Syagrus coronata) the rps19 gene is only partially contained in the IRs, resulting in a complete rps19 and a partial rps19: the complete rps19 gene starts at IRb and ends at LSC (LSC/IRb); the partial rps19 starts at IR, but does not contain the final portion of the gene (Fig 2).
Similarly, the ycf1 gene is partially contained in IRs, with a complete ycf1 at IRa/SSC and a partial (pseudo) ycf1 at IRb. The ndhF gene has both position and length conserved at the IRb/ SSC junction in tribe Cocoseae, with the portion of the gene contained in the IRb overlapping the ycf1 gene (56 bp) (Fig 2).

Hypervariable regions
We carried out the SV% and PIS% estimates to identify the plastome regions with the greatest variation within tribe Cocoseae. All ten regions selected according to the highest SV% showed greater variation than the plastid molecular markers commonly used for phylogenetic analysis in palms ( Fig 3A). As expected, they have SV% lower than the nuclear markers PRK and RPB2 (Fig 3A). Among the ten regions selected according to the highest PIS% values, all showed greater values than the plastid molecular markers commonly used for phylogenetic analysis in palms ( Fig 3B) and two of them (trnC-petN and psbC-trnS) were more variable than the

PLOS ONE
The plastome sequence of Bactris gasipaes and evolutionary analysis in tribe Cocoseae (Arecaceae) nuclear marker PRK ( Fig 3B). The nuclear marker RPB2 showed the highest values for both SV% and PIS% estimates.
We also calculated the frequency of substitutions and frequency of InDel events for each plastome region. The substitution frequency was~5x higher than InDels in plastomes (Table 4). In the coding sequences (CDSs), substitutions are~80x more common than InDels. In IGSs and introns, substitutions are~4x and~3x more frequent than InDels, respectively (Table 4). In general, these data show that in all regions of Cocoseae plastomes there is a higher frequency of substitutions than InDels.

Phylogenomic analysis
The complete LCB matrix of the aligned plastomes consisted of 137,452 columns, of which 134,244 are constant and 815 parsimony-informative. The SV matrix contained a total of 9614 columns, with 133 parsimony-informative and 9144 constant sites, and the PIS matrix 7538 columns, with 129 parsimony-informative and 7200 constant sites. The phylogenomic trees inferred using Maximum Likelihood (ML) and based on the plastome and the ten selected regions with greatest SV% datasets showed identical topologies (Fig 4) and high bootstrap

PLOS ONE
The plastome sequence of Bactris gasipaes and evolutionary analysis in tribe Cocoseae (Arecaceae)

PLOS ONE
The plastome sequence of Bactris gasipaes and evolutionary analysis in tribe Cocoseae (Arecaceae)

PLOS ONE
The plastome sequence of Bactris gasipaes and evolutionary analysis in tribe Cocoseae (Arecaceae) support (> 86). The phylogenetic tree inferred from the alignment of the ten regions with greatest PIS values showed a similar topology; it differed only by the presence of a polytomy in the Attaleinae clade, generated by the low bootstrap support in intergeneric relationships (S1 Fig). For the topologies generated by ML in the three datasets, the monophyly of the subtribes Bactridinae and Attaleinae was confirmed. Also, the subtribe Elaeidinae appears more closely related to Bactridinae than to Attaleinae. In Bactridinae, Bactris and Astrocaryum are closely related genera, and in Attaleinae, Cocos nucifera as sister to Syagrus coronata.
The gene cemA has an unconventional start codon in Bactris gasipaes, what was previously described in species of subtribes Attaleinae and Elaeidinae, in Podococcus barteri (NC_027276.1), Phoenix dactylifera 'Khalas', and other monocots. However, it is still not clear if this gene, with the unconventional start codon, is translatable to a protein [44]. Although most of the genes encoding proteins have ATG initiation codons [7], some alternative initiation codons are found in plants [45], such as GTG in the rps19 gene, ACG in rpl2 and ATC in ndhD, which were reported in Lilium longiflorum, Phoenix dactylifera 'Khalas' and Amomum compactum, respectively [44,46,47].

Comparative plastome analysis within Cocoseae
Comparative studies using plastomes of species at different taxonomic levels can bring insights into plastome evolution, phylogenetic relationships, and evolutionary rates [2]. Plastomes of the three subtribes analyzed (Attaleinae, Bactridinae, and Elaeidinae), represented here by eight species, provided information to compare sequence variations in tribe Cocoseae. We identified slight differences in plastome size (~2 kb) and an inversion of 4.5 kp that occurs in the ndh complex (LSC region) of the Astrocaryum plastome. Similarly, Barrett et al. [48] reported that Arecaceae plastomes are highly conserved structure, describing only one 1.9 kb inversion located between the rps16 and trnG-UUC genes in Tahina spectabilis.
Also, the variability among Attaleinae, Bactridinae, and Elaeidinae in the LSC/IR junctions, mainly involving the rps19 gene, was previously described in Acrocomia aculeata [38] and Butia eriosphata [40], as well as in Phoenix dactylifera [44,49]. Similarly, the ndhF gene overlapping with ycf1 in~25 bp is commonly observed in palms [38,40,44,49]. Despite the differences in the IR junction, the IR structure and gene content is conserved among palms, corroborating the hypothesis that the IR regions offer an isolation mechanism that stabilizes the structure of the genome [50].

Hypervariable regions
Plastomes have several non-coding regions, but not all of them have been explored for phylogenetic studies [3,4,28]. Among the ten regions with the greatest SV% identified in our study, only five IGSs (e.g., accD-psaI, ndhF-rpl32, trnS-trnG, psaC-ndhE, rps15-ycf1) and one intron (rpl16) were previously used and/or highlighted in angiosperm studies [3,38,51]. Among the ten regions with the greatest PIS%, only three IGS (trnS-trnG, petA-psbJ, and psaC-ndhE) were identified in studies carried out by Shaw et al. [3,27] and Lopes et al. [38]. Thus, in our study, we described four new promising regions based on both SV and PIS values (trnC-petN, psbC-trnS, ccsA-ndhD, petN-psbM) and three new regions based on PIS (petD-rpoA, trnG-trnfM, rps8-rpl14). Also, Scarcelli et al. [51] reported 100 primers for phylogeny in monocots. However, four of the hypervariable regions reported in our study were not contemplated (e.g., psaC-ndhE, petN-psbM, accD-psaI, trnS-trnG). Among them. trnS-trnG and accD-psaI primers designed by Scarcelli et al. [51] showed no amplification in Arecaceae, and petN-psbM and psaC-ndhE were not mentioned, probably due to gene rearrangements in monocots plastomes. This reinforces that the highly variable regions vary between clades, and their identification may be necessary for distinct taxonomic levels. As expected, the nuclear genes PRK and RPB2 showed greater variation than most plastidial regions. These nuclear markers are very informative and produce well-resolved topologies [33]. The combined use of the plastidial regions described here and the nuclear markers PRK and RPB2 have great potential for phylogenetic studies in tribe Cocoseae.

Phylogenomic analysis
The ten regions with the greatest SV% values are suitable for phylogenetic inferences and produce phylogenetic trees with well-resolved and the expected topologies. In the ML analysis, all datasets tested (plastome, ten SV regions, and ten PIS regions) result in subtribe Bactridinae as monophyletic. The monophyly Bactridinae was previouly described by Eiserhardt et al. [32], as well as the sister relationship between the subtribes Elaeidinae and Bactridinae [33,38]. Our results are in contrast with those of Gunn [52], in which the sister relationship between Astrocaryum and Bactris is weakly supported. In all of our datasets this sister relationship is wellsupported. In addition, the monophyly of subtribe Cocoseae was also verified in a plastid DNA analysis [53], in the super-tree method [54] and by the combined analysis with the PRK and RPB2 genes [33]. The sister relationship between Cocos nucifera and Syagrus coronata was also previously described [55], corroborating our results. Thus, both the plastome and the ten regions with greatest SV values were able to produce well-resolved phylogenetic trees and with consistent topologies within tribe Cocoseae.