Comparative Analysis of Begonia Plastid Genomes and Their Utility for Species-Level Phylogenetics

Recent, rapid radiations make species-level phylogenetics difficult to resolve. We used a multiplexed, high-throughput sequencing approach to identify informative genomic regions to resolve phylogenetic relationships at low taxonomic levels in Begonia from a survey of sixteen species. A long-range PCR method was used to generate draft plastid genomes to provide a strong phylogenetic backbone, identify fast evolving regions and provide informative molecular markers for species-level phylogenetic studies in Begonia.


Introduction
Begonia is one of the most species-rich angiosperm genera with c.1900 pantropically distributed species currently identified [1]. Although Begonia species are typical of wet rainforest herbs, the genus also exhibits substantial diversity in ecology, with ranges from dry desert scrub through to wet rainforest, and at altitudes from sea level to over 3000 metres [2]. Begonia also shows wide variations of form between closely related species (Fig 1). High speciation rates may be related to limited seed dispersal mechanisms and low level of gene flow in fragmented populations [3][4][5][6]. The large numbers of species, pantropical distribution and a solid horticultural background make Begonia an excellent system for the study of plant evolution in tropical environments [7].
The Begoniaceae are placed within Cucurbitales along with six other plant families; Cucurbitaceae, Datiscaceae, Tetramelaceae, Anisophylleaceae, Coriariaceae and Corynocarpaceae [18]. Begoniaceae-like traits are thought to have evolved between 69-46 million years ago (Ma), when the Begoniaceae crown group diverged (Goodall-Copestake 2005). A geographic origin for this family has proven elusive owing to the lack of fossil evidence, although Clement et al, who investigated the ancestral relationship between the two genera of Begoniaceae; In this study, we used high-throughput sequencing to address the problem of poor resolution in this group of Begonia species. The aim was to generate complete plastid genomes for a number of Gireoudia species with outgroups from the rest of the genus. Comparative sequence analyses were then be used to identify suitable molecular markers for low-level phylogenetic studies in Begonia.
Since the first plastid genome was sequenced, that of Nicotiana tabacum, [19], there have been a total of 826 whole plastid genome sequences submitted to Genbank (as of September 2015, using search terms 'chloroplast' and 'plastid'). This data provides an overview of structural and sequence diversity amongst angiosperms. The typical plastid genome is circular in structure with a standard size of between 120-160 kb, and generally containing a quadripartite configuration [20,21]. These four components comprise; the large single copy (LSC), the small single copy (SSC) and two inverted repeats (IRa and IRb). Most plant plastid genomes have all these features, however there are a few lineages that differ. One well-studied example is within the legumes where one clade has lost one of the inverted repeats, this group is referred to as the inverted repeat-lacking clade (IRLC), [22,23]. Other losses from plastid genomes can be seen in the gymnosperms where it has been found that conifers have also lost an inverted repeat and in addition, have undergone structural changes compared to angiosperms [24]. Plastome size can also be variable within lineages. Pelargonium x hortorum has a plastid genome of 217.9 kb, 40% bigger than typical plastid genomes and is specific to the genus. The large size is due to extreme expansion in the size of the inverted repeats, where the size of each IR is 75.7 kb, this is in contrast to the average IR size for angiosperms which is normally between 10-30 kb [25].

DNA extraction
Silica dried plant material (approximately 20mg) was disrupted using a Qiagen Tissuelyser system, then total genomic DNA was extracted using Qiagen DNeasy Tissue Kit (Cat No. 69504) following manufacturers recommendations. Total genomic DNA was eluted in 2 x 50μl sterile distilled water and stored at -20°C.

Long-range PCR (LR-PCR) primers
The lack of an available reference Begonia plastid genome led to the development of primers based on conserved angiosperm plastid sequences [27]. Chung et al (2007), designed primers from regions that were conserved between three distantly related plant species (Arabidopsis thaliana, Spinacia oleracea and Nicotiana tabacum) by analysing available whole plastid genomes, and designing their primers to have an amplicon size of 3 kb. In this study, we reshuffled the conserved plastid primers to make new primer combinations in order to produce 10 kb amplicons with 500-1000 bp overlaps between amplicons. The large overlap was designed to reduce the types of errors encountered by Cronn et al (2008) during sequence assemblies, such as complementary forward and reverse primers at a single site precluding them from obtaining genomic sequence for those positions, [28] Eighteen primers pairs were initially designed and tested to amplify plastid regions in Begonia species, (S1 Table). For each Begonia accession, 18 PCRs were performed in separate reactions and pooled. In some cases amplification was poor and additional primer pairs were developed and substituted, (S1 Table).

Long-range Polymerase Chain Reaction (LR-PCR)
LR-PCR was carried out in 25μl reactions containing: approx 20ng genomic DNA, 1x Long-Amp PCR buffer (New England Biolabs, M0323G), 300 μM dNTP, 0.4 μM of each primer, 2.5 units of LongAmp Taq DNA polymerase, Nuclease-free water to 25 μl. PCR amplification involved an initial denaturing step of 94°C for 30 secs, then 94°C for 10 secs, 50°C for 30 secs, 65°C for 9 minutes for 45 cycles, and a final extension period of 65°C for 10 mins followed by a 4°C hold. All reactions were carried out in a MJ Research PTC100 thermocycler.

Verification of plastid amplification
To verify the long-range PCRs had amplified plastid sequences, the amplicons generated from B. nelumbiifolia were pooled and the pooled sample was cloned using standard TOPO 1 cloning vectors and protocols. Twenty-four random clones were sequenced using traditional Sanger sequencing by 'The Genepool' sequencing facility, Edinburgh. The sequencing results were subject to BLAST searches against the non-redundant database in GenBank using megablast parameters for highly similar sequences, (http://www.ncbi.nlm.nih.gov/). In cases where no sequence match was found, BLAST stringency was reduced to blastn default algorithm parameter selection for similar sequences.

Plastid genome assembly
Using an Illumina GAIIx sequencing system, a total of 6.2 Gb of data were generated. Reads were deconvoluted using the barcodes into individual accessions and the barcodes removed. All accessions were subject to de novo assembly using Velvet Software v.1.2.03 [29]. Assembly parameters were optimized by performing several assemblies, using a range of Kmer sizes. The results of the assemblies were assessed through the assembly statistics, (i.e. N50 statistics) alongside BLAST analysis of contig sequences. Final parameters used: kmer length 31; expected coverage 1000; coverage cutoff 5; minimum contig length 100bp; -shortpaired; insert length 200. The following settings were used during compilation of velvet software: CATEGORIES = 2; MAXKMERLENGTH = 65; OPENMP; LONGSEQUENCES; BIGASSEMBLY. Begonia de novo contigs were verified using Blastn against the non-redundant database in GenBank using megablast and default parameters (http://www.ncbi.nlm.nih.gov/).

Gap-closing in the draft B. peltata plastid genome
Custom primers (S2 Table) were designed-based on sequence from the draft B. pelata plastid genome-to confirm the junctions of the inverted repeats (IR) and to close seven gaps where the contigs did not join in the B. peltata draft plastid genome. The products from successful PCR amplifications were Sanger sequenced and used to improve the B. peltata plastid genome.

Multiple genome alignment of Begonia plastid genomes
The draft sequence generated for B. peltata was used as a reference to map each set of assembled contigs for all sixteen Begonia species (including B. peltata). Plastid genome contigs were aligned to B. peltata using MAFFT v6.717 (Multiple Alignment using Fast Fourier Transform) [30] applying the iterative refinement method (FFT-NS-i) and using default parameter settings (gap opening penality: 1.53, offset-value: 0.0) and then visually inspected and manually adjusted in the software program Geneious Pro 5.6.3 (http://www.geneious.com/).

Modeltest and maximum likelihood analysis
The program Modeltest [31] was used to test models of evolution on the plastid genome alignment for maximum likelihood analyses. The molecular substitution model chosen for the plastid genome alignment was GTR+I+G as selected by the Akaike's Information Criterion (AIC).

MrModelTest and Bayesian analysis
The program MrModeltest 2.2 [32] was used to test models of evolution on the plastid genome alignment for Bayesian Inference analyses. The molecular substitution model chosen for the plastid genome alignment was GTR+I+G selected by the AIC.

Maximum parsimony
The aligned plastid DNA matrix for sixteen Begonia species was subjected to maximum parsimony (MP) analysis in PAUP 4.0b10 [33]. A heuristic search strategy of 10,000 random sequence addition replicates with tree bisection-reconnection (TBR) branch swapping, saving 10 trees per replicate, with MULTREES on, steepest descent off, was implemented. A bootstrap analysis [34] was performed to evaluate the robustness of clades using a fast stepwise addition search algorithm with 1000 replicates. A 50% majority-rule consensus tree was calculated from all the most parsimonious trees.

Maximum likelihood
Maximum likelihood (ML) analysis was implemented using PhyML [35] with GTR+I+G and determination of a 50% majority rule consensus tree from 1000 bootstraps. The appropriate substitution model was selected for the dataset using Modeltest [31] based on AIC and hierarchical Likelihood Ratio Test (hLRT).

Bayesian inference
Bayesian inference (BI) analyses were performed in MrBayes v3.1.2 [36] on unordered and equally weighted characters with the following settings. The evolutionary model employed six substitution types ("nst = 6"), with base frequencies set to the empirically observed values ("basefreq = empirical"). Rate variation across sites was modeled using a gamma distribution ("rates = invgamma"). The Markov chain Monte Carlo search was run with 4 chains for 1,100,000 generations, with trees sampled every 200 generations and the first 1000 trees discarded as 'burn-in'. Phylogenetic trees were visualized using Figtree [37].

Support values
For the BI analyses, a 90% posterior probability (PP) lower threshold was considered to indicate moderate support and a 95% lower threshold to indicate well supported relationships. For the MP and ML analyses, a 70% bootstrap support value lower threshold was considered to indicate moderate support, and an 85% lower threshold to indicate well-supported relationships.

A comparison of variation between the LSC, SSC and IR regions
The Begonia plastid alignment was partitioned into the small single copy (SSC), large single copy (LSC) and inverted repeat (IR) regions after consideration of the common boundaries [38,39] and visualization of the discrepancies found in the alignment itself (discrepancies seemed to be a contained in the regions predicted to span the junctions of the inverted repeats). Each region was subjected to phylogenetic analyses along with descriptive statistics produced using Geneious software, as well as manually calculated. In addition, descriptive statistics were produced only for species in sect. Gireoudia for each region to look at variation within sect. Gireoudia.

Selection of phylogenetically informative regions for low-level studies in Begonia
A sliding windows analysis was performed on the large-scale Begonia alignment in the commercial software, Geneious. A window size of 10 bp was used to assess mean pairwise identity over all pairs in the column across the whole alignment. The sliding windows analysis was used to identify regions of the alignment that contain potential phylogenetically informative regions. These regions were then visually inspected to ensure that there was as little missing data as possible and that unambiguously aligned regions were not included in the final analyses. For each new region identified, the sequence alignment was subjected to a maximum likelihood analysis using PhyML with the following evolutionary model, GTR+I+G and determination of a 50% majority rule consensus tree from 1000 bootstrap replicates. The resulting phylogenetic trees were visually compared with the phylogenetic tree from the whole plastid genome alignment, particularly with respect to good bootstrap support and the grouping of the American Begonias. A small selection of the potentially phylogenetically informative regions identified through ML analyses were subjected to further computationally intensive phylogenetic analyses using Bayesian Inference performed in MrBayes v3.1. The phylogenetic trees produced were again compared to the phylogenetic tree from the large-scale plastid genome alignment for similarity and statistical support. Blast searches were performed on the consensus sequence for each alignment to determine/confirm sequence identity.

Plastid genome annotation
Plastid genome annotation was performed using DOGMA [40] and putative annotations were confirmed using blast sequence similarity search. Plastid assemblies and alignments are available in Dryad (doi:10.5061/dryad.cp4mb) and raw reads are in the European Nucleotide Archive, Project PRJEB11898.

Results and Discussion
Recent and extensive speciation in Begonia requires informative markers for full phylogenetic resolution. To identify optimum plastid markers we used a comparison of conserved sequences from angiosperm plastid genomes [27] to design 18 pairs of primers for the generation of 10 kb amplicons with overlapping regions to sequence 16 Begonia plastid genomes.

Plastid genome assembly and features
A total number of 32,611,570 sequence reads were generated, providing 6.2 Gb of data, consisting of 50bp paired-end illumina reads. Expected coverage was determined based on the size of the Cucumis sativas plastid genome and ranged from 382-832 (S3 Table). The reads were assembled de novo and the number of assembled contigs per genome ranged from 132-679 ( Table 2). The contigs were validated by blast searches of the NCBI database. From the sixteen plastid genome datasets, B. peltata was chosen to create a draft genome as initial assembly statistics indicated that this dataset was the most complete and had the best assembly scores with a total of 310 contigs and an N50 of 9, of which 50 contigs were greater than one kb.
The draft B. peltata plastid genome has a typical angiosperm quadripartite structure. A large single copy (LSC) region (84,812 bp) flanked either side by a pair of inverted repeats, termed IRa and IRb, (predicted to be 26,456 bp) and circularized by a small single copy (SSC) region (16,152 bp). The IR region was predicted from the site of contig termination (the start of sequence duplication) in the ycf1 gene in the SSC region, through to contig termination in the rps19 gene in the LSC. The final draft plastid genome for B. peltata covered a total of 127,420 bp excluding the IRb.
The B. peltata genome was made more complete by the development of custom primers to aid with contig orientation and gap-closing. A total of five out of seven gaps were closed using traditional Sanger sequencing. Sequence analysis revealed that these gaps were present because of low complexity and/or homopolymer runs These events are common when assembling sequences de novo with Illumina short read data, as an increase in identical nucleotides or tandem repeats reduces the confidence of the assembly and can result in contig termination [29]. Two gaps were located at the predicted junctions of the IRb region. One end of the IRb is is predicted to reside within the linear gene grouping of rpl2-rps19-rpl22 genes. The contig representing this region in the LSC terminates in the rps19 gene, making this the most likely position for the junction. The rps19 gene is one of the most common sites for the LSC and IRb junction reported in the angiosperms [39]. The rps19 gene in B. pletata is, however, not complete. A megablast search of the non-redundant database using the B. peltata rps19 gene region as the query sequence indicates there is potentially 15 bp missing from the end of the gene sequence in this species, and rpl22 is missing altogether. We were unable to verify the size and/or presence of these genes due to PCR failure in for all Begonia species used in this study. It is possible that these genes are missing or significantly changed in Begonia. Therefore, when calculating the total plastome size these genes are estimates based upon those of Cucumis sativus. The total length of the complete B. peltata plastid genome is predicted to be 153,876 bp.

Genome annotation
Annotation of the draft plastid genome from B. peltata predicts 103 genes, including 45 tRNA genes, eight ribosomal RNA genes and six predicted open reading frames (ORFs), (Fig 2). The LSC contains 70 genes and 26 tRNA genes, whilst the SSC contains 13 genes and one tRNA. The predicted inverted repeats each contain ten genes and nine tRNA genes along with four rRNAs and three predicted ORFs (ycf68, orf42 and orf56). These results fit well with previously annotated angiosperm plastid genomes, and are highly syntenic with known plastid genomes in the Cucurbitales [27].

Base composition and codon usage
The nucleotide base composition of the plastid genome of B. peltata is highly skewed (64.2%), in favour of adenine (A) and thymine (T) bases, whilst poorly represented in cytosine (C) and guanine (G) bases (35.8%). This unequal balance is typical of plastid genomes throughout angiosperm lineages with average AT base composition reported to be around 62% [41], and an ATrich base composition is also seen in more primitive organisms such as the green algae, Oedogonium cardiacum at 70.5% [42]. Theories speculating on the reasoning for high AT-content include the mis-incorporation of A and T bases during replication, a bias in DNA repair machinery and DNA damage through high levels of reactive, and potentially mutagenic species generated during the electron transfer reactions of photosynthesis [43]. The high AT-content is also a general feature of mitochondrial genomes, another endosymbiont with a circular genome.

Comparison with Cucumis sativus
Cucumis sativus was the closest relative to Begonia whose plastid genome had been sequenced [27]. A sequence alignment between C. sativus and B. peltata reveals that the genes are highly syntenic, which is reflective of the general conserved order of plastid genes across land plants [21], however a pairwise identity of 50.1% at the nucleotide level reflects the sequence diversity between these distantly related species. The gene content of both plastid genomes are almost identical except for the loss of the infA gene (which codes for the translation initiation factor 1) in B. peltata. Unfortunately, the infA gene is usually found between the genes rpl36 and rps8, this region is also the same region that is missing from all the sequenced Begonia plastid genomes in this study and this region is positioned closely (~4.  they did not specify which species-with additional separate losses in 24 angiosperm lineages. It is clear that the infA gene is one of the least conserved genes in the plastid genome.

A comparison of variation between the LSC, SSC and IR regions
The analysis of the SSC, LSC and the IR regions show that in the Begonia species sampled, the SSC has evolved almost twice as fast as the LSC, and at five times the rate of the IR, Table 3. The variability present in sect. Gireoudia only, reflects the same results, Table 4. Many previous phylogenetic studies in Begonia have concentrated on molecular markers found in the LSC; matK gene [45], rbcL gene (10), trnL-UAA intron [46], trnL-F spacer [45], although some of these are protein-coding and will therefore be expected to have a slower evolutionary rate.
Recently, Thomas et al, (2011) successfully demonstrated the phylogenetic utility of a selection of sequential non-coding regions found in the SSC (ndhA intron, ndhF-rpl32 spacer, rpl32-trnL spacer) for low-level studies of Asian Begonia. These three regions combined (a total alignment length of 4059 bp) were able to determine the species level resolution for several Asian Begonia sections [13]. Analysis of the IR region reveals a low substitution rate, confirming the findings of Wolfe et al, (1987) who reported that the inverted repeats were more highly conserved than the SSC and the LSC regions. Although their analysis was between highly divergent monocotyledon and eudicot plants, it can be concluded from these analyses that it holds true for species-level comparisons in Begonia [47].

Phylogenetic analysis of the large-scale Begonia plastid alignment
The highly conservative nature of gene synteny in the plastid genome allowed the alignment of plastid contigs for all sixteen Begonia species at a genome-wide level. In this study, only one copy of the IR is included in the large-scale genome alignment because two copies of the inverted repeat could not be confirmed during gap-closure. Phylogenetic analysis was carried out on the large-scale genome alignment of sixteen incomplete plastid genome sequences to determine whether the uncertainty over species-level patterns within sect. Gireoudia revealed in Moonlight et al., (2015) could be resolved using plastid genome data [15]. The aligned matrix for the large-scale Begonia plastid alignment consisted of 16 taxa with an alignment length of 150,255 bp, (139,457 bp unambiguously aligned characters, 143,538 characters are constant, 4804 variable parsimony-uninformative and 1,913 parsimony-informative sites). Phylogenetic analyses (MP, ML & BI) were performed on the large-scale Begonia alignment revealing sufficient phylogenetic variation was present to delineate the relationships of the sixteen Begonia species in this study (Fig 3). All three topologies were congruent with each other although there were minor differences between the levels of support and also the grouping of two taxa, namely B. dregei and B. solananthera. This uncertainty may be linked to evidence for at least two separate introductions of Begonia to the Neotropics and discordance between the mitochondrial, nuclear and plastid phylogenies at the base of the neotropical clades [17,48].

Phylogenetically informative plastid regions for low-level studies in Begonia
Twenty-four small regions (between 800-2000 bp in length) of the plastid genome alignment were chosen for phylogenetic analysis using Maximum likelihood analyses. The phylogenetic trees were assessed based on congruency with the results from the large-scale Begonia plastid alignment and tree node support values. Nine of the initial 24 regions were selected for further phylogenetic analyses using Bayesian Inference ( Table 5). The majority of these plastid regions were part of the SSC, although one of the selected regions, Region 24, was in the LSC. Phylogenetic analyses determined that two regions, Region 21 and Region 24, contained high sequence divergence that could be suitable for phylogenetic analyses at the species level. These regions gave strong support within sect. Gireoudia along with good support outside of this group (Figs 4 and 5).
Region 21 (ndhI-ndhG) has an alignment length of 804 bp and is positioned on the largescale plastid alignment at 138,658-139,461 bp spanning the intergenic region between the genes ndhI and ndhG in the SSC. An important feature of Region 21 is the relatively small size of the alignment meaning a full sequence is likely to be obtained from only one Sanger sequencing reaction, which is cost effective. It has an average pairwise identity of 86.7%, this is slightly higher when compared to an overall pairwise identity of the SSC alignment at 85.3%. Studies are now beginning to highlight the potential of non-coding regions within the SSC as putative sequences for phylogenetic analyses [47,49].
Region 24 (rpoB-psbD) has an alignment length between 1223 bp (mostly sect. Gireoudia species) and 1300 bp (other Begonia species). It is located in the rpoB-psbM intergenic spacer in the LSC, (large-scale Begonia alignment position 27,734-28,931 bp) and has a higher pairwise identity than that of Region 21 at 93.9%. The pairwise identity is also much higher than the whole LSC (75.9%) however this is to be expected as there were large regions of missing data, especially for B. pustulata, B solanathera and B. bogneri and the real figure is likely to be much higher. Although, these analyses did not include indel-coding, it is speculated that Region 24 would be particularly suitable for the analysis of indels as there is a distinct indel structure in sect. Gireoudia that is not seen in the other Begonia sections. An area encompassing region 24 (rpoB-psbD, 28-36 Kb), has been reported by Wang and Messing, (2011) as a highly divergent region in the subfamily Lemnoideae, however their comparative plastid analysis was between genera, hence the resolution of a smaller region suitable for phylogenetic studies was not reported [50]. It is clear from the phylogenetic analyses that both regions give good support and resolution to the American Begonia species and also to the African and Asian species. These two regions, Region 21 (ndhI-ndhG) and Region 24 (rpoB-psbD) are recommended as candidates for further assessment in a wider sampling of the pantropical genus Begonia.

Conclusions
The creation of this dataset provided better phylogenetic resolution of relationships within sect. Gireoudia and between closely related species, providing a strongly supported phylogenetic tree. The inclusion of plastid genomes from African and Asian Begonia species as outgroups, meant a comparative study could be undertaken, gaining new insights into the rates of evolution of different plastid regions and gene order. It also enabled the identification of appropriate informative loci for phylogenetic studies in Begonia at species level. The identification of two suitable molecular markers for the study of phylogenetics in Begonia, particularly Neotropical lineages means a comprehensive phylogenetic analysis can now be undertaken to fully understand how American Begonia have evolved throughout Mesoamerica. In order to determine the evolutionary history of Begonia, it is necessary to understand the full genomic complement in Begonia species and further studies into nuclear and mitochondrial genome evolution are required. This study presents a framework for the discovery and development of new, and more importantly, suitable molecular markers for phylogenetic studies in Begonia. It is an important step in the development and selection of new, phylogenetically informative markers for non-model species for which little information is available.
Supporting Information S1 Table. Primer sequences used in this manuscript to amplify plastid regions in Begonia.
(DOCX) S2 Table. Custom primer sequences developed from B. peltata plastid genome to confirm the junctions of the inverted repeats (IR) and to close gaps between contigs. (DOCX) S3 Table. A total number of 32,611,570 sequence reads were generated, providing 6.2 Gb of data, consisting of 50bp paired-end illumina reads. Expected coverage for each accession was determined based on the size of the Cucumis sativas plastid genome. (DOCX)