First complete chloroplast genomics and comparative phylogenetic analysis of Commiphora gileadensis and C. foliacea: Myrrh producing trees

Commiphora gileadensis and C. foliacea (family Burseraceae) are pantropical in nature and known for producing fragrant resin (myrrh). Both the tree species are economically and medicinally important however, least genomic understanding is available for this genus. Herein, we report the complete chloroplast genome sequences of C. gileadensis and C. foliacea and comparative analysis with related species (C. wightii and Boswellia sacra). A modified chloroplast DNA extraction method was adopted, followed with next generation sequencing, detailed bioinformatics and PCR analyses. The results revealed that the cp genome sizes of C. gileadensis and C. foliacea, are 160,268 and 160,249 bp, respectively, with classic quadripartite structures that comprises of inverted repeat’s pair. Overall, the organization of these cp genomes, GC contents, gene order, and codon usage were comparable to other cp genomes in angiosperm. Approximately, 198 and 175 perfect simple sequence repeats were detected in C. gileadensis and C. foliacea genomes, respectively. Similarly, 30 and 25 palindromic, 15 and 25 forward, and 20 and 25 tandem repeats were determined in both the cp genomes, respectively. Comparison of these complete cp genomes with C. wightii and B. sacra revealed significant sequence resemblance and comparatively highest deviation in intergenic spacers. The phylo-genomic comparison showed that C. gileadensis and C. foliacea form a single clade with previously reported C. wightii and B. sacra from family Burseraceae. Current study reports for the first time the cp genomics of species from Commiphora, which could be helpful in understanding genetic diversity and phylogeny of this myrrh producing species.

Introduction Environment, Muscat, Sultanate of Oman to collect leaf samples for research purpose. The current study did not involve endangered or protected species.

Sample collection
Leaf samples were collected from Wadi Darbaat, Dhofar-Oman (17 31.237'N 55' 12.923'E). The samples include fresh and young photosynthetic leaves of C. gileadensis and C. foliacea. The collected samples were kept immediately in liquid nitrogen and then stored at -80˚C until chloroplast DNA extraction.

Chloroplast DNA extraction and sequencing
Leaf samples of C. gileadensis and C. foliacea were cleaned and washed with sterilized water, air dried and kept in dark for 48 hrs in order to reduce the starch content in leaf tissues. Chloroplast DNA was extracted by the protocol of Shi et al, [23] with modifications to remove the traces of resinous content from tissues. The workflow of Ion Torrent S5 Sequencer (Life Technologies, USA) was used for extracted cp DNA sequencing. Chloroplast DNA were enzymatically sheared for 400 bp using the Ion Shear Plus Reagents and library were prepared following the protocol of Ion S5 with Ion Xpress Plus DNA Fragment Library kit. Prepared libraries were checked on Qubit fluorimeter and bioanalyzer (Agilent 2100, CA, USA) for quality check and standardization. Ion One Touch 2 instrument was used for template amplification, post template amplification, whereas the enrichment process was carried out with Ion One Touch ES enrichment system. The sample was loaded onto the Ion S5 Chip and sequencing were performed according to the protocol of Ion Torrent S5.

Genome assembly
The quality of raw reads were evaluated by using the FastQC [24]. Adapters were removed from both end of the contigs and Platanus_trim (v.1.0.7) [25] with phred score >30 was used to trim high quality reads. The chloroplast genomes of both Commiphora species were first de novo assembled. In order to get contamination free read of chloroplast genome from mitochondrial and nuclear genomes, the Commiphora species genomes paired end reads were obtained by mapping the high quality reads to a selected reference genome of C. wightii (NC036978) with Bowtie2 (v.2.2.3) [26]. The selected resultant reads were assembled using Spades (v.3.7.1) software [27] and the parameters were set to default. The regions which was uncertain in these genomes such as IR junctions region were picked out from the already published genome of C. wightii and B. sacra (NC036978 and NC029420, respectively), to adjust the sequence length, iteration method was used with software MITObim (v.1.8) [28]. The complete genome sequences were deposited in Gene Bank of NCBI, where C. gileadensis and C. foliacea were given MH042752 and MH041484 accession numbers, respectively.

Genome annotation
Chloroplast genomes were annotated by using Dual Organellar Genome Annotator (DOGMA) [29] and BLASTX and BLASTN were used to identify the positions of ribosomal RNAs, transfer RNAs and coding genes, tRNAscan-SE77 software was used to annotate tRNA genes. Furthermore, for manual adjustment, Geneious Pro (v. 10.2.3) [30]and tRNAscan-SE [31] were used to compare it with previously reported C. wightii genome. Similarly, the start and stop codon and intron boundaries were also manually adjusted compared with pre sequenced C. wightii and B. sacra. Furthermore, the structural features of both Commiphora species cp genome were illustrated using OGDRAW [32]. Similarly, MEGA6 software [33] was used to determine the relative synonymous codon usage and divergence in usage of identical codons. The divergence of these two Commiphora species cp genome with other related species were determined by using mVISTA [34] in Shuffle-LAGAN mode and using C. wightii as a reference genome.

Repeat identification
REPuter software [35] was used for the identification of palindromic, tandem and forward repeats present in genome. The criterion was minimum >15 base pairs with sequence identity of 90%. SSRs dataset was determined through PHOBOS ver3. 3

Sequence-divergence and Phylo-genomic analysis
In this analysis, average-pairwise sequence divergence of complete plastomes and shared genes of Commiphora species with related species were determined. Missing and ambiguous gene annotations were confirmed by comparative sequence analysis after a multiple sequence alignment and gene order comparisons using Geneious Pro (v. 10 45]. For Bayesian posterior probabilities (PP) in the BI analyses, the best substitution model GTR + G model was tested according to the Akaike information criterion (AIC) by jModelTest verion 2102. The Markov Chain Monto Carlo (MCMC) was run for 1,000,000 generations with 4 incrementally heated chains, starting from random trees and sampling 1 out of every 100 generations. The first 30% of trees were discarded as burn-in to estimate the value of posterior probabilities. Furthermore, parameters for the ML analysis were optimized with a BIONJ tree as the starting tree with 1000 bootstrap replicates using the Kimura 2-parameter model with gammadistributed rate heterogeneity and invariant sites. MP was run using a heuristic search with 1000 random addition sequence replicates with the tree-bisection-reconnection (TBR) branch-swapping tree search criterion. In the second phylogenetic analysis, 72 shared genes from the cp genomes of the twenty-six members of order Sapindales, were aligned using ClustalX with default settings, followed by manual adjustment to preserve reading frames. Similarly, the above4 mentioned phylogenetic inference models were utilized to build trees using 72 concatenated genes, using the same setting as described above and suggested by Asaf et al [45].
The size of these cp genomes were almost similar with previously reported chloroplast genome of B. sacra (160,543 bp) [46], Azadirachta indica (160,737 bp) [47], Citrus sinensis (160,129 bp) [48] and Ailanthus altissima [49], which belong to order Sapindales. Both of these genomes possess the quadripartite structures comprises a pair of inverted repeats (IRa and IRb) separated by small single copy region (SSC) and large single copy region (LSC). The LSC regions in these genomes varies from 87,885 bp to 88,054 bp, SSC varies from 18,746bp to 18,962bp, and  the inverted repeat region varies from 26,763bp to 26,807bp (Fig 1). Similarly, the length of LSC, SSC and IR regions was also similar with previously reported genomes for order Sapindales [48,49].
Furthermore, the average GC content of C. gileadensis and C. foliacea genomes were found 37.8% which is almost similar to B. sacra (37.8%) and C. wightii (38%). The GC content of these cp genomes were also found similar with previously reported Sesamum indicum L. which is approximately 38% [50]. The AT content of both the cp genomes were 62.2%. This is in correlation to the other species from order Sapindales, for example A. miaotaiense (62.12%) [51], A. davidii (62.10%) [52], C. sinensis (61.52%) [48] and P. amurense (61.60%) [53]. Overall, the A+T content of 62.14% in both the cp genomes are closely related to order Sapindales ( Table 1).
The GC content was unevenly present in the C. gileadensis and C. foliacea cp genomes where it was low (32.3 and 32.4%, respectively) in the SSC regions, high (42.9%) in IR regions and moderate (35.8%) in the LSC regions. In synergy to the previously published reports on cp genomes, the presence of ribosomal RNA (rRNA) sequences enhance the GC contents in the IR regions [54][55][56]. In addition, about 43.72% of C. gileadensis and 46.91% of C. foliacea cp genomes were found noncoding. In case of coding regions, the protein coding genes were 48.81 and 45.62%, tRNA genes were 1.83 and 1.83%, and rRNA genes were 5.64 and 5.64% found in the C. gileadensis and C. foliacea cp genomes, respectively.
The total coding DNA sequences (CDSs) of C. gileadensis and C. foliacea were 78,238 bp and 73,119bp in size which encodes 94 and 93 genes respectively (S1 Table). This also includes 26,078 and 24,273bp codons respectively (S2 Table). Similarly, the codon-usage frequency of the both C. gileadensis and C. foliacea cp genomes were determined on the basis of proteincoding and tRNA-related gene sequence (S3 Table, S4 Table). Like previously reported cp genomes, the cysteine (1.2%) and leucine (10.3%) were the least and most commonly encoded amino acids [39,54]. Furthermore, The AT contents of both C. gileadensis and C. foliacea cp genomes at the 1 st , 2 nd , and 3rd codon position of CDS were 54.6 and 55.1%, 61.4 and 58.4%, and 65.99 and 67.3%, respectively (S2 Tablehttp://journals.plos.org/plosone/article?id = 10.1371/journal.pone.0182281 -pone-0182281-t003). This is in correlation with previous reports showing that the terrestrial plant's cp genome with highest AT-content at the 3 rd codon-position [54,57] The total number of genes in the C. gileadensis and C. foliacea were 140 and 141 respectively, in which 94 and 93 genes were protein coding genes, while 39 were tRNAs and 8 were  [59], and in Meliaceae species has 112 genes [60], which is from the same order Sapindales [51]. Camellia species contains 146 genes [60]. The protein-coding genes present in C. gileadensis and C. foliacea cp genomes include twelve genes-encoding small-ribosomal proteins (rps2, rps3, rps4, rps7, rps8, rps11, rps12, rps14, rps16, rps18, rps19), 9 genes-encoding large ribosomal proteins (rpl2, rpl14, rpl16, rpl20, rpl22, rpl23, rpl32, rpl33, rpl36), 10 genes of photosystem-II, five genes-encoding photosystem-I components, and 6 genes (atpA, atpB, atpE, atpF, atpH, atpI) ATP-synthase and electron-transport chain components (S1 Table). Similarly, the chloroplast genomes of C. gileadensis and C. foliacea contains introns containing genes. There were 11 genes containing intron inclusive of nine which have single-intron and 3 (clpP, ycf3 and rps12) which have two introns ( Table 2). These results are similar with previously reported cp genome of angiosperms. The smallest intron in both C. gileadensis and C. foliacea cp genoemes were 518bp and 526 bp respectively, whereas the longest intron was determined in trnK-UUU (2507 bp) in both cp genomes that included the entire matK gene. Introns can be a useful tool for successful transformational effectiveness and play a vital role in the regulation of gene expression [61]. Like other angiosperms cp genomes, rps12 gene was unequally distributed, with single copy of its 3 0 exon/intron, located at the IR regions and 5 0exon, located in the LSC region. A similar correlation in the results were observed in previously reported cp genomes of C. platymamma [62], C. aurantiifolia [63] and Dipteronia species [62]. Moreover, there are 4 ribosomal RNA genes and 30 transfer RNA genes. The infA gene, which code for transcription factor of initiation was present in both Commiphora species, while it is absent in Citrus sinensis (L.) cp genome [64].

Expansion and contraction of IRs
Expansion and contraction of the IR (a&b) repeats were compared among different species belonging to order Sapindales. The chloroplast genomes of angiosperm are highly conserved, but there is still some variation due to contraction or expansion of SSC and IR boundary region [64]. Due to these contraction and expansion, the size variation and rearrangement occurs in the LSC/SSC/IRA/IRB [64]. In this study we carried out a detail comparison of 4-junctions (JLA, JLB, JSA, and JSB) between LSC and SSC regions and both the IRa and IRb regions of the C. gileadensis and C. foliacea species and five other species from order Sapindales were performed (Fig 2).  foliacea, which is 1 bp in B. sacra and A. indica. Furthermore, variation was observed in the location of ndhF genes which is present at 268 bp, 193 bp and 84 bp away from J SB in SSC regions in C. gileadensis, C. foliacea and B. sacra cp genomes. However, in other four species cp genomes ndhF was located at the junction of IRb-SCC. Furthermore, there is 76 bp variation was observed in location of ycf1 gene at J SB border in both C. gileadensis and C. foliacea. However, in B. sacra cp genome this distance was calculated 1 bp away from J SB border [46]. Similar to previously reported cp genome from Sapindales these cp genomes having wellmaintained genomic structure in term of cp genome length, IR regions, gene order and gene numbers [49]. However, some of the deviation in sequence might be due to the result of boundary contraction and expansion between the boundaries of IR and single copy regions among different plant species as reported by Wang et al. [64].

Structural variation in genomic regions
In order to determine the sequence divergence among the four chloroplast genomes viz. C. gileadensis, C. foliacea, C. wightii and B. sacra, the annotation of C. gileadensis cp genome was used as a reference for determination of the sequence similarity in the cp genomes of the three species through mVISTA program (Fig 3). The results showed that high degree of synteny and comparatively lower sequence similarity were noted among these cp genome of these four species especially in rpoC2, rpoB, petB, psaB, ndhB, ndhF, ccsA, ycf1, ycf2, rpl22 and atpF genes (Fig 3). Furthermore, like previous reported genomes the LSC and SSC regions were more divergent as compared to IR regions in the compared species and less similarity in the coding region were observed. Similarly, various deviating regions included matK, ycf3-psaA, clpP, accD, atpF, rpoC1, petA-psbJ, ycf1-rps15, rps19 and ndhF were reported previously in various cp genomes [54,56]. Differences in the coding regions were similar in this study to the previously analyzed cp genome by Kumar et al. [33]. Similarly, for the shared genes the average pairwise sequence differentiation was calculated among these four species (Fig 3 and S9  Table). The results revealed that the 13 most divergent genes among these genomes were infA, rps8, rpl32, rpl22, rpl16, psaI, ndhH, ndhG, matK, ccsA, atpH, accD and psbN. The rpl22 gene showed the greatest average sequence divergence (0.029), after that rps3 (0.028), ndhH (0.027), and ccsA (0.020), majority of these were located in the LSC region. Similar results were observed in previously reported angiosperm cp genomes [56,65]. Furthermore, comparison of the cp genome of C. gileadensis with C. foliacea, C. wightii and B. sacra revealed 3,032, 8,787 and 5,120 SNPs as well as 3,580, 10,460 and 17,122 Indels respectively (Fig 4). Similarly, the C. foliacea cp genome also showed 8,194 and 5,182 SNPs while 7,632 and 17,970 Indel with C. wightii and B. sacra respectively. These Results shows that even the most conserved genome possesses some interspecific mutations which provides an important information in analyzing the phylogenetic and genetic diversity among the species [56].

SSR Polymorphism in the cp Plastomes
Diversity exist in the copies of SSRs present in the chloroplast genome and these SSRs are vital molecular markers in the plant evolutionary, population genetics and studying the ecology of the plants [66]. In the present study, we detected complete SSRs in C. gileadensis, C. foliacea cp genomes together with C. wightii and B. sacra (Fig 5) and detail SSR analysis of C. gileadensis, C. foliacea, C. wightii and B. sacra were also performed (S5 Table, S6 Table, S7 Table, S8  Table). Specific parameters were set for the SSRs present in genome because SSR of more than 10bp are liable to slip strand mispairing, which is considered to be the basic reason for SSR polymorphism. [67][68][69]. The results reveled a total of 196, 175, 153 and 191 SSRs in the C. gileadensis, C. foliacea, C. wightii and B. sacra cp genomes, respectively. The majority of SSRs 75 (38.2%) in C. gileadensis cp genome was mono-nucleotide repeat motifs. However, in other three cp genome the majority of SSRs were tri nucleotides motif, varying from quantity from 71 (40.57%) in C. foliacea to 75 (39.26%) in B. sacra. Tri-nucleotide repeat motif was found the second most common 69 (35.2%) in C. gileadensis. Using our search criterion, 3, 2 and 2 penta nucleotide were detected in C. gileadensis, C. foliacea and C. wightii cp genome respectively. However, in hexa nucleotide was only detected in B. sacra cp genome. Furthermore, in C. gileadensis and C. foliacea, most common mononucleotide SSRs are A (93.33% and 94.1%) motif, respectively. Approximately, 52% and 67.3% of SSRs are sited in non-coding regions, 2.04% and 5.71% are located in rRNA sequences in both C. gileadensis and C. foliacea respectively. These results suggest that SSRs are irregularly disseminated in the chloroplast genome and provides valuable information to select the effective molecular markers for spotting inter and intra specific polymorphisms [70][71][72]. The abundance of 'A' and 'T' nucleotide in the cp genomes as compared to 'G' and 'C' is due to the fact that mono and dinucleotide is only consist of 'A' and 'T' nucleotide which contributes to the bias in the cp genome base composition [68]. The finding from these Commiphora genomes reveals that SSRs in the cp genomes are Complete chloroplast genomics of two Commiphora species normally composed of polyadenine (polyA) or polythymine (polyT) repeats and irregularly contains the tandems guanine (G) or cytosine (C) repeats [73], which is similar to the previous results thus a possible reason for AT richness [46,55,56]. The presence of SSRs in cp genomes will give useful information for primer designing used for phylogeography and population structure at specie level or SSRs can also be used for obtaining useful and important information used for phylogenetic relationship and population genetics [74]. Previously reported D. viscoa contains 249 SSRs, having the mononucleotide SSRs in highest number followed by tri nucleotide repeats [74,75]. The cp genome of globe Artichoke contains 127 repeats is lesser than our findings [76].

Repeats analysis of Commiphora plastomes
Repetitive sequences in the plastomes plays role in the rearrangement of genomes which provide an important information about phylogenetic studies [50,77] From the previously analyzed cp genomes it is evident that for the induction of indels and substitutions these repeat sequence is essential. Additionally, analysis of different cp-genomes exposed that repeat sequence is important to produce indels/substitutions [78]. Similarly, in our study repeat analysis of the C. gileadensis and C. foliacea identified 30 and 25 palindromic repeat, 15 and 25 forward, 20 and 25 tandem repeat respectively. Similarly, 21 and 20 palindromic repeats, 27 and 20 tandem repeats were spotted in C. wightii and B. sacra respectively. However, in C. wightii only 6 forward repeats were detected while in B. sacra it was 29 in number. Overall 65 and 75 repeats of different length were found in both C. gileadensis and C. foliacea, respectively. In C. gileadensis four palindromic repeats were 75-89bp and 21 repeats were > 90 length. However, in C. foliacea the number of >90 repeats were less and only 2 palindromic repeats were found. On the other hand, among the forward repeats 10 repeats of >90 bp were detected in both C. gileadensis and C. foliacea cp genome (Fig 6). Earlier reports recommend that deviation in sequences and genome arrangement occur due to the slipped-strand mispairing and inappropriate recombination of repetitive sequences [77,79]. Moreover, the occurrence of the repeats shows that this locus is a key hots-pot for re-configuration of the genome [50,80]. Also, the Information from these repeats are a source of valuable information for constructing genetic markers for population studies and phylogenetic analysis [50].

Phylogenomic analysis
Several aspects of Commiphora natural history have impeded efforts to resolve its species-level taxonomy and investigate its systematic biology [81]. Previously, the two species have examined species-level phylogenetic relationships in Commiphora and tested the monophyly of some of these infrageneric taxonomic groups [5,82]. Gostel et al. [83] reconstruct phylogenetic relationship in Commiphora species using genes from nuclear as well as from chloroplast genome. However, hypothesis regarding higher level relationship among Commiphora specie are similarly unresolved [83]. To resolve the phylogenetic relationship among different species, the complete chloroplast genome sequencing provides more detailed information about the phylogenetics [84,85]. Therefore, in this study the phylogenetic position of both C. gileadensis and C. foliacea within order Sapindales was established by analyzing the complete cp genomes (Fig 7 and S1) and 72 shared genes (form all twenty-six species). Phylogenetic analysis using MP, BI, NJ and ML methods were performed. The results revealed that both complete cp Complete chloroplast genomics of two Commiphora species genomes and 72 shared genes of C. gileadensis and C. foliacea contain the same phylogenetic signals and generated phylogenetic trees with identical topologies (Fig 6, S1 Fig). The results show that both C. gileadensis and C. foliacea form a single clade with previously reported C. wightii and B. sacra from family Burseraceae with high BI and bootstrap support values (Fig 7,  S1 Fig). The tree topology showed that these four species from family Burseraceae are more closely related to Spondias species from Family Anacardiaceae and Azadirachta indica from Meliaceae (Fig 7, S1 Fig). Furthermore, the phylogenetic analysis validated the relationship inferred from the phylogenetic work reported by Saina et al. [86] that the families Burseraceae and Anacardiaceae formed a sister group/clade, which further branched forming sister clade with Meliaceae, Rutaceae, Simaroubaceae and Sapindaceae families. Therefore, for future phylogenetic studies must incorporate additional species for better understanding of Commiphora species evolution and phylogeny. This study offers a basis for future phylogenetic of family Burseraceae.