The Chloroplast Genome of Elaeagnus macrophylla and trnH Duplication Event in Elaeagnaceae

Elaeagnaceae, which harbor nitrogen-fixing actinomycetes, is a plant family of the Rosales and sister to Rhamnaceae, Barbeyaceae and Dirachmaceae. The results of previous molecular studies have not strongly supported the families of Elaeagnaceae, Rhamnaceae, Barbeyaceae and Dirachmaceae. However, chloroplast genome studies provide valuable phylogenetic information; therefore, we determined the chloroplast genome of Elaeaganus macrophylla and compared it to that of Rosales such as IR junction and infA gene. The chloroplast genome of Elaeagnus macrophylla is 152,224 bp in length and the infA gene of E. macrophylla was psuedogenation. Phylogenetic analyses based on 79 genes in 30 species revealed that Elaeagnus was closely related to Morus. Comparison of the IR junction in six other rosids revealed that the trnH gene contained the LSC region, whereas E. macrophylla contained a trnH gene duplication in the IR region. Comparison of the LSC/IRb (JLB) and the IRa/LSC (JLA) regions of Elaeagnaceae (Elaeagnus and Shephedia) and Rhamnaceae (Rhamnus) showed that trnH gene duplication only occurred in the Elaeagnaceae. The complete chloroplast genome of Elaeagnus macrophylla provides unique characteristics in rosids. The infA gene has been lost or transferred to the nucleus in rosids, while E. macrophylla lost the infA gene. Evaluation of the chloroplast genome of Elaeagnus revealed trnH gene duplication for the first time in rosids. The availability of Elaeagnus cp genomes provides valuable information describing the relationship of Elaeagnaceae, Barbeyaceae and Dirachmaceae, IR junction that will be valuable to future systematics studies.

Elaeagnus L. belong to Elaeagnaceae, a small family that also contains Hippophae L and Shepherdia Nutt [5,6]. Elaeagnus consists of approximately 60 species distributed in Asia, Australia, southern Europe and North America [7]. Elaeagnus macrophylla is a popular ornamental plant valued for its aesthetic qualities and sweetly scented flowers. E. macrophylla is native to Eastern Asia.
The plant chloroplast (cp) genome consists of large inverted repeats (IRa, IRb) separated by a large single-copy (LSC) region and a small single-copy (SSC) region [8,9]. Approximately 100-120 genes are located in the cp genome, which is highly conserved [10]. However, some species in Asteraceae have been shown to have inversions [11], rearrangements have been observed in Pelargonicum [12], and gene loss and IR variations have been found in early-divergent eudicots [13,14].
Recent studies of the IR region have enabled its use as an important marker describing relationships among plants. The IR region of most angiosperms ranges from 20 kb to 28 kb [12]. Previous analyses have shown various expansions or contractions of IR in some plants, such as 25 kb in Cycas [15], 114 bp in Cryptomeria [16] and 76 kb in Pelargonium [12]. Plunkett and Downie [17] reported IR expansion/contraction in Apioideae. IRa and IRb contain four junctions, J LA (LSC/IRa border), J SA (SSC/IRa border), J SB (LSC/IRb border) and J LB (SSC/IRb border). Most angiosperm plant IRb and IRa contain rps19-rpl2 and rps19-trnH, respectively [18], while most monocot IRb and IRa contain rps19-trnH and trnH-rps19-psbA, respectively [19].
Previous studies have analyzed the complete chloroplast genome sequences of rosids and identified unique features such as inversion and gene transfer in their plastids [20]. Fagaceae of rosids showed changes in gene order in response to 51 kb inversions in Glycine and loss of the IR region in Medicago [20]. Genes of some rosid plastomes have been transferred to the nucleus [21,22]. For example, the infA gene (Gossypium [23], Arabidopsis [24], Oenothera [25] and Lotus [26]) and rpl22 gene (Castanea, Quercus and Passiflora [22]) were transferred to the nucleus.
Here, we report the complete sequence of the chloroplast genome of E. macrophylla in Elaeagnaceae for the first time. In this review, we provide comparative analyses of the chloroplast genome of rosids species such as the infA gene, rpl22 gene and IR junction. Specifically, we describe the structure of the chloroplast genome, IR junction characteristics and gene contents, which will better resolve phylogenetic relationships among rosids and Rosales.

Materials and Methods
Ethics, plant samples, sequencing, mapping and ananlysis

Phylogenetic analyses
A total of 79 genes sequences from 30 species (S1 Table) were aligned using MAFFT [31]. Phylogenetic analysis was conducted based on the maximum likelihood (ML) using the GTR+R+I model in RAxML v. 7.2.6 [32] and 1,000 bootstrap replicates. ML analysis resulted in a single tree with-lnL = 485,367.343.

PCR amplification and comparative analysis of IR junctions
Six species of Elaeagnaceae (Elaeagnus and Shephedia) and six species of Rhamnaceae (Rhamnus) were evaluated (S2 Table) using the following primers specific for the LSC/IRb junction and IRa/LSC junction designed with Primer3 [33]: 1) rps19-rpl2: Forward, CGCTCGGGACC AAGTTACTA; Reverse, GGGTTATCCTGCACTTGGAA 2) rpl2-psbA: Forward, ATGTTGG GGTGAACCAGAAA; Reverse, GCTGCTTGGCCTGTAGTAGG. Total DNA was extracted as described by Allan et al. [34] and then subjected to PCR amplification. PCR cocktail (25μl) consisted of 250ng genomic DNA, 1X Diastar TM Taq DNA butter, 0.2 mM of each dNTP, 10 pM of each primer and 0.025 U of Diastar TM Taq DNA polymerase (SolGent Co., Korea). The amplification conditions were as follows: initial denaturation at 95°C for 2 min, followed by 35 cycle of 95°C for 20sec, 56°C for 40sec, and 72°C for 50sec, with a final extension at 72°C for 5min, after which samples were held at 8°C. Amplification products were purified using a HiGenTM Gel & PCR Purification System (Biofact Inc., Daejeon, Korea). Nucleotide sequences of the rps19-rpl2 region and rps19-psbA regions were aligned with Geneious v. 6.1.7 [30].

Comparison of the chloroplast genome of Elaeagnus macrophylla to those of other rosids
The cp genome sequence of E. macrophylla was submitted to GenBank and assigned accession number KP211788. The cp genome contains 152,224 bp, the LSC has 82,136 bp, the SSC has 18,278 bp and the IR has 25,905 bp (Fig 1). We identified 113 unique genes in E. macrophylla, 80 protein coding genes, 29 transfer RNA (tRNA) genes and 4 ribosomal RNA (rRNA) genes. The genome consisted of 73.4% coding genes (111,792 bp), including 60.5% protein-coding genes (92,119 bp), 7% tRNA genes (10,625 bp) and 5.9% rRNA genes (9,048 bp). Additionally, 18 genes encoded introns among unique genes of E. macrophylla, among which 12 are proteincoding genes and six are tRNA genes. Three protein coding genes include two introns (clpP, ycf3 and rps12), and the overall A+T content of E. macrophylla is 63.9%. The A+T percentages are higher in the SSC region (69.4%) than the LSC (65%) and IR regions (57.3%).
Previous studies, E. macrophylla belong to Rosales [2] and complete chloroplast genome of Rosales studied in Morus indica (NC_008359) and Prunus kansuensis [22]. Therefore, the genome features of E. macrophylla were compared to M. indica and P. kansuensis ( Table 1

Genes of infA and rpl22 in rosids
The functional gene sequences of infA and rpl22 are highly variable in rosids. The infA gene of rosids differs from that of most asterids (Helianthus, Guizotia, Lactuca and Jacobaea), monocots (Dioscorea), magnoliids (Drimys) and chloranthales (Chloranthus) owing to the presence of the pseudogene, infA, in Elaeagnus and other rosids. However, other plants such as Dioscorea, Helianthus, Guizotia, Lactuca, Jacobaea, Chloranthus and Drimys encode homologous sequences (Fig 2).  The rpl22 gene of Arabidopsis, Glycine and Lotus showed an internal stop codon. However, the rpl22 gene of the rpl22 gene of other plants consists of the start codon (methionine) to stop codon (data not shown). Nevertheless, the size of the rpl22 gene in another 18 species ranged from 252 bp in Cucumis to 552 bp in Guizotia, while it was 423 bp in Elaeagnus.

Phylogenic analyses of Elaeagnus and Rosids
Phylogenic analysis was conducted using a gene data matrix based on 79 genes from 30 species with 75,370 bp aligned nucleotides (Fig 4). Rosids and asterids form two well supported monophyletic sister groups with strong support (100% bootstrap values). Rosids are a well-defined group with two strongly supported clades: Fabidae (Prunus, Morus, Elaeagnus, Lotus, Theobroma, Manihot and Populus); and Malvidae (Gossypium, Castanea, Arabidopsis, Citrus and Eucalyptus). The results of the present study confirmed that the genus Elaeagnus belongs to Fabidae and forms a sister relationship with Morus with 100% bootstrap values.
The infA gene has been lost from many angiosperms in land plants, and Millen et al. [21] suggested functional replacement of a nucleus copy. Our results indicate that the infA gene has been lost from rosids (including Elaeagnus). We also found that trnH duplication of the IR region was only present in Elaeagnus.

Comparison of the trnH gene in the IR region between Rhamnaceae and Elaeagnaceae
A previous study reported that Elaeagnaceae is closely related to Rhamnaceae, Dirachmaceae and Barbeyaceae [2,14,35]. In the present study, the chloroplast genome data revealed that the gene order in the LSC/IRb region of E. macrophylla continued to rps19, trnH and rpl2, while the IRa/LSC region continued to rpl2, trnH and psbA. Therefore, we compared the LSC/IRb (J LB ) and the IRa/LSC (J LA ) regions of Elaeagnaceae (Elaeagnus and Shephedia) and Rhamnaceae (Rhamnus). Fourteen species of Elaeagnaceae (Elaeagnus and Shephedia) and Rhamnaceae (Rhamnus) does experiments and aligned the sequences of rps19-rpl2 (J LB region) and rpl2-psbA (J LA ) regions.
The rps19-rpl2 region of Elaeagnaceae differed from that of Rhamnaceae (Fig 5). The rps19 ( Fig 5A) and rpl2 (Fig 5C) regions of Elaeagnaceae were highly similar to those of Rhamnaceae, whereas the areas surrounding the trnH and trnH gene differed greatly between these families (Fig 5B and 5D). Additionally, the rpl2-psbA region (Fig 6) between Elaeagnaceae and Rhamnaceae could be distinguished by the ψrps19 gene ( Fig 6E). The rpl2, trnH, and psbA genes are conserved in Elaeagnaceae and Rhamaceae, whereas Elaeagnaceae has long gaps among coding genes and the trnH gene (rpl2-trnH and trnH-psbA).

Rosid gene contents
Previous studies revealed that the infA and rpl22 genes and atpF intron have been lost or subjected to psuedogenation in rosids. Millen et al. [21] and Jansen et al. [22] found that the chloroplast genes, infA and rpl22, are transferred to the nucleus in rosids. However, the intron was lost from the atpF gene of Cassava (Manihot esculenta) [38]. Moreover, the infA gene has been independently lost multiple times from angiosperms and most rosids [22,26]. Phylogenic studies placed Elaeagnus sister to Morus in the Rosales clade [2,26], and complete chloroplast genome analysis of Morus did not reveal an infA gene [40]. The rpl22 gene has been lost from Fabaceae (Glycine and Medicago) and Fagaceae (Castanea and Quercus), and these plants have been independently transferred to the nucleus [22].
Our results also revealed the putative loss or formation of a pseudogene of the infA gene in E. macrophylla (Fig 2B). Moreover, the loss of the infA gene of 12 rosids was observed in this study (Fig 4). Su et al. [41] showed that Quercus, Francoa and Cucumis contain intact infA genes; however, no infA gene was observed in Cucumis in the present study (Fig 4).
The rpl22 gene of E. macrophylla is intact, while it was lost from Arabidopsis, Glycine, Lotus and Castanea, and present in varying lengths in 18 other species.

Special event in the IR region of Elaeagnus
The chloroplast genome of land plants is highly conserved structurally, and the junction of large inverted repeats (IRs) is not essential to chloroplast genome function [18]. Because of black pine, Conopholis and Phelipanche of Orobanchaceae and Erodium was not present the IR region [42][43][44]. However, the IR region is a variable site on the chloroplast genome with useful features [17,45,46].
As shown in Fig 3, the LSC-IRb junction of the Elaeagnus species shows insertion of the trnH gene, whereas the other rosids species do not contain the trnH gene. Comparison of the LSC-IRb region of closely related species of Rhmanaceae revealed an approximately 600 bp gap after the rps19 gene (Fig 5). In contrast, the IRa-LSC region contains 600bp gaps in Elaeagnaceae species. The trnH gene of Elaeagnaceae and Rhamnaceae is the same length, but Elaeagnaceae does not include the rps19 gene (Fig 6). Consequently, Elaeagnaceae and Rhamancaea have different gene contents and arrangements in the IR region. Comparisons of J LB and J LA in Rosales revealed that the rps19 gene is not duplicated in Morus, whereas, Prunus contains a 108 bp duplication of the rps19 gene. The gene ycf1 is duplicated from 1,002 bp in Morus to 1,051 bp in Prunus. However, the trnH gene is duplicated in the J LB (rps19 is not duplicated) and J LA border, and 1,215 bp of ycf1 is duplicated in Elaeagnus. Hence, the IR length of Elaeagnus was longest in Morus and Prunus. Wang et al. [19] has suggested two possible mechanisms of the evolution of IR expansions in Monocots. Wang et al. [19], double-strand break (DSB) events occurred within the IRb, after which the free 3' end of the broken strand was repaired against the homologous sequence in IRa. The repaired sequence then extends over the original IR-LSC junction, reaching the area downstream of trnH, resulting in duplication of the trnH gene in the newly repaired IRb. Similarly, the IR region extends in Elaeagnaceae.

Duplication of the trnH gene in Elaeagnaceae
Our data analyses confirmed IR evolution in Rosales (Fig 7A). The incomplete rps19 gene of Prunus in Rosaceae ( Fig 7E) and Rhamnus in Rhamnaceae (Fig 7D) was duplicated in the IR region. Conversely, Morus in Moraceae did not contain a duplicated rps19 gene in the IR region ( Fig 7B). Only the Elaeagnaceae was duplicated in the trnH gene (Fig 7C). The trnH gene duplication is a useful marker in Rosales, such as Dirachmacae, Barbeyaceae and Elaeagnaceae. In a previous study, Richardson et al. [3] suggested a sister relationship between Rhamnaceae, Dirachmaceae and Barbeyaceae. In contrast, Zhang et al. [2] suggested a sister relationship among Elaeagnaeae, Dirachmaceae and Barbeyaceae, but this was not well supported in the Elaeagnaceae clade. Consequently, analyses of trnH duplication in the LSC/IRb junction and the IRa/ LSC junction from different Moraceae and Elaeagnaceae would be of great value in systematics studies.

Conclusions
Here, we present the complete chloroplast genome of Elaeagnus macrophylla and compare it to that of rosids. The infA gene has been lost from the chloroplast genome or transferred to the nucleus in angiosperms [21]. Most rosids, including E. macrophylla, show loss of the infA gene. The chloroplast genome consists of a LSC (Large Single Copy), SSC (Small Single Copy) and two IR (Inverted Repeat) regions. The IR region is between 20 and 30 kb in length in angiosperms, and clearly differs among closely related species. The IR region of E. macrophylla differs owing to trnH gene duplication. Phylogenetic analysis strongly supports a monophyletic group of Rosales (Elaeagnus, Morus and Prunus). Previous studies did not clearly support Eleaganaceae, Rhamnaceae, Dirachmaceae and Barbeyaceae in the molecular phylogenetic tree. In the present study, comparison of trnH gene duplication in two closely related families, Elaeagnaceae and Rhmanaceae, showed that no duplication occurred in Rhmanaceae, but that it occurred in Elaeagnaceae. Consequently, trnH gene duplication in Elaeagnaceae offers information that will be useful for systematics and elucidation of the relationship between Elaeagnaceae, Dirachmaceae and Barbeyaceae.
Supporting Information S1 Table. Phylogenetic study taxa and Genebank accession numbers of references. (DOCX) S2 Table. IR junction analysis taxa and accession numbers. (DOCX)