The complete chloroplast genome sequence of Epipremnum aureum and its comparative analysis among eight Araceae species

Epipremnum aureum is an important foliage plant in the Araceae family. In this study, we have sequenced the complete chloroplast genome of E. aureum by using Illumina Hiseq sequencing platforms. This genome is a double-stranded circular DNA sequence of 164,831 bp that contains 35.8% GC. The two inverted repeats (IRa and IRb; 26,606 bp) are spaced by a small single-copy region (22,868 bp) and a large single-copy region (88,751 bp). The chloroplast genome has 131 (113 unique) functional genes, including 86 (79 unique) protein-coding genes, 37 (30 unique) tRNA genes, and eight (four unique) rRNA genes. Tandem repeats comprise the majority of the 43 long repetitive sequences. In addition, 111 simple sequence repeats are present, with mononucleotides being the most common type and di- and tetranucleotides being infrequent events. Positive selection pressure on rps12 in the E. aureum chloroplast has been demonstrated via synonymous and nonsynonymous substitution rates and selection pressure sites analyses. Ycf15 and infA are pseudogenes in this species. We constructed a Maximum Likelihood phylogenetic tree based on the complete chloroplast genomes of 38 species from 13 families. Those results strongly indicated that E. aureum is positioned as the sister of Colocasia esculenta within the Araceae family. This work may provide information for further study of the molecular phylogenetic relationships within Araceae, as well as molecular markers and breeding novel varieties by chloroplast genetic-transformation of E. aureum in particular.


Introduction
Chloroplasts are organelles in green plants and algae, where they are responsible for photosynthesis and other housekeeping functions that are essential for nitrate and sulphate assimilation, as well as the synthesis of amino acids, fatty acids, chlorophyll, and carotenoids [1,2]. Over the long period time of evolution, the sizes of chloroplasts genomes have reduced dramatically to 120-130 genes in recent times, due to gene loss and/or the large-scale gene transfer to the PLOS  and Lemnaceae [23][24][25][26][27]. Here, we assembled the cp genome of E. aureum and compared it with that of other plants within Araceae. We then constructed a phylogenetic tree to make comparisons with cp genomes published for other plant species in related families. Our objectives were to provide information for the systematic evolution studies of Araceae and Alismatales, with special interest in the positioning of E. aureum in plant systematics and evolution. We also investigated intergenic spacer and regulatory sequences to obtain information that can be used in future studies for chloroplast genetic engineering.

DNA extraction, genome sequencing, assembly, and annotation
Fresh leaves of Epipremnum aureum were collected and total DNA was prepared following the instruction of Gentra Puregene Tissue Kit (QIAGEN Biotechnology, Germany), for constructing a shotgun library and sequencing on an Illumina Hiseq 2500 platform (Genesky Biotechnologies, Inc., Shanghai, China). The raw data (1.41 G) were filtered and trimmed by CLC Genomics Workbench version 7.5 (CLC Bio, Aarhus, Denmark). In all, 793,730 reads were assembled with MITobim version 1.7 [28] according to the complete cp genome sequence of Colocasia esculenta (KC016753). For filling the gaps, the high-quality flanking sequences (~600 bp of each end) of gaps were selected as a starting seed for the direct reconstructing process, respectively. The gaps could be fixed by aligning the new consensus sequences with the initial assembled cp genome sequence. At last, the quality checks were implemented by mapping the trimmed reads to the assembled cp genome sequence. The assembled sequence was annotated by GENEIOUS R8 (Biomatters Ltd., Auckland, New Zealand), with the complete cp genome sequence of Colocasia esculenta (KC016753) serving as the reference [29]. Statistical analyses of synonymous codon usage for all protein-coding genes and gene sequence alignments were also performed with GENEIOUS R8.

Comparative genome analysis
The synonymous (dS) and nonsynonymous (dN) substitutions of chloroplast genomes in members of Araceae were evaluated by BioEdit [30,31] and DnaSPs software [31,32] The plastid genomic sequences of E. aureum and 37 other species were collected from NCBI (S1 Table). The coding sequences (CDS) of 53 common protein-coding genes (S1 File) from the 38 species were aligned with the MAFFT method based on codons by GENEIOUS R8. The final alignment was concatenated by the 53 CDS alignments (S2 File). The best nucleotide substitution model (GTR+G+I) was tested and a Maximum Likelihood (ML) tree (1000 bootstrap replicates) was constructed with MEGA 6.0 software [36].

Comparisons of chloroplast genomes among eight species in Araceae
We compared the cp genome sequences of E. aureum and other species in the Araceae family and found similar GC contents that ranged from 35.66 to 36.39% (Table 1), which is within the range of 31.0 to 38.0% for typical monocots [31]. Within the Araceae family, the cp genomes are larger for four floating plants (L. minor, S. polyrhiza, W. australiana, and W. lingulata) than for four terrestrial plants (C. esculenta, E. aureum, P. ternata, and D. seguine) ( Table 1). We used mVISTA to study these variations and confirmed that the overall gene content and order in the cp genome of E. aureum is similar to those in other Araceae members (Fig 2), without inversions or translocations. However, the greatest variations were observed in four genes (clpP, accD, ycf1, and ycf2) in the coding regions for E. aureum and Dianthus superbus [40]. In addition, one copy each of ycf1 and rps15 is located in the IRs for our four floating plants whereas those genes are within the SSC regions for the four terrestrial plants. As shown in Fig 2, a feature common to the genomes of land plants is that genes are much more highly conserved in their coding regions than in their non-coding regions [3,40]. ycf1 and ycf2. During the evolution of plastids, genome sizes and individual plant genes have changed, with many functional genes being lost or transferred to nuclear or mitochondrial genomes [3]. For example, the remarkable reduction in genome size and gene contents in Epifagus (a non-photosynthetic parasitic species) is due to the uselessness of most photosynthesis and expression-related genes [3,41]. The possible reason of genome reduction is that the tactic could save resources for their DNA replication quickly under adverse environmental conditions, e.g. genes lost from Pinaceae, Gnetophytes, and intron and spacer regions reduction in Gnetum and Welwitschia [3,41]. Among the 85 ycf (hypothetical open reading frame) genes found in plant cp genomes, functions are still unclear for most of them, and only 18 genes have been characterized as being involved in photosynthesis or other essential processes [42]. In Table 1. Comparison of chloroplast sequences for E. aureum and other species within Araceae family.   contrast to genes that have been lost, both ycf1 and ycf2 have possibly arisen from an ancestor common to land plants and some green algae [3,43]. In decades, researchers have paid much attention to ycf1 and ycf2, the two largest open reading frames in chloroplast genomes of higher plants, which are thought to participate in various cellular activities related to ATPase, chaperones, cell division, structural remodeling, transport of proteins across membranes, or proteolysis [3,[44][45][46][47]. In Arabidopsis, ycf1 is functionally classified as coding TIC214 (a translocon at the chloroplast inner envelope membrane), which is considered essential for plant survival because most exotic proteins are imported into chloroplasts via Tic214/Tic20 [48]. Moreover, the loss of ycf1 genes from monocotyledonous species (e.g., Poales, Acorales, and some parasitic plants) and dicotyledons such as Vaccinium sp. indicates that YCF1 does not act as a TIC in other plants as it does in Arabidopsis [3,[48][49][50], or that is possibly transferred to the nucleus in these species. In contrast, Nakai et al suggest YCF1 is a green TIC and largely conserved among the Chlorophyta and land plants [51]. Hence, whether ycf1 is a bona fide green TIC is still controversial. InfA, ycf15, and ycf68. Our global comparison via mVISTA also revealed homologous fragments of infA and ycf15 in the cp genomes from the majority of Araceae members (Fig 2). However, except for the loss of infA in L. minor, both appear to be pseudogenes based on the presence of stop codons in those sequences (Fig 3A and 3B). Therefore, the cp genome of E. aureum has two pseudogenes: one infA and two ycf15 (in IR regions). Our results are consistent with those reported for infA and ycf15 in C. esculenta [29]. infA is a functional gene in more than 300 angiosperm chloroplasts, which encodes translation initiation factor 1 [43]. During evolution, it was lost entirely or persists as a pseudogene in most angiosperm, or it was transferred to nucleus in some species [43,52]. The ycf15, with unknown function, is usually located downstream of ycf2, with an intact or interrupted form in a small group of land plants [3,53]. The role of ycf15 in Spinacia oleracea may be similar to that of sprA in tobacco, acting as a regulatory sequence or specifying a structural RNA that is non-essential for normal growth, but which involves complex post-transcriptional splicing and does not function as a protein-coding gene [3,[53][54][55].

Length (bp) GC (%) Length (bp) GC (%) Length (bp) GC (%) AT (%) GC (%)
Similarly, the function of ycf68 is also unclear. This gene occurs intact in the IR regions of our four tested land plants but is a pseudogene in the four floating plants examined here. Therefore, expression of ycf68 might be related to plant adaptations to terrestrial circumstances. Other research results have suggested that it is a protein-coding gene in plants such as  Pinus thunbergii, P. koraiensis, Nymphaeales, and grass species; that it participates as an excising intro or regulatory gene in non-protein processes for Aneura mirabilis and several angiosperms; or else exists as a pseudogene in most other species [3,53]. Both Ycf15 and ycf68 are conserved when occurring in the slowly evolving IR region, but not because of any functional requirement [46,53].  Table 4). The dominant start codon is ATG, substituting for CTG in accD and rps19, ACG in ndhD and rpl2, and ATA in rpl22. In most angiosperm plasmids, the start codons of rpl2 and rps19 are common RNA editing sites where ACG changes to AUG [2,31]. The ACG editing site in ndhD transcripts might have been lost through the slow rate of evolution or a C mutation back to T in the ndhD The complete chloroplast genome sequence of Epipremnum aureum start codon. Thus, this loss of ACG as the start codon in ndhD might seriously influence the accumulation of NDH complex in the leaves [2]. Amino acids leucine (10.4%) and cysteine (1.2%) are found in the highest and lowest proportions, respectively (Fig 4), while the codons of AAA (1117) for lysine and UGC (83) for cysteine occur in the highest and lowest proportions, respectively. These features of leucine (10.4%) and UGC (83) for cysteine have also been reported for the cp genomes of Colchicum autumnale, Gloriosa superba, and Metasequoia glyptostroboides [10,31]. However, except for E. aureum, AAA (1117) is the most widely used codon in these species [10]. The codon usage of E. aureum is rather similar to that of Nelumbo nucifera and Colchicum autumnale [2,31]. Our relative synonymous codon usage (RSCU) results (Table 4) demonstrate that codons in the E. aureum cp genome at the third position have a higher nucleotide frequency for A or T than for G or C. Furthermore, the calculated RSCUs for alanine show that the values for GCC (0.635) and GCG (0.431) are dwarfed by those of GCA (1.193) and GCT (1.741). In the cp genomes of numerous land plants, it is common for the preference of AT at the third codon position to be positively correlated with the contents of A and T [10].

Simple sequence repeats and long repetitive sequences
Simple sequence repeats, also known as microsatellites or short tandem repeats, are 1-to 6-bp repeating sequences that are widely distributed over a genome. Because of their co-dominance and high degree of polymorphism, SSRs are useful genetic markers for research involving gene flow, population genetics, molecular breeding, and gene mapping [10]. The cp genome of E. aureum (S2 Table) includes 111 SSRs longer than 10 bp that are repeated 5 to 74 times ( Fig  5A). Mononucleotides, dinucleotides, and tetranucleotides account for 82.9%, 14.4%, and 2.7%, respectively, of those SSRs (Fig 5A and 5B). The majority (98.9%) of the mononucleotide repeats is AT-rich. Likewise, all of the dinucleotides and tetranucleotide are AT-rich, being composed of AT/TA and ATAA/TATT repeats, respectively (Fig 5A). These results demonstrate that chloroplast SSRs usually consist of short repeats of polyA/T [31]. The high A/T frequency in chloroplast SSRs also contributes to the preference in base composition of the E. aureum cp genome, with an A/T abundance of 64.2%.
The SSRs within a cp genome are abundantly distributed in the non-coding regions [10,31]. In E. aureum, those SSRs are more numerous in the non-coding regions (73.0% of all SSRs) than in the protein-coding regions (15.3% of the total; Fig 5C). Most are located in the IGS (Fig 5C), while 17 SSRs occur in protein-coding genes such as rps19, rps7, rpoC1, psbI, ndhD, accD, ycf1, and ycf2 (S2 Table). The genes ycf1 and ycf2 contain more SSR loci than the other genes (S2 Table). Furthermore, 11.7% of the SSRs are in the introns and only 15.3% are in the protein-coding regions, accounting for 49.1% of the entire cp genome. This suggests that the SSRs are distributed unequally. The E. aureum genome has multiple repeat sequences in the trnT(UGU)-trnL(UAA) IGS and in ycf1 (i.e., each with six repeats). This SSR information serves as a useful reference for future studies on the identification, development, and application of molecular markers for examining population genetics.
The distribution of SSRs varies among species, with genomic diversity being smaller in higher plants [56]. We also detected fewer SSRs (111) in E. aureum than in D. superbus (10,543) and Ananas comosus (205). However, our focus species has more than the 58 in Colchicum autumnale and the 56 in Gloriosa superba, and its SSRs are longer than 15 bp in the non-coding regions [13,31,40]. These characters of SSRs in E. aureum are consistent with a previous report that the majority of mononucleotide repeats is distributed universally in cp genomes when compared with other types of SSRs; that the longer SSRs are mainly located in non-coding regions; and that the number, relative abundance, and relative density of SSRs are weakly influenced by GC content in those cp genomes [56].
Repeat motifs are crucial when analyzing genome recombination, rearrangement, and phylogenetic development, or when inducing substitutions and indels in a cp genome [3,10]. The E. aureum chloroplast DNA has 43 large repeats that are longer than 9 bp (S3 Table). They include five forward repeats (11.6%), two palindromic repeats (4.7%), and 36 tandem repeats (83.7%; Fig 6A). Among them, 53.5% are shorter than 50 bp (Fig 6D). Most of the tandem The complete chloroplast genome sequence of Epipremnum aureum repeats are less than 100 bp long while the forward and palindromic repeats are more than 100 bp in length ( Fig 6C). Furthermore, 36 tandem repeats are nine to 115 bp, with most being less than 50 bp. The forward repeats are 127 to 217 bp long, while the two palindromic repeats are 137 and160 bp. Among these 43 repeats, 60.5% are located in the LSC, 16.3% in the SSC, and 23.2% in the IR regions ( Fig 6B). Most (62.8%) are distributed in the IGS while 18.6% are in the protein-coding regions and 9.3% in the introns. The number, composition, and distribution of repeats in E. aureum are similar to those in Ananas comosus [13]. These repeat motifs have been selected for population studies and phylogenetic analysis because they are an informative source for developing markers [10,39].

Comparison of IR junctions among Araceae species
Changes in genes within the IRs significantly influence the boundaries of those particular regions with the LSC or SSC region, as well as expansion and contraction of the genome [31]. We compared IR junctions among members of Araceae (Fig 7) and found that those locations are generally similar for all cp genomes. However, some notable characteristics include the following. First, the LSC/IRa boundary is within trnH-rpl2. The boundary that divides LSC/IRb fluctuates near the rps19 to rpl2 region with one exception, i.e., the boundary for genes from W. australiana is between trnH and rpl23 due to the loss of rpl2 in the IRa region. Second, The complete chloroplast genome sequence of Epipremnum aureum Araceae members can be assigned to two separate groups based on their differences in SSC/IR junctions: Group 1, SSC/IRa and SSC/IRb boundaries occurring at ycf1-trnN and ndhF-trnN in C. esculenta, E. aureum, P. ternata, and D. seguine; and Group 2, SSC/IRa and SSC/IRb boundaries expand to adhH/rps15 and ndhF/rps15, respectively, in the cp genomes of L. minor, S. polyrhiza, W. australiana, and W. lingulata. These findings are in accord with those described for members of Alismatales [22].

Base substitution ratios among Araceae species
The nonsynonymous (dN) and synonymous (dS) substitution ratio, or dN/dS, is an important means for evaluating genome evolution [31,57]. We analyzed dN/dS among Araceae members (Fig 8), using the genome of C. crapnelliana (Theaceae) as our reference. The majority of genes in those genomes showed ratios of less than 1.00 (the exceptions being rpsl2, accD, and ycf2). In the SSC region, the highest and lowest ratios were identified with ycf1 and psaC, respectively, while the other ndh genes ranged from 0.10 to 0.30. Because the ratios for rps2, rps3,  rps4, rps7, rps11, rps12, rps14, rps18, rps19, rpl2, rpl36, ycf1, ycf2, clpP, atpA, and ropB were obviously higher in E. aureum than in the other species, we examined the sequence diversity in genes that were among the exceptions. The highest ratio was calculated for accD, and was approximately 1.00 or larger in the LSC region, thereby indicating that this gene is not conserved in the cp genome. The ratio of 1.27 for rps12 in E. aureum was five times higher than that in other species, and it was has two positively selected sites (including Val 39 and Tyr 43 ), showing that this gene is under positive selection in E. aureum (Table 5). In contrast, the substitution rates were low to zero in the photosynthesis-related genes psb (I, M, L, and F) and petN, which are more conserved than the other gene groups in all cp genomes because of strong functional constraints. The dN/dS ratios of ycf2, ndh, and photosynthetic genes were similar to those of colchicine plants Colchicum autumnale L. and Gloriosa superba L. [31].

Phylogenetic analysis
To examine the phylogenetic position of E. aureum within Araceae (Alismatales), we selected 37 species representing 13 families, i.e., Melanthiaceae, Alstroemeriaceae, Arecaceae, Orchidaceae, Acoraceae, Araceae, Hydrocharitaceae, Poaceae, Salicaceae, Rosaceae, Brassicaceae and Ginkgoaceae. Gingko biloba was set as the outgroup. After the cp genome sequences were downloaded from NCBI, they were aligned and a phylogenetic tree was constructed by the ML method. The position of E. aureum was situated as the sister of C. esculenta, which is a closely related species in the Araceae family (Fig 9); and Acorus (Acorales) was sister to Alismatales and other monocots clades. The result is consistent with the classification of Acorus (Acorales) according to the APG (Angiosperm Phylogeny Group) III system published in 2009 [58]. The previous report of plastid phylogenomic analyses strongly supports Acorales as sister to the remaining monocots and monophyly of Alismatales [22]. The genus Acorus was once placed within the family Araceae (Alismatales), though it was separated from Araceae but within  The complete chloroplast genome sequence of Epipremnum aureum Acorales, there was data that still support to place it within Alismatales [27]. Hence, the positioning of Acorus is still under dispute. Meanwhile, the circumscriptions of Araceae and Alismatales also have a fair amount of conflicting information [24,26]. The controversial placement of Acorus and the disputed relationships within Alismatales mayhave resulted from The complete chloroplast genome sequence of Epipremnum aureum the different data sets used [27]. With the availability of chloroplast sequences for this family, including Alismatales, future research can focus more deeply on phyletic evolution.
Supporting information S1 File. The gene names of 53 protein-coding coding sequences used in phylogenetic tree.
(DOCX) S2 File. The alignment of coding sequences of 53 protein-coding genes for phylogenetic tree.