Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Complete Sequence and Analysis of Coconut Palm (Cocos nucifera) Mitochondrial Genome

  • Hasan Awad Aljohi ,

    Contributed equally to this work with: Hasan Awad Aljohi, Wanfei Liu, Qiang Lin

    Affiliations Joint Center for Genomics Research (JCGR), King Abdulaziz City for Science and Technology and Chinese Academy of Sciences, Riyadh, Saudi Arabia, National Center for Genomics Research (NCGR), King Abdulaziz City for Science and Technology, Riyadh, Saudi Arabia

  • Wanfei Liu ,

    Contributed equally to this work with: Hasan Awad Aljohi, Wanfei Liu, Qiang Lin

    Affiliations Joint Center for Genomics Research (JCGR), King Abdulaziz City for Science and Technology and Chinese Academy of Sciences, Riyadh, Saudi Arabia, CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China

  • Qiang Lin ,

    Contributed equally to this work with: Hasan Awad Aljohi, Wanfei Liu, Qiang Lin

    Affiliations Joint Center for Genomics Research (JCGR), King Abdulaziz City for Science and Technology and Chinese Academy of Sciences, Riyadh, Saudi Arabia, CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China

  • Yuhui Zhao,

    Affiliation CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China

  • Jingyao Zeng,

    Affiliation CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China

  • Ali Alamer,

    Affiliations Joint Center for Genomics Research (JCGR), King Abdulaziz City for Science and Technology and Chinese Academy of Sciences, Riyadh, Saudi Arabia, National Center for Genomics Research (NCGR), King Abdulaziz City for Science and Technology, Riyadh, Saudi Arabia

  • Ibrahim O. Alanazi,

    Affiliations Joint Center for Genomics Research (JCGR), King Abdulaziz City for Science and Technology and Chinese Academy of Sciences, Riyadh, Saudi Arabia, National Center for Genomics Research (NCGR), King Abdulaziz City for Science and Technology, Riyadh, Saudi Arabia

  • Abdullah O. Alawad,

    Affiliation National Center for Genomics Research (NCGR), King Abdulaziz City for Science and Technology, Riyadh, Saudi Arabia

  • Abdullah M. Al-Sadi,

    Affiliation Department of Crop Sciences, Sultan Qaboos University, AlKhoud, Oman

  • Songnian Hu , (SNH); (JY)

    Affiliations Joint Center for Genomics Research (JCGR), King Abdulaziz City for Science and Technology and Chinese Academy of Sciences, Riyadh, Saudi Arabia, CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China

  • Jun Yu (SNH); (JY)

    Affiliations Joint Center for Genomics Research (JCGR), King Abdulaziz City for Science and Technology and Chinese Academy of Sciences, Riyadh, Saudi Arabia, CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China

Complete Sequence and Analysis of Coconut Palm (Cocos nucifera) Mitochondrial Genome

  • Hasan Awad Aljohi, 
  • Wanfei Liu, 
  • Qiang Lin, 
  • Yuhui Zhao, 
  • Jingyao Zeng, 
  • Ali Alamer, 
  • Ibrahim O. Alanazi, 
  • Abdullah O. Alawad, 
  • Abdullah M. Al-Sadi, 
  • Songnian Hu


Coconut (Cocos nucifera L.), a member of the palm family (Arecaceae), is one of the most economically important crops in tropics, serving as an important source of food, drink, fuel, medicine, and construction material. Here we report an assembly of the coconut (C. nucifera, Oman local Tall cultivar) mitochondrial (mt) genome based on next-generation sequencing data. This genome, 678,653bp in length and 45.5% in GC content, encodes 72 proteins, 9 pseudogenes, 23 tRNAs, and 3 ribosomal RNAs. Within the assembly, we find that the chloroplast (cp) derived regions account for 5.07% of the total assembly length, including 13 proteins, 2 pseudogenes, and 11 tRNAs. The mt genome has a relatively large fraction of repeat content (17.26%), including both forward (tandem) and inverted (palindromic) repeats. Sequence variation analysis shows that the Ti/Tv ratio of the mt genome is lower as compared to that of the nuclear genome and neutral expectation. By combining public RNA-Seq data for coconut, we identify 734 RNA editing sites supported by at least two datasets. In summary, our data provides the second complete mt genome sequence in the family Arecaceae, essential for further investigations on mitochondrial biology of seed plants.


The plant mitochondrial (mt) genome is considered as a remnant of an ancestral α-proteobacterium that was symbiont in its eukaryotic common ancestor [1]. It is involved in cellular energy production by respiration and various cellular function regulations, such as homeostasis, apoptosis, and metabolite biosynthesis [2]. Since the first mt genome of land plants was published (Marchantia polymorpha, liverwort) [3], there had been 303 mt genomes available until December 9, 2015 in the NCBI organelle database [4]. Plant mt genomes have several characteristics that make them important for evolutionary studies. First, plant mt gene contents are highly variable across plant taxa [5], obtaining genes from both plastid and nuclear genomes from intracellular transfer [68], as well as other species via horizontal transfer [9, 10]; plant mt genes can also be transferred to their nuclear genomes [11]. Second, plant mt genomes evolve more rapidly in their structure, but slower in their primary sequence [12, 13] as compared to both the chloroplast (cp) and nuclear counterparts. The genome-size expansion of plant mt genomes primarily reflects the increase of intronic and intergenic DNA [13] as plant mt genomes have dramatically lower mutation rates when compared to both their cp and nuclear counterparts [14, 15]. Third, plant mt genomes have a large number of copies per cell and show a remarkable amount of rearrangements [16]. A recent study has also shown that copies of the Silene noctiflora mt genome can be gained or lost, and the fact emphasizes evolutionary difference among them, the mt, the cp, and the nuclear genomes [17]. Fourth, plant mt genomes have a large number of intron-containing genes; some of them need trans-splicing to produce complete transcripts [18]. Fifth, plant mt genomes have a high frequency of RNA editing that contributes to functional conservation for the mt proteins [19, 20]. In plants, RNA editing affects mitochondrial and plastid transcripts by site-specific modification of cytidines-to-uridines and the reverse [2023]. Taken together, the characteristics of plant mt genomes highlight difficulty of sequence assembly and analysis. Recently, we have released a mt genome assembly of date palm (Phoenix dactylifera) as the first one of the palm family [24], and now we add another, that of the coconut (Cocos nucifera L.), as the second of the palm family.

C. nucifera or coconut is one of the most economically important crops in tropics, serving as a source of food, drink, fuel, medicine, and construction material [25]. Although the plant has significant economic values as a significant crop, there have been a limited number of studies on its genome. Based on a flow cytometric analysis, a diploid genome of coconut is 5.966 ± 0.111 pg or 5.757 Gb in size, i.e., its haploid counterpart is 2.8785 Gb [26]. Genome sequencing data also supports this estimate, showing a genome size of ~2.6 Gb with 50% to 70% repeat contents [27]. Recently, several coconut transcriptomic studies have been reported [2830], providing datasets for de novo transcriptomic assembly and other molecular studies. The coconut cultivars are generally classified into two types, the Tall and the Dwarf. In this study, we present the result for the first coconut mt genome of the Oman local Tall variety. We first acquired high-throughput sequences of total cellular DNA using the Roche/454 platform and assembled them into a complete chromosome, and then corrected some of the sequence variations using Illumina HiSeq data. We also analyzed the genome assembly using transcriptomic data for genome structure and functional genes based on various comparative genome analysis tools.

Materials and Methods

Plant materials

Fresh green leaves from an adult coconut plant of the Tall cultivar located at Salalah, Dhofar Governorate, Oman, were collected, washed with double-distilled water, and frozen immediately in a liquid nitrogen container. The farm is owned by one of the co-authors of this work, Dr. Abdullah M. Al-Sadi, who is employed by Sultan Qaboos University and to whom future inquiry should be addressed. This study does not involve endangered or protected species and does not require specific permission from regulatory authority concerning wildlife protection. After being transported to the laboratory, these samples were stored in -80°C freezers until use.

Genomic DNA isolation and sequencing

The raw coconut mt genome sequences were extracted from those produced as part of the Palm Plant Genome Project (a joint effort between KACST and BIG, CAS). Genomic DNA was isolated from 50g fresh leaves according to a CTAB-based method [31]. 5mg purified DNA was used for library construction for both single-read and paired-read libraries with 3kb and 8kb insert sizes according to the manufacturer’s manual for GS FLX Titanium. The libraries were amplified and sequenced on the Roche/454 GS FLX platform. All Roche/454 data was deposited at BIGD database (, CRX007340 and CRX007339). The same purified DNA sample was also used for constructing the Illumina HiSeq libraries. HiSeq paired-end (< = 500bp) and mate-pair libraries (1kb to 8kb) were constructed using the Illumina Simple Paired-End Library and Mate-Pair Library Preparation Protocol, respectively. The libraries were sequenced by Illumina HiSeq 2000 platform. HiSeq data used for coconut mt genome correction was deposited at BIGD database (CRX007360, CRX007361 and CRX007362).

Sequence assembly and validation

We first assembled total reads from 13 single-read datasets and 12 paired-read datasets into 573,893 contigs using Newbler 2.6 (with “-a 0” option and default for others), a de novo sequence assembly software. We then aligned the assembled contigs to 234 published land plant mt genomes downloaded from NCBI organelle database at September 16, 2015 using BLAST (identity> = 80%, E-value< = 10−5 and overlap percent> = 30%) [3234]. We next used 353 annotated contigs (length ranging from 102bp to 49,695bp with median size in 399bp) to build scaffolds using bb.454contignet and manually checked based on contig coverage and spanning reads in Newbler assemblies [35]. We finally obtained a single scaffold of 678,112bp in length without gaps from 143 overlapping contigs.

To correct the sequence errors that are unique to the Roche/454 platform in the assembly, such as homopolymers (characteristic of the pyrosequencing), we used HiSeq paired-end data (180bp insert size) and bowtie2 (version 2.2.4) [36]. The consensus sequence was obtained by using samtools (version 1.2) [37, 38] and bcftools (version 1.2) [39]. The length became 678,133bp after this correction. As a byproduct, we identified several pseudogenes due to frame-shifts caused by homopolymers. We checked the final assembly manually based on Roche/454 and HiSeq paired-end data using IGV software (version 2.3.61) and revised 687 loci with 528 indels and 159 SNPs [40, 41]. Finally, we obtained a new length of 678,653bp with average sequence depths of ~42x for Roche/454 data and ~1788x for HiSeq data. We checked this assembly using HiSeq mate-pair data with insert sizes of 5kb and 8kb in a 5kb and 8kb sliding windows, respectively. On average, our final genome assembly was supported by 59.57% and 58.37% mate-pair reads from the 5-kb and 8-kb libraries. The complete mt genome sequence was deposited to GenBank (accession number KX028885).

Sequence annotation

We aligned our assembly to the mt genes from 18 representative land plants with BLAST (identity > = 80% and E-value < = 1e-5) and identified ORFs using ORF finder ( [4]. Introns were depicted by using Rfam (v1.1 with default parameters, [42] (S1 Table) and tRNA genes were identified by using BLAST (v2.2.26+) and tRNAscan-SE (v1.23) [43]. All rRNA genes were identified similarly. The cp-derived regions were identified by comparing mt genome with cp genome (GenBank accession number KX028884) based on BLAST (identity > = 80%, E-value < = 1e-5 and length > = 50bp). REPuter and tandem repeat finder were used to identify forward, palindromic, and tandem repeats ( and [44, 45].

Sequence variants

Sequence variants were identified based on HiSeq paired-end data with 180-bp insert size. The raw reads were mapped to the final mt genome by using bowtie2 (version 2.2.4) [36], and the variants were called by using RGAAT tool, which developed in our laboratory (, and samtools and bcftools (version 1.2) [3739]. To eliminate false positives, we only kept the variations identified by both methods. To evaluate the variations between the two palm species, C. nucifera and P. dactylifera, MUMmer3 was used for genome alignment [46].

RNA editing analysis

We predicted putative RNA editing sites based on 8 public RNA-Seq datasets of coconut palm (SRR1063404, SRR1063407, SRR1137438, SRR1173229, SRR1265939, SRR1273070, SRR1273180, and SRR606452). After filtering the low quality reads and removing the adapter sequences by Trimmomatic (version 0.33) [47], we mapped all high-quality reads to the mt genome using GSNAP (version 2014-12-19) with the options “-N 1 and -force-xs-dir” (all other options are default) [48]. The candidate RNA editing loci were filtered through read mapping with the following criteria: (1) there are more than 2 aligned reads for each alternative allele, and (2) the percentage of the alternative allele must be equal or above 50%. We identified 845 RNA editing sites using REDO tool ( and predicted putative RNA editing sites in protein-coding genes using the web-based PREP-mt program with cutoff score 0.6 ( [49].

Phylogenetic analysis

Thirty-one representative mt protein coding genes were extracted from 19 species, including 8 monocots, 6 eudicots, and one each from gymnosperm (Cycas taitungensis), vascular plant (Phlegmariurus squarrosus), liverwort (M. polymorpha), hornwort (Phaeoceros laevis), and moss (Physcomitrella patens). Their amino acid sequences were aligned by using clustalw2 (version 2.1) [50]. We used both statistical method, Maximum Likelihood (ML) with Jones-Taylor-Thornton (JTT) substitution model and Maximum Parsimony (MP) in MEGA (version 6.06) for phylogenies of concatenated aligned sequences with 1000 bootstrap [51]. The gaps or missing data were eliminated when the site coverage below 90%. Phylogenetic trees were visualized with EvolView program [52].

Transcriptome analysis

We counted the number of reads for each gene for mt genome using an in-house Perl script and identified differentially expressed genes using DESeq (version 1.20.0) [53]. For identifying the novel genes, we used Trinity (version 2.0.6) to construct transcripts based on GSNAP mapping results [54]. If different mt genes were assembled into one sequence, we assigned them to polycistronic transcription unit.

Results and Discussion

The C. nucifera mt genome content

We started C. nucifera (Oman local Tall variety) mt genome assembly based solely on the Roche/454 GS FLX data, including 7,617,799 single reads, 2,884,708 paired reads with 3-kb insert size, and 1,594,036 paired reads with 8-kb insert size. After homopolymer correction using the Illumina reads, we have an assembly of 678,653bp in length (Fig 1; see Materials and Methods). It encodes 72 proteins (87 protein-coding genes, 8.62% of mt genome), 9 truncated proteins (codon frameshift mutations; 10 pseudogenes, 0.83% of mt genome), 23 tRNAs (corresponding to 17 amino acid codons and one stop codon, 42 tRNA-coding genes, 0.46% of mt genome), and 3 ribosomal RNAs (6 rRNA-coding genes, 1.51% of mt genome), which all together constitute a gene content of 11.43% (77,542bp) (Table 1). Among them, 13 proteins (15 protein-coding genes), 2 truncated proteins (codon frameshift; 3 pseudogenes), 11 tRNAs (corresponding to 10 amino acid codons, 13 tRNA-coding genes) and 3 ribosomal RNAs (3 rRNA-coding genes) locate in the chloroplast-derived regions, which are accounted for 5.07% of the genome sequence. The GC contents of protein-coding genes, pseudogenes, tRNAs, rRNAs, and the remaining non-coding sequences are 44.5% (58,895bp), 47.7% (5,294bp), 41.1% (3,092bp), 53.5% (10,261bp), and 45.5% (601,111bp), respectively. The genome harbors 0.49% tandem (3,310bp) and 17.26% long repeats (≥100bp). In addition, there are 13 co-transcribed gene clusters, including conserved 18S-5S rRNA and nad3-rps12 among angiosperm mt genomes [55]. Our phylogenetic analysis shows that C. nucifera clusters with P. dactylifera and Butomus umbellatus among the monocotyledon plants (Fig 2).

Fig 1. Circular display of C. nucifera mt genome.

We display (from outside to inside): physical map scaled in kb; coding sequences transcribed in the clockwise and counterclockwise directions (nad in red; cob, matR and mttB in green; cox in blue; atp in purple; ccm in orange; rpl in yellow; rps in dark red; rRNA in dark green; tRNA in dark blue; orf in dark purple; and others in black); chloroplast-derived regions (green); repeats (forward repeats in green, palindrome repeats in red and tandem repeats in blue); RNA edit sites (synonymous in green and non-synonymous in red); gene conserve scores (black); proper HiSeq mate-pair (MP) reads percent with insert size 5kb and 8kb (blue); and the four regions (thick lines indicate IRs and thin lines indicate LSC and SSC). * indicates pseudogenes.

Fig 2. Phylogenetic trees of 31 mt proteins from 19 plant species.

Shown in the left is a maximum parsimony tree and the right is a maximum likelihood tree based on MEGA 6.06. The C. nucifera mt proteins form a cluster with those of P. dactylifera and B. umbellatus among monocotyledons.

Protein-coding, rRNA, and tRNA genes

The C. nucifera mt genome encodes 50 known functional and 22 hypothetical proteins. Among the first group, 23 proteins are related to the electron transport chain, including 9 subunits of nicotinamide adenine dinucleotide dehydrogenase (complex I), one subunit of succinate dehydrogenase (complex II), one apocytochrome b (complex III), 3 subunits of cytochrome c oxidase (complex IV), 5 subunits of ATP synthase F1 (complex V), and 4 cytochrome c biogenesis proteins (Table 1).

First, when compared the C. nucifera mt proteins to 18 other plants (S1 Table and S1 Fig), we identified sdh gene that is unique to the coconut and absent in 7 other monocots. Second, similar in the cases of Vitis vinifera, S. latifolia, and P. dactylifera, RNA polymerase genes are identified in the mt genome (one RNA polymerase and one DNA-dependent RNA polymerase). Third, the C. nucifera mt genome has the highest copy number of rps19 genes (5 copies) in all 19 inspected species, followed by V. vinifera (3 copies). Fourth, there is no rps3 gene in C. nucifera mt genome, whereas it exists in 7 other monocot species. Fifth, rpl10 (pseudogene) and rps11 (protein-coding gene) are found only in P. dactylifera and C. nucifera among all 8 monocots. Last, a few of cp-derived genes are identified in this genome, including 15 protein-coding (such as rpl14, rpl33 and rps14), 3 rRNA, and 13 tRNA genes as well as 3 pseudogenes.

The mt genome contains 42 tRNA genes; 12 of them have introns (9 mt tRNAs and 3 cp-derived tRNAs) (Table 2). Among these tRNA genes, all correspond to 17 amino acids but are absent for the rest three: Ala, Leu, and Val. The tRNA genes for amino acids Thr, His, Arg, Gly, and tRNAIle(GAU) are only found in the cp-derived regions. These results are consistent with previous studies that the mt tRNA genes are replaced by those of the cp-derived tRNA gradually [24, 56].

Table 2. Codon usage and codon-anticodon recognition pattern in the C. nucifera mt genome.

Cp-derived regions, introns, and repeats

The plant cp and mt genomes are known to have extensive and widespread homologies due to sequence transfer [57, 58]. The transfer of cp genomic DNA to that of the mt genome has been going on for at least 300 million years [59]. In the C. nucifera mt genome, there are 33 cp-derived regions with a length range of 64 to 3,365bp (S2 Table). The total length of cp-derived regions is 34,395bp and the coding region is 37.58% (12,925bp), which is higher than mt gene content (11.41%) but lower than cp gene content (61.17%). The GC content of the cp-derived regions is 41.9%, which is between those of the cp (37.44%) and mt (45.50%) genomes. A similar trend is found in P. dactylifera with GC contents of 37.23%, 37.40%, and 45.1% for cp, cp-derived region, and mt DNA, respectively [24, 60]. These results suggest that cp-derived sequences, to some extent, have evolved to be close to the mt genome sequences in GC contents and gene coding fractions after being transferred into mt genomes.

In the C. nucifera mt genome, there are 28 intron-containing genes (16 protein-coding genes and 12 tRNA genes), and according to the prediction based on Rfam, one group I intron (not located in gene regions) and 23 group II introns were identified. Among 23 group II introns, 15 locate in 8 protein-coding genes (nad1, nad2, nad4, nad5, nad7, rps10, cox2a and cox2b) and 2 are in 2 tRNA genes (two trnI-GAU). Although mitochondrial tRNA genes do not possess introns in general, we identified 12 intron-containing tRNA genes (including 3 cp-derived tRNA genes) in the assembly. Among 18 other plants (S1 Table), M. polymorpha (liverwort), C. taitungensis (gymnosperm), B. umbellatus (monocot), P. dactylifera (monocot), Zea mays (monocot), and V. vinifera (eudicot) have one (tRNA-Ser), one (tRNA-Val), two (tRNA-Ile and tRNA-Ala), three (tRNA-Lys, tRNA-Asn and tRNA-Stop), three (tRNA-Leu/pseudo, tRNA-Leu/pseudo and tRNA-Ile), and one (tRNA-Lys) intron-containing tRNA genes, respectively. It shows that the C. nucifera mt genome has the largest intron-containing tRNA genes among all analyzed sequences.

The C. nucifera mt genome contains 0.49% tandem repeats, which are compatible with those of P. dactylifera (0.33%) (S3 Table). However, it harbors 17.26% long repeats (> = 100bp), and the number is significantly higher than that of P. dactylifera (2.3%) but compatible with those of other monocot species, such as Triticum aestivum (15.9%), Sorghum bicolor (16.2%), and Zea may (19.1%) (S4 Table).

Sequence variation analysis

Based on the HiSeq data, we identified 202 and 157 variations in different places of the genome, using samtools & bcftools and RGAAT (, respectively; among the total, 102 variations are cross-discovered based on both methods (S5 Table). To reduce false positives, we only used the 102 shared variations (100 SNPs and 2 insertions) for further analysis. First, 48 out of the total are found in the cp-derived regions. Among all variations, only 5 SNPs are in the protein-coding genes, including 3 synonymous SNPs of rps1, rps2, and rpl14 (cp-derived) and two non-synonymous SNPs of orf247-ct (S6 Table). Other 6 SNPs and 1 insertion are found in 5 tRNA genes, whereas the remaining 89 SNPs and 1 insertion are non-coding. Second, according to the variation types, there are 23 transitions (Ti) and 77 transversions (Tv), leading to a Ti/Tv ratio of 0.30. If we remove the cp-derived regions from the analysis, the ratio becomes 0.06 (Ti/Tv ratio; 3 Ti and 50 Tv SNPs). It is in sharp contrast to that of the nuclear genome, where the ratios range ~2.0–2.1 in genome-wide and 3.0–3.3 in exonic sequences [61, 62]. The Ti/Tv ratio in the coconut mt genome is much lower than what is in the nuclear genome, as well as a random prediction (0.5). It supports the observation that DNA replication and repair mechanisms are very different between mt and nuclear genomes. Third, we further scrutinized the data to exclude other possibilities that may lead to biased results. According to the Roche/454 and Illumina sequence coverage, there are ~2x, ~42x, and ~235x of the Roche/454 reads, as well as ~20x, ~1788x, and ~6000x of the Illumina reads for nuclear, mt, and cp DNA, respectively, which reflect a copy number ratio among them as ~1:55:209 on average. This result indicates that only 1.79% reads of similar sequences may be an origin of the nuclear genome in the mt genome datasets, which can be excluded readily during sequence variation identification (alternative allele proportion> = 15%). Similarly, for the cp-derived regions, sequence variations are more likely from cp (79.17%) rather than from the nuclear or mt genomes.

Comparing to the two taxonomically closest species P. dactylifera and B. umbellatus in this study, we only aligned 54.45% and 14.15% of the C. nucifera mt genome, respectively, using bl2seq (S2 Fig) [63]. To further evaluate mt genome variations between the two palm species P. dactylifera and C. nucifera, we used MUMmer to compare the alignable regions and identified 2,442 SNPs and 1,122 indels, coming up with an average rate of 5 variations per 1,000bp (S3 Fig).

RNA editing

RNA editing is universal to almost all plant mt transcripts [64, 65] with features of tissue specific and partial edits [66]. Different species have different RNA editing sites and the number of RNA editing sites ranges from 200 to 600 in angiosperm[67]. The public RNA-Seq data in NCBI are excellent and untapped resources, where we found 8 RNA-Seq datasets from coconut (two of Tall cultivars and 6 of Dwarf cultivars) for our RNA editing analysis [68]. To differentiate sequencing errors and SNPs from editing, we only kept the RNA editing sites with more than 2 supporting reads and with at least 50% edited reads. The criteria lead to the identification of 845 RNA editing sites in 56 protein-coding genes and 36 RNA editing sites are in the cp-derived regions (S7 Table). Among the total RNA editing sites (92 synonymous and 753 nonsynonymous), there are 811 C->T, 26 G->A and 8 T->C sites. We compared tissue disparity among the 8 samples, where healthy leaf1 has the most RNA editing sites (697, 82.49%, 18 unique) and embryo has the least RNA editing sites (489, 57.87%, 22 unique). In addition, 297 RNA editing sites are shared by all 8 samples. Since the 8 samples are from two cultivars, we partitioned the editing sites between the Dwarf and Tall cultivars, yielding 835 and 675 RNA editing sites, respectively, unique to each cultivar and 665 shared. Considering the codon changing edits, we ranked the top six codon changes: TCA->TTA (95, 11.24%), TCT->TTT (67, 7.93%), TCG->TTG (58, 6.86%), CCA->CTA (50, 5.92%), TCC->TTC (45, 5.33%), and CGG->TGG (45, 5.33%); 5 of them changed the second codon position. Moreover, the top six edited codons are TTT (135, 15.98%), TTA (118, 13.96%), TTG (72, 8.52%), TTC (58, 6.86%), CTA (51, 6.04%), and CTT (50, 5.92%).

We also predicted 648 RNA editing sites using PREP-mt program in 45 genes. Comparing the RNA editing sites identified by using the two methods, we have 591 shared, 57 unique to PREP-mt program, and 212 unique to our method; the underestimation of PREP-mt program becomes obvious.

Gene expression analysis based on transcriptome data

Using the RNA-Seq datasets, we obtained mt transcriptomic profiles for the 8 samples (Fig 3 and Table 3). Three healthy leaf samples have the most abundant mapped reads (3.71% to 1.47%), two disease related leaf samples and embryogenic callus fall into the second abundance group (0.29% to 0.24%), whereas endosperm and embryo are the least abundant (0.12% and 0.05%, respectively). Read abundance of mt sequence coincides with tissue characteristics but read coverage shows a different pattern. First, root wilt disease susceptible (RWDS) leaf has the highest read coverage (71.92%) and coconut yellow decline (CYD) leaf has the lowest read coverage (34.94%). Second, healthy leaf samples (54.77% to 68.00%) and embryogenic callus (57.52%) have higher read coverage as opposed to embryo (37.34%) and endosperm (45.28%) (Table 4).

Fig 3. Circular display of C. nucifera mt transcriptomes.

We display (from outside to inside): physical map scaled in kb; coding sequences transcribed in the clockwise and counterclockwise directions (nad in red; cob, matR and mttB in green; cox in blue; atp in purple; ccm in orange; rpl in yellow; rps in dark red; rRNA in dark green; tRNA in dark blue; orf in dark purple; and others in black); histogram of transcriptome data (plus strand in red and minus strand in green, standing for normalized average coverage value per 100 bp ranging from 0 to 100) for sample Health_leaf1, CYD_leaf, Callus, RWDS_leaf, Endosperm, Embryo, Health_leaf2 and Leaf_fruit; coding sequences transcribed in the clockwise and counterclockwise directions; and the four regions (thick lines indicate IRs and thin lines indicate LSC and SSC). * indicates pseudogene.

There are 113 out of the total 145 genes expressed in at least two samples whereas only 3 genes (rpo, trna-UUA, and trnI-AAU) expressed in one sample (Young_leaf) (S8 Table). The number of expressed genes is consistent with read coverage. CYD leaf has the least expressed genes (92) as opposed to RWDS leaf that has the most (116). The genes petL and orf247-ct, which have stop codon in the middle of gene sequence, are highly expressed, however, we have not found any RNA editing sites to rescue the normal protein-coding function. Both of them need to be validated in future studies. All pseudogenes have relatively high expression level in all samples other than rpl10. According to transcriptomic profiles, we found 13 polycistronic transcripts among 37 genes (S9 Table). The conservative co-transcribed gene clusters rps12-nad3 and 5SrRNA-18SrRNA are also found in our mt genome.

According to the gene expression profiles (Fig 4), we have observed several obvious features. First, the genes can be divided into three categories according to expression intensity: highly, moderately, and lowly expressed. Second, among 33 highly expressed genes, there are only two tRNA genes (trnI-GAU and trnH-GUG). Third, three of the five rps19 copies are highly expressed and the rest are moderately expressed. Fourth, only nad6 and ccmFn1 of the 25 respiration related genes are lowly expressed.

Fig 4. Expression patterns of mt genes among 8 RNA-Seq datasets.

The expression levels are normalized based on DEseq.

Phylogenetic relationships

Our phylogenetic trees are built based on 31 mt protein-coding genes from 19 selected plants (8 monocots and 6 eudicots, as well as one each of gymnosperm, vascular plant, liverwort, hornwort, and moss; Fig 2). The maximum-likelihood (ML) tree has higher bootstrap values than the maximum parsimony (MP) tree except for the node of S. bicolor and Z. mays and the node between P. squarrosus and the group of M. polymorpha and P. patens. Most nodes have bootstrap values larger than 65% except for one node (49%) among Helianthus annuus, V. vinifera and Carica papaya and another node (58%) between P. squarrosus and the group of M. polymorpha and P. patens from ML method. Both methods have high bootstrap values (97% and 85%) for subgroup of C. nucifera, P. dactylifera and B. umbellatus in monocots. Previous studies indicate that date palm appears to be the most basal among monocots [24, 69]. Moreover, date palm has certain miRNA families only found in eudicots [70]. Taken together, these results suggest that Arecaceae separated from the monocotyledon clade earlier than other plant families.


Despite the fact that the C. nucifera mt genome is as large as 678,653bp in length, we have assembled it using a variety of datasets and information, including all plant mt genome sequences, C. nucifera mt sequence datasets from different platforms and libraries with variable insert sizes, and specialized bioinformatics tools suitable for different purposes. The genome sequence variations and RNA editing sites based on transcriptomic data are all invaluable for further biological studies. Phylogenetic analysis indicates that Arecaceae separated from the rest of monocotyledons earlier in flowering plant evolution.

Supporting Information

S1 Fig. The homologous mt genes among C. nucifera and 18 other representative plant species.


S2 Fig. A mt genome comparison among C. nucifera, P. dactylifera and B. umbellatus.


S3 Fig. Palm mt and cp genome comparisons between P. dactylifera and C. nucifera (Ref) based on MUMmer.

(A) mt genomes and (B) cp genomes. Unlike the cp genomes, variations between the mt genomes are much higher.


S1 Table. The homologous mt genes among C. nucifera and 18 other representative species.


S2 Table. The cp-derived regions in the C. nucifera mt genome.


S3 Table. Tandem repeats identified in the C. nucifera mt genome by using Tandem Repeats Finder.


S4 Table. Long repeats (> = 100bp, forward and palindrome) identified in the C. nucifera mt genome based on REPuter.


S5 Table. The common variations identified in the C. nucifera by using samtools & bcftools and RGAAT.


S6 Table. The functional evaluation of common variations in the C. nucifera mt genome.


S7 Table. RNA editing sites identified by using RNA-Seq data and PREP-mt program.

NT, nucleotide; AA, amino acid.


S8 Table. Reads of mt genes in the 8 coconut RNA-Seq datasets.


S9 Table. 13 polycistronic transcripts identified based on the 8 coconut RNA-Seq datasets.



We thank Drs. Xiaowei Zhang and Dr. Yuxin Yin for collecting the coconut samples in the early stage of this project. We acknowledge the public RNA-Seq data produced by other research groups. We also acknowledge all members in the Joint Center for Genomics Research (JCGR) for their helps in completing this project. Technical supports were provided by the CAS Key Laboratory of Genome Science and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, the People's Republic of China. The authors thank the anonymous reviewers and editors for their critical comments and helpful suggestions.

Author Contributions

  1. Conceptualization: WFL QL SNH JY HAA.
  2. Data curation: WFL QL.
  3. Formal analysis: WFL QL YHZ JYZ AA IOA AOA.
  4. Funding acquisition: HAA WFL QL.
  5. Investigation: WFL QL HAA.
  6. Methodology: WFL QL.
  7. Project administration: HAA SNH JY.
  8. Resources: WFL QL HAA.
  9. Software: WFL QL.
  10. Supervision: WFL QL HAA SNH JY.
  11. Validation: WFL QL AMA AA JY.
  12. Visualization: WFL QL.
  13. Writing – original draft: WFL QL HAA SNH JY.
  14. Writing – review & editing: WFL QL HAA SNH JY.


  1. 1. Lang BF, Gray MW, Burger G. Mitochondrial genome evolution and the origin of eukaryotes. Annual review of genetics. 1999;33(1):351–97. pmid:10690412
  2. 2. McBride HM, Neuspiel M, Wasiak S. Mitochondria: more than just a powerhouse. Current Biology. 2006;16(14):R551–R60. pmid:16860735
  3. 3. Oda K, Yamato K, Ohta E, Nakamura Y, Takemura M, Nozato N, et al. Gene organization deduced from the complete sequence of liverwort Marchantia polymorpha mitochondrial DNA: a primitive form of plant mitochondrial genome. Journal of molecular biology. 1992;223(1):1–7. pmid:1731062
  4. 4. NCBI RC. Database resources of the National Center for Biotechnology Information. Nucleic acids research. 2013;41(Database issue):D8. pmid:23193264
  5. 5. Adams KL, Palmer JD. Evolution of mitochondrial gene content: gene loss and transfer to the nucleus. Molecular phylogenetics and evolution. 2003;29(3):380–95. pmid:14615181
  6. 6. Cummings MP, Nugent JM, Olmstead RG, Palmer JD. Phylogenetic analysis reveals five independent transfers of the chloroplast gene rbcL to the mitochondrial genome in angiosperms. Current genetics. 2003;43(2):131–8. pmid:12695853
  7. 7. Turmel M, Otis C, Lemieux C. The mitochondrial genome of Chara vulgaris: insights into the mitochondrial DNA architecture of the last common ancestor of green algae and land plants. The Plant Cell. 2003;15(8):1888–903. pmid:12897260
  8. 8. Sloan DB, Wu Z. History of plastid DNA insertions reveals weak deletion and AT mutation biases in angiosperm mitochondrial genomes. Genome biology and evolution. 2014;6(12):3210–21. pmid:25416619
  9. 9. Vaughn JC, Mason MT, Sper-Whitis GL, Kuhlman P, Palmer JD. Fungal origin by horizontal transfer of a plant mitochondrial group I intron in the chimeric coxI gene of Peperomia. Journal of Molecular Evolution. 1995;41(5):563–72. pmid:7490770
  10. 10. Mower JP, Sloan DB, Alverson AJ. Plant mitochondrial genome diversity: the genomics revolution: Springer; 2012.
  11. 11. Adams KL, Qiu Y-L, Stoutemyer M, Palmer JD. Punctuated evolution of mitochondrial gene content: high and variable rates of mitochondrial gene loss and transfer to the nucleus during angiosperm evolution. Proceedings of the National Academy of Sciences. 2002;99(15):9905–12. pmid:12119382
  12. 12. Palmer JD, Herbon LA. Plant mitochondrial DNA evolved rapidly in structure, but slowly in sequence. Journal of Molecular Evolution. 1988;28(1–2):87–97. pmid:3148746
  13. 13. Gray MW, Burger G, Lang BF. Mitochondrial evolution. Science. 1999;283(5407):1476–81. pmid:10066161
  14. 14. Lynch M, Koskella B, Schaack S. Mutation pressure and the evolution of organelle genomic architecture. Science. 2006;311(5768):1727–30. pmid:16556832
  15. 15. Wolfe KH, Li W-H, Sharp PM. Rates of nucleotide substitution vary greatly among plant mitochondrial, chloroplast, and nuclear DNAs. Proceedings of the National Academy of Sciences. 1987;84(24):9054–8. pmid:3480529
  16. 16. Kubo T, Newton KJ. Angiosperm mitochondrial genomes and mutations. Mitochondrion. 2008;8(1):5–14. pmid:18065297
  17. 17. Wu Z, Cuthbert JM, Taylor DR, Sloan DB. The massive mitochondrial genome of the angiosperm Silene noctiflora is evolving by gain or loss of entire chromosomes. Proceedings of the National Academy of Sciences. 2015;112(33):10185–91. pmid:25944937
  18. 18. Winkler M, Kück U. The group IIB intron from the green alga Scenedesmus obliquus mitochondrion: molecular characterization of the in vitro splicing products. Current genetics. 1991;20(6):495–502. pmid:1723663
  19. 19. Gualberto JM, Lamattina L, Bonnard G, Weil J-H, Grienenberger J-M. RNA editing in wheat mitochondria results in the conservation of protein sequences. 1989. pmid:2552325
  20. 20. Hiesel R, Combettes B, Brennicke A. Evidence for RNA editing in mitochondria of all major groups of land plants except the Bryophyta. Proceedings of the National Academy of Sciences. 1994;91(2):629–33. pmid:8290575
  21. 21. Malek O, Lättig K, Hiesel R, Brennicke A, Knoop V. RNA editing in bryophytes and a molecular phylogeny of land plants. The EMBO journal. 1996;15(6):1403. pmid:8635473
  22. 22. Freyer R, Kiefer-Meyer M-C, Kössel H. Occurrence of plastid RNA editing in all major lineages of land plants. Proceedings of the National Academy of Sciences. 1997;94(12):6285–90. pmid:9177209
  23. 23. Wu Z, Stone JD, Štorchová H, Sloan DB. High transcript abundance, RNA editing, and small RNAs in intergenic regions within the massive mitochondrial genome of the angiosperm Silene noctiflora. BMC genomics. 2015;16(1):938. pmid:26573088
  24. 24. Fang Y, Wu H, Zhang T, Yang M, Yin Y, Pan L, et al. A complete sequence and transcriptomic analyses of date palm (Phoenix dactylifera L.) mitochondrial genome. PloS one. 2012;7(5):e37164. pmid:22655034
  25. 25. Schnell RJ, Priyadarshan P. Genomics of tree crops: Springer Science & Business Media; 2012.
  26. 26. Gunn BF, Baudouin L, Beulé T, Ilbert P, Duperray C, Crisp M, et al. Ploidy and domestication are associated with genome size variation in Palms. American journal of botany. 2015;102(10):1625–33. pmid:26437888
  27. 27. Alsaihati B. Coconut genome de novo sequencing. Plant and Animal Genome XXII Conference; 2014: Plant and Animal Genome.
  28. 28. Fan H, Xiao Y, Yang Y, Xia W, Mason AS, Xia Z, et al. RNA-Seq analysis of Cocos nucifera: transcriptome sequencing and de novo assembly for subsequent functional genomics approaches. PloS one. 2013;8(3):e59997. pmid:23555859
  29. 29. Huang Y-Y, Lee C-P, Fu JL, Chang BC-H, Matzke AJ, Matzke M. De Novo Transcriptome Sequence Assembly from Coconut Leaves and Seeds with a Focus on Factors Involved in RNA-Directed DNA Methylation. G3: Genes| Genomes| Genetics. 2014;4(11):2147–57. pmid:25193496
  30. 30. Nejat N, Cahill DM, Vadamalai G, Ziemann M, Rookes J, Naderali N. Transcriptomics-based analysis using RNA-Seq of the coconut (Cocos nucifera) leaf in response to yellow decline phytoplasma infection. Molecular Genetics and Genomics. 2015:1–12. pmid:25893418
  31. 31. Gawel N, Jarret R. A modified CTAB DNA extraction procedure forMusa andIpomoea. Plant Molecular Biology Reporter. 1991;9(3):262–6.
  32. 32. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC bioinformatics. 2009;10(1):421. pmid:20003500
  33. 33. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research. 1997;25(17):3389–402. pmid:9254694
  34. 34. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of molecular biology. 1990;215(3):403–10. pmid:2231712
  35. 35. Iorizzo M, Senalik D, Szklarczyk M, Grzebelus D, Spooner D, Simon P. De novo assembly of the carrot mitochondrial genome using next generation sequencing of whole genomic DNA provides first evidence of DNA transfer into an angiosperm plastid genome. BMC plant biology. 2012;12(1):61. pmid:22548759
  36. 36. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nature methods. 2012;9(4):357–9. pmid:22388286
  37. 37. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25(16):2078–9. pmid:19505943
  38. 38. Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011;27(21):2987–93. pmid:21903627
  39. 39. Li H. Improving SNP discovery by base alignment quality. Bioinformatics. 2011;27(8):1157–8. pmid:21320865
  40. 40. Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, et al. Integrative genomics viewer. Nature biotechnology. 2011;29(1):24–6. pmid:21221095
  41. 41. Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Briefings in bioinformatics. 2012:bbs017. pmid:22517427
  42. 42. Nawrocki EP, Burge SW, Bateman A, Daub J, Eberhardt RY, Eddy SR, et al. Rfam 12.0: updates to the RNA families database. Nucleic acids research. 2014:gku1063. pmid:25392425
  43. 43. Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic acids research. 1997;25(5):0955–964. pmid:9023104
  44. 44. Kurtz S, Choudhuri JV, Ohlebusch E, Schleiermacher C, Stoye J, Giegerich R. REPuter: the manifold applications of repeat analysis on a genomic scale. Nucleic acids research. 2001;29(22):4633–42. pmid:11713313
  45. 45. Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research. 1999;27(2):573. pmid:9862982
  46. 46. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, et al. Versatile and open software for comparing large genomes. Genome biology. 2004;5(2):R12. pmid:14759262
  47. 47. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014:btu170. pmid:24695404
  48. 48. Wu TD, Nacu S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics. 2010;26(7):873–81. pmid:20147302
  49. 49. Mower JP. The PREP suite: predictive RNA editors for plant mitochondrial genes, chloroplast genes and user-defined alignments. Nucleic acids research. 2009;37(suppl 2):W253–W9. pmid:19433507
  50. 50. Larkin MA, Blackshields G, Brown N, Chenna R, McGettigan PA, McWilliam H, et al. Clustal W and Clustal X version 2.0. Bioinformatics. 2007;23(21):2947–8. pmid:17846036
  51. 51. Tamura K, Stecher G, Peterson D, Filipski A, Kumar S. MEGA6: molecular evolutionary genetics analysis version 6.0. Molecular biology and evolution. 2013;30(12):2725–9. pmid:24132122
  52. 52. Zhang H, Gao S, Lercher MJ, Hu S, Chen W-H. EvolView, an online tool for visualizing, annotating and managing phylogenetic trees. Nucleic acids research. 2012;40(W1):W569–W72. pmid:22695796
  53. 53. Anders S, Huber W. Differential expression analysis for sequence count data. Genome biol. 2010;11(10):R106. pmid:20979621
  54. 54. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology. 2011;29(7):644.
  55. 55. Binder S, Marchfelder A, Brennicke A. Regulation of gene expression in plant mitochondria. Post-Transcriptional Control of Gene Expression in Plants: Springer; 1996. p. 303–14. pmid:8980484
  56. 56. Tian X, Zheng J, Hu S, Yu J. The discriminatory transfer routes of tRNA genes among organellar and nuclear genomes in flowering plants: a genome-wide investigation of indica rice. Journal of molecular evolution. 2007;64(3):299–307. pmid:17273918
  57. 57. Stern DB, Lonsdale DM. Mitochondrial and chloroplast genomes of maize have a 12-kilobase DNA sequence in common. Nature. 1982;299(5885):698–702. pmid:6889685
  58. 58. Stern DB, Palmer JD. Extensive and widespread homologies between mitochondrial DNA and chloroplast DNA in plants. Proceedings of the National Academy of Sciences. 1984;81(7):1946–50. pmid:16593442
  59. 59. Wang D, Wu Y-W, Shih AC-C, Wu C-S, Wang Y-N, Chaw S-M. Transfer of chloroplast genomic DNA to mitochondrial genome occurred at least 300 MYA. Molecular biology and evolution. 2007;24(9):2040–8. pmid:17609537
  60. 60. Yang M, Zhang X, Liu G, Yin Y, Chen K, Yun Q, et al. The complete chloroplast genome sequence of date palm (Phoenix dactylifera L.). PloS one. 2010;5(9):e12762. pmid:20856810
  61. 61. Ebersberger I, Metzler D, Schwarz C, Pääbo S. Genomewide comparison of DNA sequences between humans and chimpanzees. The American Journal of Human Genetics. 2002;70(6):1490–7. pmid:11992255
  62. 62. Freudenberg-Hua Y, Freudenberg J, Kluck N, Cichon S, Propping P, Nöthen MM. Single nucleotide variation analysis in 65 candidate genes for CNS disorders in a representative sample of the European population. Genome research. 2003;13(10):2271–6. pmid:14525928
  63. 63. Zhang Z, Schwartz S, Wagner L, Miller W. A greedy algorithm for aligning DNA sequences. Journal of Computational biology. 2000;7(1–2):203–14. pmid:10890397
  64. 64. Covello PS, Gray MW. RNA editing in plant mitochondria. 1989.
  65. 65. Hiesel R, Wissinger B, Schuster W, Brennicke A. RNA editing in plant mitochondria. Science. 1989;246(4937):1632–4. pmid:2480644
  66. 66. Picardi E, Horner DS, Chiara M, Schiavon R, Valle G, Pesole G. Large-scale detection and analysis of RNA editing in grape mtDNA by RNA deep-sequencing. Nucleic acids research. 2010;38(14):4755–67. pmid:20385587
  67. 67. Cuenca A, Petersen G, Seberg O. The complete sequence of the mitochondrial genome of Butomus umbellatus–a member of an early branching lineage of Monocotyledons. 2013. pmid:23637852
  68. 68. Smith DR. RNA-Seq data: a goldmine for organelle research. Briefings in functional genomics. 2013;12(5):454–6. pmid:23334532
  69. 69. Al-Mssallem IS, Hu S, Zhang X, Lin Q, Liu W, Tan J, et al. Genome sequence of the date palm Phoenix dactylifera L. Nature communications. 2013;4. pmid:23917264
  70. 70. Xin C, Liu W, Lin Q, Zhang X, Cui P, Li F, et al. Profiling microRNA expression during multi-staged date palm (Phoenix dactylifera L.) fruit development. Genomics. 2015;105(4):242–51. pmid:25638647